Description
The use of non-targeted analyses has emerged as a crucial method for the identification of contaminants in environmental samples. This technique usually involves the use of comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC × GC/ToF-MS) to generate raw data that is analyzed through the software LECO® ChromaTOF®. However, this software is prone to errors when matching an analyte in a sample to its best spectral match in the National Institute of Standards and Technology (NIST) database. This forces the researcher to use an extensive manual review process to classify the software’s spectral matches as “high” or “low” quality. Due to this, the Python programming language has been used to create an algorithm capable of classifying these matches to reduce time and manual error. However, since this algorithm uses hard-coded rules with strict thresholds for classification, some compounds may be misclassified if they do not meet one of the requirements for a high classification. Therefore, we have developed several supervised machine learning models in Python to determine the efficacy of using this method for this type of classification task. Machine learning models are able to train on particular data sets with known outcomes and use what it learns to classify unknown data. Since this does not use strict rules for training, it could be able to correctly classify matches missed by the algorithm. These models were tested on two data sets, one from the Environmental Protection Agency, and another of stormwater runoff samples from the 2017 Northern California wildfires. After initial testing, the algorithm was able to provide higher accuracy than the machine learning models. However, these new models were able to detect high spectral matches missed by the algorithm. In addition, after the application of a class balancing technique, a higher accuracy on the stormwater data set was achieved. These methods may indicate that machine learning acts as a viable alternative for the classification of mass spectral matches.