Description
Non-targeted analysis of environmental pollutants is of paramount importance as it can identify novel contaminants that could potentially cause deleterious biological effects. Non-targeted analysis performed using comprehensive two-dimensional gas chromatography coupled with time-of-flight mass spectrometry (GC×GC/ToF-MS) generates large volumes of possible analytes and presents data analysis challenges. In addition, the default spectral library search algorithm produces many false positive matches, which must be manually reviewed. This makes non-targeted analysis tedious and predisposed to human errors, during heavy data handling. To improve the speed and accuracy of non-targeted analyte analyses we developed CINeMA.py (Classification Is Never Manual Again), that automates GC × GC/ToF-MS data interpretation by predicting the quality of spectrum match (High or Low) between the suggested analyte mass spectrum and the LECO® ChromaTOF® software generated mass spectrum of the library hit from the NIST library search. Our software allows the user to evaluate the quality of the match using two different approaches: algorithmic and a machine learning (neural network) approach. In addition, the software allows the user to adjust various parameters (e.g., similarity threshold, percent library hit threshold, epochs) and study the effect on prediction accuracy. We used data from EPA environmental sample to assess the effectiveness of CINeMA.py. An accuracy of 80% and 74% was obtained respectively for the algorithmic and machine learning approaches, for which the reference was based on the analysis of the same data sets performed by a highly-trained individual. This process was accomplished in 10 seconds for 700 suggested analytes, whereas manual data analysis took 3 months. We also encountered errors in the manual review for the ground truth, because of heavy data handling and they impacted the accuracy of the software negatively.CINeMA.py significantly reduces the manual analysis time, improves accuracy by reducing manual errors and provides additional analysis options to the user to work with large data sets in non-targeted analysis, which is a great improvement over manual analysis.