We've Moved!
Visit SDSU’s new digital collections website at https://digitalcollections.sdsu.edu
Description
Erroneous genomic data caused by sequencing platforms is unavoidable and can cause issues in downstream genomic analysis pipelines. These sequencing errors are significantly more prevalent when using Continuous Long Reads (CLR). Platforms such as PacBio and Nanopore creating CLRs are often used for sequencing Prokaryotic genomes by providing a way to keep any translocations that would have been fixed by a De Bruijn based assembly while also bridging long repeat regions. The drawback of CLRs is that they contain a high base error rate. Using Google’s open source TensorFlow machine learning library in tandem with BioPython and Pysam, we created a Convolutional Neural Network based deep learning variant caller named DeepVCF that is shown here to outperform existing traditional Hidden Markov Model based variant callers, such as BCFtools, for erroneous genomic data caused by platform sequencing. DeepVCF accomplishes this by using the high-confidence variant dataset Genome in A Bottle (GIAB) as a baseline to prove model validity while training and testing on 10 Prokaryotic species datasets with variants created in silico. DeepVCF provides dynamic parameters for the user to alter the dimensions of the training tensors, heterozygous threshold for false positive training, minimum base quality, minimum read coverage, and complete control of Keras layers within the machine learning model. The current drawback of existing deep learning variant callers, such as Google’s DeepVariant, are their fixed parameters that meet award winning accuracies with ideal training datasets from GIAB but underperform for less-than-optimal Prokaryotic variant datasets. By giving more control to the user, we show that DeepVCF can provide insight on erroneous genomic data to better determine novel variant calls with simple dynamic models.