We've Moved!
Visit SDSU’s new digital collections website at https://digitalcollections.sdsu.edu
Description
Somatic structural variation (SSV) is an established hallmark of cancer and is emerging as an important contributor to the pathogenesis of neurological diseases such as autism and schizophrenia, but detecting SSVs in silico is challenging due to the extreme imbalance of false positive predictions to true positive events. While many methods have been able to successfully resolve germline structural variation (GSV) detection, unbiased SSV detection remains particularly difficult due to the dampened signal in comparison to GSV. Existing programs that identify somatic variants are limited to detection of only single nucleotide variants and mobile-element insertions, target tumor cells, or are based on microarrays which suffers from low resolution. Here, I developed a machine learning method to detect and filter GSV and SSV, processing an initial prediction set of millions of SVs down to a manageable number of high confidence predictions. The program, named CHONK, utilizes machine learning to aid in the discovery and filtering of SSV using alignments from bulk whole genome sequencing. CHONK operates on unpaired samples and is able to confidently identify GSV, and SSV with wide degrees of allelic frequency (down to 1%) and varying length (50bp – 1Mb). The GSV classifier employs a random forest model; genotyping with a false discovery rate (FDR) of 3-12%, sensitivity of 88-99%, and an area under the receiver operating characteristic curve (ROC) of 91-99%, depending on the size and type of SV. CHONK’s germline genotyping performance is comparable to existing commonly-used genotypers, and outperforms when genotyping large (≥1kb) deletions and duplications. The SSV classifier also uses a random forest classifier; predicting somatic events with an FDR of 9-27%, sensitivity of 31-86%, ROC of 65-92%, and a Matthews Correlation Coefficient of 0.5-0.9, depending on the size and type of SV. This application will help detect these elusive mutations and help researchers and clinicians further their understanding of the effects of mosaicism on evolution and disease.