We've Moved!
Visit SDSU’s new digital collections website at https://digitalcollections.sdsu.edu
Description
Members of Phycodnaviridae are large, icosahedral, double-stranded DNA (dsDNA) eukaryotic viruses with genome lengths ranging from 170 to 560 kbp and particle sizes ranging from 100 to 220 nm. There are six genera ( Chlorovirus, Coccolithovirus, Phaeovirus, Prasinovirus, Prymnesiovirus and Raphidovirus) in the family of Phycodnaviridae viruses. Members of Phycodnaviridae play an essential role in marine ecology by infecting phytoplankton communities and macro-algae. One method of studying viruses is the use of metagenomics. Metagenomes are the genetic material retrived directly from environmental samples. Because of the inter-genus genomic variation, identifying Phycodnaviridae sequences from a metagenome, which is dominated by marine phages, is difficult. Current efforts to identify Phycodnaviridae sequences are dependent on comparative studies such as Blast. Comparative approaches can identify sequences of known Phycodnaviridae with much success, but are not useful in the identification of novel Phycodnaviridae. To address this challenge, I have developed a bioinformatics tool using Random Forest Classifier to differentiate between Phycodnaviridae virus and phage sequences from metagenomes. For the development of this tool, differentiating genomic signatures (GC content, gene length, gene abundance, codon usage and amino acid frequency) and conserved sequences were used. However, only sixteen whole genomes of Phycodnaviridae family have been sequenced so far, creating a challenge in obtaining reliable genomic signatures and conserved sequences. To overcome this low genome number and to match the metagenomics test data, Grinder (http://sourceforge.net/projects/biogrinder/), an open-source bioinformatics tool, was used to generate contigs of specified mean length and standard deviation. The dataset consisting of genomic signatures and conserved sequences from these simulated contigs were used to train Random Forest Classifier. The recognition tool had an accuracy of 98.86% with no variable selection on the testing dataset. With significant variables, the tool shows an accuracy of 99.20% on the testing dataset.