We've Moved!
Visit SDSU’s new digital collections website at https://digitalcollections.sdsu.edu
Description
The invention of next generation sequencing technology (NGS) provides the capability of generating high throughput low cost sequencing data, and is used by scientists to address a diverse range of biological problems. Several data analysis algorithms have been developed in last few years to best exploit NGS data. New tools and methods have also been implemented for better understanding of these data. This dissertation presents several novel techniques involving NGS datasets. The first technique, qudaich is a novel sequence aligner, which can be used as a key part of NGS data analysis. Qudaich generates the pairwise local alignments of a query dataset against a database. Qudaich can efficiently process large volumes of data and is well suited to the next generation reads datasets. This aligner can also handle both DNA and protein sequences and tries to generate the best possible alignment for each query sequence. In contrast to other contemporary aligners, qudaich is more efficient in terms of execution time and accuracy. Next, in this dissertation, I show different ways to extract useful genomic information from NGS data, which, in turn, shows promising directions to solve some of the existing biological problems like prophage prediction. Prophages are viruses that integrated into, and replicated as part of, the bacterial genome. These genetic elements can have tremendous impact on their hosts. The majority of other phage finding tools mainly rely on homology-based approach for prophage prediction, which limits the de novo discovery of novel prophages. This dissertation presents a novel algorithm, PhiSpy to predict prophages in bacterial genomes. PhiSpy combines similarity based and composition based strategies to identify prophages. It finds 94% of the known prophages in 50 complete bac-terial genomes with a 6% false negative rate and a 0.66% false positive rate. This led to a successful prediction of the largest set of prophages comparing to other prophage finding applications. Finally, this dissertation also demonstrates that information theory can be effectively applied to find informative sequences, to predict the lifestyle restrictions of an organism, and to analyze the deviation of the amino acid utilization profile in different metabolic processes in different organisms. Together, these tools will enable the next generation of sequence analyses using next generation sequence data.