Description
Genome-scale metabolic models are constructed by inferring the network of metabolic reactions present in an organism from the organism’s genome annotations. The models aim to summarize the full metabolic capacity of an organism, but are nearly always incomplete due to missing annotations as well as missing biochemical and organismal knowledge. Missing reactions, known as gaps, exist in the draft network and must be filled by adding reactions to the model in order to enable biomass production so the model may be used to accurately simulate cell phenotypes. The most common methods for gap-filling are parsimony based, adding only the minimal set of reactions needed to complete the network and enable growth in silico. This type of approach fails to take into account evidence from the organism’s genome and may not produce biologically relevant gap-filling solutions. This thesis investigated the use of a novel gap-filling approach based on linear programming optimization that incorporates k-mer distance evidence in order to produce more accurate gap-filling solutions supported by evidence from the genome of interest. A random forest model was trained to classify between correct and incorrect functional assignments for protein sequences using k-mer distances calculated at different values of k. This classifier achieved a prediction accuracy of 89.92%. Feature importance values for the various k-mer distances were assessed to identify the most informative k-mer distance metric for use in gap-filling. A likelihood-based gap-filling procedure was utilized to gap-fill a draft model for Citrobacter sedlakii on a set of 90 minimal media conditions. Possible missing reactions were proposed for the model and k-mer distances were calculated between proteins present in the genome of interest and a database of known proteins with functionality associated with functional roles linked to the missing reactions. The k-mer distances were used to assess probabilities for the missing reactions and a weighted linear programming formulation was utilized to determine reactions needed for addition to the model to enable biomass production and growth. The final gap-filled metabolic model achieved a growth prediction accuracy of 91.1%.