Description
Utilization of alternative initiation sites for protein translation directed by non-AUG codons in mammalian mRNAs is observed with increasing frequency. Recent research has identified the use of alternative initiation for translational control of important regulatory proteins that result in distinct biological functions. This phenomenon has previously been considered rare, and as a result, not considered for gene prediction or expression studies in eukaryotes. This study has investigated the untranslated (UTR) regions of mRNAs to define consensus sequence properties and structural features that provide quantitative evidence for the selection of non-AUG start sites. Bioinformatic evaluation of 5'-UTR sequences of mammalian sequences was conducted for classification and identification of alternative translation initiation sites. Primary and secondary sequence parameters were quantified on mRNAs that have been experimentally demonstrated to utilize alternative non-AUG initiation sites representing: unique consensus sequence patterns near the initiation codon, primary sequence characteristics of 5'-UTR, and secondary sequence structures. These metrics were quantified and used to train a supervised machine learning method known as a classification or decision tree. This resulted in a C4.5 decision tree that was able to accurately categorize mRNA sequences into one of two categories, those with potential aTIS, and those without. Most importantly, this study successfully defined the unique properties of 5'-UTR that lead to the selection of non-AUG start sites during translation. The results of this work are very important in the development and training of gene prediction algorithms and the subsequent identification of novel gene products required for cellular functions in health and disease.