Jump to navigation
Machine learning techniques for identification of commonalities and shared origin of language scripts
Bellur Keshavamurthy, Vyshak Athreya
Liu, XiaobaiKassegne, Samuel
Understanding unknown scripts, finding their origin and relating its relationship with the known scripts has intrigued linguistics. Many language scripts share common characters that are usually derived. Scripts imbibe from other scripts, differing only by small percentages. A scientific approach is being taken here to measure the structural similarity between different language scripts. Machine Learning techniques are used to compute similarity percentage. At first, most significant, set of features is extracted using the Principal Component Analysis, Independent Component Analysis, Factor Analysis and Non negative components. These features are compared using clustering algorithms such as K means and K nearest neighbour. Distance metrics such as Euclidean, Manhattan, Cosine, Chebyshev and Correlation are compared to choose the one that mimics the human visual system. Character wise similarity is measured with a mathematical score. Scripts are compared by number of characters that is similar between them with at least 50% similarity score. A pie chart is generated for each script indicating the percentage share. A tree structure is also derived from the distance metrics to estimate the character hierarchy in the same script and in between different scripts. Around 8 different scripts from different parts of the world are considered here. Through the writing system, human history has been recorded durably. Finding a similarity metric through Machine Learning techniques, will help discover and understand many language scripts worldwide.
San Diego State University
Master of Science (M.S.) San Diego State University, 2019
© 2015 SDSU Library & Information Access. All Rights Reserved.