Investigation Of Phylogenetic Relations Using Graph Data Science Algorithms
Examensarbete för masterexamen
Data science and AI (MPDSC), MSc
Subramanian, Guru Prakash
Driven by the vast amount of fast-growing biological databases, a total of 2.1 million diverse species have been categorized within the NCBI taxonomy database either by DNA, RNA, protein, or genome sequences. This thesis focuses on performing a comprehensive analysis of the classified taxonomic branches and nodes in the taxonomy database through utilizing various graph algorithms. By converting these taxonomy data into a Neo4j database, a super graph with 2,121,053 unique branches and 2,323,131 intermediate and end nodes was obtained in a rooted tree structure. In contrast to the classic Linnaean system with eight major ranks (from domain to species), there are 37 additional taxonomic ranks that have been used in describing the complicated phylogeny of the accumulated species. Surprisingly, nearly 10% of the taxonomic nodes are found with a rank either "norank" or "clade" that remain unclassified and await for systematic assignment. In addition, incomplete investigation of skipping cases of taxonomic ranks revealed thousands of lineages that lack one more major rank. They are deviated from the classic taxon hierarchy defined in the Linnaean system, which appears lagging behind the pace of current biological advancement and should be revisited for upgrading. Finally, a bioinformatic tool for estimating phylogenetic distance between any two given organisms was developed and provided with a graphical interface for user exploration.
Data science, Neo4j, Graph Analytics, GraphXR, Neo4j browser, Cypher, NCBI, Phylogenetics