Investigation Of Phylogenetic Relations Using Graph Data Science Algorithms
Typ
Examensarbete för masterexamen
Program
Data science and AI (MPDSC), MSc
Publicerad
2021
Författare
Rahavachari, Ankita
Subramanian, Guru Prakash
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Driven by the vast amount of fast-growing biological databases, a total of 2.1 million
diverse species have been categorized within the NCBI taxonomy database either by
DNA, RNA, protein, or genome sequences. This thesis focuses on performing a comprehensive
analysis of the classified taxonomic branches and nodes in the taxonomy
database through utilizing various graph algorithms. By converting these taxonomy
data into a Neo4j database, a super graph with 2,121,053 unique branches and
2,323,131 intermediate and end nodes was obtained in a rooted tree structure. In
contrast to the classic Linnaean system with eight major ranks (from domain to
species), there are 37 additional taxonomic ranks that have been used in describing
the complicated phylogeny of the accumulated species. Surprisingly, nearly 10% of
the taxonomic nodes are found with a rank either "norank" or "clade" that remain
unclassified and await for systematic assignment. In addition, incomplete investigation
of skipping cases of taxonomic ranks revealed thousands of lineages that lack
one more major rank. They are deviated from the classic taxon hierarchy defined in
the Linnaean system, which appears lagging behind the pace of current biological
advancement and should be revisited for upgrading. Finally, a bioinformatic tool for
estimating phylogenetic distance between any two given organisms was developed
and provided with a graphical interface for user exploration.
Beskrivning
Ämne/nyckelord
Data science, Neo4j, Graph Analytics, GraphXR, Neo4j browser, Cypher, NCBI, Phylogenetics