Predicting evolutionary distances from variable length Markov chains with deep regression
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
In light of the accelerated surge of sequence data due to next-generation sequencing technologies, traditional alignment-based approaches for genome comparison are being outpaced, leading to a rising interest in more efficient alignment-free comparison methods. One such method is the alignment-free method based on Variable Length Markov Chains (VLMCs).
In this thesis, we explore the application of VLMCs as genomic signatures to estimate evolutionary distances. We employ deep regression models and alignment-free VLMC distances which are computed through a recently developed distance measure (d∗v) for VLMCs.
Genomic data from over 300 various species are downloaded from the National Center for Biotechnology Information database and are used to train multiple deep regression models. The thesis is structured in two complementary parts. First, we develop and evaluate models for estimating evolutionary distances through VLMC distances on synthetic mutations. Second, we develop models for estimating divergence times of various species using VLMC distances and data derived from TimeTree, a public knowledge base derived from thousands of published studies.
The results show that regression models built on carefully selected features outperform linear regressor benchmarks in predicting evolutionary distances in both parts of the project and across all evaluation metrics. The thesis effectively demonstrates the promising results of VLMCs and d∗v in deep regression models for predicting evolutionary distances and highlights potential areas for future research to improve model accuracy.
Beskrivning
Ämne/nyckelord
Computer science, Bioinformatics, Master’s thesis, Markov chains, Variable length Markov chains, Deep regression, Evolutionary distance, Alignment-free sequence analysis, Genomic signatures, DNA