Predicting evolutionary distances from variable length Markov chains with deep regression

dc.contributor.authorHelmroth, Filip
dc.contributor.authorSöderpalm, Erik
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerAxelson-Fisk, Marina
dc.contributor.supervisorSchliep, Alexander
dc.date.accessioned2023-12-20T15:49:23Z
dc.date.available2023-12-20T15:49:23Z
dc.date.issued2023
dc.date.submitted2023
dc.description.abstractIn light of the accelerated surge of sequence data due to next-generation sequencing technologies, traditional alignment-based approaches for genome comparison are being outpaced, leading to a rising interest in more efficient alignment-free comparison methods. One such method is the alignment-free method based on Variable Length Markov Chains (VLMCs). In this thesis, we explore the application of VLMCs as genomic signatures to estimate evolutionary distances. We employ deep regression models and alignment-free VLMC distances which are computed through a recently developed distance measure (d∗v) for VLMCs. Genomic data from over 300 various species are downloaded from the National Center for Biotechnology Information database and are used to train multiple deep regression models. The thesis is structured in two complementary parts. First, we develop and evaluate models for estimating evolutionary distances through VLMC distances on synthetic mutations. Second, we develop models for estimating divergence times of various species using VLMC distances and data derived from TimeTree, a public knowledge base derived from thousands of published studies. The results show that regression models built on carefully selected features outperform linear regressor benchmarks in predicting evolutionary distances in both parts of the project and across all evaluation metrics. The thesis effectively demonstrates the promising results of VLMCs and d∗v in deep regression models for predicting evolutionary distances and highlights potential areas for future research to improve model accuracy.
dc.identifier.coursecodeDATX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/307456
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectComputer science
dc.subjectBioinformatics
dc.subjectMaster’s thesis
dc.subjectMarkov chains
dc.subjectVariable length Markov chains
dc.subjectDeep regression
dc.subjectEvolutionary distance
dc.subjectAlignment-free sequence analysis
dc.subjectGenomic signatures
dc.subjectDNA
dc.titlePredicting evolutionary distances from variable length Markov chains with deep regression
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeComputer science – algorithms, languages and logic (MPALG), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 23-78 FH ES.pdf
Storlek:
38.28 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: