DNA Sequence Classification Using Variable Length Markov Models

Publicerad

Typ

Examensarbete för masterexamen

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Pathogens such as viruses and bacteria are a major health concern today. To effectively treat these it is important to identify known pathogens and potential new ones from DNA samples. Modern methods are however not good enough at classifying rare, previously undocumented pathogens. This thesis explores nearest neighbor classification using variable length Markov chains (VLMC) as a possible solution. A vantage point tree is used to store the database of VLMC being queried against. This gives promising results when classifying VLMC from complete genomes or chromosomes. Multiple techniques, both greedy approximations and new lower bounds are explored. This results in order of magnitude faster classification than previous research. However the technique ultimately fails at classifying shorter DNA sequences of lengths typically found when sequencing DNA. Multiple reasons for this are given with a possible way forward if further research is deemed relevant.

Beskrivning

Ämne/nyckelord

Computer science, Bioinformatics, Master’s thesis, vantage point tree, metric space, Variable length Markov chains, Markov Models, DNA, Classification

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced