Clustering genomic signatures A new distance measure for variable length Markov chains

Typ
Examensarbete för masterexamen
Master Thesis
Program
Computer science – algorithms, languages and logic (MPALG), MSc
Publicerad
2018
Författare
Gustafsson, Joel
Norlander, Erik
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Pathogens such as bacteria and viruses are leading causes of disease worldwide, which makes it essential to identify them in DNA samples. Instead of analysing raw DNA sequences, mathematical models based on Variable Length Markov Chains (VLMCs), known as Genomic signatures, make it possible to classify DNA samples faster than with traditional alignment-based methods. To analyse a set of genomic signatures, we use clustering, which is an unsupervised machine-learning method. For the clustering of VLMCs, an accurate and fast similarity measure (distance function) is needed. To analyse distance functions and clusters, we define metrics based primarily on the taxonomic ranks of the underlying organisms. For the distance functions, we primarily analysed whether the VLMCs within the same taxonomic rank were closest to each other. For the cluster analysis, we use the silhouette metric to determine how well separated the clusters are and define the average percentages, sensitivity, and specificity of the captured taxonomic ranks. We present a new distance function for VLMCs, called Frobenius-intersection, which correlates accurately with the well-known Kullback-Liebler distance function, while also being several orders of magnitude faster. We use average-link clustering together with the Frobenius-intersection distance to cluster data sets of known viruses and bacteria with relatively short DNA sequences. The clusters of VLMCs correspond accurately to the Baltimore types of the viruses as well as the viruses’ and bacteria’s taxonomic families. However, most of the classifications of viruses are also subdivided into multiple clusters. Moreover, when combining the set of bacteria and viruses, the clusters start to mix the viruses and bacteria before finding all of the taxonomic families. The clustering of the genomic signatures is accurate with respect to, for instance, taxonomic ordering. Therefore, it can help in identifying unclassified pathogens. Future research may reveal other causes of similarity between the genomic signatures.
Beskrivning
Ämne/nyckelord
Data- och informationsvetenskap , Computer and Information Science
Citation
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Byggår
Modelltyp
Skala
Teknik / material
Index