Clustering genomic signatures A new distance measure for variable length Markov chains

dc.contributor.authorGustafsson, Joel
dc.contributor.authorNorlander, Erik
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data- och informationsteknik (Chalmers)sv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineering (Chalmers)en
dc.date.accessioned2019-07-03T14:48:46Z
dc.date.available2019-07-03T14:48:46Z
dc.date.issued2018
dc.description.abstractPathogens such as bacteria and viruses are leading causes of disease worldwide, which makes it essential to identify them in DNA samples. Instead of analysing raw DNA sequences, mathematical models based on Variable Length Markov Chains (VLMCs), known as Genomic signatures, make it possible to classify DNA samples faster than with traditional alignment-based methods. To analyse a set of genomic signatures, we use clustering, which is an unsupervised machine-learning method. For the clustering of VLMCs, an accurate and fast similarity measure (distance function) is needed. To analyse distance functions and clusters, we define metrics based primarily on the taxonomic ranks of the underlying organisms. For the distance functions, we primarily analysed whether the VLMCs within the same taxonomic rank were closest to each other. For the cluster analysis, we use the silhouette metric to determine how well separated the clusters are and define the average percentages, sensitivity, and specificity of the captured taxonomic ranks. We present a new distance function for VLMCs, called Frobenius-intersection, which correlates accurately with the well-known Kullback-Liebler distance function, while also being several orders of magnitude faster. We use average-link clustering together with the Frobenius-intersection distance to cluster data sets of known viruses and bacteria with relatively short DNA sequences. The clusters of VLMCs correspond accurately to the Baltimore types of the viruses as well as the viruses’ and bacteria’s taxonomic families. However, most of the classifications of viruses are also subdivided into multiple clusters. Moreover, when combining the set of bacteria and viruses, the clusters start to mix the viruses and bacteria before finding all of the taxonomic families. The clustering of the genomic signatures is accurate with respect to, for instance, taxonomic ordering. Therefore, it can help in identifying unclassified pathogens. Future research may reveal other causes of similarity between the genomic signatures.
dc.identifier.urihttps://hdl.handle.net/20.500.12380/255511
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectData- och informationsvetenskap
dc.subjectComputer and Information Science
dc.titleClustering genomic signatures A new distance measure for variable length Markov chains
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster Thesisen
dc.type.uppsokH
local.programmeComputer science – algorithms, languages and logic (MPALG), MSc
Ladda ner
Original bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
255511.pdf
Storlek:
1.71 MB
Format:
Adobe Portable Document Format
Beskrivning:
Fulltext