Population Based Microsatellite Genotyping
Examensarbete för masterexamen
Computer science – algorithms, languages and logic (MPALG), MSc
Microsatellites, also known as short tandem repeats (STRs) are short DNA sequences containing repeated motifs ranging from 2-6 bases. The number of repeats varies between individuals and the numbers occurring in a population are known as the alleles of a microsatellite. Each individual carries two copies of each chromosome and hence two alleles of each microsatellite. There are at least 250.000 microsatellites that have a known location on a human reference genome, the most common form is dinucleotide repeats. The range of applications for microsatellite analysis is very wide and includes among other things medical genetics, forensics and genetic genealogy. However, microsatellite variations are rarely considered in whole-genome sequencing studies in large due to a lack of tools capable of analyzing them. The goal of this thesis is to create a microsatellite genotype caller which is faster and more accurate than others previously presented. In order to accomplish this goal two things were examined. First, we reduce by 87% the amount of sequencing data necessary for creating microsatellite profiles using previously aligned sequencing data. This was achieved by filtering the input to contain only reads aligned to known microsatellite locations and unaligned reads as these should be the ones useful for profiling. The results indicate that when performing microsatellite profiling using previously aligned data it is possible to significantly reduce running time with negligible effects on the resulting profile. Second, the accuracy of the microsatellite profiler was increased from 87.5% to 96.3%. The improvements included using population information to train microsatellite and individual specific error profiles. This was done by adding parameters to the model as well as using sequencing data from multiple individuals to improve parameter estimates. Combining these two procedures we were able to give a practical implementation of microsatellite genotyping which is both much faster and more accurate than previously presented solutions.
Data- och informationsvetenskap , Computer and Information Science