Experimental Design for Comparative Metagenomics Investigating and optimising the trade-off between number of samples and sequencing depth
Typ
Examensarbete för masterexamen
Program
Engineering mathematics and computational science (MPENM), MSc
Publicerad
2020
Författare
Conti, Sofia
Hermanova Billstein, Martina
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
In comparative metagenomics, samples from different environments are compared with the aim
to identify differentially abundant genes. It is important to have a sound experimental design in
such studies, including a sufficiently large number of samples from each environment as well as a
sufficiently high sequencing depth in each sample.
The aim of this master’s thesis was to provide guidance on the required number of samples and
sequencing depth for experimental designs in future comparative metagenomic studies. In order
to do so, various experimental designs with different number of samples and sequencing depths
were evaluated based on their statistical performance. For each design, a large number of artificial
datasets were created by resampling real metagenomic data. Three real datasets were used and
the analyses were conducted in R.
The performances of all the investigated designs were shown to improve when the effect size of
the studied phenomenon was large as well as when the studied genes had high abundance or
low variability. It was further found that the performance of the designs increased both with
increasing sequencing depth and with increasing number of samples in each group. A sequencing
depth of ten thousand reads was generally too low to yield an acceptable performance. Likewise,
having only three samples in each group was found to be too few unless the studied genes had
high abundance or low variability. The main result was that the performance improved more
with increasing number of samples than with increasing sequencing depth. However, when taking
the economic aspect into account, a larger amount of samples became less profitable due to the
high sequencing cost per sample. A final conclusion was that an experimental design may be
less extensive and use fewer samples if the effect size is large or if the studied genes have high
abundance or low variability.
Beskrivning
Ämne/nyckelord
bioinformatics, performance, statistical power, economic impact, false discovery rate (FDR), effect size, gene abundance, gene variability, differentially abundant genes (DAGs), R.