Machine Learning for Prediction of Antibiotic Resistance
Typ
Examensarbete för masterexamen
Program
Engineering mathematics and computational science (MPENM), MSc
Publicerad
2019
Författare
Carlsson, Emil
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
This thesis aims to investigate whether machine learning can be used to diagnose
whether a bacteria is resistance towards a certain antibiotic or not. This will be done
by building a prediction model for prediction of minimum inhibitory concentration.
Minimum inhibitory concentration is defined as the minimum dosage of a drug
needed in order to inhibit a infection or disease.
To do this, a labeled dataset consisting of 4964 genomes from Salmonella bacteria
with corresponding minimum inhibitory concentrations for up to 15 antibiotics where
used alongside a unlabeled dataset of Salmonella genomes taken from ncbi GenBank.
Further, due to the small size of the dataset compared to the length of a Salmonella
genome, more than 4 000 000 nucleotides, we divided each genome into k-mers and
viewed each k-mer as a word. The genome can then be viewed as a document and
the problem at hand becomes to classify this document w.r.t antibiotic resistance.
To classify this document we took a bag-of-word approach, counting the occurrence
of each k-mer and then producing a vector based on the count for each genome.
The bag-of-word approach resulted in an information loss regarding the context of
certain k-mers but made further processing feasible.
Furthermore, we considered two different machine learning model for the given
task. A standard feedforward neural network trained in a supervised setting and
a ladder network trained in a semi-supervised setting. We trained the networks
for prediction of inhibitory concentration for all the 15 antibiotics simultaneously.
To handle missing labels in the data we constructed a customized output layer
consisting of 15 softmax layers concatenated. Given a missing label we simply
ignored to gradient from the corresponding softmax layer. The training set was also
over-sampled using two different techniques based on bootstrapping and synthetic
minority over-sampling.
Moreover, it was found, through hyperparemeter tuning using the Parzen Tree
estimator, that the semi-supervised learning did not improve the accuracy and a
standard feedforward neural network had the best accuracy when it came to predicting
exact minimum inhibitory concentration. Our feedforward neural network
was then compared to baseline model, which was based on the distribution of labels
in the dataset, and an already existing machine learning model trained on the
considered dataset.
It was found that our feedforward neural network outperformed both these models
when it comes to prediction minimum inhibitory concentrations. The average
accuracy for prediction of exact minimum inhibitory concentration where 0.78 and
when the result was translated to the labels sensitive, intermediate and resistance
towards an antibiotic the model got an average accuracy of 0.97.In addition, we evaluated our model with respect to the error rates defined by
the National Antimicrobial Resistance Monitoring System and the error rates where
found to not be low enough to be used in a clinical setting. We think that this is a
combination of the limitations with a bag-of-word approach and the lack of data.
Nevertheless, from this work we can conclude that machine learning is an intresting
and prominent approach to autonomous prediction of of minimum inhibitory
concentration and diagnosis of antibiotic resistance. However, several problems like
the interpretability of the models and skewness in the datasets are yet to be solved
before a machine learning model can be used on a clinical setting for this purpose.
We end this thesis with a discussion regarding future work that could solve many
of the problems encountered throughout this thesis.
Beskrivning
Ämne/nyckelord
Machine Learning, Salmonella, Antibiotic Resistance, Minimum Inhibitory Concentration, Neural Network, Ladder Network, Bayesian Optimization