Machine Learning for Prediction of Antibiotic Resistance
Examensarbete för masterexamen
Engineering mathematics and computational science (MPENM), MSc
This thesis aims to investigate whether machine learning can be used to diagnose whether a bacteria is resistance towards a certain antibiotic or not. This will be done by building a prediction model for prediction of minimum inhibitory concentration. Minimum inhibitory concentration is defined as the minimum dosage of a drug needed in order to inhibit a infection or disease. To do this, a labeled dataset consisting of 4964 genomes from Salmonella bacteria with corresponding minimum inhibitory concentrations for up to 15 antibiotics where used alongside a unlabeled dataset of Salmonella genomes taken from ncbi GenBank. Further, due to the small size of the dataset compared to the length of a Salmonella genome, more than 4 000 000 nucleotides, we divided each genome into k-mers and viewed each k-mer as a word. The genome can then be viewed as a document and the problem at hand becomes to classify this document w.r.t antibiotic resistance. To classify this document we took a bag-of-word approach, counting the occurrence of each k-mer and then producing a vector based on the count for each genome. The bag-of-word approach resulted in an information loss regarding the context of certain k-mers but made further processing feasible. Furthermore, we considered two different machine learning model for the given task. A standard feedforward neural network trained in a supervised setting and a ladder network trained in a semi-supervised setting. We trained the networks for prediction of inhibitory concentration for all the 15 antibiotics simultaneously. To handle missing labels in the data we constructed a customized output layer consisting of 15 softmax layers concatenated. Given a missing label we simply ignored to gradient from the corresponding softmax layer. The training set was also over-sampled using two different techniques based on bootstrapping and synthetic minority over-sampling. Moreover, it was found, through hyperparemeter tuning using the Parzen Tree estimator, that the semi-supervised learning did not improve the accuracy and a standard feedforward neural network had the best accuracy when it came to predicting exact minimum inhibitory concentration. Our feedforward neural network was then compared to baseline model, which was based on the distribution of labels in the dataset, and an already existing machine learning model trained on the considered dataset. It was found that our feedforward neural network outperformed both these models when it comes to prediction minimum inhibitory concentrations. The average accuracy for prediction of exact minimum inhibitory concentration where 0.78 and when the result was translated to the labels sensitive, intermediate and resistance towards an antibiotic the model got an average accuracy of 0.97.In addition, we evaluated our model with respect to the error rates defined by the National Antimicrobial Resistance Monitoring System and the error rates where found to not be low enough to be used in a clinical setting. We think that this is a combination of the limitations with a bag-of-word approach and the lack of data. Nevertheless, from this work we can conclude that machine learning is an intresting and prominent approach to autonomous prediction of of minimum inhibitory concentration and diagnosis of antibiotic resistance. However, several problems like the interpretability of the models and skewness in the datasets are yet to be solved before a machine learning model can be used on a clinical setting for this purpose. We end this thesis with a discussion regarding future work that could solve many of the problems encountered throughout this thesis.
Machine Learning, Salmonella, Antibiotic Resistance, Minimum Inhibitory Concentration, Neural Network, Ladder Network, Bayesian Optimization