A machine learning approach for predicting bacteria content in drinking water
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Program
Data science and AI (MPDSC), MSc
Publicerad
2023
Författare
Eric, Jonsson
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
The current method for finding whether drinking water contains bacterial contamination
is a very slow process and it can take up to eight days before the results
are obtained. During this time, a significant proportion of the population has potentially
obtained diseases from contaminated water. As a mitigating action, this
thesis aimed to understand if machine learning could be a promising method for
forecasting the bacteria level and how such a model could be designed. The project
was performed in association with a case company called Nocoli, which is spun out of
Chalmers Ventures and desired an examination of the potential implementation. A
literature review including eight different case studies of how machine learning was
previously applied in the field and three semi-structured interviews with industryspecific
stakeholders were conducted. The research methodology originated from
the fact that both an overview of the current industry situation as well as machine
learning applicability was required. Moreover, by using an extracted theory of machine
learning algorithms for different objectives, the case studies were evaluated to
find patterns that could meet the case companys demands.
It was found that machine learning is promising and desired in the industry to
improve current operations. The Random Forest algorithm was recommended in
the initial stage due to its trade-off between accuracy and interpretability. Data
on bacterial content and other factors including weather was intended as the data
source. The recommendation included a 3:1:1 split between training-, validation-,
and test sets as well as using a recursive feature selection algorithm. Additionally,
a combination of error measures was recommended including Mean Squared Error
with an out-of-bag supplement to reduce overfitting. Furthermore, although no data
could be obtained to evaluate the recommended model, it was concluded that machine
learning could have a positive impact on today’s approach and contribute to
improved water management and safety by enabling reliable forecasts.
Beskrivning
Ämne/nyckelord
machine learning, forecasting, drinking water quality, contaminated water, drinking water treatment, escherichia coli prediction, HPC method, Random Forest.