A machine learning approach for predicting bacteria content in drinking water
Examensarbete för masterexamen
Data science and AI (MPDSC), MSc
The current method for finding whether drinking water contains bacterial contamination is a very slow process and it can take up to eight days before the results are obtained. During this time, a significant proportion of the population has potentially obtained diseases from contaminated water. As a mitigating action, this thesis aimed to understand if machine learning could be a promising method for forecasting the bacteria level and how such a model could be designed. The project was performed in association with a case company called Nocoli, which is spun out of Chalmers Ventures and desired an examination of the potential implementation. A literature review including eight different case studies of how machine learning was previously applied in the field and three semi-structured interviews with industryspecific stakeholders were conducted. The research methodology originated from the fact that both an overview of the current industry situation as well as machine learning applicability was required. Moreover, by using an extracted theory of machine learning algorithms for different objectives, the case studies were evaluated to find patterns that could meet the case companys demands. It was found that machine learning is promising and desired in the industry to improve current operations. The Random Forest algorithm was recommended in the initial stage due to its trade-off between accuracy and interpretability. Data on bacterial content and other factors including weather was intended as the data source. The recommendation included a 3:1:1 split between training-, validation-, and test sets as well as using a recursive feature selection algorithm. Additionally, a combination of error measures was recommended including Mean Squared Error with an out-of-bag supplement to reduce overfitting. Furthermore, although no data could be obtained to evaluate the recommended model, it was concluded that machine learning could have a positive impact on today’s approach and contribute to improved water management and safety by enabling reliable forecasts.
machine learning, forecasting, drinking water quality, contaminated water, drinking water treatment, escherichia coli prediction, HPC method, Random Forest.