Predictive analysis of E. coli levels to assess water quality in the river Göta älv

dc.contributor.authorRavishankar, Ramachandran
dc.contributor.departmentChalmers tekniska högskola / Institutionen för arkitektur och samhällsbyggnadsteknik (ACE)sv
dc.contributor.examinerBondelind, Mia
dc.contributor.supervisorSokolova, Ekaterina
dc.contributor.supervisorBondelind, Mia
dc.date.accessioned2021-07-26T13:59:19Z
dc.date.available2021-07-26T13:59:19Z
dc.date.issued2021sv
dc.date.submitted2020
dc.description.abstractWater quality is one of the most important factors in a clean and hygienic environment. Sewage waste from the city contains harmful faecal pathogens that when led to the river may contaminate the quality of the water. In this study, a widely used faecal indicator, known as Escherichia coli or E. coli, is predicted at Lärjeholm drinking water intake plant. An initial dataset was compiled using the raw data points obtained from Göteborg Kretslopp och Vatten and Swedish Meteorological and Hydrological Institute (SMHI). Data preprocessing steps, such as log10(x + 1) transformation, time indexing, removing duplicate values, filling missing values, and defining lag values were carried out on the initial dataset. After preprocessing, the initial dataset was split into baseline and complex datasets. The baseline dataset contains lag values of precipitation at Komperöd and Vänersborg and water temperature at Lärjeholm to predict E. coli levels at Lärjeholm, while complex dataset, an upgraded version of the baseline dataset with additional features such as lag values of E. coli at Garn, turbidity at Lärjeholm, coliforms at Lärjeholm and Garn. Linear models Multivariate adaptive regression splines (MARS), and Elasticnet regression and a non-linear tree-based model Extreme Gradient Boosting (XGBoost) regression were used for the prediction of E. coli levels. Elasticnet regression was the most efficient algorithm with a mean absolute error of 77 (CFU/100 ml), root mean squared error of 125 (CFU/100 ml) and R2 score of 0.46. MARS was the least efficient with an mean absolute error of 86 CFU/100 ml, root mean squared error of 154 (CFU/100 ml) and R2 score of 0.22. Though XGBoost was expected to perform better than linear model such as Elasticnet, it failed to do so. However, the relative error change (∆ error) for XGBoost was around 43% from baseline to complex dataset, the highest improvement rate among all three models with the addition of new features into the dataset. The study uses machine learning algorithms as a complement to expensive lab analysis to analyse and predict E. coli levels to take precautionary actions if the levels exceed a certain threshold. The study can be expanded to include other faecal and physio-chemical indicators to improve the accuracy of the models. Further enhancements, can include other machine learning/deep learning algorithms to predict E. coli levels.sv
dc.identifier.coursecodeACEX30sv
dc.identifier.urihttps://hdl.handle.net/20.500.12380/303811
dc.language.isoengsv
dc.setspec.uppsokTechnology
dc.subjectE. colisv
dc.subjectXGBoostsv
dc.subjectElasticnet regressionsv
dc.subjectMARSsv
dc.subjectpredictive analysissv
dc.titlePredictive analysis of E. coli levels to assess water quality in the river Göta älvsv
dc.type.degreeExamensarbete för masterexamensv
dc.type.uppsokH

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
ACEX30 Ravishankar, Ramachandran.pdf
Storlek:
3.15 MB
Format:
Adobe Portable Document Format
Beskrivning:

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
1.51 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: