Log Classification using NLP Techniques Data-Driven Fault Categorization of Multimodal Logs using Natural Language Processing Techniques

Wirehed, Adam; Suhren Gustafsson, Adam

Log Classification using NLP Techniques Data-Driven Fault Categorization of Multimodal Logs using Natural Language Processing Techniques

dc.contributor.author	Wirehed, Adam
dc.contributor.author	Suhren Gustafsson, Adam
dc.contributor.department	Chalmers tekniska högskola / Institutionen för matematiska vetenskaper	sv
dc.contributor.examiner	Axelson-Fisk, Marina
dc.contributor.supervisor	Jonasson, Johan
dc.date.accessioned	2021-06-08T16:32:41Z
dc.date.available	2021-06-08T16:32:41Z
dc.date.issued	2021	sv
dc.date.submitted	2020
dc.description.abstract	System logs record system states to facilitate debugging of issues and failures. At Ericsson, several logs are analyzed when faulty baseband hardware is returned from customer networks. Classifying a unit given several logs can be considered a multimodal classification problem where each log represents modes of the system. As systems increase in size and complexity, the resources needed for subject matter experts to analyze these logs increase to a point where it’s no longer efficient. Therefore, Ericsson has used machine learning models using manual feature extraction patterns to analyze these logs according to the best understanding of which features should be used for classification. However, this manual feature engineering gives no guarantee of correlation between the best representation of the logs and the output of the classification model. In this thesis, we have shown that a data-driven NLP approach where concatenated bag-of-words representations for each log file fitted on an XGBoost classifier was able to match the production model used by Ericsson. Attempts to incorporate sequential representations of the log entries and parameter lists produced by the Spell and Drain log parsers did not yield improved results. In addition, while deep learning models like Transformers combined with neural Word2Vec embeddings were able to produce similar results, they are prohibitively complex in relation to the simpler solution. Our findings indicate that the baseband unit logs do not show the same high variability in sentence structure, nor seem to depend on structures of sequences for different hardware- or software faults. We also propose that care should be taken when treating logs as texts found in other classical NLP tasks, like sentiment analysis, or document classification where the text is in fact directly generated by humans, as opposed to automatic logging systems. All tested models were evaluated on a holdout test dataset used by the current production model. The existing Ericsson model achieved a macro F1-score of 0.866, the XGBoost model 0.885, and the Transformer model 0.861.	sv
dc.identifier.coursecode	MVEX03	sv
dc.identifier.uri	https://hdl.handle.net/20.500.12380/302416
dc.language.iso	eng	sv
dc.setspec.uppsok	PhysicsChemistryMaths
dc.subject	NLP, log, classification, machine learning, word embedding, LSTM, transformer, XGBoost, Spell, Drain.	sv
dc.title	Log Classification using NLP Techniques Data-Driven Fault Categorization of Multimodal Logs using Natural Language Processing Techniques	sv
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.uppsok	H

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: Masters_Thesis_Adam Wirehed och Adam Suhren Gustafsson 210604.pdf
Size:: 2.99 MB
Format:: Adobe Portable Document Format
Description:: Log Classification using NLP Techniques Data-Driven Fault Categorization of Multimodal Logs using Natural Language Processing Techniques

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 1.14 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen