Leveraging Data Augmentation for Better Named Entity Recognition in Low-Resource Settings
dc.contributor.author | Björnerud, Philip | |
dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv |
dc.contributor.department | Chalmers University of Technology / Department of Computer Science and Engineering | en |
dc.contributor.examiner | Bernardy, Jean-Philippe | |
dc.contributor.supervisor | Dannélls, Dana | |
dc.contributor.supervisor | Kokkinakis, Dimitrios | |
dc.date.accessioned | 2024-02-21T09:39:17Z | |
dc.date.available | 2024-02-21T09:39:17Z | |
dc.date.issued | 2024 | |
dc.date.submitted | 2023 | |
dc.description.abstract | This thesis investigates the challenges in the field of Natural Language Processing (NLP), with a focus on Named Entity Recognition (NER), a subtask within NLP that involves classifying entities. Addressing the issue of data scarcity, which is particularly critical in non-English languages like Swedish, this study investigates various data augmentation methods by fine-tuning the transformer-based model, KB-BERT. The datasets are simulated as low-resource settings, drawing inspiration from the study X Dai and H Adel (2020) [1] work, using three sets of training data containing 50, 150, and 500 instances respectively. The thesis also explores whether a newly developed state-of-the-art data augmentation method can outperform other data augmentation methods in enhancing an NLP model, centering on three data augmentation methods: Synonym replacement, Mention replacement, and AugGPT, the last being a state-of-the-art method. The findings of this study highlight that synonym replacement emerged as the most effective data augmentation method across various low-resource settings, achieving the highest F1-score increase in all scenarios. AugGPT achieved the second highest average F1-score, while mention replacement achieved the lowest across the tested settings. | |
dc.identifier.coursecode | DATX05 | |
dc.identifier.uri | http://hdl.handle.net/20.500.12380/307588 | |
dc.language.iso | eng | |
dc.setspec.uppsok | Technology | |
dc.subject | Named Entity Recognition | |
dc.subject | Data Augmentation | |
dc.subject | Low-Resource Settings | |
dc.subject | Synonym Replacement | |
dc.subject | Mention Replacement | |
dc.subject | AugGPT | |
dc.title | Leveraging Data Augmentation for Better Named Entity Recognition in Low-Resource Settings | |
dc.type.degree | Examensarbete för masterexamen | sv |
dc.type.degree | Master's Thesis | en |
dc.type.uppsok | H | |
local.programme | Computer science – algorithms, languages and logic (MPALG), MSc |