Leveraging Data Augmentation for Better Named Entity Recognition in Low-Resource Settings
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
This thesis investigates the challenges in the field of Natural Language Processing (NLP), with a focus on Named Entity Recognition (NER), a subtask within NLP that involves classifying entities. Addressing the issue of data scarcity, which is particularly critical in non-English languages like Swedish, this study investigates various data augmentation methods by fine-tuning the transformer-based model, KB-BERT. The datasets are simulated as low-resource settings, drawing inspiration from the study X Dai and H Adel (2020) [1] work, using three sets of training data containing 50, 150, and 500 instances respectively. The thesis also explores whether a newly developed state-of-the-art data augmentation method can outperform other data augmentation methods in enhancing an NLP model, centering on three data augmentation methods: Synonym replacement, Mention replacement, and AugGPT, the last being a state-of-the-art method. The findings of this study highlight that synonym replacement emerged as the most effective data augmentation method across various low-resource settings, achieving the highest F1-score increase in all scenarios. AugGPT achieved the second highest average F1-score, while mention replacement achieved the lowest across the tested settings.
Beskrivning
Ämne/nyckelord
Named Entity Recognition, Data Augmentation, Low-Resource Settings, Synonym Replacement, Mention Replacement, AugGPT