Leveraging Data Augmentation for Better Named Entity Recognition in Low-Resource Settings

dc.contributor.authorBjörnerud, Philip
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerBernardy, Jean-Philippe
dc.contributor.supervisorDannélls, Dana
dc.contributor.supervisorKokkinakis, Dimitrios
dc.date.accessioned2024-02-21T09:39:17Z
dc.date.available2024-02-21T09:39:17Z
dc.date.issued2024
dc.date.submitted2023
dc.description.abstractThis thesis investigates the challenges in the field of Natural Language Processing (NLP), with a focus on Named Entity Recognition (NER), a subtask within NLP that involves classifying entities. Addressing the issue of data scarcity, which is particularly critical in non-English languages like Swedish, this study investigates various data augmentation methods by fine-tuning the transformer-based model, KB-BERT. The datasets are simulated as low-resource settings, drawing inspiration from the study X Dai and H Adel (2020) [1] work, using three sets of training data containing 50, 150, and 500 instances respectively. The thesis also explores whether a newly developed state-of-the-art data augmentation method can outperform other data augmentation methods in enhancing an NLP model, centering on three data augmentation methods: Synonym replacement, Mention replacement, and AugGPT, the last being a state-of-the-art method. The findings of this study highlight that synonym replacement emerged as the most effective data augmentation method across various low-resource settings, achieving the highest F1-score increase in all scenarios. AugGPT achieved the second highest average F1-score, while mention replacement achieved the lowest across the tested settings.
dc.identifier.coursecodeDATX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/307588
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectNamed Entity Recognition
dc.subjectData Augmentation
dc.subjectLow-Resource Settings
dc.subjectSynonym Replacement
dc.subjectMention Replacement
dc.subjectAugGPT
dc.titleLeveraging Data Augmentation for Better Named Entity Recognition in Low-Resource Settings
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeComputer science – algorithms, languages and logic (MPALG), MSc
Ladda ner
Original bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 24-03 PB.pdf
Storlek:
6.68 MB
Format:
Adobe Portable Document Format
Beskrivning:
License bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: