Leveraging Data Augmentation for Better Named Entity Recognition in Low-Resource Settings

Björnerud, Philip

Leveraging Data Augmentation for Better Named Entity Recognition in Low-Resource Settings

dc.contributor.author	Björnerud, Philip
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering	en
dc.contributor.examiner	Bernardy, Jean-Philippe
dc.contributor.supervisor	Dannélls, Dana
dc.contributor.supervisor	Kokkinakis, Dimitrios
dc.date.accessioned	2024-02-21T09:39:17Z
dc.date.available	2024-02-21T09:39:17Z
dc.date.issued	2024
dc.date.submitted	2023
dc.description.abstract	This thesis investigates the challenges in the field of Natural Language Processing (NLP), with a focus on Named Entity Recognition (NER), a subtask within NLP that involves classifying entities. Addressing the issue of data scarcity, which is particularly critical in non-English languages like Swedish, this study investigates various data augmentation methods by fine-tuning the transformer-based model, KB-BERT. The datasets are simulated as low-resource settings, drawing inspiration from the study X Dai and H Adel (2020) [1] work, using three sets of training data containing 50, 150, and 500 instances respectively. The thesis also explores whether a newly developed state-of-the-art data augmentation method can outperform other data augmentation methods in enhancing an NLP model, centering on three data augmentation methods: Synonym replacement, Mention replacement, and AugGPT, the last being a state-of-the-art method. The findings of this study highlight that synonym replacement emerged as the most effective data augmentation method across various low-resource settings, achieving the highest F1-score increase in all scenarios. AugGPT achieved the second highest average F1-score, while mention replacement achieved the lowest across the tested settings.
dc.identifier.coursecode	DATX05
dc.identifier.uri	http://hdl.handle.net/20.500.12380/307588
dc.language.iso	eng
dc.setspec.uppsok	Technology
dc.subject	Named Entity Recognition
dc.subject	Data Augmentation
dc.subject	Low-Resource Settings
dc.subject	Synonym Replacement
dc.subject	Mention Replacement
dc.subject	AugGPT
dc.title	Leveraging Data Augmentation for Better Named Entity Recognition in Low-Resource Settings
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	Computer science – algorithms, languages and logic (MPALG), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 24-03 PB.pdf
Storlek:: 6.68 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Storlek:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Beskrivning:

Ladda ner

Samlingar

Examensarbeten för masterexamen