Improving Algorithmic Text Moderation via Context-Based Representations of Word Semantics

Nordén, Felix

Improving Algorithmic Text Moderation via Context-Based Representations of Word Semantics

dc.contributor.author	Nordén, Felix
dc.contributor.department	Chalmers tekniska högskola / Institutionen för matematiska vetenskaper	sv
dc.contributor.examiner	Jonasson, Johan
dc.contributor.supervisor	Johansson, Fredrik
dc.date.accessioned	2021-05-04T09:35:13Z
dc.date.available	2021-05-04T09:35:13Z
dc.date.issued	2021	sv
dc.date.submitted	2020
dc.description.abstract	Abstract Reliable text moderation requires proper domain knowledge. With scaling requirements increasing as platforms of the Internet grow larger and larger, the prevalence of algorithmic text moderation has increased with the intention to alleviate, or even replace, its manual counterpart. Nonetheless, these algorithm-based solutions are harder to interpret, evaluate, and risk being biased in their decision making, resulting in more rigid and error-prone behavior when changes in context end up shifting the semantics of the text itself. To solve these shortcomings, this thesis presents an approach that learns semantic nuances within shorter pieces of text when given a related context represented by various layers of information. For this purpose, the sentence transformer architecture is employed which jointly learns embeddings of the short-form text and its context. The embeddings are used as input to a Log-loss optimized, fully-connected network to classify the appropriacy of the text. Furthermore, the thesis investigates the tradeoff between gained performance and added time- and implementation complexity for each additional layer of information. The approach is evaluated on chat data from Twitch – a live-streaming service – where the related context for each message is built up incrementally; first by introducing a layer of stream metadata and then augmenting the stream metadata by introducing a layer of related game metadata provided by IGDB – the Internet Game Database. From the results, the approach demonstrates that representing a context using both stream- and game metadata has a significant impact on the performance; yielding an F1 score of 0.37 compared to 0.18 and an AUROC score of 0.63 compared to 0.45 of the best-performing baseline. Furthermore, a linear time complexity dependence is identified on the number of sentences to embed per datapoint, causing a forward pass to take at worst 78 ms. per datapoint. With this, it is concluded that contextual information is able to improve predictive performance for algorithmic text moderation on shorter pieces of text. Additionally, exploring contextual relevance of data is easy when using sentence transformers, albeit with a linear growth in time complexity.	sv
dc.identifier.coursecode	MVEX03	sv
dc.identifier.uri	https://hdl.handle.net/20.500.12380/302340
dc.language.iso	eng	sv
dc.setspec.uppsok	PhysicsChemistryMaths
dc.subject	Natural Language Processing, Machine Learning, Deep Learning, Algorithmic Text Moderation, Sentence Transformers, Master’s Thesis	sv
dc.title	Improving Algorithmic Text Moderation via Context-Based Representations of Word Semantics	sv
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.uppsok	H
local.programme	Data science and AI (MPDSC), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: Masters_thesis-Felix-Norden.pdf
Size:: 1.24 MB
Format:: Adobe Portable Document Format
Description:: Improving Algorithmic Text Moderation via Context-Based Representations of Word Semantics

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 1.14 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen