Improving Algorithmic Text Moderation via Context-Based Representations of Word Semantics

dc.contributor.authorNordén, Felix
dc.contributor.departmentChalmers tekniska högskola / Institutionen för matematiska vetenskapersv
dc.contributor.examinerJonasson, Johan
dc.contributor.supervisorJohansson, Fredrik
dc.date.accessioned2021-05-04T09:35:13Z
dc.date.available2021-05-04T09:35:13Z
dc.date.issued2021sv
dc.date.submitted2020
dc.description.abstractAbstract Reliable text moderation requires proper domain knowledge. With scaling requirements increasing as platforms of the Internet grow larger and larger, the prevalence of algorithmic text moderation has increased with the intention to alleviate, or even replace, its manual counterpart. Nonetheless, these algorithm-based solutions are harder to interpret, evaluate, and risk being biased in their decision making, resulting in more rigid and error-prone behavior when changes in context end up shifting the semantics of the text itself. To solve these shortcomings, this thesis presents an approach that learns semantic nuances within shorter pieces of text when given a related context represented by various layers of information. For this purpose, the sentence transformer architecture is employed which jointly learns embeddings of the short-form text and its context. The embeddings are used as input to a Log-loss optimized, fully-connected network to classify the appropriacy of the text. Furthermore, the thesis investigates the tradeoff between gained performance and added time- and implementation complexity for each additional layer of information. The approach is evaluated on chat data from Twitch – a live-streaming service – where the related context for each message is built up incrementally; first by introducing a layer of stream metadata and then augmenting the stream metadata by introducing a layer of related game metadata provided by IGDB – the Internet Game Database. From the results, the approach demonstrates that representing a context using both stream- and game metadata has a significant impact on the performance; yielding an F1 score of 0.37 compared to 0.18 and an AUROC score of 0.63 compared to 0.45 of the best-performing baseline. Furthermore, a linear time complexity dependence is identified on the number of sentences to embed per datapoint, causing a forward pass to take at worst 78 ms. per datapoint. With this, it is concluded that contextual information is able to improve predictive performance for algorithmic text moderation on shorter pieces of text. Additionally, exploring contextual relevance of data is easy when using sentence transformers, albeit with a linear growth in time complexity.sv
dc.identifier.coursecodeMVEX03sv
dc.identifier.urihttps://hdl.handle.net/20.500.12380/302340
dc.language.isoengsv
dc.setspec.uppsokPhysicsChemistryMaths
dc.subjectNatural Language Processing, Machine Learning, Deep Learning, Algorithmic Text Moderation, Sentence Transformers, Master’s Thesissv
dc.titleImproving Algorithmic Text Moderation via Context-Based Representations of Word Semanticssv
dc.type.degreeExamensarbete för masterexamensv
dc.type.uppsokH
local.programmeData science and AI (MPDSC), MSc
Ladda ner
Original bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
Masters_thesis-Felix-Norden.pdf
Storlek:
1.24 MB
Format:
Adobe Portable Document Format
Beskrivning:
Improving Algorithmic Text Moderation via Context-Based Representations of Word Semantics
License bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
1.14 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: