Unsupervised Word-Sense Disambiguation for Product Description Texts
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Program
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
As the name suggests, word-sense disambiguation is the task of determining the
correct meaning, or sense, of words that can have multiple interpretations. Textual,
a company with a product that automatically generates product description texts in
multiple languages, can make use of word-sense disambiguation to improve the quality
of their texts. In this project, an attempt to solve this task is made. To achieve
this, word alignment is used to define and label the senses of words as quadruples
of translations in English, Swedish, French and Spanish, making word-sense disambiguation
a supervised task. Contextually alike quadruples are then merged using
a permutation test and a novel merging algorithm. In word-sense disambiguation
it is natural to represent the word along with its context as a vector in a higherdimensional
vector space. For this, different BERT-models are used as well as the
simpler Bag-of-Words- and contextual Word2Vec-models. The results on 69 different
word types show an average accuracy of 91.97% compared to 58.35% for the
baseline classifier, the classifier that always predicts the most frequent sense. On
unseen data from new fashion sites, the average accuracy on 8 word types is 85.19%
compared to 56.89% for the baseline classifier.
Beskrivning
Ämne/nyckelord
word-sense disambiguation, natural language processing, BERT, word alignment, machine learning, artificial intelligence