Relevant Phrase Generation for Language Learners
Ladda ner
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Program
Complex adaptive systems (MPCAS), MSc
Publicerad
2023
Författare
Lidholm, Edvin
Pinti, Davide
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
In recent times Artificial Intelligence, Natural Language Processing (NLP) especially, has spread widely. Nowadays, most people use it, either directly or indirectly, and you can find it almost everywhere: from social networks, to generating images based on text prompts, to the automatic grammar checker of written text. In the field of NLP, the generation of text through large language models (LLMs) is becoming more and more dominant, especially since the release of the Transformer in 2017. The objective of this study was to leverage the powerful tools of NLP to generate contextually appropriate English sentences for language learners, when given a specific English keyword as input. Other papers in the field of Intelligent Computer-Assisted Language Learning (ICALL) have generated examples for language learners by either retrieval- or ranking-based models, but this is the first time generative language models have been used in this context. We claim that our model developed in this project, SpeakEasy, is useful for language learners that are trying to learn a new language. The generated sentences have three important characteristics: (1) They are relevant in context to the keyword; (2) They are in “simple” English suitable for language learners; (3) They always include the exact form of the keyword given in the prompt. We achieve this by first fine-tuning a GPT–2 model implemented by HuggingFace to a dataset of human-written sentences specifically tailored to language learners from Tatoeba.org. A decoding algorithm consisting of two steps was implemented. Initially, a shift to the probability distribution the context around the keyword was applied using Keyword2Text for controlled generation. Subsequently, the vocabulary is truncated to only include words that have an information content close to the language model itself, picked up during domain adaptation, using locally
typical sampling. The generated sentences are similar to the human-written examples with a MAUVE score of 0.755 and an average cosine similarity of 0.577. Moreover, only 1.83% of sentences generated were identical to entries in the dataset and a sentence took on average 0.45 seconds to be generated. Qualitative human evaluation showed that examples generated by SpeakEasy did not only beat a finetuned version of GPT–2 with hybrid top-k and nucleus sampling scheme, but also out competing the human-written sentences with an average rank of 2.13 on a scale from 1 (best) to 4 (worst).
Beskrivning
Ämne/nyckelord
Natural Language Processing , Automatic Text Generation , Language Learning , Transformer , GPT–2