Relevant Phrase Generation for Language Learners
dc.contributor.author | Lidholm, Edvin | |
dc.contributor.author | Pinti, Davide | |
dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv |
dc.contributor.department | Chalmers University of Technology / Department of Computer Science and Engineering | en |
dc.date.accessioned | 2023-11-22T15:05:38Z | |
dc.date.available | 2023-11-22T15:05:38Z | |
dc.date.issued | 2023 | |
dc.date.submitted | 2023 | |
dc.description.abstract | In recent times Artificial Intelligence, Natural Language Processing (NLP) especially, has spread widely. Nowadays, most people use it, either directly or indirectly, and you can find it almost everywhere: from social networks, to generating images based on text prompts, to the automatic grammar checker of written text. In the field of NLP, the generation of text through large language models (LLMs) is becoming more and more dominant, especially since the release of the Transformer in 2017. The objective of this study was to leverage the powerful tools of NLP to generate contextually appropriate English sentences for language learners, when given a specific English keyword as input. Other papers in the field of Intelligent Computer-Assisted Language Learning (ICALL) have generated examples for language learners by either retrieval- or ranking-based models, but this is the first time generative language models have been used in this context. We claim that our model developed in this project, SpeakEasy, is useful for language learners that are trying to learn a new language. The generated sentences have three important characteristics: (1) They are relevant in context to the keyword; (2) They are in “simple” English suitable for language learners; (3) They always include the exact form of the keyword given in the prompt. We achieve this by first fine-tuning a GPT–2 model implemented by HuggingFace to a dataset of human-written sentences specifically tailored to language learners from Tatoeba.org. A decoding algorithm consisting of two steps was implemented. Initially, a shift to the probability distribution the context around the keyword was applied using Keyword2Text for controlled generation. Subsequently, the vocabulary is truncated to only include words that have an information content close to the language model itself, picked up during domain adaptation, using locally typical sampling. The generated sentences are similar to the human-written examples with a MAUVE score of 0.755 and an average cosine similarity of 0.577. Moreover, only 1.83% of sentences generated were identical to entries in the dataset and a sentence took on average 0.45 seconds to be generated. Qualitative human evaluation showed that examples generated by SpeakEasy did not only beat a finetuned version of GPT–2 with hybrid top-k and nucleus sampling scheme, but also out competing the human-written sentences with an average rank of 2.13 on a scale from 1 (best) to 4 (worst). | |
dc.identifier.coursecode | DATX05 | |
dc.identifier.uri | http://hdl.handle.net/20.500.12380/307388 | |
dc.language.iso | eng | |
dc.setspec.uppsok | Technology | |
dc.subject | Natural Language Processing | |
dc.subject | Automatic Text Generation | |
dc.subject | Language Learning | |
dc.subject | Transformer | |
dc.subject | GPT–2 | |
dc.title | Relevant Phrase Generation for Language Learners | |
dc.type.degree | Examensarbete för masterexamen | sv |
dc.type.degree | Master's Thesis | en |
dc.type.uppsok | H | |
local.programme | Complex adaptive systems (MPCAS), MSc |