Relevant Phrase Generation for Language Learners Master’s thesis in Computer science and engineering - Data Science & AI EDVIN LIDHOLM DAVIDE PINTI Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2023 Master’s thesis 2023 Relevant Phrase Generation for Language Learners SpeakEasy: A language model that makes it easy for students to practice their language skills by generating example sentences EDVIN LIDHOLM DAVIDE PINTI Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2023 Relevant Phrase Generation for Language Learners SpeakEasy: A language model that makes it easy for students to practice their language skills by generating example sentences EDVIN LIDHOLM DAVIDE PINTI © EDVIN LIDHOLM, 2023. © DAVIDE PINTI, 2023. Supervisor: Richard Johansson, Department of Computer Science and Engineering Examiner: Moa Johansson, Department of Computer Science and Engineering Master’s Thesis 2023 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Image generated by Canva.com based on a description of our project as image prompt. Typeset in LATEX Gothenburg, Sweden 2023 iv https://www.canva.com/ Relevant Phrase Generation for Language Learners SpeakEasy: A language model that makes it easy for students to practice their language skills by generating example sentences EDVIN LIDHOLM DAVIDE PINTI Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract In recent times Artificial Intelligence, Natural Language Processing (NLP) especially, has spread widely. Nowadays, most people use it, either directly or indirectly, and you can find it almost everywhere: from social networks, to generating images based on text prompts, to the automatic grammar checker of written text. In the field of NLP, the generation of text through large language models (LLMs) is becoming more and more dominant, especially since the release of the Transformer in 2017. The objective of this study was to leverage the powerful tools of NLP to generate con- textually appropriate English sentences for language learners, when given a specific English keyword as input. Other papers in the field of Intelligent Computer-Assisted Language Learning (ICALL) have generated examples for language learners by ei- ther retrieval- or ranking-based models, but this is the first time generative language models have been used in this context. We claim that our model developed in this project, SpeakEasy, is useful for language learners that are trying to learn a new language. The generated sentences have three important characteristics: (1) They are relevant in context to the keyword; (2) They are in “simple” English suitable for language learners; (3) They always include the exact form of the keyword given in the prompt. We achieve this by first fine-tuning a GPT–2 model implemented by HuggingFace to a dataset of human-written sentences specifically tailored to lan- guage learners from Tatoeba.org. A decoding algorithm consisting of two steps was implemented. Initially, a shift to the probability distribution the context around the keyword was applied using Keyword2Text for controlled generation. Subsequently, the vocabulary is truncated to only include words that have an information content close to the language model itself, picked up during domain adaptation, using lo- cally typical sampling. The generated sentences are similar to the human-written examples with a MAUVE score of 0.755 and an average cosine similarity of 0.577. Moreover, only 1.83% of sentences generated were identical to entries in the dataset and a sentence took on average 0.45 seconds to be generated. Qualitative human evaluation showed that examples generated by SpeakEasy did not only beat a fine- tuned version of GPT–2 with hybrid top-k and nucleus sampling scheme, but also out competing the human-written sentences with an average rank of 2.13 on a scale from 1 (best) to 4 (worst). Keywords: Natural Language Processing, Automatic Text Generation, Language Learning, Transformer, GPT–2. v https://tatoeba.org/ Acknowledgements We would like to express our sincere gratitude to our supervisor Richard Johansson, from the Department of Computer Science and Engineering at Chalmers University of Technology, who has provided guidance, support, and valuable insights through- out the project. Their expertise and constructive feedback have been invaluable in shaping our research. We also extend our appreciation to our examiner, Moa Johansson, for their time and effort in evaluating our work and providing valuable suggestions for improvement. Finally, a big thanks goes out to all our family mem- bers and friends who have supported us during our studies. Edvin Lidholm & Davide Pinti, Gothenburg, 2023-06-12 In loving memory of Anna-Lena Lidholm, 1968–2021. ♡ vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: ANN Artificial Neural Network ATG Automatic Text Generation BERT Bidirectional Encoder Representations from Transformers CBoW Continuous Bag-of-Words BLEU BiLingual Evaluation Understudy DL Deep Learning GloVe Global Vectors for Word Representation GPT Generative Pretrained Transformer GPU Graphics Processing Unit ICALL Intelligent Computer-Assisted Language Learning LM Language Model LSTM Long Short-Term Memory ML Machine Learning MLM Masked Language Modeling NLG Natural Language Generation NLP Natural Language Processing NLU Natural Language Understanding NSP Next Sequence Prediction PE Positional Encoding PPL Perplexity ReLU Rectified Linear Unit RL ROUGE–L RN ROUGE–N RNN Recurrent Neural Network ROUGE Recall-Oriented Understudy for Gisting Evaluation SOTA State-Of-The-Art ix Contents List of Acronyms ix List of Figures xv List of Tables xvii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Specification of Issue Under Investigation . . . . . . . . . . . . . . . . 2 1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.6 Structure of the Thesis Report . . . . . . . . . . . . . . . . . . . . . . 4 2 Theory 5 2.1 Research Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Natural Language Processing . . . . . . . . . . . . . . . . . . 5 2.1.2 Automatic Text Generation . . . . . . . . . . . . . . . . . . . 6 2.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 RNN and LSTM . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2.3 Transformer Architecture . . . . . . . . . . . . . . . . . . . . . 9 2.2.3.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.3.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.3.3 HuggingFace . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Word Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1.1 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1.2 GloVe . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1.3 FastText . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1 Sentence-BERT . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.4.1.1 Siamese and Triplet Network Architecture . . . . . . 18 2.4.2 KeyBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5 Autoregressive Language Models . . . . . . . . . . . . . . . . . . . . . 18 2.5.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . 20 xi Contents 2.5.2 GPT–2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.2.1 Background and Description . . . . . . . . . . . . . . 20 2.6 Decoding Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.6.1 Standard Approaches to Decoding . . . . . . . . . . . . . . . . 21 2.6.1.1 Maximization-Based Decoding . . . . . . . . . . . . . 21 2.6.1.2 Random Sampling . . . . . . . . . . . . . . . . . . . 23 2.6.2 Domain-Relevant Decoding Algorithms . . . . . . . . . . . . . 25 2.6.2.1 Keyword2Text Decoding . . . . . . . . . . . . . . . . 25 2.6.2.2 Typical Sampling . . . . . . . . . . . . . . . . . . . . 26 2.7 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7.1 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.7.1.1 Cosine Similarity for SBERT Sentence Embeddings . 27 2.7.2 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 2.7.3 MAUVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3 Methods 31 3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . 31 3.1.2 Pre-Processing of Data . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.1 Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.2.2 Decoding Algorithms . . . . . . . . . . . . . . . . . . . . . . . 35 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.1 Evaluating Fine-Tuning . . . . . . . . . . . . . . . . . . . . . 36 3.3.2 Evaluating Generated Text . . . . . . . . . . . . . . . . . . . . 37 3.3.2.1 Human Evaluation . . . . . . . . . . . . . . . . . . . 37 4 Results and Analysis 41 4.1 Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Example SpeakEasy Outputs . . . . . . . . . . . . . . . . . . . . . . 42 4.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3.1 Automatic Evaluation . . . . . . . . . . . . . . . . . . . . . . 42 4.3.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5 Discussion 49 5.1 Potential Errors in Methods and Results . . . . . . . . . . . . . . . . 49 5.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.1.2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . 50 5.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 Ethical Considerations of the Project . . . . . . . . . . . . . . . . . . 52 5.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 6 Conclusion 55 6.1 Project summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 xii Contents Bibliography 57 A Appendix 1 I A.1 Removed Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . I xiii Contents xiv List of Figures 2.1 Schematic representation of a single neuron in an ANN . . . . . . . . 7 2.2 A multi-layer perceptron neural network with one hidden layer of five neurons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 a) shows a recurrent neural network and b) shows the same RNN unfolded in time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Schematic representation of the LSTM architecture. . . . . . . . . . . 9 2.5 The Transformer model architecture according to the paper by Vaswani et al. (2017) [31]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.6 Structure of the Word2Vec network, where W and S are V × E and E × V matrices, respectively. . . . . . . . . . . . . . . . . . . . . . . 15 2.7 Weighting function of the GloVe word embedding model for various values of α, but with fixed xmax. . . . . . . . . . . . . . . . . . . . . . 16 2.8 Illustration of autoregressive text generation from a language model, where the input to the model at time step t is the sequence of previ- ously generated tokens. . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.9 Example of teacher forcing where the gold-standard reference sentence used is “The lion hunts...”. . . . . . . . . . . . . . . . . . . . . . . . . 20 2.10 Probability distribution over a vocabulary consisting of five words given a starting prompt, X. . . . . . . . . . . . . . . . . . . . . . . . 22 2.11 Example of a sequence generation where a greedy decoding strat- egy yields a suboptimal output. The “Beginning of Sequence”-token, , tells the model to start generating text. . . . . . . . . . . . . 22 2.12 Illustration of how different values of temperature, T , effects the prob- ability distribution from the example in Figure 2.10 . . . . . . . . . . 25 2.13 Illustration of the pipeline for the text comparison metric. . . . . . . 28 3.1 Schema of the pre-processing pipeline for our dataset. . . . . . . . . . 32 3.2 Histogram showing distribution of sentence length (in terms of num- ber of words) of dataset before and after truncation. . . . . . . . . . . 34 3.3 Illustration of the step-by-step effect our proposed decoding algorithm has on the probability distribution before sampling from the language model, where the x-axis are tokens in the vocabulary and the y-axis is the sampling probability distribution. The first step is a shift to the original distribution by the K2T method, and the second step represents the truncation of the vocabulary by the locally typical sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 xv List of Figures 3.4 Example of a question in the human evaluation questionnaire. . . . . 39 4.1 Plot of training- and validation loss during fine-tuning of the GPT–2 model for three epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 Distribution of perplexity scores for all four example sentence sources, where the count in each bin is normalized to a percentage of the total number of sentences. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Histograms for the individual sentence sources from the evaluation survey. Sample means are marked by dashed black lines. . . . . . . . 46 4.4 Bar graph indicating how the different ranks were distributed between the example sentence sources. . . . . . . . . . . . . . . . . . . . . . . 46 4.5 Box plot showing the spread of the average rank per question over all respondents for every example source. . . . . . . . . . . . . . . . . . . 47 xvi List of Tables 2.1 Character sub-N-grams in the FastText representation of the word dogs with 3 ≤ N ≤ 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.1 Examples of keywords extracted from sentences by KeyBERT [55]. . . 33 3.2 Average number of words per sentence (µ) and standard deviation (σ) of the dataset before and after truncation. . . . . . . . . . . . . . . . 34 3.3 Fine-tuning parameters for model. . . . . . . . . . . . . . . . . . . . . 35 3.4 Hyperparameters used for K2T probability boosting of keyword and semantically similar words as well as locally typical sampling. . . . . 36 3.5 Parameters used for evaluation during fine-tuning of model. . . . . . 36 4.1 Examples of sentences generated by SpeakEasy. . . . . . . . . . . . . 42 4.2 Automatic quality metrics for the different model stages throughout the project, as described in Sections 2.7 and 3.3.2. . . . . . . . . . . . 43 4.3 Percentages of sampled sentences and successfully generated keywords from the examples generated by the autoregressive Transformer lan- guage models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Minimium, median and maximum perplexity scores for the Tatoeba dataset, fine-tuned GPT–2, K2T-controlled generation and SpeakEasy. 44 4.5 Mean rank afforded to each type of example sentence during human evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 A.1 Sentences manually removed from dataset. . . . . . . . . . . . . . . . I xvii List of Tables xviii 1 Introduction This chapter provides an introduction to the research subject. First, the context for the project is presented in Section 1.1, followed by a description of the aim of the study in Section 1.2. Section 1.3 outlines the the research questions that our study aims to address. An overview of the related research in this field can be found in Section 1.4, and the delimitations of the project in Section 1.5. 1.1 Background In today’s era, the pursuit of learning new languages has become widespread due to the increasing demands of travel, work, and personal interests. Consequently, the methods of language acquisition have also evolved. The advent of advanced technology has brought about a revolution, enabling easy access to the internet and a vast array of resources at our fingertips. This progress has led to the develop- ment of language learning applications such as Duolingo, Babbel, and others. Here, the field of natural language processing (NLP) comes into play as a crucial compo- nent, addressing the need for tools that can generate words and sentences to assist students. The domain of language learning holds immense significance and is rapidly expand- ing, with its applications spanning across education, business, and personal devel- opment [1]. Among the primary challenges in language acquisition is the ability to comprehend and produce relevant and coherent language within various contexts. Leveraging NLP techniques has the potential to transform the way we teach and learn languages by automating the generation of meaningful phrases and sentences for learners to practice with. This paradigm shift can enhance both the efficiency and effectiveness of language learning endeavours. By employing a large language model, we aim to create an exemplary sentence gen- eration system for language learners, surpassing existing retrieval or ranking-based methods, mentioned in Section 1.4. One reason to use NLP models to generate sentences exemplifying the use of a word is that the more advanced methods can grasp intricate grammar rules, idiomatic expressions, and cultural nuances, creat- ing diverse and authentic sentences that closely resemble human language. This adaptability allows learners to encounter a wide range of sentence structures and vocabulary. Furthermore, Machine Learning (ML) models are highly scalable. They efficiently handle large volumes of data, enabling them to learn from extensive lan- 1 1. Introduction guage corpora and generate a plethora of example sentences. This scalability ensures access to diverse and relevant examples. Lastly, ML models facilitate continuous im- provement. They can be trained and fine-tuned using user feedback, allowing them to adapt and enhance their sentence generation capabilities over time. By incorpo- rating user preferences, corrections, and linguistic insights, the models become more accurate and better aligned with learners’ specific needs and preferences. 1.2 Aim The aim of this project is to achieve a method of generating English phrases of suitable level for language learners correlated with a specific keyword. A keyword is a single word working as an input for generating the sentences with context around it. Furthermore, these sentences should have some level of “quality”, meaning they are not just trivial phrases, but with information relevant to the keyword. To illustrate this one could think of an example where a student wants to practice using the word “bicycle”. In this case a sentence about riding bicycles, or falling while riding, like “I learned how to ride a bicycle when I was 6 years old.”, would be useful, but “It is a bicycle.” would be quite uninteresting. 1.3 Specification of Issue Under Investigation The following four research questions will be examined: • Can an NLP-based system be trained to generate contextually appropriate English phrases and sentences for language learners, when given a specific English keyword as input? • How can a decoding algorithm be developed to guide the text generation towards suitable phrases and sentences for said learners? • How can the generated phrases and sentences be evaluated for “quality”, in terms of their relevance and usefulness when learning a new language? • Can the proposed method of generating phrases for language learners be ap- plied in a novel way, beyond what has been previously considered in existing research? 1.4 Related Work While we are not aware of any previous work that used language models to generate examples from scratch, several projects in the area of intelligent computer-assisted language learning (ICALL) have investigated the use of retrieval from large corpora for finding illustrative examples for purposes such as lexicon development, vocab- ulary exemplification for learners, and exercise generation. Pilán et al. (2016) [2] gives an overview of work in this area. 2 1. Introduction Early work includes the GDEX algorithm [3], which retrieves examples to illustrate word use for lexicographers using a rule-based approach. Firstly, the GDEX algo- rithm prioritizes sentences within the range of 10 to 25 words, penalizing those that are longer or shorter. Moreover, it penalizes sentences containing uncommon or rare words, meaning they are not part of the 17,000 most frequently used English words. Sentences with pronouns and anaphors lacking self-contained meaning1 receive penal- ties due to their need for additional context. Furthermore, preferred sentences begin with a capital letter and conclude with a full stop, exclamation mark, or question mark. The authors note that effective examples often introduce contextual informa- tion and position the keyword towards the sentence’s end, enabling users to infer its meaning. In ICALL specifically, Pilán et al. (2013) [4] proposed a ranker similar to GDEX for selecting examples suitable for learners. They extended the rule-based ranker by using a machine learning-based approach to classify sentences into their correspond- ing CERF (Common European Framework of Reference) language levels. Results showed that 70% of the sentences retrieved by the model were deemed to be of an appropriate language level by human evaluators. Furthermore, 60% of the retrieved sentences were suitable as an illustrative example of word usage. Example selection has also been used in ICALL to generate exercises, typically in the form of cloze questions (fill-in-the-blank) [5], [6]. Some of these approaches used language models in order to rank examples according to the typicality of a word in a context; for instance, Wojatzki et al. (2016) [6] used an N -gram language model for this purpose. Results showed that their model’s proposed exercises reduced the ambiguity in the “fill-in-the-blank”-type questions. Furthermore, the authors also introduced a disambiguation measure, which was proven to effectively discard exercises that were too ambiguous. 1.5 Delimitations This thesis project has several limitations that should be considered. Firstly, it is only focused on English as the target language. This means that the model and decoding algorithm only is able to generate phrases and sentences in English, and does not support other languages. Another limitation of the project is that it only considers one-word prompts for text generation. This means that the model only generates phrases and sentences based on a single word input, rather than multiple words or whole sentences. This may affect the model’s ability to generate contextually appropriate phrases and sentences, as one-word prompts may not provide enough context for the model to generate accurate and meaningful output. Additionally, we did not develop a way of adapting the level of the sentences towards the student proficiency in the target language. This means that the generated phrases and sentences are not tailored to the specific language level of the student, 1E.g., this, that, it or one. 3 1. Introduction thus making the generated phrases and sentences less relevant and useful for some students who are at different stages of language learning. In summary, this thesis has limitations in terms of the target language, the type of prompts used, and the lack of adaptability to students’ prior proficiency level. 1.6 Structure of the Thesis Report In this thesis, Chapter 2 provides a comprehensive overview of the relevant back- ground information, including the research field, relevant models, and previous stud- ies. The methods employed to address the research question is outlined in Chapter 3, while the results of the study are presented and analyzed in Chapter 4. Finally, in Chapters 5 and 6, the methodology and results are thoroughly discussed along with suggestions for future research directions and conclusion, repectively. 4 2 Theory This chapter provides the theoretical framework for the subject matter. In Sec- tion 2.1, an overview of NLP and its sub-field automatic text generation (ATG) is presented. Section 2.2 explores the field of deep learning (DL) and introduces com- monly used models for NLP tasks, such as recurrent neural networks (RNNs), long short-term memorys (LSTMs), and the Transformer architecture. In Section 2.3, tokenization and word embedding are introduced, while Section 2.4 provides a de- scription of the BERT model with some relevant adaptations of the core model. Section 2.5 provides a background for autoregressive language models used for gen- erating text, finishing by introducing the GPT–2 model used in this project. Addi- tionally, Section 2.6 delves into the algorithms used for decoding, which translate the probabilities generated by the decoder model into text. Finally, in Section 2.7, the tools used to evaluate the results are described. 2.1 Research Area This section introduces the broader research domains of natural language processing, and the automated generation of text. 2.1.1 Natural Language Processing NLP is a rapidly growing field that involves the use of computer algorithms to analyze, understand, and generate human language. It is a multidisciplinary field that combines the fields of computer science, linguistics, and cognitive psychology to enable computers to interact with human language [7]. The two primary subfields of NLP are natural language understanding (NLU) and natural language generation (NLG). NLU focuses on enabling machines to com- prehend human language by analyzing and processing linguistic structures such as syntax, semantics, and pragmatics [8]. NLG, on the other hand, is concerned with producing language and involves generating human-like text that is coherent, gram- matically correct, and semantically meaningful [9]. NLP has many applications and is used in a wide range of fields, including machine translation, text classification, question answering, sentiment analysis, and speech recognition. Machine translation is the process of translating one language into another [10], while text classification involves categorizing text into different classes 5 2. Theory or categories [11]. Question answering systems allow machines to understand and respond to human language queries [12], while sentiment analysis involves analyzing the emotional content of text [13]. In recent times, DL has emerged as the most commonly used approach to create NLP models [14], which is a subset of machine learning (ML) that involves the use of artificial neural networks (ANNs) to analyze and process large amounts of data. DL algorithms have been shown to outperform traditional statistical methods in many NLP tasks, and have led to significant advancements in this field [15]. In particular, the use of autoregressive language models (introduced in Section 2.5) with transformer architectures has revolutionized the field of text generation, allowing for the creation of high-quality, human-like text. Overall, NLP is an exciting and rapidly evolving field that has the potential to transform the way we interact with machines and with each other through language. Its applications are wide-ranging and continue to grow as new technologies and techniques are developed. 2.1.2 Automatic Text Generation As a subfield of NLG, ATG focuses on generating human-like text automatically. It is a challenging task as it requires the computer to produce text that is coherent, grammatically correct, and semantically meaningful. Previous text generation mod- els have been statistical-based [16], cased-based [17], and rule-based [18]. However, these models often fail to capture the nuances and complexities of human language, resulting in stilted and unnatural text [19]. Recently, the state-of-the-art (SOTA) technology in text generation is using deep learning and autoregressive language models with a transformer architecture [20]. These models are able to generate convincing natural text by training on large amounts of data and learning the patterns and structures of human language. The transformer architecture allows for capturing long-term dependencies and contextual information, leading to better text coherence and fluency [21]. Furthermore, recent advancements in pre-training techniques such as ChatGPT and GPT–4 have shown promising results in generating high-quality text that can mimic human-like writing style and tone [22], [23]. With these new technologies, ATG is becoming an increasingly important area of research, with potential applications in content creation, virtual assistants, chatbots, and more. 2.2 Deep Learning This section provides an overview of the deep learning field and highlights the com- monly employed models for ATG. 6 2. Theory x1 x2 xn Σ θ w1 w2 wn σ y Figure 2.1: Schematic representation of a single neuron in an ANN 2.2.1 Background Deep Learning is a sub-field in ML that uses ANNs. These networks are inspired by the structure of the human brain and consist of interconnected nodes that exchange information, also called neurons [14], [24]. Each node receives inputs from other nodes and generates a single output, which is then sent to multiple nodes. The connections between neurons are weighted and these weights determine the influence of inputs on the output. ANNs are composed of multiple layers, where deeper layers allows the network to learn more complex patterns. Typically, neurons in one layer are connected to all neurons in the preceding and following layers [25], also called a fully connected multi-layer perceptron. This structure allows DL algorithms to learn abstract representations of data, making it a powerful tool in addressing NLP challenges, such as automatic text generation [26]. Figure 2.1 provides a visualization of a single neuron in a neural network. The inputs x1, . . . , xn coming from neurons in the preceding layer are multiplied by the corresponding weights w1, . . . , wn and then summed up together with a bias, θ. The result of this addition is then passed through an activation function, σ, to produce the output y = σ (θ +∑n i=1 xi · wi) that is transmitted to the next set of neurons in the network [25]. In Figure 2.2, an ANN with one hidden layer of neurons is illustrated. The input layer is responsible for receiving the initial data, while the output layer delivers the model’s result. The hidden layer, positioned between the input and output layers, performs the computational operations. In the training process of an ANN, the weights and biases are iteratively adjusted with the aim of reducing the value of the cost function, which measures the discrep- ancy between the predicted output of the network and the actual target output. This is typically achieved through the utilization of backpropagation [27], which allows for the calculation of the gradient with respect to each weight based on the difference between the target output and the actual output [28]. In the context of ATG, recur- rent neural networks and long short-term memory networks have been widely used in the past, but the current SOTA models follow the Transformer architecture [14]. In the following sections, these various models will be discussed. 7 2. Theory Input Layer Hidden Layer Output Layer Figure 2.2: A multi-layer perceptron neural network with one hidden layer of five neurons. x(t) U W h(t) y(t) V (a) x(t-1) U W h(t-1) y(t-1) x(t) U W h(t) y(t) x(t+1) U W h(t+1) y(t+1) V V V V (b) Figure 2.3: a) shows a recurrent neural network and b) shows the same RNN un- folded in time. 2.2.2 RNN and LSTM An RNN is a type of neural network that utilizes feedback loops to incorporate in- formation from previous time steps into its computation. This structure is depicted in Figure 2.3, where U , V and W are edge weights; And x (t) and y (t) are input and output states, respectively. At each time step, the hidden state h (t) is passed back into the network to be processed again, enabling the network to handle inputs of varying length and to preserve the order of input sequences, which is crucial in the field of ATG where sentences or word sequences have variable length and the arrangement of words affects their meaning [26]. However, RNNs face challenges in dealing with long-term dependencies, such as gradient vanishing or gradient explo- sion, where the gradient can become very small or very large over time [29]. An LSTM unit is a type of RNN that utilizes gates to control the memorization 8 2. Theory process. This unit has three distinct gates, the input, output, and forget gate, which regulate the flow and modification of information through the neuron. This advancement in RNN architecture addressed the issues of vanishing gradients and long-term memory loss [30], and is depicted in Figure 2.4. ct, ht and xt are the cell-, hidden- and input states at time t, respectively. ct ht ct+1 xt σ σ tanh σ tanh × + × × ht+1 Figure 2.4: Schematic representation of the LSTM architecture. 2.2.3 Transformer Architecture Vaswani et al. introduced the Transformer network [31] as a type of deep neural net- work that has become the SOTA model for language modeling and NLP tasks [20]. This model allows for greater parallelization, making efficient use of modern GPU capabilities for parallel computation. The Transformer architecture has proven ef- fective in automatic text generation as well. Figure 2.5 depicts the Transformer architecture as presented by Vaswani et al. (2017), and the subsequent paragraphs provides a detailed description of the various layers of the model, based on the original publication. 2.2.3.1 Encoder This section describes the encoder block, the components in the left part of Fig- ure 2.5. Input Embedding: This component involves converting the input sequence of words into embeddings, which are numerical vectors that represent each word in the sequence. The embeddings are created in such a way that words with similar meanings are represented by similar vectors. In the paper [31], 512-dimensional embedding vectors are used, but other implementations may use embeddings of different dimensions. Positional Encoding: This step is added to encode the position of each word in the input sequence, which is essential for NLP tasks as word order conveys meaning, and this information is not naturally preserved when computations are done in parallel. Vaswani et al. achieves this by generating a 512-dimensional vector for each position in the sequence as in Equation 2.1 [31]. 9 2. Theory Inputs Outputs (shifted right) Input Embedding Output Embedding + + Positional Encoding Positional Encoding Multi-Head Attention Add & Norm Feed Forward Add & Norm Multi-Head Attention Add & Norm Feed Forward Add & Norm Masked Multi-Head Attention Add & Norm Linear Softmax Output Probabilities Figure 2.5: The Transformer model architecture according to the paper by Vaswani et al. (2017) [31]. 10 2. Theory PEpos,2i = sin ( pos/10, 0002i/dmodel ) PEpos,2i+1 = cos ( pos/10, 0002i/dmodel ) (2.1) Multi-Head Attention: The self-attention mechanism is a crucial part of the Transformer architecture that helps the model to understand the relationships be- tween words in a sentence. Specifically, this mechanism allows the input sequence to attend to itself. With the help of trainable weight matrices W Q i , W K i , W V i and the input X the model obtains three representations of the same word for each attention-head i as such: • Queries: Qi = XW Q i • Keys: Ki = XW K i • Values: Vi = XW V i From these vectors you calculate the self-attention by multiplying the queries and keys vectors and dividing by the square root of their dimension. The softmax is then applied and the values vector is multiplied to the result, see Equation 2.2 [31]. Attention (Qi, Ki, Vi) = softmax ( QiK T i√ dk ) Vi (2.2) To pass the attention scores through the feed-forward network the model concate- nates the output of all heads and multiplies by an output weight matrix W O, as MultiHead (Q, K, V ) = concat (head1, head2, . . . , headh) W O, (2.3) where headi = Attention ( QW Q i , KW K i , V W V i ) . (2.4) Add and Norm: The word embeddings from the previous layer are added with the output from the Multi-Head Attention mechanism and normalized to ensure that the mean and variance of each word representation are standardized to 0 and 1, respectively [31]. Feed Forward: The feed-forward network is composed of two linear layers sepa- rated by a ReLU activation function [32]. The output of this layer is added to its input, and then the result is normalized using a new Add & Norm layer as above [31]. 2.2.3.2 Decoder This section describes the components in the right part of Figure 2.5, the decoder block. Only the layers that differ from their respective counterpart in the encoder are described. 11 2. Theory Masked Multi-Head Attention: The decoder of the Transformer architecture generates the output sequence word-by-word, and its output at each step should only depend on the previous words in the sequence. To achieve this, the decoder applies a mask to the future tokens. This mask is a matrix, M , filled with 0’s for the unmasked tokens and −∞ for the masked tokens. During the computation of the attention score, masking is performed to serve two purposes. Firstly, masking is used to zero out attention outputs where there is padding in the input sentences, to prevent padding from contributing to self- attention. Secondly, it is used to prevent the decoder from “peeking” ahead at the rest of the target sentence when predicting the next word. To achieve this, the de- coder masks out input words that appear later in the sequence. The masked elements are set to negative infinity just before the softmax calculation, see Equation (2.5), to ensure that softmax turns those values to zero [31]. MaskedAttention (Q, K, V ) = softmax ( QKT + M√ dk ) Vi (2.5) Encoder-Decoder Attention: In the decoder block’s self-attention layer, the encoder output serves as both the values and keys, whilst the Masked Multi-Head attention outputs function as queries. This allows the decoder to focus on the relevant encoder inputs, leading to a more efficient and precise decoding process. The encoder-decoder attention mechanism is similar to that of sequence-to-sequence models and enables every decoder position to attend over all positions in the input sequence [31]. Linear: The Linear layer in the Transformer architecture is responsible for pro- jecting the decoder output vector into word scores, with a score value assigned to each unique word in the target vocabulary, at every position in the sentence. This means that for an output sentence with n words and a target vocabulary with V unique words, we generate V score values for each of the n words. These score values indicate the likelihood of occurrence for each word in the vocabulary at that specific position of the sentence. Essentially, the Linear layer acts as a classifier, producing a vector of the same length as the vocabulary [31]. Softmax: At the end of the decoder block, there is a layer that produces a proba- bility distribution for each token in the vocabulary, with each token assigned a score between 0 and 1 (which all add up to 1.0). This is accomplished by applying the softmax function to the output of the previous linear transformation layer [31]. 2.2.3.3 HuggingFace HuggingFace is an open-source NLP technologies company that provides a plat- form for researchers to upload pre-trained models for public use. The Transformers library1, created by HuggingFace, contains various well-known Transformers archi- 1The library is available at: https://github.com/huggingface/transformers. 12 https://github.com/huggingface/transformers 2. Theory tecture variants [33]. The library’s primary objective is to make pre-trained models more accessible and more comfortable to read, develop, and deploy. The commu- nity around HuggingFace has contributed thousands of fine-tuned models that re- searchers can download and use for research and educational purposes. Researchers can download a model’s architecture and pre-train it from scratch or download pre-trained models by defining a checkpoint, which contains the model’s current state of weights and tokenizer. In this thesis, we use the GPT–2 model, one of the models implemented in the Transformers library, as an autoregressive language model. The library also con- tains tokenizers that handle the specific encoding and decoding of tokens for each model. Our specific model is “GPT2LMHeadModel”, which is a GPT–2 implementa- tion specifically for language modeling. 2.3 Word Tokenization In NLP, tokenization is a crucial preprocessing step that breaks down text into smaller pieces, or tokens [34]. The tokenization process can group tokens into words, subwords, or even as granular units as characters, depending on the chosen tok- enizer, and studies suggest that these strategies can significantly affect model per- formance [35], [36]. Tokens are mapped to token-IDs and tracked in a vocabulary. While word- and character-based tokenization methods are relatively straightforward to grasp, they have some issues. Word-based tokenization can lead to unmanageably vocabularies, as every word has its own token. To combat this issue, a vocabulary of only the most common words can be created, with an “unknown” token assigned for words not included. However, this approach can lead to a loss of performance as information is lost for every used unknown token. Furthermore, semantically similar words (e.g. “bicycle” and “bicycles”) can have different representations in the model, and will therefore falsely be treated as completely separate entities. While character-based tokenization mitigates the problem of large vocabularies, it fails to capture the emergent properties of language where a single character may not carry as much meaning as a word2. Therefore, the model has to look at several tokens to interpret the meaning of a word. Additionally, character-based tokeniza- tion requires the model to handle larger inputs, as a word-based input of only a small number of tokens can be split up into a large number of characters. The subword-based tokenization method is a combination of the word- and character- based approaches and is the most common method used by current SOTA models in NLP [37]. This method splits uncommon words into subwords while leaving common sequences of characters intact. The model can build every word in the document by stitching together the subwords. Subword-based tokenization can learn pre- and suffixes and grammatical word endings, allowing the model to see the similarity between the aforementioned singular contra plural example [38]. However, 2For example Japanese Kanji characters can carry a lot of meaning, but this is mostly true for languages using the Latin alphabet (e.g, English). 13 2. Theory the partitioning of subword tokens depends on the data used to train the tokenizer and may not always result in an optimal partition. The WordPiece tokenizer is a common approach that aims to express the input cor- pus with a fixed-size vocabulary [37]. If a word is not found in the vocabulary, it is split into subwords in such a way as to minimize the number of tokens needed. However, this recursive process of splitting can result in excessive splitting and sig- nificantly lengthen input text if it is dense in out-of-vocabulary words and subwords. 2.3.1 Word Embeddings Word embeddings are used in NLP models to represent words as real-valued vectors instead of one-hot encodings, which are often insufficient for deeper understand- ing [39]. The goal of word embeddings is to map words with semantic similarities to similar vectors in a high-dimensional space, where distance between vectors indi- cates the degree of semantic relatedness between words and cosine similarity as the measure of closeness (see Section 2.7.1). This is not possible with one-hot encodings because each word is represented by a binary vector with identical distance to every other word. The larger the dimension of the space, the better the representation, but the trade-off is that it requires more data or is slower to train. By using word em- beddings, NLP models can capture semantic relationships between words, enabling them to perform more accurate and sophisticated language processing tasks. 2.3.1.1 Word2Vec Word2Vec is an embedding model which uses a neural network comprising one hid- den layer and can be trained using two different methods. Developed by Mikolov et al. (2013) [40], [41] at Google, this algorithm is used to predict words close to the word to be embedded via a two-layer neural network, the structure of which is illustrated in Figure 2.6, and then uses the obtained parameters as embeddings. The first method used is the Continuous Bag-of-Words (CBoW) method, which learns the word embeddings by implementing a context window around a word, and then the network tries to predict the word in question. Similarly, the Skip-gram method also uses a context window, but trains the model to predict the surrounding words in the training set. 2.3.1.2 GloVe GloVe (Global Vectors for Word Representation) is a well-known algorithm for ob- taining word embeddings from large text corpora [42]. Unlike other techniques such as Word2Vec, GloVe not only considers local context-based information but also leverages global co-occurrence information to create embedding vectors that cap- ture both semantic and syntactic properties of words. The training of the model is done by constructing a co-occurrence matrix that stores the frequency of adja- cent words occurring together in a given window size for every word in the training corpus. 14 2. Theory Input layer x1 x2 xi xV Output layer y1 y2 yk yV Hidden layer h1 hj hE W S Figure 2.6: Structure of the Word2Vec network, where W and S are V × E and E × V matrices, respectively. One of the distinctive features of GloVe is its use of matrix factorization to optimize the vector representations of words by minimizing a loss function that compares the dot product of two word vectors and the logarithm of their co-occurrence probability. GloVe’s resulting word embeddings are applicable to various NLP tasks, such as text classification [43], information retrieval [44], and machine translation [45], [46]. GloVe embeddings are calculated by first letting Xi be the window size times the amount of times word vi occurs in the training corpus. Furthermore, let Xi,j denote the number of times that vj occurs in a window of the given size around word vi. By defining pj|i = Xi,j/Xi, GloVe deems words vi and vj to be “close” if the ratio pk|i/pk|j is close to 1 for most words vk. Similarly to Word2Vec, GloVe also uses target vectors, tj, and context vectors, cj. They are trained by fitting a parametric model, with the target- and context vectors as parameters: tT i ck − tT j ck = log pk|i pk|j = log Xi,k/Xi Xj,k/Xj . (2.6) Equation (2.6) holds if tT i ck = log Xi,k − log Xi. Next, two biases bj and b (c) j are introduced for each j such that log Xi + log Xk = bi + b (c) k + bk + b (c) i . This works if tT i ck + bi + b (c) k = log Xi,k. Now the embeddings are calculated by picking target- and context vectors so as to minimize the double sum, J = V∑ i,k=1 f (Xi,k) ( tT i ck + bi + b (c) k − log Xi,k )2 , (2.7) where they weighting function f (·) is introduced to mitigate the logarithm blowing up if Xi,k = 0, and has the following three properties: 15 2. Theory xmax 0 0.2 0.4 0.6 0.8 1.0 Xi,j f (X i, j ) α = 0.5 α = 0.75 α = 1.0 Figure 2.7: Weighting function of the GloVe word embedding model for various values of α, but with fixed xmax. • f (0) = 0 and it vanishes faster than log x as x→ 0+. • It is non-decreasing. • f (x) = 1 when x > xmax > 0 after some cutoff xmax. In their paper, Pennington et al. (2014) use a weighting function parameterized as, f (x) = (x/xmax)α if x < xmax 1 otherwise, (2.8) with values xmax = 100 and α = 3/4 [42]. An illustration of how Equation (2.8) behaves for different values of α can be found in Figure 2.7. 2.3.1.3 FastText Developed by Facebook’s AI Research team [47], FastText is a word embedding algo- rithm that extends the previously mentioned Word2Vec model. FastText represents words as bags of character N-grams. This approach enables the model to capture morphological details and effectively handle out-of-vocabulary words. This stands in contrast to Word2Vec and GloVe, which fail to provide vector representations for words outside their pre-defined dictionaries. By representing words using character N-grams, fastText gains the ability to capture the meaning of shorter words and understand suffixes and prefixes. This model can be seen as a bag of words with a sliding window over each word, where the order of the N-grams within the window is not significant. Each word is represented by itself and all sub-N-grams for K1 ≤ N ≤ K2 3. Pre-processing of any word is made by enclosing the word in brackets (e.g., “dogs” becomes “”). The full representation of dogs can be found in Table 2.1. The training process involves employing a skip-gram model that learns embeddings based on these character N-grams. 3Usually K1 = 3 and K2 = 6. 16 2. Theory sub-N-grams N = 3 N = 4 N = 5 N = 6 Table 2.1: Character sub-N-grams in the FastText representation of the word dogs with 3 ≤ N ≤ 6. 2.4 BERT Bidirectional Encoder Representations from Transformers (BERT), is a language model that was developed by Google in 2018 [48]. Its architectural design closely resembles that of the Transformer encoder block, illustrated in the left part of Fig- ure 2.5. BERT has revolutionized the field of NLP as it achieves remarkable results on various benchmarking tests in the NLP domain [48]. The bidirectional aspect of BERT allows the model to consider both preceding and succeeding positions when encoding information. The initial pre-training of the BERT language model involves training on a large unlabeled corpus, and subsequent fine-tuning enables it to be tailored for specific downstream NLP tasks with minimal architectural modifications. This fine-tuning process requires less data and training compared to the initial pre-training phase, making BERT highly efficient and adaptable to new tasks after the initial language modeling is performed. BERT is pre-trained on two innovative language tasks: Masked Language Modeling (MLM) and Next Sequence Prediction (NSP). In MLM, a percentage of tokens in an input sequence are replaced with masking tokens, and the model is trained to predict the original tokens. During BERT’s pre-training, approximately 15% of tokens were masked. NSP, on the other hand, involves predicting the relatedness of two sentences, facilitating a better understanding of sentence relationships in tasks like question answering. Furthermore, during the pre-training phase, BERT also learns positional embeddings, which enable the model to internalize the position of words within a sequence [48]. One of the key advantages of BERT is its ability to capture the context-dependent meaning of words in a sentence. This means that the BERT embeddings of two sentences that are similar in meaning will be closer together in the embedding space compared to embeddings of two sentences that are not similar in meaning. As a result, BERT embeddings have become an important tool for text comparison tasks [48]. 2.4.1 Sentence-BERT Sentence-BERT (SBERT) is an extension of the BERT model that uses a Transformer- based encoder to create fixed-length vector representations for sentences [49]. SBERT 17 2. Theory employs siamese or triplet network architectures, described in further detail in Sec- tion 2.4.1.1, to acquire sentence embeddings that capture semantic similarity or relatedness between sentences. SBERT produces a concise and fixed-length vector representation, known as a sen- tence embedding, for each input sentence. These embeddings effectively encode con- textual information, encapsulating the semantic essence of the sentences. SBERT’s versatility extends to various applications, including semantic similarity score com- putation and clustering of similar sentences [49], [50]. Thorough evaluations on diverse sentence semantics tasks have proven SBERT’s superior performance when compared to previous methods of sentence embedding. Additionally, SBERT achieves this enhanced performance while maintaining compu- tational efficiency, enabling its application across a wide range of tasks [49]. 2.4.1.1 Siamese and Triplet Network Architecture The siamese and triplet network architectures in NLP are neural network models designed to compare and measure the similarity between two input sequences [51]. They are commonly used for tasks such as text similarity detection, paraphrase identification, and sentiment analysis. The name “siamese” comes from the idea that the architecture consists of two identical sub-networks. Each arm processes one of the input sequences independently and produces a fixed-length representation of the input. These representations are then compared to determine the similarity between the two sequences [52]. The triplet network architecture is based on the concept of triplets, which consist of three input sequences: an anchor, a positive example, and a negative example. The goal of the model is to learn representations in such a way that the anchor and positive example are closer to each other in the representation space, while the anchor and negative example are farther apart [53], [54]. 2.4.2 KeyBERT KeyBERT is another extension of the BERT model. It is used as to extract the most relevant words from a sentences or document, called keywords. KeyBERT uses SentenceBERT for sentence embedding of the input and computes the cosine similarity between each word embedding and the embedding of the entire sentence. The highest scoring words are deemed as the most characteristic of the sentence as a whole. After calculating similarities KeyBERT returns the k most important words, determined by the hyperparameter k [55]. 2.5 Autoregressive Language Models Language models aim to capture the probability distribution of generated text [56]. The main objective is to compute the probability P (X) for a given text X = x1, x2, . . . , xm, where every x is a word in the vocabulary, and leverage this proba- bility model to generate text. 18 2. Theory LM The LM The book LM The book is Figure 2.8: Illustration of autoregressive text generation from a language model, where the input to the model at time step t is the sequence of previously generated tokens. The language model outputs a probability distribution over the vocabulary, repre- senting the likelihood of each word in the vocabulary being the next word in the sen- tence. This distribution is obtained by applying the chain rule of probabilities where the likelihood of generating word sequence X can be expressed mathematically using Equation (2.9), where x denotes a word in the dictionary and X The LM The book LM The lion roars Figure 2.9: Example of teacher forcing where the gold-standard reference sentence used is “The lion hunts...”. To ensure effective generation, autoregressive language models are trained on mas- sive datasets, with models like GPT–3 being trained on a staggering 400 billion tokens. This extensive training allows the model to capture rich linguistic patterns and generate high-quality and contextually relevant text [58]. 2.5.1 Transfer Learning The technique of transfer learning, or fine-tuning, involves utilizing a model that is pre-trained on a source dataset to solve a related problem by further training it on a target dataset [59]. The aim of transfer learning is to leverage the knowl- edge gained during pre-training, thereby reducing the amount of data required for training the model in the target domain. This results in reduced time and com- putational resources compared to training the model from scratch [60]. Currently, fine-tuning existing language models is considered a leading method in NLP and language modeling [61]. 2.5.2 GPT–2 This section provides an overview of the GPT–2 model developed by OpenAI and the HuggingFace Transformers library utilized to perform fine-tuning. 2.5.2.1 Background and Description Introduced in 2018, the Generative pre-trained Transformer (GPT) is a large lan- guage model developed by OpenAI [56]. It was designed as a semi-supervised model, utilizing both unsupervised pre-training and supervised fine-tuning stages. Unlike previous models that relied solely on supervised learning, GPT’s semi-supervised approach allowed it to perform well on datasets that were not well-annotated and to train extremely large models more efficiently [56], [62]. GPT’s architecture was based on a twelve-layer decoder-only transformer with twelve masked self-attention heads, and it used the Adam optimization algorithm [56], [63]. GPT was able to achieve robust transfer performance across diverse tasks due to its use of a transformer architecture, which provided a more structured memory than could be achieved through previous techniques. During transfer, GPT utilized task-specific input adaptations derived from traversal-style approaches that process 20 2. Theory structured text input as a single contiguous sequence of tokens. Despite not being specifically tailored to individual tasks, GPT was able to outperform discriminatively trained models with task-oriented architectures on a variety of language processing tasks [56]. GPT–2 [64] was created as a scale-up of GPT, with its parameter count and dataset size increased by an entire order of magnitude. Much like the original GPT, GPT–2 was an unsupervised transformer model trained to generate text by predicting the next word in a sequence of tokens. GPT–2 was trained on a dataset of 8 million web pages and has 1.5 billion parameters.4 It was evaluated on its performance on tasks in a zero-shot setting, meaning it was not specifically trained for these tasks [64]. The use of the Transformer architecture enabled GPT-series models to be trained on larger corpora than previous NLP models due to the ability to parallelize and self-supervize the training process. GPT–2 was trained on a new corpus, known as WebText, which was generated by scraping only pages linked to by Reddit posts that had received at least three upvotes prior to December 2017. The corpus was subsequently cleaned by parsing HTML documents into plain text, eliminating du- plicate pages, and removing Wikipedia pages (since their presence in many other datasets could have induced overfitting) [64]. 2.6 Decoding Algorithms The autoregressive language model described in the previous section will output a probability distribution over the vocabulary V , which represents the likelihood of each word in the vocabulary being the next word in the sentence. Figure 2.10 illustrates a toy example of such a distribution for a vocabulary consisting of only five words. The way in which words are sampled from this distribution is the decoding algorithm. 2.6.1 Standard Approaches to Decoding This section introduces the standard strategies used when generating text from a neural LM, such as greedy decoding, beam search, nucleus- and top-k sampling, and temperature. 2.6.1.1 Maximization-Based Decoding Greedy decoding implies selecting the highest-scoring word at each step of the generation process, as described in Algorithm 1 [65]. While this approach is com- putationally efficient, with a time complexity of O (|V| · T )5, and can work well for short sequences, it can lead to problems with longer ones. One issue with greedy search is that it only considers the highest conditional probability for each token in the vocabulary, which can result in suboptimal output sequences. This is because 4There also exist smaller versions with 117, 345, 762 million parameters, respectively. 5Where |V| is the size of the vocabulary and T is the maximum sequence length. 21 2. Theory 0.35 0.30 0.15 0.13 0.07 good bad long mine car0 0.2 0.4 0.6 Next token, [x] P (x |X ) X = The book is Figure 2.10: Probability distribution over a vocabulary consisting of five words given a starting prompt, X. This I The Lamp 0.3 0.2 0.1 0.05 dog lamp book bike 0.1 0.01 0.3 0.05 Figure 2.11: Example of a sequence generation where a greedy decoding strategy yields a suboptimal output. The “Beginning of Sequence”-token, , tells the model to start generating text. greedy search hides high probabilities that may be found in subsequent tokens. Fig- ure 2.11 provides and illustration of this drawback where the algorithm generates the sequence “This dog” (marked in bold with a total score of 0.3 × 0.1 = 0.03), while the highest-scoring sequence in actuality is “The book” (marked in red with a total score of 0.2× 0.3 = 0.06). Algorithm 1 Pseudocode for greedy decoding Let X = x1, . . . , xm be some initial token sequence for (i = m + 1, . . . until stopping criterion is met) do xi ← arg maxxP (x|X) append xi to X end for return X Beam search improves upon greedy decoding by considering multiple potential next words at each step and selects the top N candidates based on their likelihood, as described in Algorithm 2. The number of options considered is the number of 22 2. Theory “beams” used in the search, i.e., the beam width. This allows for exploration of multiple levels of the output and assessment of the quality of all of these tokens combined [66]. Illustratively, a beam search with N = 2 would find the optimal solution to the example in Figure 2.11, where a greedy solution fell short. However, beam search can lead to repetitive and uninformative text if it degrades into selecting the most probable option repeatedly [67]–[69]. Furthermore, it does not guarantee finding the output sequence with the highest score [70]. Increasing the beam size improves the quality of the output sequence, but at the cost of reduced decoder speed, since the computational complexity O (|V| · T ·N) increase linearly with the beam width, N . Additionally, there is a saturation point where further increase in beam size does not improve the quality of decoding anymore [71]. Algorithm 2 Pseudocode for beam search Let N be the beam width Let X = x1, . . . , xm be some initial token sequence B ← [X] for (i = m + 1, . . . until stopping criterion is met) do C ← [ ] for (each b ∈ B) do compute P (x|b) add b + [x] to C for all x in the vocabulary, V end for B ← select N top-scoring candidates from C end for return top-scoring beam in B 2.6.1.2 Random Sampling Sampling is a stochastic process where the next word/token is selected randomly based on the probabilities, as described in Algorithm 3. Deterministic methods such as greedy decoding and beam search have a problem of repetition and blandness, respectively, and random sampling offers a trade-off between coherence and diversity. However, this method can lead to incoherent outputs due to excessive randomness, as there is no guarantee that the words will fit together. Algorithm 3 Pseudocode for random sampling Let X = x1, . . . , xm be some initial token sequence for (i = m + 1, . . . until stopping criterion is met) do xi ∼ P (x|X) append xi to X end for return X When sampling from a large vocabulary, the probability of each token becomes small, and the possibility of selecting a low-probability token is not negligible. If the 23 2. Theory selected token is not suitable, the subsequent text generated may become nonsensical, which is why sampling from only a truncated subset of the vocabulary distribution is preferred. Nucleus sampling was proposed by Holtzman et al. (2020) [67]. Instead of sam- pling from the entire vocabulary, this approach considers a subset called the top-p vocabulary, which is defined as the smallest set of tokens with cumulative probabil- ity mass exceeding a pre-determined threshold p. More formally, the truncation set, V(p) ⊆ V is the solution to the optimization problem in Equation (2.11). The proba- bility distribution is then re-scaled with regards to this smaller set, from which the next word is sampled. E.g., nucleus sampling with p = 0.6 applied to the example distribution in Figure 2.10 would result in the truncation set V(p) = {good, bad}. min V(p) ∣∣∣V(p) ∣∣∣ s.t. ∑ x∈V(p) P (x|X) ≥ p, (2.11) The size of the truncated vocabulary is determined dynamically based on the shape of the probability distribution at each time step. For high values of p, the top- p vocabulary consists of a small subset of the vocabulary that contains the vast majority of the probability mass [67]. Furthermore, applying nucleus sampling has a computational complexity of O (|V| · log |V|) for every word sampled, resulting in a total complexity of O (|V| · log |V| · T ) Top-k sampling is used to sample from a truncated set of only the k most prob- able words at each time step, with a similar computational complexity as nucleus sampling. The truncation set is defined as the top k highest-probability tokens in the distribution, or the solution V(k) ⊆ V to the maximization problem in Equa- tion (2.12). E.g., top-k sampling with k = 3 applied to the example distribution in Figure 2.10 would result in the truncation set V(k) = {good, bad, long}. max V(k) ∑ x∈V(k) P (x|X) s.t. ∣∣∣V(k) ∣∣∣ ≤ k, (2.12) By removing the tail of the probability distribution, top-k sampling can improve the quality of the generated text and make it less likely to go off-topic. However, the optimal value of k can vary between different time steps, and selecting an appropriate value of k can be challenging. This is because the distribution of words can change at each time step, which means that a value of k that works well in one step may not work as well in another [67]. Temperature plays a crucial role in sampling-based generation from LMs and provides a flexible mechanism to control the balance between exploration and ex- ploitation in ATG [72]. To apply temperature, the logits are divided by a chosen 24 2. Theory 0.47 0.36 0.09 0.06 0.02 0.35 0.30 0.15 0.13 0.07 0.27 0.26 0.18 0.17 0.12 T = 0.5 T = 1.0 T = 2.0 0 0.1 0.2 0.3 0.4 0.5 0.6 Next token, [x] P (x |X ) Figure 2.12: Illustration of how different values of temperature, T , effects the prob- ability distribution from the example in Figure 2.10 temperature value, denoted as T , before either sampling directly or further truncat- ing the vocabulary, e.g., by nucleus- or top-k sampling [73]. This rescaling of logits helps adjust the distribution of probabilities generated by the softmax function and its effect is illustrated in Figure 2.12. When T is set between 0 and 1, the distribution becomes skewed towards high- probability tokens, effectively reducing the mass in the tail of the distribution. This adjustment biases the model towards more confident predictions, resulting in a nar- rower range of sampled tokens. However, it is worth noting that analyses have highlighted a trade-off between generation quality and diversity when lowering the temperature [74], [75]. Conversely, higher temperature values, greater than 1, intro- duce more randomness into the sampling process. This increased randomness leads to a broader exploration of the probability distribution and encourages the model to consider a wider range of potential tokens. In extreme cases, when the temperature approaches infinity, the sampling becomes uniform, meaning that all tokens have an equal chance of being selected [67]. 2.6.2 Domain-Relevant Decoding Algorithms The decoding algorithms described above are not suitable for our project because they do not take into account the specific language learning context and may gen- erate text that is not appropriate for students as it can be both uninformative and repetitive. Furthermore, they do not offer a way to guarantee that the prompt word occurs in the generated sentences, which is a key requirement for our model. This section goes more into depth on the specific methods we used to mitigate these drawbacks of the standard decoding algorithms. 2.6.2.1 Keyword2Text Decoding Pascual et al. [76] present an approach to controlled language generation using large pre-trained language models (like GPT–2). They propose a decoding method called 25 2. Theory Keyword2Text (K2T), that, similarly to temperature, involves adding a shift of the probability distribution over the vocabulary towards semantically similar words based on a given topic or keyword. The authors use cosine similarity of their re- spective GloVe embedding [42] to measure the semantic similarity between words. Using the following definition of the score function: score (·|X) = log P (·|X), the suggested method produces the following shift of the probability distribution to guide generation toward the semantic space of the given prompt word, w: score′ (x, w| X) = score (x|X) + λ ·max {0, cos (γ (x) , γ (w))} , (2.13) where γ (·) is the GloVe embedding of a word and λ is the strength of the probability modification. Only words with positive similarity to the prompt word are “boosted” so as to not negatively effect words that would otherwise be favourable according to the original score function. To ensure the eventual generation of the prompt word, the authors propose an exponential growth of the λ-parameter throughout the generated sentence: λt = λ0 exp { c·t T } if t < T ∞ otherwise, (2.14) at step t. T is the maximum length of the generated string, while λ0 and c are hyperparameters that control the initial value and growth of λ, respectively. This boosting of the probabilities for certain words only grows until the keyword has been generated, at which point λ is set to 0. The authors demonstrate that this simple method can be used to impose hard con- straints on language generation, and show that it performs well in practice, leading to diverse and fluent sentences while ensuring the appearance of given guide words. 2.6.2.2 Typical Sampling Meister et al. (2022) developed a method to generate text with higher “informative- ness” [77]. In the article, the authors propose a new approach to generating text using probabilistic language models. They argue that current models often under- perform when generating text, and suggest that this may be due to a lack of consider- ation for the ways in which humans use language as a communication channel. The authors propose a method they call typical sampling, which involves sampling words from the set of words with information content close to the conditional entropy of the model, rather than always choosing words from the high–probability region of the distribution. Similar to nucleus- and top-k sampling, this is the solution to a minimization problem where the truncation set, V(τ) ⊆ V , optimizes the following: min V(τ) ∑ x∈V(τ) |H (x|X) + log P (x|X)| s.t. ∑ x∈V(τ) P (x|X) ≥ τ, (2.15) 26 2. Theory where H (·) is the Shannon entropy6, or the expected information content, of a random variable with support χ [78] and τ is a hyperparameter determining what probability mass to include in the truncation. This is done with a computational complexity of O (|V| · log |V| · T ), equivalent to both nucleus- and top-k sampling. The authors demonstrate that this approach offers competitive performance in terms of quality while consistently reducing the number of repetitions, and suggest that it could be a promising approach for improving the performance of probabilistic language models in text generation tasks [77]. 2.7 Evaluation Methods In this section, the assessment techniques used in the project are presented. 2.7.1 Cosine Similarity Cosine similarity is a useful measure for calculating the similarity between two non- zero vectors in an inner product space. It is calculated as the cosine of the angle between the two vectors, and the resulting score ranges from −1 to 1. If the score is 1, the vectors have the same orientation, while a score of 0 indicates that they are orthogonal. The magnitude of the vectors does not affect the cosine similarity score. This measure has various applications in natural language processing, includ- ing determining the similarity between two strings or measuring how similar two documents are based on the number of occurrences of each word in the document. A significant advantage of cosine similarity is its computational efficiency, particu- larly for sparse vectors, as only non-zero coordinates need to be considered. It is defined as: cosine similarity (u, v) = cos (θ) = u · v ||u|| ||v|| , (2.16) where θ is the angle between vectors u and v. 2.7.1.1 Cosine Similarity for SBERT Sentence Embeddings It’s common to combine the cosine similarity of vectors and the sentence embeddings from the BERT encoder to measure how similar the generated sentences are to the fine-tuning dataset and it is illustrated step-by-step in Figure 2.13. The advantage of using cosine similarity with BERT embeddings is that it provides a simple and ef- fective way to measure the similarity between two sentences in the high-dimensional embedding space. Furthermore, the use of BERT sentence embeddings, introduced in Section 2.4, allows us to capture the contextual information of the sentences in a fixed-size embedding, which is particularly relevant for tasks that require an understanding of the meaning of the sentences. 6H (x) = − ∑ x∈χ p (x) log p (x). 27 2. Theory Generated sentence List of reference sentences SBERT SBERT Sentence Embedding Pairwise Cosine Similarity 0.72 0.09 0.21 0.87 0.35 0.54 Maximum Similarity SBERT cosine score 0.87 Figure 2.13: Illustration of the pipeline for the text comparison metric. 2.7.2 Perplexity Perplexity is an evaluation metric for generative language models, calculated as a deterministic transformation of log-likelihood into an information-theoretic quantity. It is defined as: PPL (X) = 2− l(X) M , (2.17) where M is the total number of tokens in the held-out corpus and l is log-likelihood of word sequence X = x1, x2, . . . , xm, l (X) = M∑ t=1 log P (xt|X 15 standard deviations away from the mean. 33 3. Methods 0 5 10 15 20 25 300 5 10 15 20 # of words % of se nt en ce s Sentence length distribution Before truncation After truncation Figure 3.2: Histogram showing distribution of sentence length (in terms of number of words) of dataset before and after truncation. Table 3.2: Average number of words per sentence (µ) and standard deviation (σ) of the dataset before and after truncation. µ σ Before truncation 7.50 3.83 After truncation 7.28 3.63 3.2 Model Implementation For the implementation of the model in this research, the HuggingFace library’s GPT–2 model was selected, specifically the GPT2LMHeadModel3 variant, which is pre- trained and designed for language modeling. The library also includes other imple- mentations of GPT–2, such as GPT2DoubleHeadsModel and GPT2ForTokenClassi- fication, which serve different purposes. 3.2.1 Fine-Tuning To fine-tune the pretrained GPT2LMHeadModel on additional data, the Trainer class from the HuggingFace library was utilized. This class enabled the smooth fine- tuning of the model and was selected as the HuggingFace library is a SOTA library for NLP model implementations. The previously extracted keyword was given to the model as a prompt together with the sentence it was extracted from: “BOS w SEP EOS”, where BOS, SEP and EOS are tokens that indicate respectively, Beginning-Of-Sentence, SEParation and End-Of-Sentence, and then the training was carried out via teacher forcing, as described in Section 2.5. The model was fine-tuned on an NVIDIA Tesla V100-PCIE GPU (16GB) for three epochs. 3https://huggingface.co/docs/transformers/v4.30.0/en/model_doc/gpt2# transformers.GPT2LMHeadModel 34 https://huggingface.co/docs/transformers/v4.30.0/en/model_doc/gpt2#transformers.GPT2LMHeadModel https://huggingface.co/docs/transformers/v4.30.0/en/model_doc/gpt2#transformers.GPT2LMHeadModel 3. Methods x x x PPP K2T Typical Sampling Figure 3.3: Illustration of the step-by-step effect our proposed decoding algorithm has on the probability distribution before sampling from the language model, where the x-axis are tokens in the vocabulary and the y-axis is the sampling probability distribution. The first step is a shift to the original distribution by the K2T method, and the second step represents the truncation of the vocabulary by the locally typical sampling. The default settings in the HuggingFace trainer was used, but due to hardware related limitations, batch size during training was reduced to 10 (still using an evaluation batch size of 32). Parameters used during fine-tuning are presented in Table 3.3. Table 3.3: Fine-tuning parameters for model. Training batch size 10 Number of epochs 3 Starting learning rate 5e−5 Number of warmup steps 200 Weight decay 0.01 3.2.2 Decoding Algorithms As described in Section 2.6, there are several shortcomings of the commonly used decoding algorithms. To mitigate these drawbacks a two-step approach was imple- mented to (1) guarantee that the keyword was generated every time and (2) increase the informational content of the generated sentences to better exemplify the keyword. An illustration of the combined effect of this method can be found in Figure 3.3. In a similar manner to temperature being applied before the random sampling [67], the first step of the decoding process is a shift to the probability distribution out- put from the language model by the K2T algorithm described in Section 2.6.2.1. Implementation of this algorithm is very user-friendly and compatible with Hug- gingFace’s library and PyTorch4 since all source code is available with instructions on the author’s GitHub repository5. Table 3.4 shows the hyper parameters values used from Equation (2.14), where T refers to the maximum sentence length, not temperature. These parameters were chosen after running some example with the evaluation dictionary to see what values are optimal for this task. 4https://pytorch.org/ 5https://github.com/dapascual/K2T 35 https://pytorch.org/ https://github.com/dapascual/K2T https://pytorch.org/ https://github.com/dapascual/K2T 3. Methods After the word scores have been shifted in the first step, the vocabulary, from which sampling is performed, is truncated to mimic the informational density of the dataset, which the model learned during fine-tuning. This is illustrated as the second step in Figure 3.3 and is equally as intuitively implemented as the K2T prob- ability shift. A TypicalLogitsWarper6 implemented in the Transformer library directly by HuggingFace. This object accepts the probability mass hyperparamter, τ in Equation (2.15), as argument and works by setting the word scores of tokens excluded by the truncation to −infinity, effectively reducing their probability to be sampled to 0. When generating sentences, a parameter value of τ = 0.5 was used, which is also included in Table 3.4. Table 3.4: Hyperparameters used for K2T probability boosting of keyword and semantically similar words as well as locally typical sampling. λ0 0.8 c 0.25 T 30 τ 0.5 3.3 Evaluation This section presents the pipeline used for evaluating the performance of the model throughout the distinct stages of the project. 3.3.1 Evaluating Fine-Tuning As mentioned in Section 3.2.1, fine-tuning was carried out for 3 epochs, but to avoid overfitting, the model was evaluated every 1,500 steps on the validation dataset of bin 4.2 in Figure 3.1. A validation batch size of 32 was used and an early stopping was implemented where training would stop if validation loss did not improve for 3 evaluations on the validation set. At the end of training, the best performing model on the validation dataset was saved and used during the rest of the project. Parameters used to evaluate the fine-tuning are presented in Table 3.5 and loss data can be found in Section 4.1. Table 3.5: Parameters used for evaluation during fine-tuning of model. Evaluation batch size 32 Evaluation steps 1,500 Early stopping patience 3 6https://huggingface.co/docs/transformers/internal/generation_utils# transformers.TypicalLogitsWarper 36 https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TypicalLogitsWarper https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TypicalLogitsWarper 3. Methods 3.3.2 Evaluating Generated Text To evaluate the sentences generated by our model, a list of 500 keywords was con- structed by using the top 1000 most frequently used English words, ranked based on the one billion word Corpus of Contemporary American English (COCA) [90]7. From this ranking the words in places 501-1,000 were extracted, discarding the first 500 words reasoning that they were common enough to either be uninteresting; e.g., a, the, of, and; or common enough to be taught really early on in the language learning process and not really need exemplification; e.g., house or they. For every word in this list three sentences were generated, giving a total of 1,500 sentences and on these sentences a suite of five metrics were calculated. Apart from cosine similarity, perplexity and MAUVE, which were introduced in Section 2.7, the proportion of sentences in which the keyword was generated and the proportion of sentences that were identical to an example in the dataset were calculated. The percentage of generated sentences containing the prompt word is important to mea- sure since it is one of the key requirements of our model. Furthermore, it is also important to keep track of to what degree the model generates sentences that are already in the dataset because, if we generate sentences that are identical to some examples in the dataset, why not just use the existing sentences from Tatoeba to practice using the keyword? The evaluation pipeline described above was computed three times during the project. First, it was evaluated on sentences generated after the fine-tuning to the target dataset and for generation a temperature of 0.5 was used with a hybrid top-k- and nucleus sampling scheme with k = 75 and p = 0.75. Second, to evaluate the effect of adding the K2T probability shift, the temperature was set to 1 and K2T was im- plemented with the hyperparameters found in Table 3.4. This intermediary model used the same hybrid sampling scheme as the previous one described above. Lastly, the full implementation of SpeakEasy was evaluated with the hyperparameter set found in Table 3.4. There is a significant possibility that automatically applied metrics are not able to capture and reflect the specific aspects of what makes a sentence a good example to practice on for a language learner. To evaluate the evolution of the “usefulness” of the generated sentences, human evaluation was used. 3.3.2.1 Human Evaluation One big factor to take in to account when evaluating the performance of a model in the specific domain of language learning is how useful the generated sentences are when exemplifying the proper use of a word for someone learning a new language. This is a rather intangible property of a text and something which is not easily reflected via an automatically calculated metric. To perform this evaluation, a questionnaire was used containing 100 questions which prompted respondents to rank four sentences containing the same keyword on how 7Available at: https://www.wordfrequency.info/samples.asp 37 https://www.wordfrequency.info/samples.asp 3. Methods useful they were to language learners. Examples of questions are presented in Fig- ure 3.4. The keywords were the first 100 words in the list used for the automatic evaluation described above, the 501st to the 600th most frequently used English words. One of the four sentences was sampled randomly from the human written sentences in the Tatoeba dataset containing the keyword, while the remaining three were generated one from each model evaluated in the automatic evaluation, respec- tively. One thing to keep in mind is that the evaluators are non-professional linguists but rather friends and students, so the results will not be viewed from a pedagogical perspective. The four example sentences were presented in a random order, and respondents were instructed to rank them from 1 to 4, where 4 represented the worst and 1 the best sentence. Furthermore, the instructions for the survey included the following list of criteria to have in mind when ranking the examples, in a descending order of importance: 1. Grammatical correctness and sense-making: The most crucial aspect is that the sentence is grammatically correct and makes sense. If a sentence fails in this regard, please rank it last. 2. Exact inclusion of the keyword: Check if the keyword is included in the sentence exactly as it is given in the question. If it is not, it should be ranked lower. 3. Exemplification of the prompt word: If both the grammar and keyword criteria are fulfilled, distinguish between sentences based on how well they exemplify the given prompt word. Consider whether the sentence effectively conveys the meaning of the prompt word. 4. Usefulness to language learners: Finally, evaluate how useful each sen- tence would be for a language learner to train on. Consider whether the sen- tence provides meaningful context and aids in language comprehension and learning. 38 3. Methods Figure 3.4: Example of a question in the human evaluation questionnaire. 39 3. Methods 40 4 Results and Analysis In this chapter, the results of the experiments conducted during this project are presented and analyzed. First, Section 4.1 presents the results of the fine-tuning of the GPT–2 model. Then, examples of sentences generated by SpeakEasy are presented in Section 4.2. Sections 4.3 provides a presentation and analysis of the results from the automatic- and human evaluation. 4.1 Fine-Tuning Fine-tuning of the model lasted during approximately eight hours and all three epochs elapsed without early stopping. Average (over batch) loss values on both the training- and validation data are presented in Figure 4.1, where the loss function is the cross entropy of Equation (2.10). After an initial transient during the begin- ning of each epoch the model quickly converges, and small improvements are made between epochs. 1 2 30 1 2 3 4 5 6 7 Epochs Lo ss Training loss Validation loss Figure 4.1: Plot of training- and validation loss during fine-tuning of the GPT–2 model for three epochs. 41 4. Results and Analysis 4.2 Example SpeakEasy Outputs The table 4.1 presents some example of sentences generated by SpeakEasy given the keyword as input to the model. Generating an example sentence from a given keyword with SpeakEasy takes on average 0.45± 0.19 seconds, measured over 1000 sentence generations. This time includes tokenizing the prompt word, generating the sentence token-by-token and decoding the generated sentence from the word tokens. Although we have not found any studies, nor performed one of our own, investigating the average time for a human to write example sentences, we are convinced a human would not be faster than SpeakEasy. This is especially true when generating a large amount of sentences in short succession, since a human most probably would experience fatigue and slow down with time, but a computer would not. Table 4.1: Examples of sentences generated by SpeakEasy. Keyword Sentence generated ways There are many ways to express this. voice His voice began to be heard over the whole room. ready We’ll be ready in five minutes. strong Tom is a strong person. society As a society, we can provide food and clothes to the poor. single She has a single mother. results We could have found better results. student She is a student at this school. hair He shaved his hair. medical A medical emergency is an urgent and necessary one. 4.3 Evaluation Results This section presents the resu