Relevant Phrase Generation for
Language Learners

Master’s thesis in Computer science and engineering - Data Science & AI

EDVIN LIDHOLM
DAVIDE PINTI

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2023


Master’s thesis 2023

Relevant Phrase Generation for
Language Learners

SpeakEasy: A language model that makes it easy for students to
practice their language skills by generating example sentences

EDVIN LIDHOLM
DAVIDE PINTI

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2023


Relevant Phrase Generation for Language Learners
SpeakEasy: A language model that makes it easy for students to practice their
language skills by generating example sentences
EDVIN LIDHOLM
DAVIDE PINTI

© EDVIN LIDHOLM, 2023.
© DAVIDE PINTI, 2023.

Supervisor: Richard Johansson, Department of Computer Science and Engineering
Examiner: Moa Johansson, Department of Computer Science and Engineering

Master’s Thesis 2023
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Image generated by Canva.com based on a description of our project as
image prompt.

Typeset in LATEX
Gothenburg, Sweden 2023

iv

https://www.canva.com/


Relevant Phrase Generation for Language Learners
SpeakEasy: A language model that makes it easy for students to practice their
language skills by generating example sentences
EDVIN LIDHOLM
DAVIDE PINTI
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
In recent times Artificial Intelligence, Natural Language Processing (NLP) especially,
has spread widely. Nowadays, most people use it, either directly or indirectly, and
you can find it almost everywhere: from social networks, to generating images based
on text prompts, to the automatic grammar checker of written text. In the field
of NLP, the generation of text through large language models (LLMs) is becoming
more and more dominant, especially since the release of the Transformer in 2017.
The objective of this study was to leverage the powerful tools of NLP to generate con-
textually appropriate English sentences for language learners, when given a specific
English keyword as input. Other papers in the field of Intelligent Computer-Assisted
Language Learning (ICALL) have generated examples for language learners by ei-
ther retrieval- or ranking-based models, but this is the first time generative language
models have been used in this context. We claim that our model developed in this
project, SpeakEasy, is useful for language learners that are trying to learn a new
language. The generated sentences have three important characteristics: (1) They
are relevant in context to the keyword; (2) They are in “simple” English suitable
for language learners; (3) They always include the exact form of the keyword given
in the prompt. We achieve this by first fine-tuning a GPT–2 model implemented
by HuggingFace to a dataset of human-written sentences specifically tailored to lan-
guage learners from Tatoeba.org. A decoding algorithm consisting of two steps was
implemented. Initially, a shift to the probability distribution the context around the
keyword was applied using Keyword2Text for controlled generation. Subsequently,
the vocabulary is truncated to only include words that have an information content
close to the language model itself, picked up during domain adaptation, using lo-
cally typical sampling. The generated sentences are similar to the human-written
examples with a MAUVE score of 0.755 and an average cosine similarity of 0.577.
Moreover, only 1.83% of sentences generated were identical to entries in the dataset
and a sentence took on average 0.45 seconds to be generated. Qualitative human
evaluation showed that examples generated by SpeakEasy did not only beat a fine-
tuned version of GPT–2 with hybrid top-k and nucleus sampling scheme, but also
out competing the human-written sentences with an average rank of 2.13 on a scale
from 1 (best) to 4 (worst).

Keywords: Natural Language Processing, Automatic Text Generation, Language
Learning, Transformer, GPT–2.

v

https://tatoeba.org/


Acknowledgements
We would like to express our sincere gratitude to our supervisor Richard Johansson,
from the Department of Computer Science and Engineering at Chalmers University
of Technology, who has provided guidance, support, and valuable insights through-
out the project. Their expertise and constructive feedback have been invaluable
in shaping our research. We also extend our appreciation to our examiner, Moa
Johansson, for their time and effort in evaluating our work and providing valuable
suggestions for improvement. Finally, a big thanks goes out to all our family mem-
bers and friends who have supported us during our studies.

Edvin Lidholm & Davide Pinti, Gothenburg, 2023-06-12

In loving memory of Anna-Lena Lidholm, 1968–2021. ♡

vii


List of Acronyms

Below is the list of acronyms that have been used throughout this thesis listed in
alphabetical order:

ANN Artificial Neural Network
ATG Automatic Text Generation
BERT Bidirectional Encoder Representations from Transformers
CBoW Continuous Bag-of-Words
BLEU BiLingual Evaluation Understudy
DL Deep Learning
GloVe Global Vectors for Word Representation
GPT Generative Pretrained Transformer
GPU Graphics Processing Unit
ICALL Intelligent Computer-Assisted Language Learning
LM Language Model
LSTM Long Short-Term Memory
ML Machine Learning
MLM Masked Language Modeling
NLG Natural Language Generation
NLP Natural Language Processing
NLU Natural Language Understanding
NSP Next Sequence Prediction
PE Positional Encoding
PPL Perplexity
ReLU Rectified Linear Unit
RL ROUGE–L
RN ROUGE–N
RNN Recurrent Neural Network
ROUGE Recall-Oriented Understudy for Gisting Evaluation
SOTA State-Of-The-Art

ix


Contents

List of Acronyms ix

List of Figures xv

List of Tables xvii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Specification of Issue Under Investigation . . . . . . . . . . . . . . . . 2
1.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Structure of the Thesis Report . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 5
2.1 Research Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Natural Language Processing . . . . . . . . . . . . . . . . . . 5
2.1.2 Automatic Text Generation . . . . . . . . . . . . . . . . . . . 6

2.2 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 RNN and LSTM . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.3 Transformer Architecture . . . . . . . . . . . . . . . . . . . . . 9

2.2.3.1 Encoder . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.3.2 Decoder . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.3.3 HuggingFace . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Word Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3.1.1 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1.2 GloVe . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1.3 FastText . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Sentence-BERT . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.1.1 Siamese and Triplet Network Architecture . . . . . . 18
2.4.2 KeyBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5 Autoregressive Language Models . . . . . . . . . . . . . . . . . . . . . 18
2.5.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . 20

xi


Contents

2.5.2 GPT–2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.2.1 Background and Description . . . . . . . . . . . . . . 20

2.6 Decoding Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6.1 Standard Approaches to Decoding . . . . . . . . . . . . . . . . 21

2.6.1.1 Maximization-Based Decoding . . . . . . . . . . . . . 21
2.6.1.2 Random Sampling . . . . . . . . . . . . . . . . . . . 23

2.6.2 Domain-Relevant Decoding Algorithms . . . . . . . . . . . . . 25
2.6.2.1 Keyword2Text Decoding . . . . . . . . . . . . . . . . 25
2.6.2.2 Typical Sampling . . . . . . . . . . . . . . . . . . . . 26

2.7 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7.1 Cosine Similarity . . . . . . . . . . . . . . . . . . . . . . . . . 27

2.7.1.1 Cosine Similarity for SBERT Sentence Embeddings . 27
2.7.2 Perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.7.3 MAUVE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 Methods 31
3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.1.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.2 Pre-Processing of Data . . . . . . . . . . . . . . . . . . . . . . 31

3.2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.1 Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Decoding Algorithms . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Evaluating Fine-Tuning . . . . . . . . . . . . . . . . . . . . . 36
3.3.2 Evaluating Generated Text . . . . . . . . . . . . . . . . . . . . 37

3.3.2.1 Human Evaluation . . . . . . . . . . . . . . . . . . . 37

4 Results and Analysis 41
4.1 Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Example SpeakEasy Outputs . . . . . . . . . . . . . . . . . . . . . . 42
4.3 Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3.1 Automatic Evaluation . . . . . . . . . . . . . . . . . . . . . . 42
4.3.2 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 45

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5 Discussion 49
5.1 Potential Errors in Methods and Results . . . . . . . . . . . . . . . . 49

5.1.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.1.2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . 50
5.1.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2 Ethical Considerations of the Project . . . . . . . . . . . . . . . . . . 52
5.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Conclusion 55
6.1 Project summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

xii


Contents

Bibliography 57

A Appendix 1 I
A.1 Removed Sentences . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

xiii


Contents

xiv


List of Figures

2.1 Schematic representation of a single neuron in an ANN . . . . . . . . 7
2.2 A multi-layer perceptron neural network with one hidden layer of five

neurons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 a) shows a recurrent neural network and b) shows the same RNN

unfolded in time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4 Schematic representation of the LSTM architecture. . . . . . . . . . . 9
2.5 The Transformer model architecture according to the paper by Vaswani

et al. (2017) [31]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Structure of the Word2Vec network, where W and S are V × E and

E × V matrices, respectively. . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Weighting function of the GloVe word embedding model for various

values of α, but with fixed xmax. . . . . . . . . . . . . . . . . . . . . . 16
2.8 Illustration of autoregressive text generation from a language model,

where the input to the model at time step t is the sequence of previ-
ously generated tokens. . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.9 Example of teacher forcing where the gold-standard reference sentence
used is “The lion hunts...”. . . . . . . . . . . . . . . . . . . . . . . . . 20

2.10 Probability distribution over a vocabulary consisting of five words
given a starting prompt, X. . . . . . . . . . . . . . . . . . . . . . . . 22

2.11 Example of a sequence generation where a greedy decoding strat-
egy yields a suboptimal output. The “Beginning of Sequence”-token,
<BOS>, tells the model to start generating text. . . . . . . . . . . . . 22

2.12 Illustration of how different values of temperature, T , effects the prob-
ability distribution from the example in Figure 2.10 . . . . . . . . . . 25

2.13 Illustration of the pipeline for the text comparison metric. . . . . . . 28

3.1 Schema of the pre-processing pipeline for our dataset. . . . . . . . . . 32
3.2 Histogram showing distribution of sentence length (in terms of num-

ber of words) of dataset before and after truncation. . . . . . . . . . . 34
3.3 Illustration of the step-by-step effect our proposed decoding algorithm

has on the probability distribution before sampling from the language
model, where the x-axis are tokens in the vocabulary and the y-axis
is the sampling probability distribution. The first step is a shift to
the original distribution by the K2T method, and the second step
represents the truncation of the vocabulary by the locally typical
sampling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

xv


List of Figures

3.4 Example of a question in the human evaluation questionnaire. . . . . 39

4.1 Plot of training- and validation loss during fine-tuning of the GPT–2
model for three epochs. . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.2 Distribution of perplexity scores for all four example sentence sources,
where the count in each bin is normalized to a percentage of the total
number of sentences. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Histograms for the individual sentence sources from the evaluation
survey. Sample means are marked by dashed black lines. . . . . . . . 46

4.4 Bar graph indicating how the different ranks were distributed between
the example sentence sources. . . . . . . . . . . . . . . . . . . . . . . 46

4.5 Box plot showing the spread of the average rank per question over all
respondents for every example source. . . . . . . . . . . . . . . . . . . 47

xvi


List of Tables

2.1 Character sub-N-grams in the FastText representation of the word
dogs with 3 ≤ N ≤ 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Examples of keywords extracted from sentences by KeyBERT [55]. . . 33
3.2 Average number of words per sentence (µ) and standard deviation (σ)

of the dataset before and after truncation. . . . . . . . . . . . . . . . 34
3.3 Fine-tuning parameters for model. . . . . . . . . . . . . . . . . . . . . 35
3.4 Hyperparameters used for K2T probability boosting of keyword and

semantically similar words as well as locally typical sampling. . . . . 36
3.5 Parameters used for evaluation during fine-tuning of model. . . . . . 36

4.1 Examples of sentences generated by SpeakEasy. . . . . . . . . . . . . 42
4.2 Automatic quality metrics for the different model stages throughout

the project, as described in Sections 2.7 and 3.3.2. . . . . . . . . . . . 43
4.3 Percentages of sampled sentences and successfully generated keywords

from the examples generated by the autoregressive Transformer lan-
guage models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.4 Minimium, median and maximum perplexity scores for the Tatoeba
dataset, fine-tuned GPT–2, K2T-controlled generation and SpeakEasy. 44

4.5 Mean rank afforded to each type of example sentence during human
evaluation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

A.1 Sentences manually removed from dataset. . . . . . . . . . . . . . . . I

xvii


List of Tables

xviii


1
Introduction

This chapter provides an introduction to the research subject. First, the context for
the project is presented in Section 1.1, followed by a description of the aim of the
study in Section 1.2. Section 1.3 outlines the the research questions that our study
aims to address. An overview of the related research in this field can be found in
Section 1.4, and the delimitations of the project in Section 1.5.

1.1 Background
In today’s era, the pursuit of learning new languages has become widespread due
to the increasing demands of travel, work, and personal interests. Consequently,
the methods of language acquisition have also evolved. The advent of advanced
technology has brought about a revolution, enabling easy access to the internet and
a vast array of resources at our fingertips. This progress has led to the develop-
ment of language learning applications such as Duolingo, Babbel, and others. Here,
the field of natural language processing (NLP) comes into play as a crucial compo-
nent, addressing the need for tools that can generate words and sentences to assist
students.

The domain of language learning holds immense significance and is rapidly expand-
ing, with its applications spanning across education, business, and personal devel-
opment [1]. Among the primary challenges in language acquisition is the ability to
comprehend and produce relevant and coherent language within various contexts.
Leveraging NLP techniques has the potential to transform the way we teach and
learn languages by automating the generation of meaningful phrases and sentences
for learners to practice with. This paradigm shift can enhance both the efficiency
and effectiveness of language learning endeavours.

By employing a large language model, we aim to create an exemplary sentence gen-
eration system for language learners, surpassing existing retrieval or ranking-based
methods, mentioned in Section 1.4. One reason to use NLP models to generate
sentences exemplifying the use of a word is that the more advanced methods can
grasp intricate grammar rules, idiomatic expressions, and cultural nuances, creat-
ing diverse and authentic sentences that closely resemble human language. This
adaptability allows learners to encounter a wide range of sentence structures and
vocabulary. Furthermore, Machine Learning (ML) models are highly scalable. They
efficiently handle large volumes of data, enabling them to learn from extensive lan-

1


1. Introduction

guage corpora and generate a plethora of example sentences. This scalability ensures
access to diverse and relevant examples. Lastly, ML models facilitate continuous im-
provement. They can be trained and fine-tuned using user feedback, allowing them
to adapt and enhance their sentence generation capabilities over time. By incorpo-
rating user preferences, corrections, and linguistic insights, the models become more
accurate and better aligned with learners’ specific needs and preferences.

1.2 Aim
The aim of this project is to achieve a method of generating English phrases of
suitable level for language learners correlated with a specific keyword. A keyword is
a single word working as an input for generating the sentences with context around it.
Furthermore, these sentences should have some level of “quality”, meaning they are
not just trivial phrases, but with information relevant to the keyword. To illustrate
this one could think of an example where a student wants to practice using the word
“bicycle”. In this case a sentence about riding bicycles, or falling while riding, like
“I learned how to ride a bicycle when I was 6 years old.”, would be useful, but “It
is a bicycle.” would be quite uninteresting.

1.3 Specification of Issue Under Investigation
The following four research questions will be examined:

• Can an NLP-based system be trained to generate contextually appropriate
English phrases and sentences for language learners, when given a specific
English keyword as input?

• How can a decoding algorithm be developed to guide the text generation
towards suitable phrases and sentences for said learners?

• How can the generated phrases and sentences be evaluated for “quality”, in
terms of their relevance and usefulness when learning a new language?

• Can the proposed method of generating phrases for language learners be ap-
plied in a novel way, beyond what has been previously considered in existing
research?

1.4 Related Work
While we are not aware of any previous work that used language models to generate
examples from scratch, several projects in the area of intelligent computer-assisted
language learning (ICALL) have investigated the use of retrieval from large corpora
for finding illustrative examples for purposes such as lexicon development, vocab-
ulary exemplification for learners, and exercise generation. Pilán et al. (2016) [2]
gives an overview of work in this area.

2


1. Introduction

Early work includes the GDEX algorithm [3], which retrieves examples to illustrate
word use for lexicographers using a rule-based approach. Firstly, the GDEX algo-
rithm prioritizes sentences within the range of 10 to 25 words, penalizing those that
are longer or shorter. Moreover, it penalizes sentences containing uncommon or rare
words, meaning they are not part of the 17,000 most frequently used English words.
Sentences with pronouns and anaphors lacking self-contained meaning1 receive penal-
ties due to their need for additional context. Furthermore, preferred sentences begin
with a capital letter and conclude with a full stop, exclamation mark, or question
mark. The authors note that effective examples often introduce contextual informa-
tion and position the keyword towards the sentence’s end, enabling users to infer its
meaning.

In ICALL specifically, Pilán et al. (2013) [4] proposed a ranker similar to GDEX
for selecting examples suitable for learners. They extended the rule-based ranker by
using a machine learning-based approach to classify sentences into their correspond-
ing CERF (Common European Framework of Reference) language levels. Results
showed that 70% of the sentences retrieved by the model were deemed to be of an
appropriate language level by human evaluators. Furthermore, 60% of the retrieved
sentences were suitable as an illustrative example of word usage.

Example selection has also been used in ICALL to generate exercises, typically in
the form of cloze questions (fill-in-the-blank) [5], [6]. Some of these approaches used
language models in order to rank examples according to the typicality of a word in
a context; for instance, Wojatzki et al. (2016) [6] used an N -gram language model
for this purpose. Results showed that their model’s proposed exercises reduced
the ambiguity in the “fill-in-the-blank”-type questions. Furthermore, the authors
also introduced a disambiguation measure, which was proven to effectively discard
exercises that were too ambiguous.

1.5 Delimitations
This thesis project has several limitations that should be considered. Firstly, it is
only focused on English as the target language. This means that the model and
decoding algorithm only is able to generate phrases and sentences in English, and
does not support other languages.

Another limitation of the project is that it only considers one-word prompts for
text generation. This means that the model only generates phrases and sentences
based on a single word input, rather than multiple words or whole sentences. This
may affect the model’s ability to generate contextually appropriate phrases and
sentences, as one-word prompts may not provide enough context for the model to
generate accurate and meaningful output.

Additionally, we did not develop a way of adapting the level of the sentences towards
the student proficiency in the target language. This means that the generated
phrases and sentences are not tailored to the specific language level of the student,

1E.g., this, that, it or one.

3


1. Introduction

thus making the generated phrases and sentences less relevant and useful for some
students who are at different stages of language learning.

In summary, this thesis has limitations in terms of the target language, the type of
prompts used, and the lack of adaptability to students’ prior proficiency level.

1.6 Structure of the Thesis Report
In this thesis, Chapter 2 provides a comprehensive overview of the relevant back-
ground information, including the research field, relevant models, and previous stud-
ies. The methods employed to address the research question is outlined in Chapter 3,
while the results of the study are presented and analyzed in Chapter 4. Finally, in
Chapters 5 and 6, the methodology and results are thoroughly discussed along with
suggestions for future research directions and conclusion, repectively.

4


2
Theory

This chapter provides the theoretical framework for the subject matter. In Sec-
tion 2.1, an overview of NLP and its sub-field automatic text generation (ATG) is
presented. Section 2.2 explores the field of deep learning (DL) and introduces com-
monly used models for NLP tasks, such as recurrent neural networks (RNNs), long
short-term memorys (LSTMs), and the Transformer architecture. In Section 2.3,
tokenization and word embedding are introduced, while Section 2.4 provides a de-
scription of the BERT model with some relevant adaptations of the core model.
Section 2.5 provides a background for autoregressive language models used for gen-
erating text, finishing by introducing the GPT–2 model used in this project. Addi-
tionally, Section 2.6 delves into the algorithms used for decoding, which translate
the probabilities generated by the decoder model into text. Finally, in Section 2.7,
the tools used to evaluate the results are described.

2.1 Research Area
This section introduces the broader research domains of natural language processing,
and the automated generation of text.

2.1.1 Natural Language Processing
NLP is a rapidly growing field that involves the use of computer algorithms to
analyze, understand, and generate human language. It is a multidisciplinary field
that combines the fields of computer science, linguistics, and cognitive psychology
to enable computers to interact with human language [7].

The two primary subfields of NLP are natural language understanding (NLU) and
natural language generation (NLG). NLU focuses on enabling machines to com-
prehend human language by analyzing and processing linguistic structures such as
syntax, semantics, and pragmatics [8]. NLG, on the other hand, is concerned with
producing language and involves generating human-like text that is coherent, gram-
matically correct, and semantically meaningful [9].

NLP has many applications and is used in a wide range of fields, including machine
translation, text classification, question answering, sentiment analysis, and speech
recognition. Machine translation is the process of translating one language into
another [10], while text classification involves categorizing text into different classes

5


2. Theory

or categories [11]. Question answering systems allow machines to understand and
respond to human language queries [12], while sentiment analysis involves analyzing
the emotional content of text [13].

In recent times, DL has emerged as the most commonly used approach to create
NLP models [14], which is a subset of machine learning (ML) that involves the
use of artificial neural networks (ANNs) to analyze and process large amounts of
data. DL algorithms have been shown to outperform traditional statistical methods
in many NLP tasks, and have led to significant advancements in this field [15]. In
particular, the use of autoregressive language models (introduced in Section 2.5) with
transformer architectures has revolutionized the field of text generation, allowing for
the creation of high-quality, human-like text.

Overall, NLP is an exciting and rapidly evolving field that has the potential to
transform the way we interact with machines and with each other through language.
Its applications are wide-ranging and continue to grow as new technologies and
techniques are developed.

2.1.2 Automatic Text Generation

As a subfield of NLG, ATG focuses on generating human-like text automatically. It
is a challenging task as it requires the computer to produce text that is coherent,
grammatically correct, and semantically meaningful. Previous text generation mod-
els have been statistical-based [16], cased-based [17], and rule-based [18]. However,
these models often fail to capture the nuances and complexities of human language,
resulting in stilted and unnatural text [19].

Recently, the state-of-the-art (SOTA) technology in text generation is using deep
learning and autoregressive language models with a transformer architecture [20].
These models are able to generate convincing natural text by training on large
amounts of data and learning the patterns and structures of human language. The
transformer architecture allows for capturing long-term dependencies and contextual
information, leading to better text coherence and fluency [21].

Furthermore, recent advancements in pre-training techniques such as ChatGPT and
GPT–4 have shown promising results in generating high-quality text that can mimic
human-like writing style and tone [22], [23]. With these new technologies, ATG is
becoming an increasingly important area of research, with potential applications in
content creation, virtual assistants, chatbots, and more.

2.2 Deep Learning

This section provides an overview of the deep learning field and highlights the com-
monly employed models for ATG.

6


2. Theory

x1

x2

xn

Σ

θ

w1
w2

wn
σ y

Figure 2.1: Schematic representation of a single neuron in an ANN

2.2.1 Background

Deep Learning is a sub-field in ML that uses ANNs. These networks are inspired by
the structure of the human brain and consist of interconnected nodes that exchange
information, also called neurons [14], [24]. Each node receives inputs from other
nodes and generates a single output, which is then sent to multiple nodes. The
connections between neurons are weighted and these weights determine the influence
of inputs on the output. ANNs are composed of multiple layers, where deeper layers
allows the network to learn more complex patterns. Typically, neurons in one layer
are connected to all neurons in the preceding and following layers [25], also called
a fully connected multi-layer perceptron. This structure allows DL algorithms to
learn abstract representations of data, making it a powerful tool in addressing NLP
challenges, such as automatic text generation [26].

Figure 2.1 provides a visualization of a single neuron in a neural network. The
inputs x1, . . . , xn coming from neurons in the preceding layer are multiplied by the
corresponding weights w1, . . . , wn and then summed up together with a bias, θ. The
result of this addition is then passed through an activation function, σ, to produce
the output y = σ (θ +∑n

i=1 xi · wi) that is transmitted to the next set of neurons in
the network [25].

In Figure 2.2, an ANN with one hidden layer of neurons is illustrated. The input
layer is responsible for receiving the initial data, while the output layer delivers the
model’s result. The hidden layer, positioned between the input and output layers,
performs the computational operations.

In the training process of an ANN, the weights and biases are iteratively adjusted
with the aim of reducing the value of the cost function, which measures the discrep-
ancy between the predicted output of the network and the actual target output. This
is typically achieved through the utilization of backpropagation [27], which allows
for the calculation of the gradient with respect to each weight based on the difference
between the target output and the actual output [28]. In the context of ATG, recur-
rent neural networks and long short-term memory networks have been widely used
in the past, but the current SOTA models follow the Transformer architecture [14].
In the following sections, these various models will be discussed.

7


2. Theory

Input Layer

Hidden Layer

Output Layer

Figure 2.2: A multi-layer perceptron neural network with one hidden layer of five
neurons.

x(t)

U

W

h(t)

y(t)

V

(a)
x(t-1)
U

W

h(t-1)

y(t-1)

x(t)
U

W

h(t)

y(t)

x(t+1)
U

W

h(t+1)

y(t+1)

V V V V

(b)

Figure 2.3: a) shows a recurrent neural network and b) shows the same RNN un-
folded in time.

2.2.2 RNN and LSTM
An RNN is a type of neural network that utilizes feedback loops to incorporate in-
formation from previous time steps into its computation. This structure is depicted
in Figure 2.3, where U , V and W are edge weights; And x (t) and y (t) are input
and output states, respectively. At each time step, the hidden state h (t) is passed
back into the network to be processed again, enabling the network to handle inputs
of varying length and to preserve the order of input sequences, which is crucial in
the field of ATG where sentences or word sequences have variable length and the
arrangement of words affects their meaning [26]. However, RNNs face challenges in
dealing with long-term dependencies, such as gradient vanishing or gradient explo-
sion, where the gradient can become very small or very large over time [29].

An LSTM unit is a type of RNN that utilizes gates to control the memorization

8


2. Theory

process. This unit has three distinct gates, the input, output, and forget gate,
which regulate the flow and modification of information through the neuron. This
advancement in RNN architecture addressed the issues of vanishing gradients and
long-term memory loss [30], and is depicted in Figure 2.4. ct, ht and xt are the cell-,
hidden- and input states at time t, respectively.

ct

ht

ct+1

xt

σ σ tanh

σ

tanh

× +

×

× ht+1

Figure 2.4: Schematic representation of the LSTM architecture.

2.2.3 Transformer Architecture
Vaswani et al. introduced the Transformer network [31] as a type of deep neural net-
work that has become the SOTA model for language modeling and NLP tasks [20].
This model allows for greater parallelization, making efficient use of modern GPU
capabilities for parallel computation. The Transformer architecture has proven ef-
fective in automatic text generation as well. Figure 2.5 depicts the Transformer
architecture as presented by Vaswani et al. (2017), and the subsequent paragraphs
provides a detailed description of the various layers of the model, based on the
original publication.

2.2.3.1 Encoder

This section describes the encoder block, the components in the left part of Fig-
ure 2.5.

Input Embedding: This component involves converting the input sequence of
words into embeddings, which are numerical vectors that represent each word in
the sequence. The embeddings are created in such a way that words with similar
meanings are represented by similar vectors. In the paper [31], 512-dimensional
embedding vectors are used, but other implementations may use embeddings of
different dimensions.

Positional Encoding: This step is added to encode the position of each word in
the input sequence, which is essential for NLP tasks as word order conveys meaning,
and this information is not naturally preserved when computations are done in
parallel. Vaswani et al. achieves this by generating a 512-dimensional vector for
each position in the sequence as in Equation 2.1 [31].

9


2. Theory

Inputs Outputs
(shifted right)

Input Embedding Output Embedding

+ + Positional
Encoding

Positional
Encoding

Multi-Head
Attention

Add & Norm

Feed
Forward

Add & Norm Multi-Head
Attention

Add & Norm

Feed
Forward

Add & Norm

Masked
Multi-Head
Attention

Add & Norm

Linear

Softmax

Output
Probabilities

Figure 2.5: The Transformer model architecture according to the paper by Vaswani
et al. (2017) [31].

10


2. Theory

PEpos,2i = sin
(
pos/10, 0002i/dmodel

)
PEpos,2i+1 = cos

(
pos/10, 0002i/dmodel

) (2.1)

Multi-Head Attention: The self-attention mechanism is a crucial part of the
Transformer architecture that helps the model to understand the relationships be-
tween words in a sentence. Specifically, this mechanism allows the input sequence
to attend to itself. With the help of trainable weight matrices W Q

i , W K
i , W V

i and
the input X the model obtains three representations of the same word for each
attention-head i as such:

• Queries: Qi = XW Q
i

• Keys: Ki = XW K
i

• Values: Vi = XW V
i

From these vectors you calculate the self-attention by multiplying the queries and
keys vectors and dividing by the square root of their dimension. The softmax is
then applied and the values vector is multiplied to the result, see Equation 2.2 [31].

Attention (Qi, Ki, Vi) = softmax

(
QiK

T
i√

dk

)
Vi (2.2)

To pass the attention scores through the feed-forward network the model concate-
nates the output of all heads and multiplies by an output weight matrix W O, as

MultiHead (Q, K, V ) = concat (head1, head2, . . . , headh) W O, (2.3)

where
headi = Attention

(
QW Q

i , KW K
i , V W V

i

)
. (2.4)

Add and Norm: The word embeddings from the previous layer are added with
the output from the Multi-Head Attention mechanism and normalized to ensure
that the mean and variance of each word representation are standardized to 0 and
1, respectively [31].

Feed Forward: The feed-forward network is composed of two linear layers sepa-
rated by a ReLU activation function [32]. The output of this layer is added to its
input, and then the result is normalized using a new Add & Norm layer as above [31].

2.2.3.2 Decoder

This section describes the components in the right part of Figure 2.5, the decoder
block. Only the layers that differ from their respective counterpart in the encoder
are described.

11


2. Theory

Masked Multi-Head Attention: The decoder of the Transformer architecture
generates the output sequence word-by-word, and its output at each step should
only depend on the previous words in the sequence. To achieve this, the decoder
applies a mask to the future tokens. This mask is a matrix, M , filled with 0’s for
the unmasked tokens and −∞ for the masked tokens.

During the computation of the attention score, masking is performed to serve
two purposes. Firstly, masking is used to zero out attention outputs where there
is padding in the input sentences, to prevent padding from contributing to self-
attention. Secondly, it is used to prevent the decoder from “peeking” ahead at the
rest of the target sentence when predicting the next word. To achieve this, the de-
coder masks out input words that appear later in the sequence. The masked elements
are set to negative infinity just before the softmax calculation, see Equation (2.5),
to ensure that softmax turns those values to zero [31].

MaskedAttention (Q, K, V ) = softmax

(
QKT + M√

dk

)
Vi (2.5)

Encoder-Decoder Attention: In the decoder block’s self-attention layer, the
encoder output serves as both the values and keys, whilst the Masked Multi-Head
attention outputs function as queries. This allows the decoder to focus on the
relevant encoder inputs, leading to a more efficient and precise decoding process.
The encoder-decoder attention mechanism is similar to that of sequence-to-sequence
models and enables every decoder position to attend over all positions in the input
sequence [31].

Linear: The Linear layer in the Transformer architecture is responsible for pro-
jecting the decoder output vector into word scores, with a score value assigned to
each unique word in the target vocabulary, at every position in the sentence. This
means that for an output sentence with n words and a target vocabulary with V
unique words, we generate V score values for each of the n words. These score values
indicate the likelihood of occurrence for each word in the vocabulary at that specific
position of the sentence. Essentially, the Linear layer acts as a classifier, producing
a vector of the same length as the vocabulary [31].

Softmax: At the end of the decoder block, there is a layer that produces a proba-
bility distribution for each token in the vocabulary, with each token assigned a score
between 0 and 1 (which all add up to 1.0). This is accomplished by applying the
softmax function to the output of the previous linear transformation layer [31].

2.2.3.3 HuggingFace

HuggingFace is an open-source NLP technologies company that provides a plat-
form for researchers to upload pre-trained models for public use. The Transformers
library1, created by HuggingFace, contains various well-known Transformers archi-

1The library is available at: https://github.com/huggingface/transformers.

12

https://github.com/huggingface/transformers


2. Theory

tecture variants [33]. The library’s primary objective is to make pre-trained models
more accessible and more comfortable to read, develop, and deploy. The commu-
nity around HuggingFace has contributed thousands of fine-tuned models that re-
searchers can download and use for research and educational purposes.

Researchers can download a model’s architecture and pre-train it from scratch or
download pre-trained models by defining a checkpoint, which contains the model’s
current state of weights and tokenizer.

In this thesis, we use the GPT–2 model, one of the models implemented in the
Transformers library, as an autoregressive language model. The library also con-
tains tokenizers that handle the specific encoding and decoding of tokens for each
model. Our specific model is “GPT2LMHeadModel”, which is a GPT–2 implementa-
tion specifically for language modeling.

2.3 Word Tokenization
In NLP, tokenization is a crucial preprocessing step that breaks down text into
smaller pieces, or tokens [34]. The tokenization process can group tokens into words,
subwords, or even as granular units as characters, depending on the chosen tok-
enizer, and studies suggest that these strategies can significantly affect model per-
formance [35], [36]. Tokens are mapped to token-IDs and tracked in a vocabulary.

While word- and character-based tokenization methods are relatively straightforward
to grasp, they have some issues. Word-based tokenization can lead to unmanageably
vocabularies, as every word has its own token. To combat this issue, a vocabulary
of only the most common words can be created, with an “unknown” token assigned
for words not included. However, this approach can lead to a loss of performance as
information is lost for every used unknown token. Furthermore, semantically similar
words (e.g. “bicycle” and “bicycles”) can have different representations in the model,
and will therefore falsely be treated as completely separate entities.

While character-based tokenization mitigates the problem of large vocabularies, it
fails to capture the emergent properties of language where a single character may
not carry as much meaning as a word2. Therefore, the model has to look at several
tokens to interpret the meaning of a word. Additionally, character-based tokeniza-
tion requires the model to handle larger inputs, as a word-based input of only a
small number of tokens can be split up into a large number of characters.

The subword-based tokenization method is a combination of the word- and character-
based approaches and is the most common method used by current SOTA models
in NLP [37]. This method splits uncommon words into subwords while leaving
common sequences of characters intact. The model can build every word in the
document by stitching together the subwords. Subword-based tokenization can learn
pre- and suffixes and grammatical word endings, allowing the model to see the
similarity between the aforementioned singular contra plural example [38]. However,

2For example Japanese Kanji characters can carry a lot of meaning, but this is mostly true for
languages using the Latin alphabet (e.g, English).

13


2. Theory

the partitioning of subword tokens depends on the data used to train the tokenizer
and may not always result in an optimal partition.

The WordPiece tokenizer is a common approach that aims to express the input cor-
pus with a fixed-size vocabulary [37]. If a word is not found in the vocabulary, it
is split into subwords in such a way as to minimize the number of tokens needed.
However, this recursive process of splitting can result in excessive splitting and sig-
nificantly lengthen input text if it is dense in out-of-vocabulary words and subwords.

2.3.1 Word Embeddings
Word embeddings are used in NLP models to represent words as real-valued vectors
instead of one-hot encodings, which are often insufficient for deeper understand-
ing [39]. The goal of word embeddings is to map words with semantic similarities
to similar vectors in a high-dimensional space, where distance between vectors indi-
cates the degree of semantic relatedness between words and cosine similarity as the
measure of closeness (see Section 2.7.1). This is not possible with one-hot encodings
because each word is represented by a binary vector with identical distance to every
other word. The larger the dimension of the space, the better the representation, but
the trade-off is that it requires more data or is slower to train. By using word em-
beddings, NLP models can capture semantic relationships between words, enabling
them to perform more accurate and sophisticated language processing tasks.

2.3.1.1 Word2Vec

Word2Vec is an embedding model which uses a neural network comprising one hid-
den layer and can be trained using two different methods. Developed by Mikolov
et al. (2013) [40], [41] at Google, this algorithm is used to predict words close to
the word to be embedded via a two-layer neural network, the structure of which is
illustrated in Figure 2.6, and then uses the obtained parameters as embeddings.

The first method used is the Continuous Bag-of-Words (CBoW) method, which
learns the word embeddings by implementing a context window around a word, and
then the network tries to predict the word in question. Similarly, the Skip-gram
method also uses a context window, but trains the model to predict the surrounding
words in the training set.

2.3.1.2 GloVe

GloVe (Global Vectors for Word Representation) is a well-known algorithm for ob-
taining word embeddings from large text corpora [42]. Unlike other techniques such
as Word2Vec, GloVe not only considers local context-based information but also
leverages global co-occurrence information to create embedding vectors that cap-
ture both semantic and syntactic properties of words. The training of the model
is done by constructing a co-occurrence matrix that stores the frequency of adja-
cent words occurring together in a given window size for every word in the training
corpus.

14


2. Theory

Input layer 

x1

x2

xi

xV

Output layer 

y1

y2

yk

yV

Hidden layer 

h1

hj

hE

W S

Figure 2.6: Structure of the Word2Vec network, where W and S are V × E and
E × V matrices, respectively.

One of the distinctive features of GloVe is its use of matrix factorization to optimize
the vector representations of words by minimizing a loss function that compares the
dot product of two word vectors and the logarithm of their co-occurrence probability.
GloVe’s resulting word embeddings are applicable to various NLP tasks, such as text
classification [43], information retrieval [44], and machine translation [45], [46].

GloVe embeddings are calculated by first letting Xi be the window size times the
amount of times word vi occurs in the training corpus. Furthermore, let Xi,j denote
the number of times that vj occurs in a window of the given size around word vi.
By defining pj|i = Xi,j/Xi, GloVe deems words vi and vj to be “close” if the ratio
pk|i/pk|j is close to 1 for most words vk.

Similarly to Word2Vec, GloVe also uses target vectors, tj, and context vectors, cj.
They are trained by fitting a parametric model, with the target- and context vectors
as parameters:

tT
i ck − tT

j ck = log
pk|i

pk|j
= log Xi,k/Xi

Xj,k/Xj

. (2.6)

Equation (2.6) holds if tT
i ck = log Xi,k − log Xi. Next, two biases bj and b

(c)
j are

introduced for each j such that log Xi + log Xk = bi + b
(c)
k + bk + b

(c)
i . This works

if tT
i ck + bi + b

(c)
k = log Xi,k. Now the embeddings are calculated by picking target-

and context vectors so as to minimize the double sum,

J =
V∑

i,k=1
f (Xi,k)

(
tT
i ck + bi + b

(c)
k − log Xi,k

)2
, (2.7)

where they weighting function f (·) is introduced to mitigate the logarithm blowing
up if Xi,k = 0, and has the following three properties:

15


2. Theory

xmax
0

0.2

0.4

0.6

0.8

1.0

Xi,j

f
(X

i,
j
)

α = 0.5
α = 0.75
α = 1.0

Figure 2.7: Weighting function of the GloVe word embedding model for various
values of α, but with fixed xmax.

• f (0) = 0 and it vanishes faster than log x as x→ 0+.

• It is non-decreasing.

• f (x) = 1 when x > xmax > 0 after some cutoff xmax.

In their paper, Pennington et al. (2014) use a weighting function parameterized as,

f (x) =

(x/xmax)α if x < xmax

1 otherwise,
(2.8)

with values xmax = 100 and α = 3/4 [42]. An illustration of how Equation (2.8)
behaves for different values of α can be found in Figure 2.7.

2.3.1.3 FastText

Developed by Facebook’s AI Research team [47], FastText is a word embedding algo-
rithm that extends the previously mentioned Word2Vec model. FastText represents
words as bags of character N-grams. This approach enables the model to capture
morphological details and effectively handle out-of-vocabulary words. This stands
in contrast to Word2Vec and GloVe, which fail to provide vector representations for
words outside their pre-defined dictionaries.

By representing words using character N-grams, fastText gains the ability to capture
the meaning of shorter words and understand suffixes and prefixes. This model can
be seen as a bag of words with a sliding window over each word, where the order of the
N-grams within the window is not significant. Each word is represented by itself and
all sub-N-grams for K1 ≤ N ≤ K2

3. Pre-processing of any word is made by enclosing
the word in brackets (e.g., “dogs” becomes “<dogs>”). The full representation of
dogs can be found in Table 2.1. The training process involves employing a skip-gram
model that learns embeddings based on these character N-grams.

3Usually K1 = 3 and K2 = 6.

16


2. Theory

sub-N-grams
N = 3 <do, dog, ogs, gs>
N = 4 <dog, dogs, ogs>
N = 5 <dogs, dogs>
N = 6 <dogs>

Table 2.1: Character sub-N-grams in the FastText representation of the word dogs
with 3 ≤ N ≤ 6.

2.4 BERT
Bidirectional Encoder Representations from Transformers (BERT), is a language
model that was developed by Google in 2018 [48]. Its architectural design closely
resembles that of the Transformer encoder block, illustrated in the left part of Fig-
ure 2.5. BERT has revolutionized the field of NLP as it achieves remarkable results
on various benchmarking tests in the NLP domain [48]. The bidirectional aspect of
BERT allows the model to consider both preceding and succeeding positions when
encoding information.

The initial pre-training of the BERT language model involves training on a large
unlabeled corpus, and subsequent fine-tuning enables it to be tailored for specific
downstream NLP tasks with minimal architectural modifications. This fine-tuning
process requires less data and training compared to the initial pre-training phase,
making BERT highly efficient and adaptable to new tasks after the initial language
modeling is performed.

BERT is pre-trained on two innovative language tasks: Masked Language Modeling
(MLM) and Next Sequence Prediction (NSP). In MLM, a percentage of tokens in
an input sequence are replaced with masking tokens, and the model is trained to
predict the original tokens. During BERT’s pre-training, approximately 15% of
tokens were masked. NSP, on the other hand, involves predicting the relatedness of
two sentences, facilitating a better understanding of sentence relationships in tasks
like question answering. Furthermore, during the pre-training phase, BERT also
learns positional embeddings, which enable the model to internalize the position of
words within a sequence [48].

One of the key advantages of BERT is its ability to capture the context-dependent
meaning of words in a sentence. This means that the BERT embeddings of two
sentences that are similar in meaning will be closer together in the embedding space
compared to embeddings of two sentences that are not similar in meaning. As
a result, BERT embeddings have become an important tool for text comparison
tasks [48].

2.4.1 Sentence-BERT
Sentence-BERT (SBERT) is an extension of the BERT model that uses a Transformer-
based encoder to create fixed-length vector representations for sentences [49]. SBERT

17


2. Theory

employs siamese or triplet network architectures, described in further detail in Sec-
tion 2.4.1.1, to acquire sentence embeddings that capture semantic similarity or
relatedness between sentences.

SBERT produces a concise and fixed-length vector representation, known as a sen-
tence embedding, for each input sentence. These embeddings effectively encode con-
textual information, encapsulating the semantic essence of the sentences. SBERT’s
versatility extends to various applications, including semantic similarity score com-
putation and clustering of similar sentences [49], [50].

Thorough evaluations on diverse sentence semantics tasks have proven SBERT’s
superior performance when compared to previous methods of sentence embedding.
Additionally, SBERT achieves this enhanced performance while maintaining compu-
tational efficiency, enabling its application across a wide range of tasks [49].

2.4.1.1 Siamese and Triplet Network Architecture

The siamese and triplet network architectures in NLP are neural network models
designed to compare and measure the similarity between two input sequences [51].
They are commonly used for tasks such as text similarity detection, paraphrase
identification, and sentiment analysis. The name “siamese” comes from the idea
that the architecture consists of two identical sub-networks. Each arm processes
one of the input sequences independently and produces a fixed-length representation
of the input. These representations are then compared to determine the similarity
between the two sequences [52].

The triplet network architecture is based on the concept of triplets, which consist
of three input sequences: an anchor, a positive example, and a negative example.
The goal of the model is to learn representations in such a way that the anchor
and positive example are closer to each other in the representation space, while the
anchor and negative example are farther apart [53], [54].

2.4.2 KeyBERT
KeyBERT is another extension of the BERT model. It is used as to extract the
most relevant words from a sentences or document, called keywords. KeyBERT
uses SentenceBERT for sentence embedding of the input and computes the cosine
similarity between each word embedding and the embedding of the entire sentence.
The highest scoring words are deemed as the most characteristic of the sentence as a
whole. After calculating similarities KeyBERT returns the k most important words,
determined by the hyperparameter k [55].

2.5 Autoregressive Language Models
Language models aim to capture the probability distribution of generated text [56].
The main objective is to compute the probability P (X) for a given text X =
x1, x2, . . . , xm, where every x is a word in the vocabulary, and leverage this proba-
bility model to generate text.

18


2. Theory

LM

<BOS>

The

LM

The

book

LM

The book

is

Figure 2.8: Illustration of autoregressive text generation from a language model,
where the input to the model at time step t is the sequence of previously generated
tokens.

The language model outputs a probability distribution over the vocabulary, repre-
senting the likelihood of each word in the vocabulary being the next word in the sen-
tence. This distribution is obtained by applying the chain rule of probabilities where
the likelihood of generating word sequence X can be expressed mathematically using
Equation (2.9), where x denotes a word in the dictionary and X<t = x1, . . . , xt−1
represents the sequence of words generated up to the current time step, t.

P (X) =P (x1) · P (x2|x1) · P (x3|x1, x2) · . . . · P (xm|X<m) =

=
m∏

t=1
P (xt|X<t)

(2.9)

In this context, autoregressive refers to the fact that the model applies itself to its
own outputs from previous steps, as is illustrated in Figure 2.8, allowing for se-
quential generation. The autoregressive language model, following the previously
described left-to-right decomposition of text probability, generates the continuation
token by token using a specific decoding strategy. The decoding algorithm deter-
mines the manner in which words are sampled from the probability distribution and
several commonly employed decoding strategies are described in Section 2.6.

To train these language models, raw text data is used, and the training process
involves maximizing the likelihood, which is equivalent to minimizing the cross-
entropy loss

L (X) =
m∑

t=1
− log P (xt|X<t). (2.10)

When training a model that depends on its own predictions a common approach
is to use teacher forcing, where only the gold-standard reference is used as input,
regardless of previous outputs. By doing so, the model learns to make its next
prediction based on the correct input available in the training data, rather than using
the potentially erroneous prediction from the previous time step. This approach
prevents the model from being trained on inaccurate predictions. Figure 2.9 provides
an example demonstrating the application of teacher forcing in practice. Note that
the third input to the model is not based on the words the model has output so far,
contrary to the process in Figure 2.8 [57].

19


2. Theory

LM

<BOS>

The

LM

The

book

LM

The lion

roars

Figure 2.9: Example of teacher forcing where the gold-standard reference sentence
used is “The lion hunts...”.

To ensure effective generation, autoregressive language models are trained on mas-
sive datasets, with models like GPT–3 being trained on a staggering 400 billion
tokens. This extensive training allows the model to capture rich linguistic patterns
and generate high-quality and contextually relevant text [58].

2.5.1 Transfer Learning
The technique of transfer learning, or fine-tuning, involves utilizing a model that
is pre-trained on a source dataset to solve a related problem by further training
it on a target dataset [59]. The aim of transfer learning is to leverage the knowl-
edge gained during pre-training, thereby reducing the amount of data required for
training the model in the target domain. This results in reduced time and com-
putational resources compared to training the model from scratch [60]. Currently,
fine-tuning existing language models is considered a leading method in NLP and
language modeling [61].

2.5.2 GPT–2
This section provides an overview of the GPT–2 model developed by OpenAI and
the HuggingFace Transformers library utilized to perform fine-tuning.

2.5.2.1 Background and Description

Introduced in 2018, the Generative pre-trained Transformer (GPT) is a large lan-
guage model developed by OpenAI [56]. It was designed as a semi-supervised model,
utilizing both unsupervised pre-training and supervised fine-tuning stages. Unlike
previous models that relied solely on supervised learning, GPT’s semi-supervised
approach allowed it to perform well on datasets that were not well-annotated and
to train extremely large models more efficiently [56], [62]. GPT’s architecture was
based on a twelve-layer decoder-only transformer with twelve masked self-attention
heads, and it used the Adam optimization algorithm [56], [63].

GPT was able to achieve robust transfer performance across diverse tasks due to
its use of a transformer architecture, which provided a more structured memory
than could be achieved through previous techniques. During transfer, GPT utilized
task-specific input adaptations derived from traversal-style approaches that process

20


2. Theory

structured text input as a single contiguous sequence of tokens. Despite not being
specifically tailored to individual tasks, GPT was able to outperform discriminatively
trained models with task-oriented architectures on a variety of language processing
tasks [56].

GPT–2 [64] was created as a scale-up of GPT, with its parameter count and dataset
size increased by an entire order of magnitude. Much like the original GPT, GPT–2
was an unsupervised transformer model trained to generate text by predicting the
next word in a sequence of tokens. GPT–2 was trained on a dataset of 8 million web
pages and has 1.5 billion parameters.4 It was evaluated on its performance on tasks
in a zero-shot setting, meaning it was not specifically trained for these tasks [64].

The use of the Transformer architecture enabled GPT-series models to be trained
on larger corpora than previous NLP models due to the ability to parallelize and
self-supervize the training process. GPT–2 was trained on a new corpus, known as
WebText, which was generated by scraping only pages linked to by Reddit posts
that had received at least three upvotes prior to December 2017. The corpus was
subsequently cleaned by parsing HTML documents into plain text, eliminating du-
plicate pages, and removing Wikipedia pages (since their presence in many other
datasets could have induced overfitting) [64].

2.6 Decoding Algorithms
The autoregressive language model described in the previous section will output
a probability distribution over the vocabulary V , which represents the likelihood
of each word in the vocabulary being the next word in the sentence. Figure 2.10
illustrates a toy example of such a distribution for a vocabulary consisting of only five
words. The way in which words are sampled from this distribution is the decoding
algorithm.

2.6.1 Standard Approaches to Decoding
This section introduces the standard strategies used when generating text from a
neural LM, such as greedy decoding, beam search, nucleus- and top-k sampling, and
temperature.

2.6.1.1 Maximization-Based Decoding

Greedy decoding implies selecting the highest-scoring word at each step of the
generation process, as described in Algorithm 1 [65]. While this approach is com-
putationally efficient, with a time complexity of O (|V| · T )5, and can work well for
short sequences, it can lead to problems with longer ones. One issue with greedy
search is that it only considers the highest conditional probability for each token in
the vocabulary, which can result in suboptimal output sequences. This is because

4There also exist smaller versions with 117, 345, 762 million parameters, respectively.
5Where |V| is the size of the vocabulary and T is the maximum sequence length.

21


2. Theory

0.35
0.30

0.15 0.13
0.07

good bad long mine car0

0.2

0.4

0.6

Next token, [x]

P
(x
|X

)

X = The book is

Figure 2.10: Probability distribution over a vocabulary consisting of five words given
a starting prompt, X.

<BOS>

This

I

The

Lamp

0.3

0.2

0.1

0.05

dog

lamp

book

bike

0.1

0.01

0.3

0.05

Figure 2.11: Example of a sequence generation where a greedy decoding strategy
yields a suboptimal output. The “Beginning of Sequence”-token, <BOS>, tells the
model to start generating text.

greedy search hides high probabilities that may be found in subsequent tokens. Fig-
ure 2.11 provides and illustration of this drawback where the algorithm generates
the sequence “This dog” (marked in bold with a total score of 0.3 × 0.1 = 0.03),
while the highest-scoring sequence in actuality is “The book” (marked in red with a
total score of 0.2× 0.3 = 0.06).

Algorithm 1 Pseudocode for greedy decoding
Let X = x1, . . . , xm be some initial token sequence
for (i = m + 1, . . . until stopping criterion is met) do

xi ← arg maxxP (x|X)
append xi to X

end for
return X

Beam search improves upon greedy decoding by considering multiple potential
next words at each step and selects the top N candidates based on their likelihood,
as described in Algorithm 2. The number of options considered is the number of

22


2. Theory

“beams” used in the search, i.e., the beam width. This allows for exploration of
multiple levels of the output and assessment of the quality of all of these tokens
combined [66]. Illustratively, a beam search with N = 2 would find the optimal
solution to the example in Figure 2.11, where a greedy solution fell short.

However, beam search can lead to repetitive and uninformative text if it degrades
into selecting the most probable option repeatedly [67]–[69]. Furthermore, it does
not guarantee finding the output sequence with the highest score [70]. Increasing the
beam size improves the quality of the output sequence, but at the cost of reduced
decoder speed, since the computational complexity O (|V| · T ·N) increase linearly
with the beam width, N . Additionally, there is a saturation point where further
increase in beam size does not improve the quality of decoding anymore [71].

Algorithm 2 Pseudocode for beam search
Let N be the beam width
Let X = x1, . . . , xm be some initial token sequence
B ← [X]
for (i = m + 1, . . . until stopping criterion is met) do

C ← [ ]
for (each b ∈ B) do

compute P (x|b)
add b + [x] to C for all x in the vocabulary, V

end for
B ← select N top-scoring candidates from C

end for
return top-scoring beam in B

2.6.1.2 Random Sampling

Sampling is a stochastic process where the next word/token is selected randomly
based on the probabilities, as described in Algorithm 3. Deterministic methods such
as greedy decoding and beam search have a problem of repetition and blandness,
respectively, and random sampling offers a trade-off between coherence and diversity.
However, this method can lead to incoherent outputs due to excessive randomness,
as there is no guarantee that the words will fit together.

Algorithm 3 Pseudocode for random sampling
Let X = x1, . . . , xm be some initial token sequence
for (i = m + 1, . . . until stopping criterion is met) do

xi ∼ P (x|X)
append xi to X

end for
return X

When sampling from a large vocabulary, the probability of each token becomes
small, and the possibility of selecting a low-probability token is not negligible. If the

23


2. Theory

selected token is not suitable, the subsequent text generated may become nonsensical,
which is why sampling from only a truncated subset of the vocabulary distribution
is preferred.

Nucleus sampling was proposed by Holtzman et al. (2020) [67]. Instead of sam-
pling from the entire vocabulary, this approach considers a subset called the top-p
vocabulary, which is defined as the smallest set of tokens with cumulative probabil-
ity mass exceeding a pre-determined threshold p. More formally, the truncation set,
V(p) ⊆ V is the solution to the optimization problem in Equation (2.11). The proba-
bility distribution is then re-scaled with regards to this smaller set, from which the
next word is sampled. E.g., nucleus sampling with p = 0.6 applied to the example
distribution in Figure 2.10 would result in the truncation set V(p) = {good, bad}.

min
V(p)

∣∣∣V(p)
∣∣∣

s.t.
∑

x∈V(p)

P (x|X) ≥ p,
(2.11)

The size of the truncated vocabulary is determined dynamically based on the shape
of the probability distribution at each time step. For high values of p, the top-
p vocabulary consists of a small subset of the vocabulary that contains the vast
majority of the probability mass [67]. Furthermore, applying nucleus sampling has
a computational complexity of O (|V| · log |V|) for every word sampled, resulting in
a total complexity of O (|V| · log |V| · T )

Top-k sampling is used to sample from a truncated set of only the k most prob-
able words at each time step, with a similar computational complexity as nucleus
sampling. The truncation set is defined as the top k highest-probability tokens in
the distribution, or the solution V(k) ⊆ V to the maximization problem in Equa-
tion (2.12). E.g., top-k sampling with k = 3 applied to the example distribution in
Figure 2.10 would result in the truncation set V(k) = {good, bad, long}.

max
V(k)

∑
x∈V(k)

P (x|X)

s.t.
∣∣∣V(k)

∣∣∣ ≤ k,

(2.12)

By removing the tail of the probability distribution, top-k sampling can improve the
quality of the generated text and make it less likely to go off-topic. However, the
optimal value of k can vary between different time steps, and selecting an appropriate
value of k can be challenging. This is because the distribution of words can change
at each time step, which means that a value of k that works well in one step may
not work as well in another [67].

Temperature plays a crucial role in sampling-based generation from LMs and
provides a flexible mechanism to control the balance between exploration and ex-
ploitation in ATG [72]. To apply temperature, the logits are divided by a chosen

24


2. Theory

0.47

0.36

0.09
0.06

0.02

0.35
0.30

0.15 0.13
0.07

0.27 0.26

0.18 0.17
0.12

T = 0.5 T = 1.0 T = 2.0
0

0.1

0.2

0.3

0.4

0.5

0.6

Next token, [x]

P
(x
|X

)

Figure 2.12: Illustration of how different values of temperature, T , effects the prob-
ability distribution from the example in Figure 2.10

temperature value, denoted as T , before either sampling directly or further truncat-
ing the vocabulary, e.g., by nucleus- or top-k sampling [73]. This rescaling of logits
helps adjust the distribution of probabilities generated by the softmax function and
its effect is illustrated in Figure 2.12.

When T is set between 0 and 1, the distribution becomes skewed towards high-
probability tokens, effectively reducing the mass in the tail of the distribution. This
adjustment biases the model towards more confident predictions, resulting in a nar-
rower range of sampled tokens. However, it is worth noting that analyses have
highlighted a trade-off between generation quality and diversity when lowering the
temperature [74], [75]. Conversely, higher temperature values, greater than 1, intro-
duce more randomness into the sampling process. This increased randomness leads
to a broader exploration of the probability distribution and encourages the model to
consider a wider range of potential tokens. In extreme cases, when the temperature
approaches infinity, the sampling becomes uniform, meaning that all tokens have an
equal chance of being selected [67].

2.6.2 Domain-Relevant Decoding Algorithms
The decoding algorithms described above are not suitable for our project because
they do not take into account the specific language learning context and may gen-
erate text that is not appropriate for students as it can be both uninformative and
repetitive. Furthermore, they do not offer a way to guarantee that the prompt word
occurs in the generated sentences, which is a key requirement for our model. This
section goes more into depth on the specific methods we used to mitigate these
drawbacks of the standard decoding algorithms.

2.6.2.1 Keyword2Text Decoding

Pascual et al. [76] present an approach to controlled language generation using large
pre-trained language models (like GPT–2). They propose a decoding method called

25


2. Theory

Keyword2Text (K2T), that, similarly to temperature, involves adding a shift of
the probability distribution over the vocabulary towards semantically similar words
based on a given topic or keyword. The authors use cosine similarity of their re-
spective GloVe embedding [42] to measure the semantic similarity between words.
Using the following definition of the score function: score (·|X) = log P (·|X), the
suggested method produces the following shift of the probability distribution to
guide generation toward the semantic space of the given prompt word, w:

score′ (x, w| X) = score (x|X)
+ λ ·max {0, cos (γ (x) , γ (w))} ,

(2.13)

where γ (·) is the GloVe embedding of a word and λ is the strength of the probability
modification. Only words with positive similarity to the prompt word are “boosted”
so as to not negatively effect words that would otherwise be favourable according to
the original score function.

To ensure the eventual generation of the prompt word, the authors propose an
exponential growth of the λ-parameter throughout the generated sentence:

λt =

λ0 exp
{

c·t
T

}
if t < T

∞ otherwise,
(2.14)

at step t. T is the maximum length of the generated string, while λ0 and c are
hyperparameters that control the initial value and growth of λ, respectively. This
boosting of the probabilities for certain words only grows until the keyword has been
generated, at which point λ is set to 0.

The authors demonstrate that this simple method can be used to impose hard con-
straints on language generation, and show that it performs well in practice, leading
to diverse and fluent sentences while ensuring the appearance of given guide words.

2.6.2.2 Typical Sampling

Meister et al. (2022) developed a method to generate text with higher “informative-
ness” [77]. In the article, the authors propose a new approach to generating text
using probabilistic language models. They argue that current models often under-
perform when generating text, and suggest that this may be due to a lack of consider-
ation for the ways in which humans use language as a communication channel. The
authors propose a method they call typical sampling, which involves sampling words
from the set of words with information content close to the conditional entropy of
the model, rather than always choosing words from the high–probability region of
the distribution. Similar to nucleus- and top-k sampling, this is the solution to a
minimization problem where the truncation set, V(τ) ⊆ V , optimizes the following:

min
V(τ)

∑
x∈V(τ)

|H (x|X) + log P (x|X)|

s.t.
∑

x∈V(τ)

P (x|X) ≥ τ,
(2.15)

26


2. Theory

where H (·) is the Shannon entropy6, or the expected information content, of a
random variable with support χ [78] and τ is a hyperparameter determining what
probability mass to include in the truncation. This is done with a computational
complexity of O (|V| · log |V| · T ), equivalent to both nucleus- and top-k sampling.

The authors demonstrate that this approach offers competitive performance in terms
of quality while consistently reducing the number of repetitions, and suggest that
it could be a promising approach for improving the performance of probabilistic
language models in text generation tasks [77].

2.7 Evaluation Methods
In this section, the assessment techniques used in the project are presented.

2.7.1 Cosine Similarity
Cosine similarity is a useful measure for calculating the similarity between two non-
zero vectors in an inner product space. It is calculated as the cosine of the angle
between the two vectors, and the resulting score ranges from −1 to 1. If the score
is 1, the vectors have the same orientation, while a score of 0 indicates that they
are orthogonal. The magnitude of the vectors does not affect the cosine similarity
score. This measure has various applications in natural language processing, includ-
ing determining the similarity between two strings or measuring how similar two
documents are based on the number of occurrences of each word in the document.
A significant advantage of cosine similarity is its computational efficiency, particu-
larly for sparse vectors, as only non-zero coordinates need to be considered. It is
defined as:

cosine similarity (u, v) = cos (θ) = u · v
||u|| ||v||

, (2.16)

where θ is the angle between vectors u and v.

2.7.1.1 Cosine Similarity for SBERT Sentence Embeddings

It’s common to combine the cosine similarity of vectors and the sentence embeddings
from the BERT encoder to measure how similar the generated sentences are to the
fine-tuning dataset and it is illustrated step-by-step in Figure 2.13. The advantage
of using cosine similarity with BERT embeddings is that it provides a simple and ef-
fective way to measure the similarity between two sentences in the high-dimensional
embedding space. Furthermore, the use of BERT sentence embeddings, introduced
in Section 2.4, allows us to capture the contextual information of the sentences
in a fixed-size embedding, which is particularly relevant for tasks that require an
understanding of the meaning of the sentences.

6H (x) = −
∑

x∈χ p (x) log p (x).

27


2. Theory

Generated sentence

List of reference
sentences

SBERT

SBERT

Sentence
Embedding

Pairwise Cosine
Similarity

0.72

0.09

0.21

0.87

0.35

0.54

Maximum
Similarity

SBERT cosine
score

0.87

Figure 2.13: Illustration of the pipeline for the text comparison metric.

2.7.2 Perplexity
Perplexity is an evaluation metric for generative language models, calculated as a
deterministic transformation of log-likelihood into an information-theoretic quantity.
It is defined as:

PPL (X) = 2− l(X)
M , (2.17)

where M is the total number of tokens in the held-out corpus and l is log-likelihood
of word sequence X = x1, x2, . . . , xm,

l (X) =
M∑

t=1
log P (xt|X<t) . (2.18)

Lower perplexities correspond to higher likelihoods, indicating better performance
on this metric. Perfect language models achieve perplexity scores of 1, while the
worst reasonable case scenario is a uniform, unigram model with a vocabulary size
of V, which assigns equal probability 1

V
to all words [79].

In practice, language models tend to produce perplexities in the range of 1 to
V , with state-of-the-art models achieving increasingly lower perplexities on larger
datasets [79]. For example, on the Penn Treebank dataset [80], a well-smoothed
5-gram model achieves a perplexity of 141 [81], while an LSTM language model can
achieve a perplexity of approximately 80 [82], and enhancements to the LSTM archi-
tecture can bring the perplexity below 60 [83]. On larger datasets like the 1B Word
Benchmark [84], perplexities of around 25 can be obtained by averaging together
multiple LSTM language models [85].

2.7.3 MAUVE
In recent years, the development of open-ended text generation has led to the need
for reliable evaluation metrics that can measure the similarity between machine-
generated text and human-written language [86]. A proposed solution to this prob-
lem is MAUVE [87], a measure of comparison that evaluates the statistical gap
between two text distributions: one from a text generation model and the other

28


2. Theory

from human-written text. MAUVE is a library built on PyTorch and implemented
by HuggingFace Transformers.

The measure employs divergence frontiers and scales up to modern text genera-
tion models by computing information divergences in a quantized embedding space.
The effectiveness of MAUVE is achieved by reducing the measurement to computing
Kullback-Leibler divergences [88] in a quantized, low-dimensional space after embed-
ding samples from each distribution with an external language model. The MAUVE
measure consist of the area under the divergence curve, providing a summary of two
type of errors: Type I is when the model put high mass of the probability distribution
in an area which is unlikely to occur human natural language; Type II, instead, is
when the model cannot generate text which is plausible in human content. MAUVE
produces a number in the interval (0,1], where 1 indicates that machine-generated
and human text are the same [87].

The implementation of MAUVE is user-friendly as it yields a scalar measure of the
gap between neural text and human text. The measure can quantify differences
in the quality of generated text based on the size of the model, the decoding al-
gorithm, and the length of the generated text. In an extensive empirical study on
three open-ended generation tasks, MAUVE was found to identify known properties
of generated text, scale naturally with model size, and correlate with human judge-
ments. Moreover, it was observed that MAUVE has fewer restrictions than existing
distributional evaluation metrics [87].

29


2. Theory

30


3
Methods

In this chapter, an outline of the methods and setup used to achieve the objectives
of the thesis will be presented. The dataset selection and processing procedures
are described in Section 3.1. Section 3.2 outlines the model that was utilized along
with its implementation process. Additionally, Section 3.3 provides a comprehensive
explanation of the evaluation approach that was adopted for this research.

3.1 Dataset
This section will provide information about the data used in fine-tuning our model
and the pre-processing pipeline. For further discussion pertaining to the dataset,
see Section 5.1.1

3.1.1 Dataset Description
Tatoeba.org is a website that provides a large collection of example sentences in
various languages [89]. The website allows users to download the data in the form
of a TSV-file1, which can then be used for building a task-specific dataset for fine-
tuning a language model. Downloading the language data from there is relatively
simple, it starts by accessing the Tatoeba website, then selecting the language and
format you want the data to be in.

We chose Tatoeba as it is a good resource for building a task-specific dataset for fine-
tuning a language model for four key reasons. Firstly, it provides a vast collection of
sentences, which allows for building a large dataset for domain-adaptation. Secondly,
the website’s sentences are contributed by users, which provides a diverse set of
sentences which lowers the risk of ideological biases within the dataset. Furthermore,
the data provided by Tatoeba is free and open-source, which allows for easy access
and usage. Lastly, the platform provides sentences in multiple languages, which
makes it easier to extend the work done in this project to other language domains.

3.1.2 Pre-Processing of Data
The pipeline of the dataset pre-processing is illustrated in Figure 3.1.

1The dataset is available at: https://downloads.tatoeba.org/exports/per_language/
eng/.

31

https://tatoeba.org/
https://downloads.tatoeba.org/exports/per_language/eng/
https://downloads.tatoeba.org/exports/per_language/eng/


3. Methods

1

English 
Sentences

1,689,908

2

English 
Sentences
(Cleaned)
1,689,891

Removing duplicates 

3

Complete
Dataset

510,000

Truncating

4

Fine-Tuning
Dataset

500,000

5

Evaluation
Dictionary

5,360

4.1

Training
Dataset

450,000

4.2

Validation
Dataset

50,000

Randomly split
the data into a 
training- and a 

validation dataset
with a 90:10 ratio

dictionary

key-
word

list of
sentences

dataframe

sentence, keyword

Figure 3.1: Schema of the pre-processing pipeline for our dataset.

32


3. Methods

The initial pre-processing included loading the aforementioned TSV-file into a Pan-
das DataFrame and removing empty strings and duplicate sentences. This process
left 1,689,891 sentences after removing just 17 duplicates, and is represented by bin
number 2 in Figure 3.1.

The vast amount of available sentences proved to make training impossible with
the computing resources we had at our disposal. To circumvent this limitation and
to avoid “CUDA Out Of Memory” errors we opted to truncate the dataset to only
the first 510,000 sentences; 500,000 of which made up the target dataset for the
fine-tuning of our model, found in bin number 4 of Figure 3.1.

From each sentence in the fine-tuning dataset, one word was selected as the des-
ignated keyword using KeyBERT [55] and Table 3.1 shows 10 examples of keyword-
sentence pairs extracted by this method. 90% of these sentences were then randomly
selected for the training dataset and the rest were used in the validation dataset,
represented by bins 4.1 and 4.2, respectively. A more in depth description of how
the fine-tuning was carried out can be found in Section 3.2.1.

Table 3.1: Examples of keywords extracted from sentences by KeyBERT [55].

Sentence Keyword
I have to go to sleep. sleep
I was in the mountains. mountains
Is it a recent picture? picture
I’ll do my best not to disturb your studying. disturb
For some reason I feel more alive at night. night
It depends on the context. context
Are you freaking kidding me?! kidding
I don’t want to be lame; I want to be cool!! cool
I don’t intend to be selfish. selfish
Innocence is a beautiful thing. innocence

To evaluate how well the fine-tuning let our model adapt to the domain of educa-
tional sentences we set aside 10,000 sentences and set up a hash table represented
by bin 5. One key-value pair consists of a word and a list of all sentences where the
keyword occurs in these 10,000 sentences. In total the dictionary consists of 5,360
words and for further explanation of how it was used in evaluating the fine-tuning,
see Section 3.3.1.

Figure 3.2 shows how the sentence length distribution of the dataset is affected by the
truncation process in step 2 of Figure 3.1. The blue bars represent the distribution of
bin number 2 in Figure 3.1, while the red bars represent bin number 3. Initially, the
average sentence length was µ = 7.50 words per sentence with a standard deviation
of σ = 3.83. After truncating the dataset, both the average sentence length and
standard deviation decreases to 7.28 and 3.63, respectively. This is summarized in
Table 3.2. Furthermore, ten extremely long2 sentences were removed before fine-
tuning, which can be found in Appendix A.1.

2In this case, “extremely long” means > 15 standard deviations away from the mean.

33


3. Methods

0 5 10 15 20 25 300

5

10

15

20

# of words

%
of

se
nt

en
ce

s

Sentence length distribution

Before truncation
After truncation

Figure 3.2: Histogram showing distribution of sentence length (in terms of number
of words) of dataset before and after truncation.

Table 3.2: Average number of words per sentence (µ) and standard deviation (σ) of
the dataset before and after truncation.

µ σ
Before truncation 7.50 3.83
After truncation 7.28 3.63

3.2 Model Implementation
For the implementation of the model in this research, the HuggingFace library’s
GPT–2 model was selected, specifically the GPT2LMHeadModel3 variant, which is pre-
trained and designed for language modeling. The library also includes other imple-
mentations of GPT–2, such as GPT2DoubleHeadsModel and GPT2ForTokenClassi-
fication, which serve different purposes.

3.2.1 Fine-Tuning
To fine-tune the pretrained GPT2LMHeadModel on additional data, the Trainer class
from the HuggingFace library was utilized. This class enabled the smooth fine-
tuning of the model and was selected as the HuggingFace library is a SOTA library
for NLP model implementations. The previously extracted keyword was given to the
model as a prompt together with the sentence it was extracted from: “BOS w SEP
<sentence> EOS”, where BOS, SEP and EOS are tokens that indicate respectively,
Beginning-Of-Sentence, SEParation and End-Of-Sentence, and then the training
was carried out via teacher forcing, as described in Section 2.5. The model was
fine-tuned on an NVIDIA Tesla V100-PCIE GPU (16GB) for three epochs.

3https://huggingface.co/docs/transformers/v4.30.0/en/model_doc/gpt2#
transformers.GPT2LMHeadModel

34

https://huggingface.co/docs/transformers/v4.30.0/en/model_doc/gpt2#transformers.GPT2LMHeadModel
https://huggingface.co/docs/transformers/v4.30.0/en/model_doc/gpt2#transformers.GPT2LMHeadModel


3. Methods

x x x

PPP

K2T
Typical
Sampling

Figure 3.3: Illustration of the step-by-step effect our proposed decoding algorithm
has on the probability distribution before sampling from the language model, where
the x-axis are tokens in the vocabulary and the y-axis is the sampling probability
distribution. The first step is a shift to the original distribution by the K2T method,
and the second step represents the truncation of the vocabulary by the locally typical
sampling.

The default settings in the HuggingFace trainer was used, but due to hardware
related limitations, batch size during training was reduced to 10 (still using an
evaluation batch size of 32). Parameters used during fine-tuning are presented in
Table 3.3.

Table 3.3: Fine-tuning parameters for model.

Training batch size 10
Number of epochs 3
Starting learning rate 5e−5
Number of warmup steps 200
Weight decay 0.01

3.2.2 Decoding Algorithms
As described in Section 2.6, there are several shortcomings of the commonly used
decoding algorithms. To mitigate these drawbacks a two-step approach was imple-
mented to (1) guarantee that the keyword was generated every time and (2) increase
the informational content of the generated sentences to better exemplify the keyword.
An illustration of the combined effect of this method can be found in Figure 3.3.

In a similar manner to temperature being applied before the random sampling [67],
the first step of the decoding process is a shift to the probability distribution out-
put from the language model by the K2T algorithm described in Section 2.6.2.1.
Implementation of this algorithm is very user-friendly and compatible with Hug-
gingFace’s library and PyTorch4 since all source code is available with instructions
on the author’s GitHub repository5. Table 3.4 shows the hyper parameters values
used from Equation (2.14), where T refers to the maximum sentence length, not
temperature. These parameters were chosen after running some example with the
evaluation dictionary to see what values are optimal for this task.

4https://pytorch.org/
5https://github.com/dapascual/K2T

35

https://pytorch.org/
https://github.com/dapascual/K2T
https://pytorch.org/
https://github.com/dapascual/K2T


3. Methods

After the word scores have been shifted in the first step, the vocabulary, from
which sampling is performed, is truncated to mimic the informational density of
the dataset, which the model learned during fine-tuning. This is illustrated as the
second step in Figure 3.3 and is equally as intuitively implemented as the K2T prob-
ability shift. A TypicalLogitsWarper6 implemented in the Transformer library
directly by HuggingFace. This object accepts the probability mass hyperparamter,
τ in Equation (2.15), as argument and works by setting the word scores of tokens
excluded by the truncation to −infinity, effectively reducing their probability to be
sampled to 0. When generating sentences, a parameter value of τ = 0.5 was used,
which is also included in Table 3.4.

Table 3.4: Hyperparameters used for K2T probability boosting of keyword and
semantically similar words as well as locally typical sampling.

λ0 0.8
c 0.25
T 30
τ 0.5

3.3 Evaluation
This section presents the pipeline used for evaluating the performance of the model
throughout the distinct stages of the project.

3.3.1 Evaluating Fine-Tuning
As mentioned in Section 3.2.1, fine-tuning was carried out for 3 epochs, but to avoid
overfitting, the model was evaluated every 1,500 steps on the validation dataset of
bin 4.2 in Figure 3.1. A validation batch size of 32 was used and an early stopping
was implemented where training would stop if validation loss did not improve for
3 evaluations on the validation set. At the end of training, the best performing
model on the validation dataset was saved and used during the rest of the project.
Parameters used to evaluate the fine-tuning are presented in Table 3.5 and loss data
can be found in Section 4.1.

Table 3.5: Parameters used for evaluation during fine-tuning of model.

Evaluation batch size 32
Evaluation steps 1,500
Early stopping patience 3

6https://huggingface.co/docs/transformers/internal/generation_utils#
transformers.TypicalLogitsWarper

36

https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TypicalLogitsWarper
https://huggingface.co/docs/transformers/internal/generation_utils#transformers.TypicalLogitsWarper


3. Methods

3.3.2 Evaluating Generated Text
To evaluate the sentences generated by our model, a list of 500 keywords was con-
structed by using the top 1000 most frequently used English words, ranked based
on the one billion word Corpus of Contemporary American English (COCA) [90]7.
From this ranking the words in places 501-1,000 were extracted, discarding the first
500 words reasoning that they were common enough to either be uninteresting; e.g.,
a, the, of, and; or common enough to be taught really early on in the language
learning process and not really need exemplification; e.g., house or they.

For every word in this list three sentences were generated, giving a total of 1,500
sentences and on these sentences a suite of five metrics were calculated. Apart from
cosine similarity, perplexity and MAUVE, which were introduced in Section 2.7, the
proportion of sentences in which the keyword was generated and the proportion of
sentences that were identical to an example in the dataset were calculated. The
percentage of generated sentences containing the prompt word is important to mea-
sure since it is one of the key requirements of our model. Furthermore, it is also
important to keep track of to what degree the model generates sentences that are
already in the dataset because, if we generate sentences that are identical to some
examples in the dataset, why not just use the existing sentences from Tatoeba to
practice using the keyword?

The evaluation pipeline described above was computed three times during the project.
First, it was evaluated on sentences generated after the fine-tuning to the target
dataset and for generation a temperature of 0.5 was used with a hybrid top-k- and
nucleus sampling scheme with k = 75 and p = 0.75. Second, to evaluate the effect
of adding the K2T probability shift, the temperature was set to 1 and K2T was im-
plemented with the hyperparameters found in Table 3.4. This intermediary model
used the same hybrid sampling scheme as the previous one described above. Lastly,
the full implementation of SpeakEasy was evaluated with the hyperparameter set
found in Table 3.4.

There is a significant possibility that automatically applied metrics are not able to
capture and reflect the specific aspects of what makes a sentence a good example to
practice on for a language learner. To evaluate the evolution of the “usefulness” of
the generated sentences, human evaluation was used.

3.3.2.1 Human Evaluation

One big factor to take in to account when evaluating the performance of a model in
the specific domain of language learning is how useful the generated sentences are
when exemplifying the proper use of a word for someone learning a new language.
This is a rather intangible property of a text and something which is not easily
reflected via an automatically calculated metric.

To perform this evaluation, a questionnaire was used containing 100 questions which
prompted respondents to rank four sentences containing the same keyword on how

7Available at: https://www.wordfrequency.info/samples.asp

37

https://www.wordfrequency.info/samples.asp


3. Methods

useful they were to language learners. Examples of questions are presented in Fig-
ure 3.4. The keywords were the first 100 words in the list used for the automatic
evaluation described above, the 501st to the 600th most frequently used English
words. One of the four sentences was sampled randomly from the human written
sentences in the Tatoeba dataset containing the keyword, while the remaining three
were generated one from each model evaluated in the automatic evaluation, respec-
tively. One thing to keep in mind is that the evaluators are non-professional linguists
but rather friends and students, so the results will not be viewed from a pedagogical
perspective.

The four example sentences were presented in a random order, and respondents were
instructed to rank them from 1 to 4, where 4 represented the worst and 1 the best
sentence. Furthermore, the instructions for the survey included the following list
of criteria to have in mind when ranking the examples, in a descending order of
importance:

1. Grammatical correctness and sense-making: The most crucial aspect is
that the sentence is grammatically correct and makes sense. If a sentence fails
in this regard, please rank it last.

2. Exact inclusion of the keyword: Check if the keyword is included in the
sentence exactly as it is given in the question. If it is not, it should be ranked
lower.

3. Exemplification of the prompt word: If both the grammar and keyword
criteria are fulfilled, distinguish between sentences based on how well they
exemplify the given prompt word. Consider whether the sentence effectively
conveys the meaning of the prompt word.

4. Usefulness to language learners: Finally, evaluate how useful each sen-
tence would be for a language learner to train on. Consider whether the sen-
tence provides meaningful context and aids in language comprehension and
learning.

38


3. Methods

Figure 3.4: Example of a question in the human evaluation questionnaire.

39


3. Methods

40


4
Results and Analysis

In this chapter, the results of the experiments conducted during this project are
presented and analyzed. First, Section 4.1 presents the results of the fine-tuning
of the GPT–2 model. Then, examples of sentences generated by SpeakEasy are
presented in Section 4.2. Sections 4.3 provides a presentation and analysis of the
results from the automatic- and human evaluation.

4.1 Fine-Tuning
Fine-tuning of the model lasted during approximately eight hours and all three
epochs elapsed without early stopping. Average (over batch) loss values on both the
training- and validation data are presented in Figure 4.1, where the loss function
is the cross entropy of Equation (2.10). After an initial transient during the begin-
ning of each epoch the model quickly converges, and small improvements are made
between epochs.

1 2 30

1

2

3

4

5

6

7

Epochs

Lo
ss

Training loss
Validation loss

Figure 4.1: Plot of training- and validation loss during fine-tuning of the GPT–2
model for three epochs.

41


4. Results and Analysis

4.2 Example SpeakEasy Outputs
The table 4.1 presents some example of sentences generated by SpeakEasy given
the keyword as input to the model. Generating an example sentence from a given
keyword with SpeakEasy takes on average 0.45± 0.19 seconds, measured over 1000
sentence generations. This time includes tokenizing the prompt word, generating the
sentence token-by-token and decoding the generated sentence from the word tokens.
Although we have not found any studies, nor performed one of our own, investigating
the average time for a human to write example sentences, we are convinced a human
would not be faster than SpeakEasy. This is especially true when generating a
large amount of sentences in short succession, since a human most probably would
experience fatigue and slow down with time, but a computer would not.

Table 4.1: Examples of sentences generated by SpeakEasy.

Keyword Sentence generated
ways There are many ways to express this.
voice His voice began to be heard over the whole room.
ready We’ll be ready in five minutes.
strong Tom is a strong person.
society As a society, we can provide food and clothes to the poor.
single She has a single mother.
results We could have found better results.
student She is a student at this school.
hair He shaved his hair.
medical A medical emergency is an urgent and necessary one.

4.3 Evaluation Results
This section presents the resu