Generating subtitles with controllable
length using natural language
processing

Master’s thesis in Computer science and engineering

Joakim Svensson
Victor Troksch

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2022


Master’s thesis 2022

Generating subtitles with controllable
length using natural language

processing

Joakim Svensson
Victor Troksch

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2022


Generating subtitles with controllable length using natural language processing

Joakim Svensson
Victor Troksch

© Joakim Svensson 2022.
© Victor Troksch 2022.

Supervisor: Richard Johansson, Department of Computer Science and Engineering,
Chalmers
Advisor: Niklas Jansson & Peter Eklund, Plint AB
Examiner: Moa Johansson, Department of Computer Science and Engineering,
Chalmers

Master’s Thesis 2022
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2022

iv


Generating subtitles with controllable length using natural language processing

Joakim Svensson
Victor Troksch
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Creating subtitles for video content is a task that has traditionally been performed
manually by subtitlers. When creating a subtitle, there are rules and guidelines for
how the text should be presented to the viewer. Therefore, a subtitle, translated
from one language to another, often contains linguistic compression in the form of
paraphrasing or removing parts of the dialogues. With advances in natural lan-
guage processing, subtitlers now have tools for machine translation and automated
speech recognition to assist them in their work. This thesis aims to explore various
methods for how to control the generated output length of a sequence-to-sequence
model, which are typically used for text generation and therefore also for machine
translation. We apply different modifications to both the model itself and the data
to control the output. Furthermore, this project makes use of transfer learning and
pre-trained models with the Transformer architecture. The length ratio method
produced the best results, in which it was possible to effectively control the output
length of a generated subtitle. We also discover that it was also possible to apply
this method for a translation model. Although it is a relatively simple method, it
produced the desired results with linguistic correctness.

Keywords: Natural Language Processing, NLP, Transformer, seq2seq, text genera-
tion, BART, subtitles

v


Acknowledgements
We want to express our thanks and gratitude to our supervisors who have helped
and supported us throughout this project. At Plint, we would like to thank Niklas
Jansson & Peter Eklund for their involvement and support in our daily work. We
would also like to thank our supervisor Richard Johansson, from the Department of
Computer Science and Engineering at Chalmers University of Technology, who has
provided us with important feedback along the way. Finally, we would like to thank
all of our family members and friends who have supported us during this time.

Joakim Svensson & Victor Troksch, Gothenburg, June 2022

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5.1 Neural Machine Translation with Constraints . . . . . . . . . 3
1.5.2 Text Summarization . . . . . . . . . . . . . . . . . . . . . . . 3
1.5.3 Sentence Simplifications . . . . . . . . . . . . . . . . . . . . . 4

1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 7
2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Word Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.1 Feed Forward Neural Network . . . . . . . . . . . . . . . . . . 10
2.2.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 11

2.3 Sequence-to-sequence Models . . . . . . . . . . . . . . . . . . . . . . 11
2.3.1 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.2 Teacher Forcing . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.3 Autoregressive generation . . . . . . . . . . . . . . . . . . . . 14

2.4 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4.1 Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4.2 Positional Encodings . . . . . . . . . . . . . . . . . . . . . . . 17

2.5 Pre-Trained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.2 GPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.3 BART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.4 Marian & MarianMT . . . . . . . . . . . . . . . . . . . . . . . 21

2.6 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 Cosine similarity . . . . . . . . . . . . . . . . . . . . . . . . . 22

ix


Contents

2.6.2 ROUGE-N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.3 BLEU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.6.4 METEOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3 Methods 25
3.1 Plint Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2 Public Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Backtranslation with OpenSubtitles . . . . . . . . . . . . . . . 27
3.2.2 OpenSubtitles - English to Swedish . . . . . . . . . . . . . . . 28
3.2.3 WikiOpen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.1 Hugging Face . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.3.2 Transformer with Length Encoding . . . . . . . . . . . . . . . 30
3.3.3 Transformer with Length Token . . . . . . . . . . . . . . . . . 31

3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Results 33
4.1 BART with Length Encodings and Tokens . . . . . . . . . . . . . . . 33

4.1.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.1.2 Length Encodings and Length Tokens . . . . . . . . . . . . . . 34

4.2 BART with Ratio Tokens . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.2 Category-based Ratio Tokens . . . . . . . . . . . . . . . . . . 40
4.2.3 Value-based Ratio Tokens . . . . . . . . . . . . . . . . . . . . 41
4.2.4 Manual Token Evaluation . . . . . . . . . . . . . . . . . . . . 41

4.3 Marian with Ratio Tokens . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.2 Category-based Ratio Tokens . . . . . . . . . . . . . . . . . . 46
4.3.3 Value-based Ratio Tokens . . . . . . . . . . . . . . . . . . . . 47
4.3.4 Manual Token Evaluation . . . . . . . . . . . . . . . . . . . . 49

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Discussion and Conclusion 51
5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.1 Methods of choice . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.1.2 Reasoning about the data . . . . . . . . . . . . . . . . . . . . 52
5.1.3 Evaluating the models . . . . . . . . . . . . . . . . . . . . . . 52
5.1.4 Analysing the results . . . . . . . . . . . . . . . . . . . . . . . 52

5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

Bibliography 57

A Appendix 1 I

B Appendix 1 XIX

x


List of Figures

2.1 A visualization of a many-to-many RNN. The network is unrolled in
the right-hand side of the figure to visualize the effect of the hidden
states h. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 The attention mechanism visualized with an example sequence. . . . 13
2.3 Teacher forcing exemplified with a RNN. The input at x(3) is not

the output from y(2), instead it is the actual ground truth at the
associated time step. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.4 Beam- and Greedy search visualized by an example. The yellow lines
represent the greedy approach, which takes the highest probability for
each word and it results in a score of 0.20. The green lines represents
the Beam search, which takes multiple combinations into considera-
tion and results in a score of 0.32. . . . . . . . . . . . . . . . . . . . 14

2.5 Visualization of the Transformer architecture as depicted in the orig-
inal paper. The yellow part represents the encoder block and the
green the decoder block. . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.6 Visualization of a sequence containing 6 words, also with an embed-
ding dimension of 6. Worth noting is that the denominator used for
this visualization is altered in order to make the shapes better visible
on small example sentences. . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 The resulting positional encoding values from Figure 2.6. This white
text is added in order to make the figures get on the same level. Sorry
for this easter egg a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 18

3.1 Visualization of the distributions for subword lengths and subword
ratios of the OpenBack dataset. . . . . . . . . . . . . . . . . . . . . . 28

3.2 Visualization of the distributions for subword lengths and subword
ratios of the OpenSubtitles dataset. . . . . . . . . . . . . . . . . . . . 28

3.3 Visualization of the distributions for subword lengths and subword
ratios of the WikiOpen dataset. . . . . . . . . . . . . . . . . . . . . . 29

4.1 Average training loss plot of the initial baseline model. . . . . . . . . 34
4.2 Distribution of target subword length in the dataset from Plint. . . . 35
4.3 Average training loss plot of the the initial model. . . . . . . . . . . . 36
4.4 Visualization of the implemented sinusodial positinal encodings and

the original learned positinal embeddings of the BART-checkpoint.
The sinusodial positional encodings range from -1 to 1 while the orig-
inal learned embeddings are mostly centered around 0 . . . . . . . . . 36

xi


List of Figures

4.5 Distribution of the test set based on ratio categories. . . . . . . . . . 38
4.6 Distribution of the test set based on ratio values. . . . . . . . . . . . 38
4.7 The distribution of the category-based ratio tokens in WikiOpen and

OpenBack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.8 The distribution of each value-based ratio token for the WikiOpen

and OpenBack datasets. The x-axis can be interpreted as the corre-
sponding ratio token. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.9 Distribution of the test set based on ratio categories . . . . . . . . . . 45
4.10 Distribution of the test set based on ratio values . . . . . . . . . . . . 45
4.11 Distribution of the training set for category-based tokens . . . . . . . 46
4.12 Distribution of the training set for value-based tokens . . . . . . . . . 47

xii


List of Tables

3.1 A constructed example of how an entry in the Plint dataset looks like. 26

4.1 Generated sequences by the BART baseline model fine-tuned on the
Plint dataset for 5 epochs. Gen corresponds to a beam search with a
beam size of 5 and Genpen to a bi-gram penalised beam search with
beam size of 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.2 Generated sequences by the BART model with additional length
encodings and length tokens, fine-tuned on the Plint dataset for 5
epochs. Gen corresponds to a beam search with a beam size of 5 and
Genpen to a bi-gram penalised beam search with beam size of 5. . . . 37

4.3 The mean length ratios against the source (LRsrc) for each dataset
with the baseline models are presented in the columns of this table.
The results were evaluated from a fine-tuned BART model on the
WikiOpen and OpenBack datasets without any ratio tokens. . . . . 39

4.4 The metrics evaluated for each dataset with the baseline models.
The results were evaluated from a fine-tuned BART model on the
WikiOpen and OpenBack datasets without any ratio tokens. . . . . 39

4.5 The resulting mean length ratios against the source (LRsrc) with
category-based ratio tokens on the WikiOpen and OpenBack datasets.
The leftmost column contains the evaluated token. . . . . . . . . . . . 40

4.6 The evaluated metrics for each dataset with the category-based to-
kens. The results are evaluated from a fine-tuned BART model on
the WikiOpen and OpenBack datasets with tokens representing short,
normal, and long sentence ratios. . . . . . . . . . . . . . . . . . . . . 40

4.7 The resulting mean length ratios against the source (LRsrc) with
value-based ratio tokens on the WikiOpen and OpenBack datasets.
The leftmost column contains the evaluated tokens and in the other
two columns are the corresponding mean length ratios for each dataset. 42

4.8 The evaluated metrics for each dataset with the value-based ratio
tokens. The results are evaluated from a fine-tuned BART model on
the WikiOpen and OpenBack datasets with 20 tokens representing
length ratios between 0 and 2, with an interval of 0.1. The leftmost
column show the token used to evaluate the model. . . . . . . . . . . 42

4.9 Sentences generated with the BART model fine-tuned on WikiOpen . 43
4.10 Sentences generated with the BART model fine-tuned on OpenBack . 43
4.11 Sentences generated with the BART model fine-tuned on WikiOpen . 44

xiii


List of Tables

4.12 Sentences generated with the BART model fine-tuned on OpenBack . 44
4.13 The mean length ratio against the source (LRsrc) for the Marian

baseline model is presented in the columns above. It was created
by fine-tuning the Marian model on OpenSubtitles data without any
additional length tokens. . . . . . . . . . . . . . . . . . . . . . . . . 46

4.14 The metrics were evaluated with category-based tokens on the Open-
Subtitles data using a fine-tuned Marian model. The left column
contains the evaluated tokens and the right corresponds to their value. 47

4.15 The metrics evaluated for each dataset with category-based tokens.
The result is evaluated from a Marian model fine-tuned on the Open-
Subtitles dataset with tokens representing short, normal, and long
sentence ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.16 The resulting mean length ratios against the source (LRsrc) with
value- based ratio tokens on the OpenSubtitles dataset. The leftmost
column contains the evaluated tokens and the right column corre-
sponds to the mean length ratio for each token. . . . . . . . . . . . . 48

4.17 The evaluated metrics with the value-based ratio tokens. The results
are evaluated from a fine-tuned Marian model on the OpenSubtitles
dataset with 20 tokens representing length ratios between 0 and 2,
with an interval of 0.1. The leftmost column show the token used to
evaluate the model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.18 Examples of sentences generated with the Marian model . . . . . . . 49
4.19 Examples of sentences generated with the Marian model . . . . . . . 50

B.1 The evaluated metrics for each dataset with the value-based ratio
tokens. The results are evaluated from a fine-tuned BART model on
the WikiOpen and OpenBack datasets with 20 tokens representing
length ratios between 0 and 2, with an interval of 0.1. The leftmost
column show the token used to evaluate the model. . . . . . . . . . . XIX

B.2 The evaluated metrics with the value-based ratio tokens. The results
are evaluated from a fine-tuned Marian model on the OpenSubtitles
dataset with 20 tokens representing length ratios between 0 and 2,
with an interval of 0.1. The leftmost column show the token used to
evaluate the model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . XX

xiv


1
Introduction

Creating subtitles to video content is a challenging task that is performed by transla-
tors and linguists working as subtitlers. The amount of content that needs subtitling
is increasing due to the growth of streaming platforms and international services.
There are also regulations, such as directives like the one from the European Union
in 2020 [9], which says that all material proved by the public sector in the union
(and therefore Sweden as well) needs subtitling.The creation of subtitles is a manual
process, and partially automating it would be highly beneficial regarding the work
and time resources that it requires.

1.1 Background
In our daily life, we encounter subtitles, whether we think about it or not. It can
be, for example, when watching the news or our favourite TV show. With the in-
crease of streaming services and international content comes an increased demand
for subtitles. At first glance, it may seem that creating subtitles is an easy task,
but there is more to it than just translating dialogues from one language to another.
When subtitling, the text must be translated while following certain length rules
and specifications [4]. For example, the length restriction exists to ensure that the
subtitle actually fits the screen. The length of a subtitle is also affected by the read-
ing speed, which means that it must be possible to read the subtitle in a short period
of time, as new subtitles will follow. Furthermore, there are also specifications and
rules involving how certain words must be preserved (or censored), when to break
lines, how to handle scene cuts, and how to identify speakers (such as a narrator), to
mention a few. The challenge lies in creating subtitles that follow these rules while
still maintaining the essence of the dialogue.

The creation of subtitles is a task traditionally performed by subtitlers that trans-
late spoken content into subtitles for a specific language. Subtitlers also have to
handle the task of segmentation, meaning that they have to partition the subtitles
to match the dialogue. Segmentation is usually performed by adding time stamps
to each subtitle, which determines when the subtitle will be visible to the viewer.
If the segmentation is incorrect, the subtitles can appear at the wrong time and
therefore confuse and annoy the viewer.

Another challenge in creating subtitles is that a dialogue can be highly nuanced and
could contain things such as colloquialisms, cultural references, and humour. Fur-

1


1. Introduction

thermore, it is also hard to translate certain sayings that are typical in the source
language, but do not exist in the target language. These are things that are easy
for a human to spot and to find a different translation for, but the more difficult for
a machine.

Recent developments of technologies within natural language processing (NLP), such
as automatic speech recognition (ASR) and machine translation (MT), have affected
the way subtitlers work. With ASR, they can now get a transcript of the spoken
content to use as a template when creating subtitles. This transcript was previously
created by listening to the content and manually writing it down. Developments
like this can significantly increase the productivity of subtitlers, and this was also
shown in a study by Campbell in 2019 [6]. Although subtitlers can use aids to
assist them in their work, the AudioVisual Translators Europe (AVTE) association
claims that fully automated machine translation models are far from taking over
the work of media translators [10]. However, they agree on the point of using ASR
and MT as a complement in their work as linguists and media translators. At Plint,
the company where this project takes place, a software platform is developed and
maintained where their large pool of freelancing subtitlers works to create subtitles
that the company later can deliver to its customers.

1.2 Problem Definition
To keep a subtitle within the length limitation, the task of creating subtitles is
also often the task of performing a summary of the context along with the trans-
lation. This is a concept called linguistic or semantic compression, which means
that the semantics of the sentence are maintained by paraphrasing parts of the sen-
tence or removing redundant words. Therefore, this problem can be divided into
two sub-problems within NLP, namely text summarization and machine translation.

This project intends to focus on the summarization part, which means that a sub-
title should be generated with a controllable length. The project intends to explore
various methods to approach and solve this problem. Furthermore, this involves en-
abling current state-of-the-art models within NLP. These models have been trained
on large amounts of data and are a suitable starting point, rather than developing
a model from scratch.

1.3 Aim
The aim of this project is to generate subtitles with controllable lengths. Control-
ling the output length is of importance for a subtitle due to the limited amount of
characters that it is allowed to have. Restricting the output length of a generated
sequence would not only be beneficial for subtitle generation, it would also be ben-
eficial in other tasks involving language modelling, such as machine translation and
text summarization.

2


1. Introduction

More specifically, the project will consist of exploring and implementing models with
controllable output lengths through either additional tokens, length encodings, or
both. These methods will be implemented with existing pre-trained models such as
BART [19]. Furthermore, the aim is to evaluate the methods with both quantitative
and qualitative measurements to see how the implemented methods influence the
model.

1.4 Limitations
The project is limited to working only with models based on the Transformer archi-
tecture [38]. This is motivated by most of the current state-of-the-art models within
various NLP tasks uses this architecture. Furthermore, the project is limited to
only considering pre-trained models. Training a language model from scratch would
require more computing power and resources than available. The final limitation is
that the project is limited to only considering subtitle generation between English
and English and from English to Swedish. The limitation of considering only two
languages is motivated by the time constraint of the thesis.

1.5 Related Work
The project has taken inspiration from various articles and papers within the fields of
machine translation, text summarization, and sentence simplification. This section
will mention a few of the articles related to and mentioned in this thesis.

1.5.1 Neural Machine Translation with Constraints
Neural machine translation (NMT) is a well-researched field within NLP, and current
state-of-the-art models are all based on the Transformer architecture [38]. However,
when generating subtitles, the length of the subtitle is crucial, and hence this task is
closely related to the work of constraining the output length of NMT. Many attempts
to restrict the output length of a sequence-to-sequence model are inspired by the
work of Takase & Okazaki [37]. To preserve a length constraint, they implemented
a modified positional encoding in the decoder of the Transformer, which encodes the
positional encoding with respect to how many tokens that are left to be generated
in the sequence, rather than how many tokens that have been generated. This
work inspired Lakew et al. [18] and Niehaus [28] who both uses this encoding in
their work. They also experimented with the implementation of special tokens to
represent a specific length or ratio to their models.

1.5.2 Text Summarization
Translating a wordy dialogue into a subtitle is a complex task and hence a challenge
to approach with text summarization. It is the task of creating a summary that con-
tains the most important and relevant content of the original text. The two most
common methods are extractive summarization and abstractive summarization [40].

3


1. Introduction

The extractive summarization works by selecting a subset of words present in the
original text and stitching them together to create a summary, while abstractive
summarization can paraphrase and insert new words to create a more fluent and
coherent summary.

Rush et al. [33] developed a model that implements abstractive summarization to
create summaries of fixed length. Their summary generation is carried out using a
beam search algorithm [12] to generate summaries with fixed length. Although their
model has proven to be effective, it is limited in the aspect of being only capable of
working with the same language.

1.5.3 Sentence Simplifications
Text simplification is the subfield within NLP that focuses on automatic simplifica-
tion of sentences. Simplifying sentences is beneficial for people who are not fluent
in a language, such as children, language learners, or people with reading disorders.
Simplifying a text or sentence consists of modifying the content and rewriting the
structure, while still preserving the original meaning. Modifications can consist of
paraphrasing words, removing redundant words, or splitting a sentence into two if
it helps simplify the original sentence.

The relevant work for this thesis is related to the work of Martin et al. [23][24].
His work is based on creating Control Tokens to control the text generation of a
transformer simplification model. These tokens are created to represent different
features of the relation between the source- and target sentences, such as the length
difference and the amount of paraphrasing between the two sentences. This method
has been proven to work on Transformer models trained from scratch, but also on
pre-trained models such as BART [19].

1.6 Outline
This chapter of the thesis has introduced the background of the project, as well as
the definition, aim, and limitations of the project. Furthermore, this chapter also
covers a short introduction to previous work related to the project. The outline of
the remaining chapters of the thesis can be seen in the list below:

• Chapter 2 introduces the theory involved in this thesis. Concepts and theory
specific for NLP, such as tokenization, word embeddings are described along
with more common machine learning theory, such as sequence-to-sequence
models and the Transformer, together with an introduction to pre-trained
models. The chapter is finalised with an introduction to the NLP metrics that
are used to evaluate language models.

4


1. Introduction

• Chapter 3 describes the methods used in this thesis. The chapter starts by
introducing the data used in the project and the associated pre-processing
techniques. In addition, it explains what models and what methods were
used. Finally, the chapter describes how the model is trained and evaluated.

• Chapter 4 presents the results for the evaluated methods and also shows a few
examples of the generated sequences.

• Chapter 5 discusses and draws conclusions based on the methods and the re-
sults. Improvements and suggestions for future work are also presented here.

5


1. Introduction

6


2
Theory

This chapter introduces and describes the theory and concepts used in this the-
sis. The chapter starts with a short introduction to Natural Languange Processing
(NLP) in general and the most important concepts within the field. Following the
introduction of NLP comes an in-depth explanation of some of the most influential
deep learning models for NLP, like the sequence-to-sequence (seq2seq) architecture,
the Transformer, and an introduction to the pre-trained models used in this the-
sis. Lastly, comes an explanation of the most common metrics and scores used to
evaluate our methods and experiments.

2.1 Natural Language Processing
Natural language processing (NLP) is the field within machine learning that involves
text and natural language. The field can be described as the interaction and bridge
between a computer and its ability to process and handle natural language. Natural
language is defined as the native speech of people, in contrast to artificial languages
that are designed to control a computer, for example. There are many different
topics and subfields within NLP and to name a few, there are tasks like machine
translation, text generation, text classification, and question answering.

Modern NLP models are typically constructed and trained with neural networks.
This implies that the data require special features that can be interpreted by a
machine learning model. Therefore, the data has to go through a number of pre-
processing steps in order to transform the raw text data into numerical data that
a computer can understand. The following sections will cover the most essential
methods for representing text with numbers.

2.1.1 Word Tokens
Tokenization is one of the most important preprocessing steps in NLP. Tokenization
is a technique used to split up a text, sentence, or document into smaller pieces that
are called tokens. The tokens can be divided into groups of words, subwords, or even
characters, depending on what tokenizer is used. The tokenizer keeps track of all
tokens in a vocabulary, meaning that it maps every token to a specific token-ID. To
get an understanding of the different tokenization methods, consider the following
sentence:

7


2. Theory

The kids are playing football.

Word-based tokenization:
[The, kids, are, playing, football]

Character-based tokenization:
[T, h, e, k, i, d, s, a, r, e, p, l, a, y, i, n, g, f, o, o, t, b, a, l, l]

Subword-based tokenization:
[The, kid, s, are, play, ing, foot, ball]

Word- and character-based tokenization is probably the easiest to interpret. How-
ever, these two methods have some issues. With the word-based method, the vocab-
ulary tends to become very large because every word has its own token (the English
language contains more than 500,000 unique words). This issue can be solved by
creating a vocabulary consisting of only the most common words and then assign-
ing an "unknown token" for words not included in the vocabulary. However, this
method will lead to a loss of performance for the model since there will be a loss of
information at each of the unknown tokens. Another problem is that similar words,
like "bird" and "birds", will initially have completely different representations in the
model due to the individual tokens.

The problem with large vocabularies is countered by having the tokenization based
on characters. However, a single character does not often say much on its own com-
pared to words in languages using the Latin alphabet (Chinese signs carry more
information, for example). This means that the model has to look at several tokens
to interpret the meaning of a single word. The models also have to handle larger
inputs for every query. A word-based input of 5 tokens could be equivalent to more
than 30 character tokens.

The subword-based tokenization is a combination of the two previous mentioned
methods and is also the most common method. This method is also used by most
of the current state-of-the-art models within NLP. Common character sequences,
such as short words for example, are left intact while longer and more uncommon
ones are split into subwords. This method has the advantage that it can build every
word in the document by stitching together the said subwords. This means that the
vocabulary does not need to be as big as for a word-based tokenizer, but also that
the model does not have to handle as many tokens as with a character-based tok-
enizer. This method can also learn pre- and suffixes along with grammatical word
endings, which can be seen in the example sentence above: The, kid, s, are, play,
ing, foot, ball. This will allow the model to see the similarity between "kid" and
"kids", for example. However, the partitioning does not necessarily have to result
in a favorable way. It might as well be tokenized as: Th, ki, ds, are, pla, ying, foo,
tball. The partitioning of the subword tokens depends on the data that was used to
train the tokenizer.

8


2. Theory

There are several algorithms to create the subword tokens, where Byte-Pair [35],
WordPiece [34] and SentencePiece [17] are the most common ones. Byte-Pair en-
coding (BPE) creates the vocabulary by first finding every unique word in a corpus,
called pre-tokenization, where also the word frequency is saved. The next step is to
create a base vocabulary consisting of every character present in the unique words.
Starting with the base vocabulary, the training data are tokenized into temporary
tokens consisting of neighbouring existing token pairs. The most frequent temporary
token can be determined by the earlier word count and is added to the vocabulary.
The process is repeated, where one token is added per iteration, until the desired
size (specified hyperparameter) is reached.

WordPiece tokenization is comparable to BPE. The algorithm starts by initializing
a base vocabulary consisting of every character that occurs in the training data.
However, the pair selection is not based on the highest frequency, but rather on
whether the pair maximizes the likelihood of the training data when it is added
to the vocabulary. This implies that maximizing the likelihood of the pair, whose
probability is greater than all other pairs, gives the best training data.

SentencePiece [17] is a method that also utilizes the BPE method, but includes
whitespaces in the set of available characters. The tokens that are created will make
up the final vocabulary, and, as mentioned earlier, every token in the vocabulary is
assigned with a unique integer as ID or index.

2.1.2 Word Embeddings
Many NLP-models have the words represented by one-hot encodings, meaning a
binary vector where every element is equal to zero apart from one, which represents
the token-ID. However, assigning a single binary vector for each word is often not
enough for a machine learning model to understand words on a deeper level. The
most common way to represent words is therefore to use word embeddings. A word
embedding is a mathematical representation of each word that works by assigning a
vector with real-valued numbers to each word in the vocabulary. Words that have
semantic similarities are also expected to have similar word embeddings in the vec-
tor space. This is not possible with one-hot encoding, since binary vectors entail the
same distance between every word in the vocabulary. To create word embeddings,
there are a few available methods, but the most common ones are to use methods
that involve machine learning or statistics.

Word2Vec [27][26] is a statistical method that is essentially a shallow neural net-
work consisting of one hidden layer, which can be trained with two different methods.
The first is the Continuous Bag-of-Words (CBOW) method, which learns the word
embeddings by implementing a context window around a word, and then the net-
work tries to predict the said word. The weights corresponding to the word later
act as the embedding. The Skip-gram method is similar to CBOW but it works the
other way around, meaning that the model is trained to predict the surrounding
words in the training set. In both methods, the size of the context window is speci-

9


2. Theory

fied when the model is created and will act as the embedding dimension.

GloVe (Global Vectors for Word Representation) [30] is an embedding technique for
distributed word representation, which utilizes unsupervised learning to achieve em-
bedding vectors. Unlike Word2Vec, GloVe does not only look at the local statistics,
meaning the surrounding words, but also considers the word co-occurrences. The
model is trained by creating a co-occurrence matrix for every word in the training
corpus. For every word, the matrix stores the number of occurrences together with
adjacent words in a specified window size.

Word embeddings can also be constructed by training a regular embedding layer,
as in [11][31][19]. The embedding layer acts as a lookup table, where every word cor-
responds to a vector of a specified embedding dimension. The weights are initialized
randomly and updated with respect to the loss function of the language model.

2.2 Artificial Neural Networks
Artificial neural networks (ANNs) are a type of computing system inspired by the
human brain. ANNs can come in many different shapes and sizes depending on the
task they are used to perform. This chapter will cover a short introduction to the
most essential network types relevant to this thesis and how they are constructed.

2.2.1 Feed Forward Neural Network
The feedforward neural network (FFNN) was the first type of ANN to be invented
and is also probably one of the simplest network architectures to date. The net-
work is called a feedforward neural network because the information flows forward
through the network without any cycles or loops.

The simplest version of a FFNN can be viewed as a single perceptron, which means
that it is constructed with only an input layer and an output layer. This version
is also known as a single-layer perceptron. The output is calculated according to
Equation 2.1, where w represents the weights, x the input, b is the bias, and σ is
the non-linear activation function. A single-layer perceptron is a linear classifier.

y = σ

 n∑
j=0

wixi + b

 (2.1)

The other version of a FFNN is called the multi-layer perceptron (MLP), which is
composed of many perceptrons. The MLPs are constructed of at least three layers:
an input layer, a hidden layer, and an output layer. This enables the network to
compute non-linearly separable functions and is hence suitable for tasks such as
classification in supervised learning.

10


2. Theory

2.2.2 Recurrent Neural Networks
Recurrent neural networks (RNNs) are a type of ANNs that are used to process se-
quential data. Unlike ordinary FFNNs, RNNs contain directed cycles, which means
that the information does not flow in a strict direct order. These cycles enable the
network to handle information about previous steps in the computation, which can
be seen as the network having memory. Due to the ability of RNNs to process se-
quential data, they are typically useful in NLP tasks.

RNNs work by receiving an input x and for each time step t calculate an output
y(t), based on the input of x(t) and the previous output at y(t − 1). The output of
the previous step is saved in a hidden state, denoted h(t). Furthermore, a RNN can
have different sizes of inputs and outputs. A one-to-many network takes one input
and generates a sequence of outputs. A many-to-one network works in the opposite
way, which means that it takes many inputs to generate one output. Lastly, many-
to-many RNNs take an input sequence and generate a sequence as an output, hence
is this type suitable for machine translation. An illustration of a many-to-many
network can be seen in Figure 2.1.

Figure 2.1: A visualization of a many-to-many RNN. The network is unrolled in
the right-hand side of the figure to visualize the effect of the hidden states h.

2.3 Sequence-to-sequence Models
A sequence-to-sequence (seq2seq) model, or encoder-decoder model, is a special class
of ANN architectures that takes sequential data as input and generates a new se-
quence as output. The seq2seq model is composed of an encoder and a decoder,

11


2. Theory

where both components are usually constructed by RNNs. The encoder is used to
compress the input sequence into a context vector that the decoder can use to gen-
erate the new output.

The encoder works by encoding each word in the input sequence by computing the
hidden states hi for each time step i. The final hidden state, at time step n, is
denoted hn and this state is equivalent to the context vector that is sent to the
decoder. The decoder generates an output yi for each time step, depending on the
previous state. The initial state of the decoder is the context vector hn.

One drawback with the seq2seq models and the encoder-decoder architecture is
that it has issues with long sentences. Cho et al. [8] showed that when the input
sequence becomes longer, it is harder for the model to encapsulate all the important
information in the context vector.

2.3.1 Attention
To solve the problem that arises with long input sequences, Bahdanau [2] introduced
the attention mechanism. This mechanism is created with the intention of mimick-
ing cognitive attention by deciding which parts of the input sequence that are of
importance. It works by creating a context vector ci from a linear combination of
the hidden states hj in the encoder and the attention weights αij for each time step
j, see Equation 2.2. The context vector ci is also influenced by the previous hidden
state si−1 as can be seen in Equation 2.4.

ci =
Tx∑

j=1
αijhj (2.2)

Here, the weight of each αij is equal to:

αij = exp(eij)
Tx∑

k=1
exp(eik)

(2.3)

Here, eij is the alignment score of a FFNN described by the function a:

eij = a(si−1, hj) (2.4)

By passing all context vectors ci to the decoder, the model can decide what to focus
on in the sequence while decoding the next step. See Figure 2.2 for an illustration
of how the attention mechanism is calculated.

12


2. Theory

Figure 2.2: The attention mechanism visualized with an example sequence.

2.3.2 Teacher Forcing
Teacher forcing is a common method used to train seq2seq models quickly and
efficiently. Consider an arbitrary seq2seq model that predicts an output y(t), given
an input x(t). The method works by letting the next input to the model x(t + 1) be
equal to the actual ground truth, regardless of the predicted output y(t). This will
enable the model to learn the next prediction from the correct input in the training
data, rather than using the predicted output from the previous time step. The idea
behind this method is to not let the model train on false predictions and hence waste
valuable training time. An example of teacher forcing can be seen in Figure 2.3.

Figure 2.3: Teacher forcing exemplified with a RNN. The input at x(3) is not the
output from y(2), instead it is the actual ground truth at the associated time step.

13


2. Theory

2.3.3 Autoregressive generation
A seq2seq model is, in fact, also an autoregressive model. This is a model that is
used to describe time-varying processes, such as language generation. For a model
to be autoregressive, it has to predict future values based on previous values. To
put this into context, an autoregressive language model let the i-th generated word
in a sequence depend on all preceding i − 1 words. The probability distribution for
a word sequence can be described by Equation 2.5, where w1:T is the generated text
sequence, W0 equals the initial text sequence fed to the model and w1:0 is the empty
set, implying that no sequence has been generated in advance.

P (w1:T |W0) =
T∏

t=1
P (wt|w1:t−1, W0), where w1:0 = ∅ (2.5)

With Equation 2.5, text can be generated in different ways. One method is to use the
Greedy Search algorithm. It makes the selection by selecting the word with the
highest probability for the given time step, that is, wt = argmax(P (w|w1:t−1). This
algorithm is efficient, but comes with the downside that it can miss combinations
of words that have a higher probability than the predicted one. The reason for this
is that a word combination with higher probability can exist where the first word
has a lower probability than the generated one. An example of this can be seen in
Figure 2.4. The algorithm runs until a special end-of-sequence token is generated
or a specified number of words have been reached.

Figure 2.4: Beam- and Greedy search visualized by an example. The yellow lines
represent the greedy approach, which takes the highest probability for each word
and it results in a score of 0.20. The green lines represents the Beam search, which
takes multiple combinations into consideration and results in a score of 0.32.

The Beam Search algorithm counters the problem of missing out on hidden high-
probability words by keeping the n most likely sequences at each time step t. This
creates beams in the search tree, hence the name, which can get past lower proba-
bilities in order to find better predictions further down the beam. Due to the fact
that the algorithm stores several hypothetical sequences and acts greedily at the

14


2. Theory

same time (since it always picks the n words with the highest probability), it is
guaranteed to find a sequence at least as likely as the greedy search algorithm, at
the expense of computational cost. However, the algorithm is not guaranteed to
find the sequence with the highest probability, since that would require a complete
search of all possible word combinations.

Both the greedy- and beam search algorithms can encounter the problem with re-
peating sequences of words. To avoid the problem, a n-gram penalty can be imple-
mented [29]. This can be done in different ways, but one way is to set the probability
to 0 for a word that creates a n-gram that has already appeared in the sequence.

It is also worth noting that the human language is not always as predictable. In [15],
the author shows that humans seem to prefer to be surprised by a text. This means
that it might not be the best practice to always pick the word with the highest
probability instead of sampling from the distribution if the goal is to mimic human
generated text.

2.4 Transformer
The Transformer is a deep learning model that was introduced by Vaswani et al.
[38] in 2017. The model was developed for machine translation, but it quickly be-
came state-of-the-art in many NLP related tasks. The Transformer is essentially a
seq2seq model, but unlike RNNs, the Transformer does not need to process data
in sequential order. Without having to process the data in order, the training can
be parallelized. To capture the structure and order of a sequence, the Transformer
uses positional encodings (see Section 2.4.2) and the self-attention mechanism (see
Section 2.4.1).

The Transformer is composed of two blocks, one block that works as an encoder
and the other block that works as a decoder. In the original paper, the Transformer
is composed of 6 layers in each block, but the number of layers can be modified to
serve specific tasks. In all layers, within the two blocks, there are residual layers that
work as skip connections. The skip connection is an operation where some parts of
the output skip one or more layers and instead it is being added to a layer deeper
into the model. Residual layers prevent deep networks from losing track of input
and leads to a better performing network as a result [13].

The first block is the encoder block, where each of the 6 layers works as an indepen-
dent encoder with its own weights. Each layer in the encoder can be broken down
into two sub-layers, a multi-head attention module and one feed-forward neural net-
work (FFNN). After each sublayer inside the encoder, there is a residual connection
followed by a layer normalization. Dropout is also applied to each sub-layer.

The second block is the decoder block, which is very similar to the encoder block,
with the key difference that a multi-head attention mechanism is added on the
output of the encoder stack. The decoder is also modified so that it cannot attend

15


2. Theory

to subsequent positions when decoding the output. This means that the model
output can only depend on the previous positions in the generated sequence. This
modification is also known as masked multi-head attention.

Figure 2.5: Visualization of the Transformer architecture as depicted in the original
paper. The yellow part represents the encoder block and the green the decoder block.

2.4.1 Self-attention
With the introduction of the Transformer, the first transduction sequence model
that relies only on attention was introduced. The implementation that enabled this
new feature was to replace the recurrent layers with a self-attention module. This
type of attention mechanism works by allowing the model to process attention within
an input sequence. By relating the positions in a sequence with other positions in
the same sequence, self-attention can create an understanding of how words relate
to each other.

Self-attention is computed multiple times in each layer, in parallel and indepen-
dently, through what is called multi-headed attention. This is a module that con-
catenates the outputs of each self-attention module before linearly transforming the
outputs to create a final representation. A detailed explanation of how self-attention
is calculated follows.

16


2. Theory

The three main components in calculating self-attention are the vectors of queries,
keys, and values, denoted as Q, K, and V . All these vectors are retrieved from
the word representation in the input sequence, by multiplying the input embedding
with the corresponding matrix, denoted as W Q, W K , and W V . The queries and keys
have the dimension of dk and the value vector dv. When calculating self-attention,
one can see the calculation as a mapping between a query and a set of key-value
pairs to an output.

Self-attention is measured with an attention score. This score represents how much
attention should be paid to other parts of the input with respect to the current
position. In other words, the score represents how much attention should be paid
to other words in the sequence, based on the current word. This score is computed
by first taking the dot product of the query matrix with the key matrix. The score
is then divided by

√
dk followed by a softmax function. Lastly, the self-attention

calculation is finalized by multiplying it with the value matrix, see Equation 2.6.

Attention(Q, K, V ) = softmax(QKT

√
dk

)V (2.6)

As mentioned above, the Transformer calculates the self-attention scores in parallel,
which means that it performs the self-attention calculation multiple times. This en-
ables the model to focus on different positions inside the sequence, and in turn, leads
to better scores. The original implementation of the Transformer uses 8 different
heads, which generate 8 different matrices for self-attention per sequence. To pass
these values through the FFNN, the matrices for each head are concatenated into
one big matrix. This matrix, which contains all the multi-headed attention scores,
is then multiplied with an additional weight matrix, denoted W O, to create the final
representation, see Equations 2.7 and 2.8.

MultiHead(Q, K, V ) = Concat(head1, . . . , headh)W O (2.7)

where

headi = Attention(QW Q
i , KW K

i , V W V
i ) (2.8)

2.4.2 Positional Encodings
Unlike RNNs, the Transformer sees the input as a quantity of words instead of a
sequence by default. Since a language depends on the word ordering, positional in-
formation needs to be added in order for the Transformer to function properly. One
key element in the Transformer architecture is therefore the positional encoding,
which takes care of the positional information. This encoding is applied to both the
encoder and decoder parts of the model. An example of the importance of word
ordering can be seen below, where both sentences contain the same words but mean
two different things.

17


2. Theory

"He likes football but hates golf."
"He likes golf but hates football."

In [38], the authors describe two ways to encode the positions, either fixed or learned.
The fixed method is based on the use of trigonometric functions to encode the po-
sitions. The positions are encoded according to Equation 2.9, where pos represents
the position of the word in a sentence, dmodel is the total number of embedding di-
mensions of the model, and i is a specific dimension according to i = 1, . . . , dmodel/2.
When the frequencies of the sinusoidal functions are altered, the encoded values will
be different for every position. The variation of the encodings is determined by the
size of the embedding dimension. Even dimensions get the sine embedding, while
odd dimensions get the cosine one. The positional encodings are simply added to the
input embeddings, as they have the same dimensions. A visualization of Equation
2.9 can be seen in Figure 2.6.

PEpos,2i = sin
( pos

10000
2i

dmodel

)

PEpos,2i+1 = cos

(
pos

10000
2i

dmodel

) (2.9)

Figure 2.6: Visualization of a se-
quence containing 6 words, also with
an embedding dimension of 6. Worth
noting is that the denominator used
for this visualization is altered in or-
der to make the shapes better visible
on small example sentences.

Figure 2.7: The resulting posi-
tional encoding values from Figure
2.6. This white text is added in
order to make the figures get on the
same level. Sorry for this easter egg a
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa

The other method is to let the model learn the positional encodings on its own.
This is done using a regular embedding layer that the model learns during training,
which means that the model learns a positional embedding layer. However, Vaswani
showed that there was very little difference between this method and the one using
sinusoidal encodings. In the original Transformer paper, they chose the sinusoidal

18


2. Theory

approach because of the hypothesis that a sinusoidal encoding would learn long
dependencies better than the learned variant. More recent work, which is currently
state-of-the-art, has for most cases adapted the learned approach over the static
approach [19][5].

2.5 Pre-Trained Models
Transfer learning is the concept of transferring knowledge between one model to
another, by reusing parts of a pre-trained model into the new model. The idea is
that if there already is a model used for solving a similar to the one you are dealing
with, the new model could inherit some features from the pre-trained model. By
using a pre-trained model instead of training a model from scratch, one can save
both time and resources.

Formally, the transfer learning task is defined as follows: Given a source domain
DS and a learning task TS with the corresponding notation for the target domain
DT and TT . The aim is to improve the learning task of the target domain TT with
the help of the learning task of the source domain TS. Specifically, the transfer is
based on improving the conditional probability distribution P (YT |XT ) in DT with
the information from DS and TS. A requirement is that either DT ̸= DS or TT ̸= TS,
otherwise no information would be transferred.

A common use case of transfer learning within NLP is to fine-tune a pre-trained
model for a downstream task. Transformer models are often pre-trained on large
text corpora, meaning that they have a good understanding of a language. This
enables the model to be fine-tuned for specific NLP tasks. For instance, one can
easily create a sentiment classifier by adding one or more linear layers on top of a
pre-trained model. Recent development of language models such as BERT [11] and
BART [19] has led to models that could perform multi-tasking, meaning that the
same model can be used for different tasks. For example, the same model can be
used for machine translation, question answering, and semantic analysis. Another
example is the use of multilingual models, where one model can learn multiple
languages at once. The following subsections cover some of the most influential
NLP models and the most relevant ones for this project.

2.5.1 BERT
BERT is a language model that was introduced by Google AI Language in 2018 [11].
The name stands for Bidirectional Encoder Representations from Transformers and
its architecture is almost identical to the Transformer. The model originally came
in two sizes, where the first, BERTBASE , consists of 12 layers, 12 attention heads
and a hidden size of 768, with a total of 110 million parameters. The second one,
BERTLARGE , consists of 24 layers, 16 attention heads, and a hidden size of 1024,
resulting in a total of 340 million parameters. BERT has learned positional embed-
dings, which means that the embeddings are learned during pre-training.

19


2. Theory

As stated in the name, BERT works with a bidirectional encoder, which enables the
model to look at positions further in the sequence when encoding. This feature is
suitable for pre-training on unsupervised tasks such as masked language modelling
(MLM) and next sentence prediction (NSP). MLM is a pre-training method where
some percentage of all tokens in an input sequence is replaced with a masking token.
Then, the model’s task is to assign and predict the correct token for the mask. When
pre-training BERT, 15% of all tokens were masked. NSP is another pre-training task
where the model’s task is to predict if two sentences are related to each other or not.
It is used to pre-train for tasks that involve an understanding between sentences,
such as a question answering task.

2.5.2 GPT
The first Generative Pre-trained Transformer (GPT) was introduced by OpenAI
in 2018 [31]. The architecture of the GPT model is similar to that of the decoder
block in the Transformer model. It stacks 12 decoder layers on top of each other to
create a sequential decoding block. The output tokens generated by the model are
predicted autoregressively, which means that it can be used to generate sequences.
When decoding, the model can only look at previously generated words, similar to
the Transformer model.

At the time of its release, it achieved state-of-the-art in 9 of the 12 datasets it was
evaluated on, including tasks such as question answering, semantic similarity as-
sessment, and text classification. The original GPT model has 2 successors, namely
GPT-2 [32] and GPT-3 [5], substantially increasing the number of parameters for
each model.

2.5.3 BART
BART is a seq2seq model based on the Transformer architecture which was intro-
duced by Facebook AI in 2019 [19]. The two building blocks of the BART model can
be generalized as using BERT as the encoder, because of the bidirectional encoder,
and GPT as the decoder, because of the autoregressive decoder. The difference
between the original Transformer and BART is that the latter uses GeLU [14] in-
stead of ReLU [1] as activation functions. Regarding the rest of the architecture,
it is very similar to BERT with the two differences that the decoder layers perform
cross-attention over the final hidden layer of the encoder and that BERT uses an
additional FFNN to predict the next word, which BART does not. Each of the
encoder and decoder blocks are made of 6 layers in the base model and 12 layers in
the large model. The BARTBASE model has approximately 140 million parameters,
and the BARTLARGE model has approximately 400 million parameters. Because of
the similarity to BERT, BART also uses learned positional embeddings.

The model is pre-trained as a denoising autoencoder, which means that the model
is trained on noisy and corrupted text. Training data are modified and corrupted
in numerous ways. The modifications to the data are summarized in the list below.

20


2. Theory

• Token Masking: Random tokens in the input sequence are masked with a
mask token. The model is then trained to predict the correct mask based on
the rest of the sequence. Essentially, the same pre-training method as the
MLM with BERT.

• Token Deletion: Random tokens are deleted from the input, which will lead
the model to predict what content the deleted token had, based on the rest
of the sequence. Furthermore, the model must predict what positions they had.

• Text Infilling: Similar to deletion, text filling removes several tokens in a
row and replaces it with a single mask token. This implies that the model has
to learn and predict the content of the missing tokens.

• Sentence Permutation: Performing sentence permutation, meaning that
words in a sentence are shuffled. This will enable the model to learn the con-
text of the input sentence, regardless of the order.

• Document Rotation: This method chooses a random word to start the se-
quence. This will teach the model about the beginning of documents.

After the pre-training phase, the model is ready to be fine-tuned for downstream
tasks. Because of the autoregressive decoder, BART can be fine-tuned for sequence
generation tasks such as text summarization and machine translation. The encoder
takes an input sequence and generates outputs autoregressively through the decoder.

2.5.4 Marian & MarianMT
Marian is an NMT framework written entirely in C++, which is highly optimized
for machine translation. The framework is mostly developed by Microsoft and the
University of Edinburgh. The NLP group at the University of Helsinki has pre-
trained over 1, 000 Marian models, and they were made public on Hugging Face
after converting the models to Python. MarianMT is the name of the class provided
by Hugging Face, where one can import the pre-trained Marian models.

The architecture of the MarianMT models is almost identical to the BART model,
except for a few minor differences. The first difference is that the Marian models use
the sinusoidal positional embedding instead of the learned positional embedding as
is the case with BART. The second difference is that there is no layer normalization
in the Marian models. The MarianMT models at Hugging Face are also slightly
smaller than the BARTBASE model, with a total of about 74 million parameters.

21


2. Theory

2.6 Metrics
Evaluating a language model is a complex task that requires specific metrics. To
address one of the problems that could arise when measuring and comparing texts,
consider the following sentences:

I enjoyed the show

I liked the concert

The contextual meaning of these sentences is essentially the same, but they are dif-
ferent in terms of words used to describe the situation. This is a common case for
the task of sentence simplification and text summarization, where redundant words
are removed or paraphrased. The problem that arises in these situations is that most
metrics are looking at matching n-grams, meaning n number of matching words in
a row, and do not take the contextual meaning into consideration.

Evaluation scores are often calculated between a candidate and one or more refer-
ences. The candidate is the generated translation or prediction, and the references
are the correct translations, made, for example, by a translator.

2.6.1 Cosine similarity
Cosine similarity is a way to measure the similarity between two vectors. It is de-
fined as the cosine of the angle between the two vectors. If the cosine similarity score
is 1, the two vectors have the same orientation. Likewise, if the cosine similarity
score is 0, the two vectors are orthogonal. In NLP, cosine similarity can be used to
measure the similarity of two strings.

Cosine similarity is formally defined as:

cosine similarity = cos(θ) = A · B
||A|| ||B||

=

n∑
i=1

AiBi√
n∑

i=1
A2

i

√
n∑

i=1
B2

i

(2.10)

2.6.2 ROUGE-N
ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation [20] and it
is a set of metrics used to evaluate machine translations and text summarizations.
With ROUGE-N, the N stands for the number of n-grams that are used in the cur-
rent evaluation. For ROUGE-1 and ROUGE-2 it is the unigrams and, respectively,
the bigrams that are evaluated. ROUGE-N is composed of recall, precision, and
F1-scores.

Recall is calculated by first counting the number of overlapping n-grams found in
both the candidate and the reference. This number is then divided by the number

22


2. Theory

of n-grams found in the reference. The recall formula can be seen in Equation 2.11,
where TP stands for true positives and FP stands for false positive.

Recall = TP

TP + FP
(2.11)

Precision is, similarly to recall, also calculated by first counting the number of
overlapping n-grams found in both the candidate and references. Precision is, how-
ever, calculated by dividing by the number of n-grams found in the candidate. The
formula can be seen in Equation 2.12 where FN stands for false negatives.

Precision = TP

TP + FN
(2.12)

F1-Score is a combination of recall and precision. The combination results in a
score that not only rely on giving high scores to sequences that cover as many words
as possible, but also does so without outputting redundant words. The general
formula for the F1-score can be found in Equation 2.13.

F1 = 2 ∗ precision ∗ recall

precision + recall
(2.13)

2.6.3 BLEU
Bilingual Evaluation Understudy (BLEU) [16] is an algorithm designed to evaluate
the quality of machine translated text. It was developed with the idea that the
closer a machine translation is to a professional human translation, the better it is.
BLEU is also great at rewarding sentences that do not generate an abundance of
reasonable words. This is done by calculating a modified n-gram precision score for
the candidate.

The n-gram precision score is calculated using the following Equation:

pn =

∑
n-gram∈C

Countclipped(n-gram)∑
n-gram∈C

Count(n-gram) (2.14)

The first function, Count, counts the number of matching n-grams between the can-
didate C and the references R1, . . . , Rm. The other function, Countclipped, counts
the number of clipped n-grams. The clipped n-grams refer to limiting the count for
each correct word in the candidate to the maximum number of times it occurs in
one of the reference sentences.

The BLEU score is usually computed by taking the modified precision score for
n-grams up to 4, which means p1, p2, p3 and p4. To obtain the final score, a brevity
penalty (BP in Equation 2.15) is also introduced, which penalizes short candidate
sentences. This was implemented because shorter translations are more likely to

23


2. Theory

get a higher modified precision score, even though they might miss out on a lot of
context. The resulting formula ends up being the following.

BLEU = BP · exp(
N∑

n=1
wn · log · pn) (2.15)

2.6.4 METEOR
Metric for Evaluation of Translation with Explicit ORdering (METEOR) [3] is a
metric to evaluate machine translations. It was developed to counter some of the
shortcomings with BLEU. The metric works by calculating the harmonic mean of the
precision and recall of unigrams, where the recall is weighted higher. The algorithm
creates an alignment between two sentences, just as in BLEU. The score is calculated
by first counting the number of matching unigrams in the generated and reference
translations, which is described as m, the number of unigrams in the generated
translation, wt, and the number of unigrams in the reference translations,wr. With
this information, the unigram- recall R and precision P can be calculated according
to Equation 2.16.

R = m

wr

, P = m

wt

(2.16)

With recall and precision, the weighted harmonic mean can be calculated according
to Equation 2.17. The average is weighted with the recall being 9 times more
important.

Fmean = 10PR

R + 9P
(2.17)

So far, the calculations only account for the matching of single words and not longer
segments occurring in both the generated- and reference translations. To account
for these occurrences, n-grams matches are used to compute an alignment penalty.
Mappings between the generation and references that are not adjacent increase the
penalty. Calculation wise, this is done by first defining the chunks c as the set
of adjacent unigrams in the generation and reference. The longer the mapping,
the fewer chunks exist, which means that a perfect match results in 1 chunk. The
penalty can be calculated according to Equation 2.18 where wm is the number of
mapped unigrams.

p = 0.5
(

c

um

)3

(2.18)

Now, the final METEOR score can be calculated as in Equation 2.19.

METEOR = Fmean(1 − p) (2.19)

24


3
Methods

This chapter aims to specify and explain the methods used in this thesis. The
methods have been shaped in an iterative process, in which the results along the
way have affected the chosen methods. The chapter starts with an introduction of
the datasets used in this thesis, followed by how the data are pre-processed. All
of the methods are based on using a pre-trained Transformer models. Our specific
implementations and modifications of the models are described in Section 3.3.2
& 3.3.3. The methods are then evaluated with both quantitative and qualitative
analysis.

3.1 Plint Data
Plint has been in the localisation business for many years, and during this time,
thousands of hours of film have been translated and subtitled into many different
languages. Some of the projects that the company has worked with contain JSON
files where the original transcript and subtitles can be found. These files also contain
metadata with information about the subtitle. An example of the metadata that
exists for the subtitles is that the files contain time stamps that determine when
and for how long a subtitle is visible to the viewer.

The transcript in the project file contains the spoken content of the video being sub-
titled. This was previously done manually, listening to the content and writing down
the words, but lately most companies have used ASR for this task. The transcript
is provided with time stamps for each word. Therefore, it is necessary to segment
the ASR-output into fitting sentences. To solve this task, the supervisors at Plint
made a script that formats the ASR-output to match the timestamps of the subtitle.

Plint provided us with three JSON-files per project, one containing the raw ASR-
output, one containing the ASR-transcripted segmentations, and one containing the
English subtitles. In total, we received 4443 files from 1481 subtitling projects. In
order to make use of these data, and to construct a dataset, we had to perform a
few steps of pre-processing. To begin with, we merged the ASR-segmentations and
the English subtitles into a single file. We chose not to merge the raw ASR-outputs
because we already had the segmented version of the ASR-transcripts. When con-
structing the dataset, we labeled the ASR-transcriptions as "input" and the subtitles
as "truth". We ignored all the metadata because it did not serve us any purpose.

25


3. Methods

The method to segment the ASR-outputs based on time stamps produced overall
well-aligned matchings between the ASR-output and the subtitles. However, when
evaluating the data, we found outliers in which single words, at the beginning or
end of a sentence, had been placed incorrectly. Hence, we had to find a solution
to this problem, and we came up with an algorithm that solved this by comparing
the endings and beginnings of two succeeding data pairs, where the data pairs are
the input - truth alignments from above. If the endings of the first data pair do
not match, the endings are compared to the beginnings of the second data pair.
If the algorithm then finds a match between the endings of the first pair and the
beginnings of the second pair, one of the words has been placed incorrectly, and the
data pairs are adjusted. If there is a further mismatch, that is, if none of the words
at the beginnings and endings matches at all, the data pair is removed.

The ASR-output and the English subtitles are also compared by content, in order
to filter out pairs that are not related even though they might share the first and
last words. This is done by calculating the cosine similarity between the two. If
the similarity is less than 0.5, the data pair is discarded. Setting the threshold too
low would allow for faulty data pairs, where a too strict threshold would discard
examples with linguistic compression, which would defeat the purpose of the project.

As stated in Section 1.1, there are guidelines on how to announce which person is
talking. Since the transcript contains only the spoken language, the speaker indica-
tors in the English subtitles are also filtered out. A constructed example from the
filtered dataset can be seen in Table 3.1.

Input Truth
text Our fine saint Donald Duck is

much superior
Our saint Donald Duck is supe-
rior

Table 3.1: A constructed example of how an entry in the Plint dataset looks like.

3.2 Public Data
In addition to the data provided by Plint, we used two open-source datasets. This
section will cover the details of each of these datasets, but also how we processed
and modified the data to create more suitable datasets for our purposes.

OpenSubtitles [21] is a dataset consisting of subtitles retrieved from the web page
opensubtitles.org. The dataset is composed of a collection of subtitles for 62 lan-
guages and 1782 bitexts. The set of subtitles from English to Swedish contained
about 17 million subtitle pairs.

WikiLarge [41] is a dataset that is typically used for sentence simplification. The
dataset is constructed by aligning a complex version of a sentence with its corre-

26


3. Methods

sponding simpler version. The complex sentences come from the regular English
Wikipedia1 web page, and the simple sentences come from the Simple English
Wikipedia2 web page. The dataset contains 296,402 sentences used for training,
all based on the complex-simple alignment. WikiLarge also has a test and valida-
tion set based on the Turkcorpus [39], these two sets contain 2000 and 359 sentences,
respectively.

3.2.1 Backtranslation with OpenSubtitles
By using backtranslation on the OpenSubtitles dataset for English to Swedish, it was
possible to find a large proportion of linguistic compression and paraphrasing be-
tween the subtitles. Backtranslation was performed on the Swedish subtitles, which
means that the Swedish subtitles were translated back to English. This method
was inspired by a paper from Netflix [25], and the translations were made with a
MarianMT model from Hugging Face. The backtranslation was a computationally
heavy task, and therefore we ended up translating only a subset of the total subti-
tles that we had available. In total, we translated 260,000 Swedish subtitles back to
English. The checkpoint we used was "Helsinki-NLP/opus-mt-sv-en", which is the
pre-trained checkpoint for machine translation from Swedish to English with the
MarianMT model.

To verify that the translations were correct and useful, we started to analyse the
translations manually. However, this method was both ineffective and time-consuming.
Therefore, we needed a way to automatically analyse the translations. The manual
analysis did, however, produce an important insight; we discovered that when the
OpenSubtitles data had been aligned poorly, there was also a bad translation. A
bad alignment meant that parts of the subtitles did not match, for example, some
parts of the translation could have been offsetted to the next subtitle.

To solve this issue, we introduced a filter based on cosine similarity. In theory,
this will give a low score if the translations are aligned poorly, because very few of
the words will match, and a high score if the translations are aligned correctly, be-
cause more words will match between the sentences. However, this method has the
downside of punishing short sentences for paraphrasing words, resulting in a very
low cosine similarity score. Therefore, we had to find a sweet-spot where the bad
translations were removed, but the good ones were kept. We went with a threshold
for cosine similarity of 0.5, and it gave the results we were after. Furthermore, we
also filtered subtitles on the basis of different length criterion, this filter had limita-
tions in character lengths and length ratios. The applied length limitation was to
only consider subtitles with character lengths between 5 and 100. The ratio criteria
were applied to only consider subtitles with a ratio below 2. The resulting dataset
contains 147,548 backtranslated subtitles, and a visualization of the length and ra-
tios of the data can be seen in Figure 3.1. Further on, we refer to this dataset as
"OpenBack".

1en.wikipedia.org
2simple.wikipedia.org

27


3. Methods

Figure 3.1: Visualization of the distributions for subword lengths and subword
ratios of the OpenBack dataset.

3.2.2 OpenSubtitles - English to Swedish
To train a translation model, we needed subtitle data within two languages. Because
we had already used the OpenSubtitles data for backtranslations, it was suitable to
use this dataset for the purpose of translation as well. To avoid the problems of
poorly aligned data, we used the same data as in the backtranslation. Using the same
data would also assure us that the translations were correct, as the backtranslated
data were correct according to our filter based on the cosine similarity. However, the
data still needed to be processed through the length filter to make sure the lengths
and ratios criterion was still fulfilled. The length filter that we applied was the same
as in the previous task, meaning a threshold of 2 on the length ratio and character
lengths between 5 and 100. The resulting dataset with English - Swedish subtitles
consists of 144,968 subtitles.

Figure 3.2: Visualization of the distributions for subword lengths and subword
ratios of the OpenSubtitles dataset.

28


3. Methods

3.2.3 WikiOpen
The use of the WikiLarge dataset was inspired by the sentence simplification meth-
ods of Martin et al. [24][23]. The dataset contains sentences of various lengths,
where the shortest are only a few characters long, and the longest are up to 500
characters long. To fit our purpose, we had to filter the data so that it consisted
only of sentences that could work as a subtitle.

Our first step in filtering out the sentences was to manually evaluate the dataset and
just as we found in the OpenSubtitles data, there were instances of poorly aligned
sentence pairs. Therefore, our first step was to implement the same cosine filter
with a threshold of 0.5. We also applied the same length and ratio restrictions,
which means that we only consider data with character length in a range between
5 and 100 and a ratio between 0 and 2. To create a larger dataset, we ended up
combining these data with the backtranslated OpenSubtitles data. We named this
dataset "WikiOpen" and it contains a total of 202, 718 instances, where 147, 723 of
them are subtitles from OpenBack and 54, 995 are sentences from WikiLarge. The
subword length and subword ratio distributions in this dataset can be seen in Figure
3.3.

Figure 3.3: Visualization of the distributions for subword lengths and subword
ratios of the WikiOpen dataset.

3.3 Model
All the methods in this thesis are based on using a Transformer model to generate
the subtitles and we found that the BART model was suitable for the task of English
to English and the Marian model for Swedish to English.

For most parts, the project utilizes pre-trained models as a starting point. This
means that our methods are based on models that are fine-tuned on the transferred
weights from a pre-trained model. When inheriting the weights of the pre-trained
model, the tokenizer and vocabulary follow as well. However, in most cases, changes
or additions to the tokenizer were necessary.

29


3. Methods

3.3.1 Hugging Face
Hugging Face is a company that provides open-source NLP technologies. The re-
search leaders such as Facebook AI, Microsoft, and Google AI are all supporting and
uploading their models on Hugging Face for public use. This is beneficial for both
research and educational purposes, where smaller research institutes and universities
can access state-of-the-art models for free. The community around Hugging Face
has also contributed with thousands of fine-tuned models.

When downloading a model from Hugging Face, it is possible to download only
the architecture of a model and then pre-training it from scratch. However, it is
also possible to download a pre-trained model if the user defines a checkpoint. The
checkpoints are essentially the current state of the model’s weights and tokenizer.
This enables the user to access the freely available state-of-the-art models that have
been released onto Hugging Face. Likewise, the user can also access the models that
have been fine-tuned for downstream tasks by the community. All of the pre-trained
models that we use in this thesis are all retrieved from Hugging Face.

3.3.2 Transformer with Length Encoding
To create a position-aware encoding, we introduce the positional encoding method
proposed by Sho & Takase [37]. They train a Transformer model to constrain the
output by implementing the length encodings seen in Equation 3.1, which is a slight
modification of Equation 2.9. These new encodings look at the current position with
respect to the remainder of the sequence, instead of the beginning of the sequence,
as is the case with the original implementation.

PElen,pos,2i = sin
( len − pos

10000
2i

dmodel

)

PElen,pos,2i+1 = cos

(
len − pos

10000
2i+1

dmodel

) (3.1)

Here, len represents the specified length of the output sequence, which means the
length of the final subtitle, and pos represents the current position in the sequence.
The other parameters remain the same as in the original implementation (see Sec-
tion 2.4.2). BART is based on a BPE tokenizer, and therefore we experimented with
two different methods on how to count the len and pos parameters. During training,
len, is the length of the target sequence and during inference, it is the desired length
of the output.

Subword count was the first method that we used to represent the different lengths.
It is based on the tokenizer’s decoding ability, meaning that we based the length
parameters on the number of subwords from the tokenizer. This implies that all
tokens are treated equally in length, hence all have the length of 1.

30


3. Methods

Character count was the second method that we planned to use to represent the
different lengths. In contrast to the subword count, this method takes each individ-
ual subword length into consideration. Therefore, the position pos, is calculated by
counting the character length of each of the preceding subword tokens. Similarly,
the length, len, is simply the character length of the target sequence.

3.3.3 Transformer with Length Token
In addition to the modified positional encoding, we introduced new tokens that
would represent different lengths to the model. The length token is prepended to
each source sentence and is meant to represent the relation between the source sen-
tence and the target sentence. The tokens were created by adding a token that
represented some value to the tokenizer. We used an approach with angle brackets
in combination with the corresponding value that we wanted to represent, the to-
kens therefore had the shape of "<value>". We experimented with a few different
methods on how to assign the value when creating our length tokens, but the two
main concepts were to create the tokens based on length or by ratio.

The first method was to prepend the length of the target sentence to the source sen-
tence. The length was retrieved by tokenizing the target sentence with the tokenizer.
Therefore, the length tokens would look like "<5>" for sequences with five subwords.

The second method was to make the length tokens aware of the length ratio between
the source sentence and the target sentence. To calculate the ratio, we divided the
subword length of the target sentence with the subword length of the source sentence
(see Equation 3.2). For this method, we used two versions of tokens, a category-
based version and a value-based version.

subword_length_target

subword_length_source
= ratio (3.2)

The first version of the ratio method was to use only three tokens to represent the
ratios. The three tokens that we created were "<shrt>", "<norm>" and "<long>",
which indicates the lengths of short, normal, and long. The calculation to get these
tokens was carried out according to Equation 3.2, and the tokens were assigned us-
ing the following rules:

ratio < 0.95 =⇒ <shrt>
1.05 > ratio > 0.95 =⇒ <norm>
ratio > 1.05 =⇒ <long>

The second version used value-based ratio tokens. This implied that we created
the ratio tokens based on the actual value between the target and source sentences
rather than putting them into categories. We created 20 tokens between 0 and 2,
with an interval of 0.10. The source sentences were then assigned with a token for
the value-based ratio based on the closest ratio.

31


3. Methods

All of the tokens that we created were added to the tokenizer, where their weights
were initialized randomly. This enabled the model to distinguish each token individ-
ually during training. However, this also means that tokens that are close to each
other in value, such as "<11>" and "<12>" for example, might not be related to
each other at all in the embedding dimensions.

3.4 Training
In this project, we made use of cloud-based computing to train our models. The
company provided us with an instance on the EC2 platform at Amazon Web Services
(AWS) where we had access to a NVIDIA T4 Tensor Core GPU. When training on
the Plint data, it was important that it did not leave the AWS servers due to legal
rights. However, when training the models on the data from the public datasets, we
were able to use free computing resources. Therefore, these models were trained on
GPUs at Google Colab and Kaggle.

3.5 Evaluation
To evaluate our models, we used multiple metrics combined with manual analysis.
The metrics we used to evaluate our models were BLEU, ROUGE-N, and METEOR.
These metrics were evaluated with an implementation of the Jury library [7]. The
manual analysis was based on evaluating the models with focus on the linguistic
performance, which is difficult to measure with metrics. This analysis was performed
by picking random samples in the corresponding test sets and evaluating the impact
of each token.

32


4
Results

This chapter presents the results and evaluations obtained in this project. The re-
sults are based on experiments with different combinations of the methods described
in Chapter 3. The performed experiments can be seen in the list below:

• BART with Length Encodings and Length Tokens
– Fine-tuned with Plint data

• BART with Category-based Ratio Tokens
– Fine-tuned with WikiOpen data
– Fine-tuned with OpenBack data

• BART with Value-based Ratio Tokens
– Fine-tuned with WikiOpen data
– Fine-tuned with OpenBack data

• Marian with Category-based Ratio Tokens
– Fine-tuned with OpenSubtitles data

• Marian with Value-based Tokens
– Fine-tuned with OpenSubtitles data

The results are presented with tables and figures showing the metrics and length
ratios for each method. Furthermore, this chapter contains examples generated from
each of the evaluated models. A further discussion of the results can be found in
Chapter 5.

4.1 BART with Length Encodings and Tokens

Our first experiments consisted of fine-tuning a BART model with the length en-
codings introduced in Section 3.3.2, together with a length token based on the
subword length of the target sentence. This method utilizes the pre-trained check-
point "facebook/bart-base", when initializing the BART model. For fine-tuning the
model, we used the Plint dataset where we split 90% of the data into a training
set and evenly split the remaining data into a validation and test set equal to 5%.
Furthermore, we used a batch size of 64 and the learning rate was set at 3 · 10−5.
Cross-entropy loss was used as the loss function.

33


4. Results

4.1.1 Baseline
The baseline model for this method was created using the BART model for the
pre-trained checkpoint without any modifications, which means without any length
encodings or length tokens. The model is fine-tuned for 5 epochs with an evalua-
tion of the validation set at epochs 0, 2, and 4. The graph of average (over batch)
training and validation loss can be seen in Figure 4.1.

A few generated examples from the baseline model can be seen in Table 4.1. Gen
declares an unconstrained beam search, with beam size of 5, while Genpen has
the same amount of beams but utilizes the n-gram penalty (bi-gram) mentioned in
Section 2.3.3 (the probability for a word that creates a bi-gram that has already
appeared in the sequence is set to 0). The generated sequences are similar to the
input with the exception that the model is adding the "<br>" token, which is the
token for line breaks, when the sequences tend to get longer. The model also misses
the last couple of tokens in the last example.

Figure 4.1: Average training loss plot of the initial baseline model.

4.1.2 Length Encodings and Length Tokens
To evaluate the method with length encodings and length tokens, we modified the
BART model to use these as an additional input. Looking at the target length
distribution of the Plint dataset in Figure 4.2, we figured that a maximum length of
50 would be more than sufficient for our task. This means that tokens up to "<50>"
could be used in our model.

This model was fine-tuned for 11 epochs, where the evaluation of the validation
set was performed every 5 epochs. The graph of average (over batch) training and
validation loss can be seen in Figure 4.3.

34


4. Results

Baseline - Epoch 4
Input That was her idea, not mine.
Gen That was her idea, not mine.
Genpen That was her idea, not mine.
Target It was her idea.
Input You were planning to marry Mrs. Van Dorn, weren’t you?
Gen You were planning to marry<br>Mrs. Van Dorn, weren’t you
Genpen You were planning to marry<br>Mrs. Van Dorn, weren’t you
Target You were going to get married, weren’t you?
Input I’ve just had time to think things out put myself in your position.
Gen I’ve just had time to think things out,<br>put myself in your
Genpen I’ve just had time to think things out,<br>put myself in your
Target I’ve put myself in your position

Table 4.1: Generated sequences by the BART baseline model fine-tuned on the
Plint dataset for 5 epochs. Gen corresponds to a beam search with a beam size of
5 and Genpen to a bi-gram penalised beam search with beam size of 5.

Figure 4.2: Distribution of target subword length in the dataset from Plint.

Although a reasonable loss graph, and in contrast to the similar implementation in
[37], the approach did not work, thus resulting in the model generating gibberish.
Some examples can be seen in Table 4.2 where the model seems to begin to stutter
and repeat itself. Generation was carried out in the same way as with the baseline
model, which means a beam search with 5 beams, but also with a variant using a
bi-gram penalty.

We also experimented with freezing parts of the model. The experiment consisted
of partially freezing the encoder, which means that the weights within the encoder
were prevented from being updated. The first approach was to freeze the entire
model except for the word embeddings, and the second was to also freeze the word
embeddings except for our new length tokens. However, these methods did not re-
sult in any success and are therefore not presented.

35


4. Results

Figure 4.3: Average training loss plot of the the initial model.

To gain a better understanding of the failed implementation, a visualization of the
added positional encodings compared to the pre-trained BART position embeddings
can be seen in Figure 4.4. As stated in Equation 3.1, the sinusoidal approach ranges
from -1 to 1, while the learned embeddings are mostly centered around 0 (with the
exception of a single value being approximately −3.9, which is not visible but ex-
plains the colour bar). From this it is possible to see that the sinusoidal encodings
have a much higher variance than the learned ones, and this could possibly cause
problems when they are added together in the decoder.

Figure 4.4: Visualization of the implemented sinusodial positinal encodings and
the original learned positinal embeddings of the BART-checkpoint. The sinusodial
positional encodings range from -1 to 1 while the original learned embeddings are
mostly centered around 0

36


4. Results

Length Encodings and Length Tokens - Epoch 5
Input <7> That was her idea, not mine.
Gen That was her idea, not mine not mine. That was my idea. <br> That
Genpen That was her idea, not mine I’m sorry. That was my idea. <br>
Target It was her idea.
Input <13> You were planning to marry Mrs. Van Dorn, weren’t you?
Gen You were planning to marry Mrs. <br> Van Dorn <br> Van Dorn Van

D
Genpen You were planning to marry Mrs. <br> Van Van Dorn <br> Van. Van

You
Target You were going to get married, weren’t you?
Input <10> I’ve just had time to think things out put myself in your position.
Gen I’ve just had time to think things out. Put myself in <br> to think

things
Genpen I’ve just had time to think things out. Put myself in out <br> to think
Target I’ve put myself in your position

Length Encodings and Length Tokens - Epoch 10
Input <7> That was her idea, not mine.
Gen That was her thatThatThatThat ThatThatThatIt wasThatThat

thatThat was
Genpen That was her That thatThatThat ThatThatIt was That That wasThat

that That
Target It was her idea.
Input <13> You were planning to marry Mrs. Van Dorn, weren’t you?
Gen You were planningYou wereYou planningYou planning planningYou

wanted planningYou would marryYou
Genpen You were planningYou planning planningToYou,You would marryYou

wanted planning toYou
Target You were going to get married, weren’t you?
Input <10> I’ve just had time to think things out put myself in your position.
Gen I just had time time timeHadIIIHad time timeI timeII
Genpen I just had time timeIIHad timeTime toITo think timetoI
Target I’ve put myself in your position

Table 4.2: Generated sequences by the BART model with additional length en-
codings and length tokens, fine-tuned on the Plint dataset for 5 epochs. Gen corre-
sponds to a beam search with a beam size of 5 and Genpen to a bi-gram penalised
beam search with beam size of 5.

4.2 BART with Ratio Tokens
Based on the methods in Section 3.3.3, the ratio tokens are created to repre-
sent different length ratios between the source and target sentences. To evaluate
this method, we fine-tuned the BART model using the WikiOpen and OpenBack
datasets. The datasets were divided into train, validation, and test sets with a ratio
of 80/15/5. Fine-tuning was performed with a batch size of 64 and a learning rate

37


4. Results

of 3.10 · 10−4. The checkpoint used to initialize our model prior to fine-tuning was
"facebook/bart-base". All models had converged after 5 epochs, meaning no improve-
ment in validation loss, and hence this was used for all models.

At inference time, we create a test set by backtranslating new data, similar to Section
3.2.1. This set contains 1000 sentences, and the distribution of the category-based
and value-based ratio tokens can be seen in Figure 4.5 & 4.6.

Figure 4.5: Distribution of the test set based on ratio categories.

Figure 4.6: Distribution of the test set based on ratio values.

By prepending each of the sentences in the test set with the different ratio tokens,
we could generate evaluation data to analyse the impact of each token. To generate
the evaluation data, we used a beam search with a beam size of 3. The evaluation
was constructed in two ways, where the first method was to prepend all sentences

38


4. Results

in the test set with each token. The second method was to divide the test set into
groups of length tokens, which means that the token was evaluated in the group
to which it belongs based on its ratio with the corresponding target sentence. This
implied that we created new sets within the test set to evaluate the performance of
each token.

The evaluation is based on measuring the impact of each token by first analysing
the generated length ratio compared to the source sentence. This is denoted as
LRsrc and was done for all sentences within the test set and therefore corresponds
to the first evaluation strategy. Furthermore, the evaluation and analysis are based
on the metrics of Section 3.5 and this is carried out for each token group, hence this
corresponds to the second evaluation strategy. We use ROUGE-2recall to evaluate
the ROUGE-N metric. Lastly, there is a human evaluation of random samples of
the generated sentences.

4.2.1 Baseline
The baseline models were created by fine-tuning BART on each of the two datasets
without any length tokens, meaning that it was fine-tuned only on the source - target
aligned sentences. This method was motivated by the need to be able to measure
the impact of the tokens. The metrics evaluated for the baseline models can be seen
in Table 4.3.

BART - BASELINE

WikiOpen OpenBack
Token LRsrc LRsrc

0.88 0.88

Table 4.3: The mean length ratios against the source (LRsrc) for each dataset with
the baseline models are presented in the columns of this table. The results were
evaluated from a fine-tuned BART model on the WikiOpen and OpenBack datasets
without any ratio tokens.

BART - BASELINE

WikiOpen OpenBack
Token BLEU ROUGE-2 METEOR BLEU ROUGE-2 METEOR

50.15 62.74 76.28 50.39 62.26 75.65

Table 4.4: The metrics evaluated for each dataset with the baseline models. The
results were evaluated from a fine-tuned BART model on the WikiOpen and Open-
Back datasets without any ratio tokens.

39


4. Results

4.2.2 Category-based Ratio Tokens
The first version of the method with ratio tokens was to use three categories to
represent short, normal, and long sentences. The distribution of the three tokens for
each dataset can be found in Figure 4.7. The resulting length ratios can be found
in Table 4.5 & 4.6.

Figure 4.7: The distribution of the category-based ratio tokens in WikiOpen and
OpenBack

BART - CATEGORIES

WikiOpen OpenBack
Token LRsrc LRsrc

<shrt> 0.76 0.78
<norm> 0.98 0.98
<long> 1.13 1.15

Table 4.5: The resulting mean length ratios against the source (LRsrc) with
category-based ratio tokens on the WikiOpen and OpenBack datasets. The left-
most column contains the evaluated token.

BART - CATEGORIES

WikiOpen OpenBack
Token BLEU ROUGE-2 METEOR BLEU ROUGE-2 METEOR

<shrt> 46.42 57.22 71.25 46.14 56.83 70.81
<norm> 74.07 80.84 87.76 74.80 81.75 88.25
<long> 45.22 54.66 68.89 46.95 56.21 69.35

Table 4.6: The evaluated metrics for each dataset with the category-based tokens.
The results are evaluated from a fine-tuned BART model on the WikiOpen and
OpenBack datasets with tokens representing short, normal, and long sentence ratios.

40


4. Results

4.2.3 Value-based Ratio Tokens
The second version of the method using ratio tokens was to use 20 value-based
tokens ranging from 0 to 2, with an interval of 0.1. The distribution of the tokens in
the data can be seen in Figure 4.8. The results of this method can be seen in Table
4.7 & B.1. Due to the unbalance of the value-based ratio tokens we only present
the metrics for these tokens in the range of 0.5 to 1.2. For a full presentation of the
results for each token we refer to Appendix B.

Figure 4.8: The distribution of each value-based ratio token for the WikiOpen and
OpenBack datasets. The x-axis can be interpreted as the correspondi