Generating subtitles with controllable length using natural language processing Master’s thesis in Computer science and engineering Joakim Svensson Victor Troksch Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2022 Master’s thesis 2022 Generating subtitles with controllable length using natural language processing Joakim Svensson Victor Troksch Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2022 Generating subtitles with controllable length using natural language processing Joakim Svensson Victor Troksch © Joakim Svensson 2022. © Victor Troksch 2022. Supervisor: Richard Johansson, Department of Computer Science and Engineering, Chalmers Advisor: Niklas Jansson & Peter Eklund, Plint AB Examiner: Moa Johansson, Department of Computer Science and Engineering, Chalmers Master’s Thesis 2022 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2022 iv Generating subtitles with controllable length using natural language processing Joakim Svensson Victor Troksch Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Creating subtitles for video content is a task that has traditionally been performed manually by subtitlers. When creating a subtitle, there are rules and guidelines for how the text should be presented to the viewer. Therefore, a subtitle, translated from one language to another, often contains linguistic compression in the form of paraphrasing or removing parts of the dialogues. With advances in natural lan- guage processing, subtitlers now have tools for machine translation and automated speech recognition to assist them in their work. This thesis aims to explore various methods for how to control the generated output length of a sequence-to-sequence model, which are typically used for text generation and therefore also for machine translation. We apply different modifications to both the model itself and the data to control the output. Furthermore, this project makes use of transfer learning and pre-trained models with the Transformer architecture. The length ratio method produced the best results, in which it was possible to effectively control the output length of a generated subtitle. We also discover that it was also possible to apply this method for a translation model. Although it is a relatively simple method, it produced the desired results with linguistic correctness. Keywords: Natural Language Processing, NLP, Transformer, seq2seq, text genera- tion, BART, subtitles v Acknowledgements We want to express our thanks and gratitude to our supervisors who have helped and supported us throughout this project. At Plint, we would like to thank Niklas Jansson & Peter Eklund for their involvement and support in our daily work. We would also like to thank our supervisor Richard Johansson, from the Department of Computer Science and Engineering at Chalmers University of Technology, who has provided us with important feedback along the way. Finally, we would like to thank all of our family members and friends who have supported us during this time. Joakim Svensson & Victor Troksch, Gothenburg, June 2022 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5.1 Neural Machine Translation with Constraints . . . . . . . . . 3 1.5.2 Text Summarization . . . . . . . . . . . . . . . . . . . . . . . 3 1.5.3 Sentence Simplifications . . . . . . . . . . . . . . . . . . . . . 4 1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Theory 7 2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Word Tokens . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.2 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.1 Feed Forward Neural Network . . . . . . . . . . . . . . . . . . 10 2.2.2 Recurrent Neural Networks . . . . . . . . . . . . . . . . . . . 11 2.3 Sequence-to-sequence Models . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.2 Teacher Forcing . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.3 Autoregressive generation . . . . . . . . . . . . . . . . . . . . 14 2.4 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4.1 Self-attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4.2 Positional Encodings . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Pre-Trained Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.5.2 GPT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.3 BART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.4 Marian & MarianMT . . . . . . . . . . . . . . . . . . . . . . . 21 2.6 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6.1 Cosine similarity . . . . . . . . . . . . . . . . . . . . . . . . . 22 ix Contents 2.6.2 ROUGE-N . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6.3 BLEU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.6.4 METEOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3 Methods 25 3.1 Plint Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Public Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.2.1 Backtranslation with OpenSubtitles . . . . . . . . . . . . . . . 27 3.2.2 OpenSubtitles - English to Swedish . . . . . . . . . . . . . . . 28 3.2.3 WikiOpen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.3.1 Hugging Face . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.3.2 Transformer with Length Encoding . . . . . . . . . . . . . . . 30 3.3.3 Transformer with Length Token . . . . . . . . . . . . . . . . . 31 3.4 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 Results 33 4.1 BART with Length Encodings and Tokens . . . . . . . . . . . . . . . 33 4.1.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.2 Length Encodings and Length Tokens . . . . . . . . . . . . . . 34 4.2 BART with Ratio Tokens . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.2 Category-based Ratio Tokens . . . . . . . . . . . . . . . . . . 40 4.2.3 Value-based Ratio Tokens . . . . . . . . . . . . . . . . . . . . 41 4.2.4 Manual Token Evaluation . . . . . . . . . . . . . . . . . . . . 41 4.3 Marian with Ratio Tokens . . . . . . . . . . . . . . . . . . . . . . . . 45 4.3.1 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.2 Category-based Ratio Tokens . . . . . . . . . . . . . . . . . . 46 4.3.3 Value-based Ratio Tokens . . . . . . . . . . . . . . . . . . . . 47 4.3.4 Manual Token Evaluation . . . . . . . . . . . . . . . . . . . . 49 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5 Discussion and Conclusion 51 5.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1.1 Methods of choice . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1.2 Reasoning about the data . . . . . . . . . . . . . . . . . . . . 52 5.1.3 Evaluating the models . . . . . . . . . . . . . . . . . . . . . . 52 5.1.4 Analysing the results . . . . . . . . . . . . . . . . . . . . . . . 52 5.2 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 5.2.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 Bibliography 57 A Appendix 1 I B Appendix 1 XIX x List of Figures 2.1 A visualization of a many-to-many RNN. The network is unrolled in the right-hand side of the figure to visualize the effect of the hidden states h. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 The attention mechanism visualized with an example sequence. . . . 13 2.3 Teacher forcing exemplified with a RNN. The input at x(3) is not the output from y(2), instead it is the actual ground truth at the associated time step. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Beam- and Greedy search visualized by an example. The yellow lines represent the greedy approach, which takes the highest probability for each word and it results in a score of 0.20. The green lines represents the Beam search, which takes multiple combinations into considera- tion and results in a score of 0.32. . . . . . . . . . . . . . . . . . . . 14 2.5 Visualization of the Transformer architecture as depicted in the orig- inal paper. The yellow part represents the encoder block and the green the decoder block. . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.6 Visualization of a sequence containing 6 words, also with an embed- ding dimension of 6. Worth noting is that the denominator used for this visualization is altered in order to make the shapes better visible on small example sentences. . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 The resulting positional encoding values from Figure 2.6. This white text is added in order to make the figures get on the same level. Sorry for this easter egg a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa 18 3.1 Visualization of the distributions for subword lengths and subword ratios of the OpenBack dataset. . . . . . . . . . . . . . . . . . . . . . 28 3.2 Visualization of the distributions for subword lengths and subword ratios of the OpenSubtitles dataset. . . . . . . . . . . . . . . . . . . . 28 3.3 Visualization of the distributions for subword lengths and subword ratios of the WikiOpen dataset. . . . . . . . . . . . . . . . . . . . . . 29 4.1 Average training loss plot of the initial baseline model. . . . . . . . . 34 4.2 Distribution of target subword length in the dataset from Plint. . . . 35 4.3 Average training loss plot of the the initial model. . . . . . . . . . . . 36 4.4 Visualization of the implemented sinusodial positinal encodings and the original learned positinal embeddings of the BART-checkpoint. The sinusodial positional encodings range from -1 to 1 while the orig- inal learned embeddings are mostly centered around 0 . . . . . . . . . 36 xi List of Figures 4.5 Distribution of the test set based on ratio categories. . . . . . . . . . 38 4.6 Distribution of the test set based on ratio values. . . . . . . . . . . . 38 4.7 The distribution of the category-based ratio tokens in WikiOpen and OpenBack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.8 The distribution of each value-based ratio token for the WikiOpen and OpenBack datasets. The x-axis can be interpreted as the corre- sponding ratio token. . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.9 Distribution of the test set based on ratio categories . . . . . . . . . . 45 4.10 Distribution of the test set based on ratio values . . . . . . . . . . . . 45 4.11 Distribution of the training set for category-based tokens . . . . . . . 46 4.12 Distribution of the training set for value-based tokens . . . . . . . . . 47 xii List of Tables 3.1 A constructed example of how an entry in the Plint dataset looks like. 26 4.1 Generated sequences by the BART baseline model fine-tuned on the Plint dataset for 5 epochs. Gen corresponds to a beam search with a beam size of 5 and Genpen to a bi-gram penalised beam search with beam size of 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Generated sequences by the BART model with additional length encodings and length tokens, fine-tuned on the Plint dataset for 5 epochs. Gen corresponds to a beam search with a beam size of 5 and Genpen to a bi-gram penalised beam search with beam size of 5. . . . 37 4.3 The mean length ratios against the source (LRsrc) for each dataset with the baseline models are presented in the columns of this table. The results were evaluated from a fine-tuned BART model on the WikiOpen and OpenBack datasets without any ratio tokens. . . . . 39 4.4 The metrics evaluated for each dataset with the baseline models. The results were evaluated from a fine-tuned BART model on the WikiOpen and OpenBack datasets without any ratio tokens. . . . . 39 4.5 The resulting mean length ratios against the source (LRsrc) with category-based ratio tokens on the WikiOpen and OpenBack datasets. The leftmost column contains the evaluated token. . . . . . . . . . . . 40 4.6 The evaluated metrics for each dataset with the category-based to- kens. The results are evaluated from a fine-tuned BART model on the WikiOpen and OpenBack datasets with tokens representing short, normal, and long sentence ratios. . . . . . . . . . . . . . . . . . . . . 40 4.7 The resulting mean length ratios against the source (LRsrc) with value-based ratio tokens on the WikiOpen and OpenBack datasets. The leftmost column contains the evaluated tokens and in the other two columns are the corresponding mean length ratios for each dataset. 42 4.8 The evaluated metrics for each dataset with the value-based ratio tokens. The results are evaluated from a fine-tuned BART model on the WikiOpen and OpenBack datasets with 20 tokens representing length ratios between 0 and 2, with an interval of 0.1. The leftmost column show the token used to evaluate the model. . . . . . . . . . . 42 4.9 Sentences generated with the BART model fine-tuned on WikiOpen . 43 4.10 Sentences generated with the BART model fine-tuned on OpenBack . 43 4.11 Sentences generated with the BART model fine-tuned on WikiOpen . 44 xiii List of Tables 4.12 Sentences generated with the BART model fine-tuned on OpenBack . 44 4.13 The mean length ratio against the source (LRsrc) for the Marian baseline model is presented in the columns above. It was created by fine-tuning the Marian model on OpenSubtitles data without any additional length tokens. . . . . . . . . . . . . . . . . . . . . . . . . 46 4.14 The metrics were evaluated with category-based tokens on the Open- Subtitles data using a fine-tuned Marian model. The left column contains the evaluated tokens and the right corresponds to their value. 47 4.15 The metrics evaluated for each dataset with category-based tokens. The result is evaluated from a Marian model fine-tuned on the Open- Subtitles dataset with tokens representing short, normal, and long sentence ratios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.16 The resulting mean length ratios against the source (LRsrc) with value- based ratio tokens on the OpenSubtitles dataset. The leftmost column contains the evaluated tokens and the right column corre- sponds to the mean length ratio for each token. . . . . . . . . . . . . 48 4.17 The evaluated metrics with the value-based ratio tokens. The results are evaluated from a fine-tuned Marian model on the OpenSubtitles dataset with 20 tokens representing length ratios between 0 and 2, with an interval of 0.1. The leftmost column show the token used to evaluate the model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.18 Examples of sentences generated with the Marian model . . . . . . . 49 4.19 Examples of sentences generated with the Marian model . . . . . . . 50 B.1 The evaluated metrics for each dataset with the value-based ratio tokens. The results are evaluated from a fine-tuned BART model on the WikiOpen and OpenBack datasets with 20 tokens representing length ratios between 0 and 2, with an interval of 0.1. The leftmost column show the token used to evaluate the model. . . . . . . . . . . XIX B.2 The evaluated metrics with the value-based ratio tokens. The results are evaluated from a fine-tuned Marian model on the OpenSubtitles dataset with 20 tokens representing length ratios between 0 and 2, with an interval of 0.1. The leftmost column show the token used to evaluate the model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . XX xiv 1 Introduction Creating subtitles to video content is a challenging task that is performed by transla- tors and linguists working as subtitlers. The amount of content that needs subtitling is increasing due to the growth of streaming platforms and international services. There are also regulations, such as directives like the one from the European Union in 2020 [9], which says that all material proved by the public sector in the union (and therefore Sweden as well) needs subtitling.The creation of subtitles is a manual process, and partially automating it would be highly beneficial regarding the work and time resources that it requires. 1.1 Background In our daily life, we encounter subtitles, whether we think about it or not. It can be, for example, when watching the news or our favourite TV show. With the in- crease of streaming services and international content comes an increased demand for subtitles. At first glance, it may seem that creating subtitles is an easy task, but there is more to it than just translating dialogues from one language to another. When subtitling, the text must be translated while following certain length rules and specifications [4]. For example, the length restriction exists to ensure that the subtitle actually fits the screen. The length of a subtitle is also affected by the read- ing speed, which means that it must be possible to read the subtitle in a short period of time, as new subtitles will follow. Furthermore, there are also specifications and rules involving how certain words must be preserved (or censored), when to break lines, how to handle scene cuts, and how to identify speakers (such as a narrator), to mention a few. The challenge lies in creating subtitles that follow these rules while still maintaining the essence of the dialogue. The creation of subtitles is a task traditionally performed by subtitlers that trans- late spoken content into subtitles for a specific language. Subtitlers also have to handle the task of segmentation, meaning that they have to partition the subtitles to match the dialogue. Segmentation is usually performed by adding time stamps to each subtitle, which determines when the subtitle will be visible to the viewer. If the segmentation is incorrect, the subtitles can appear at the wrong time and therefore confuse and annoy the viewer. Another challenge in creating subtitles is that a dialogue can be highly nuanced and could contain things such as colloquialisms, cultural references, and humour. Fur- 1 1. Introduction thermore, it is also hard to translate certain sayings that are typical in the source language, but do not exist in the target language. These are things that are easy for a human to spot and to find a different translation for, but the more difficult for a machine. Recent developments of technologies within natural language processing (NLP), such as automatic speech recognition (ASR) and machine translation (MT), have affected the way subtitlers work. With ASR, they can now get a transcript of the spoken content to use as a template when creating subtitles. This transcript was previously created by listening to the content and manually writing it down. Developments like this can significantly increase the productivity of subtitlers, and this was also shown in a study by Campbell in 2019 [6]. Although subtitlers can use aids to assist them in their work, the AudioVisual Translators Europe (AVTE) association claims that fully automated machine translation models are far from taking over the work of media translators [10]. However, they agree on the point of using ASR and MT as a complement in their work as linguists and media translators. At Plint, the company where this project takes place, a software platform is developed and maintained where their large pool of freelancing subtitlers works to create subtitles that the company later can deliver to its customers. 1.2 Problem Definition To keep a subtitle within the length limitation, the task of creating subtitles is also often the task of performing a summary of the context along with the trans- lation. This is a concept called linguistic or semantic compression, which means that the semantics of the sentence are maintained by paraphrasing parts of the sen- tence or removing redundant words. Therefore, this problem can be divided into two sub-problems within NLP, namely text summarization and machine translation. This project intends to focus on the summarization part, which means that a sub- title should be generated with a controllable length. The project intends to explore various methods to approach and solve this problem. Furthermore, this involves en- abling current state-of-the-art models within NLP. These models have been trained on large amounts of data and are a suitable starting point, rather than developing a model from scratch. 1.3 Aim The aim of this project is to generate subtitles with controllable lengths. Control- ling the output length is of importance for a subtitle due to the limited amount of characters that it is allowed to have. Restricting the output length of a generated sequence would not only be beneficial for subtitle generation, it would also be ben- eficial in other tasks involving language modelling, such as machine translation and text summarization. 2 1. Introduction More specifically, the project will consist of exploring and implementing models with controllable output lengths through either additional tokens, length encodings, or both. These methods will be implemented with existing pre-trained models such as BART [19]. Furthermore, the aim is to evaluate the methods with both quantitative and qualitative measurements to see how the implemented methods influence the model. 1.4 Limitations The project is limited to working only with models based on the Transformer archi- tecture [38]. This is motivated by most of the current state-of-the-art models within various NLP tasks uses this architecture. Furthermore, the project is limited to only considering pre-trained models. Training a language model from scratch would require more computing power and resources than available. The final limitation is that the project is limited to only considering subtitle generation between English and English and from English to Swedish. The limitation of considering only two languages is motivated by the time constraint of the thesis. 1.5 Related Work The project has taken inspiration from various articles and papers within the fields of machine translation, text summarization, and sentence simplification. This section will mention a few of the articles related to and mentioned in this thesis. 1.5.1 Neural Machine Translation with Constraints Neural machine translation (NMT) is a well-researched field within NLP, and current state-of-the-art models are all based on the Transformer architecture [38]. However, when generating subtitles, the length of the subtitle is crucial, and hence this task is closely related to the work of constraining the output length of NMT. Many attempts to restrict the output length of a sequence-to-sequence model are inspired by the work of Takase & Okazaki [37]. To preserve a length constraint, they implemented a modified positional encoding in the decoder of the Transformer, which encodes the positional encoding with respect to how many tokens that are left to be generated in the sequence, rather than how many tokens that have been generated. This work inspired Lakew et al. [18] and Niehaus [28] who both uses this encoding in their work. They also experimented with the implementation of special tokens to represent a specific length or ratio to their models. 1.5.2 Text Summarization Translating a wordy dialogue into a subtitle is a complex task and hence a challenge to approach with text summarization. It is the task of creating a summary that con- tains the most important and relevant content of the original text. The two most common methods are extractive summarization and abstractive summarization [40]. 3 1. Introduction The extractive summarization works by selecting a subset of words present in the original text and stitching them together to create a summary, while abstractive summarization can paraphrase and insert new words to create a more fluent and coherent summary. Rush et al. [33] developed a model that implements abstractive summarization to create summaries of fixed length. Their summary generation is carried out using a beam search algorithm [12] to generate summaries with fixed length. Although their model has proven to be effective, it is limited in the aspect of being only capable of working with the same language. 1.5.3 Sentence Simplifications Text simplification is the subfield within NLP that focuses on automatic simplifica- tion of sentences. Simplifying sentences is beneficial for people who are not fluent in a language, such as children, language learners, or people with reading disorders. Simplifying a text or sentence consists of modifying the content and rewriting the structure, while still preserving the original meaning. Modifications can consist of paraphrasing words, removing redundant words, or splitting a sentence into two if it helps simplify the original sentence. The relevant work for this thesis is related to the work of Martin et al. [23][24]. His work is based on creating Control Tokens to control the text generation of a transformer simplification model. These tokens are created to represent different features of the relation between the source- and target sentences, such as the length difference and the amount of paraphrasing between the two sentences. This method has been proven to work on Transformer models trained from scratch, but also on pre-trained models such as BART [19]. 1.6 Outline This chapter of the thesis has introduced the background of the project, as well as the definition, aim, and limitations of the project. Furthermore, this chapter also covers a short introduction to previous work related to the project. The outline of the remaining chapters of the thesis can be seen in the list below: • Chapter 2 introduces the theory involved in this thesis. Concepts and theory specific for NLP, such as tokenization, word embeddings are described along with more common machine learning theory, such as sequence-to-sequence models and the Transformer, together with an introduction to pre-trained models. The chapter is finalised with an introduction to the NLP metrics that are used to evaluate language models. 4 1. Introduction • Chapter 3 describes the methods used in this thesis. The chapter starts by introducing the data used in the project and the associated pre-processing techniques. In addition, it explains what models and what methods were used. Finally, the chapter describes how the model is trained and evaluated. • Chapter 4 presents the results for the evaluated methods and also shows a few examples of the generated sequences. • Chapter 5 discusses and draws conclusions based on the methods and the re- sults. Improvements and suggestions for future work are also presented here. 5 1. Introduction 6 2 Theory This chapter introduces and describes the theory and concepts used in this the- sis. The chapter starts with a short introduction to Natural Languange Processing (NLP) in general and the most important concepts within the field. Following the introduction of NLP comes an in-depth explanation of some of the most influential deep learning models for NLP, like the sequence-to-sequence (seq2seq) architecture, the Transformer, and an introduction to the pre-trained models used in this the- sis. Lastly, comes an explanation of the most common metrics and scores used to evaluate our methods and experiments. 2.1 Natural Language Processing Natural language processing (NLP) is the field within machine learning that involves text and natural language. The field can be described as the interaction and bridge between a computer and its ability to process and handle natural language. Natural language is defined as the native speech of people, in contrast to artificial languages that are designed to control a computer, for example. There are many different topics and subfields within NLP and to name a few, there are tasks like machine translation, text generation, text classification, and question answering. Modern NLP models are typically constructed and trained with neural networks. This implies that the data require special features that can be interpreted by a machine learning model. Therefore, the data has to go through a number of pre- processing steps in order to transform the raw text data into numerical data that a computer can understand. The following sections will cover the most essential methods for representing text with numbers. 2.1.1 Word Tokens Tokenization is one of the most important preprocessing steps in NLP. Tokenization is a technique used to split up a text, sentence, or document into smaller pieces that are called tokens. The tokens can be divided into groups of words, subwords, or even characters, depending on what tokenizer is used. The tokenizer keeps track of all tokens in a vocabulary, meaning that it maps every token to a specific token-ID. To get an understanding of the different tokenization methods, consider the following sentence: 7 2. Theory The kids are playing football. Word-based tokenization: [The, kids, are, playing, football] Character-based tokenization: [T, h, e, k, i, d, s, a, r, e, p, l, a, y, i, n, g, f, o, o, t, b, a, l, l] Subword-based tokenization: [The, kid, s, are, play, ing, foot, ball] Word- and character-based tokenization is probably the easiest to interpret. How- ever, these two methods have some issues. With the word-based method, the vocab- ulary tends to become very large because every word has its own token (the English language contains more than 500,000 unique words). This issue can be solved by creating a vocabulary consisting of only the most common words and then assign- ing an "unknown token" for words not included in the vocabulary. However, this method will lead to a loss of performance for the model since there will be a loss of information at each of the unknown tokens. Another problem is that similar words, like "bird" and "birds", will initially have completely different representations in the model due to the individual tokens. The problem with large vocabularies is countered by having the tokenization based on characters. However, a single character does not often say much on its own com- pared to words in languages using the Latin alphabet (Chinese signs carry more information, for example). This means that the model has to look at several tokens to interpret the meaning of a single word. The models also have to handle larger inputs for every query. A word-based input of 5 tokens could be equivalent to more than 30 character tokens. The subword-based tokenization is a combination of the two previous mentioned methods and is also the most common method. This method is also used by most of the current state-of-the-art models within NLP. Common character sequences, such as short words for example, are left intact while longer and more uncommon ones are split into subwords. This method has the advantage that it can build every word in the document by stitching together the said subwords. This means that the vocabulary does not need to be as big as for a word-based tokenizer, but also that the model does not have to handle as many tokens as with a character-based tok- enizer. This method can also learn pre- and suffixes along with grammatical word endings, which can be seen in the example sentence above: The, kid, s, are, play, ing, foot, ball. This will allow the model to see the similarity between "kid" and "kids", for example. However, the partitioning does not necessarily have to result in a favorable way. It might as well be tokenized as: Th, ki, ds, are, pla, ying, foo, tball. The partitioning of the subword tokens depends on the data that was used to train the tokenizer. 8 2. Theory There are several algorithms to create the subword tokens, where Byte-Pair [35], WordPiece [34] and SentencePiece [17] are the most common ones. Byte-Pair en- coding (BPE) creates the vocabulary by first finding every unique word in a corpus, called pre-tokenization, where also the word frequency is saved. The next step is to create a base vocabulary consisting of every character present in the unique words. Starting with the base vocabulary, the training data are tokenized into temporary tokens consisting of neighbouring existing token pairs. The most frequent temporary token can be determined by the earlier word count and is added to the vocabulary. The process is repeated, where one token is added per iteration, until the desired size (specified hyperparameter) is reached. WordPiece tokenization is comparable to BPE. The algorithm starts by initializing a base vocabulary consisting of every character that occurs in the training data. However, the pair selection is not based on the highest frequency, but rather on whether the pair maximizes the likelihood of the training data when it is added to the vocabulary. This implies that maximizing the likelihood of the pair, whose probability is greater than all other pairs, gives the best training data. SentencePiece [17] is a method that also utilizes the BPE method, but includes whitespaces in the set of available characters. The tokens that are created will make up the final vocabulary, and, as mentioned earlier, every token in the vocabulary is assigned with a unique integer as ID or index. 2.1.2 Word Embeddings Many NLP-models have the words represented by one-hot encodings, meaning a binary vector where every element is equal to zero apart from one, which represents the token-ID. However, assigning a single binary vector for each word is often not enough for a machine learning model to understand words on a deeper level. The most common way to represent words is therefore to use word embeddings. A word embedding is a mathematical representation of each word that works by assigning a vector with real-valued numbers to each word in the vocabulary. Words that have semantic similarities are also expected to have similar word embeddings in the vec- tor space. This is not possible with one-hot encoding, since binary vectors entail the same distance between every word in the vocabulary. To create word embeddings, there are a few available methods, but the most common ones are to use methods that involve machine learning or statistics. Word2Vec [27][26] is a statistical method that is essentially a shallow neural net- work consisting of one hidden layer, which can be trained with two different methods. The first is the Continuous Bag-of-Words (CBOW) method, which learns the word embeddings by implementing a context window around a word, and then the net- work tries to predict the said word. The weights corresponding to the word later act as the embedding. The Skip-gram method is similar to CBOW but it works the other way around, meaning that the model is trained to predict the surrounding words in the training set. In both methods, the size of the context window is speci- 9 2. Theory fied when the model is created and will act as the embedding dimension. GloVe (Global Vectors for Word Representation) [30] is an embedding technique for distributed word representation, which utilizes unsupervised learning to achieve em- bedding vectors. Unlike Word2Vec, GloVe does not only look at the local statistics, meaning the surrounding words, but also considers the word co-occurrences. The model is trained by creating a co-occurrence matrix for every word in the training corpus. For every word, the matrix stores the number of occurrences together with adjacent words in a specified window size. Word embeddings can also be constructed by training a regular embedding layer, as in [11][31][19]. The embedding layer acts as a lookup table, where every word cor- responds to a vector of a specified embedding dimension. The weights are initialized randomly and updated with respect to the loss function of the language model. 2.2 Artificial Neural Networks Artificial neural networks (ANNs) are a type of computing system inspired by the human brain. ANNs can come in many different shapes and sizes depending on the task they are used to perform. This chapter will cover a short introduction to the most essential network types relevant to this thesis and how they are constructed. 2.2.1 Feed Forward Neural Network The feedforward neural network (FFNN) was the first type of ANN to be invented and is also probably one of the simplest network architectures to date. The net- work is called a feedforward neural network because the information flows forward through the network without any cycles or loops. The simplest version of a FFNN can be viewed as a single perceptron, which means that it is constructed with only an input layer and an output layer. This version is also known as a single-layer perceptron. The output is calculated according to Equation 2.1, where w represents the weights, x the input, b is the bias, and σ is the non-linear activation function. A single-layer perceptron is a linear classifier. y = σ  n∑ j=0 wixi + b  (2.1) The other version of a FFNN is called the multi-layer perceptron (MLP), which is composed of many perceptrons. The MLPs are constructed of at least three layers: an input layer, a hidden layer, and an output layer. This enables the network to compute non-linearly separable functions and is hence suitable for tasks such as classification in supervised learning. 10 2. Theory 2.2.2 Recurrent Neural Networks Recurrent neural networks (RNNs) are a type of ANNs that are used to process se- quential data. Unlike ordinary FFNNs, RNNs contain directed cycles, which means that the information does not flow in a strict direct order. These cycles enable the network to handle information about previous steps in the computation, which can be seen as the network having memory. Due to the ability of RNNs to process se- quential data, they are typically useful in NLP tasks. RNNs work by receiving an input x and for each time step t calculate an output y(t), based on the input of x(t) and the previous output at y(t − 1). The output of the previous step is saved in a hidden state, denoted h(t). Furthermore, a RNN can have different sizes of inputs and outputs. A one-to-many network takes one input and generates a sequence of outputs. A many-to-one network works in the opposite way, which means that it takes many inputs to generate one output. Lastly, many- to-many RNNs take an input sequence and generate a sequence as an output, hence is this type suitable for machine translation. An illustration of a many-to-many network can be seen in Figure 2.1. Figure 2.1: A visualization of a many-to-many RNN. The network is unrolled in the right-hand side of the figure to visualize the effect of the hidden states h. 2.3 Sequence-to-sequence Models A sequence-to-sequence (seq2seq) model, or encoder-decoder model, is a special class of ANN architectures that takes sequential data as input and generates a new se- quence as output. The seq2seq model is composed of an encoder and a decoder, 11 2. Theory where both components are usually constructed by RNNs. The encoder is used to compress the input sequence into a context vector that the decoder can use to gen- erate the new output. The encoder works by encoding each word in the input sequence by computing the hidden states hi for each time step i. The final hidden state, at time step n, is denoted hn and this state is equivalent to the context vector that is sent to the decoder. The decoder generates an output yi for each time step, depending on the previous state. The initial state of the decoder is the context vector hn. One drawback with the seq2seq models and the encoder-decoder architecture is that it has issues with long sentences. Cho et al. [8] showed that when the input sequence becomes longer, it is harder for the model to encapsulate all the important information in the context vector. 2.3.1 Attention To solve the problem that arises with long input sequences, Bahdanau [2] introduced the attention mechanism. This mechanism is created with the intention of mimick- ing cognitive attention by deciding which parts of the input sequence that are of importance. It works by creating a context vector ci from a linear combination of the hidden states hj in the encoder and the attention weights αij for each time step j, see Equation 2.2. The context vector ci is also influenced by the previous hidden state si−1 as can be seen in Equation 2.4. ci = Tx∑ j=1 αijhj (2.2) Here, the weight of each αij is equal to: αij = exp(eij) Tx∑ k=1 exp(eik) (2.3) Here, eij is the alignment score of a FFNN described by the function a: eij = a(si−1, hj) (2.4) By passing all context vectors ci to the decoder, the model can decide what to focus on in the sequence while decoding the next step. See Figure 2.2 for an illustration of how the attention mechanism is calculated. 12 2. Theory Figure 2.2: The attention mechanism visualized with an example sequence. 2.3.2 Teacher Forcing Teacher forcing is a common method used to train seq2seq models quickly and efficiently. Consider an arbitrary seq2seq model that predicts an output y(t), given an input x(t). The method works by letting the next input to the model x(t + 1) be equal to the actual ground truth, regardless of the predicted output y(t). This will enable the model to learn the next prediction from the correct input in the training data, rather than using the predicted output from the previous time step. The idea behind this method is to not let the model train on false predictions and hence waste valuable training time. An example of teacher forcing can be seen in Figure 2.3. Figure 2.3: Teacher forcing exemplified with a RNN. The input at x(3) is not the output from y(2), instead it is the actual ground truth at the associated time step. 13 2. Theory 2.3.3 Autoregressive generation A seq2seq model is, in fact, also an autoregressive model. This is a model that is used to describe time-varying processes, such as language generation. For a model to be autoregressive, it has to predict future values based on previous values. To put this into context, an autoregressive language model let the i-th generated word in a sequence depend on all preceding i − 1 words. The probability distribution for a word sequence can be described by Equation 2.5, where w1:T is the generated text sequence, W0 equals the initial text sequence fed to the model and w1:0 is the empty set, implying that no sequence has been generated in advance. P (w1:T |W0) = T∏ t=1 P (wt|w1:t−1, W0), where w1:0 = ∅ (2.5) With Equation 2.5, text can be generated in different ways. One method is to use the Greedy Search algorithm. It makes the selection by selecting the word with the highest probability for the given time step, that is, wt = argmax(P (w|w1:t−1). This algorithm is efficient, but comes with the downside that it can miss combinations of words that have a higher probability than the predicted one. The reason for this is that a word combination with higher probability can exist where the first word has a lower probability than the generated one. An example of this can be seen in Figure 2.4. The algorithm runs until a special end-of-sequence token is generated or a specified number of words have been reached. Figure 2.4: Beam- and Greedy search visualized by an example. The yellow lines represent the greedy approach, which takes the highest probability for each word and it results in a score of 0.20. The green lines represents the Beam search, which takes multiple combinations into consideration and results in a score of 0.32. The Beam Search algorithm counters the problem of missing out on hidden high- probability words by keeping the n most likely sequences at each time step t. This creates beams in the search tree, hence the name, which can get past lower proba- bilities in order to find better predictions further down the beam. Due to the fact that the algorithm stores several hypothetical sequences and acts greedily at the 14 2. Theory same time (since it always picks the n words with the highest probability), it is guaranteed to find a sequence at least as likely as the greedy search algorithm, at the expense of computational cost. However, the algorithm is not guaranteed to find the sequence with the highest probability, since that would require a complete search of all possible word combinations. Both the greedy- and beam search algorithms can encounter the problem with re- peating sequences of words. To avoid the problem, a n-gram penalty can be imple- mented [29]. This can be done in different ways, but one way is to set the probability to 0 for a word that creates a n-gram that has already appeared in the sequence. It is also worth noting that the human language is not always as predictable. In [15], the author shows that humans seem to prefer to be surprised by a text. This means that it might not be the best practice to always pick the word with the highest probability instead of sampling from the distribution if the goal is to mimic human generated text. 2.4 Transformer The Transformer is a deep learning model that was introduced by Vaswani et al. [38] in 2017. The model was developed for machine translation, but it quickly be- came state-of-the-art in many NLP related tasks. The Transformer is essentially a seq2seq model, but unlike RNNs, the Transformer does not need to process data in sequential order. Without having to process the data in order, the training can be parallelized. To capture the structure and order of a sequence, the Transformer uses positional encodings (see Section 2.4.2) and the self-attention mechanism (see Section 2.4.1). The Transformer is composed of two blocks, one block that works as an encoder and the other block that works as a decoder. In the original paper, the Transformer is composed of 6 layers in each block, but the number of layers can be modified to serve specific tasks. In all layers, within the two blocks, there are residual layers that work as skip connections. The skip connection is an operation where some parts of the output skip one or more layers and instead it is being added to a layer deeper into the model. Residual layers prevent deep networks from losing track of input and leads to a better performing network as a result [13]. The first block is the encoder block, where each of the 6 layers works as an indepen- dent encoder with its own weights. Each layer in the encoder can be broken down into two sub-layers, a multi-head attention module and one feed-forward neural net- work (FFNN). After each sublayer inside the encoder, there is a residual connection followed by a layer normalization. Dropout is also applied to each sub-layer. The second block is the decoder block, which is very similar to the encoder block, with the key difference that a multi-head attention mechanism is added on the output of the encoder stack. The decoder is also modified so that it cannot attend 15 2. Theory to subsequent positions when decoding the output. This means that the model output can only depend on the previous positions in the generated sequence. This modification is also known as masked multi-head attention. Figure 2.5: Visualization of the Transformer architecture as depicted in the original paper. The yellow part represents the encoder block and the green the decoder block. 2.4.1 Self-attention With the introduction of the Transformer, the first transduction sequence model that relies only on attention was introduced. The implementation that enabled this new feature was to replace the recurrent layers with a self-attention module. This type of attention mechanism works by allowing the model to process attention within an input sequence. By relating the positions in a sequence with other positions in the same sequence, self-attention can create an understanding of how words relate to each other. Self-attention is computed multiple times in each layer, in parallel and indepen- dently, through what is called multi-headed attention. This is a module that con- catenates the outputs of each self-attention module before linearly transforming the outputs to create a final representation. A detailed explanation of how self-attention is calculated follows. 16 2. Theory The three main components in calculating self-attention are the vectors of queries, keys, and values, denoted as Q, K, and V . All these vectors are retrieved from the word representation in the input sequence, by multiplying the input embedding with the corresponding matrix, denoted as W Q, W K , and W V . The queries and keys have the dimension of dk and the value vector dv. When calculating self-attention, one can see the calculation as a mapping between a query and a set of key-value pairs to an output. Self-attention is measured with an attention score. This score represents how much attention should be paid to other parts of the input with respect to the current position. In other words, the score represents how much attention should be paid to other words in the sequence, based on the current word. This score is computed by first taking the dot product of the query matrix with the key matrix. The score is then divided by √ dk followed by a softmax function. Lastly, the self-attention calculation is finalized by multiplying it with the value matrix, see Equation 2.6. Attention(Q, K, V ) = softmax(QKT √ dk )V (2.6) As mentioned above, the Transformer calculates the self-attention scores in parallel, which means that it performs the self-attention calculation multiple times. This en- ables the model to focus on different positions inside the sequence, and in turn, leads to better scores. The original implementation of the Transformer uses 8 different heads, which generate 8 different matrices for self-attention per sequence. To pass these values through the FFNN, the matrices for each head are concatenated into one big matrix. This matrix, which contains all the multi-headed attention scores, is then multiplied with an additional weight matrix, denoted W O, to create the final representation, see Equations 2.7 and 2.8. MultiHead(Q, K, V ) = Concat(head1, . . . , headh)W O (2.7) where headi = Attention(QW Q i , KW K i , V W V i ) (2.8) 2.4.2 Positional Encodings Unlike RNNs, the Transformer sees the input as a quantity of words instead of a sequence by default. Since a language depends on the word ordering, positional in- formation needs to be added in order for the Transformer to function properly. One key element in the Transformer architecture is therefore the positional encoding, which takes care of the positional information. This encoding is applied to both the encoder and decoder parts of the model. An example of the importance of word ordering can be seen below, where both sentences contain the same words but mean two different things. 17 2. Theory "He likes football but hates golf." "He likes golf but hates football." In [38], the authors describe two ways to encode the positions, either fixed or learned. The fixed method is based on the use of trigonometric functions to encode the po- sitions. The positions are encoded according to Equation 2.9, where pos represents the position of the word in a sentence, dmodel is the total number of embedding di- mensions of the model, and i is a specific dimension according to i = 1, . . . , dmodel/2. When the frequencies of the sinusoidal functions are altered, the encoded values will be different for every position. The variation of the encodings is determined by the size of the embedding dimension. Even dimensions get the sine embedding, while odd dimensions get the cosine one. The positional encodings are simply added to the input embeddings, as they have the same dimensions. A visualization of Equation 2.9 can be seen in Figure 2.6. PEpos,2i = sin ( pos 10000 2i dmodel ) PEpos,2i+1 = cos ( pos 10000 2i dmodel ) (2.9) Figure 2.6: Visualization of a se- quence containing 6 words, also with an embedding dimension of 6. Worth noting is that the denominator used for this visualization is altered in or- der to make the shapes better visible on small example sentences. Figure 2.7: The resulting posi- tional encoding values from Figure 2.6. This white text is added in order to make the figures get on the same level. Sorry for this easter egg a aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa The other method is to let the model learn the positional encodings on its own. This is done using a regular embedding layer that the model learns during training, which means that the model learns a positional embedding layer. However, Vaswani showed that there was very little difference between this method and the one using sinusoidal encodings. In the original Transformer paper, they chose the sinusoidal 18 2. Theory approach because of the hypothesis that a sinusoidal encoding would learn long dependencies better than the learned variant. More recent work, which is currently state-of-the-art, has for most cases adapted the learned approach over the static approach [19][5]. 2.5 Pre-Trained Models Transfer learning is the concept of transferring knowledge between one model to another, by reusing parts of a pre-trained model into the new model. The idea is that if there already is a model used for solving a similar to the one you are dealing with, the new model could inherit some features from the pre-trained model. By using a pre-trained model instead of training a model from scratch, one can save both time and resources. Formally, the transfer learning task is defined as follows: Given a source domain DS and a learning task TS with the corresponding notation for the target domain DT and TT . The aim is to improve the learning task of the target domain TT with the help of the learning task of the source domain TS. Specifically, the transfer is based on improving the conditional probability distribution P (YT |XT ) in DT with the information from DS and TS. A requirement is that either DT ̸= DS or TT ̸= TS, otherwise no information would be transferred. A common use case of transfer learning within NLP is to fine-tune a pre-trained model for a downstream task. Transformer models are often pre-trained on large text corpora, meaning that they have a good understanding of a language. This enables the model to be fine-tuned for specific NLP tasks. For instance, one can easily create a sentiment classifier by adding one or more linear layers on top of a pre-trained model. Recent development of language models such as BERT [11] and BART [19] has led to models that could perform multi-tasking, meaning that the same model can be used for different tasks. For example, the same model can be used for machine translation, question answering, and semantic analysis. Another example is the use of multilingual models, where one model can learn multiple languages at once. The following subsections cover some of the most influential NLP models and the most relevant ones for this project. 2.5.1 BERT BERT is a language model that was introduced by Google AI Language in 2018 [11]. The name stands for Bidirectional Encoder Representations from Transformers and its architecture is almost identical to the Transformer. The model originally came in two sizes, where the first, BERTBASE , consists of 12 layers, 12 attention heads and a hidden size of 768, with a total of 110 million parameters. The second one, BERTLARGE , consists of 24 layers, 16 attention heads, and a hidden size of 1024, resulting in a total of 340 million parameters. BERT has learned positional embed- dings, which means that the embeddings are learned during pre-training. 19 2. Theory As stated in the name, BERT works with a bidirectional encoder, which enables the model to look at positions further in the sequence when encoding. This feature is suitable for pre-training on unsupervised tasks such as masked language modelling (MLM) and next sentence prediction (NSP). MLM is a pre-training method where some percentage of all tokens in an input sequence is replaced with a masking token. Then, the model’s task is to assign and predict the correct token for the mask. When pre-training BERT, 15% of all tokens were masked. NSP is another pre-training task where the model’s task is to predict if two sentences are related to each other or not. It is used to pre-train for tasks that involve an understanding between sentences, such as a question answering task. 2.5.2 GPT The first Generative Pre-trained Transformer (GPT) was introduced by OpenAI in 2018 [31]. The architecture of the GPT model is similar to that of the decoder block in the Transformer model. It stacks 12 decoder layers on top of each other to create a sequential decoding block. The output tokens generated by the model are predicted autoregressively, which means that it can be used to generate sequences. When decoding, the model can only look at previously generated words, similar to the Transformer model. At the time of its release, it achieved state-of-the-art in 9 of the 12 datasets it was evaluated on, including tasks such as question answering, semantic similarity as- sessment, and text classification. The original GPT model has 2 successors, namely GPT-2 [32] and GPT-3 [5], substantially increasing the number of parameters for each model. 2.5.3 BART BART is a seq2seq model based on the Transformer architecture which was intro- duced by Facebook AI in 2019 [19]. The two building blocks of the BART model can be generalized as using BERT as the encoder, because of the bidirectional encoder, and GPT as the decoder, because of the autoregressive decoder. The difference between the original Transformer and BART is that the latter uses GeLU [14] in- stead of ReLU [1] as activation functions. Regarding the rest of the architecture, it is very similar to BERT with the two differences that the decoder layers perform cross-attention over the final hidden layer of the encoder and that BERT uses an additional FFNN to predict the next word, which BART does not. Each of the encoder and decoder blocks are made of 6 layers in the base model and 12 layers in the large model. The BARTBASE model has approximately 140 million parameters, and the BARTLARGE model has approximately 400 million parameters. Because of the similarity to BERT, BART also uses learned positional embeddings. The model is pre-trained as a denoising autoencoder, which means that the model is trained on noisy and corrupted text. Training data are modified and corrupted in numerous ways. The modifications to the data are summarized in the list below. 20 2. Theory • Token Masking: Random tokens in the input sequence are masked with a mask token. The model is then trained to predict the correct mask based on the rest of the sequence. Essentially, the same pre-training method as the MLM with BERT. • Token Deletion: Random tokens are deleted from the input, which will lead the model to predict what content the deleted token had, based on the rest of the sequence. Furthermore, the model must predict what positions they had. • Text Infilling: Similar to deletion, text filling removes several tokens in a row and replaces it with a single mask token. This implies that the model has to learn and predict the content of the missing tokens. • Sentence Permutation: Performing sentence permutation, meaning that words in a sentence are shuffled. This will enable the model to learn the con- text of the input sentence, regardless of the order. • Document Rotation: This method chooses a random word to start the se- quence. This will teach the model about the beginning of documents. After the pre-training phase, the model is ready to be fine-tuned for downstream tasks. Because of the autoregressive decoder, BART can be fine-tuned for sequence generation tasks such as text summarization and machine translation. The encoder takes an input sequence and generates outputs autoregressively through the decoder. 2.5.4 Marian & MarianMT Marian is an NMT framework written entirely in C++, which is highly optimized for machine translation. The framework is mostly developed by Microsoft and the University of Edinburgh. The NLP group at the University of Helsinki has pre- trained over 1, 000 Marian models, and they were made public on Hugging Face after converting the models to Python. MarianMT is the name of the class provided by Hugging Face, where one can import the pre-trained Marian models. The architecture of the MarianMT models is almost identical to the BART model, except for a few minor differences. The first difference is that the Marian models use the sinusoidal positional embedding instead of the learned positional embedding as is the case with BART. The second difference is that there is no layer normalization in the Marian models. The MarianMT models at Hugging Face are also slightly smaller than the BARTBASE model, with a total of about 74 million parameters. 21 2. Theory 2.6 Metrics Evaluating a language model is a complex task that requires specific metrics. To address one of the problems that could arise when measuring and comparing texts, consider the following sentences: I enjoyed the show I liked the concert The contextual meaning of these sentences is essentially the same, but they are dif- ferent in terms of words used to describe the situation. This is a common case for the task of sentence simplification and text summarization, where redundant words are removed or paraphrased. The problem that arises in these situations is that most metrics are looking at matching n-grams, meaning n number of matching words in a row, and do not take the contextual meaning into consideration. Evaluation scores are often calculated between a candidate and one or more refer- ences. The candidate is the generated translation or prediction, and the references are the correct translations, made, for example, by a translator. 2.6.1 Cosine similarity Cosine similarity is a way to measure the similarity between two vectors. It is de- fined as the cosine of the angle between the two vectors. If the cosine similarity score is 1, the two vectors have the same orientation. Likewise, if the cosine similarity score is 0, the two vectors are orthogonal. In NLP, cosine similarity can be used to measure the similarity of two strings. Cosine similarity is formally defined as: cosine similarity = cos(θ) = A · B ||A|| ||B|| = n∑ i=1 AiBi√ n∑ i=1 A2 i √ n∑ i=1 B2 i (2.10) 2.6.2 ROUGE-N ROUGE stands for Recall-Oriented Understudy for Gisting Evaluation [20] and it is a set of metrics used to evaluate machine translations and text summarizations. With ROUGE-N, the N stands for the number of n-grams that are used in the cur- rent evaluation. For ROUGE-1 and ROUGE-2 it is the unigrams and, respectively, the bigrams that are evaluated. ROUGE-N is composed of recall, precision, and F1-scores. Recall is calculated by first counting the number of overlapping n-grams found in both the candidate and the reference. This number is then divided by the number 22 2. Theory of n-grams found in the reference. The recall formula can be seen in Equation 2.11, where TP stands for true positives and FP stands for false positive. Recall = TP TP + FP (2.11) Precision is, similarly to recall, also calculated by first counting the number of overlapping n-grams found in both the candidate and references. Precision is, how- ever, calculated by dividing by the number of n-grams found in the candidate. The formula can be seen in Equation 2.12 where FN stands for false negatives. Precision = TP TP + FN (2.12) F1-Score is a combination of recall and precision. The combination results in a score that not only rely on giving high scores to sequences that cover as many words as possible, but also does so without outputting redundant words. The general formula for the F1-score can be found in Equation 2.13. F1 = 2 ∗ precision ∗ recall precision + recall (2.13) 2.6.3 BLEU Bilingual Evaluation Understudy (BLEU) [16] is an algorithm designed to evaluate the quality of machine translated text. It was developed with the idea that the closer a machine translation is to a professional human translation, the better it is. BLEU is also great at rewarding sentences that do not generate an abundance of reasonable words. This is done by calculating a modified n-gram precision score for the candidate. The n-gram precision score is calculated using the following Equation: pn = ∑ n-gram∈C Countclipped(n-gram)∑ n-gram∈C Count(n-gram) (2.14) The first function, Count, counts the number of matching n-grams between the can- didate C and the references R1, . . . , Rm. The other function, Countclipped, counts the number of clipped n-grams. The clipped n-grams refer to limiting the count for each correct word in the candidate to the maximum number of times it occurs in one of the reference sentences. The BLEU score is usually computed by taking the modified precision score for n-grams up to 4, which means p1, p2, p3 and p4. To obtain the final score, a brevity penalty (BP in Equation 2.15) is also introduced, which penalizes short candidate sentences. This was implemented because shorter translations are more likely to 23 2. Theory get a higher modified precision score, even though they might miss out on a lot of context. The resulting formula ends up being the following. BLEU = BP · exp( N∑ n=1 wn · log · pn) (2.15) 2.6.4 METEOR Metric for Evaluation of Translation with Explicit ORdering (METEOR) [3] is a metric to evaluate machine translations. It was developed to counter some of the shortcomings with BLEU. The metric works by calculating the harmonic mean of the precision and recall of unigrams, where the recall is weighted higher. The algorithm creates an alignment between two sentences, just as in BLEU. The score is calculated by first counting the number of matching unigrams in the generated and reference translations, which is described as m, the number of unigrams in the generated translation, wt, and the number of unigrams in the reference translations,wr. With this information, the unigram- recall R and precision P can be calculated according to Equation 2.16. R = m wr , P = m wt (2.16) With recall and precision, the weighted harmonic mean can be calculated according to Equation 2.17. The average is weighted with the recall being 9 times more important. Fmean = 10PR R + 9P (2.17) So far, the calculations only account for the matching of single words and not longer segments occurring in both the generated- and reference translations. To account for these occurrences, n-grams matches are used to compute an alignment penalty. Mappings between the generation and references that are not adjacent increase the penalty. Calculation wise, this is done by first defining the chunks c as the set of adjacent unigrams in the generation and reference. The longer the mapping, the fewer chunks exist, which means that a perfect match results in 1 chunk. The penalty can be calculated according to Equation 2.18 where wm is the number of mapped unigrams. p = 0.5 ( c um )3 (2.18) Now, the final METEOR score can be calculated as in Equation 2.19. METEOR = Fmean(1 − p) (2.19) 24 3 Methods This chapter aims to specify and explain the methods used in this thesis. The methods have been shaped in an iterative process, in which the results along the way have affected the chosen methods. The chapter starts with an introduction of the datasets used in this thesis, followed by how the data are pre-processed. All of the methods are based on using a pre-trained Transformer models. Our specific implementations and modifications of the models are described in Section 3.3.2 & 3.3.3. The methods are then evaluated with both quantitative and qualitative analysis. 3.1 Plint Data Plint has been in the localisation business for many years, and during this time, thousands of hours of film have been translated and subtitled into many different languages. Some of the projects that the company has worked with contain JSON files where the original transcript and subtitles can be found. These files also contain metadata with information about the subtitle. An example of the metadata that exists for the subtitles is that the files contain time stamps that determine when and for how long a subtitle is visible to the viewer. The transcript in the project file contains the spoken content of the video being sub- titled. This was previously done manually, listening to the content and writing down the words, but lately most companies have used ASR for this task. The transcript is provided with time stamps for each word. Therefore, it is necessary to segment the ASR-output into fitting sentences. To solve this task, the supervisors at Plint made a script that formats the ASR-output to match the timestamps of the subtitle. Plint provided us with three JSON-files per project, one containing the raw ASR- output, one containing the ASR-transcripted segmentations, and one containing the English subtitles. In total, we received 4443 files from 1481 subtitling projects. In order to make use of these data, and to construct a dataset, we had to perform a few steps of pre-processing. To begin with, we merged the ASR-segmentations and the English subtitles into a single file. We chose not to merge the raw ASR-outputs because we already had the segmented version of the ASR-transcripts. When con- structing the dataset, we labeled the ASR-transcriptions as "input" and the subtitles as "truth". We ignored all the metadata because it did not serve us any purpose. 25 3. Methods The method to segment the ASR-outputs based on time stamps produced overall well-aligned matchings between the ASR-output and the subtitles. However, when evaluating the data, we found outliers in which single words, at the beginning or end of a sentence, had been placed incorrectly. Hence, we had to find a solution to this problem, and we came up with an algorithm that solved this by comparing the endings and beginnings of two succeeding data pairs, where the data pairs are the input - truth alignments from above. If the endings of the first data pair do not match, the endings are compared to the beginnings of the second data pair. If the algorithm then finds a match between the endings of the first pair and the beginnings of the second pair, one of the words has been placed incorrectly, and the data pairs are adjusted. If there is a further mismatch, that is, if none of the words at the beginnings and endings matches at all, the data pair is removed. The ASR-output and the English subtitles are also compared by content, in order to filter out pairs that are not related even though they might share the first and last words. This is done by calculating the cosine similarity between the two. If the similarity is less than 0.5, the data pair is discarded. Setting the threshold too low would allow for faulty data pairs, where a too strict threshold would discard examples with linguistic compression, which would defeat the purpose of the project. As stated in Section 1.1, there are guidelines on how to announce which person is talking. Since the transcript contains only the spoken language, the speaker indica- tors in the English subtitles are also filtered out. A constructed example from the filtered dataset can be seen in Table 3.1. Input Truth text Our fine saint Donald Duck is much superior Our saint Donald Duck is supe- rior Table 3.1: A constructed example of how an entry in the Plint dataset looks like. 3.2 Public Data In addition to the data provided by Plint, we used two open-source datasets. This section will cover the details of each of these datasets, but also how we processed and modified the data to create more suitable datasets for our purposes. OpenSubtitles [21] is a dataset consisting of subtitles retrieved from the web page opensubtitles.org. The dataset is composed of a collection of subtitles for 62 lan- guages and 1782 bitexts. The set of subtitles from English to Swedish contained about 17 million subtitle pairs. WikiLarge [41] is a dataset that is typically used for sentence simplification. The dataset is constructed by aligning a complex version of a sentence with its corre- 26 3. Methods sponding simpler version. The complex sentences come from the regular English Wikipedia1 web page, and the simple sentences come from the Simple English Wikipedia2 web page. The dataset contains 296,402 sentences used for training, all based on the complex-simple alignment. WikiLarge also has a test and valida- tion set based on the Turkcorpus [39], these two sets contain 2000 and 359 sentences, respectively. 3.2.1 Backtranslation with OpenSubtitles By using backtranslation on the OpenSubtitles dataset for English to Swedish, it was possible to find a large proportion of linguistic compression and paraphrasing be- tween the subtitles. Backtranslation was performed on the Swedish subtitles, which means that the Swedish subtitles were translated back to English. This method was inspired by a paper from Netflix [25], and the translations were made with a MarianMT model from Hugging Face. The backtranslation was a computationally heavy task, and therefore we ended up translating only a subset of the total subti- tles that we had available. In total, we translated 260,000 Swedish subtitles back to English. The checkpoint we used was "Helsinki-NLP/opus-mt-sv-en", which is the pre-trained checkpoint for machine translation from Swedish to English with the MarianMT model. To verify that the translations were correct and useful, we started to analyse the translations manually. However, this method was both ineffective and time-consuming. Therefore, we needed a way to automatically analyse the translations. The manual analysis did, however, produce an important insight; we discovered that when the OpenSubtitles data had been aligned poorly, there was also a bad translation. A bad alignment meant that parts of the subtitles did not match, for example, some parts of the translation could have been offsetted to the next subtitle. To solve this issue, we introduced a filter based on cosine similarity. In theory, this will give a low score if the translations are aligned poorly, because very few of the words will match, and a high score if the translations are aligned correctly, be- cause more words will match between the sentences. However, this method has the downside of punishing short sentences for paraphrasing words, resulting in a very low cosine similarity score. Therefore, we had to find a sweet-spot where the bad translations were removed, but the good ones were kept. We went with a threshold for cosine similarity of 0.5, and it gave the results we were after. Furthermore, we also filtered subtitles on the basis of different length criterion, this filter had limita- tions in character lengths and length ratios. The applied length limitation was to only consider subtitles with character lengths between 5 and 100. The ratio criteria were applied to only consider subtitles with a ratio below 2. The resulting dataset contains 147,548 backtranslated subtitles, and a visualization of the length and ra- tios of the data can be seen in Figure 3.1. Further on, we refer to this dataset as "OpenBack". 1en.wikipedia.org 2simple.wikipedia.org 27 3. Methods Figure 3.1: Visualization of the distributions for subword lengths and subword ratios of the OpenBack dataset. 3.2.2 OpenSubtitles - English to Swedish To train a translation model, we needed subtitle data within two languages. Because we had already used the OpenSubtitles data for backtranslations, it was suitable to use this dataset for the purpose of translation as well. To avoid the problems of poorly aligned data, we used the same data as in the backtranslation. Using the same data would also assure us that the translations were correct, as the backtranslated data were correct according to our filter based on the cosine similarity. However, the data still needed to be processed through the length filter to make sure the lengths and ratios criterion was still fulfilled. The length filter that we applied was the same as in the previous task, meaning a threshold of 2 on the length ratio and character lengths between 5 and 100. The resulting dataset with English - Swedish subtitles consists of 144,968 subtitles. Figure 3.2: Visualization of the distributions for subword lengths and subword ratios of the OpenSubtitles dataset. 28 3. Methods 3.2.3 WikiOpen The use of the WikiLarge dataset was inspired by the sentence simplification meth- ods of Martin et al. [24][23]. The dataset contains sentences of various lengths, where the shortest are only a few characters long, and the longest are up to 500 characters long. To fit our purpose, we had to filter the data so that it consisted only of sentences that could work as a subtitle. Our first step in filtering out the sentences was to manually evaluate the dataset and just as we found in the OpenSubtitles data, there were instances of poorly aligned sentence pairs. Therefore, our first step was to implement the same cosine filter with a threshold of 0.5. We also applied the same length and ratio restrictions, which means that we only consider data with character length in a range between 5 and 100 and a ratio between 0 and 2. To create a larger dataset, we ended up combining these data with the backtranslated OpenSubtitles data. We named this dataset "WikiOpen" and it contains a total of 202, 718 instances, where 147, 723 of them are subtitles from OpenBack and 54, 995 are sentences from WikiLarge. The subword length and subword ratio distributions in this dataset can be seen in Figure 3.3. Figure 3.3: Visualization of the distributions for subword lengths and subword ratios of the WikiOpen dataset. 3.3 Model All the methods in this thesis are based on using a Transformer model to generate the subtitles and we found that the BART model was suitable for the task of English to English and the Marian model for Swedish to English. For most parts, the project utilizes pre-trained models as a starting point. This means that our methods are based on models that are fine-tuned on the transferred weights from a pre-trained model. When inheriting the weights of the pre-trained model, the tokenizer and vocabulary follow as well. However, in most cases, changes or additions to the tokenizer were necessary. 29 3. Methods 3.3.1 Hugging Face Hugging Face is a company that provides open-source NLP technologies. The re- search leaders such as Facebook AI, Microsoft, and Google AI are all supporting and uploading their models on Hugging Face for public use. This is beneficial for both research and educational purposes, where smaller research institutes and universities can access state-of-the-art models for free. The community around Hugging Face has also contributed with thousands of fine-tuned models. When downloading a model from Hugging Face, it is possible to download only the architecture of a model and then pre-training it from scratch. However, it is also possible to download a pre-trained model if the user defines a checkpoint. The checkpoints are essentially the current state of the model’s weights and tokenizer. This enables the user to access the freely available state-of-the-art models that have been released onto Hugging Face. Likewise, the user can also access the models that have been fine-tuned for downstream tasks by the community. All of the pre-trained models that we use in this thesis are all retrieved from Hugging Face. 3.3.2 Transformer with Length Encoding To create a position-aware encoding, we introduce the positional encoding method proposed by Sho & Takase [37]. They train a Transformer model to constrain the output by implementing the length encodings seen in Equation 3.1, which is a slight modification of Equation 2.9. These new encodings look at the current position with respect to the remainder of the sequence, instead of the beginning of the sequence, as is the case with the original implementation. PElen,pos,2i = sin ( len − pos 10000 2i dmodel ) PElen,pos,2i+1 = cos ( len − pos 10000 2i+1 dmodel ) (3.1) Here, len represents the specified length of the output sequence, which means the length of the final subtitle, and pos represents the current position in the sequence. The other parameters remain the same as in the original implementation (see Sec- tion 2.4.2). BART is based on a BPE tokenizer, and therefore we experimented with two different methods on how to count the len and pos parameters. During training, len, is the length of the target sequence and during inference, it is the desired length of the output. Subword count was the first method that we used to represent the different lengths. It is based on the tokenizer’s decoding ability, meaning that we based the length parameters on the number of subwords from the tokenizer. This implies that all tokens are treated equally in length, hence all have the length of 1. 30 3. Methods Character count was the second method that we planned to use to represent the different lengths. In contrast to the subword count, this method takes each individ- ual subword length into consideration. Therefore, the position pos, is calculated by counting the character length of each of the preceding subword tokens. Similarly, the length, len, is simply the character length of the target sequence. 3.3.3 Transformer with Length Token In addition to the modified positional encoding, we introduced new tokens that would represent different lengths to the model. The length token is prepended to each source sentence and is meant to represent the relation between the source sen- tence and the target sentence. The tokens were created by adding a token that represented some value to the tokenizer. We used an approach with angle brackets in combination with the corresponding value that we wanted to represent, the to- kens therefore had the shape of "". We experimented with a few different methods on how to assign the value when creating our length tokens, but the two main concepts were to create the tokens based on length or by ratio. The first method was to prepend the length of the target sentence to the source sen- tence. The length was retrieved by tokenizing the target sentence with the tokenizer. Therefore, the length tokens would look like "<5>" for sequences with five subwords. The second method was to make the length tokens aware of the length ratio between the source sentence and the target sentence. To calculate the ratio, we divided the subword length of the target sentence with the subword length of the source sentence (see Equation 3.2). For this method, we used two versions of tokens, a category- based version and a value-based version. subword_length_target subword_length_source = ratio (3.2) The first version of the ratio method was to use only three tokens to represent the ratios. The three tokens that we created were "", "" and "", which indicates the lengths of short, normal, and long. The calculation to get these tokens was carried out according to Equation 3.2, and the tokens were assigned us- ing the following rules: ratio < 0.95 =⇒ 1.05 > ratio > 0.95 =⇒ ratio > 1.05 =⇒ The second version used value-based ratio tokens. This implied that we created the ratio tokens based on the actual value between the target and source sentences rather than putting them into categories. We created 20 tokens between 0 and 2, with an interval of 0.10. The source sentences were then assigned with a token for the value-based ratio based on the closest ratio. 31 3. Methods All of the tokens that we created were added to the tokenizer, where their weights were initialized randomly. This enabled the model to distinguish each token individ- ually during training. However, this also means that tokens that are close to each other in value, such as "<11>" and "<12>" for example, might not be related to each other at all in the embedding dimensions. 3.4 Training In this project, we made use of cloud-based computing to train our models. The company provided us with an instance on the EC2 platform at Amazon Web Services (AWS) where we had access to a NVIDIA T4 Tensor Core GPU. When training on the Plint data, it was important that it did not leave the AWS servers due to legal rights. However, when training the models on the data from the public datasets, we were able to use free computing resources. Therefore, these models were trained on GPUs at Google Colab and Kaggle. 3.5 Evaluation To evaluate our models, we used multiple metrics combined with manual analysis. The metrics we used to evaluate our models were BLEU, ROUGE-N, and METEOR. These metrics were evaluated with an implementation of the Jury library [7]. The manual analysis was based on evaluating the models with focus on the linguistic performance, which is difficult to measure with metrics. This analysis was performed by picking random samples in the corresponding test sets and evaluating the impact of each token. 32 4 Results This chapter presents the results and evaluations obtained in this project. The re- sults are based on experiments with different combinations of the methods described in Chapter 3. The performed experiments can be seen in the list below: • BART with Length Encodings and Length Tokens – Fine-tuned with Plint data • BART with Category-based Ratio Tokens – Fine-tuned with WikiOpen data – Fine-tuned with OpenBack data • BART with Value-based Ratio Tokens – Fine-tuned with WikiOpen data – Fine-tuned with OpenBack data • Marian with Category-based Ratio Tokens – Fine-tuned with OpenSubtitles data • Marian with Value-based Tokens – Fine-tuned with OpenSubtitles data The results are presented with tables and figures showing the metrics and length ratios for each method. Furthermore, this chapter contains examples generated from each of the evaluated models. A further discussion of the results can be found in Chapter 5. 4.1 BART with Length Encodings and Tokens Our first experiments consisted of fine-tuning a BART model with the length en- codings introduced in Section 3.3.2, together with a length token based on the subword length of the target sentence. This method utilizes the pre-trained check- point "facebook/bart-base", when initializing the BART model. For fine-tuning the model, we used the Plint dataset where we split 90% of the data into a training set and evenly split the remaining data into a validation and test set equal to 5%. Furthermore, we used a batch size of 64 and the learning rate was set at 3 · 10−5. Cross-entropy loss was used as the loss function. 33 4. Results 4.1.1 Baseline The baseline model for this method was created using the BART model for the pre-trained checkpoint without any modifications, which means without any length encodings or length tokens. The model is fine-tuned for 5 epochs with an evalua- tion of the validation set at epochs 0, 2, and 4. The graph of average (over batch) training and validation loss can be seen in Figure 4.1. A few generated examples from the baseline model can be seen in Table 4.1. Gen declares an unconstrained beam search, with beam size of 5, while Genpen has the same amount of beams but utilizes the n-gram penalty (bi-gram) mentioned in Section 2.3.3 (the probability for a word that creates a bi-gram that has already appeared in the sequence is set to 0). The generated sequences are similar to the input with the exception that the model is adding the "
" token, which is the token for line breaks, when the sequences tend to get longer. The model also misses the last couple of tokens in the last example. Figure 4.1: Average training loss plot of the initial baseline model. 4.1.2 Length Encodings and Length Tokens To evaluate the method with length encodings and length tokens, we modified the BART model to use these as an additional input. Looking at the target length distribution of the Plint dataset in Figure 4.2, we figured that a maximum length of 50 would be more than sufficient for our task. This means that tokens up to "<50>" could be used in our model. This model was fine-tuned for 11 epochs, where the evaluation of the validation set was performed every 5 epochs. The graph of average (over batch) training and validation loss can be seen in Figure 4.3. 34 4. Results Baseline - Epoch 4 Input That was her idea, not mine. Gen That was her idea, not mine. Genpen That was her idea, not mine. Target It was her idea. Input You were planning to marry Mrs. Van Dorn, weren’t you? Gen You were planning to marry
Mrs. Van Dorn, weren’t you Genpen You were planning to marry
Mrs. Van Dorn, weren’t you Target You were going to get married, weren’t you? Input I’ve just had time to think things out put myself in your position. Gen I’ve just had time to think things out,
put myself in your Genpen I’ve just had time to think things out,
put myself in your Target I’ve put myself in your position Table 4.1: Generated sequences by the BART baseline model fine-tuned on the Plint dataset for 5 epochs. Gen corresponds to a beam search with a beam size of 5 and Genpen to a bi-gram penalised beam search with beam size of 5. Figure 4.2: Distribution of target subword length in the dataset from Plint. Although a reasonable loss graph, and in contrast to the similar implementation in [37], the approach did not work, thus resulting in the model generating gibberish. Some examples can be seen in Table 4.2 where the model seems to begin to stutter and repeat itself. Generation was carried out in the same way as with the baseline model, which means a beam search with 5 beams, but also with a variant using a bi-gram penalty. We also experimented with freezing parts of the model. The experiment consisted of partially freezing the encoder, which means that the weights within the encoder were prevented from being updated. The first approach was to freeze the entire model except for the word embeddings, and the second was to also freeze the word embeddings except for our new length tokens. However, these methods did not re- sult in any success and are therefore not presented. 35 4. Results Figure 4.3: Average training loss plot of the the initial model. To gain a better understanding of the failed implementation, a visualization of the added positional encodings compared to the pre-trained BART position embeddings can be seen in Figure 4.4. As stated in Equation 3.1, the sinusoidal approach ranges from -1 to 1, while the learned embeddings are mostly centered around 0 (with the exception of a single value being approximately −3.9, which is not visible but ex- plains the colour bar). From this it is possible to see that the sinusoidal encodings have a much higher variance than the learned ones, and this could possibly cause problems when they are added together in the decoder. Figure 4.4: Visualization of the implemented sinusodial positinal encodings and the original learned positinal embeddings of the BART-checkpoint. The sinusodial positional encodings range from -1 to 1 while the original learned embeddings are mostly centered around 0 36 4. Results Length Encodings and Length Tokens - Epoch 5 Input <7> That was her idea, not mine. Gen That was her idea, not mine not mine. That was my idea.
That Genpen That was her idea, not mine I’m sorry. That was my idea.
Target It was her idea. Input <13> You were planning to marry Mrs. Van Dorn, weren’t you? Gen You were planning to marry Mrs.
Van Dorn
Van Dorn Van D Genpen You were planning to marry Mrs.
Van Van Dorn
Van. Van You Target You were going to get married, weren’t you? Input <10> I’ve just had time to think things out put myself in your position. Gen I’ve just had time to think things out. Put myself in
to think things Genpen I’ve just had time to think things out. Put myself in out
to think Target I’ve put myself in your position Length Encodings and Length Tokens - Epoch 10 Input <7> That was her idea, not mine. Gen That was her thatThatThatThat ThatThatThatIt wasThatThat thatThat was Genpen That was her That thatThatThat ThatThatIt was That That wasThat that That Target It was her idea. Input <13> You were planning to marry Mrs. Van Dorn, weren’t you? Gen You were planningYou wereYou planningYou planning planningYou wanted planningYou would marryYou Genpen You were planningYou planning planningToYou,You would marryYou wanted planning toYou Target You were going to get married, weren’t you? Input <10> I’ve just had time to think things out put myself in your position. Gen I just had time time timeHadIIIHad time timeI timeII Genpen I just had time timeIIHad timeTime toITo think timetoI Target I’ve put myself in your position Table 4.2: Generated sequences by the BART model with additional length en- codings and length tokens, fine-tuned on the Plint dataset for 5 epochs. Gen corre- sponds to a beam search with a beam size of 5 and Genpen to a bi-gram penalised beam search with beam size of 5. 4.2 BART with Ratio Tokens Based on the methods in Section 3.3.3, the ratio tokens are created to repre- sent different length ratios between the source and target sentences. To evaluate this method, we fine-tuned the BART model using the WikiOpen and OpenBack datasets. The datasets were divided into train, validation, and test sets with a ratio of 80/15/5. Fine-tuning was performed with a batch size of 64 and a learning rate 37 4. Results of 3.10 · 10−4. The checkpoint used to initialize our model prior to fine-tuning was "facebook/bart-base". All models had converged after 5 epochs, meaning no improve- ment in validation loss, and hence this was used for all models. At inference time, we create a test set by backtranslating new data, similar to Section 3.2.1. This set contains 1000 sentences, and the distribution of the category-based and value-based ratio tokens can be seen in Figure 4.5 & 4.6. Figure 4.5: Distribution of the test set based on ratio categories. Figure 4.6: Distribution of the test set based on ratio values. By prepending each of the sentences in the test set with the different ratio tokens, we could generate evaluation data to analyse the impact of each token. To generate the evaluation data, we used a beam search with a beam size of 3. The evaluation was constructed in two ways, where the first method was to prepend all sentences 38 4. Results in the test set with each token. The second method was to divide the test set into groups of length tokens, which means that the token was evaluated in the group to which it belongs based on its ratio with the corresponding target sentence. This implied that we created new sets within the test set to evaluate the performance of each token. The evaluation is based on measuring the impact of each token by first analysing the generated length ratio compared to the source sentence. This is denoted as LRsrc and was done for all sentences within the test set and therefore corresponds to the first evaluation strategy. Furthermore, the evaluation and analysis are based on the metrics of Section 3.5 and this is carried out for each token group, hence this corresponds to the second evaluation strategy. We use ROUGE-2recall to evaluate the ROUGE-N metric. Lastly, there is a human evaluation of random samples of the generated sentences. 4.2.1 Baseline The baseline models were created by fine-tuning BART on each of the two datasets without any length tokens, meaning that it was fine-tuned only on the source - target aligned sentences. This method was motivated by the need to be able to measure the impact of the tokens. The metrics evaluated for the baseline models can be seen in Table 4.3. BART - BASELINE WikiOpen OpenBack Token LRsrc LRsrc 0.88 0.88 Table 4.3: The mean length ratios against the source (LRsrc) for each dataset with the baseline models are presented in the columns of this table. The results were evaluated from a fine-tuned BART model on the WikiOpen and OpenBack datasets without any ratio tokens. BART - BASELINE WikiOpen OpenBack Token BLEU ROUGE-2 METEOR BLEU ROUGE-2 METEOR 50.15 62.74 76.28 50.39 62.26 75.65 Table 4.4: The metrics evaluated for each dataset with the baseline models. The results were evaluated from a fine-tuned BART model on the WikiOpen and Open- Back datasets without any ratio tokens. 39 4. Results 4.2.2 Category-based Ratio Tokens The first version of the method with ratio tokens was to use three categories to represent short, normal, and long sentences. The distribution of the three tokens for each dataset can be found in Figure 4.7. The resulting length ratios can be found in Table 4.5 & 4.6. Figure 4.7: The distribution of the category-based ratio tokens in WikiOpen and OpenBack BART - CATEGORIES WikiOpen OpenBack Token LRsrc LRsrc 0.76 0.78 0.98 0.98 1.13 1.15 Table 4.5: The resulting mean length ratios against the source (LRsrc) with category-based ratio tokens on the WikiOpen and OpenBack datasets. The left- most column contains the evaluated token. BART - CATEGORIES WikiOpen OpenBack Token BLEU ROUGE-2 METEOR BLEU ROUGE-2 METEOR 46.42 57.22 71.25 46.14 56.83 70.81 74.07 80.84 87.76 74.80 81.75 88.25 45.22 54.66 68.89 46.95 56.21 69.35 Table 4.6: The evaluated metrics for each dataset with the category-based tokens. The results are evaluated from a fine-tuned BART model on the WikiOpen and OpenBack datasets with tokens representing short, normal, and long sentence ratios. 40 4. Results 4.2.3 Value-based Ratio Tokens The second version of the method using ratio tokens was to use 20 value-based tokens ranging from 0 to 2, with an interval of 0.1. The distribution of the tokens in the data can be seen in Figure 4.8. The results of this method can be seen in Table 4.7 & B.1. Due to the unbalance of the value-based ratio tokens we only present the metrics for these tokens in the range of 0.5 to 1.2. For a full presentation of the results for each token we refer to Appendix B. Figure 4.8: The distribution of each value-based ratio token for the WikiOpen and OpenBack datasets. The x-axis can be interpreted as the correspondi