Using pre-trained language models for extractive text summarisation of academic papers Master’s thesis in Computer Science and Engineering ERIK HERMANSSON CHARLOTTE BODDIEN Department of Mechanics and Maritime Sciences CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2020 Master’s thesis 2020 Using pre-trained language models for extractive text summarisation of academic papers ERIK HERMANSSON CHARLOTTE BODDIEN Department of Mechanics and Maritime Sciences Division of Vehicle Safety Chalmers University of Technology Gothenburg, Sweden 2020 Using pre-trained language models for extractive text summarisation of academic papers ERIK HERMANSSON CHARLOTTE BODDIEN © ERIK HERMANSSON, CHARLOTTE BODDIEN, 2020. Supervisor: Selpi, Department of Mechanics and Maritime Sciences Examiner: Selpi, Department of Mechanics and Maritime Sciences Master’s Thesis 2020:01 Department of Mechanics and Maritime Sciences Division of Vehicle Safety Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Printed by Department of Mechanics and Maritime Science Gothenburg, Sweden 2020 iv Using pre-trained language models for extractive text summarisation of academic papers ERIK HERMANSSON CHARLOTTE BODDIEN Department of Mechanics and Maritime Sciences Chalmers University of Technology Abstract Given the overwhelming amount of textual information on the internet and else- where, the automatic creation of high-quality summaries is becoming increasingly important. With the development of neural networks and pre-trained models within natural language processing in recent years, such as BERT and its derivatives, the field of automatic text summarisation has seen a lot of progress. These models have been pre-trained on large amounts of data for general knowledge and are then fine- tuned for specific tasks. Datasets are a limiting factor for training summarisation models, as they require a large amount of manual summaries to be created. Most of the current summarisation models have been trained and evaluated using textual data mostly from the news domain. However, pre-trained models, fine-tuned on data from the news domain, could potentially also be able to generalize and perform well on other data as well. The main objective of this thesis is to investigate the suitability of several pre-trained language models for automatic text summarisation. The chosen models were fine- tuned on readily available news data, and evaluated on a very different dataset of academic texts to determine their ability to generalise. There were only slight differences between the models on the news data. But more interestingly, the results on the academic texts showed significant differences be- tween the models. The results indicate that the more robustly pre-trained models are able to generalise better and according to the metrics perform quite well. How- ever, human evaluation puts this into question, showing that even the high-scoring summaries did not necessarily read well. This highlights the need for better evalu- ation methods and metrics. Keywords: natural language processing, nlp, machine learning, deep learning, au- tomatic text summarisation, extractive summarisation, transformer, bert, roberta, xlnet v Acknowledgements We would like to thank our supervisor Selpi for her ongoing support and guidance during the project. For providing computational resources enabling the work in this thesis and support we would like to thank Chalmers Centre for Computational Science and Engineer- ing (C3SE) provided by the Swedish National Infrastructure for Computing (SNIC). We thank SAFER for providing access to their facilities. Erik Hermansson, Charlotte Boddien, Gothenburg, February 2020 vii Contents List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Objective and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Theory 3 2.1 Automatic Text Summarisation . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 Extractive Summarisation Methods . . . . . . . . . . . . . . . 3 2.1.1.1 Score and Select . . . . . . . . . . . . . . . . . . . . 3 2.1.1.2 Sequence Labeling . . . . . . . . . . . . . . . . . . . 4 2.2 Summarisation Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.1 ROUGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.2 Other evaluation metrics . . . . . . . . . . . . . . . . . . . . . 6 2.2.2.1 BLEU . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2.2 Precision, Recall and F-Score . . . . . . . . . . . . . 6 2.2.2.3 Cosine Similarity . . . . . . . . . . . . . . . . . . . . 6 2.3 Text Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.2 Sentence Embedding . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.3 Document Embedding . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 Machine Learning Models for NLP . . . . . . . . . . . . . . . . . . . 8 2.5.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 9 2.5.2 Sequential Models . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.5.2.1 Recurrent Neural Network (RNN) . . . . . . . . . . . 11 2.5.2.2 Stacked RNN . . . . . . . . . . . . . . . . . . . . . . 11 2.5.2.3 Bidirectional RNN . . . . . . . . . . . . . . . . . . . 12 2.5.2.4 Simple RNN . . . . . . . . . . . . . . . . . . . . . . 12 2.5.2.5 Long Short Term Memory (LSTM) . . . . . . . . . . 13 2.5.2.6 RNN Modes . . . . . . . . . . . . . . . . . . . . . . . 13 2.5.2.7 Attention . . . . . . . . . . . . . . . . . . . . . . . . 14 2.5.2.8 Transformer . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.2.9 Transformer-XL . . . . . . . . . . . . . . . . . . . . 17 ix Contents 2.5.3 Pre-Trained Language Models . . . . . . . . . . . . . . . . . . 18 2.5.3.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.3.2 RoBERTa . . . . . . . . . . . . . . . . . . . . . . . . 20 2.5.3.3 DistilBert . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.3.4 XLNet . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.5.4 Task Specific Models . . . . . . . . . . . . . . . . . . . . . . . 22 2.5.4.1 BERTSum . . . . . . . . . . . . . . . . . . . . . . . 23 2.5.4.2 Sentence-BERT (SBERT) . . . . . . . . . . . . . . . 24 3 Methods 27 3.1 Changes in the direction of the project . . . . . . . . . . . . . . . . . 27 3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2.1 CNN/DM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1.1 Label Generation . . . . . . . . . . . . . . . . . . . . 28 3.2.2 Academic Paper Dataset . . . . . . . . . . . . . . . . . . . . . 29 3.2.2.1 Text extraction . . . . . . . . . . . . . . . . . . . . . 29 3.2.2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . 29 3.2.2.3 Obtaining reference summaries . . . . . . . . . . . . 30 3.3 Summary Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.1 Adressing BERTSum’s Token Limit . . . . . . . . . . . . . . . 31 3.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5.1 ROUGE Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 34 3.5.2 Sentence Similarity Evaluation . . . . . . . . . . . . . . . . . . 34 3.5.3 Evaluation on the CNN/DM dataset . . . . . . . . . . . . . . 34 3.5.4 Evaluation on the Academic Paper dataset . . . . . . . . . . . 35 3.5.5 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 35 3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4 Results and Discussion 39 4.1 Training and Validation . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 CNN/DM Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.1 Truncated Results . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2.2 Full Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3 Academic Paper Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.2 Single Sample Results . . . . . . . . . . . . . . . . . . . . . . 50 4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 4.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.1 ROUGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.2 Sentence Similarity . . . . . . . . . . . . . . . . . . . . . . . . 53 4.4.3 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 54 4.5 CNN/DM Dataset Positional Bias . . . . . . . . . . . . . . . . . . . . 54 4.6 Model Positional Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.6.1 CNN/DM Dataset . . . . . . . . . . . . . . . . . . . . . . . . 55 x Contents 4.6.2 Academic Paper Dataset . . . . . . . . . . . . . . . . . . . . . 56 4.7 Model Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5 Conclusions and Future Work 61 5.1 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 Bibliography 67 A Appendix 1 I A.1 Full Text Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.2 Academic Texts: Single Sample Scores . . . . . . . . . . . . . . . . . I A.3 Full Confidence Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . III A.4 Text Excerpts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV A.4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV A.4.2 Manual Extractive Summary . . . . . . . . . . . . . . . . . . . IV A.4.3 Every-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V A.4.4 DistilBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII A.4.5 XLNet Mem. . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII xi Contents xii List of Figures 2.1 A simplified illustration of an ANN. . . . . . . . . . . . . . . . . . . . 9 2.2 Single RNN cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 RNN unrolled over 4 steps. . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Stacked RNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Bidirectional RNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6 Encoder/Decoder model. . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7 Translation of a French sentence into an English one. Figure taken from [29] with permission from the authors. . . . . . . . . . . . . . . 14 2.8 Transformer architecture. Taken from [30] with the authors’ permission. 16 2.9 BERT input representation. Reproduced from [19] with the authors’ permission. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.10 BERTSum architecture. Reproduced from [2] with the authors’ per- mission. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.11 The SBERT architecture for a classification objective function. . . . . 25 2.12 The SBERT architecture for a regressive objective function. This can be used to compute similarity scores. . . . . . . . . . . . . . . . . . . 26 3.1 An overview of the experiments we performed. . . . . . . . . . . . . . 37 4.1 The plot of training loss with BERT,DistilBERT, RoBERTa and XLNet 40 4.2 Plots of validation loss and score of BERT, DistilBERT, RoBERTa and XLNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 The combined scores on the truncated CNN/DM dataset. . . . . . . . 43 4.4 The combined scores on the full CNN/DM dataset. (S: Score, SBS: Sentence BERT Score) . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.5 The scores on the Academic Paper dataset. . . . . . . . . . . . . . . . 49 4.6 Selection scores of the sentences of the CNN/DM dataset with respect to sentence position. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.7 Selection scores of the sentences with and without randomised positions. 56 4.8 Averaged scores of XLNet and XLNet Mem. on the Academic Paper dataset with regards to sentence position. (Red Line signifies block split) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.9 Plots of RoBERTa Confidence scores . . . . . . . . . . . . . . . . . . 58 4.10 Plots of XLNet and XLNet Mem. Confidence scores . . . . . . . . . . 59 4.11 Plots of RoBERTa S Confidence Metrics . . . . . . . . . . . . . . . . 60 A.1 The confidence score of all the models on the CNN/DM dataset . . . III xiii List of Figures A.2 The confidence score of all the models on the Academic Texts dataset III xiv List of Tables 3.1 Statistics for CNN/DM dataset . . . . . . . . . . . . . . . . . . . . . 28 4.1 Size in MB and required training time for all the models used in our experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2 Scores on the truncated CNN/DM dataset. (S: Score, SBS: Sentence BERT Score) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Scores on the full CNN/DM dataset. (S: Score, SBS: Sentence BERT Score) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.4 The averaged scores the different models achieved on the Academic Paper dataset. In parenthesis, the difference between the score achieved when using reference summary 1 and the one achieved when using ref- erence summary 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.5 Rankings of the Academic Paper dataset summaries according to the different scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.6 Rankings of the sample summaries according to the different scores. . 51 4.7 Statistic of RoBERTa confidence scores . . . . . . . . . . . . . . . . . 59 4.8 Statistics of RoBERTa S Confidence Scores . . . . . . . . . . . . . . . 60 A.1 Scores on Text Dataset against both the reference summaries.(1: com- paired against reference summaries 1, 2: compaired against reference summaries 2, GS: Greedy Score, SBS: Sentence Bert Similarity) . . . I A.2 The scores the different models achieved on the selected sample of the Academic Paper dataset. In parenthesis, the difference between the score achieved when using reference summary 1 and the one achieved when using reference summary 2. . . . . . . . . . . . . . . . . . . . . II xv List of Tables xvi 1 Introduction When trying to investigate a topic, researchers nowadays have access to huge amounts of data - news articles, papers, blog posts, etc. However, due to the sheer volume of data, surveying it for relevant information can be a difficult task. Even when summaries or abstracts are provided, as is often the case with papers, they do not necessarily constitute a good summary of the actual document. Instead, many of them are written with the intention of enticing the reader to read the rest, not with the intention of providing them with all the most interesting information up front. Reading the full documents, however, can be an impossible task, and reading only parts of them selectively risks missing out on important pieces of information. Ideally, what a researcher in this position would want is for someone to sort through the material and provide them with summaries comprised of all the most important bits of the documents. These summaries can then help them to get a good overview over the topic(s) and to decide which documents to read in full. Since producing such summaries is a very time-intensive task when done by humans, many attempts have been made over the past decades to automate this process, especially in the news domain. In this thesis, we aim to investigate the current state-of-the-art meth- ods and apply them to the summarisation of academic papers from the field of traffic safety. Natural Language Processing (NLP) in general and automatic text summarisation in particular are still very active fields of research with new papers and models being released all the time. Most of the current state-of-the-art models for various NLP tasks are still very new and have not been tested very extensively on the task of text summarisation or on data sets comprised of anything else than relatively short news articles. This is what we wish to investigate in this thesis. 1.1 Objective and Scope The main objective of this thesis is to investigate the suitability of several pre- trained language models for automatic text summarisation. For this purpose we will investigate several sub-tasks: • As many different models exist, we will compare how the chosen models per- form against each other for the purpose of extractive text summarisation. • Since few datasets for summarisation exist outside the news domain, and cre- ating one is out of scope of this thesis, we will instead investigate how well a 1 1. Introduction model fine-tuned on a dataset consisting of news articles is able to generalize and perform on the very different academic texts. • To properly evaluate the models, adequate evaluation methods are required. Currently used evaluation methods have limitations, for example requiring ex- act word matching. Therefore, we will investigate a method based on sentence similarity. We will employ this and the current standard metrics to evaluate the generated summaries and compare results against human judgement of the summaries’ quality. The scope of this thesis project is limited in the following ways: 1. Developing a new method for text summarisation or creating a new model for text representation is out of scope for this project. 2. Development of a complete method to automatically pre-process scientific doc- uments (e.g., from PDF to clean text) for summarisation is outside the scope of this project. 3. No graphical interface for any end-user will be created. This thesis focuses solely on the scientific investigation of methods for automatic text summari- sation, not on the development of a finished software product. 4. The aim is to produce summaries of individual scientific papers from the field of traffic safety. Multi-document summaries (producing a single summary combining the information from several source documents) are out of scope for this project. 1.2 Outline The rest of the thesis is structured as follows: In Chapter 2 we will provide an overview over the current state of the art in the fields of NLP and automatic text summarisation. We will offer definitions and descriptions of the most important concepts and methods in NLP and will in particular introduce the most promising approaches to the automatic creation of text summaries and to their evaluation. Chapter 3 will detail how we performed our experiments. We will describe the implementations of the models and methods we used, how we obtained our training, test and evaluation data, and what kind of experiments we performed. The results of those experiments will be presented and discussed in Chapter 4. In Chapter 5 we will look at some ethical considerations regarding automatic text summarisation in general, review our most important findings with respect to the limitations of our project, and suggest possible directions for future work. 2 2 Theory Automatic text summarisation is part of a research area called natural language processing (NLP). In the following, different approaches to text summarisation and evaluation of text summarisation will be presented in Sections 2.1 and 2.2, respec- tively. In Sections 2.3 and 2.4, the two important concepts of text embedding and language modeling will be explained. Section 2.5 follows the development of increas- ingly powerful machine learning models that have been developed for various NLP tasks and are useful for automatic text summarisation. 2.1 Automatic Text Summarisation Automatic text summarisation techniques can be divided into two different ap- proaches: extractive and abstractive. As described by See in [1], extractive sum- marisation techniques produce summaries by directly picking a subset of relevant sentences/phrases/words from the source document(s). Abstractive methods gener- ate summaries using a separate vocabulary, rebuilding each sentence from scratch thus allowing the summary to contain words that don’t necessarily appear in the source documents. This has the potential to result in a more cohesive text, but can also distort facts. Abstractive methods require encoding a much deeper level of understanding of natural language to be successful. Most research so far has been focused on extractive methods, as they don’t require the machine to “understand” the text semantically and are easier to implement. For the above reasons, and to be able to draw on this wealth of available research, this thesis will be focusing on extractive summarisation. 2.1.1 Extractive Summarisation Methods The summaries produced by the extractive method are a subset of the sentences of the source document(s). In the following subsections, we will look at two main approaches to this task: Score and Select, and Sequence Labeling. 2.1.1.1 Score and Select For the score and select method summarisation is treated as a problem of assigning each sentence a score which captures how important it is to include in the sum- mary. Then, the n highest-ranking sentences are selected to form the summary. Alternatively, the sentence selection can be approached as an optimisation problem: Importance and coherence are to be maximised, redundancy minimised. 3 2. Theory 2.1.1.2 Sequence Labeling Another way to model the extractive summarisation task is to treat it as a binary classification problem: Each sentence needs to be labeled either as a summary sen- tence (which will be included in the summary) or a non-summary sentence (which will not be). Usually, a neural network (see Section 2.5.1) is trained for this task, using training data of texts and their sentence-labels. Many current state-of-the-art methods are using this model, like BertSum as introduced by Liu [2], which will be described in Section 2.5.4.1. 2.2 Summarisation Evaluation Automatically creating summaries is a difficult in itself, and evaluating the quality of these summaries is another non-trivial task. Evaluation of summaries can be done manually: People may read at least a sizable part of the documents as well as the produced summary and subjectively judge how well it summarises the documents. Another often used human-judgement metric is how well the given summary can be used to answers certain queries. However, there can be a significant amount of variation in how different people judge the quality of the same summary. Formulating objective quality measures and automating the evaluation process can help with this problem. Being able to obtain an exactly defined score for each summary, makes it possible to better compare the results of different summarisation approaches with one another. According to Allahyari et al. [3], the most widely used metric for automatic evaluation is ROUGE (Recall- Oriented Understudy for Gisting Evaluation). This metric can be used to compare the produced summaries against a set of typically manually created reference sum- maries according to certain criteria. 2.2.1 ROUGE ROUGE, introduced in 2004 by Lin [4], is a set of measures that can be used for evaluation of text summaries by comparing them to other reference summaries as- sumed to be ideal. The comparison is performed by evaluating the overlap of text units, such as single words, word pairs (bi-grams) or word sequences, between the summary to be evaluated (also called the candidate summary) and the reference summary. Using ROUGE, the evaluation of summaries can be completely auto- mated, which both saves time and allows for a more objective measure to compare different summarisation methods against each other. In [4], Lin introduces five different ROUGE measures: ROUGE-N, ROUGE-L, ROUGE-W, ROUGE-S and ROUGE-SU. Each of them will be briefly described in the next few paragraphs. ROUGE-N is defined by the overlap of n-grams between the candidate summary and the reference summary. An n-gram is a sequence of n adjacent words. In partic- ular, the two commonly used measure ROUGE-1 and ROUGE-2 refer to the overlap 4 2. Theory of single words and the overlap of bi-grams respectively. ROUGE-L refers to the Longest Common Sequence (LCS, see [4]). Its score is based on the longest matching sequence of words that can be found between the candidate summary sentences and the reference summary sentences. The total ROUGE-L-score of the summaries is computed from the LCS-scores of the indi- vidual candidate-reference sentence pairs. ROUGE-W is a weighted variant of ROUGE-L that favours consecutive LCSs. ROUGE-S is similar to ROUGE-N, but measures the co-occurence of skip-bigrams rather than n-grams. Skip-bigrams allow for arbitrary gaps between the two words of the bigram. For example, the sentence “I had lunch today.” contains the following skip-bigrams: “I had”, “I lunch”, “I today”, “had lunch”, “had today” and “lunch today”. Note that the order in which the words appear matters. The last one is ROUGE-SU. This is an extension of ROUGE-S, and addition- ally takes unigrams into account. Using ROUGE-S, there would be no skip bigram match between the two sentences “I had lunch today” and “today lunch had I”, as the second sentence is the exact reverse of the first. With ROUGE-SU, however, we get four unigram matches. ROUGE-SU can be obtained from ROUGE-S by adding a begin-of-sentence token at the beginning of each candidate and reference sentence. For the example above, this would give us the two sentences: "[START] I had lunch today" and "[START] today lunch had I", which would give us the fol- lowing ROUGE-S matches between them: "[START] I", "[START] had", "[START] lunch" and "[START] today". Lin [4] concludes that the ROUGE-scores correlate well with human judgement for single-document summarisations, but less well for multi-document summarisation. ROUGE-1, ROUGE-2, ROUGE-S4, ROUGE-S9, ROUGE-SU4 and ROUGE-SU9, however, performed “reasonably well when stopwords were excluded from matching” [4]. Correlations with human judgement can be further increased by using multiple reference summaries per document. ROUGE, in particular ROUGE-1, ROUGE-2 and ROUGE-L, has been used widely in recent papers on automatic summarisation techniques such as [5], [2] and [6] to evaluate their performance. These evaluations are very commonly done on the CNN/Daily Mail dataset [7]. This is because it has long been one of the biggest available datasets of texts and reference summaries, and because evaluating different methods on the same dataset facilitates easy comparison between them. 5 2. Theory 2.2.2 Other evaluation metrics 2.2.2.1 BLEU Another once very commonly used metric, originally developed by Papineni et al. to evaluate machine translations, is BLEU (bilingual evaluation understudy, [8]). It was used to evaluate machine-generated text against human-generated text. Sub- sequent papers, like [9] and [10], however, have called the usefulness of BLEU for anything other than the evaluation of machine translations into question. In our research, we have not come across BLEU being used for evaluation of text summari- sation today, which is why we will not discuss it here any further. Steinberger and Ježek [11] give a good overview over various approaches to text sum- marisation evaluation. In the following, we want to mention in particular: precision, recall and F-score, and cosine similarity. 2.2.2.2 Precision, Recall and F-Score As extractive summarisation is essentially a binary classification problem, we can use precision and recall to evaluate generated summaries against extractive reference summaries in the following way: Precision (P) is the number of sentences in the generated summary that are also in the reference summary divided by the number of all the sentences in the generated summary. Recall (R) is the number of sentences in the generated summary that are also in the reference summary divided by the number of all the sentences in the reference summary. It is worth noting that a very high precision score can be achieved by a summary that includes only a single sentence, as long as that sentence is also in the reference summary. High recall, on the other hand, can be achieved by a generated summary that contains a multitude of irrelevant sentences, as long as it also includes many of the sentences in the reference summary. In the case of extractive summarisation, the optimal summary is likely to lie some- where in between those two extremes. A more useful metric than precision or recall alone is therefore the F-score, which combines the two: F = (β2 + 1) · P ·R β2 · P +R (2.1) where β is a weighting parameter that favours precision when chosen greater than 1 and recall when smaller than 1. 2.2.2.3 Cosine Similarity Cosine similarity is a measure for the similarity of two vectors. If the candidate and reference summaries have been converted into vector space, where similar summaries 6 2. Theory are closer to each other (more on how this is done can be read in Section 2.3), then cosine similarity can be used to determine how similar two summaries are by performing the following measure to compare them: cos(X, Y ) = ∑ i xi · yi√∑ i xi 2 · √∑ i yi 2 (2.2) 2.3 Text Embedding 2.3.1 Word Embeddings For a computer to be able to work with text, we need to first convert that text into numerical input, usually into vectors. This process is called text embedding and can be done on word, sentence or document level. The input text itself is treated as a sequence of tokens, which usually are either words or sub-word entities (like "play" and "#ing" for the word "playing"). Ideally, these embeddings are not just arbitrary, but capture some syntactic and semantic information. The goal is to embed the words in such a way that words with similar meanings are embedded similarly, meaning they are close together in the vector space. The linguist Zellig Harris noted already in 1956 [12] that words that appear in similar contexts tend to have similar meanings, which is known today as the distributional hypothesis. This hypothesis implies that, with a large enough corpus of text to train on, word embeddings can be learned unsupervised by neural networks, simply by observing the contexts (other words) they often appear in. A very commonly used word embedding method applying the distributional hy- pothesis with very good results is word2vec [13]. The embeddings the algorithm produces clearly capture semantic properties of words, as the following example, taken from [13], illustrates: "vector(”King”) - vector(”Man”) + vector(”Woman”) results in a vector that is closest to the vector representation of the word Queen" However, one of the drawbacks of word2vec and similar embedding methods like Glove [14] is that they embed each word only as a single, fixed vector, regardless of the specific context it appears in. For example, in the sentence “I lost my cell phone in the prison cell”. The word vector for “cell” would be the same in both occurrences - capturing some mixture of all the different meanings that “cell” can take on. Recently, new approaches for text embedding that utilize deep learning and attention have improved on this, like BERT (see 2.5.3.1 for details) and XLNet (see 2.5.3.4 for details). The concept of attention will be described in Section 2.5.2.7. Rather than returning fixed vectors for each word, these models, after training, can be used to obtain context-specific word vectors. In our previous example, the two occurences of “cells” would be represented as two different vectors, one most likely being much closer to the vectors of words like “telephone” and “conversation” and the other being closer to words like “crime” and “punishment”. 7 2. Theory 2.3.2 Sentence Embedding Sentence embeddings can be computed from the vectors of the words they are made of (the simplest approach being to take their average). Language models like BERT and XLNet can also be trained to produce sentence embeddings for specific tasks. 2.3.3 Document Embedding Just like sentence embeddings can be obtained from aggregating word embeddings in some way, document embeddings can be obtained from the embeddings of the sentences they contain. Again, the simplest approach would be to simply average the sentence vectors. Another method is Doc2Vec [15]. Doc2Vec, also called Paragraph Vector, is an un- supervised algorithm that extends the basic concept of Word2Vec to variable-length pieces of text. These may be single sentences or long documents. Doc2Vec learns fixed-length vector representations of these text pieces by trying to predict words in it. This method of document encoding is able to capture semantic information about the text unit it is given, much like Word2Vec is able to do that for words, and some researchers like Campr and Ježek [16] found it useful for the evaluation of automatic text summarisations. Dai et al. [17], too, investigated Doc2Vec’s general usefulness for measuring the similarity of two texts and found that it performed better or on par with other methods of document embedding. They also found that vector operations can be performed on the vectors, much like with word2vec. 2.4 Language Modeling Another important concept for the field of NLP is that of Language Modeling, which means representing a language as a probability distribution over sequences of words. Jozefowicz et al. [18] give a good overview over the developments in language mod- eling up to 2016. Ideally, a language model is able to capture both grammatical and semantic information, assigning high probabilities to sentences that are both grammatically correct and likely to appear in the context of the corpus, which is often limited to texts belonging to a certain topic, and low probabilities otherwise. Language models are used for many NLP tasks like speech recognition, machine translation and text summarisation. In the past, RNNs (see section 2.5.2.1) were very commonly used to train such mod- els. However, as of 2020, when this thesis was written, two of the most promising models for language modeling are BERT [19] and XLNet [20]; both employ the Transformer architecture (as described in section 2.5.2.8). In the following sections, these models will be described in more detail. 2.5 Machine Learning Models for NLP In more recent years, machine learning, in particular neural networks, have enabled great progress in NLP in general and automatic text summarisation in particular. 8 2. Theory In the following, we will trace the developments of these techniques. Section 2.5.1 gives an introduction to Artifical Neural Networks. In Section 2.5.2 networks for handling sequential data, such as text, are introduced from the early Reccurent Neural Networks 2.5.2.1 to the more recent Transformer 2.5.2.8. Section 2.5.3 in- troduces several language models. Section 2.5.4 introduces two task specific models using pre-trained models. 2.5.1 Artificial Neural Networks Artificial Neural Networks (ANN, often also just referred to as neural networks, see [21] for a more detailed overview) are computing systems inspired by the biological neural network found in the brain. As a biological network consists of neurons, an artificial neural network is made up of artificial neurons (from here on just referred to as neurons), which are essentially functions. Each such neuron receives input and performs some computation on it, before passing on the result of that computation, multiplied by some weight, to one or more neurons of the next layer it is connected to, until the ones in the last layer produce the output of the network. Figure 2.1: A simplified illustration of an ANN. In order to perform a specific task, a neural network, once set up, needs to be trained. To do this, a set of training data is required, meaning a set of inputs with the corresponding outputs we want the network to produce. If the network does not produce the desired outputs, the weights of the neurons are adjusted through a process called backpropagation. For details on this process, the interested reader is referred to [21]. Each ANN model has an objective function, which captures the desired outcome. For example, the objective function of an ANN trying to guess a number correctly might be to minimize |numberguess− numbertrue|. A so-called loss function expresses how well the model fits the training data. The loss function depends on the parameters of the model (the weights). The aim of training is to find the parameters/weights that will minimize the loss function. This is done via a process called Gradient Descent. A gradient is the multi-dimensional equivalent of a function’s derivative, which measures the slope of a function. This slope will be 0 for parameter values for which the function has a maximum/minimum. The aim of training is to find this minimum of the loss function, and therefore to find parameters for which the gradient will be 0. We "descend" the gradient until we hit its low-point of zero. Computing the gradient of a function produces a vector that points in the "uphill" direction of the gradient, which is why gradient descent happens in two steps: 9 2. Theory 1. Calculate the gradient. 2. Take a small "step" in the direction opposite to the gradient. (by adjusting the weights) This is repeated until the gradient is close enough to zero. The step size is impor- tant: If it is too large, we risk "stepping over" the optimum in our adjustment step, never hitting it. If the step size is too small, however, reaching the optimum might take more time than we have available. We will encounter these problems in Section 2.5.2.4 in the form of the exploding/vanishing gradient problem. In order to train an ANN robustly, multiple runs through the training data may be necessary. Such a run through all the training data is called an epoch. Due to the massive size that modern ANNs can reach and the memory requirements of training, many times it is not possible to train the ANN in entire epochs at a time. In this case, the training data is split into so-called batches, which are processed one by one. This also affects the gradient calculation since we only have access to a random subset of the data. In this case, a stochastic approximation is used, stochastic gradi- ent descent (SGD). After each batch has been processed, backpropagation is applied. Neural Networks form the base for all the models that will be described in the following sections. 2.5.2 Sequential Models For NLP tasks, input data often takes the form of sequences: A sentence, for ex- ample, is a sequence of words and a document is a sequence of sentences. Such sequences will vary in length, which poses a problem for the neural networks. By encoding the input, using Continuous Bag Of Word representations like [13], it is possible for neural networks to process such data. But this process is limited, as it does not take the order of words into consideration. The word order, however, can be very important for the meaning of a sentence: For example, “bad, not good” and “not bad, good” have very different meanings, even though they contain the exact same words. Convolutional Neural Network models, [22], would be able to represent such relations and dependencies, but are limited to only local ordering and have trouble with relations over large distances, such as a long sentence. This is because of the convolution process, which generally only covers a short range. For an ex- ample of why this is important, imagine a text describing somebody’s biography. The first sentence of this text might be something like: "XYZ was born in France." Then many other sentences may follow, which don’t refer to XYZ’s country of birth, until the last sentence: "XYZ returned to his country of birth and died there." In order to know what "his country of birth" refers to, it is important to remember the information from the beginning of the text. CNN models would struggle with this and potentially not be able to resolve that "his country of birth" and "France" refer to the same country. To model sequence dependencies over large distances, other models are required. One such model that is specifically designed to model dependencies between sequential inputs, is the so-called Recurrent Neural Network, described in the next section. 10 2. Theory 2.5.2.1 Recurrent Neural Network (RNN) Recurrent Neural Networks (RNN) can process data of varying length, while main- taining structured relations and dependencies. A RNN takes as input a sequence of vectors, each of which is processed in a step-by-step fashion, outputting a state vector which is used to pass on information to the next step. As more of the in- put is processed, the state vector gathers more information, better representing the sequence. Figure 2.2 illustrates the basic RNN architecture. For the first step an initial randomized state vector is used, for each subsequent step the previous state vector is used as input. Figure 2.2: Single RNN cell. If the length of the input sequence is known in advance, the network can be unrolled to display the full network, as illustrated in Figure 2.3 . Figure 2.3: RNN unrolled over 4 steps. When unrolled, it can be seen that the RNN is a deep neural network and can thus be trained like a feed-forward NN by backpropagation through time. 2.5.2.2 Stacked RNN RNNs can be stacked [23], such that the output from one layer is used as the input to the next layer. This creates hierarchical structures, often called Deep RNNs. As Goldberg writes in [24], stacked RNNs often perform better on various NLP tasks but it is not theoretically clear as to why. 11 2. Theory Figure 2.4: Stacked RNN. 2.5.2.3 Bidirectional RNN An issue with RNNs is that they can only use past states for predictions, future states, however, might also contain useful information. Additionally when process- ing a sequence, later states will contain more information than earlier states, thus accuracy improves as more of the sequence is processed. Bidirectional RNNs at- tempt to solve this by utilizing two layers. Each layer processes the same inputs, but in opposite directions, i.e., one does so from front-to-back as can be seen in Figure 2.5, the other from back-to-front. The output of each step is a combination of that steps layers. This allows the network to use past and future states with more accumulated information. Figure 2.5: Bidirectional RNN. 2.5.2.4 Simple RNN A simple version of RNN was proposed by Elman [25]. In this version, the state vector is simply the linear combination of the previous state and of the current input passed through a non-linear activation function. This simple architecture suffers from the vanishing/exploding gradient problem (EVGP ) Hanin [26] explains how and under which circumstances the EVGP occurs. It means that when the weights of the network are updated, the increment in which this is done is either too big and therefore too imprecise, or too small and therefore effectively meaningless. For the simple RNN, this happens especially when handling long sequences. Over long sequences information is lost and thus the ability to represent dependency is compromised. 12 2. Theory 2.5.2.5 Long Short Term Memory (LSTM) The LSTM architecture was developed by Hochreiter et al. [27] to solve the van- ishing gradient problem among others. The main addition is the use of a memory cell in combination with a number of “gates” that control it. The gates are values computed using the previous and current steps of the sequence. As each segment of a sequence is processed, the gates influence what should be added to the memory, what should be forgotten, and what the new output should be. The gating com- ponents allow for gradients to be passed through the memory cell over longer ranges. 2.5.2.6 RNN Modes There are different modes for handling the outputs produced by a RNN: Acceptor: The output is based on the information contained in the final state. For the example of sentence classification, a classification would be produced after all words in the sentence have been processed. Encoder: The output is the final state vector, an encoding of the sequence into a single vector. Often used in combination with a decoder. For the sentence example, after all words have been processed, an encoder outputs an encoding of the sentence. Transducer: The output is based on combined information of each step’s state vector. For the sentence example, there would be a single output after each word has been processed. Encoder/Decoder An Encoder-Decoder architecture is often used for sequence-to- sequence NLP tasks. RNN Encoder-Decoder was first introduced by Cho et al. [28]. The encoder encodes an input sequence into an intermediate vector representation, as described above. This vector is then used as the initial state for the decoder. A decoder is often autoregressive, meaning that it consumes its own output, using so far produced output as input in the next step. Using the example of translation, a sentence is given as input for the encoder, producing the intermediate vector. The decoder, with this vector as the initial state, works step-by-step producing and consuming the translated sentence word-by-word until it finds itself generating an end-token. This architecture is illustrated in Figure 2.6. Figure 2.6: Encoder/Decoder model. 13 2. Theory 2.5.2.7 Attention The problem with the Encoder/Decoder sequence-to-sequence model described in Section 2.5.2.6 is that it encodes the entire input sequence into a single, fixed-length context vector, which the decoder then uses to generate the output. In order to produce this vector, the input sequence is processed sequentially from beginning to end, and at each step only some of the information from the previous step is passed on. This means, that the final output of the encoder is much more influenced by the last couple of tokens than it is by the first. For very long sequences, this can lead to important information from the beginning of the sequence simply being “forgotten”. Even LSTM can not fully solve this problem. In order to alleviate this problem, Bahdanau et al. [29] suggest a new way of processing sequential data: Instead of using an encoder to produce a single context vector while discarding all the intermediary hidden states of the encoder, the authors propose to utilize all the encoder states. The goal of training such a model is then no longer to produce the one context vector that perfectly encodes the input sequence, but rather to learn which parts of the input sequence to pay attention to in order to generate each part of the output sequence. For illustration purposes, Figure 2.7 shows a machine translation example from [29]. It shows the attention that was paid to each French word of the input sequence to produce each word of the English output sequence. Note, for example, that in order to generate the English word “Syria”, full attention was paid to both the French words “la” and “Syrie” and little to no attention to any of the other words in the sentence. Figure 2.7: Translation of a French sentence into an English one. Figure taken from [29] with permission from the authors. 14 2. Theory 2.5.2.8 Transformer Another architecture for sequence modeling is the Transformer, introduced by Vaswani et al. in the paper "Attention Is All You Need" [30]. This model follows the en- coder/decoder structure introduced in 2.5.2.6, but as the title of the paper suggests, it relies primarily on attention. It was developed to solve some of the issues with existing models, which were largely based on RNN (see section 2.5.2.1) and CNN [22]: Most prominently the problem of retaining information over many steps when encoding long input sequences, and the limited possibility of parallelization, since every step of the encoding and decoding requires the output of the previous step. Even when RNN models were enhanced by the addition of attention to alleviate the former problem, the problems with parallelization remained. CNN models, on the other hand, can be parallelized but suffer from an increased path length in the network as sequence length increases, which increases the amount of information that is potentially lost. The Transformer architecture discards the recurrent approach of sequence modelling and utilizes at- tention instead, as described in Section 2.5.2.7. Due to its non-sequential approach, this method is highly parallelizable while only requiring a constant, O(1), number of operations and path lengths. This allows for faster training and better retention of information. The basic architecture of the Transformer model can be seen in Figure 2.8 and will be described in the following paragraphs. Encoder The encoder of the transformer consists of 12 stacked encoder layers. Each such layer consists of two sub-layers, an attention layer and a feedforward network layer. The input to the first of these layers is a sequence of embeddings, very commonly word embeddings. Before these embeddings are passed to the model, they are sup- plemented with positional encoding to be able to retain the information of word order, despite no longer processing the input in a sequential manner. With the po- sitional encoding, the same token at different positions is encoded differently, and their final embeddings will have some meaningful distances in vector space. The attention sub-layers allow the system to focus on the most relevant parts of a sequence. During the encoding process this is used to determine how much a word relates to all other words in the sequence. The attention mechanism used by the Transformer is called Multi-Head Self-attention: "Self", because it encodes each word of the sequence in relation to the other words in the sequence, and "multi-head" because each attention layer utilizes not just one, but several different attention weights ("attention heads") [30]. This means that multiple attention processes are performed in parallel. The idea is for each to focus on different aspects of the sequence. The second sub-layer is a fully connected feed-forward neural network. This network is applied to each element of the sequence separately but identically. 15 2. Theory Figure 2.8: Transformer architecture. Taken from [30] with the authors’ permis- sion. 16 2. Theory All sub-layers of the encoder also contain residual connections, their purpose is to combine the input, which has not been affected by the layers, with the output pro- duced by the layers. In the Transformer architecture, these residual connections are used to restore positional encoding after processing the word embeddings. Without these connections the performance suffers greatly, as positional information gets lost. Decoder The decoder differs only slightly from the encoder. It contains the same two sub- layers, but has an additional sub-layer, an encoder-decoder attention layer, between them. The decoders role is to produce output, using the information produced by the encoder. In the Transformer it does this by performing multi-head attention over the output of the encoder and the so far produced output by the decoder. This differs from the other forms of attention in the model as it is no longer self-attention, instead it uses multiple sequences. 2.5.2.9 Transformer-XL The Transformer model described above, as proposed by Vaswani et al. [30] handles the whole input at once, as such there must be some limit on the length of the input, due to computational and resource limitations. The default implementation uses a token limit of 512. This means the transformer can only consider any token’s context in 512 token blocks. Solutions for longer texts have been suggested in later works, such as [31], where the longer corpus is split into multiple 512 length blocks. But this has two problems: Firstly, no contextual information is shared between blocks, and secondly, the splitting of the corpus is often done without any respect for sentence or semantic structure, leading to context fragmentation. Transformer-XL [32] is an architecture proposed to solve the problem of fixed length contexts. Its main contribution is the re-introduction of recurrence to the Trans- former, which allows context to flow between blocks of a split corpus. To achieve context flowing over the boundaries of blocks, the previous block’s attention vectors are saved and can be used to "look back" on for context, resulting in better long- term dependency and avoiding the fragmentation problem. Applying this method to every two consecutive blocks creates a combined context that can represent context over much longer than just two blocks. The method could also be extended to allow for further connections, beyond just two blocks. For this method to work, another type of positional encoding is required. Since the model looks back at previous blocks, the absolute positional encoding employed by the Transformer no longer works - since each token will be in multiple positions and tokens in different segments would be assigned the same positions. Instead a relative positional encoding based on the distance between tokens is used. The attention score, too, is calculated slightly differently from the Transformer. The Transformer-XL model is able to generalize from training on short sequences to much longer sequences quite well. For example, Dai et al. detail in [32] how the model was trained with an attention length of 784 tokens and evaluated on corpus of 3,800 tokens and achieved a new state of the art result. 17 2. Theory 2.5.3 Pre-Trained Language Models In the following, we will introduce several pre-trained models for NLP tasks that have been built using the Transformer/TransformerXL architectures: BERT (2.5.3.1) and XLNet (2.5.3.4) are built upon, respectively. Additionaly number of variations of BERT will be presented as well: We will elaborate on the models RoBERTa (Section 2.5.3.2), DistilBert (Section 2.5.3.3). 2.5.3.1 BERT BERT is a Transformer-based model for language encoding, introduced in [19] by Devlin et al. The authors identify the fact that models could only be trained unidi- rectionally as one of the big limitations of previous approaches to language modeling. This meant that a token could only be encoded using the information of either the tokens to its left or the tokens to its right, but never using information from both combined at the same time. The objective in creating BERT (Bidirectional Encoder Representations from Transformers) was to create a model that could take the full context of a token into account, left and right. In order to use BERT for some down-stream task (like machine translation or text summarisation), two steps are necessary: 1. The model needs to be pre-trained. This means the model is not yet trained in any task-specific way, but instead is taught to encode language itself in a sensible way. This is done so the same model can be used for several different down-stream tasks without needing to be trained from scratch. Task-specific training (fine-tuning) is done in the next step. Pre-training results in a general- purpose Transformer that can encode input tokens. This pre-training is done on unlabeled training data over two different training tasks, described later in this section. 2. The model can then, once it is initialized with the parameters obtained through pre-training, be fine-tuned to be used for a particular down-stream task. In order to do this, importantly, the architecture of the model itself does not need to be changed. Instead, the same pre-trained model can be applied to several different tasks, by layering task-specific layers on top and training the model on labeled training data pertaining to the desired downstream task. Since the authors made their pre-trained BERT models (a larger and a smaller one) available for download and free to use, this means that with relatively little effort, these already pre-trained models can be applied to a wide variety of text-based tasks. The architecture of the model itself is almost identical to the Transformer architec- ture described in section 2.5.2.8. Perhaps more interesting is how textual input is processed and how the model is (pre-)trained: As input, BERT accepts textual sequences that may each be composed of either a single sentence or a pair of sentences, where a "sentence" means any arbitrary span of contiguous text, not necessarily sentences in the grammatical sense. Each 18 2. Theory such sequence is preceded by a classification token ([CLS]). The final hidden state for this token can be trained to obtain an aggregated representation of the entire sequence. This is useful for some classification tasks, like summary/non-summary sentence classification for extractive summarisation. If the sequence consists of two sentences, then they are separated by a [SEP] token. Additionally, BERT adds a so-called segment embedding to each token, which in- dicates whether it belongs to Sentence A or Sentence B. The input representation of each token is obtained by adding together the token’s WordPiece embedding (see [33] for details), segment embedding and positional embedding. The latter encodes where in the sequence the token is located. This is necessary, as BERT, being a Transformer model, is not going through the tokens sequentially, and therefore does not "know" the order of the input tokens. Figure 2.9 illustrates the BERT input representation. Figure 2.9: BERT input representation. Reproduced from [19] with the authors’ permission. Once the input representation is obtained, BERT is pre-trained on it by trying to solve two tasks, as mentioned above. These two tasks are the following: 1. Masked Language Model (MLM) Some percentage of the input tokens (in the paper the authors chose 15%) is masked and BERT is tasked with predicting them by using the entire context - left and right. Notably, only these masked tokens are predicted by the model, and no attempt is made to reconstruct the entire input. 2. Next Sentence Prediction (NSP) This task is meant to help the model learn the relationships between sentences: To create the pre-training dataset, for each training instance two sentences A and B are picked from the training corpus. With 50% probability, sentence B will be sentence A’s successor, with 50% probability it will be a random sentence from anywhere else in the corpus. BERT is tasked with predicting (binary) if sentence B is indeed sentence A’s successor. For pre-training, the authors used BooksCorpus ([34]) and English Wikipedia texts. The pre-trained models BERTLARGE and BERTBASE are publically available at: https://github.com/google-research/bert 19 https://github.com/google-research/bert 2. Theory In the next two subsections, we will look at variants of BERT that aim to improve on the original model. 2.5.3.2 RoBERTa In [35], the authors claim that BERT, as introduced in the original paper [19] is actu- ally undertrained and show that with some modifications, significantly better results can be achieved, which are competitive with the performance of every model pub- lished after BERT. They name their modified BERT-version RoBERTa. (a robustly optimized BERT approach) Apart from changing some of BERT’s hyperparameters, the main differences from BERT pertain to how the model is pretrained. The main changes from the training suggested in [19] are the following: Dynamic Masking In the original BERT, the masking of the input sequences is done in a static way: Only once in pre-processing. To ensure that BERT will encounter the same se- quences with different masking patterns, the training data instances are multiplied by 10 before masking. RoBERTa, however, applies dynamic masking instead. A new masking pattern is generated every time a sequence is fed to the model. This means that the model will encounter many more different masking patterns of the same instance. This in turn removes the need to drastically increase the number of training instances. Full sentences As opposed to BERT, RoBERTa uses exclusively full sentences as input to the model. Such sentences are sampled contiguously from the documents, such that the total length does not exceed 512 tokens. If document boundaries are crossed while sampling, a special inter-document separator token is inserted. Training in large mini-batches BERT was trained in 1 million training steps with a batch size of 256 sequences. RoBERTa, on the other hand, was trained in only 125.000 steps, using a much larger batch size of 2.000 sequences. The authors express uncertainty over whether they have already found the ideal batch size with this, but it produces better results than the original BERT while taking less time to train. Larger byte-level BPE BPE (byte-pair encoding) is a hybrid between character- and word-level text encod- ing. It bases its encodings on subword units, for example: Instead of encoding the word "playing" or each of its letters separately, BPE might encode "play" and "#ing", building blocks which can be re-used to form other words as well. This allows for a much larger vocabulary as would otherwise be possible. However, a lot of the time in BPE, a large portion of the encodings are encodings of single uni-code characters, which limits the total number of words that can be captured. RoBERTa instead makes use of a variation of BPE introduced in [36], which is 20 2. Theory based on bytes instead of characters. This means that less subword units are needed to encode a larger vocabulary. While BERT used character-level BPE with a size of 30.000 subword units and requires the input to be tokenized in preprocessing, RoBERTa requires no such preprocessing and uses byte-level BPE to encode 50.000 subwords. Longer pretraining on larger data sets RoBERTA was trained over 500.000 steps, while BERT over 100.000, and trained on 160GB of textual training data, resulting in much better end-task performance. The pretrained RoBERTa model is publically available at: https://github.com/ pytorch/fairseq 2.5.3.3 DistilBert Pre-trained language models such as the ones described in the previous sections, can easily have several hundred million parameters. This means that they require great amounts of memory and computational power to be trained and run. Sanh et al. Distilbert therefore set out to create a much smaller language model, which is less resource demanding. They did so using the method of knowledge distillation ([37], [38]), creating a much smaller Transformer model, which they called DistilBERT. DistilBERT has the same general architecture as BERT (see section 2.5.3.1), but less layers. The authors also pre-trained a BERT model according to the best-practise suggestions of [35] (see the section on RoBERTa: 2.5.3.2). DistilBERT was then trained, using the same corpus of training data, to produce the same outputs as this BERT model. This method is also called Teacher-Student Knowledge Distillation, as the BERT model functions as a teacher, whose behaviour the DistilBERT model, the student, tries to replicate. The resulting pre-trained DistilBERT model is only 40% the size of the original BERT and, according to the authors, 60% faster. Through experiments, Sanh et al. showed that despite being so much smaller, DistilBERT retains 97% of BERT’s language understanding capabilities, as measured by its performance on various language understanding tasks. DistilBERT is openly available in the Transformers library from HuggingFace. 1 2.5.3.4 XLNet XLNet takes a slightly different approach to language modeling than BERT and its variants. It is an autoregressive model for capturing bidirectional dependencies, which the authors developed to solve some perceived problems of BERT. Firstly, the method of masking words and then predicting them corrupts the input, creating texts filled with [MASK] tokens that are not seen in regular texts. This causes a dis- crepancy between pre-training and fine-tuning, since the latter does not contain any 1https://github.com/huggingface/transformers 21 https://github.com/pytorch/fairseq https://github.com/pytorch/fairseq 2. Theory [MASK] tokens. Secondly, BERT assumes that masked tokens are independent, but for a sentence containing multiple masked tokens, these may in fact be dependent on each other. BERT also has a fixed length context of 512 tokens, while XLNet builds on the Transformer-XL architecture to be more suitable for longer texts. Despite its problems, BERT is still very good at capturing bidirectional context which is what lead to gains against previous models. XLNet captures bidirectional dependencies slightly differently, by utilizing language-permutation modeling, which works by making predictions in a random pattern. Given a sentence of 5 words, for instance, the model could be asked to predict the word in the random order [word5, word1, word2, word4, word3]. This allows the model to learn bidirectional dependencies. For example: When predicting word2, the context will contain words that, in the original context, were both earlier (word1) and later (word5) than it. XLNet takes this approach with all possible permutations of the factorization order. It does so by utilizing masking in the Transformer, so as to not change the order of the actual input as this would create unrealistic text combinations creating discrep- ancies between pre-training and fine-tuning. XLNet Architecture XLNet is a Transformer-XL based architecture with some modifications. When pre- training a Transformer based model, the embedding for the token being predicted is masked, this includes its positional embedding. This, however, is potentially use- ful information. When predicting a token, only the position should be known, not the content. The solution is a two-stream self-attention architecture consisting of a content stream and a query stream. The content stream is a standard self-attention model without masking, which allows access to the full context and token content. The query stream has limited access to only the context of the previous steps and the current token’s position. The query stream is only used for pre-training and can be dropped during the fine- tuning process, turning the model into a normal Transformer based model. To handle multiple segments, XLNet utilizes a relative positional encoding scheme similar to the one proposed by Transformer-XL (see section 2.5.2.9). Each word has a segment encoding, which indicates whether any two words belong to the same segment or not. This means that it does not encode which specific segment a word belongs to or where exactly its position in that segment is, only whether or not two words are from different segments. This has the additional benefit of allowing encoding of more than two segments, which BERT does not. 2.5.4 Task Specific Models In this section we introduce two task specific uses of pre-trained models BertSum (Section 2.5.4.1) for the task of summarisation and SBert (Section 2.5.4.2) for pro- ducing sentence embeddings. 22 2. Theory Figure 2.10: BERTSum architecture. Reproduced from [2] with the authors’ permission. 2.5.4.1 BERTSum BERTSum is publically available at: https://github.com/nlpyang/PreSumm BERTSum [2] is a variant of BERT fine-tuned for extractive single-document sum- marisation. For the purpose of extractive summarisation, two problems need to be overcome: 1. Each sentence needs to be labeled as either a summary sentence or a non- summary sentence. By default, however, BERT outputs token representations, not sentence representations, and no classifications either. 2. BERT accepts inputs of either a single sentence or a pair of sentences. For summarisation purposes, however, the model should be able to process docu- ments containing multiple sentences. In order to overcome these problems, the author of [2] modified both the input sequence and the embedding slightly, extending them to sequences of multiple sen- tences: 1. In the input sequence, each sentence is preceded by a [CLS] token and suc- ceeded by a [SEP] token. 2. In the original BERT, two sentences A and B are distinguished by segment embedding. Each token of sentence A is embedded using EA and each token of sentence B is embedded using EB. For BERTSum, this is extended to: The tokens of the i-th sentence are embedded using EA if i is odd and EB if i is even. Thus, the output of the top BERT-layer for each [CLS] token is treated as the sentence representation of the sentence following that token. The architecture of the BERTSum model and especially the input embedding is shown in Figure 2.10 Having obtained sentence representations for multiple sentences, there are several ways the author suggests to fine-tune BERT for extractive summarisation: 23 https://github.com/nlpyang/PreSumm 2. Theory 1. Adding a single sigmoid classification layer on top of the BERT ouputs. 2. Adding more Transformer layers on top of the BERT output. (Inter-sentence Transformer) On top of that, a sigmoid classification layer. 3. Adding an LSTM on top of the BERT outputs. On top of that, a sigmoid classification layer. Liu’s experiments in [2] showed that the option of adding a two-layer Transformer and a single sigmoid classification layer on top of BERT produced the best results. Like BERT, BERTSum has an input limit of 512 tokens. 2.5.4.2 Sentence-BERT (SBERT) Reimers et al. [39] propose a modified BERT model for semantic textual similarity (STS) tasks using BERT. The default BERT implementation can be used for STS tasks by utilizing the input of sentence pairs. But this is computationally expensive, as each sentence pair would need to be compared against each other. The authors name as an example, that finding the most similar pair in a collection of 10.000 sen- tences using conventional BERT requires n(n− 1)/2 = 49995000 operations, which would take take roughly 65 hours. A common solution to these types of problems is to map the inputs to some vector space which can then be compared via for example clustering. Sentence embeddings can also be performed using BERT. This is typically done by feeding the model a sentence and either averaging the output layer or using the [CLS] token. How- ever, without additional fine-tuning these are not very useful for semantic textual similarity (STS) tasks. In order to mitigate these problems, the authors developed a modification of the BERT model, which they called Sentence-BERT or SBERT. SBERT is then fine-tuned to produce semantically meaningful fixed-length sentence embeddings (i.e., so that semantically similiar sentences are close together in the vector space) which can be easily compared with the cosine similarity score (see sec- tion 2.2.2.3). This makes finding the most similiar pair of sentences much quicker. Using the same example as before, finding the most similiar pair of sentences in a collection of 10.000 sentences takes, according to the authors, only a few seconds with SBERT, as opposed to 65 hours with BERT. The authors did not find that using RoBERTa instead of BERT resulted in any im- provements for their purposes. They also found that XLNet performed even worse than BERT on STS tasks, which is why they used BERT as the basis of their work. The authors also tried alternative similarity measures, like the Manhattan and neg- ative Euclidean measure, but found that they had no advantages compared to cosine similarity. Reimers et al. developed several possible architectures for SBERT, depending on the kind of training data available. SBERT can be built with a siamese or triplet network structure [40], which means that the same weights are used to process two 24 2. Theory or three input sentences at the same time. SBERT also adds a pooling operation to the output of BERT to derive fixed-size sentence embeddings from it. The authors tried different pooling strategies, but found that taking the mean of all the output vectors produced the best results. There are different objective functions available, depending on the task to be trained for: 1. Classification Objective Function. The sentence embeddings are concate- nated with the element-wise difference and a softmax-function2 is applied to obtain the classification label. This structure is depicted in Figure 2.11. 2. Regression Objective Function. The sentence embeddings are used to compute the cosine similarity score. This structure is depicted in Figure 2.12 3. Triplet Objective Function. The network is fed three sentences, one of which is the so-called anchor sentence, while the other two are the so-called positive and negative sentences. The training objective is to make sure that the distance between the positive sentence and the anchor sentence is always smaller than the distance between the negative sentence and the anchor sen- tence. Figure 2.11: The SBERT architecture for a classification objective function. 2A softmax function takes a vector of dimensionality k and normalizes it into a probability distribution of k probabilities, proportional to the original inputs, which add up to 1. 25 2. Theory Figure 2.12: The SBERT architecture for a regressive objective function. This can be used to compute similarity scores. SBERT raised the state of the art for sentence embedding and several STS tasks. 26 3 Methods In this chapter the methods used to achieve the goals of the thesis are described. Section 3.1 gives an overview over the changes that this project went through during the course of its conception and execution. In Section 3.2 we describe the datasets we used during training and evaluation, their properties and how we obtained our training data from them. In Section 3.3 we describe some of the challenges of using BertSum for summary generation and how we solved them. In the Section 3.3.2 we describe our implementation and training related details. Section 3.5 describes our evaluation metrics. Finally we describe the experiments we performed using different pre-trained models in Section 3.6. 3.1 Changes in the direction of the project The initial plan for this thesis project was to use a dataset of academic papers to fine-tune a BertSum model for summarisation. But it became clear early on that creating a large enough dataset would not be possible within the scope of the project. We identified a few bottlenecks for such a project, which will be described in the section on future work 5.4. Instead, we decided to investigate how well a BertSum model fine-tuned on the widely used CNN/DM news dataset would transfer and perform on our academic papers dataset. We will also investigate how different pre-trained models compare against each other. In the next section we will go into more detail on the datasets we used and how we obtained them. 3.2 Datasets In the following, we will describe the datasets we used for our experiments: the CNN/DM dataset, which we used to fine-tune the models, and the "Academic Pa- pers" dataset, which is a very small dataset we created ourselves for the purpose of evaluation. A dataset to train for the extractive summarisation task requires a text (sequence of sentences) and reference labels: For sequence-labeling, each sentence needs to be labeled as 1 (summary) or 0 (non-summary). This means that the summary length must be known before label selection. For score-and-select, the label for each sentence is a score indicating its importance to the summary. This does not require summary length to be known in advance. 27 3. Methods 3.2.1 CNN/DM Few datasets exist for the task of text summarisation, especially extractive text summarisation. The CNN and DM datasets1, which contain news articles, gathered from CNN and Daily Mail, each accompanied by a short abstractive summary, are commonly used for training and evaluating models for summarisation tasks. This dataset exists in an anonymised and a non-anonymised version. The anonymised version replaces identifiers with non-identifiables. For better comparisons with pre- vious works, which largely used the non-anonymised version, we chose that one for our experiments, too. The dataset was split for training, validation and testing as suggested by Hermann et al. in [41], statistics can be seen in Table 3.1. Table 3.1: Statistics for CNN/DM dataset Train Validation Test No. of Samples 287083 13367 11489 Avg. Sent. Length 35.56 32.24 32.62 Avg. Number of Tokens 927.96 910.17 921.96 Summary Avg. Sent. Length 3.73 4.11 3.88 3.2.1.1 Label Generation The summaries included in the CNN/DM dataset are so-called "highlights", a few sentences for each news article, which aims to summarise it. These summaries are abstractive and can therefore not be used directly for training an extractive sum- mariser. Instead labels were generated using these abstractive summary as guidance. We generated three sets of label data. 1. Binary Labels: As in the BertSum paper [2], we generated binary labels for the sequence-labeling problem definition. Up to three sentences( the pre-determined length of summaries) are selected from each news article, by maximize the ROUGE scores against the abstractive summaries. For this purpose, Liu [2] proposes two different algorithms. The first is a greedy algorithm, which is fast but does not consider all combinations of sentences. The second algorithm does consider all com- binations, but is slower. Liu opted for the faster but inaccurate algorithm and so did we because of the insignificant differences between them in terms of score. In addition to this, we propose two additional label selection schemes for the score- and-select problem definition: 2. Score Labels: We assigned each sentence a score, based on its ROUGE score against the abstractive summary. We hope that this method will allow models to 1Both are available for download here: https://cs.nyu.edu/ kcho/DMQA/ which is where we obtained our data from. (Last accessed 14.02.2020.) 28 3. Methods generalise better for a wider range of summary lengths. 3. Sentence BERT Score(SBS) Labels: Similar to the previous score labels, but instead each sentence is assigned a score based on the cosine similarity between its sentence embedding and that of the summary, as produced by SBERT (see Section 2.5.4.2). 3.2.2 Academic Paper Dataset This dataset consists of a small number of papers on driving styles in the domain of traffic safety. This dataset was too small to perform meaningful fine-tuning on, but we did use it to evaluate our models’ ability to transfer from the news data they were fine-tuned on to this different type of texts. The papers were provided to us in PDF-form which was a limiting factor for the number of documents we were able to include in the dataset, because of the additional work required to extract and pre-process the texts. To obtain labeled data, we created extractive reference summaries from scratch. In the following sections, we will describe how we extracted the text from the PDFs, how we pre-processed these texts and how we obtained the reference summaries. 3.2.2.1 Text extraction For our experiments, we collected 31 PDF-documents. These are scientific publica- tions on the topic of traffic safety between 5 and 33 pages of length, with the average number of pages being 14. The first step to use these documents was to extract the text from each of the PDFs. Since developing a dedicated tool for text extraction from PDF was out of scope for this thesis project, we utilised an existing one, pdftotext2 from the open source toolset Xpdf. We configured the pdftotext tool to cut out all images from the PDFs and then had it produce a TXT-document for each original PDF-document con- taining only its extracted text, which we then cleaned up manually. This manual clean-up was necessary, because even the best text extraction tool we were able to find had various issues, which will be described in the next section. 3.2.2.2 Pre-processing The automatically extracted TXT-files, while giving us a good base to work from, had various problems, which would have limited the quality of potential summaries: 1. Tables, headers, footers and page numbers were not recognised as not being part of the text. Therefore, they appeared in the extracted TXT-files, breaking up the actual text in unfortunate ways. Often, these artifacts would be inserted mid-sentence. 2http://www.xpdfreader.com/pdftotext-man.html 29 3. Methods 2. Occasionally letters, words or phrases would be printed repeatedly. In rare cases, even nonsensical streams of letters were produced for no apparent reason. 3. Sometimes the original texts would contain periods in headlines or mid-sentence, which pdftotext would preserve, impeding our ability to automatically recog- nise the beginning and end of sentences. Because of the relatively small number of documents, we came to the conclusion that trying to develop a piece of software to solve these problems automatically would take more time than simply cleaning up the TXT-files by hand. The following manual changes were made to the automatically extracted texts: 1. Headers, footers, page numbers, tables etc. were removed from the texts wherever we found them. 2. Corrupted words (such as "trraaaafffffiiiiiiic", "tra?c", "traf-fic") were corrected ("traffic"), and where whole sentences were corrupted, we manually copied over the correct text from the source PDF. 3. Periods that were used in any other way than to end a sentence were removed from the sentence. Periods were added after headlines so as to prevent them from being interpreted as the beginning of the following sentence. 4. Anything before the "Introduction" section and after the "Conclusion" was removed from the document. In particular, we removed the abstracts from the text as we intended to use them during evaluation. 5. Formulas, which pdftotext was rarely able to extract in a readable format, were either cleaned up or deleted from the text, where we deemed their exact content irrelevant to the surrounding text. 6. The headlines were translated into a machine-readable format in the following fashion: 1. Headline -> # Headline. 1.1. Second headline -> ## Second headline. etc. We thought that the information on sections and subsections of the text might be interesting to preserve, though we ended up not using it. 3.2.2.3 Obtaining reference summaries Next, we had to create reference summaries for each of the documents, to be used for evaluation. This, too, was done manually. Due to time constraints, we only did this for ten of the documents. The summaries were created from the cleaned up TXT-files in the following way: First, we read carefully through the entire docu- ment to familiarise ourselves with what they were about. We then went through the document again from top to bottom and removed all sentences that did not seem essential to convey the most important information of the document. We repeated this step until we felt that no more sentences could be removed without leaving out important information. We did this individually, each of us producing a separate summary for each of the ten documents. We thus ended up with two reference summaries per document. 30 3. Methods This was done to have a measure for how much we can expect two summaries of the same text to differ from one another, even under the assumption that both of them are optimal. Knowing this will help us better interpret the quality of our generated summaries. All the summaries except for one were roughly 10-20% of the length of the original documents, measured in word count. (The single outlier was almost 50% of the document length, but with 1676 words total, it was a short document to begin with.) 3.3 Summary Generation In this section we will describe the problems we had to overcome to be able to generate summaries for the Academic Paper dataset, and how we implemented the summarisation models. 3.3.1 Adressing BERTSum’s Token Limit As mentioned in Section 2.5.4.1, the BERTSum model has a limit of 512 input to- kens. For short news data like the CNN/DM dataset, texts are usually truncated to fit these limitations. This, however, affects summary generation, as it only al- lows the model to select sentences contained within these first 512 tokens. For the Academic Paper dataset, which consists of texts much longer than this limit, this is unreasonable, as it would disregard the majority of the texts. A possible solution, suggested by the BERTSum [2] authors, is to simply extend the token limit of the model, but as the pre-trained models it uses to generate token embeddings have the same limit, this would not affect pre-training. Additionally, the academic papers are of such a length that the token limit would have to be increased 20-fold to fit some of the texts, and we would run into computational and memory limitations. Another solution was needed. As suggested by Al-Rfou et al. in [31], we instead split the input texts into multiple blocks below the limit and feed each block to the model individually, later combining its output, this allows the model to generate summaries for longer texts. This method, however, is not without problems: Since no information is shared between the blocks, contextual information is lost, which will have an effect on the generated summaries. Positions will be repeated within each block and thus poten- tial positional bias will also be repeated. To allow summary generation for texts of any length, the following steps were taken: 1. Input texts are split into blocks of a maximum length of 512 tokens, with a maximum sentence length of 200 tokens. Longer sentences are truncated to avoid single sentence blocks. 2. Blocks are created sentence by sentence: If adding another sentence would exceed the limit, a new block is created. Thus, sentence integrity is maintained. 31 3. Methods 3. Each block is put through the model separately, which outputs a score for each sentence in the block. 4. The scores of each block are combined and the top scoring sentences are se- lected to form the summary. 3.3.2 Implementation We used the BERTSum model implementation proposed by Liu3 as presented in [2] as the base for our implementations. This BERTSum implementation is built on top of "Open Source Neural Machine Translation in PyTorch"4, which is an open source framework for sequence models. To better facilitate our goals of using multiple pre- trained models, we made some changes to the original implementation, which will be described in the next sections. We used the Transformers library, maintained by Hugging Face5, which contains standardised PyTorch implementations of many of the newest Transformer-based models. This is the latest version of the commonly used pytorch-transformers library (which BERTSum utilises). Our own implementation is based on their examples. In the following sections we will explain some key alterations we made. BERTSum model alterations: The BERTSum model, which originally builts on just BERT, was altered to support multiple pre-trained models for generating sentence embeddings. Since some of the pre-trained models we intend to use, like RoBERTa, do not use segment embeddings for pre-training and since the BERTSum paper [2] showed only small differences, we also removed segment embedding from our altered BERTSum model. Support for the XLNet-specific feature of context sharing between blocks was implemented. Data-loading: We utilise a different dataloading process than BERTSum, which uses a dataset already containing pre-computed word token representations specif- ically for BERT. As some of the other pre-trained models use different tokens, this method would require us to create a new dataset for each model, containing such pre-computed tokens. Instead, we used a PyTorch-style data-loader that performs model-specific tokenisation during data-loading. This causes some computational overhead, which can be mitigated by utilising several worker processes. Hyper-Parameters: Trying to find the optimal hyper-parameters was out of scope for this project. Instead we refer to previous work and use the hyper-parameters sug- gested by the authors in each pre-trained models’ paper. The authors of the original paper on BERT [19] suggest that hyper-parameter tuning is of less importance when the fine-tuning data set is large (as is the case for the CNN/DM dataset), which also matched our initial findings when trying different parameters. Therefore, we simply stick to the suggested hyper-parameters for each model. 3https://github.com/nlpyang/BertSum 4https://opennmt.net/ 5https://github.com/huggingface/transformers 32 3. Methods Optimisers and Schedulers: In addition to the BERTSum optimiser/scheduler we also implemented support for the PyTorch implementation of AdamW6 with linear schedule decay. We used AdamW as the optimiser and scheduler for our ex- periments to better match how the models were originally pre-trained. Loss Functions: BERTSum uses the summed Binary Cross Entropy as its loss function, which is suitable for the binary classification task of sequence-labeling. To also support the score-and-select training objective, we implemented support for a mean squared error (MSE) loss function as well. Selection Layer: The BERTSum paper [2] explores several different selection lay- ers and come to the conclusion that a Transformer layer produced the best results. This is therefore what we used as well. The authors also used a tri-gram blocking scheme, which blocks the addition of a sentence to the summary if it contain an overlapping tri-gram with the summary. This ensures a more diverse text, and lead to improved scores for the authors. Checkpoint Averaging: During training, checkpoints of the model are saved at regular intervals. Checkpoint Averaging is a method where a number of checkpoints are combined and averaged into a single, supposedly more robust, model. The authors of the BERTSum paper used multiple checkpoints of the model saved during training and combined the weights of the top 3 performing checkpoints (on the validation set) into the final model. We also employed this method. 3.4 Hardware Most of the training was done on the high-performance clusters of C3SE7 which is the centre for scientific and technical computing at Chalmers University of Technol- ogy in Gothenburg, Sweden. C3SE is part of the Swedish National Infrastructure for Computing, SNIC8. The training was performed on the GPU-nodes of the Vera Cluster, which are out- fitted with Tesla V100 32GB model GPUs. These GPUs support half-precision float format (FP16, 16-bit floats), which allows for mixed-precision training. Utilis- ing this can lead to a significantly faster training speed. Mixed-precision training9 uses FP16 for operations, while important network information is stored in single- precision (FP32). This reduces memory requirements and allows for larger models and batches. FP16 structures are also faster to access and transfer than FP32. The loss of precision, which can lead to small numbers being interpreted as 0, is combated by a technique called loss scaling, which helps preserve small gradients. 6https://www.tensorflow.org/addons/api_docs/python/tfa/optimisers/AdamW 7Chalmers Centre for Computational Science and Engineering: https://www.c3se.chalmers.se/ 8https://www.snic.se/ 9https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html 33 https://www.tensorflow.org/addons/api_docs/python/tfa/optimisers/AdamW 3. Methods 3.5 Evaluation In this section we will describe the metrics used to evaluate the models’ performance. 3.5.1 ROUGE Evaluation The three commonly used ROUGE metrics for evaluating summaries, ROUGE-1, ROUGE-2 and ROUGE-L, are used for evaluation. For each of these we use the F-1 score which is a combination of the precision and recall. For a more detailed description of these metrics, see Section 2.2. Several suites exist for performing ROUGE evaluation. We opted for a Python im- plementation of ROUGE10. This implementation produces slightly different scores compared to the implementation used in the BERTSum paper. This hurts com- parison against previous works, but it is faster, and as our goal required evaluating many models against each other we deemed it the better option. 3.5.2 Sentence Similarity Evaluation We also explored another evaluation metric based on sentence similarity. We hoped that this metric would be more accurate when for example comparing against ab- stractive candidate summaries, since it does not require exact word matchings as ROUGE does. We use SBERT (see 2.5.4.2) to produce an embedding for each sen- tence. These are then averaged into a single combined document embedding. This is a naive approach based on the same methodology that SBERT employs, in tak- ing the mean of word embeddings to create the sentence embedding. The score for each summary is the cosine distance between the reference summary and the gen- erated candidate summary. To bring it into a similiar range and thus make it more comparable with the other metrics we took (1− distcosine) · 100 as the final score. 3.5.3 Evaluation on the CNN/DM dataset The held-out test data of the CNN/DM dataset (see Section 3.2) was used to eval- uate the models’ performance. The fine-tuned models were used to generate three- sentence summaries, and these were evaluated against the abstractive summaries included in the CNN/DM dataset, using the methods outlined above. Having no explicitly extractive reference summaries likely limits the ROUGE score that can be reached, as there might not always be exact word matched between the provided abstractive summary and the generated extractive one. The sentence similarity met- ric, however, should not be as dependent on word matches. Evaluation was performed on both a truncated and a full version of the CNN/DM dataset. For the truncated version, the BERTSum token limit is enforced by simply truncating the texts. This will bias the results, as the model can only select from 10https://github.com/pltrdy/ROUGE 34 3. Methods sentences that appear before this limit is reached. The full dataset evaluation used the block splitting method introduced in Section 3.3.1 to allow the model to select from the full range of sentences. 3.5.4 Evaluation on the Academic Paper dataset For this dataset the evaluation of the generated summaries was performed against the manually created reference summaries. Since these reference summaries are ex- tractive, we also measured sentence overlap, in addition to the above evaluation methods. Sentence overlap is measured as follows: Sc = set of sentences in candidate summary Sr = set of sentences in reference summary Score = Sc ∩ Sr Sc ∪ Sr 3.5.5 Human Evaluation Extensive human evaluation was not performed, due to constraints in time and resources. We did, however, want to have some sort of measure of how well the different models performed by human standards. We obtained this in the follow- ing way: A random text was selected from the Academic Paper dataset for which we evaluated and ranked the generated summaries of all models. Summaries were ranked on relevant sentence selection, cohesion and readability. More formally the ranking was performed as follows: The origin of the generated summaries were ob- scured so as to not influence our rankings. We assigned each sentence a score of 1 if it was a "good" sentence (according to our subjective judgement), a sequence of good sentences was assigned an increasing score to capture a notion of cohesion. Neutral sentences were assigned a score of 0. Bad sentences were assigned a score of -1. Finally we summed the scores to give a final score for the sentence. When perform- ing the final rankings, tie breakers or close scores were determined by subjectively assessing the whole summary on the above criteria. 3.6 Experiments We performed several experiments to evaluate and compare the pre-trained models against each other on the task of extractive text summarisation. The pre-trained models we decided to investigate are: 1. BERT 2. DistilBERT 3. RoBERTa 4. XLNet Most of the models are also available in larger versions with a deeper network, we opted to only train the smaller "base" versions of the models. Previous work has shown gains for the large models, but for our purpose of comparing several models 35 3. Methods we decided that the base versions were a better choice because of the lower resource and training time requirements. Experiment 1: BERTSum Reference For the first experiment, no fine-tuning was performed. We only evaluated the Bert- Sum model published by Liu and Lapata [2] as is on the Academic Paper dataset to obtain a baseline. Their model was trained for 50 000 iterations using 3 Nvidia 1080-ti GPUs with a gradient accumulation of 2, resulting in an approximately combined batch-size of 36. Training for 50 000 iterations with this batch size resulted in approximately 6 epochs. Experiment 2-7 For these experiments, we used some different parameters than for Experiment 1. The main differences are the warmup and weight-decay: We opted to run for 4 epochs using 10% of total training steps as warm-up steps and a linear learning rate decay, motivated by suggestions in each pre-trained model’s paper, time to train and resource availability. We decided to use the same batch size as BERTSum, 36. With the available hard- ware and using mixed-precision training we were able to fit the entire batch size onto one GPU without using gradient accumulation for all models except one. The when fine-tuning the XLNet model we could not fit an entire batch onto a single GPU, instead it was trained using two GPUs. For these experiments, the pre-trained models were fine-tuned as sequence-labeling models using the binary label data for the CNN/DM dataset, as described in Sec- tion 3.2.1. We evaluated their performance on the held-out portion of the dataset. Additionally, we evaluated their performance on our Academic Paper dataset, to measure how well the models would transfer to the new task. The models used in our experiments were the following: • Experiment 2: The BERT base pre-trained model, which has 12 encoding layers, 12 attention heads and a total of 110M parameters. This model will in the following be referred to as "BERT". • Experiment 3: The RoBERTa base model which has 12 encoding layers, 12 attention heads and 125M parameters. This model will in the following be referred to as "RoBERTa". • Experiment 4: The DistilBERT base model which has 6 encoding layers, 12-heads and a total of 66M parameters. This model will in the following be referred to as "DistilBERT". • Experiment 5: The XLNet base model which has 12 encoding layers, 12 attention heads and a total of, 110M parameters. This model will in the fol- lowing be ref