Using pre-trained language models
for extractive text summarisation of
academic papers
Master’s thesis in Computer Science and Engineering

ERIK HERMANSSON
CHARLOTTE BODDIEN

Department of Mechanics and Maritime Sciences
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2020


Master’s thesis 2020

Using pre-trained language models for extractive
text summarisation of academic papers

ERIK HERMANSSON
CHARLOTTE BODDIEN

Department of Mechanics and Maritime Sciences
Division of Vehicle Safety

Chalmers University of Technology
Gothenburg, Sweden 2020


Using pre-trained language models for extractive text summarisation of academic
papers
ERIK HERMANSSON
CHARLOTTE BODDIEN

© ERIK HERMANSSON, CHARLOTTE BODDIEN, 2020.

Supervisor: Selpi, Department of Mechanics and Maritime Sciences
Examiner: Selpi, Department of Mechanics and Maritime Sciences

Master’s Thesis 2020:01
Department of Mechanics and Maritime Sciences
Division of Vehicle Safety
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Printed by Department of Mechanics and Maritime Science
Gothenburg, Sweden 2020

iv


Using pre-trained language models for extractive text summarisation of academic
papers
ERIK HERMANSSON
CHARLOTTE BODDIEN
Department of Mechanics and Maritime Sciences
Chalmers University of Technology

Abstract
Given the overwhelming amount of textual information on the internet and else-
where, the automatic creation of high-quality summaries is becoming increasingly
important. With the development of neural networks and pre-trained models within
natural language processing in recent years, such as BERT and its derivatives, the
field of automatic text summarisation has seen a lot of progress. These models have
been pre-trained on large amounts of data for general knowledge and are then fine-
tuned for specific tasks. Datasets are a limiting factor for training summarisation
models, as they require a large amount of manual summaries to be created. Most
of the current summarisation models have been trained and evaluated using textual
data mostly from the news domain. However, pre-trained models, fine-tuned on data
from the news domain, could potentially also be able to generalize and perform well
on other data as well.

The main objective of this thesis is to investigate the suitability of several pre-trained
language models for automatic text summarisation. The chosen models were fine-
tuned on readily available news data, and evaluated on a very different dataset of
academic texts to determine their ability to generalise.

There were only slight differences between the models on the news data. But more
interestingly, the results on the academic texts showed significant differences be-
tween the models. The results indicate that the more robustly pre-trained models
are able to generalise better and according to the metrics perform quite well. How-
ever, human evaluation puts this into question, showing that even the high-scoring
summaries did not necessarily read well. This highlights the need for better evalu-
ation methods and metrics.

Keywords: natural language processing, nlp, machine learning, deep learning, au-
tomatic text summarisation, extractive summarisation, transformer, bert, roberta,
xlnet

v


Acknowledgements
We would like to thank our supervisor Selpi for her ongoing support and guidance
during the project.

For providing computational resources enabling the work in this thesis and support
we would like to thank Chalmers Centre for Computational Science and Engineer-
ing (C3SE) provided by the Swedish National Infrastructure for Computing (SNIC).

We thank SAFER for providing access to their facilities.

Erik Hermansson, Charlotte Boddien, Gothenburg, February 2020

vii


Contents

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Objective and Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 3
2.1 Automatic Text Summarisation . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Extractive Summarisation Methods . . . . . . . . . . . . . . . 3
2.1.1.1 Score and Select . . . . . . . . . . . . . . . . . . . . 3
2.1.1.2 Sequence Labeling . . . . . . . . . . . . . . . . . . . 4

2.2 Summarisation Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 ROUGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Other evaluation metrics . . . . . . . . . . . . . . . . . . . . . 6

2.2.2.1 BLEU . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2.2 Precision, Recall and F-Score . . . . . . . . . . . . . 6
2.2.2.3 Cosine Similarity . . . . . . . . . . . . . . . . . . . . 6

2.3 Text Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.2 Sentence Embedding . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Document Embedding . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Language Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Machine Learning Models for NLP . . . . . . . . . . . . . . . . . . . 8

2.5.1 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 9
2.5.2 Sequential Models . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5.2.1 Recurrent Neural Network (RNN) . . . . . . . . . . . 11
2.5.2.2 Stacked RNN . . . . . . . . . . . . . . . . . . . . . . 11
2.5.2.3 Bidirectional RNN . . . . . . . . . . . . . . . . . . . 12
2.5.2.4 Simple RNN . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2.5 Long Short Term Memory (LSTM) . . . . . . . . . . 13
2.5.2.6 RNN Modes . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.2.7 Attention . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.2.8 Transformer . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.2.9 Transformer-XL . . . . . . . . . . . . . . . . . . . . 17

ix


Contents

2.5.3 Pre-Trained Language Models . . . . . . . . . . . . . . . . . . 18
2.5.3.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5.3.2 RoBERTa . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5.3.3 DistilBert . . . . . . . . . . . . . . . . . . . . . . . . 21
2.5.3.4 XLNet . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5.4 Task Specific Models . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.4.1 BERTSum . . . . . . . . . . . . . . . . . . . . . . . 23
2.5.4.2 Sentence-BERT (SBERT) . . . . . . . . . . . . . . . 24

3 Methods 27
3.1 Changes in the direction of the project . . . . . . . . . . . . . . . . . 27
3.2 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.2.1 CNN/DM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1.1 Label Generation . . . . . . . . . . . . . . . . . . . . 28

3.2.2 Academic Paper Dataset . . . . . . . . . . . . . . . . . . . . . 29
3.2.2.1 Text extraction . . . . . . . . . . . . . . . . . . . . . 29
3.2.2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . 29
3.2.2.3 Obtaining reference summaries . . . . . . . . . . . . 30

3.3 Summary Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1 Adressing BERTSum’s Token Limit . . . . . . . . . . . . . . . 31
3.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.5.1 ROUGE Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 34
3.5.2 Sentence Similarity Evaluation . . . . . . . . . . . . . . . . . . 34
3.5.3 Evaluation on the CNN/DM dataset . . . . . . . . . . . . . . 34
3.5.4 Evaluation on the Academic Paper dataset . . . . . . . . . . . 35
3.5.5 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 35

3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Results and Discussion 39
4.1 Training and Validation . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 CNN/DM Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.1 Truncated Results . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.2 Full Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.2.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.3 Academic Paper Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.1 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.2 Single Sample Results . . . . . . . . . . . . . . . . . . . . . . 50
4.3.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.1 ROUGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.2 Sentence Similarity . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4.3 Human Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 CNN/DM Dataset Positional Bias . . . . . . . . . . . . . . . . . . . . 54
4.6 Model Positional Bias . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.6.1 CNN/DM Dataset . . . . . . . . . . . . . . . . . . . . . . . . 55

x


Contents

4.6.2 Academic Paper Dataset . . . . . . . . . . . . . . . . . . . . . 56
4.7 Model Confidence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Conclusions and Future Work 61
5.1 Ethical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

Bibliography 67

A Appendix 1 I
A.1 Full Text Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 Academic Texts: Single Sample Scores . . . . . . . . . . . . . . . . . I
A.3 Full Confidence Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . III
A.4 Text Excerpts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV

A.4.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV
A.4.2 Manual Extractive Summary . . . . . . . . . . . . . . . . . . . IV
A.4.3 Every-7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V
A.4.4 DistilBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII
A.4.5 XLNet Mem. . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII

xi


Contents

xii


List of Figures

2.1 A simplified illustration of an ANN. . . . . . . . . . . . . . . . . . . . 9
2.2 Single RNN cell. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 RNN unrolled over 4 steps. . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Stacked RNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 Bidirectional RNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Encoder/Decoder model. . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 Translation of a French sentence into an English one. Figure taken

from [29] with permission from the authors. . . . . . . . . . . . . . . 14
2.8 Transformer architecture. Taken from [30] with the authors’ permission. 16
2.9 BERT input representation. Reproduced from [19] with the authors’

permission. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.10 BERTSum architecture. Reproduced from [2] with the authors’ per-

mission. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.11 The SBERT architecture for a classification objective function. . . . . 25
2.12 The SBERT architecture for a regressive objective function. This can

be used to compute similarity scores. . . . . . . . . . . . . . . . . . . 26

3.1 An overview of the experiments we performed. . . . . . . . . . . . . . 37

4.1 The plot of training loss with BERT,DistilBERT, RoBERTa and XLNet 40
4.2 Plots of validation loss and score of BERT, DistilBERT, RoBERTa

and XLNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 The combined scores on the truncated CNN/DM dataset. . . . . . . . 43
4.4 The combined scores on the full CNN/DM dataset. (S: Score, SBS:

Sentence BERT Score) . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.5 The scores on the Academic Paper dataset. . . . . . . . . . . . . . . . 49
4.6 Selection scores of the sentences of the CNN/DM dataset with respect

to sentence position. . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.7 Selection scores of the sentences with and without randomised positions. 56
4.8 Averaged scores of XLNet and XLNet Mem. on the Academic Paper

dataset with regards to sentence position. (Red Line signifies block
split) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.9 Plots of RoBERTa Confidence scores . . . . . . . . . . . . . . . . . . 58
4.10 Plots of XLNet and XLNet Mem. Confidence scores . . . . . . . . . . 59
4.11 Plots of RoBERTa S Confidence Metrics . . . . . . . . . . . . . . . . 60

A.1 The confidence score of all the models on the CNN/DM dataset . . . III

xiii


List of Figures

A.2 The confidence score of all the models on the Academic Texts dataset III

xiv


List of Tables

3.1 Statistics for CNN/DM dataset . . . . . . . . . . . . . . . . . . . . . 28

4.1 Size in MB and required training time for all the models used in our
experiments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Scores on the truncated CNN/DM dataset. (S: Score, SBS: Sentence
BERT Score) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.3 Scores on the full CNN/DM dataset. (S: Score, SBS: Sentence BERT
Score) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.4 The averaged scores the different models achieved on the Academic
Paper dataset. In parenthesis, the difference between the score achieved
when using reference summary 1 and the one achieved when using ref-
erence summary 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.5 Rankings of the Academic Paper dataset summaries according to the
different scores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.6 Rankings of the sample summaries according to the different scores. . 51
4.7 Statistic of RoBERTa confidence scores . . . . . . . . . . . . . . . . . 59
4.8 Statistics of RoBERTa S Confidence Scores . . . . . . . . . . . . . . . 60

A.1 Scores on Text Dataset against both the reference summaries.(1: com-
paired against reference summaries 1, 2: compaired against reference
summaries 2, GS: Greedy Score, SBS: Sentence Bert Similarity) . . . I

A.2 The scores the different models achieved on the selected sample of the
Academic Paper dataset. In parenthesis, the difference between the
score achieved when using reference summary 1 and the one achieved
when using reference summary 2. . . . . . . . . . . . . . . . . . . . . II

xv


List of Tables

xvi


1
Introduction

When trying to investigate a topic, researchers nowadays have access to huge amounts
of data - news articles, papers, blog posts, etc. However, due to the sheer volume
of data, surveying it for relevant information can be a difficult task. Even when
summaries or abstracts are provided, as is often the case with papers, they do not
necessarily constitute a good summary of the actual document. Instead, many of
them are written with the intention of enticing the reader to read the rest, not with
the intention of providing them with all the most interesting information up front.
Reading the full documents, however, can be an impossible task, and reading only
parts of them selectively risks missing out on important pieces of information.

Ideally, what a researcher in this position would want is for someone to sort through
the material and provide them with summaries comprised of all the most important
bits of the documents. These summaries can then help them to get a good overview
over the topic(s) and to decide which documents to read in full. Since producing
such summaries is a very time-intensive task when done by humans, many attempts
have been made over the past decades to automate this process, especially in the
news domain. In this thesis, we aim to investigate the current state-of-the-art meth-
ods and apply them to the summarisation of academic papers from the field of traffic
safety.

Natural Language Processing (NLP) in general and automatic text summarisation
in particular are still very active fields of research with new papers and models being
released all the time. Most of the current state-of-the-art models for various NLP
tasks are still very new and have not been tested very extensively on the task of
text summarisation or on data sets comprised of anything else than relatively short
news articles. This is what we wish to investigate in this thesis.

1.1 Objective and Scope
The main objective of this thesis is to investigate the suitability of several pre-
trained language models for automatic text summarisation. For this purpose we
will investigate several sub-tasks:

• As many different models exist, we will compare how the chosen models per-
form against each other for the purpose of extractive text summarisation.

• Since few datasets for summarisation exist outside the news domain, and cre-
ating one is out of scope of this thesis, we will instead investigate how well a

1


1. Introduction

model fine-tuned on a dataset consisting of news articles is able to generalize
and perform on the very different academic texts.

• To properly evaluate the models, adequate evaluation methods are required.
Currently used evaluation methods have limitations, for example requiring ex-
act word matching. Therefore, we will investigate a method based on sentence
similarity. We will employ this and the current standard metrics to evaluate
the generated summaries and compare results against human judgement of the
summaries’ quality.

The scope of this thesis project is limited in the following ways:
1. Developing a new method for text summarisation or creating a new model for

text representation is out of scope for this project.
2. Development of a complete method to automatically pre-process scientific doc-

uments (e.g., from PDF to clean text) for summarisation is outside the scope
of this project.

3. No graphical interface for any end-user will be created. This thesis focuses
solely on the scientific investigation of methods for automatic text summari-
sation, not on the development of a finished software product.

4. The aim is to produce summaries of individual scientific papers from the field
of traffic safety. Multi-document summaries (producing a single summary
combining the information from several source documents) are out of scope
for this project.

1.2 Outline
The rest of the thesis is structured as follows: In Chapter 2 we will provide an
overview over the current state of the art in the fields of NLP and automatic text
summarisation. We will offer definitions and descriptions of the most important
concepts and methods in NLP and will in particular introduce the most promising
approaches to the automatic creation of text summaries and to their evaluation.
Chapter 3 will detail how we performed our experiments. We will describe the
implementations of the models and methods we used, how we obtained our training,
test and evaluation data, and what kind of experiments we performed. The results
of those experiments will be presented and discussed in Chapter 4. In Chapter 5 we
will look at some ethical considerations regarding automatic text summarisation in
general, review our most important findings with respect to the limitations of our
project, and suggest possible directions for future work.

2


2
Theory

Automatic text summarisation is part of a research area called natural language
processing (NLP). In the following, different approaches to text summarisation and
evaluation of text summarisation will be presented in Sections 2.1 and 2.2, respec-
tively. In Sections 2.3 and 2.4, the two important concepts of text embedding and
language modeling will be explained. Section 2.5 follows the development of increas-
ingly powerful machine learning models that have been developed for various NLP
tasks and are useful for automatic text summarisation.

2.1 Automatic Text Summarisation
Automatic text summarisation techniques can be divided into two different ap-
proaches: extractive and abstractive. As described by See in [1], extractive sum-
marisation techniques produce summaries by directly picking a subset of relevant
sentences/phrases/words from the source document(s). Abstractive methods gener-
ate summaries using a separate vocabulary, rebuilding each sentence from scratch
thus allowing the summary to contain words that don’t necessarily appear in the
source documents. This has the potential to result in a more cohesive text, but
can also distort facts. Abstractive methods require encoding a much deeper level of
understanding of natural language to be successful. Most research so far has been
focused on extractive methods, as they don’t require the machine to “understand”
the text semantically and are easier to implement. For the above reasons, and to
be able to draw on this wealth of available research, this thesis will be focusing on
extractive summarisation.

2.1.1 Extractive Summarisation Methods
The summaries produced by the extractive method are a subset of the sentences
of the source document(s). In the following subsections, we will look at two main
approaches to this task: Score and Select, and Sequence Labeling.

2.1.1.1 Score and Select

For the score and select method summarisation is treated as a problem of assigning
each sentence a score which captures how important it is to include in the sum-
mary. Then, the n highest-ranking sentences are selected to form the summary.
Alternatively, the sentence selection can be approached as an optimisation problem:
Importance and coherence are to be maximised, redundancy minimised.

3


2. Theory

2.1.1.2 Sequence Labeling

Another way to model the extractive summarisation task is to treat it as a binary
classification problem: Each sentence needs to be labeled either as a summary sen-
tence (which will be included in the summary) or a non-summary sentence (which
will not be). Usually, a neural network (see Section 2.5.1) is trained for this task,
using training data of texts and their sentence-labels. Many current state-of-the-art
methods are using this model, like BertSum as introduced by Liu [2], which will be
described in Section 2.5.4.1.

2.2 Summarisation Evaluation

Automatically creating summaries is a difficult in itself, and evaluating the quality
of these summaries is another non-trivial task.
Evaluation of summaries can be done manually: People may read at least a sizable
part of the documents as well as the produced summary and subjectively judge how
well it summarises the documents. Another often used human-judgement metric is
how well the given summary can be used to answers certain queries. However, there
can be a significant amount of variation in how different people judge the quality
of the same summary. Formulating objective quality measures and automating the
evaluation process can help with this problem. Being able to obtain an exactly
defined score for each summary, makes it possible to better compare the results of
different summarisation approaches with one another. According to Allahyari et
al. [3], the most widely used metric for automatic evaluation is ROUGE (Recall-
Oriented Understudy for Gisting Evaluation). This metric can be used to compare
the produced summaries against a set of typically manually created reference sum-
maries according to certain criteria.

2.2.1 ROUGE
ROUGE, introduced in 2004 by Lin [4], is a set of measures that can be used for
evaluation of text summaries by comparing them to other reference summaries as-
sumed to be ideal. The comparison is performed by evaluating the overlap of text
units, such as single words, word pairs (bi-grams) or word sequences, between the
summary to be evaluated (also called the candidate summary) and the reference
summary. Using ROUGE, the evaluation of summaries can be completely auto-
mated, which both saves time and allows for a more objective measure to compare
different summarisation methods against each other.
In [4], Lin introduces five different ROUGE measures: ROUGE-N, ROUGE-L,
ROUGE-W, ROUGE-S and ROUGE-SU. Each of them will be briefly described
in the next few paragraphs.

ROUGE-N is defined by the overlap of n-grams between the candidate summary
and the reference summary. An n-gram is a sequence of n adjacent words. In partic-
ular, the two commonly used measure ROUGE-1 and ROUGE-2 refer to the overlap

4


2. Theory

of single words and the overlap of bi-grams respectively.

ROUGE-L refers to the Longest Common Sequence (LCS, see [4]). Its score is
based on the longest matching sequence of words that can be found between the
candidate summary sentences and the reference summary sentences. The total
ROUGE-L-score of the summaries is computed from the LCS-scores of the indi-
vidual candidate-reference sentence pairs.

ROUGE-W is a weighted variant of ROUGE-L that favours consecutive LCSs.

ROUGE-S is similar to ROUGE-N, but measures the co-occurence of skip-bigrams
rather than n-grams. Skip-bigrams allow for arbitrary gaps between the two words
of the bigram. For example, the sentence “I had lunch today.” contains the following
skip-bigrams: “I had”, “I lunch”, “I today”, “had lunch”, “had today” and “lunch
today”. Note that the order in which the words appear matters.

The last one is ROUGE-SU. This is an extension of ROUGE-S, and addition-
ally takes unigrams into account. Using ROUGE-S, there would be no skip bigram
match between the two sentences “I had lunch today” and “today lunch had I”, as
the second sentence is the exact reverse of the first. With ROUGE-SU, however,
we get four unigram matches. ROUGE-SU can be obtained from ROUGE-S by
adding a begin-of-sentence token at the beginning of each candidate and reference
sentence. For the example above, this would give us the two sentences: "[START]
I had lunch today" and "[START] today lunch had I", which would give us the fol-
lowing ROUGE-S matches between them: "[START] I", "[START] had", "[START]
lunch" and "[START] today".

Lin [4] concludes that the ROUGE-scores correlate well with human judgement for
single-document summarisations, but less well for multi-document summarisation.
ROUGE-1, ROUGE-2, ROUGE-S4, ROUGE-S9, ROUGE-SU4 and ROUGE-SU9,
however, performed “reasonably well when stopwords were excluded from matching”
[4]. Correlations with human judgement can be further increased by using multiple
reference summaries per document.

ROUGE, in particular ROUGE-1, ROUGE-2 and ROUGE-L, has been used widely
in recent papers on automatic summarisation techniques such as [5], [2] and [6]
to evaluate their performance. These evaluations are very commonly done on the
CNN/Daily Mail dataset [7]. This is because it has long been one of the biggest
available datasets of texts and reference summaries, and because evaluating different
methods on the same dataset facilitates easy comparison between them.

5


2. Theory

2.2.2 Other evaluation metrics
2.2.2.1 BLEU

Another once very commonly used metric, originally developed by Papineni et al.
to evaluate machine translations, is BLEU (bilingual evaluation understudy, [8]). It
was used to evaluate machine-generated text against human-generated text. Sub-
sequent papers, like [9] and [10], however, have called the usefulness of BLEU for
anything other than the evaluation of machine translations into question. In our
research, we have not come across BLEU being used for evaluation of text summari-
sation today, which is why we will not discuss it here any further.

Steinberger and Ježek [11] give a good overview over various approaches to text sum-
marisation evaluation. In the following, we want to mention in particular: precision,
recall and F-score, and cosine similarity.

2.2.2.2 Precision, Recall and F-Score

As extractive summarisation is essentially a binary classification problem, we can
use precision and recall to evaluate generated summaries against extractive reference
summaries in the following way:

Precision (P) is the number of sentences in the generated summary that are also in
the reference summary divided by the number of all the sentences in the generated
summary.

Recall (R) is the number of sentences in the generated summary that are also in
the reference summary divided by the number of all the sentences in the reference
summary.
It is worth noting that a very high precision score can be achieved by a summary
that includes only a single sentence, as long as that sentence is also in the reference
summary. High recall, on the other hand, can be achieved by a generated summary
that contains a multitude of irrelevant sentences, as long as it also includes many of
the sentences in the reference summary.

In the case of extractive summarisation, the optimal summary is likely to lie some-
where in between those two extremes. A more useful metric than precision or recall
alone is therefore the F-score, which combines the two:

F = (β2 + 1) · P ·R
β2 · P +R

(2.1)

where β is a weighting parameter that favours precision when chosen greater than
1 and recall when smaller than 1.

2.2.2.3 Cosine Similarity

Cosine similarity is a measure for the similarity of two vectors. If the candidate and
reference summaries have been converted into vector space, where similar summaries

6


2. Theory

are closer to each other (more on how this is done can be read in Section 2.3),
then cosine similarity can be used to determine how similar two summaries are by
performing the following measure to compare them:

cos(X, Y ) =
∑

i xi · yi√∑
i xi

2 ·
√∑

i yi
2

(2.2)

2.3 Text Embedding

2.3.1 Word Embeddings
For a computer to be able to work with text, we need to first convert that text into
numerical input, usually into vectors. This process is called text embedding and can
be done on word, sentence or document level. The input text itself is treated as a
sequence of tokens, which usually are either words or sub-word entities (like "play"
and "#ing" for the word "playing").
Ideally, these embeddings are not just arbitrary, but capture some syntactic and
semantic information. The goal is to embed the words in such a way that words
with similar meanings are embedded similarly, meaning they are close together in
the vector space. The linguist Zellig Harris noted already in 1956 [12] that words
that appear in similar contexts tend to have similar meanings, which is known today
as the distributional hypothesis. This hypothesis implies that, with a large enough
corpus of text to train on, word embeddings can be learned unsupervised by neural
networks, simply by observing the contexts (other words) they often appear in.

A very commonly used word embedding method applying the distributional hy-
pothesis with very good results is word2vec [13]. The embeddings the algorithm
produces clearly capture semantic properties of words, as the following example,
taken from [13], illustrates:

"vector(”King”) - vector(”Man”) + vector(”Woman”) results in a vector
that is closest to the vector representation of the word Queen"

However, one of the drawbacks of word2vec and similar embedding methods like
Glove [14] is that they embed each word only as a single, fixed vector, regardless
of the specific context it appears in. For example, in the sentence “I lost my cell
phone in the prison cell”. The word vector for “cell” would be the same in both
occurrences - capturing some mixture of all the different meanings that “cell” can
take on. Recently, new approaches for text embedding that utilize deep learning and
attention have improved on this, like BERT (see 2.5.3.1 for details) and XLNet (see
2.5.3.4 for details). The concept of attention will be described in Section 2.5.2.7.
Rather than returning fixed vectors for each word, these models, after training, can
be used to obtain context-specific word vectors. In our previous example, the two
occurences of “cells” would be represented as two different vectors, one most likely
being much closer to the vectors of words like “telephone” and “conversation” and
the other being closer to words like “crime” and “punishment”.

7


2. Theory

2.3.2 Sentence Embedding
Sentence embeddings can be computed from the vectors of the words they are made
of (the simplest approach being to take their average). Language models like BERT
and XLNet can also be trained to produce sentence embeddings for specific tasks.

2.3.3 Document Embedding
Just like sentence embeddings can be obtained from aggregating word embeddings
in some way, document embeddings can be obtained from the embeddings of the
sentences they contain. Again, the simplest approach would be to simply average
the sentence vectors.

Another method is Doc2Vec [15]. Doc2Vec, also called Paragraph Vector, is an un-
supervised algorithm that extends the basic concept of Word2Vec to variable-length
pieces of text. These may be single sentences or long documents. Doc2Vec learns
fixed-length vector representations of these text pieces by trying to predict words
in it. This method of document encoding is able to capture semantic information
about the text unit it is given, much like Word2Vec is able to do that for words,
and some researchers like Campr and Ježek [16] found it useful for the evaluation of
automatic text summarisations. Dai et al. [17], too, investigated Doc2Vec’s general
usefulness for measuring the similarity of two texts and found that it performed
better or on par with other methods of document embedding. They also found that
vector operations can be performed on the vectors, much like with word2vec.

2.4 Language Modeling
Another important concept for the field of NLP is that of Language Modeling, which
means representing a language as a probability distribution over sequences of words.
Jozefowicz et al. [18] give a good overview over the developments in language mod-
eling up to 2016. Ideally, a language model is able to capture both grammatical
and semantic information, assigning high probabilities to sentences that are both
grammatically correct and likely to appear in the context of the corpus, which is
often limited to texts belonging to a certain topic, and low probabilities otherwise.
Language models are used for many NLP tasks like speech recognition, machine
translation and text summarisation.
In the past, RNNs (see section 2.5.2.1) were very commonly used to train such mod-
els. However, as of 2020, when this thesis was written, two of the most promising
models for language modeling are BERT [19] and XLNet [20]; both employ the
Transformer architecture (as described in section 2.5.2.8). In the following sections,
these models will be described in more detail.

2.5 Machine Learning Models for NLP
In more recent years, machine learning, in particular neural networks, have enabled
great progress in NLP in general and automatic text summarisation in particular.

8


2. Theory

In the following, we will trace the developments of these techniques. Section 2.5.1
gives an introduction to Artifical Neural Networks. In Section 2.5.2 networks for
handling sequential data, such as text, are introduced from the early Reccurent
Neural Networks 2.5.2.1 to the more recent Transformer 2.5.2.8. Section 2.5.3 in-
troduces several language models. Section 2.5.4 introduces two task specific models
using pre-trained models.

2.5.1 Artificial Neural Networks
Artificial Neural Networks (ANN, often also just referred to as neural networks, see
[21] for a more detailed overview) are computing systems inspired by the biological
neural network found in the brain. As a biological network consists of neurons, an
artificial neural network is made up of artificial neurons (from here on just referred
to as neurons), which are essentially functions. Each such neuron receives input and
performs some computation on it, before passing on the result of that computation,
multiplied by some weight, to one or more neurons of the next layer it is connected
to, until the ones in the last layer produce the output of the network.

Figure 2.1: A simplified illustration of an ANN.

In order to perform a specific task, a neural network, once set up, needs to be
trained. To do this, a set of training data is required, meaning a set of inputs with
the corresponding outputs we want the network to produce. If the network does
not produce the desired outputs, the weights of the neurons are adjusted through a
process called backpropagation. For details on this process, the interested reader is
referred to [21].

Each ANN model has an objective function, which captures the desired outcome. For
example, the objective function of an ANN trying to guess a number correctly might
be to minimize |numberguess− numbertrue|. A so-called loss function expresses how
well the model fits the training data. The loss function depends on the parameters
of the model (the weights). The aim of training is to find the parameters/weights
that will minimize the loss function. This is done via a process called Gradient
Descent. A gradient is the multi-dimensional equivalent of a function’s derivative,
which measures the slope of a function. This slope will be 0 for parameter values
for which the function has a maximum/minimum. The aim of training is to find
this minimum of the loss function, and therefore to find parameters for which the
gradient will be 0. We "descend" the gradient until we hit its low-point of zero.
Computing the gradient of a function produces a vector that points in the "uphill"
direction of the gradient, which is why gradient descent happens in two steps:

9


2. Theory

1. Calculate the gradient.
2. Take a small "step" in the direction opposite to the gradient. (by adjusting

the weights)
This is repeated until the gradient is close enough to zero. The step size is impor-
tant: If it is too large, we risk "stepping over" the optimum in our adjustment step,
never hitting it. If the step size is too small, however, reaching the optimum might
take more time than we have available. We will encounter these problems in Section
2.5.2.4 in the form of the exploding/vanishing gradient problem.

In order to train an ANN robustly, multiple runs through the training data may be
necessary. Such a run through all the training data is called an epoch. Due to the
massive size that modern ANNs can reach and the memory requirements of training,
many times it is not possible to train the ANN in entire epochs at a time. In this
case, the training data is split into so-called batches, which are processed one by
one. This also affects the gradient calculation since we only have access to a random
subset of the data. In this case, a stochastic approximation is used, stochastic gradi-
ent descent (SGD). After each batch has been processed, backpropagation is applied.

Neural Networks form the base for all the models that will be described in the
following sections.

2.5.2 Sequential Models
For NLP tasks, input data often takes the form of sequences: A sentence, for ex-
ample, is a sequence of words and a document is a sequence of sentences. Such
sequences will vary in length, which poses a problem for the neural networks. By
encoding the input, using Continuous Bag Of Word representations like [13], it is
possible for neural networks to process such data. But this process is limited, as it
does not take the order of words into consideration. The word order, however, can
be very important for the meaning of a sentence: For example, “bad, not good” and
“not bad, good” have very different meanings, even though they contain the exact
same words. Convolutional Neural Network models, [22], would be able to represent
such relations and dependencies, but are limited to only local ordering and have
trouble with relations over large distances, such as a long sentence. This is because
of the convolution process, which generally only covers a short range. For an ex-
ample of why this is important, imagine a text describing somebody’s biography.
The first sentence of this text might be something like: "XYZ was born in France."
Then many other sentences may follow, which don’t refer to XYZ’s country of birth,
until the last sentence: "XYZ returned to his country of birth and died there." In
order to know what "his country of birth" refers to, it is important to remember
the information from the beginning of the text. CNN models would struggle with
this and potentially not be able to resolve that "his country of birth" and "France"
refer to the same country. To model sequence dependencies over large distances,
other models are required. One such model that is specifically designed to model
dependencies between sequential inputs, is the so-called Recurrent Neural Network,
described in the next section.

10


2. Theory

2.5.2.1 Recurrent Neural Network (RNN)

Recurrent Neural Networks (RNN) can process data of varying length, while main-
taining structured relations and dependencies. A RNN takes as input a sequence
of vectors, each of which is processed in a step-by-step fashion, outputting a state
vector which is used to pass on information to the next step. As more of the in-
put is processed, the state vector gathers more information, better representing the
sequence. Figure 2.2 illustrates the basic RNN architecture. For the first step an
initial randomized state vector is used, for each subsequent step the previous state
vector is used as input.

Figure 2.2: Single RNN cell.

If the length of the input sequence is known in advance, the network can be unrolled
to display the full network, as illustrated in Figure 2.3 .

Figure 2.3: RNN unrolled over 4 steps.

When unrolled, it can be seen that the RNN is a deep neural network and can thus
be trained like a feed-forward NN by backpropagation through time.

2.5.2.2 Stacked RNN

RNNs can be stacked [23], such that the output from one layer is used as the input
to the next layer. This creates hierarchical structures, often called Deep RNNs. As
Goldberg writes in [24], stacked RNNs often perform better on various NLP tasks
but it is not theoretically clear as to why.

11


2. Theory

Figure 2.4: Stacked RNN.

2.5.2.3 Bidirectional RNN

An issue with RNNs is that they can only use past states for predictions, future
states, however, might also contain useful information. Additionally when process-
ing a sequence, later states will contain more information than earlier states, thus
accuracy improves as more of the sequence is processed. Bidirectional RNNs at-
tempt to solve this by utilizing two layers. Each layer processes the same inputs,
but in opposite directions, i.e., one does so from front-to-back as can be seen in
Figure 2.5, the other from back-to-front. The output of each step is a combination
of that steps layers. This allows the network to use past and future states with more
accumulated information.

Figure 2.5: Bidirectional RNN.

2.5.2.4 Simple RNN

A simple version of RNN was proposed by Elman [25]. In this version, the state
vector is simply the linear combination of the previous state and of the current
input passed through a non-linear activation function. This simple architecture
suffers from the vanishing/exploding gradient problem (EVGP ) Hanin [26] explains
how and under which circumstances the EVGP occurs. It means that when the
weights of the network are updated, the increment in which this is done is either too
big and therefore too imprecise, or too small and therefore effectively meaningless.
For the simple RNN, this happens especially when handling long sequences. Over
long sequences information is lost and thus the ability to represent dependency is
compromised.

12


2. Theory

2.5.2.5 Long Short Term Memory (LSTM)

The LSTM architecture was developed by Hochreiter et al. [27] to solve the van-
ishing gradient problem among others. The main addition is the use of a memory
cell in combination with a number of “gates” that control it. The gates are values
computed using the previous and current steps of the sequence. As each segment of
a sequence is processed, the gates influence what should be added to the memory,
what should be forgotten, and what the new output should be. The gating com-
ponents allow for gradients to be passed through the memory cell over longer ranges.

2.5.2.6 RNN Modes

There are different modes for handling the outputs produced by a RNN:

Acceptor: The output is based on the information contained in the final state. For
the example of sentence classification, a classification would be produced after all
words in the sentence have been processed.

Encoder: The output is the final state vector, an encoding of the sequence into a
single vector. Often used in combination with a decoder. For the sentence example,
after all words have been processed, an encoder outputs an encoding of the sentence.

Transducer: The output is based on combined information of each step’s state
vector. For the sentence example, there would be a single output after each word
has been processed.

Encoder/Decoder An Encoder-Decoder architecture is often used for sequence-to-
sequence NLP tasks. RNN Encoder-Decoder was first introduced by Cho et al. [28].
The encoder encodes an input sequence into an intermediate vector representation,
as described above. This vector is then used as the initial state for the decoder. A
decoder is often autoregressive, meaning that it consumes its own output, using so
far produced output as input in the next step. Using the example of translation,
a sentence is given as input for the encoder, producing the intermediate vector.
The decoder, with this vector as the initial state, works step-by-step producing and
consuming the translated sentence word-by-word until it finds itself generating an
end-token. This architecture is illustrated in Figure 2.6.

Figure 2.6: Encoder/Decoder model.

13


2. Theory

2.5.2.7 Attention

The problem with the Encoder/Decoder sequence-to-sequence model described in
Section 2.5.2.6 is that it encodes the entire input sequence into a single, fixed-length
context vector, which the decoder then uses to generate the output. In order to
produce this vector, the input sequence is processed sequentially from beginning to
end, and at each step only some of the information from the previous step is passed
on. This means, that the final output of the encoder is much more influenced by the
last couple of tokens than it is by the first. For very long sequences, this can lead to
important information from the beginning of the sequence simply being “forgotten”.
Even LSTM can not fully solve this problem.

In order to alleviate this problem, Bahdanau et al. [29] suggest a new way of
processing sequential data: Instead of using an encoder to produce a single context
vector while discarding all the intermediary hidden states of the encoder, the authors
propose to utilize all the encoder states. The goal of training such a model is then no
longer to produce the one context vector that perfectly encodes the input sequence,
but rather to learn which parts of the input sequence to pay attention to in order
to generate each part of the output sequence. For illustration purposes, Figure 2.7
shows a machine translation example from [29]. It shows the attention that was
paid to each French word of the input sequence to produce each word of the English
output sequence. Note, for example, that in order to generate the English word
“Syria”, full attention was paid to both the French words “la” and “Syrie” and little
to no attention to any of the other words in the sentence.

Figure 2.7: Translation of a French sentence into an English one. Figure taken
from [29] with permission from the authors.

14


2. Theory

2.5.2.8 Transformer

Another architecture for sequence modeling is the Transformer, introduced by Vaswani
et al. in the paper "Attention Is All You Need" [30]. This model follows the en-
coder/decoder structure introduced in 2.5.2.6, but as the title of the paper suggests,
it relies primarily on attention.
It was developed to solve some of the issues with existing models, which were largely
based on RNN (see section 2.5.2.1) and CNN [22]: Most prominently the problem of
retaining information over many steps when encoding long input sequences, and the
limited possibility of parallelization, since every step of the encoding and decoding
requires the output of the previous step. Even when RNN models were enhanced
by the addition of attention to alleviate the former problem, the problems with
parallelization remained. CNN models, on the other hand, can be parallelized but
suffer from an increased path length in the network as sequence length increases,
which increases the amount of information that is potentially lost. The Transformer
architecture discards the recurrent approach of sequence modelling and utilizes at-
tention instead, as described in Section 2.5.2.7. Due to its non-sequential approach,
this method is highly parallelizable while only requiring a constant, O(1), number
of operations and path lengths. This allows for faster training and better retention
of information.

The basic architecture of the Transformer model can be seen in Figure 2.8 and will
be described in the following paragraphs.

Encoder
The encoder of the transformer consists of 12 stacked encoder layers. Each such
layer consists of two sub-layers, an attention layer and a feedforward network layer.
The input to the first of these layers is a sequence of embeddings, very commonly
word embeddings. Before these embeddings are passed to the model, they are sup-
plemented with positional encoding to be able to retain the information of word
order, despite no longer processing the input in a sequential manner. With the po-
sitional encoding, the same token at different positions is encoded differently, and
their final embeddings will have some meaningful distances in vector space.

The attention sub-layers allow the system to focus on the most relevant parts of a
sequence. During the encoding process this is used to determine how much a word
relates to all other words in the sequence.
The attention mechanism used by the Transformer is called Multi-Head Self-attention:
"Self", because it encodes each word of the sequence in relation to the other words
in the sequence, and "multi-head" because each attention layer utilizes not just one,
but several different attention weights ("attention heads") [30]. This means that
multiple attention processes are performed in parallel. The idea is for each to focus
on different aspects of the sequence.

The second sub-layer is a fully connected feed-forward neural network. This network
is applied to each element of the sequence separately but identically.

15


2. Theory

Figure 2.8: Transformer architecture. Taken from [30] with the authors’ permis-
sion.

16


2. Theory

All sub-layers of the encoder also contain residual connections, their purpose is to
combine the input, which has not been affected by the layers, with the output pro-
duced by the layers. In the Transformer architecture, these residual connections are
used to restore positional encoding after processing the word embeddings. Without
these connections the performance suffers greatly, as positional information gets lost.

Decoder
The decoder differs only slightly from the encoder. It contains the same two sub-
layers, but has an additional sub-layer, an encoder-decoder attention layer, between
them. The decoders role is to produce output, using the information produced by
the encoder. In the Transformer it does this by performing multi-head attention
over the output of the encoder and the so far produced output by the decoder. This
differs from the other forms of attention in the model as it is no longer self-attention,
instead it uses multiple sequences.

2.5.2.9 Transformer-XL

The Transformer model described above, as proposed by Vaswani et al. [30] handles
the whole input at once, as such there must be some limit on the length of the
input, due to computational and resource limitations. The default implementation
uses a token limit of 512. This means the transformer can only consider any token’s
context in 512 token blocks. Solutions for longer texts have been suggested in later
works, such as [31], where the longer corpus is split into multiple 512 length blocks.
But this has two problems: Firstly, no contextual information is shared between
blocks, and secondly, the splitting of the corpus is often done without any respect
for sentence or semantic structure, leading to context fragmentation.

Transformer-XL [32] is an architecture proposed to solve the problem of fixed length
contexts. Its main contribution is the re-introduction of recurrence to the Trans-
former, which allows context to flow between blocks of a split corpus. To achieve
context flowing over the boundaries of blocks, the previous block’s attention vectors
are saved and can be used to "look back" on for context, resulting in better long-
term dependency and avoiding the fragmentation problem. Applying this method to
every two consecutive blocks creates a combined context that can represent context
over much longer than just two blocks. The method could also be extended to allow
for further connections, beyond just two blocks.

For this method to work, another type of positional encoding is required. Since the
model looks back at previous blocks, the absolute positional encoding employed by
the Transformer no longer works - since each token will be in multiple positions and
tokens in different segments would be assigned the same positions. Instead a relative
positional encoding based on the distance between tokens is used. The attention
score, too, is calculated slightly differently from the Transformer.
The Transformer-XL model is able to generalize from training on short sequences
to much longer sequences quite well. For example, Dai et al. detail in [32] how the
model was trained with an attention length of 784 tokens and evaluated on corpus
of 3,800 tokens and achieved a new state of the art result.

17


2. Theory

2.5.3 Pre-Trained Language Models
In the following, we will introduce several pre-trained models for NLP tasks that have
been built using the Transformer/TransformerXL architectures: BERT (2.5.3.1) and
XLNet (2.5.3.4) are built upon, respectively. Additionaly number of variations of
BERT will be presented as well: We will elaborate on the models RoBERTa (Section
2.5.3.2), DistilBert (Section 2.5.3.3).

2.5.3.1 BERT

BERT is a Transformer-based model for language encoding, introduced in [19] by
Devlin et al. The authors identify the fact that models could only be trained unidi-
rectionally as one of the big limitations of previous approaches to language modeling.
This meant that a token could only be encoded using the information of either the
tokens to its left or the tokens to its right, but never using information from both
combined at the same time. The objective in creating BERT (Bidirectional Encoder
Representations from Transformers) was to create a model that could take the full
context of a token into account, left and right.

In order to use BERT for some down-stream task (like machine translation or text
summarisation), two steps are necessary:

1. The model needs to be pre-trained. This means the model is not yet trained
in any task-specific way, but instead is taught to encode language itself in a
sensible way. This is done so the same model can be used for several different
down-stream tasks without needing to be trained from scratch. Task-specific
training (fine-tuning) is done in the next step. Pre-training results in a general-
purpose Transformer that can encode input tokens. This pre-training is done
on unlabeled training data over two different training tasks, described later in
this section.

2. The model can then, once it is initialized with the parameters obtained through
pre-training, be fine-tuned to be used for a particular down-stream task. In
order to do this, importantly, the architecture of the model itself does not need
to be changed. Instead, the same pre-trained model can be applied to several
different tasks, by layering task-specific layers on top and training the model
on labeled training data pertaining to the desired downstream task.

Since the authors made their pre-trained BERT models (a larger and a smaller one)
available for download and free to use, this means that with relatively little effort,
these already pre-trained models can be applied to a wide variety of text-based tasks.

The architecture of the model itself is almost identical to the Transformer architec-
ture described in section 2.5.2.8. Perhaps more interesting is how textual input is
processed and how the model is (pre-)trained:

As input, BERT accepts textual sequences that may each be composed of either
a single sentence or a pair of sentences, where a "sentence" means any arbitrary
span of contiguous text, not necessarily sentences in the grammatical sense. Each

18


2. Theory

such sequence is preceded by a classification token ([CLS]). The final hidden state
for this token can be trained to obtain an aggregated representation of the entire
sequence. This is useful for some classification tasks, like summary/non-summary
sentence classification for extractive summarisation.
If the sequence consists of two sentences, then they are separated by a [SEP] token.
Additionally, BERT adds a so-called segment embedding to each token, which in-
dicates whether it belongs to Sentence A or Sentence B. The input representation
of each token is obtained by adding together the token’s WordPiece embedding (see
[33] for details), segment embedding and positional embedding. The latter encodes
where in the sequence the token is located. This is necessary, as BERT, being a
Transformer model, is not going through the tokens sequentially, and therefore does
not "know" the order of the input tokens.
Figure 2.9 illustrates the BERT input representation.

Figure 2.9: BERT input representation. Reproduced from [19] with the authors’
permission.

Once the input representation is obtained, BERT is pre-trained on it by trying to
solve two tasks, as mentioned above. These two tasks are the following:

1. Masked Language Model (MLM) Some percentage of the input tokens
(in the paper the authors chose 15%) is masked and BERT is tasked with
predicting them by using the entire context - left and right. Notably, only
these masked tokens are predicted by the model, and no attempt is made to
reconstruct the entire input.

2. Next Sentence Prediction (NSP) This task is meant to help the model
learn the relationships between sentences: To create the pre-training dataset,
for each training instance two sentences A and B are picked from the training
corpus. With 50% probability, sentence B will be sentence A’s successor, with
50% probability it will be a random sentence from anywhere else in the corpus.
BERT is tasked with predicting (binary) if sentence B is indeed sentence A’s
successor.

For pre-training, the authors used BooksCorpus ([34]) and English Wikipedia texts.
The pre-trained models BERTLARGE and BERTBASE are publically available at:
https://github.com/google-research/bert

19

https://github.com/google-research/bert


2. Theory

In the next two subsections, we will look at variants of BERT that aim to improve
on the original model.

2.5.3.2 RoBERTa

In [35], the authors claim that BERT, as introduced in the original paper [19] is actu-
ally undertrained and show that with some modifications, significantly better results
can be achieved, which are competitive with the performance of every model pub-
lished after BERT. They name their modified BERT-version RoBERTa. (a robustly
optimized BERT approach)

Apart from changing some of BERT’s hyperparameters, the main differences from
BERT pertain to how the model is pretrained. The main changes from the training
suggested in [19] are the following:

Dynamic Masking
In the original BERT, the masking of the input sequences is done in a static way:
Only once in pre-processing. To ensure that BERT will encounter the same se-
quences with different masking patterns, the training data instances are multiplied
by 10 before masking.
RoBERTa, however, applies dynamic masking instead. A new masking pattern is
generated every time a sequence is fed to the model. This means that the model
will encounter many more different masking patterns of the same instance. This in
turn removes the need to drastically increase the number of training instances.

Full sentences
As opposed to BERT, RoBERTa uses exclusively full sentences as input to the
model. Such sentences are sampled contiguously from the documents, such that the
total length does not exceed 512 tokens. If document boundaries are crossed while
sampling, a special inter-document separator token is inserted.

Training in large mini-batches
BERT was trained in 1 million training steps with a batch size of 256 sequences.
RoBERTa, on the other hand, was trained in only 125.000 steps, using a much larger
batch size of 2.000 sequences. The authors express uncertainty over whether they
have already found the ideal batch size with this, but it produces better results than
the original BERT while taking less time to train.

Larger byte-level BPE
BPE (byte-pair encoding) is a hybrid between character- and word-level text encod-
ing. It bases its encodings on subword units, for example: Instead of encoding the
word "playing" or each of its letters separately, BPE might encode "play" and "#ing",
building blocks which can be re-used to form other words as well. This allows for a
much larger vocabulary as would otherwise be possible. However, a lot of the time
in BPE, a large portion of the encodings are encodings of single uni-code characters,
which limits the total number of words that can be captured.
RoBERTa instead makes use of a variation of BPE introduced in [36], which is

20


2. Theory

based on bytes instead of characters. This means that less subword units are needed
to encode a larger vocabulary. While BERT used character-level BPE with a size
of 30.000 subword units and requires the input to be tokenized in preprocessing,
RoBERTa requires no such preprocessing and uses byte-level BPE to encode 50.000
subwords.

Longer pretraining on larger data sets
RoBERTA was trained over 500.000 steps, while BERT over 100.000, and trained
on 160GB of textual training data, resulting in much better end-task performance.

The pretrained RoBERTa model is publically available at: https://github.com/
pytorch/fairseq

2.5.3.3 DistilBert

Pre-trained language models such as the ones described in the previous sections, can
easily have several hundred million parameters. This means that they require great
amounts of memory and computational power to be trained and run.

Sanh et al. Distilbert therefore set out to create a much smaller language model,
which is less resource demanding. They did so using the method of knowledge
distillation ([37], [38]), creating a much smaller Transformer model, which they
called DistilBERT.
DistilBERT has the same general architecture as BERT (see section 2.5.3.1), but less
layers. The authors also pre-trained a BERT model according to the best-practise
suggestions of [35] (see the section on RoBERTa: 2.5.3.2). DistilBERT was then
trained, using the same corpus of training data, to produce the same outputs as this
BERT model. This method is also called Teacher-Student Knowledge Distillation,
as the BERT model functions as a teacher, whose behaviour the DistilBERT model,
the student, tries to replicate.
The resulting pre-trained DistilBERT model is only 40% the size of the original
BERT and, according to the authors, 60% faster. Through experiments, Sanh et
al. showed that despite being so much smaller, DistilBERT retains 97% of BERT’s
language understanding capabilities, as measured by its performance on various
language understanding tasks.
DistilBERT is openly available in the Transformers library from HuggingFace. 1

2.5.3.4 XLNet

XLNet takes a slightly different approach to language modeling than BERT and
its variants. It is an autoregressive model for capturing bidirectional dependencies,
which the authors developed to solve some perceived problems of BERT. Firstly,
the method of masking words and then predicting them corrupts the input, creating
texts filled with [MASK] tokens that are not seen in regular texts. This causes a dis-
crepancy between pre-training and fine-tuning, since the latter does not contain any

1https://github.com/huggingface/transformers

21

https://github.com/pytorch/fairseq
https://github.com/pytorch/fairseq


2. Theory

[MASK] tokens. Secondly, BERT assumes that masked tokens are independent, but
for a sentence containing multiple masked tokens, these may in fact be dependent
on each other. BERT also has a fixed length context of 512 tokens, while XLNet
builds on the Transformer-XL architecture to be more suitable for longer texts.

Despite its problems, BERT is still very good at capturing bidirectional context
which is what lead to gains against previous models. XLNet captures bidirectional
dependencies slightly differently, by utilizing language-permutation modeling, which
works by making predictions in a random pattern. Given a sentence of 5 words,
for instance, the model could be asked to predict the word in the random order
[word5, word1, word2, word4, word3]. This allows the model to learn bidirectional
dependencies. For example: When predicting word2, the context will contain words
that, in the original context, were both earlier (word1) and later (word5) than it.
XLNet takes this approach with all possible permutations of the factorization order.
It does so by utilizing masking in the Transformer, so as to not change the order of
the actual input as this would create unrealistic text combinations creating discrep-
ancies between pre-training and fine-tuning.

XLNet Architecture
XLNet is a Transformer-XL based architecture with some modifications. When pre-
training a Transformer based model, the embedding for the token being predicted
is masked, this includes its positional embedding. This, however, is potentially use-
ful information. When predicting a token, only the position should be known, not
the content. The solution is a two-stream self-attention architecture consisting of a
content stream and a query stream. The content stream is a standard self-attention
model without masking, which allows access to the full context and token content.
The query stream has limited access to only the context of the previous steps and
the current token’s position.

The query stream is only used for pre-training and can be dropped during the fine-
tuning process, turning the model into a normal Transformer based model.

To handle multiple segments, XLNet utilizes a relative positional encoding scheme
similar to the one proposed by Transformer-XL (see section 2.5.2.9). Each word
has a segment encoding, which indicates whether any two words belong to the same
segment or not. This means that it does not encode which specific segment a word
belongs to or where exactly its position in that segment is, only whether or not
two words are from different segments. This has the additional benefit of allowing
encoding of more than two segments, which BERT does not.

2.5.4 Task Specific Models

In this section we introduce two task specific uses of pre-trained models BertSum
(Section 2.5.4.1) for the task of summarisation and SBert (Section 2.5.4.2) for pro-
ducing sentence embeddings.

22


2. Theory

Figure 2.10: BERTSum architecture. Reproduced from [2] with the authors’
permission.

2.5.4.1 BERTSum

BERTSum is publically available at: https://github.com/nlpyang/PreSumm
BERTSum [2] is a variant of BERT fine-tuned for extractive single-document sum-
marisation. For the purpose of extractive summarisation, two problems need to be
overcome:

1. Each sentence needs to be labeled as either a summary sentence or a non-
summary sentence. By default, however, BERT outputs token representations,
not sentence representations, and no classifications either.

2. BERT accepts inputs of either a single sentence or a pair of sentences. For
summarisation purposes, however, the model should be able to process docu-
ments containing multiple sentences.

In order to overcome these problems, the author of [2] modified both the input
sequence and the embedding slightly, extending them to sequences of multiple sen-
tences:

1. In the input sequence, each sentence is preceded by a [CLS] token and suc-
ceeded by a [SEP] token.

2. In the original BERT, two sentences A and B are distinguished by segment
embedding. Each token of sentence A is embedded using EA and each token
of sentence B is embedded using EB. For BERTSum, this is extended to: The
tokens of the i-th sentence are embedded using EA if i is odd and EB if i is even.

Thus, the output of the top BERT-layer for each [CLS] token is treated as the
sentence representation of the sentence following that token. The architecture of
the BERTSum model and especially the input embedding is shown in Figure 2.10
Having obtained sentence representations for multiple sentences, there are several
ways the author suggests to fine-tune BERT for extractive summarisation:

23

https://github.com/nlpyang/PreSumm


2. Theory

1. Adding a single sigmoid classification layer on top of the BERT ouputs.
2. Adding more Transformer layers on top of the BERT output. (Inter-sentence

Transformer) On top of that, a sigmoid classification layer.
3. Adding an LSTM on top of the BERT outputs. On top of that, a sigmoid

classification layer.

Liu’s experiments in [2] showed that the option of adding a two-layer Transformer
and a single sigmoid classification layer on top of BERT produced the best results.

Like BERT, BERTSum has an input limit of 512 tokens.

2.5.4.2 Sentence-BERT (SBERT)

Reimers et al. [39] propose a modified BERT model for semantic textual similarity
(STS) tasks using BERT. The default BERT implementation can be used for STS
tasks by utilizing the input of sentence pairs. But this is computationally expensive,
as each sentence pair would need to be compared against each other. The authors
name as an example, that finding the most similar pair in a collection of 10.000 sen-
tences using conventional BERT requires n(n− 1)/2 = 49995000 operations, which
would take take roughly 65 hours.

A common solution to these types of problems is to map the inputs to some vector
space which can then be compared via for example clustering. Sentence embeddings
can also be performed using BERT. This is typically done by feeding the model
a sentence and either averaging the output layer or using the [CLS] token. How-
ever, without additional fine-tuning these are not very useful for semantic textual
similarity (STS) tasks. In order to mitigate these problems, the authors developed
a modification of the BERT model, which they called Sentence-BERT or SBERT.
SBERT is then fine-tuned to produce semantically meaningful fixed-length sentence
embeddings (i.e., so that semantically similiar sentences are close together in the
vector space) which can be easily compared with the cosine similarity score (see sec-
tion 2.2.2.3). This makes finding the most similiar pair of sentences much quicker.
Using the same example as before, finding the most similiar pair of sentences in a
collection of 10.000 sentences takes, according to the authors, only a few seconds
with SBERT, as opposed to 65 hours with BERT.

The authors did not find that using RoBERTa instead of BERT resulted in any im-
provements for their purposes. They also found that XLNet performed even worse
than BERT on STS tasks, which is why they used BERT as the basis of their work.

The authors also tried alternative similarity measures, like the Manhattan and neg-
ative Euclidean measure, but found that they had no advantages compared to cosine
similarity.

Reimers et al. developed several possible architectures for SBERT, depending on
the kind of training data available. SBERT can be built with a siamese or triplet
network structure [40], which means that the same weights are used to process two

24


2. Theory

or three input sentences at the same time. SBERT also adds a pooling operation to
the output of BERT to derive fixed-size sentence embeddings from it. The authors
tried different pooling strategies, but found that taking the mean of all the output
vectors produced the best results. There are different objective functions available,
depending on the task to be trained for:

1. Classification Objective Function. The sentence embeddings are concate-
nated with the element-wise difference and a softmax-function2 is applied to
obtain the classification label. This structure is depicted in Figure 2.11.

2. Regression Objective Function. The sentence embeddings are used to
compute the cosine similarity score. This structure is depicted in Figure 2.12

3. Triplet Objective Function. The network is fed three sentences, one of
which is the so-called anchor sentence, while the other two are the so-called
positive and negative sentences. The training objective is to make sure that
the distance between the positive sentence and the anchor sentence is always
smaller than the distance between the negative sentence and the anchor sen-
tence.

Figure 2.11: The SBERT architecture for a classification objective function.

2A softmax function takes a vector of dimensionality k and normalizes it into a probability
distribution of k probabilities, proportional to the original inputs, which add up to 1.

25


2. Theory

Figure 2.12: The SBERT architecture for a regressive objective function. This can
be used to compute similarity scores.

SBERT raised the state of the art for sentence embedding and several STS tasks.

26


3
Methods

In this chapter the methods used to achieve the goals of the thesis are described.
Section 3.1 gives an overview over the changes that this project went through during
the course of its conception and execution. In Section 3.2 we describe the datasets
we used during training and evaluation, their properties and how we obtained our
training data from them. In Section 3.3 we describe some of the challenges of using
BertSum for summary generation and how we solved them. In the Section 3.3.2
we describe our implementation and training related details. Section 3.5 describes
our evaluation metrics. Finally we describe the experiments we performed using
different pre-trained models in Section 3.6.

3.1 Changes in the direction of the project
The initial plan for this thesis project was to use a dataset of academic papers to
fine-tune a BertSum model for summarisation. But it became clear early on that
creating a large enough dataset would not be possible within the scope of the project.
We identified a few bottlenecks for such a project, which will be described in the
section on future work 5.4.
Instead, we decided to investigate how well a BertSum model fine-tuned on the
widely used CNN/DM news dataset would transfer and perform on our academic
papers dataset. We will also investigate how different pre-trained models compare
against each other. In the next section we will go into more detail on the datasets
we used and how we obtained them.

3.2 Datasets
In the following, we will describe the datasets we used for our experiments: the
CNN/DM dataset, which we used to fine-tune the models, and the "Academic Pa-
pers" dataset, which is a very small dataset we created ourselves for the purpose of
evaluation.

A dataset to train for the extractive summarisation task requires a text (sequence
of sentences) and reference labels: For sequence-labeling, each sentence needs to
be labeled as 1 (summary) or 0 (non-summary). This means that the summary
length must be known before label selection. For score-and-select, the label for each
sentence is a score indicating its importance to the summary. This does not require
summary length to be known in advance.

27


3. Methods

3.2.1 CNN/DM
Few datasets exist for the task of text summarisation, especially extractive text
summarisation. The CNN and DM datasets1, which contain news articles, gathered
from CNN and Daily Mail, each accompanied by a short abstractive summary, are
commonly used for training and evaluating models for summarisation tasks. This
dataset exists in an anonymised and a non-anonymised version. The anonymised
version replaces identifiers with non-identifiables. For better comparisons with pre-
vious works, which largely used the non-anonymised version, we chose that one for
our experiments, too.

The dataset was split for training, validation and testing as suggested by Hermann
et al. in [41], statistics can be seen in Table 3.1.

Table 3.1: Statistics for CNN/DM dataset

Train Validation Test
No. of Samples 287083 13367 11489
Avg. Sent. Length 35.56 32.24 32.62
Avg. Number of Tokens 927.96 910.17 921.96
Summary Avg. Sent. Length 3.73 4.11 3.88

3.2.1.1 Label Generation

The summaries included in the CNN/DM dataset are so-called "highlights", a few
sentences for each news article, which aims to summarise it. These summaries are
abstractive and can therefore not be used directly for training an extractive sum-
mariser. Instead labels were generated using these abstractive summary as guidance.
We generated three sets of label data.

1. Binary Labels: As in the BertSum paper [2], we generated binary labels for
the sequence-labeling problem definition. Up to three sentences( the pre-determined
length of summaries) are selected from each news article, by maximize the ROUGE
scores against the abstractive summaries. For this purpose, Liu [2] proposes two
different algorithms. The first is a greedy algorithm, which is fast but does not
consider all combinations of sentences. The second algorithm does consider all com-
binations, but is slower. Liu opted for the faster but inaccurate algorithm and so
did we because of the insignificant differences between them in terms of score.

In addition to this, we propose two additional label selection schemes for the score-
and-select problem definition:

2. Score Labels: We assigned each sentence a score, based on its ROUGE score
against the abstractive summary. We hope that this method will allow models to

1Both are available for download here: https://cs.nyu.edu/ kcho/DMQA/ which is where we
obtained our data from. (Last accessed 14.02.2020.)

28


3. Methods

generalise better for a wider range of summary lengths.

3. Sentence BERT Score(SBS) Labels: Similar to the previous score labels, but
instead each sentence is assigned a score based on the cosine similarity between its
sentence embedding and that of the summary, as produced by SBERT (see Section
2.5.4.2).

3.2.2 Academic Paper Dataset
This dataset consists of a small number of papers on driving styles in the domain
of traffic safety. This dataset was too small to perform meaningful fine-tuning on,
but we did use it to evaluate our models’ ability to transfer from the news data
they were fine-tuned on to this different type of texts. The papers were provided to
us in PDF-form which was a limiting factor for the number of documents we were
able to include in the dataset, because of the additional work required to extract
and pre-process the texts. To obtain labeled data, we created extractive reference
summaries from scratch.

In the following sections, we will describe how we extracted the text from the PDFs,
how we pre-processed these texts and how we obtained the reference summaries.

3.2.2.1 Text extraction

For our experiments, we collected 31 PDF-documents. These are scientific publica-
tions on the topic of traffic safety between 5 and 33 pages of length, with the average
number of pages being 14. The first step to use these documents was to extract the
text from each of the PDFs.

Since developing a dedicated tool for text extraction from PDF was out of scope
for this thesis project, we utilised an existing one, pdftotext2 from the open source
toolset Xpdf. We configured the pdftotext tool to cut out all images from the PDFs
and then had it produce a TXT-document for each original PDF-document con-
taining only its extracted text, which we then cleaned up manually. This manual
clean-up was necessary, because even the best text extraction tool we were able to
find had various issues, which will be described in the next section.

3.2.2.2 Pre-processing

The automatically extracted TXT-files, while giving us a good base to work from,
had various problems, which would have limited the quality of potential summaries:

1. Tables, headers, footers and page numbers were not recognised as not being
part of the text. Therefore, they appeared in the extracted TXT-files, breaking
up the actual text in unfortunate ways. Often, these artifacts would be inserted
mid-sentence.

2http://www.xpdfreader.com/pdftotext-man.html

29


3. Methods

2. Occasionally letters, words or phrases would be printed repeatedly. In rare
cases, even nonsensical streams of letters were produced for no apparent reason.

3. Sometimes the original texts would contain periods in headlines or mid-sentence,
which pdftotext would preserve, impeding our ability to automatically recog-
nise the beginning and end of sentences.

Because of the relatively small number of documents, we came to the conclusion that
trying to develop a piece of software to solve these problems automatically would
take more time than simply cleaning up the TXT-files by hand. The following
manual changes were made to the automatically extracted texts:

1. Headers, footers, page numbers, tables etc. were removed from the texts
wherever we found them.

2. Corrupted words (such as "trraaaafffffiiiiiiic", "tra?c", "traf-fic") were corrected
("traffic"), and where whole sentences were corrupted, we manually copied over
the correct text from the source PDF.

3. Periods that were used in any other way than to end a sentence were removed
from the sentence. Periods were added after headlines so as to prevent them
from being interpreted as the beginning of the following sentence.

4. Anything before the "Introduction" section and after the "Conclusion" was
removed from the document. In particular, we removed the abstracts from
the text as we intended to use them during evaluation.

5. Formulas, which pdftotext was rarely able to extract in a readable format,
were either cleaned up or deleted from the text, where we deemed their exact
content irrelevant to the surrounding text.

6. The headlines were translated into a machine-readable format in the following
fashion:
1. Headline -> # Headline.
1.1. Second headline -> ## Second headline.
etc.

We thought that the information on sections and subsections of the text might be
interesting to preserve, though we ended up not using it.

3.2.2.3 Obtaining reference summaries

Next, we had to create reference summaries for each of the documents, to be used
for evaluation. This, too, was done manually. Due to time constraints, we only did
this for ten of the documents. The summaries were created from the cleaned up
TXT-files in the following way: First, we read carefully through the entire docu-
ment to familiarise ourselves with what they were about. We then went through the
document again from top to bottom and removed all sentences that did not seem
essential to convey the most important information of the document. We repeated
this step until we felt that no more sentences could be removed without leaving out
important information.

We did this individually, each of us producing a separate summary for each of the
ten documents. We thus ended up with two reference summaries per document.

30


3. Methods

This was done to have a measure for how much we can expect two summaries of the
same text to differ from one another, even under the assumption that both of them
are optimal. Knowing this will help us better interpret the quality of our generated
summaries.

All the summaries except for one were roughly 10-20% of the length of the original
documents, measured in word count. (The single outlier was almost 50% of the
document length, but with 1676 words total, it was a short document to begin
with.)

3.3 Summary Generation
In this section we will describe the problems we had to overcome to be able to
generate summaries for the Academic Paper dataset, and how we implemented the
summarisation models.

3.3.1 Adressing BERTSum’s Token Limit
As mentioned in Section 2.5.4.1, the BERTSum model has a limit of 512 input to-
kens. For short news data like the CNN/DM dataset, texts are usually truncated
to fit these limitations. This, however, affects summary generation, as it only al-
lows the model to select sentences contained within these first 512 tokens. For the
Academic Paper dataset, which consists of texts much longer than this limit, this is
unreasonable, as it would disregard the majority of the texts. A possible solution,
suggested by the BERTSum [2] authors, is to simply extend the token limit of the
model, but as the pre-trained models it uses to generate token embeddings have the
same limit, this would not affect pre-training. Additionally, the academic papers
are of such a length that the token limit would have to be increased 20-fold to fit
some of the texts, and we would run into computational and memory limitations.
Another solution was needed. As suggested by Al-Rfou et al. in [31], we instead
split the input texts into multiple blocks below the limit and feed each block to the
model individually, later combining its output, this allows the model to generate
summaries for longer texts.

This method, however, is not without problems: Since no information is shared
between the blocks, contextual information is lost, which will have an effect on the
generated summaries. Positions will be repeated within each block and thus poten-
tial positional bias will also be repeated.

To allow summary generation for texts of any length, the following steps were taken:
1. Input texts are split into blocks of a maximum length of 512 tokens, with a

maximum sentence length of 200 tokens. Longer sentences are truncated to
avoid single sentence blocks.

2. Blocks are created sentence by sentence: If adding another sentence would
exceed the limit, a new block is created. Thus, sentence integrity is maintained.

31


3. Methods

3. Each block is put through the model separately, which outputs a score for each
sentence in the block.

4. The scores of each block are combined and the top scoring sentences are se-
lected to form the summary.

3.3.2 Implementation
We used the BERTSum model implementation proposed by Liu3 as presented in [2]
as the base for our implementations. This BERTSum implementation is built on top
of "Open Source Neural Machine Translation in PyTorch"4, which is an open source
framework for sequence models. To better facilitate our goals of using multiple pre-
trained models, we made some changes to the original implementation, which will
be described in the next sections.

We used the Transformers library, maintained by Hugging Face5, which contains
standardised PyTorch implementations of many of the newest Transformer-based
models. This is the latest version of the commonly used pytorch-transformers library
(which BERTSum utilises). Our own implementation is based on their examples.
In the following sections we will explain some key alterations we made.

BERTSum model alterations: The BERTSum model, which originally builts
on just BERT, was altered to support multiple pre-trained models for generating
sentence embeddings. Since some of the pre-trained models we intend to use, like
RoBERTa, do not use segment embeddings for pre-training and since the BERTSum
paper [2] showed only small differences, we also removed segment embedding from
our altered BERTSum model. Support for the XLNet-specific feature of context
sharing between blocks was implemented.

Data-loading: We utilise a different dataloading process than BERTSum, which
uses a dataset already containing pre-computed word token representations specif-
ically for BERT. As some of the other pre-trained models use different tokens, this
method would require us to create a new dataset for each model, containing such
pre-computed tokens. Instead, we used a PyTorch-style data-loader that performs
model-specific tokenisation during data-loading. This causes some computational
overhead, which can be mitigated by utilising several worker processes.

Hyper-Parameters: Trying to find the optimal hyper-parameters was out of scope
for this project. Instead we refer to previous work and use the hyper-parameters sug-
gested by the authors in each pre-trained models’ paper. The authors of the original
paper on BERT [19] suggest that hyper-parameter tuning is of less importance when
the fine-tuning data set is large (as is the case for the CNN/DM dataset), which
also matched our initial findings when trying different parameters. Therefore, we
simply stick to the suggested hyper-parameters for each model.

3https://github.com/nlpyang/BertSum
4https://opennmt.net/
5https://github.com/huggingface/transformers

32


3. Methods

Optimisers and Schedulers: In addition to the BERTSum optimiser/scheduler
we also implemented support for the PyTorch implementation of AdamW6 with
linear schedule decay. We used AdamW as the optimiser and scheduler for our ex-
periments to better match how the models were originally pre-trained.

Loss Functions: BERTSum uses the summed Binary Cross Entropy as its loss
function, which is suitable for the binary classification task of sequence-labeling. To
also support the score-and-select training objective, we implemented support for a
mean squared error (MSE) loss function as well.

Selection Layer: The BERTSum paper [2] explores several different selection lay-
ers and come to the conclusion that a Transformer layer produced the best results.
This is therefore what we used as well. The authors also used a tri-gram blocking
scheme, which blocks the addition of a sentence to the summary if it contain an
overlapping tri-gram with the summary. This ensures a more diverse text, and lead
to improved scores for the authors.

Checkpoint Averaging: During training, checkpoints of the model are saved at
regular intervals. Checkpoint Averaging is a method where a number of checkpoints
are combined and averaged into a single, supposedly more robust, model. The
authors of the BERTSum paper used multiple checkpoints of the model saved during
training and combined the weights of the top 3 performing checkpoints (on the
validation set) into the final model. We also employed this method.

3.4 Hardware

Most of the training was done on the high-performance clusters of C3SE7 which is
the centre for scientific and technical computing at Chalmers University of Technol-
ogy in Gothenburg, Sweden. C3SE is part of the Swedish National Infrastructure
for Computing, SNIC8.

The training was performed on the GPU-nodes of the Vera Cluster, which are out-
fitted with Tesla V100 32GB model GPUs. These GPUs support half-precision
float format (FP16, 16-bit floats), which allows for mixed-precision training. Utilis-
ing this can lead to a significantly faster training speed. Mixed-precision training9

uses FP16 for operations, while important network information is stored in single-
precision (FP32). This reduces memory requirements and allows for larger models
and batches. FP16 structures are also faster to access and transfer than FP32. The
loss of precision, which can lead to small numbers being interpreted as 0, is combated
by a technique called loss scaling, which helps preserve small gradients.

6https://www.tensorflow.org/addons/api_docs/python/tfa/optimisers/AdamW
7Chalmers Centre for Computational Science and Engineering: https://www.c3se.chalmers.se/
8https://www.snic.se/
9https://docs.nvidia.com/deeplearning/sdk/mixed-precision-training/index.html

33

https://www.tensorflow.org/addons/api_docs/python/tfa/optimisers/AdamW


3. Methods

3.5 Evaluation
In this section we will describe the metrics used to evaluate the models’ performance.

3.5.1 ROUGE Evaluation
The three commonly used ROUGE metrics for evaluating summaries, ROUGE-1,
ROUGE-2 and ROUGE-L, are used for evaluation. For each of these we use the
F-1 score which is a combination of the precision and recall. For a more detailed
description of these metrics, see Section 2.2.

Several suites exist for performing ROUGE evaluation. We opted for a Python im-
plementation of ROUGE10. This implementation produces slightly different scores
compared to the implementation used in the BERTSum paper. This hurts com-
parison against previous works, but it is faster, and as our goal required evaluating
many models against each other we deemed it the better option.

3.5.2 Sentence Similarity Evaluation
We also explored another evaluation metric based on sentence similarity. We hoped
that this metric would be more accurate when for example comparing against ab-
stractive candidate summaries, since it does not require exact word matchings as
ROUGE does. We use SBERT (see 2.5.4.2) to produce an embedding for each sen-
tence. These are then averaged into a single combined document embedding. This
is a naive approach based on the same methodology that SBERT employs, in tak-
ing the mean of word embeddings to create the sentence embedding. The score for
each summary is the cosine distance between the reference summary and the gen-
erated candidate summary. To bring it into a similiar range and thus make it more
comparable with the other metrics we took (1− distcosine) · 100 as the final score.

3.5.3 Evaluation on the CNN/DM dataset
The held-out test data of the CNN/DM dataset (see Section 3.2) was used to eval-
uate the models’ performance. The fine-tuned models were used to generate three-
sentence summaries, and these were evaluated against the abstractive summaries
included in the CNN/DM dataset, using the methods outlined above. Having no
explicitly extractive reference summaries likely limits the ROUGE score that can be
reached, as there might not always be exact word matched between the provided
abstractive summary and the generated extractive one. The sentence similarity met-
ric, however, should not be as dependent on word matches.

Evaluation was performed on both a truncated and a full version of the CNN/DM
dataset. For the truncated version, the BERTSum token limit is enforced by simply
truncating the texts. This will bias the results, as the model can only select from

10https://github.com/pltrdy/ROUGE

34


3. Methods

sentences that appear before this limit is reached. The full dataset evaluation used
the block splitting method introduced in Section 3.3.1 to allow the model to select
from the full range of sentences.

3.5.4 Evaluation on the Academic Paper dataset
For this dataset the evaluation of the generated summaries was performed against
the manually created reference summaries. Since these reference summaries are ex-
tractive, we also measured sentence overlap, in addition to the above evaluation
methods. Sentence overlap is measured as follows:
Sc = set of sentences in candidate summary
Sr = set of sentences in reference summary

Score = Sc ∩ Sr
Sc ∪ Sr

3.5.5 Human Evaluation
Extensive human evaluation was not performed, due to constraints in time and
resources. We did, however, want to have some sort of measure of how well the
different models performed by human standards. We obtained this in the follow-
ing way: A random text was selected from the Academic Paper dataset for which
we evaluated and ranked the generated summaries of all models. Summaries were
ranked on relevant sentence selection, cohesion and readability. More formally the
ranking was performed as follows: The origin of the generated summaries were ob-
scured so as to not influence our rankings. We assigned each sentence a score of 1 if
it was a "good" sentence (according to our subjective judgement), a sequence of good
sentences was assigned an increasing score to capture a notion of cohesion. Neutral
sentences were assigned a score of 0. Bad sentences were assigned a score of -1.
Finally we summed the scores to give a final score for the sentence. When perform-
ing the final rankings, tie breakers or close scores were determined by subjectively
assessing the whole summary on the above criteria.

3.6 Experiments
We performed several experiments to evaluate and compare the pre-trained models
against each other on the task of extractive text summarisation. The pre-trained
models we decided to investigate are:

1. BERT
2. DistilBERT
3. RoBERTa
4. XLNet

Most of the models are also available in larger versions with a deeper network, we
opted to only train the smaller "base" versions of the models. Previous work has
shown gains for the large models, but for our purpose of comparing several models

35


3. Methods

we decided that the base versions were a better choice because of the lower resource
and training time requirements.

Experiment 1: BERTSum Reference
For the first experiment, no fine-tuning was performed. We only evaluated the Bert-
Sum model published by Liu and Lapata [2] as is on the Academic Paper dataset
to obtain a baseline.

Their model was trained for 50 000 iterations using 3 Nvidia 1080-ti GPUs with a
gradient accumulation of 2, resulting in an approximately combined batch-size of 36.
Training for 50 000 iterations with this batch size resulted in approximately 6 epochs.

Experiment 2-7 For these experiments, we used some different parameters than
for Experiment 1. The main differences are the warmup and weight-decay: We
opted to run for 4 epochs using 10% of total training steps as warm-up steps and
a linear learning rate decay, motivated by suggestions in each pre-trained model’s
paper, time to train and resource availability.
We decided to use the same batch size as BERTSum, 36. With the available hard-
ware and using mixed-precision training we were able to fit the entire batch size
onto one GPU without using gradient accumulation for all models except one. The
when fine-tuning the XLNet model we could not fit an entire batch onto a single
GPU, instead it was trained using two GPUs.
For these experiments, the pre-trained models were fine-tuned as sequence-labeling
models using the binary label data for the CNN/DM dataset, as described in Sec-
tion 3.2.1. We evaluated their performance on the held-out portion of the dataset.
Additionally, we evaluated their performance on our Academic Paper dataset, to
measure how well the models would transfer to the new task. The models used in
our experiments were the following:

• Experiment 2: The BERT base pre-trained model, which has 12 encoding
layers, 12 attention heads and a total of 110M parameters. This model will in
the following be referred to as "BERT".

• Experiment 3: The RoBERTa base model which has 12 encoding layers, 12
attention heads and 125M parameters. This model will in the following be
referred to as "RoBERTa".

• Experiment 4: The DistilBERT base model which has 6 encoding layers,
12-heads and a total of 66M parameters. This model will in the following be
referred to as "DistilBERT".

• Experiment 5: The XLNet base model which has 12 encoding layers, 12
attention heads and a total of, 110M parameters. This model will in the fol-
lowing be ref