Using pre-trained language models for extractive text summarisation of academic papers
Typ
Examensarbete för masterexamen
Program
Computer systems and networks (MPCSN), MSc
Publicerad
2020
Författare
Hermansson, Erik
Boddien, Charlottte
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Given the overwhelming amount of textual information on the internet and elsewhere,
the automatic creation of high-quality summaries is becoming increasingly
important. With the development of neural networks and pre-trained models within
natural language processing in recent years, such as BERT and its derivatives, the
field of automatic text summarisation has seen a lot of progress. These models have
been pre-trained on large amounts of data for general knowledge and are then finetuned
for specific tasks. Datasets are a limiting factor for training summarisation
models, as they require a large amount of manual summaries to be created. Most
of the current summarisation models have been trained and evaluated using textual
data mostly from the news domain. However, pre-trained models, fine-tuned on data
from the news domain, could potentially also be able to generalize and perform well
on other data as well.
The main objective of this thesis is to investigate the suitability of several pre-trained
language models for automatic text summarisation. The chosen models were finetuned
on readily available news data, and evaluated on a very different dataset of
academic texts to determine their ability to generalise.
There were only slight differences between the models on the news data. But more
interestingly, the results on the academic texts showed significant differences between
the models. The results indicate that the more robustly pre-trained models
are able to generalise better and according to the metrics perform quite well. However,
human evaluation puts this into question, showing that even the high-scoring
summaries did not necessarily read well. This highlights the need for better evaluation
methods and metrics.
Beskrivning
Ämne/nyckelord
natural language processing , nlp , machine learning , deep learning , automatic text summarisation , extractive summarisation , transformer , bert , roberta , xlnet