Using pre-trained language models for extractive text summarisation of academic papers

Typ
Examensarbete för masterexamen
Program
Computer systems and networks (MPCSN), MSc
Publicerad
2020
Författare
Hermansson, Erik
Boddien, Charlottte
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Given the overwhelming amount of textual information on the internet and elsewhere, the automatic creation of high-quality summaries is becoming increasingly important. With the development of neural networks and pre-trained models within natural language processing in recent years, such as BERT and its derivatives, the field of automatic text summarisation has seen a lot of progress. These models have been pre-trained on large amounts of data for general knowledge and are then finetuned for specific tasks. Datasets are a limiting factor for training summarisation models, as they require a large amount of manual summaries to be created. Most of the current summarisation models have been trained and evaluated using textual data mostly from the news domain. However, pre-trained models, fine-tuned on data from the news domain, could potentially also be able to generalize and perform well on other data as well. The main objective of this thesis is to investigate the suitability of several pre-trained language models for automatic text summarisation. The chosen models were finetuned on readily available news data, and evaluated on a very different dataset of academic texts to determine their ability to generalise. There were only slight differences between the models on the news data. But more interestingly, the results on the academic texts showed significant differences between the models. The results indicate that the more robustly pre-trained models are able to generalise better and according to the metrics perform quite well. However, human evaluation puts this into question, showing that even the high-scoring summaries did not necessarily read well. This highlights the need for better evaluation methods and metrics.
Beskrivning
Ämne/nyckelord
natural language processing , nlp , machine learning , deep learning , automatic text summarisation , extractive summarisation , transformer , bert , roberta , xlnet
Citation
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Byggår
Modelltyp
Skala
Teknik / material
Index