Deep learning for post-OCR error correction on Swedish texts
Publicerad
Författare
Typ
Examensarbete för masterexamen
Program
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
As society becomes increasingly digital, the need to digitize physical documents
and texts also increases. The most common technology for this purpose is Optical
Character Recognition (OCR). Today’s OCR systems are unable to guarantee a
totally accurate scan. The quality of digitization varies and is often negatively
impacted by features of the source material. Post-OCR correction is often performed
on the text produced by the system with the aim of correcting any errors that are
present.To our knowledge, there is currently no neural machine learning based post OCR model available for Swedish. The purpose of this thesis is to develop and train a
neural machine learning post-OCR correction model on a set of digitized and OCRed
Swedish newspaper texts. When developing the model we took advantage of machine
translation techniques as we view the problem as translating incorrect text to correct
text. Several configurations of the model were tested, and the model managed
to improve the evaluation of all metrics on the withheld validation and test sets.
These improvements are, however, rather small and only manage to correct certain
errors while skipping many others. Additionally, the system sometimes introduces
new errors. While the results show improvement, they are not entirely satisfactory
and we believe that additional tuning of hyperparameters and further research into
synthetic data generation could lead to better results.
Beskrivning
Ämne/nyckelord
Computer Science, Thesis, Machine Learning, Neural Networks, Deep Learning, Natural Language Processing, OCR, Post-OCR, Swedish