Deep learning for post-OCR error correction on Swedish texts

Lundberg, Arvid; Torstensson, Mattias

Deep learning for post-OCR error correction on Swedish texts

dc.contributor.author	Lundberg, Arvid
dc.contributor.author	Torstensson, Mattias
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.examiner	Johansson, Richard
dc.contributor.supervisor	Dannélls, Dana
dc.date.accessioned	2021-07-09T10:39:55Z
dc.date.available	2021-07-09T10:39:55Z
dc.date.issued	2021	sv
dc.date.submitted	2020
dc.description.abstract	As society becomes increasingly digital, the need to digitize physical documents and texts also increases. The most common technology for this purpose is Optical Character Recognition (OCR). Today’s OCR systems are unable to guarantee a totally accurate scan. The quality of digitization varies and is often negatively impacted by features of the source material. Post-OCR correction is often performed on the text produced by the system with the aim of correcting any errors that are present.To our knowledge, there is currently no neural machine learning based post OCR model available for Swedish. The purpose of this thesis is to develop and train a neural machine learning post-OCR correction model on a set of digitized and OCRed Swedish newspaper texts. When developing the model we took advantage of machine translation techniques as we view the problem as translating incorrect text to correct text. Several configurations of the model were tested, and the model managed to improve the evaluation of all metrics on the withheld validation and test sets. These improvements are, however, rather small and only manage to correct certain errors while skipping many others. Additionally, the system sometimes introduces new errors. While the results show improvement, they are not entirely satisfactory and we believe that additional tuning of hyperparameters and further research into synthetic data generation could lead to better results.	sv
dc.identifier.coursecode	MPDSC	sv
dc.identifier.uri	https://hdl.handle.net/20.500.12380/303714
dc.language.iso	eng	sv
dc.setspec.uppsok	Technology
dc.subject	Computer Science	sv
dc.subject	Thesis	sv
dc.subject	Machine Learning	sv
dc.subject	Neural Networks	sv
dc.subject	Deep Learning	sv
dc.subject	Natural Language Processing	sv
dc.subject	OCR	sv
dc.subject	Post-OCR	sv
dc.subject	Swedish	sv
dc.title	Deep learning for post-OCR error correction on Swedish texts	sv
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.uppsok	H

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 21-66 Lundberg Torstensson.pdf
Size:: 1.9 MB
Format:: Adobe Portable Document Format
Description:

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 1.51 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen