Deep learning for post-OCR error correction on Swedish texts

dc.contributor.authorLundberg, Arvid
dc.contributor.authorTorstensson, Mattias
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.examinerJohansson, Richard
dc.contributor.supervisorDannélls, Dana
dc.date.accessioned2021-07-09T10:39:55Z
dc.date.available2021-07-09T10:39:55Z
dc.date.issued2021sv
dc.date.submitted2020
dc.description.abstractAs society becomes increasingly digital, the need to digitize physical documents and texts also increases. The most common technology for this purpose is Optical Character Recognition (OCR). Today’s OCR systems are unable to guarantee a totally accurate scan. The quality of digitization varies and is often negatively impacted by features of the source material. Post-OCR correction is often performed on the text produced by the system with the aim of correcting any errors that are present.To our knowledge, there is currently no neural machine learning based post OCR model available for Swedish. The purpose of this thesis is to develop and train a neural machine learning post-OCR correction model on a set of digitized and OCRed Swedish newspaper texts. When developing the model we took advantage of machine translation techniques as we view the problem as translating incorrect text to correct text. Several configurations of the model were tested, and the model managed to improve the evaluation of all metrics on the withheld validation and test sets. These improvements are, however, rather small and only manage to correct certain errors while skipping many others. Additionally, the system sometimes introduces new errors. While the results show improvement, they are not entirely satisfactory and we believe that additional tuning of hyperparameters and further research into synthetic data generation could lead to better results.sv
dc.identifier.coursecodeMPDSCsv
dc.identifier.urihttps://hdl.handle.net/20.500.12380/303714
dc.language.isoengsv
dc.setspec.uppsokTechnology
dc.subjectComputer Sciencesv
dc.subjectThesissv
dc.subjectMachine Learningsv
dc.subjectNeural Networkssv
dc.subjectDeep Learningsv
dc.subjectNatural Language Processingsv
dc.subjectOCRsv
dc.subjectPost-OCRsv
dc.subjectSwedishsv
dc.titleDeep learning for post-OCR error correction on Swedish textssv
dc.type.degreeExamensarbete för masterexamensv
dc.type.uppsokH
Ladda ner
Original bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 21-66 Lundberg Torstensson.pdf
Storlek:
1.9 MB
Format:
Adobe Portable Document Format
Beskrivning:
License bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
1.51 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: