OCR correction of Swedish newspaper texts using deep CNN–LSTM neural networks

dc.contributor.authorBRANDT SKELBYE, MOLLY
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.examinerJohansson, Moa
dc.contributor.supervisorDannélls, Dana
dc.date.accessioned2021-08-18T07:39:51Z
dc.date.available2021-08-18T07:39:51Z
dc.date.issued2021sv
dc.date.submitted2020
dc.description.abstractOptical Character Recognition (OCR) refers to the technology used in the process for converting digital documents into machine–readable, searchable and editable text data. OCR has, as a means of boosting e ciency, been an important step in large–scale digitization of paper–based collections, such as historical newspapers, that is an integral part of our cultural heritage. OCR has improved signi cantly partly due to the wide adoption of deep learning algorithms. Today, OCR has reached high accuracy rates for modern prints. However, OCR remains a challenging task for historical prints. Thus, applying OCR to historical documents often results in output of poor quality. This is also the case for the OCR quality of historical Swedish newspaper texts of the KubHist corpus. State of the art OCR systems are generally not adapted to the diverse historical domain. Instead, to be able to achieve an acceptable OCR accuracy for historical text, a successful approach is to train individual character recognition models using deep CNN–LSTM hybrid neural networks. This method has been proven to outperform previously methods based on shallow LSTM neural networks. In this thesis work, we have trained models based on deep CNN–LSTM hybrid networks, using the OCR engine Calamari. A new state–of–the–art result on 19th century Swedish newspaper text was achieved, with an average character accuracy rate (CAR) of 97.43%. In addition, we have demonstrated that utilizing cross fold training, in combination with con dence based voting, improves the results additionally.sv
dc.identifier.urihttps://hdl.handle.net/20.500.12380/303910
dc.language.isoengsv
dc.setspec.uppsokTechnology
dc.subjectdeep learningsv
dc.subjectconvolutional neural network (CNN)sv
dc.subjectoptical character recognition (OCR)sv
dc.subjectlong short–term memory network (LSTM)sv
dc.subjectcomputer sciencesv
dc.titleOCR correction of Swedish newspaper texts using deep CNN–LSTM neural networkssv
dc.type.degreeExamensarbete för masterexamensv
dc.type.uppsokH
Ladda ner
Original bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 21-122 Brandt Skelbye.pdf
Storlek:
221.04 KB
Format:
Adobe Portable Document Format
Beskrivning:
License bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
1.51 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: