OCR correction of Swedish newspaper texts using deep CNN–LSTM neural networks

Examensarbete för masterexamen

Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.12380/303910
Download file(s):
File Description SizeFormat 
CSE 21-122 Brandt Skelbye.pdf221.04 kBAdobe PDFView/Open
Bibliographical item details
Type: Examensarbete för masterexamen
Title: OCR correction of Swedish newspaper texts using deep CNN–LSTM neural networks
Abstract: Optical Character Recognition (OCR) refers to the technology used in the process for converting digital documents into machine–readable, searchable and editable text data. OCR has, as a means of boosting e ciency, been an important step in large–scale digitization of paper–based collections, such as historical newspapers, that is an integral part of our cultural heritage. OCR has improved signi cantly partly due to the wide adoption of deep learning algorithms. Today, OCR has reached high accuracy rates for modern prints. However, OCR remains a challenging task for historical prints. Thus, applying OCR to historical documents often results in output of poor quality. This is also the case for the OCR quality of historical Swedish newspaper texts of the KubHist corpus. State of the art OCR systems are generally not adapted to the diverse historical domain. Instead, to be able to achieve an acceptable OCR accuracy for historical text, a successful approach is to train individual character recognition models using deep CNN–LSTM hybrid neural networks. This method has been proven to outperform previously methods based on shallow LSTM neural networks. In this thesis work, we have trained models based on deep CNN–LSTM hybrid networks, using the OCR engine Calamari. A new state–of–the–art result on 19th century Swedish newspaper text was achieved, with an average character accuracy rate (CAR) of 97.43%. In addition, we have demonstrated that utilizing cross fold training, in combination with con dence based voting, improves the results additionally.
Keywords: deep learning;convolutional neural network (CNN);optical character recognition (OCR);long short–term memory network (LSTM);computer science
Issue Date: 2021
Publisher: Chalmers tekniska högskola / Institutionen för data och informationsteknik
URI: https://hdl.handle.net/20.500.12380/303910
Collection:Examensarbeten för masterexamen // Master Theses

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.