OCR correction of Swedish newspaper texts using deep CNN–LSTM neural networks
Examensarbete för masterexamen
BRANDT SKELBYE, MOLLY
Optical Character Recognition (OCR) refers to the technology used in the process for converting digital documents into machine–readable, searchable and editable text data. OCR has, as a means of boosting e ciency, been an important step in large–scale digitization of paper–based collections, such as historical newspapers, that is an integral part of our cultural heritage. OCR has improved signi cantly partly due to the wide adoption of deep learning algorithms. Today, OCR has reached high accuracy rates for modern prints. However, OCR remains a challenging task for historical prints. Thus, applying OCR to historical documents often results in output of poor quality. This is also the case for the OCR quality of historical Swedish newspaper texts of the KubHist corpus. State of the art OCR systems are generally not adapted to the diverse historical domain. Instead, to be able to achieve an acceptable OCR accuracy for historical text, a successful approach is to train individual character recognition models using deep CNN–LSTM hybrid neural networks. This method has been proven to outperform previously methods based on shallow LSTM neural networks. In this thesis work, we have trained models based on deep CNN–LSTM hybrid networks, using the OCR engine Calamari. A new state–of–the–art result on 19th century Swedish newspaper text was achieved, with an average character accuracy rate (CAR) of 97.43%. In addition, we have demonstrated that utilizing cross fold training, in combination with con dence based voting, improves the results additionally.
deep learning , convolutional neural network (CNN) , optical character recognition (OCR) , long short–term memory network (LSTM) , computer science