OCR correction of Swedish newspaper texts using deep CNN–LSTM neural networks
Typ
Examensarbete för masterexamen
Program
Publicerad
2021
Författare
BRANDT SKELBYE, MOLLY
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Optical Character Recognition (OCR) refers to the technology used in the process for
converting digital documents into machine–readable, searchable and editable text data.
OCR has, as a means of boosting e ciency, been an important step in large–scale digitization
of paper–based collections, such as historical newspapers, that is an integral part
of our cultural heritage. OCR has improved signi cantly partly due to the wide adoption
of deep learning algorithms. Today, OCR has reached high accuracy rates for modern
prints. However, OCR remains a challenging task for historical prints. Thus, applying
OCR to historical documents often results in output of poor quality. This is also the case
for the OCR quality of historical Swedish newspaper texts of the KubHist corpus. State
of the art OCR systems are generally not adapted to the diverse historical domain. Instead,
to be able to achieve an acceptable OCR accuracy for historical text, a successful
approach is to train individual character recognition models using deep CNN–LSTM hybrid
neural networks. This method has been proven to outperform previously methods
based on shallow LSTM neural networks. In this thesis work, we have trained models
based on deep CNN–LSTM hybrid networks, using the OCR engine Calamari. A new
state–of–the–art result on 19th century Swedish newspaper text was achieved, with an
average character accuracy rate (CAR) of 97.43%. In addition, we have demonstrated that
utilizing cross fold training, in combination with con dence based voting, improves the
results additionally.
Beskrivning
Ämne/nyckelord
deep learning , convolutional neural network (CNN) , optical character recognition (OCR) , long short–term memory network (LSTM) , computer science