OCR correction of Swedish newspaper texts using deep CNN–LSTM neural networks

BRANDT SKELBYE, MOLLY

OCR correction of Swedish newspaper texts using deep CNN–LSTM neural networks

Ladda ner

CSE 21-122 Brandt Skelbye.pdf (221.04 KB)

Publicerad

2021

Författare

BRANDT SKELBYE, MOLLY

Typ

Examensarbete för masterexamen

Sammanfattning

Optical Character Recognition (OCR) refers to the technology used in the process for converting digital documents into machine–readable, searchable and editable text data. OCR has, as a means of boosting e ciency, been an important step in large–scale digitization of paper–based collections, such as historical newspapers, that is an integral part of our cultural heritage. OCR has improved signi cantly partly due to the wide adoption of deep learning algorithms. Today, OCR has reached high accuracy rates for modern prints. However, OCR remains a challenging task for historical prints. Thus, applying OCR to historical documents often results in output of poor quality. This is also the case for the OCR quality of historical Swedish newspaper texts of the KubHist corpus. State of the art OCR systems are generally not adapted to the diverse historical domain. Instead, to be able to achieve an acceptable OCR accuracy for historical text, a successful approach is to train individual character recognition models using deep CNN–LSTM hybrid neural networks. This method has been proven to outperform previously methods based on shallow LSTM neural networks. In this thesis work, we have trained models based on deep CNN–LSTM hybrid networks, using the OCR engine Calamari. A new state–of–the–art result on 19th century Swedish newspaper text was achieved, with an average character accuracy rate (CAR) of 97.43%. In addition, we have demonstrated that utilizing cross fold training, in combination with con dence based voting, improves the results additionally.

Ämne/nyckelord

deep learning, convolutional neural network (CNN), optical character recognition (OCR), long short–term memory network (LSTM), computer science

URI

https://hdl.handle.net/20.500.12380/303910

Samlingar

Examensarbeten för masterexamen

Visa fullständig post

OCR correction of Swedish newspaper texts using deep CNN–LSTM neural networks

Ladda ner

Publicerad

Författare

Typ

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Beskrivning

Ämne/nyckelord

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

URI

Samlingar

Endorsement

Review

Supplemented By

Referenced By