OCR post-processing of historical Swedish text using machine learning techniques

Publicerad

Författare

Typ

Examensarbete för masterexamen

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

We present an OCR post-processing method that utilizes machine learning techniques and is targeted at historical Swedish texts. The method is developed completely independent of any OCR-tool with the aim is to avoid bias towards any single OCR-tool. The purpose of the method is to solve the two main problems of OCR post-processing, i.e. detecting and correcting errors caused by the OCR-tool. Our method is divided into two main parts, each solves one of these problems. Error detection is solved by a Support Vector Machine (SVM) that classifies each word to be either valid or erroneous. In order for the SVM to classify the words, each word is converted into a feature vector that contains several word features for indicating the validity of the word. The error correction part of the method takes the words that have been classified as erroneous and tries to replace them with the correct word. The error correction algorithm is based upon Levenshtein edit distance combined with the frequency wordlists. We OCR processed a collection of 400 documents from the 19th century using three OCR-tools: Ocropus, Tesseract and ABBYY, and used the output result from each tool to develop our method. Experiments and evaluations were carried out against the ground truth of the documents. The method is built in a modular fashion and evaluation was performed on each module. We report quantitative and qualitative results showing varying degrees of OCR post-processing complexity.

Beskrivning

Ämne/nyckelord

machine-learning, NLP, OCR, post-processing

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced