OCR post-processing of historical Swedish text using machine learning techniques
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Program
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
We present an OCR post-processing method that utilizes machine learning techniques
and is targeted at historical Swedish texts. The method is developed completely
independent of any OCR-tool with the aim is to avoid bias towards any single
OCR-tool. The purpose of the method is to solve the two main problems of OCR
post-processing, i.e. detecting and correcting errors caused by the OCR-tool. Our
method is divided into two main parts, each solves one of these problems. Error
detection is solved by a Support Vector Machine (SVM) that classifies each word to
be either valid or erroneous. In order for the SVM to classify the words, each word is
converted into a feature vector that contains several word features for indicating the
validity of the word. The error correction part of the method takes the words that
have been classified as erroneous and tries to replace them with the correct word.
The error correction algorithm is based upon Levenshtein edit distance combined
with the frequency wordlists. We OCR processed a collection of 400 documents from
the 19th century using three OCR-tools: Ocropus, Tesseract and ABBYY, and used
the output result from each tool to develop our method. Experiments and evaluations
were carried out against the ground truth of the documents. The method is
built in a modular fashion and evaluation was performed on each module. We report
quantitative and qualitative results showing varying degrees of OCR post-processing
complexity.
Beskrivning
Ämne/nyckelord
machine-learning, NLP, OCR, post-processing