OCR post-processing of historical Swedish text using machine learning techniques

dc.contributor.authorPersson, Simon
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.examinerAngelov, Krasimir
dc.date.accessioned2019-12-19T08:56:16Z
dc.date.available2019-12-19T08:56:16Z
dc.date.issued2019sv
dc.date.submitted2019
dc.description.abstractWe present an OCR post-processing method that utilizes machine learning techniques and is targeted at historical Swedish texts. The method is developed completely independent of any OCR-tool with the aim is to avoid bias towards any single OCR-tool. The purpose of the method is to solve the two main problems of OCR post-processing, i.e. detecting and correcting errors caused by the OCR-tool. Our method is divided into two main parts, each solves one of these problems. Error detection is solved by a Support Vector Machine (SVM) that classifies each word to be either valid or erroneous. In order for the SVM to classify the words, each word is converted into a feature vector that contains several word features for indicating the validity of the word. The error correction part of the method takes the words that have been classified as erroneous and tries to replace them with the correct word. The error correction algorithm is based upon Levenshtein edit distance combined with the frequency wordlists. We OCR processed a collection of 400 documents from the 19th century using three OCR-tools: Ocropus, Tesseract and ABBYY, and used the output result from each tool to develop our method. Experiments and evaluations were carried out against the ground truth of the documents. The method is built in a modular fashion and evaluation was performed on each module. We report quantitative and qualitative results showing varying degrees of OCR post-processing complexity.sv
dc.identifier.coursecodeDATX05sv
dc.identifier.urihttps://hdl.handle.net/20.500.12380/300605
dc.language.isoengsv
dc.setspec.uppsokTechnology
dc.subjectmachine-learningsv
dc.subjectNLPsv
dc.subjectOCRsv
dc.subjectpost-processingsv
dc.titleOCR post-processing of historical Swedish text using machine learning techniquessv
dc.type.degreeExamensarbete för masterexamensv
dc.type.uppsokH

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 19-88 ODR Persson.pdf
Storlek:
7.75 MB
Format:
Adobe Portable Document Format
Beskrivning:
CSE 19-88 Persson

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
1.14 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: