OCR post-processing of historical Swedish text using machine learning techniques

Persson, Simon

OCR post-processing of historical Swedish text using machine learning techniques

dc.contributor.author	Persson, Simon
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.examiner	Angelov, Krasimir
dc.date.accessioned	2019-12-19T08:56:16Z
dc.date.available	2019-12-19T08:56:16Z
dc.date.issued	2019	sv
dc.date.submitted	2019
dc.description.abstract	We present an OCR post-processing method that utilizes machine learning techniques and is targeted at historical Swedish texts. The method is developed completely independent of any OCR-tool with the aim is to avoid bias towards any single OCR-tool. The purpose of the method is to solve the two main problems of OCR post-processing, i.e. detecting and correcting errors caused by the OCR-tool. Our method is divided into two main parts, each solves one of these problems. Error detection is solved by a Support Vector Machine (SVM) that classifies each word to be either valid or erroneous. In order for the SVM to classify the words, each word is converted into a feature vector that contains several word features for indicating the validity of the word. The error correction part of the method takes the words that have been classified as erroneous and tries to replace them with the correct word. The error correction algorithm is based upon Levenshtein edit distance combined with the frequency wordlists. We OCR processed a collection of 400 documents from the 19th century using three OCR-tools: Ocropus, Tesseract and ABBYY, and used the output result from each tool to develop our method. Experiments and evaluations were carried out against the ground truth of the documents. The method is built in a modular fashion and evaluation was performed on each module. We report quantitative and qualitative results showing varying degrees of OCR post-processing complexity.	sv
dc.identifier.coursecode	DATX05	sv
dc.identifier.uri	https://hdl.handle.net/20.500.12380/300605
dc.language.iso	eng	sv
dc.setspec.uppsok	Technology
dc.subject	machine-learning	sv
dc.subject	NLP	sv
dc.subject	OCR	sv
dc.subject	post-processing	sv
dc.title	OCR post-processing of historical Swedish text using machine learning techniques	sv
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.uppsok	H

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 19-88 ODR Persson.pdf
Size:: 7.75 MB
Format:: Adobe Portable Document Format
Description:: CSE 19-88 Persson

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 1.14 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen