New tools for old news

Löfgren, Victoria

New tools for old news

dc.contributor.author	Löfgren, Victoria
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering	en
dc.contributor.examiner	Axelson-Fisk, Marina
dc.contributor.supervisor	Dannélls, Dana
dc.date.accessioned	2025-04-23T11:17:49Z
dc.date.issued	2025
dc.date.submitted
dc.description.abstract	Many collections of digitized newspapers suffer from poor OCR quality, which impacts readability, information retrieval, and analysis of the material. Errors in OCR output can be reduced by applying machine translation models to “translate” it into a corrected version. Although transformer models show promising results in post-OCR correction and related tasks in other languages, they have not yet been used for correcting OCR errors in Swedish texts. This thesis presents a post-OCR correction model for Swedish 19th and 20th century newspapers based on the pre-trained transformer model ByT5. Three versions of the model were trained on different mixes of training data. The best model, which achieved a 37% reduction in CER, will be integrated in Språkbanken Text’s annotation pipeline Sparv.
dc.identifier.coursecode	DATX05
dc.identifier.uri	https://hdl.handle.net/20.500.12380/309276
dc.language.iso	eng
dc.relation.ispartofseries	CSE 24-184
dc.setspec.uppsok	Technology
dc.subject	Post-OCR correction, ByT5, newspaper digitization
dc.title	New tools for old news
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	Computer science – algorithms, languages and logic (MPALG), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 24-184 VL.pdf
Size:: 11.1 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen