New tools for old news

dc.contributor.authorLöfgren, Victoria
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerAxelson-Fisk, Marina
dc.contributor.supervisorDannélls, Dana
dc.date.accessioned2025-04-23T11:17:49Z
dc.date.issued2025
dc.date.submitted
dc.description.abstractMany collections of digitized newspapers suffer from poor OCR quality, which impacts readability, information retrieval, and analysis of the material. Errors in OCR output can be reduced by applying machine translation models to “translate” it into a corrected version. Although transformer models show promising results in post-OCR correction and related tasks in other languages, they have not yet been used for correcting OCR errors in Swedish texts. This thesis presents a post-OCR correction model for Swedish 19th and 20th century newspapers based on the pre-trained transformer model ByT5. Three versions of the model were trained on different mixes of training data. The best model, which achieved a 37% reduction in CER, will be integrated in Språkbanken Text’s annotation pipeline Sparv.
dc.identifier.coursecodeDATX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/309276
dc.language.isoeng
dc.relation.ispartofseriesCSE 24-184
dc.setspec.uppsokTechnology
dc.subjectPost-OCR correction, ByT5, newspaper digitization
dc.titleNew tools for old news
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeComputer science – algorithms, languages and logic (MPALG), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 24-184 VL.pdf
Storlek:
11.1 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: