New tools for old news

Löfgren, Victoria

New tools for old news

Ladda ner

CSE 24-184 VL.pdf (11.1 MB)

Publicerad

2025

Författare

Löfgren, Victoria

Typ

Examensarbete för masterexamen
Master's Thesis

Program

Computer science – algorithms, languages and logic (MPALG), MSc

Sammanfattning

Many collections of digitized newspapers suffer from poor OCR quality, which impacts readability, information retrieval, and analysis of the material. Errors in OCR output can be reduced by applying machine translation models to “translate” it into a corrected version. Although transformer models show promising results in post-OCR correction and related tasks in other languages, they have not yet been used for correcting OCR errors in Swedish texts. This thesis presents a post-OCR correction model for Swedish 19th and 20th century newspapers based on the pre-trained transformer model ByT5. Three versions of the model were trained on different mixes of training data. The best model, which achieved a 37% reduction in CER, will be integrated in Språkbanken Text’s annotation pipeline Sparv.

Ämne/nyckelord

Post-OCR correction, ByT5, newspaper digitization

URI

https://hdl.handle.net/20.500.12380/309276

Samlingar

Examensarbeten för masterexamen

Visa fullständig post

New tools for old news

Ladda ner

Publicerad

Författare

Typ

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Beskrivning

Ämne/nyckelord

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

URI

Samlingar

Endorsement

Review

Supplemented By

Referenced By