Reading Key Figures from Annual Reports

dc.contributor.authorNordin Hällgren, Sara
dc.contributor.departmentChalmers tekniska högskola / Institutionen för matematiska vetenskapersv
dc.contributor.examinerGerlee, Philip
dc.contributor.supervisorIvarsson, Oscar
dc.date.accessioned2021-06-09T12:23:04Z
dc.date.available2021-06-09T12:23:04Z
dc.date.issued2021sv
dc.date.submitted2020
dc.description.abstractThis thesis presents methods for extracting key figures from scanned annual reports. A two step approach is suggested, where a classifier locates the desired section and a separate algorithm then proceeds to identify and extract key figures within this context. Optical Character Recognition is carried out using Tesseract 4.1.1. The data consists of 280 annual reports submitted by Swedish companies, for which page labels as well as four different key figures are annotated. For the page classification task, a Random Forest classifier trained on TF-IDF embedded pages is found to achieve a test accuracy of 99.6%. To locate and extract a given key figure, it is found that an approximate string matching algorithm performs best, achieving an extraction accuracy of 92.9% on training documents and 89.6% on test documents. Accurate extraction is hampered by noise, so different image processing techniques are explored. The RCC filter is seen to improve extraction accuracy from 73.8% to 83.8% on a subset of difficult documents. Further improvements could be made by using an image processing technique based on deep learning.sv
dc.identifier.coursecodeMVEX03sv
dc.identifier.urihttps://hdl.handle.net/20.500.12380/302433
dc.language.isoengsv
dc.setspec.uppsokPhysicsChemistryMaths
dc.subjectAnnual reports, extract information, reading from tables, optical character recognition, Tesseract, image processing, remove noise, binary images, scanned documents, page classificationsv
dc.titleReading Key Figures from Annual Reportssv
dc.type.degreeExamensarbete för masterexamensv
dc.type.uppsokH
local.programmeEngineering mathematics and computational science (MPENM), MSc
Ladda ner
Original bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
Master_thesis_Sara_Nordin_Hällgren_210604.pdf
Storlek:
2.45 MB
Format:
Adobe Portable Document Format
Beskrivning:
Reading Key Figures from Annual Reports
License bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
1.14 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: