Reading Key Figures from Annual Reports
Publicerad
Författare
Typ
Examensarbete för masterexamen
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
This thesis presents methods for extracting key figures from scanned annual reports.
A two step approach is suggested, where a classifier locates the desired section and
a separate algorithm then proceeds to identify and extract key figures within this
context. Optical Character Recognition is carried out using Tesseract 4.1.1. The
data consists of 280 annual reports submitted by Swedish companies, for which page
labels as well as four different key figures are annotated. For the page classification
task, a Random Forest classifier trained on TF-IDF embedded pages is found to
achieve a test accuracy of 99.6%. To locate and extract a given key figure, it is
found that an approximate string matching algorithm performs best, achieving an
extraction accuracy of 92.9% on training documents and 89.6% on test documents.
Accurate extraction is hampered by noise, so different image processing techniques
are explored. The RCC filter is seen to improve extraction accuracy from 73.8% to
83.8% on a subset of difficult documents. Further improvements could be made by
using an image processing technique based on deep learning.
Beskrivning
Ämne/nyckelord
Annual reports, extract information, reading from tables, optical character recognition, Tesseract, image processing, remove noise, binary images, scanned documents, page classification