Large-Scale Content Extraction from Heterogeneous Sources

Typ
Examensarbete för masterexamen
Master Thesis
Program
Engineering mathematics and computational science (MPENM), MSc
Publicerad
2015
Författare
Langkilde, Daniel
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
In this thesis report we describe a novel approach to large scale content extraction from heterogenous web sources. This task is a very important step in a range of web crawling, indexing and data mining tasks. The described approach makes calculations on the Document Object Model (DOM) in order to uncover which nodes contain relevant content, and which do not. We set out with the hypothesis that the DOM tree can be modeled as a hidden Markov tree model where the hidden state of each node indicates if its relevant content or not. Using Gibbs samling we uncover the hidden states of the node, and show that competative performance can be achieved using this approach.
Beskrivning
Ämne/nyckelord
Informations- och kommunikationsteknik , Data- och informationsvetenskap , Information & Communication Technology , Computer and Information Science
Citation
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Byggår
Modelltyp
Skala
Teknik / material
Index