Large-Scale Content Extraction from Heterogeneous Sources

Langkilde, Daniel

Large-Scale Content Extraction from Heterogeneous Sources

Ladda ner

Primär fil 219477.pdf (4.27 MB)

Publicerad

2015

Författare

Langkilde, Daniel

Typ

Examensarbete för masterexamen
Master Thesis

Program

Engineering mathematics and computational science (MPENM), MSc

Sammanfattning

In this thesis report we describe a novel approach to large scale content extraction from heterogenous web sources. This task is a very important step in a range of web crawling, indexing and data mining tasks. The described approach makes calculations on the Document Object Model (DOM) in order to uncover which nodes contain relevant content, and which do not. We set out with the hypothesis that the DOM tree can be modeled as a hidden Markov tree model where the hidden state of each node indicates if its relevant content or not. Using Gibbs samling we uncover the hidden states of the node, and show that competative performance can be achieved using this approach.

Ämne/nyckelord

Informations- och kommunikationsteknik, Data- och informationsvetenskap, Information & Communication Technology, Computer and Information Science

URI

https://hdl.handle.net/20.500.12380/219477

Samlingar

Examensarbeten för masterexamen

Visa fullständig post

Large-Scale Content Extraction from Heterogeneous Sources

Ladda ner

Publicerad

Författare

Typ

Program

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Beskrivning

Ämne/nyckelord

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

URI

Samlingar

Endorsement

Review

Supplemented By

Referenced By