Large-Scale Content Extraction from Heterogeneous Sources

Langkilde, Daniel

Large-Scale Content Extraction from Heterogeneous Sources

dc.contributor.author	Langkilde, Daniel
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data- och informationsteknik (Chalmers)	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering (Chalmers)	en
dc.date.accessioned	2019-07-03T13:45:05Z
dc.date.available	2019-07-03T13:45:05Z
dc.date.issued	2015
dc.description.abstract	In this thesis report we describe a novel approach to large scale content extraction from heterogenous web sources. This task is a very important step in a range of web crawling, indexing and data mining tasks. The described approach makes calculations on the Document Object Model (DOM) in order to uncover which nodes contain relevant content, and which do not. We set out with the hypothesis that the DOM tree can be modeled as a hidden Markov tree model where the hidden state of each node indicates if its relevant content or not. Using Gibbs samling we uncover the hidden states of the node, and show that competative performance can be achieved using this approach.
dc.identifier.uri	https://hdl.handle.net/20.500.12380/219477
dc.language.iso	eng
dc.setspec.uppsok	Technology
dc.subject	Informations- och kommunikationsteknik
dc.subject	Data- och informationsvetenskap
dc.subject	Information & Communication Technology
dc.subject	Computer and Information Science
dc.title	Large-Scale Content Extraction from Heterogeneous Sources
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master Thesis	en
dc.type.uppsok	H
local.programme	Engineering mathematics and computational science (MPENM), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: 219477.pdf
Size:: 4.27 MB
Format:: Adobe Portable Document Format
Description:: Fulltext

Ladda ner

Samlingar

Examensarbeten för masterexamen