Large-Scale Content Extraction from Heterogeneous Sources

Examensarbete för masterexamen

Please use this identifier to cite or link to this item: https://hdl.handle.net/20.500.12380/219477
Download file(s):
File Description SizeFormat 
219477.pdfFulltext4.37 MBAdobe PDFView/Open
Type: Examensarbete för masterexamen
Master Thesis
Title: Large-Scale Content Extraction from Heterogeneous Sources
Authors: Langkilde, Daniel
Abstract: In this thesis report we describe a novel approach to large scale content extraction from heterogenous web sources. This task is a very important step in a range of web crawling, indexing and data mining tasks. The described approach makes calculations on the Document Object Model (DOM) in order to uncover which nodes contain relevant content, and which do not. We set out with the hypothesis that the DOM tree can be modeled as a hidden Markov tree model where the hidden state of each node indicates if its relevant content or not. Using Gibbs samling we uncover the hidden states of the node, and show that competative performance can be achieved using this approach.
Keywords: Informations- och kommunikationsteknik;Data- och informationsvetenskap;Information & Communication Technology;Computer and Information Science
Issue Date: 2015
Publisher: Chalmers tekniska högskola / Institutionen för data- och informationsteknik (Chalmers)
Chalmers University of Technology / Department of Computer Science and Engineering (Chalmers)
URI: https://hdl.handle.net/20.500.12380/219477
Collection:Examensarbeten för masterexamen // Master Theses



Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.