Large-Scale Content Extraction from Heterogeneous Sources

dc.contributor.authorLangkilde, Daniel
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data- och informationsteknik (Chalmers)sv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineering (Chalmers)en
dc.date.accessioned2019-07-03T13:45:05Z
dc.date.available2019-07-03T13:45:05Z
dc.date.issued2015
dc.description.abstractIn this thesis report we describe a novel approach to large scale content extraction from heterogenous web sources. This task is a very important step in a range of web crawling, indexing and data mining tasks. The described approach makes calculations on the Document Object Model (DOM) in order to uncover which nodes contain relevant content, and which do not. We set out with the hypothesis that the DOM tree can be modeled as a hidden Markov tree model where the hidden state of each node indicates if its relevant content or not. Using Gibbs samling we uncover the hidden states of the node, and show that competative performance can be achieved using this approach.
dc.identifier.urihttps://hdl.handle.net/20.500.12380/219477
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectInformations- och kommunikationsteknik
dc.subjectData- och informationsvetenskap
dc.subjectInformation & Communication Technology
dc.subjectComputer and Information Science
dc.titleLarge-Scale Content Extraction from Heterogeneous Sources
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster Thesisen
dc.type.uppsokH
local.programmeEngineering mathematics and computational science (MPENM), MSc
Ladda ner
Original bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
219477.pdf
Storlek:
4.27 MB
Format:
Adobe Portable Document Format
Beskrivning:
Fulltext