Towards Topic Detection Using Minimal New Sets

Examensarbete för masterexamen

Please use this identifier to cite or link to this item:
Download file(s):
File Description SizeFormat 
253923.pdfFulltext2.26 MBAdobe PDFView/Open
Type: Examensarbete för masterexamen
Master Thesis
Title: Towards Topic Detection Using Minimal New Sets
Authors: Appert, Madeleine
Stenberg, Lisa
Abstract: A common way of detecting topics in a collection of articles is to use Hierarchical Clustering. This has shown to be a successful way of clustering texts, and thereby detecting topics. However, this is computationally expensive since the similarities between all articles are compared pairwise. This thesis aims to examine if smaller amounts of data could be used to detect topics. Building on the work of Damaschke [9] and Guðjónsson [13], we processed articles as a sequence of chronologically ordered documents, and represented each document by the previously unseen word combinations, more formally known as minimal new sets of words. Based on the words that each article now is represented by, we selected articles with a word or word combination. We compared this selection to a ground truth created with Hierarchical Clustering, to see if the minimal new sets can be used to approximate clustering in a streaming setting. We performed three experiments and evaluated their results. In the first we selected articles based on one given word. In the second we selected articles based on a given two-word combination. In the third we built on the second experiment, but separated the selected articles if two consecutive articles were not published within a given time limit. Out of the experiments that we performed we found that tracking a pair of words gave the best result. Additionally, we found that the Jaccard index of the word combinations impacted the result, where words appearing more often together gave better results. The results indicate that minimal new sets can be used to detect topics. Our model shows significantly better results than the corresponding random model. However, we still do not consider our model to hold up against established methods. Therefore, we do not think that our current method is suitable for a topic detecting system, but rather that it could be possible to build on our methods.
Keywords: Data- och informationsvetenskap;Computer and Information Science
Issue Date: 2017
Publisher: Chalmers tekniska högskola / Institutionen för data- och informationsteknik (Chalmers)
Chalmers University of Technology / Department of Computer Science and Engineering (Chalmers)
Collection:Examensarbeten för masterexamen // Master Theses

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.