Finding Top-k Similar Document Pairs - speeding up a multi-document summarization approach

dc.contributor.authorToft, Johan
dc.contributor.authorBogren, Emma
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data- och informationsteknik, Datavetenskap (Chalmers)sv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineering, Computing Science (Chalmers)en
dc.date.accessioned2019-07-03T13:24:05Z
dc.date.available2019-07-03T13:24:05Z
dc.date.issued2014
dc.description.abstractToday there exist many approaches to multi-document summarization, which is the task of automatically creating a summary from multiple sources. This can be a complicated problem, but in this thesis we have focused on a simple approach that is currently being researched. The idea is to find the pairs of documents with the largest overlaps in words and use these to produce a summary. Originally, the algorithm used for this was very naive, comparing each word of every possible pair of documents, and the aim of this project has been to make it more efficient. We have reviewed existing algorithms used for similar problems, and found two useful ones: TOP-MATA and Topk Join. We have also created a new algorithm which we call the Segment Bounding Algorithm (SBA). The approaches were evaluated on two data sets - TREC and Opinosis - and the experimental results showed that SBA was the most efficient on the documents of TREC, while Top-k Join performed slightly better than SBA on the shorter documents of Opinosis. SBA was in the end proposed as an improvement of the simple summarization approach.
dc.identifier.urihttps://hdl.handle.net/20.500.12380/199411
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectData- och informationsvetenskap
dc.subjectInformations- och kommunikationsteknik
dc.subjectComputer and Information Science
dc.subjectInformation & Communication Technology
dc.titleFinding Top-k Similar Document Pairs - speeding up a multi-document summarization approach
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster Thesisen
dc.type.uppsokH
local.programmeComputer science – algorithms, languages and logic (MPALG), MSc
Ladda ner
Original bundle
Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
199411.pdf
Storlek:
1.39 MB
Format:
Adobe Portable Document Format
Beskrivning:
Fulltext