Expressive corpus search in a modern framework - Developing expressiveness for Korpsearch, a more efficient tool by which to query a corpus

dc.contributor.authorSALOMONSSON, VICTOR
dc.contributor.authorTHORESSON, MIJO
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerKemp, Graham
dc.contributor.supervisorSmallbone, Nicholas
dc.date.accessioned2025-02-25T13:13:59Z
dc.date.available2025-02-25T13:13:59Z
dc.date.issued2024
dc.date.submitted
dc.description.abstractIn this thesis we have developed a new corpus querying program, whose purpose is to be fast but also expressive. We achieved this by implementing the ability to query on prefix, suffix, contains, Python regular expressions, as well as with disjunctions whilst maintaining high speeds. Using our program, execution times for queries were on average half those of an established corpus querying program, Corpus Workbench. Our program’s time taken to execute queries on disjunction was a 4% of what Corpus Workbench required, for instance. This thesis shows that one can implement disjunction by dividing disjunctive queries into all the possible permutations of subqueries. Then one can find their result through a quick intersection finding program. This extends to being able to find all words containing a certain string, and general Python regex matches. Our program also shows that one can depart from the disjunction solution path, the path that is to find all matching words and then form a disjuntive query between them, in special cases. Special solutions have been made for prefix and suffix which have managed to have shorter execution time for those kinds of queries.
dc.identifier.coursecodeDATX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/309157
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectcorpus
dc.subjectcorpora
dc.subjectCorpus Workbench
dc.subjectKorpsearch
dc.subjectefficiently
dc.subjectquery
dc.subjectComputer
dc.subjectscience
dc.subjectcomputer science
dc.subjectengineering
dc.subjectproject
dc.subjectthesis
dc.titleExpressive corpus search in a modern framework - Developing expressiveness for Korpsearch, a more efficient tool by which to query a corpus
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeData science and AI (MPDSC), MSc
local.programmeHigh-performance computer systems (MPHPC), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 24-109 MT VS.pdf
Storlek:
1.86 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: