Expressive corpus search in a modern framework - Developing expressiveness for Korpsearch, a more efficient tool by which to query a corpus
Download
Date
Authors
Type
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Model builders
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In this thesis we have developed a new corpus querying program, whose purpose
is to be fast but also expressive. We achieved this by implementing the ability to
query on prefix, suffix, contains, Python regular expressions, as well as with disjunctions
whilst maintaining high speeds. Using our program, execution times for
queries were on average half those of an established corpus querying program, Corpus
Workbench. Our program’s time taken to execute queries on disjunction was a
4% of what Corpus Workbench required, for instance. This thesis shows that one
can implement disjunction by dividing disjunctive queries into all the possible permutations
of subqueries. Then one can find their result through a quick intersection
finding program. This extends to being able to find all words containing a certain
string, and general Python regex matches. Our program also shows that one can depart
from the disjunction solution path, the path that is to find all matching words
and then form a disjuntive query between them, in special cases. Special solutions
have been made for prefix and suffix which have managed to have shorter execution
time for those kinds of queries.
Description
Keywords
corpus, corpora, Corpus Workbench, Korpsearch, efficiently, query, Computer, science, computer science, engineering, project, thesis