Expressive corpus search in a modern framework - Developing expressiveness for Korpsearch, a more efficient tool by which to query a corpus

Date

Type

Examensarbete för masterexamen
Master's Thesis

Model builders

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

In this thesis we have developed a new corpus querying program, whose purpose is to be fast but also expressive. We achieved this by implementing the ability to query on prefix, suffix, contains, Python regular expressions, as well as with disjunctions whilst maintaining high speeds. Using our program, execution times for queries were on average half those of an established corpus querying program, Corpus Workbench. Our program’s time taken to execute queries on disjunction was a 4% of what Corpus Workbench required, for instance. This thesis shows that one can implement disjunction by dividing disjunctive queries into all the possible permutations of subqueries. Then one can find their result through a quick intersection finding program. This extends to being able to find all words containing a certain string, and general Python regex matches. Our program also shows that one can depart from the disjunction solution path, the path that is to find all matching words and then form a disjuntive query between them, in special cases. Special solutions have been made for prefix and suffix which have managed to have shorter execution time for those kinds of queries.

Description

Keywords

corpus, corpora, Corpus Workbench, Korpsearch, efficiently, query, Computer, science, computer science, engineering, project, thesis

Citation

Architect

Location

Type of building

Build Year

Model type

Scale

Material / technology

Index

Collections

Endorsement

Review

Supplemented By

Referenced By