Algorithm Selection for Document Retrieval
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Energy consumption in data centers is increasing and is expected to use a significant share of the world’s energy in the future. As the quantities of data are also increasing, searching this data uses substantial computational resources. One way to reduce the energy consumption of search clusters is to reduce the search latency. The type of search used for unstructured data, such as in web search engines, is called top-k document retrieval. Several different top-k document retrieval algorithms exist, but no single algorithm performs best in all cases. A technique that can be used in problems with this characteristic is algorithm selection. Algorithm selection is a technique that involves predicting and using one, out of a set of complementary algorithms, that will perform the best in a specific problem instance. In this thesis, we study whether algorithm selection is a viable approach for improving query latency, over state-of-the-art query processing algorithms. To do this, we first measured the performance of a set of relevant state-of-the-art top-k document retrieval algorithms in different queries. Using this performance data, we trained a machine learning classifier to predict which algorithm would perform the best, given a query. We show that the potential latency improvement of the algorithm selection approach varies with the database size. The largest potential improvement we observed is near 25%. We present an algorithm selector which achieves 17 out of the 25%. We also discuss how this could reduce the energy consumption of the querying process by around 15%, but also some practical considerations.
Beskrivning
Ämne/nyckelord
information retrieval, top-k document retrieval, algorithm selection, query processing, energy consumption, machine learning, random forest, decision tree
