Analysing a modified ranking algorithm
for exploratory search
Master’s thesis in Engineering Mathematics and Computational Science

MARKUS FÄLLMAN

Department of Mathematical Sciences
Chalmers University of Technology
Gothenburg, Sweden 2020


MASTER’S THESIS 2020:NN

Analysing a modified ranking algorithm for exploratory
search

MARKUS FÄLLMAN

Department of Mathematical Sciences
CHALMERS UNIVERSITY OF TECHNOLOGY

Gothenburg, Sweden 2020


Analysing a modified ranking algorithm for exploratory search
MARKUS FÄLLMAN

© MARKUS FÄLLMAN, 2020.

Supervisor: Johan Jonasson, Mathematical Sciences
Examiner: Serik Sagitov, Mathematical Sciences

Master’s Thesis 2020:NN
Department of Mathematical Sciences
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2020

iv


Analysing a modified ranking algorithm for exploratory search
MARKUS FÄLLMAN
Department of Mathematical Sciences
Chalmers University of Technology

Abstract
Exploratory Search is a small emerging field within Information Retrieval, studying a
type of searching called exploratory searching. This type of search is directed towards
learning and investigating, and has recently started to draw attention. However, the field
of Exploratory Search struggles with its methodology. A central problem is the difficulty
to measure improvements due to that exploratory searching by definition lacks precise
goals. New tools and ideas are therefore often evaluated with user studies. By focusing
on describing how tools and ideas work, researchers can avoid the difficulty and contribute
to the field. Such an indirect approach allows formulating measures that can be applied
to ranked lists, which, in turn, allow using simulations with many benefits. This study
showcases the approach.

The aim is to determine if a ranking algorithm modification influence the formation
of groups in lists of ranked articles returned from an academic search engine. The data are
generated by simulated searches and a Linear Mixed Model is used for the analysis. The
main covariates represent how the ranking of a standard ranking algorithm is weighted
together with the ranking according to two new criteria. The response variable consists
of scores on how tightly connected the ranked articles are, with the importance of links
decreasing with the depth, and comes from the application of a measure developed in the
thesis.

The main result is that the level of interconnectedness between high ranking articles
can be clearly and statistically significantly influenced by the modification, although the
influence varies with the randomly generated queries. While more research is needed, this
might be useful for controlling the articles interconnectedness when constructing a search
engine. On a different level, the thesis shows how the indirect approach can be applied,
that it enables using simulations, and it indicates that the approach can produce results
interesting for exploratory searching.

Keywords: exploratory search, ranking algorithm, rank biased measure, citation expan-
sion, linear mixed model.

v


Contents

List of Figures ix

List of Tables xi

1 Introduction 1
1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Societal, ethical, and ecological concerns . . . . . . . . . . . . . . . . . 2
1.3 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background on search engines 3
2.1 Introduction to Information Retrieval . . . . . . . . . . . . . . . . . . . . 3
2.2 Introduction to search engines . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Introduction to retrieval models . . . . . . . . . . . . . . . . . . . . . . . 3
2.4 Introduction to open-source search engines . . . . . . . . . . . . . . . . 5
2.5 Introduction to academic search engines . . . . . . . . . . . . . . . . . . 5
2.6 Introduction to citation expansion . . . . . . . . . . . . . . . . . . . . . 6

3 Method 7
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.2 Choosing size of citation expansion . . . . . . . . . . . . . . . . . . . . 9
3.3 Choosing parameter values for the simulations . . . . . . . . . . . . . . . 9
3.4 Choosing a statistical model . . . . . . . . . . . . . . . . . . . . . . . . 10

3.4.1 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.4.2 Repeated measures and mixed effects . . . . . . . . . . . . . . . 11
3.4.3 Linear Mixed Models . . . . . . . . . . . . . . . . . . . . . . . . 11

3.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.1 Article database . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.5.2 Elasticsearch . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Theory 15
4.1 The ranking algorithm modification . . . . . . . . . . . . . . . . . . . . 15
4.2 The Rank Biased Cluster-measure . . . . . . . . . . . . . . . . . . . . . 17

4.2.1 Rank Biased measures . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.2 The Rank Biased Cluster-measure . . . . . . . . . . . . . . . . . 18

4.3 The Linear Mixed Model . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.1 Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.2 The Linear Mixed Model . . . . . . . . . . . . . . . . . . . . . . 19

vii


Contents

4.3.3 The Likelihood Ratio Test . . . . . . . . . . . . . . . . . . . . . 20
4.4 Intraclass correlation coefficient and effective sample size . . . . . . . . . 21

5 Analysis 23
5.1 Exploring the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.1.1 Overview of the data set . . . . . . . . . . . . . . . . . . . . . . 23
5.1.2 Variance due to the random queries . . . . . . . . . . . . . . . . 24
5.1.3 Main variables . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.4 New single main variable . . . . . . . . . . . . . . . . . . . . . . 26
5.1.5 The variable d . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.6 The variable l . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Linear Mixed Model analysis . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.1 Model data set . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.2 Model formulation . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.3 Model assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.4 Final Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Discussion 35

7 Conclusion 37
7.1 Suggestions for future work . . . . . . . . . . . . . . . . . . . . . . . . . 37

Bibliography 39

viii


List of Figures

3.1 Overview of the steps in the simulation from the query generation to score. The vertical
arrows represents parameters used in the respective box. . . . . . . . . . . . . . . 7

3.2 Overview of how the data sets’ variables are collected from the simulations. For each
simulation, a value for each variable is stored. The vertical arrows shows where the
values of the variables are taken from. . . . . . . . . . . . . . . . . . . . . . . 9

4.1 Depth needed in rank biased measures to reach 95% of the maximal score for different
values of d. The calculation assumes an infinite lists with a maximal score for each rank. 18

5.1 Overview of the distribution of scores for the exploratory data set. Left: the distribution
of all scores. Right: the distribution of mean scores for each query. . . . . . . . . . . 24

5.2 The mean scores over the values of the three main variables. Left: the mean for each the
main variables. Right: the mean for two of the main variables, while the third is set to
zero. The variables set to zero are, in turn from the left subpanel, α , β , and γ . . . . . . 25

5.3 The scores over of interpolations between main variables. Each interpolation is a pair-
wise combination of the three main variables, while keeping the respective third variable
at zero. It takes values from the variable written first in the label. Left: the mean scores
for each interpolation. Right: boxplots for each interpolation. The two vacancies in
the right panel are due to that extra simulations was done only to test combinations
with high values of α and low values of β and γ . There are therefore no corresponding
combination in the data set between only β and γ . . . . . . . . . . . . . . . . . . 26

5.4 The influence of the depth variable d on the score for each value of tαγ . The different
values of d are separated by different colors. Left: mean score over tαγ . Right: boxplots
of scores for each tαγ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.5 Distribution of scores for values of tαγ and of d. The different values of d are separated
by different colors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.6 The influence of l on the score for each value of tαγ . The different values of l are sepa-
rated by different colors. Top left: mean score over tαγ . Top right: boxplots of scores for
each tαγ . Bottom: the distribution of scores for each tαγ . . . . . . . . . . . . . . . 29

5.7 Diagnostic plots for the Linear Mixed Model. Top left: the density of the residuals, with
the orange curve representing a corresponding normal distribution. Top right: the actual
values over the predicted values, with the straight line representing a perfect match.
Lower left: a QQ-plot, contrasting the residual scores against the normal distribution.
Lower right: the residuals over the predicted values, with the diagonal lines representing
the maximal residual for that predicted value. . . . . . . . . . . . . . . . . . . . 32

5.8 Scatterplot of 1000 samples from the distribution of the random coefficient vector η . . . 33

ix


List of Figures

x


List of Tables

5.1 Variables in the exploratory data. The variable s is the Rank Biased Cluster score, q
is an arbitrary assigned ID for each query, and l is the length of the query (the number
terms). The variables α,β and γ represents the three parameters controlling the ranking
algorithm modification, and the variable d represent the depth parameter used in the
measure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5.2 The intraclass correlation coefficient and effective sample size for the exploratory data set. 25
5.3 Variables in the model data set. The variable s represents the scores, q represents the

randomly generated queries, the variable d corresponds to the parameter with the same
name used in the measure, and the main variable, tαγ , represent an interpolation between
the original BM25 ranking score and each document’s average vicinity score. . . . . . 30

5.4 The intraclass correlation coefficient and effective sample size for the model data set. . . 30

xi


List of Tables

xii


1
Introduction

Searches can be classified as either look-up or exploratory, with the former having specific
targets and the latter aims for investigating and learning [27]. The former has historically
been the main goal in the research field Information Retrieval [2] and its precise targets
make measuring success relatively easy [10, 2]. Success in exploratory searching is harder
to judge. Its open-ended goals may have several correct answers [39] and exploratory
searchers often struggle with expressing their information need [36]. This makes using
look-up search tools for exploratory search difficult. Supporting exploratory searching
directly has, therefore, started to draw more attention.

In scholarly search, this trend shows in that many academic search engines have
started to help exploration through summarising search results and offering advanced fil-
tering options, with notable implementations at Microsoft Academic1, Dimensions2, and
Semantic Scholar3. The recent increase in ambitious new academic search engines is
partly due to the opening up of access to metadata [24], supported by for example the
Initiative for Open Citations, I4OC4. The improved access to data has also opened up
opportunities for research.

When it comes to the research, however, the field of Exploratory Search is still un-
developed [42]. One central methodological issue relates to the complexity of exploratory
searching. The existing measures do not capture how well new ideas improve exploratory
searching and the lack of well-defined targets make it difficult to find new suitable meth-
ods [42]. The most common approach is to perform user studies aimed at describing the
interaction between humans and the search system [42, 26], which is expensive and de-
manding. It is also difficult to do large enough user studies to determine how the impact
of proposed changes interacts with the many different parts of a search engine. It would,
therefore, be helpful to have measures that can be applied to simulations.

For the emerging field Exploratory Search, it is perhaps more important to assess
modifications conveniently and for free, than to capture the impact on the essence of
exploratory searching. Suitable alternatives are measures that only describe aspects of
exploratory searching. Since these are easier to formulate, they can be applied directly
to ranked lists. This opens for using simulations instead of user studies, which has many
benefits.

By directing focus toward how proposed modifications change the search engine,
compared to if the changes made it better or worse, the researcher can establish how the
modifications can be used. Such knowledge can then become part of an array of tools that

1Microsoft Academic: academic.microsoft.com/home
2Dimensions:app.dimensions.ai/discover/publication
3Semantic Scholar: www.semanticscholar.org/
4I4OC: https://i4oc.orga

1


1. Introduction

can be used when aiming more directly at exploratory searching.
To summarise, studying how relevant modifications affect the search engine might

be fruitful since it can enable using simulations. Even though it does not try to capture the
impact on exploratory searching, it can contribute indirectly. This study tries to exemplify
this view and is a proof of concept of the approach.

1.1 Aim
This thesis analyses how changes to an academic search engine’s standard ranking algo-
rithm affects how the citation links between the articles in the ranked lists produced by
the search engine form groups. More specifically, the aim is to determine if the ranked
articles’ interconnectedness can be controlled by three variables weighting together the
default ranking with the ranking according to two new criteria.

1.2 Societal, ethical, and ecological concerns
There is no substantial societal, ethical, and ecological concerns with regards to this thesis.
The thesis aims to contribute to navigating scientific domains. As such it could help
research no matter direction and therefore contribute to unethical research as well, but
given that science, in general, is still warranted, so is helping science in general. When it
comes to the implementation of the project, there is no relevant harmful effect, since the
project has been based around simulations implemented on a personal laptop. As such,
the projects climate footprint is not much larger than the use of a single laptop, which is
acceptably small for a master thesis project.

1.3 Structure
The rest of this thesis is divided into six sections: background on search engines, method,
theory, analysis, discussion, and conclusion.

The method section describes how the data used in the analysis is generated through
simulations and motivates choice made during the project.

The theory section describes the modification of the ranking algorithm, the measure
used to score how interconnected the articles are, and introduces the Linear Mixed Model
which is used to analyse the data generated from the simulations.

The analysis section consists of an investigative analysis of the data, which overviews
and discusses trends and relationships, and an analysis using the Linear Mixed Model.
The latter includes formulating the relevant Linear Mixed Model with its respective de-
pendent and independent variables, checking the model’s assumptions, and drawing con-
clusions about the data using the model.

2


2
Background on search engines

This sections gives an introduction to search engines and to citation expansion in the
context of searching.

2.1 Introduction to Information Retrieval
Information Retrieval is a large domain centred around searching, with relevance judge-
ment, performance evaluation, and user information needs as the larger topics [9]. Ac-
cording to Robertson [32], the field started in the 1960s, with Salton et al as a central
group. Through the 1980s, many groups, such as Croft’s and Robertson’s, created the
original versions of most of the retrieval models used today.

The commercial side started in the 1990s. Towards 2000, Google took over as the
dominating search engine, due to its interface, crawl, and speed. The PageRank algo-
rithm was part of the success, although it was not as important as usually thought [32,
9]. Nowadays, commercial web search engines incorporate hundreds of features in their
ranking algorithms that are based around huge amounts of user interaction data [9].

2.2 Introduction to search engines
Search engines are systems for finding information in a collection, which consist of doc-
uments such as articles or websites [9]. There are three main parts to a search engine: the
index, the query processing, and the user interface. The index is the database in which
the search algorithm searches. It consists mainly of a vocabulary index and inverted in-
dexes. The former contains all terms in the entire collection of documents while the latter
maps each term to the documents that include it. When the index is set-up, the included
documents are often processed in various ways to improve and speed up the later searches.

The query process is often centred around a ranking algorithm based on a formal
retrieval model [9]. Besides the ranking algorithm, the query process usually includes
several other steps, such as processing the query by removing too common words (stop
words), formatting the words (stemming), and adding synonyms (smoothing). The rank-
ing often includes how relevant the documents are for the query as well as a static ranking
independent of the query.

The last of the main parts is the interface which enables the user interaction [9].

2.3 Introduction to retrieval models
Retrieval models are formal models for assigning each document a relevance score for a
given query [9]. There are many popular ranking models, such as vector space models,
probabilistic models, language models, and learning to rank models [22]. There is no
clear best model. Croft et al assert that the language models are the dominant paradigm

3


2. Background on search engines

[9], while the vector space models are central in implementation [38, 5, 16] and the open-
source search provider Elasticsearch considers the probabilistic BM25 model as leading
[16]. On top of this, machine learning, such as learning to rank models, has become
successful due to the huge amount of data that web search engines have access to [32, 9],
although the data outside web search is often too limited for this approach [9].

The most used model is the vector space model [5]. It was originally mentioned
in the 1970s by Salton et al [9, 38], although there is a confusion about where [13]. In
essence, the model represents each term present in the entire collection of documents with
a dimension in a vector space. In this space, both documents and queries are then repre-
sented as vectors. For a given query, each document is then assigned a similarity score
and ranked accordingly [35, 38, 9]. The currently best similarity measure is the cosine
correlation [9, 38], which scores according to the size of the angle between the query
vector and the document vector [9]. Aside from the similarity measure, the scoring also
depends on how documents and queries are represented by vectors in the vector space.
The representation is done through assigning each document or query a weight along
each dimension. The weights are based on how the terms in the collection are present in
the document or query. The most common weight function is the term frequency-inverse
document frequency (tf-idf), which is often combined with a document length normaliza-
tion [38, 9].

Another group of models is language models, which represent documents with
probability distributions for all words. An example of a language model is the rele-
vance model. Croft et al explains that it fits probability distributions over the number
of occurrences of the words present in the document or query using maximum likelihood
estimation for queries and each document and then compares the distributions using the
Kullback-Leibler divergence measure [9].

The last retrieval model introduced here is the BM25. Since the BM25 retrieval
model is used in the implementation it is here described a little more thoroughly than the
other retrieval models.

The BM25 ranking algorithm is a Bayes classifier [9] developed by Spärck Jones,
Walker, and Robertson in 2000 [37]. It uses a probabilistic framework, where the rel-
evance of a document is formulated as the probability of relevance given the document
specification. By using Bayes’ Theorem and some independence assumptions, the rele-
vance of the document can be formulated as a sum of how probable the content of each
document field is given a relevance specification. Here, document field refers to differ-
ent parts of a document, such as the title or the abstract. These probabilities are then
estimated using weight functions. The weight functions are designed to score how im-
portant matches are between the relevance specification, such as a term in the query, and
the document specification, such as the number of occurrences of the query term in the
document.[37] The weight functions are, however, not specified by the framework. The
commonly used versions are similar to both the weights used in the vector space model
and to the maximum likelihood estimations used in the language models, which lessens
the difference between the three retrieval models. In this thesis, the implementation of
BM25 in the open-source search engine Elasticsearch is used. In Elasticsearch’ imple-

4


2. Background on search engines

mentation[30, 29], the weight function for a single field (such as the title) is

|Q|

∑
i=1

ln
(

1+
N −ni +0.5

ni +0.5

)
fi(k+1)

fi + k(1−b+b ld
lavg

)
,

where |Q| is the number of terms in the query, N is the number of documents, ni is the
number of documents with the ith query term in the field, ld is the length of the field, lavg
is the field’s average length for all documents with that field, and fi is the frequency of the
ith query term in the field. The variables b and k are parameters, controlling the length
normalization and how much a single query term can affect the score.

2.4 Introduction to open-source search engines
There are many open-source search engines, such as Lemur Toolkit & Indri Search En-
gine1, Apache Solr2, and Elasticsearch3 (known as Elastic). The two latter are both based
on Apache Lucene search library4. In this work, the implementation is based around
Elastic.

2.5 Introduction to academic search engines
There are a number of fully developed search engines for scholarly search, from the
most used academic search engine Google Scholar5 [24] and its web search engine twin
Google6, to recent competitors such as Dimensions7, Semantic Scholar,8 and Microsoft
Academic9, to open source attempts such as CrossRef10 and CiteSeerX11. There is also
the search systems provided by academic publishers such as Web of Science12, Science
Direct13, Springer Nature14, and Research Gate15, as well as the numerous search engines
offered by universities, often through the search provider Ebesco16.

There is little information available on how most of these search engines work.
When it comes to Google Scholar, Beel et al analyzed their ranking algorithm and came
to the conclusion that the words in the title have a dominating influence together with
citation counts, and that exact matches are important [4].

The academic search engines CrossRef, CiteSeerX, and Semantic Scholar all dis-
close that they use standard open source solutions together with static ranking and some

1http://www.lemurproject.org/
2https://lucene.apache.org/solr/
3https://www.elastic.co/
4https://lucene.apache.org/
5Google Scholar: scholar.google.com/
6Google: www.google.com
7Dimensions:app.dimensions.ai/discover/publication
8Semantic Scholar: www.semanticscholar.org/
9Microsoft Academic: academic.microsoft.com/home

10CrossRef: https://www.crossref.org/
11CiteSeerX: https://citeseerx.ist.psu.edu/index
12Web of Science: https://login.webofknowledge.com
13Science Direct: https://www.sciencedirect.com/
14Springer Nature: https://www.springernature.com/gp
15Research Gate: https://www.researchgate.net/search?q=
16Ebesco: https://www.ebsco.com/academic-libraries

5


2. Background on search engines

additional features. CrossRef implements a version of the vector space model in the
Apache Solr search engine (although CrossRef is transitioning to Elasticsearch accord-
ing to a developer at CrossRef in a personal mail conversation). CiteSeerX, which is
built upon CiteSeer, also uses Apache Solr [44]. Semantic Scholar uses a vector space
model implemented in Elastic Search, combined with document features in a learning to
rank architecture as well as with a static ranking based on the number of citations, recent
citations, and the time of publication [45, 1].

Researchers’ interest in academic search is increasing [5]. When it comes to the
ranking techniques used in research projects, the most common is the vector space model
with the cosine similarity [24, 5]. Apart from the vector space model, many also use
BM25, which is considered best for scholarly retrieval [24]. Static rankings of papers are
also important with using citation-based metrics as the most common approach [31, 40].

2.6 Introduction to citation expansion
In a search context, citation expansion means assigning relevance to articles based on
their citation links. Until recently, research on citation expansions has been hindered by
the limited access to citation databases [8], but research still suggests that it improves
academic search [24, 25] and the interest in citation expansion is growing [19, 6].

The main argument for the usefulness of citation expansion is that citations, and es-
pecially multiple citations, are a judgement of relevance by researchers. Supporting this,
Belter finds that documents with multiple citation links to relevant papers often are rele-
vant as well [6]. Another argument for citation expansion is that it can up for vocabulary
mismatch, which Croft explains happens when relevant documents do not match a query
because of different wordings [9].

Citation expansion can follow citation links both forward and backward, and there
are three types of relations: direct citation, bibliographic coupling, and co-citation [19,
25]. Bibliographic coupling means that two documents are referring to a third document
[19] and co-citation that a pair of documents are cited by the same third document [19,
24, 20]. In research, many studies base the expansion around seed articles [23, 19, 20, 6],
or expands a co-citation network by using words extracted from cited documents [20].

6


3
Method

This section clarifies how the different parts of this study tie together, what is done in
the implementation, and how the parts are related to the aim of the thesis. First is an
overview of the steps leading from the article database to the data sets discussed in the
analysis. After the overview, different choices in the implementation and the choice of
the statistical model are motivated. Lastly, the database and the search engine set-up are
described.

3.1 Overview
As described at the beginning of the thesis, the aim is to exemplify how modifications to
a search engine can be analysed in the context of exploratory searching. Specifically, this
study examines modifications of a common ranking algorithm and the analysis applies a
Linear Mixed Model to data gathered from search simulations.

The simulations consist of four larger steps: generating random queries, ranking
with the Elasticsearch search engine, reranking according to the modifications, and scor-
ing the result with the measure developed in this thesis. Figure 3.1 provides an overview
and shows the parameters used in each step.

Figure 3.1 – Overview of the steps in the simulation from the query generation to score. The
vertical arrows represents parameters used in the respective box.

In more detail, the simulations are based around the open-source search engine
Elasticsearch (Elastic). The implementation of Elastic indexes a snapshot of the article
metadata used in the academic search engine Semantic Search, consisting of more than
200 million documents [1]. The random query generation is based around a feature in
Elastic that returns l number of random documents, with l being a parameter set in the
simulation. From each of the documents’ titles, a single term is selected and added to
the query. The query is then sent to Elastic, which ranks according to their version of the
algorithm BM25 and returns a ranked list. The documents in that list have a relevance
score which the ranking algorithm modification developed in this study uses to rerank
the documents. Depending on the value of three parameters, α , β , and γ , the modifica-
tion reranks according to either the original BM25 score, the total BM25 score in each
document’s vicinity, the average score in each document’s vicinity, or a combination of

7


3. Method

the three. Here the vicinity of a document means the document itself, all its referenced
documents, and all documents citing it. The weight for the original BM25 score is set
by α , the weight for the vicinity total by β , and the weight for the vicinity average by γ .
Since querying Elastic is a major bottleneck in the simulations, each combination of the
specified values for the ranking algorithm parameters are tried for each query. In total, the
ranked lists are affected by the randomly generated queries and four parameters. These
parameters are the number of terms in the query, l, and the three weights given to the
different ranking criteria, α , β , and γ .

After the ranked lists have been produced, the Rank Biased Cluster-measure, devel-
oped in Section 4.2.2, is used to score how tightly connected the articles are in the ranked
list. The point with this measure is to capture an aspect useful for designing search engines
for exploratory academic searching while avoiding the difficulty of measuring exploratory
search success directly. Since a user start at the top of the ranked lists and stop at some
point before the lists’ end, it matters more if there is a connection early on compared to
further down. The influence of connections is therefore decreased with increasing depth,
with the geometric distribution controlling the decline and with a parameter d controlling
the distribution. This makes the measure rank biased. Similar to how all combinations of
the chosen values of the three ranking algorithm parameters are applied to each query to
speed up the simulations, the measure is applied once for every value of d for each com-
bination of query, α , β , and γ . The score resulting from applying the measure is confined
to the interval [0,1]. A score at exactly zero means that all articles form separate clusters
and a score at exactly one means that all articles are connected with at least one article
higher up in the list, creating a single cluster of articles. In practice, however, both scores
at zero and one can be reached by rounding down or up, respectively. This is because
connections found far from the top of the list have little influence on the total score.

In summary, in each simulation, the ranking algorithm modification developed pro-
duces a ranked list. The list is then scored according to how closely connected the articles
are using the Rank Biased Cluster-measure.

After the ranked lists have been scored, the data for the analysis is collected. For
each simulation, one value is stored for each variable in the analysis data set. There is
one dependent variable, the score, and six independent variables. The labelling of the
variables as dependent and independent comes from the Linear Model framework used
later in the thesis. The custom refers to that the values of the dependent variable are
assumed to depend on the variables used to predict it. Returning to the variables, the
independent variables correspond, in turn, to the parameter controlling the query length,
the ID for the random query, to the three parameters used to modify the ranking, and
the parameter controlling the measure. How the variables relate to the simulations are
presented in Figure 3.2.

Besides these variables, there are numerous other candidate variables, since there
are many different ways set-up in Elastic. A few examples of what can be done is to
change the ranking algorithm used in Elastic, add a static ranking, remove stop words
differently, and add a stemmer. Such changes would likely affect the resulting score and
also likely interact with the ranking algorithm modifications. In a larger study, it would
be important to examine many such interactions to perform what Ferro and Silvello call a
component-wise analysis [21]. Here, however, the time constraints on the thesis limit the
study to the variables already mentioned above.

8


3. Method

Figure 3.2 – Overview of how the data sets’ variables are collected from the simulations. For each
simulation, a value for each variable is stored. The vertical arrows shows where the values of the
variables are taken from.

3.2 Choosing size of citation expansion
The ranking algorithm only considers each documents’ references and citations. It is,
however, easy on a theoretical level to consider the score of documents further away. The
reason why only direct references and citations are used is that the networks for academic
documents often grow fast, making the implementation difficult when documents further
away are included.

3.3 Choosing parameter values for the simulations
In the data sets, there are five covariates that correspond to parameters in the simulations.
These are l, α , β , γ , and d. Below, each parameter used in the simulation is discussed
briefly and the chosen parameters values stated and argued for. These values then become
the values of the covariates in the data sets.

The main reason for not trying a larger range of values is that the number of obser-
vations grows quickly when each combination of the relevant four parameters is tried for
each query (the number of terms in the query, l, is unique for each query, so it does not
add to the number of observations).

The parameter l controls the number of terms in the query. In the simulations, it
takes the values {3,5,7}. The values were chosen to represent typical query lengths and
although there probably are many queries with only two words and also some queries
longer than seven, these lengths are fairly typical.

The values of the modification parameters α , β , and γ was first chosen as {0,0.25,
0.5,0.75,1}, aiming for an even spread over the interval [0,1]. While it is possible to
choose values outside [0,1], values larger than one are uninteresting and negative values
would make the interpretation messy. The reason why large values are uninteresting is
that what matters is the relation between the three parameters, which means that the ef-
fect of increasing the value of one of the parameters can already be achieved by decreasing
the others. After some preliminary analysis, the values {0,0.25,0.5,0.75,1} were com-
plemented with the values {0.9,0.95} for α and with {0.1,0.05} for β and γ in order
to follow up on an interesting pattern in the data. For the final analysis with the Linear
Mixed Model, the original values in {0,0.25,0.5,0.75,1} are used.

The parameter d controls how fast the importance of entries the ranked lists de-
creases with the depth. The parameter takes the values {0.8,0.9,0.95}. The value d = 0.8

9


3. Method

was included since it is used by Ferro and Silvello in their study [21] and roughly corre-
sponds to that the 15 first entries in the ranked lists determine 95% percent of the final
score. For the two other values, 0.9 and 0.95, the corresponding sizes are roughly the 30
and 50 first entries, respectively. These values were judged reasonable since the number
of entries needed for setting the 95% of the score are relatively evenly spread out.

3.4 Choosing a statistical model
This part of the method section explains why the Linear Mixed Model is a good choice
for basing the analysis on and comments on what the alternatives are. First comes a
motivation for selecting a linear model and a short introduction to different linear models,
then an explanation of two aspects of the simulations important for selecting the right
type of linear model, and finally arguments for selecting the Linear Mixed Model from
the remaining relevant options.

3.4.1 Linear Models
The two most common ways to quantify the analysis in all of science is through using
the frameworks regression and analysis of variance (ANOVA) [33], while ANOVA and
different t-tests are dominant in the Information Retrieval field [34]. These frameworks
can be generalised to the regression model General Linear Model [33]. The familiarity
and range of the General Linear Model and its generalisations, therefore, makes it a fitting
choice for a discussion of a more proper statistical approach in a subfield to Information
Retrieval.

Linear Regression models are, however, a broad class of models. Common to all
is that the mean of a dependent variable is modelled as a linear combination of one or
more independent variables [7]. There are many related, more or less general, versions
of linear models and, unfortunately, some confusion about how to name the versions. For
clarification, some of the relevant models are described below, using the names the way
they are used in Wikipedia’s series on Regression analysis.

The simplest of the linear regression model is called Simple Regression, which
predicts a single normally distributed dependent variable from an independent variable.
The closest larger model is the Multiple Linear Regression, which can have any number
of independent variables. The next, more versatile model, is the General Linear Model,
which allows for predicting many dependent variables and for using both continuous and
discrete variables.

From the General Linear Model, there are two different types of abstractions rele-
vant for this study. The first is the Generalized Linear Models, which allows for assuming
other distributions for the predictor variables than the normal distribution. The other type
of model is the Linear Mixed Model, which is developed to handle data that is grouped
into clusters centred around a random factor. The two type models, Generalized Lin-
ear Models and Linear Mixed Model, can also be combined into the Generalized Linear
Mixed Model.

In this study, the Linear Mixed Model is used. This is also the model recommended
by Ferro and Silvello, when showing how to take into account the interaction between the
proposed changes and other parts of the search engine [21].

10


3. Method

3.4.2 Repeated measures and mixed effects
There are two particularly relevant aspects of the simulations for choosing the Linear
Mixed Model as the statistical model. The first is that the simulations use repeated mea-
surements, and the second is the studied effects are a mix of fixed and random effects.

Starting with the explanation of the first aspect, repeated measures are defined by
that the dependent variable (the simulation scores) are grouped by a source of random-
ness (the random queries). Since the same query is tried with all ranking algorithm mod-
ifications, the scores for that query are statistically dependent.[33] The reason for using
repeated measures in the simulations is that the size of the database makes querying Elas-
tic the by far largest bottleneck. It is, therefore, important to send queries to Elastic as
seldom as possible. However, according to Rutherford, there is also a benefit to using the
repeated measurements in that the variance due to the random queries is likely to be less
important [33].

The second aspect is that the simulations create a mix of fixed and random effects.
When the independent variables are analysed as fixed-effects it is assumed that all their
relevant parts are represented [33]. A simplified example is if a variable represents the
age of a person in a study of length, both children and teenagers need to be included, but
it is not as important to include adults (since it is known that adults no longer grows after
a certain age). In the analysis data sets, all variables that correspond to parameters in the
simulations are studied as fixed. The reason is that their values, although few, are chosen
to be representative. In contrast, independent variables analysed as random-effects are
assumed to be only a random sample of a population of values [33]. A typical example
of the latter is using human subjects. It is seldom reasonable to handpick people for a
study since it would be difficult to be sure that they are representative and participants are
therefore selected through random sampling. In the simulations, the queries are randomly
sampled from a vast number of possible queries and it is hard to know how representative
they are. When analysing the data sets, the variable representing the random queries, q,
thus needs to be studied as a random effect.

In total the analysis data sets consists of variables representing mixed-effects created
from repeated measurements. There are clusters around random factors in the data, which
is what Mixed Models are designed to handle.

3.4.3 Linear Mixed Models
The reasons for choosing the Linear Mixed Model instead of other Mixed Models are dis-
cussed here. Besides the Linear Mixed Model, the two main alternatives are the Logistic
Regression, i.e. a Generalized Linear Mixed Model with the binomial distribution, and
the Beta Regression Model.

The argument for the Linear Mixed Model is that it is the simplest Mixed Model,
making it the default choice. There is a problem, though, in that the dependent variable
here only takes values in the interval [0,1] which do not fit the normal distribution assumed
in the Linear Mixed Model. Breaking the normality assumption this way is, however,
common when using measures in the Information Retrieval field [7].

The closest alternative is to use the Generalized Linear Mixed Model with the bino-
mial distribution, which does give predictions in the right interval. The problem with the
binomial distributions is that it is intended to predict the probability that the dependent
variable equals one instead of predicting the dependent variables value directly. It can be

11


3. Method

used, but using the model becomes more difficult.
The third relevant model is the Beta Regression Model. The argument for it is

that the data fulfils all the model assumptions. The Beta Regression Model is also very
flexible. The problem with it is that it is rather more involved and the theory behind is not
as well developed.[12]

Given the time constraints for this thesis, the most reasonable of the three alterna-
tives is the Linear Mixed Model. Although the data do not fit its assumptions completely
it can still be used. This is supported by Carterette, who, when examining the influence of
breaking the distributional assumption in Information Retrieval experiments, concludes
that the violation only has an acceptably small effect on the resulting model [7]. Another
reason against the two other alternatives is that they are much more technical than the typ-
ical analysis used in Information Retrieval. This makes them unnecessarily complicated
compared to the Linear Mixed Model.

3.5 Implementation
This part describes the implementation in more detail.

3.5.1 Article database
The article data that is indexed with the Elastic search engine is a snapshot of the article
metadata from the Semantic Scholar academic search engine which has been made avail-
able for research. The data is thoroughly described in Ammar et al’s article Construction
of the Literature Graph in Semantic Scholar [1]. In summary, it consists of the metadata
of close to 200 million documents (the actual number of documents is 177 443 710), each
with many fields such as title, abstracts, incitations, and outcitations. For this study, only
the fields title, incitations, and outcitations were kept from the metadata to increase the
speed of the Elastic search implementation. The reason why the abstract was not included
was that the size of the database made it difficult to handle on the laptop used for the
simulations.

The data set could have been trimmed further by e.g. removing all articles but
those written in English. However, using a large set of metadata in the implementation is
important since when the citation expansion modifies the ranking, the documents that are
linked to must also be in the metadata set if the expansion is to work properly. The full
data set was therefore kept.

3.5.2 Elasticsearch
The search engine used in the experiments is the open-source search engine Elasticsearch
(known as Elastic) [14]. Elastic is built upon the Apache Lucene search library. [15]. It
was used with the default settings. Elastic was chosen both since it is relatively easy to
operate and since it is a popular choice.

When setting up the index, the data was put in without any relevant pre-processing,
except the already mentioned discarding of other fields than ID, title, citations, and refer-
ences. It was not necessary to remove stop words (common words like ’and’) to improve
the precision of queries since the retrieval model BM25 assigns them very little weight.
However, doing so might have increased the speed of the database and also have improved
the generation of the randomised queries. In hindsight, it would also have been useful to
apply a stemmer that would have reduced words to common forms.

12


3. Method

When it comes to the ranking algorithm, the ranking algorithm BM25 was used
with the weight function as described in Section 2.3. The two parameters in the weight
function are left at their default value. There are several other ranking functions avail-
able in Elastic, such as the commonly used vector space model [18], and even some more
advanced smoothing algorithms [17]. Although it would have been interesting to try dif-
ferent ranking algorithms, doing so requires reindexing the database and it was, therefore,
not done due to the project’s time constraints.

The set-up differs from a complete search engine in several ways. Online academic
search engines naturally also index the abstracts and the body of the articles, although this
difference should not affect the results as much as one might believe since most weight is
usually given to the title. Another difference is that the implementation in the thesis does
not use any advanced static ranking (such as Page Rank) to boost quality articles, although
a feature is used that punishes articles without either references or citations. Another
similar static ranking that is not added is the recency boost that can be given newer articles
to compensate for that older articles have accumulated more citations. Furthermore, it was
also out of the scope of the thesis to use more advanced additional information, such as
author prominence, upcoming events, trends, and so forth.

13


3. Method

14


4
Theory

This section has three large parts: a description of the ranking algorithm modifications,
a description of the Rank Biased Cluster-Measure, and a description of the Linear Mixed
Model framework.

4.1 The ranking algorithm modification
The ranking algorithm modification is described in the following section. Even if the
context is exploratory academic searching, the modification is discussed in a more general
wording, by for example referring to documents as nodes. The idea is to allow it to
be applied to other areas than articles with citation links, such as the more general area
knowledge graphs. Along this line, the modification is formulated independently of which
retrieval models that it is used together with and the retrieval model used is referred only
to as an independent scoring function.

Moving on to the mathematical description, assume that there is an undirected graph
with nodes, with Di being the ith node. Further assume that there is some arbitrary scoring
function SO(·), with O for original, which can be used to score each node. In the context
of this study, the nodes are documents, the scoring is done through the BM25 algorithm,
and the links consist of references and citations. Given a natural number, r, define a r
step-vicinity, Vr(D), around a node, D, as the set consisting of D together with all nodes
up to r steps away in the graph. In the simulations, the vicinity V1(D) is used and it
includes the document D itself together with its referenced documents and the documents
that cite it.

Define a vicinity scoring function, SV (D,SO,r,βr,γr) ∈ (−∞,∞), which calculates
the score of a node D, based on the score given by the original scoring function, SO,
to the nodes in the r step-vicinity around the node. The function uses two parameters,
βr ∈ (−∞,∞) and γr ∈ (−∞,∞), to combine the total vicinity SO score and the average
vicinity score. If the values for βr are restricted to [0,1] and γr is set to 1 − βr, the
combination becomes an interpolation between the total and the average vicinity score,
but there is no need to commit to such a restriction. In other words, SV (D,SO,r,βr,γr) is
defined as

SV (D,SO,r,βr,γr) = βr ∑
j:D j∈Vr(D)

SO(D j)+ γr ∑
j:D j∈Vr(D)

SO(D j)

|Vr(D)|
.

The vicinity scoring function can now be used to create a new scoring function,
SM, with M for modification, that combine different sized vicinities. As discussed in
Section 3.2, SM is here limited to combining SV (D,SO,0,β0,γ0) and SV (D,SO,1,β1,γ1),
and only this case is described. Let ω ∈ (−∞,∞) and η ∈ (−∞,∞) be two parameters that

15


4. Theory

weight the two vicinity functions. Define SM(D,SO,ω,η ,β0,γ0,β1,γ1) as follows

SM(D,SO,ω,η ,β0,γ0,β1,γ1) = ωSV (D,SO,0,β0,γ0)+ηSV (D,SO,1,β1,γ1) =

= ω
(

β0 ∑
j:D j∈V0(D)

SO(D j)+ γ0 ∑
j:D j∈V0(D)

SO(D j)

|V0,SO(D)|

)
+

+η
(

β1 ∑
j:D j∈V1(D)

SO(D j)+ γ1 ∑
j:D j∈V1(D)

SO(D j)

|V1(D)|

)
.

Next, the formula can be simplified since |V0(D)| equals one. This leads to that

SM(D,SO,ω,η ,β0,γ0,β1,γ1) = ωSV (D,SO,0,β0,γ0)+ηSV (D,SO,1,β1,γ1) =

=

[
ω
(

β0 ∑
j:D j∈V0(D)

SO(D j)+ γ0 ∑
j:D j∈V0(D)

SO(D j)

|V0,SO(D)|

)
= ω(β0 + γ0)SO(D) =

= [ω(β0 + γ0) = α] = αSO(D)

]
= αSO(D)+ηSV (D,SO,1,β1,γ1) =

= αSO(D)+ηβ1 ∑
j:D j∈V1(D)

SO(D j)+ηγ1 ∑
j:D j∈V1(D)

SO(D j)

|V1(D)|
= [ηβ1 = β ,ηγ1 = γ] =

= αSO(D)+β ∑
j:D j∈V1(D)

SO(D j)+ γ ∑
j:D j∈V1(D)

SO(D j)

|V1(D)|
= SM(D,SO,α,β ,γ),

with α ∈ (−∞,∞), β ∈ (−∞,∞), and γ ∈ (−∞,∞). As with SV (D,SO,r,β j,γ j), the param-
eters α,β , and γ can be chosen so that the combination becomes an interpolation between
the original retrieval model matching score given by SO(·) and between the vicinity’s total
and average scores. Furthermore, even though the parameters can formally take any real
values, it is only interesting to define them in relation to each other. This is because the
ranking of the nodes does not change if the scores are multiplied by some constant. In the
simulations, the values for the parameters are chosen within [0,1].

16


4. Theory

4.2 The Rank Biased Cluster-measure
This section describes the Rank Biased Cluster-measure that is used to score how closely
interconnected the ranked lists produced in the simulations are. Although this measure
will be used to score the documents ranked according to their score given by the ranking
algorithm modification described in the previous section, this measure is separate from
how the ranked lists were ranked and from the score that the lists are ranked according to.
Instead, it only uses the articles’ citation links and position in the list. It can, therefore, be
used with any ranking algorithm.

4.2.1 Rank Biased measures
Rank Biased measures are a type of measure developed by Webber, Moffat, and Zobel for
their Rank Biased Overlap-measure [41]. This type of measures give more weight to the
beginning of ranked lists, since a user starts to examine ranked list at the top and stop at
some depth, usually without having seen the entire list.

The tool used by Webber, Moffat, and Zobel to model the user’s behaviour when
formulating their Rank Biased Overlap-measure, is to bias a scoring at each depth by a
decreasing probability. This leads to that the impact of the beginning is not overshadowed
by the rest of the list. The probabilistic interpretation is that the user starts at the top and
then have a given probability to stop before each next entry. The aspect of the list that the
measure tries to capture is judged (i.e. scored) at the depth where the user stops. Since this
depth is random, the Rank Biased measures score the list according to the expected value
of user scores.[41] This idea will be used to develop the Rank Biased Cluster-measure
below. First, though, is a description of the general form of Rank Biased measures.

Building on Webber, Moffat, and Zobel’s approach, consider the family of measures
of the form:

M(A,w) =
∞

∑
k=1

wkAk

where w is a vector of weights capturing the distribution of where the user stops and A is a
vector of scores between zero and one for each depth. How the vector of scores is defined
depends on which Rank Biased measure it is. In the Rank Biased Cluster measure below,
Ak will represent the interconnectedness between the k highest ranked articles. Then
0 ≤ M ≤ ∑∞

k=1 wk = 1, and each Ak has a fixed contribution wk.[41]
For the weights, Webber, Moffat, and Zobel use the geometric distribution [41],

which will also be used here. A difference in how they define the distribution is that
the parameter normally used, p ∈ (0,1], is defined as 1− p. This thesis continues their
usage. Another difference in the description below compared to the standard usage of the
geometric distribution is that the parameter p is called d. This is to avoid confusion in
the analysis section when discussing both the distribution parameter and p-values, i.e. the
probability of getting at least as extreme values given a tested hypothesis. To sum up, the
parameter used here is d ∈ [0,1).

Following Webber, Moffat, and Zobel [41], let the kth weight have the value (1−
d)dk−1, for d ∈ [0,1). The rank-biased measure then becomes

MRB(A,d) = (1−d)
∞

∑
k=1

dk−1Ak.

17


4. Theory

0

20

40

60

0.
6

0.
7

0.
75 0.

8
0.

85 0.
9

0.
95

d

d
e
p

th

Figure 4.1 – Depth needed in rank biased measures to reach 95% of the maximal score for different
values of d. The calculation assumes an infinite lists with a maximal score for each rank.

In the formula, d decides how fast the weights decrease, with a smaller d leading
to a larger focus on the start of the list. By choosing d close to one, the important part of
the infinite ranked list can become arbitrary long. Real lists produced by search engines
do naturally not have an infinite length. Despite this, it is a reasonable approximation,
since most search result lists in practical cases are long enough for the weights to become
insignificantly small (for reasonable choices of d). Furthermore, the application of the
measures will in a practical application often need to use a cut-off point to speed up the
computations. This then gives a lower bound on the score. Where such cut-off points
can be set depends on both the precision needed and on the used parameter d. The latter
dependence is illustrated in Figure 4.1, where the depth needed to reach 95% of the max
score (given a maximal score at each rank) is shown for different values of d.

4.2.2 The Rank Biased Cluster-measure
This measure attempts to capture how connected a single ordered list of results is while
giving more importance to connections between the top items. It will be referred to as the
Rank Biased Cluster-measure.

Let Ck be the number of clusters in the list at depth k, with k starting at two and
with a cluster defined as a group of documents in the list above or at depth k in which all
documents can be reached from any of other using an arbitrary number of intermediate
steps within the group.

Then define the score at depth k, Ak, as

Ak =
(k−1)− (Ck −1)

k−1
=

k−Ck

k−1
,

which fulfils the requirement that Ak ∈ [0,1]. The subtraction of one from each term is
needed since there is always at least one cluster, while, on the other end, can be at most k
clusters. The minuses therefore allow Ak to take the values zero and one.

18


4. Theory

The score srcb(d) for the Rank Biased Cluster measure becomes

srcb(d) = (1−d)
∞

∑
k=2

dk−2 k−Ck

k−1
.

4.3 The Linear Mixed Model
In this section, the theory behind the Linear Mixed Model is described. Since the Linear
Mixed Model is an extension of Linear Regression, the General Linear Model is first
described as a brief introduction to linear models.

4.3.1 Linear Models
General Linear Models describe, according to Rutherford [33], each observation of the
dependent variable, y, with the formula y = Prediction+Error. The prediction term cap-
tures the influence of the independent variables on y. It consists of a linear combination
between the independent variables together with a constant and their respective estimated
influence. The constant is called intercept and adds a default value for y. The error term
captures that y might not be completely determined by the intercept and the indepen-
dent variables. In the General Linear Model, the error term is assumed to be normally
distributed, with a zero mean and the variance estimated when fitting the model.

Following Olofsson and Andersson [28], let the n observations, Y , be an n×1 vector
of random variables, the k independent variables (including the intercept) be a n×k matrix
X with know values, λ be a k× 1 vector, and the errors, ε , be a n× 1 vector with each
element independently distributed according to N (0,σ2). The General Linear Model
can then be written as

Y = Xλ + ε,

where λ and σ2 are estimated when fitting the model.

4.3.2 The Linear Mixed Model
The Mixed Model inherits most of the framework of the General Linear Model [11].
The difference is that the formula for the Linear Mixed Model includes a random term
as well on top of the sum of the fixed prediction term and the error term used in the
General Linear Model. This is due to that in the Linear Mixed Model, the data is assumed
to be organised in clusters within which the observations are dependent. The clusters
themselves are, however, assumed to be independent of each other. As Wiley and Wiley
explain [43], there are many types of possible dependencies within clusters, each with
its corresponding Linear Mixed Model. A good example is the most simple version, the
random intercept-only model. It has a different mean for each cluster, but apart from that
the observations behave as in the General Linear Model around their respective cluster
mean.

The version of the Linear Mixed Model that will be used in this study is the random
slope and intercept-model, which adds random slopes to the random intercept-only model.
On top of having different means for each cluster, Wiley and Wiley explain [43], the
random slopes let the influence of random factors vary between clusters. Since both the
random intercept and random slopes depend on the cluster, they are often correlated and
therefore described using a multivariate normal distribution.

19


4. Theory

The reason why the random slope and intercept-model is suitable for the data gen-
erated in this study is discussed in section 3.4, but in short, the scores from the measure
likely varies with the different queries and the ranking algorithm modification depend on
which ranked list it modifies. Thus, the random query’s influence on the score creates
a random intercept and the query’s influence on the effect of the modifications creates a
random slope.

The Linear Mixed Model has five main assumptions [43]. The first three concern
the fixed term and the error term. These are that the relationship between the fixed factors
and the overall mean is linear, that there is no trend (homogeneity) in the residuals, i.e.
the differences between the predicted values and the actual values, and that the residuals
are normally distributed. The last two assumptions concern the random effects. These
are assumed to be independent between the different cluster and multivariate normally
distributed within.

Following Demidenko [11], let i indicate the ith cluster with yi ∈ RK×1 being the
K observations within the ith cluster. Let the values of the m variables with a fixed con-
tribution, including the fixed intercept, be described by the matrix Xi ∈ RK×m and the
vector λ ∈ Rm×1 be the coefficients describing their influence. Let the values of the l
variables with a random contribution, including a random intercept, be described by the
matrix Zi ∈ RK×l and ηi ∈ Rl×1 be the random coefficients describing their influence.
Finally, let the independently and identically normally distributed errors be described by
the vector εi ∈ RK×1. The model can then be written as

yi = X iλ +Ziη i + ε i,

η i ∼ N (0,Σ),
ε i ∼ N (0,σ2I),

(4.1)

with X iλ capturing the fixed contribution to the observations, Ziη i the random contri-
bution, and ε i the variance left unexplained by the model. In this formulation, it is the
parameters λ ,Σ, and σ2 that are estimated when the model is adapted to the data.

4.3.3 The Likelihood Ratio Test
The Linear Mixed Model is harder to work with than regular Linear Models since it is
more complex. The former is also more of an open area of research and the community
much smaller, leading to that there are often situations where there are no clearly correct
option. There are, for example, no standard tests comparable to the F-test (since it is not
always possible to know the degrees of freedom) nor is there a default goodness of fit
statistic comparable to the Linear Model’s adjusted R2 value.

Despite the difficulties, it is important to test how significant the different variables
in a chosen model are. Through such testing, variables that are not central to the model
can be eliminated. This is important in order to avoid over-fitting, which happens when a
too advanced model finds trends in the data that are only there by chance.

In a Linear Mixed Model, there are several approaches for testing the significance
of variables. The one chosen for this thesis is to compare nested models with likelihood
ratio tests. A nested model is a smaller version of another model, i.e. the full model
includes all variables from the smaller model and some more. When the likelihood ratio
test is applied in the analysis, the difference between the nested model and the full model

20


4. Theory

will only be a single variable. This is because the test is then more accurate for Linear
Mixed Models.

Likelihood ratio tests compare two proposed models about the data, quantifying if
the difference between them is significant. The test estimates how likely the data is if the
model is true for each model. The likelihood of the data given the smaller model is then
divided with the likelihood given the larger model. The version of the test used in the
analysis, provided by the anova function in R car package, computes the double negative
logarithm of the ratio and compares it with a χ2 distribution. This approach is based on
Wilk’s theorem.

4.4 Intraclass correlation coefficient and effective sample
size

Two tools that can be used to get an understanding of clustered data are the effective
sample size, Ne f f ective and the intraclass correlation coefficient, IICC.

The effective sample size gives an indication on how many independent samples
the data corresponds to. This is useful for getting an intuition about the data, since inde-
pendent samples are easier to understand. As Wiley and Wiley explains [43], the formula
for Ne f f ective depends on the number of independent clusters, N, and the number of ob-
servations per cluster, k, and the intraclass correlation coefficient (described below). The
formula is as follows

Ne f f ective =
Nk

1+(k−1)IICC
.

The intraclass correlation coefficient captures how much of the variance in the data
that comes from having different clusters in the data. It does so by dividing the variance
between clusters with the sum of the variance between and within the clusters. According
to Wiley and Wiley [43], the estimation of the variances depends on the implementation,
but given the variances the formula for the intraclass correlation coefficient is

IICC =
σ2

between

σ2
between +σ2

within
.

21


4. Theory

22


5
Analysis

The following section is divided into two parts, where the first explores the results from
the simulations and the second part applies the Linear Mixed Model. The two parts use
two different data sets, which will be referred to the exploratory data set and the model
data set below. Both data sets are generated through the simulations. The difference
between them is that the model data set has fewer variables and more observations per
variable combination. The reason why a second data set was generated is that more ob-
servations per variable combination helps the model to handle the large variance in the
data that is due to the randomly generated queries.

5.1 Exploring the data
This part of the results overviews and explores the relations between the variables in the
data. Besides building an understanding of the relationships between the variables in
the data set, it is the observations in this part that is used to select the variables that are
included in the Linear Mixed Model.

5.1.1 Overview of the data set
The variables in the exploratory data set are displayed in table 5.1. The meaning of each
variable is summarized here (see Section 3.1 for a more complete overview). The scores
given by the Rank Biased Cluster-measure are denoted with s and form the dependent
variable later in the model. The other variables are all among the model’s independent
variables.

The variable q represents an ID of the randomly generated queries, the variable l
represents the number of words in the query, and the variable d corresponds to a parameter
used in the Rank Biased Cluster-measure, with larger values of d representing that a larger
part of the final ranked list can influence the score.

The three remaining variables, α,β , and γ , are the variables of main interest. To-
gether they correspond to how the ranking score given by the search engines default BM25
ranking algorithm is combined with the total score and the average score in each docu-
ment’s vicinity. The combination α = 1,β = 0,γ = 0 ranks according to the B25 al-
gorithm, α = 0,β = 1,γ = 0 according to the vicinity total, and α = 0,β = 0,γ = 1
according to the vicinity average.

As can be seen in Table 5.1, the total number of scored observations (srbc), is much
larger than the number of separate queries. This is because the simulations apply each
combination of the main parameters, α,β , and γ , to each query and then evaluates each
resulting ranked list once for each value of d. The total number of observations thus
becomes 300× (7×7×7−1)×3 = 307800. The minus one account for that the combi-
nation α = β = γ = 0 has been removed since it gives a zero score for all documents.

23


5. Analysis

Table 5.1 – Variables in the exploratory data. The variable s is the Rank Biased Cluster score, q
is an arbitrary assigned ID for each query, and l is the length of the query (the number terms). The
variables α,β and γ represents the three parameters controlling the ranking algorithm modification,
and the variable d represent the depth parameter used in the measure.

Variable Values Total
s ∈ [0,1] 307800
q ∈ N 300
l {3,5,7}
α {0, 0.25, 0.5, 0.75, 0.9, 0.95, 1}
β {0, 0.05, 0.1, 0.25, 0.5, 0.75, 1}
γ {0, 0.05, 0.1, 0.25, 0.5, 0.75, 1}
d {0.8, 0.9, 0.95}

5.1.2 Variance due to the random queries
The left panel in Figure 5.1 overviews the distribution of the scores. Particularly interest-
ing is a large spike in the density close to the maximal score 1. Something else to notice
is that the density never comes close to zero for the lower end of the scores, which will
be important to consider when fitting the model. These points will be returned to further
down after the influence of the different factors have been examined.

The right panel in Figure 5.1 shows the distribution of mean scores for each query.
The averaging over the main variable and the depth variable does not influence the overall
appearance very much. The one notable change is the disappearance of the large spike in
the density for scores close to one. This means that the spike was not there for all variable
combinations.

Figure 5.1 – Overview of the distribution of scores for the exploratory data set. Left: the distribu-
tion of all scores. Right: the distribution of mean scores for each query.

The intraclass correlation coefficient and the effective sample size for the exploratory
data set are displayed in Table 5.2. The strikingly large variation between the queries,
which each form an independent cluster of measurements, accounts for almost 80% of
the total variation. This large variation between the queries means that a large sample is
needed to investigate the main factors. This can also be seen in that the actual sample size
is almost 800 times larger than the calculated effective sample size.

24


5. Analysis

Table 5.2 – The intraclass correlation coefficient and effective sample size for the exploratory data
set.

Type ICC Total
Queries (independent clusters) 0.76 300
Residual 0.24
Sample Size 307800
Effective Sample Size 394

5.1.3 Main variables
In the following part, the relation between the variables α,β , and γ , and the score is
explored. As previously described, these three variables correspond in turn to weights
given to each document’s original BM25 ranking score, its total vicinity score and its
vicinity average. Further down, the analysis will focus only on an interpolation between
two of these three variables while setting the third to zero. Before discussing that though,
the three variables and their relationships to the score are introduced here.

To get an initial feel for that the influence of the three variables on the score depend
on each other, the mean scores over the variable values are calculated separately for each
variable and plotted together in the left panel of Figure 5.2. From the mostly horizontal
lines in the figure, it is evident that neither of the main variables dominates the result
on their own. This can be compared with a much larger change in mean score shown
in the three panels to the right of the figure. Here the mean score over variable values is
calculated for each of the three variables, given that one of the other main variables is zero.
The reason for only setting one of the variables to zero at a time is that the modification to
the ranking algorithm theoretically only changes with the relation between its parameters.
Therefore, if two variables are zero, changes in the third does not change the score.

When comparing the different panels in the right part of Figure 5.2, it is clear that
the relation between either α and β or α and γ is of the most interest, which is reasonable
since those to relations concerns how much the original ranking is changed, while the
relation between β and γ is about whether the citation expansion is normalised.

0.
00

0.
25

0.
50

0.
75

1.
00

0.
00

0.
25

0.
50

0.
75

1.
00

Parameter value

M
a

rg
in

a
l 

m
e

a
n

 s
c

o
re

Alpha
Beta
Gamma

Alpha = 0 Beta = 0 Gamma = 0

0.
00

0.
25

0.
50

0.
75

1.
00

0.
00

0.
25

0.
50

0.
75

1.
00

0.
00

0.
25

0.
50

0.
75

1.
000.

00

0.
25

0.
50

0.
75

1.
00

Parameter value

M
a
rg

in
a
l 
m

e
a
n

 s
c
o

re

Alpha
Beta
Gamma

Figure 5.2 – The mean scores over the values of the three main variables. Left: the mean for each
the main variables. Right: the mean for two of the main variables, while the third is set to zero. The
variables set to zero are, in turn from the left subpanel, α , β , and γ .

To investigate the impact of the ratio between pairwise combination of the main
variables, three new variables is formed by pairwise interpolations of the three main vari-

25


5. Analysis

ables. The new variables are tαβ ∈ [0,1], tαγ ∈ [0,1], and tγβ ∈ [0,1] and they fulfilled the
following relations:

tαβ : α = tαβ , β = 1− tαβ ,

tαγ : α = tαγ , γ = 1− tαγ ,

tγβ : γ = tγβ , β = 1− tγβ .

Figure 5.3 shows mean scores and boxplots over the values of the interpolations,
each calculated with the respective third variable at zero. Although the information con-
veyed by plotting the means is also included in the boxplots, the mean scores are shown
separately in the figure’s left panel for a clearer view. As noted above, the trade-off be-
tween ranking according to the vicinity total and the vicinity average, which in the figure
is represented by the interpolation between β and γ , only weakly affect the score. The two
other trade-offs, tαβ and tαγ , both clearly show that the mean score decreases when the
original BM25 score is given a larger influence. When comparing tαβ and tαγ , it is clear
that while tαβ changes the mean score less than tαγ , it does so much more drastically. A
likely reason for the more drastic effect of vicinity total is that it gives a larger bonus to
documents and therefore dominates small values of α .

0.
00

0.
25

0.
50

0.
75

1.
00

0.
00

0.
25

0.
50

0.
75

1.
00

Interpolation parameter

M
e
a
n

 s
c
o

re

Interpolations

Alpha&Beta
Alpha&Gamma
Gamma&Beta

alpha_beta alpha_gamma gamma_beta

0
0.

25 0.
5

0.
75 0.

9
0.

97
5

1 0
0.

25 0.
5

0.
75 0.

9
0.

97
5

1 0
0.

25 0.
5

0.
75 0.

9
0.

97
5

10.
00

0.
25

0.
50

0.
75

1.
00

Interpolation parameter

S
c

o
re

Alpha&Beta Alpha&Gamma Gamma&Beta

Figure 5.3 – The scores over of interpolations between main variables. Each interpolation is a
pairwise combination of the three main variables, while keeping the respective third variable at
zero. It takes values from the variable written first in the label. Left: the mean scores for each
interpolation. Right: boxplots for each interpolation. The two vacancies in the right panel are due
to that extra simulations was done only to test combinations with high values of α and low values
of β and γ . There are therefore no corresponding combination in the data set between only β and
γ .

The right part of Figure 5.3 contains the boxplots showing the spread of values
around the mean scores. The most important observation here is including the vicinity
total leads to large variance. This can be seen both in that the spread for tαβ is larger than
for tαγ and that the spread drops when tαβ = 1 or tγβ = 1 since β = 0 in both cases.

5.1.4 New single main variable
For the rest of the analysis, the discussion will focus on the interpolations tαγ . Shifting
the focus to the trade-off between the different types of scoring is more interesting since it
better illustrates the effect of the modification. While it is possible to create interpolations
between all three of α , β , and γ , limiting the trade-offs to two of the variables makes the
result easier to intuitively interpret.

26


5. Analysis

One reason for only selecting one trade-off to focus on is that its influence on the
score will be clearest if it is included alone in the model. Another reason is that the
three different trade-offs are somewhat collinear. This means that the interpolations have
overlapping effects on the score, with the influence of tαβ and tαγ as an example. The
difference between collinearity and interaction effects is that in interaction the factors
affect each other’s influence while in collinearity they have the same effect. This overlap
makes it important to remove collinearities from the model.

Finally, selecting either tαβ or tαγ is more suitable, since they represent the trade-
off between the original score and a type of modification. Between these two, tαγ has
the largest span in mean scores, the more gradual change in mean score, and the lesser
variance. It is, therefore, better from an applied perspective. Importantly, tαγ also fit the
assumption behind the Linear Mixed Model much better, even when the tαβ variable is
transformed to level out the drastic change in scores for values of tαβ close to one.

As part of the investigation into tαγ , the statistical significance of including it as a
fixed effect and as a random effect was tested using the likelihood ratio test. First the
inclusion of t as only a fixed effect was tested. That is, a model with a fixed intecept, a
fixed t, and a random intercept was tested against the nested model. The smaller model
only included a fixed intercept and a random intercept. When comparing the two models,
the likelihood ratio test accepts the inclusion of t (χ2 = 846, p =< 1e−16). The p-value
given by the test says that it is very unlikely to get as an extreme difference between the
two models if the extra variable included in the full model was not important.

When the likelihood ratio test is applied here and further down, the values of tαγ
are limited to {0,0.25,0.5,0.75} unless otherwise specified. This is done since it these
values that are used when fitting the model.

Next, after establishing that the fixed influence of t is significant, the inclusion of t
as a random variable is tested. When comparing the full model, where t is included both
as fixed and random, with the nested model where t is only fixed, the likelihood ratio test
accepts the larger model (χ2 = 549, p =< 1e−16).

5.1.5 The variable d
The variable d represents the parameter used in the Rank Biased Measure to control the
bias towards high ranked entries in the lists. A larger value of d means that a larger number
of entries of the ranked lists are relevant. Although the variable d can take continuous
values in [0,1), it is best thought of as having discrete levels. The corresponding parameter
does not have a linear function in the measure and it is, therefore, confusing to treat the
variable as linear in the Linear Mixed Model. Furthermore, while there might be a linear
relationship between the values of d and the score in the data set used here, this is unlikely
to be true for more extreme values of d.

The left panel of Figure 5.4 shows the mean score over tαγ for each value of d. The
different levels of d have a clear impact on the mean score. The upper left panel also
indicates that there is a slight interaction effect between d and the interpolation, which
shows in that the span between the different levels of d decreases with larger values of
tαγ .

The right panel of Figure 5.4 shows separate boxplots of scores for each tαγ and
d. Apart from the already mentioned difference in means for different values of d, the
spread decreases with d. An exception is for high values of tαγ , where the trend is re-

27


5. Analysis

0.
00

0.
25

0.
50

0.
75

1.
00

0.
0

0.
2

0.
4

0.
6

Interpolation parameter

M
e
a
n

 s
c
o

re

d = 0.8

d = 0.9

d = 0.95

0 0.25 0.5 0.75

0
.0

0

0
.2

5

0
.5

0

0
.7

5

1
.0

0

S
c

o
r
e

d = 0.8 d = 0.9 d = 0.95

Figure 5.4 – The influence of the depth variable d on the score for each value of tαγ . The different
values of d are separated by different colors. Left: mean score over tαγ . Right: boxplots of scores
for each tαγ .

0 0.25 0.5 0.75

0.
0

0.
5

1.
0

1.
5

2.
0

Score

D
e
n

s
it

y

d = 0.8

d = 0.9

d = 0.95

Figure 5.5 – Distribution of scores for values of tαγ and of d. The different values of d are
separated by different colors.

versed. A possible reason for this is that when the ranking is dominated by the original
scoring the emphasis is on a smaller part of the ranked lists lessens the chance for spurious
connections between documents.

In Figure 5.5, the density of the scores are displayed for each tαγ and each d. As
can be seen, there is a large influence of tαγ , with high values leading to density peaks
close to scores at zero and lower values of tαγ being more normally distributed around
the scores at 0.5. The different values of d do however not have any obvious impact aside
from the already mentioned changes in spread and mean.

From the above results, the variable d appears to be important for predicting the
score. That the inclusion of d is significant is supported by the likelihood ratio test (χ2 =
1417, p =< 1e − 16), when comparing two applications of the relevant Linear Mixed
Models with and without d.

The importance of the interaction between d and tαγ is not as clear from the figures.
The likelihood ratio can again be applied by comparing a full model which adds both d
and its interaction with tαγ against a nested model that only adds d. The test either accepts
or rejects the inclusion of the interaction depending on which data set that is used. The
test accepts the full model for the data set used in this first part of the analysis (more
correctly, the test rejects that the interaction is zero, which implies that the full model

28


5. Analysis

should be used). However, when the test is applied to the larger data set used to fit the
final model, it only accepts the inclusion of the interaction if tαγ includes high values
such as {0.9,0.95,1} and otherwise rejects it (χ2 = 2.3, p = 0.3). Since those high values
are not included when fitting the final model, neither is the interaction between d and tαγ .
Dropping the weak interaction between tαγ and d makes the final model easier to interpret
and less likely to overfit.

5.1.6 The variable l
The variable l represents the number of words in the queries used in the simulations. The
upper left panel of Figure 5.6 shows the mean score over tαγ for each value of l, while the
upper right panel shows the corresponding boxplots. As can be seen, neither the means
nor the spread of the score is influenced much by the query length.

0.
00

0.
25

0.
50

0.
75

1.
00

0.
00

0.
25

0.
50

0.
75

1.
00

Interpolation parameter

M
e
a
n

 s
c
o

re

l=3

l=5

l=7

0 0.25 0.5 0.75 0.9 0.975 1

0
.0

0

0
.2

5

0
.5

0

0
.7

5

1
.0

0

S
c

o
r
e

l=3 l=5 l=7

0 0.25 0.5 0.75 0.9 0.975 1

0

1

2

3

Score

D
e
n

s
it

y

l=3

l=5

l=7

Figure 5.6 – The influence of l on the score for each value of tαγ . The different values of l are
separated by different colors. Top left: mean score over tαγ . Top right: boxplots of scores for each
tαγ . Bottom: the distribution of scores for each tαγ .

The lower panel of Figure 5.6, shows the distribution of the scores for tαγ and for l.
As with the mean and spread, l do not have any obvious influence.

Applying the likelihood ratio test leads to rejecting the inclusion of l (χ2 = 1e−
04, p > 0.9) when comparing a relevant Linear Mixed Model with l versus a model with-
out l. The test also rejects adding only an interaction between tαγ and l (χ2 = 0.03, p =
0.9) . Unlike the rejection of the interaction between d and tαγ , the rejection of any in-
clusion of l do not depend on the values of tαγ nor on which of the two data sets that are
used.

29


5. Analysis

5.2 Linear Mixed Model analysis
This section formulates the Linear Mixed Model, tests its assumptions, and applies it to
data generated by the simulations.

5.2.1 Model data set
For the application of the Linear Mixed Model, a new data set was generated with more
observations corresponding to the interpolation tα,γ . The relevant variables for the model
data set are displayed in Table 5.3. Although the variable l is not included in the de-
scription or the model, the length of the queries still varies between three, five, and seven
words, since there was no reason to choose a single length.

Another difference between the exploratory data set used above and the model data
set is that the value of tαγ is there restricted to {0,0.25,0.5,0.75}. The observations
corresponding to high values of tαγ do not fit well with the assumptions behind the Linear
Mixed Model and removing them, therefore, improves the model fit.

Table 5.3 – Variables in the model data set. The variable s represents the scores, q represents the
randomly generated queries, the variable d corresponds to the parameter with the same name used
in the measure, and the main variable, tαγ , represent an interpolation between the original BM25
ranking score and each document’s average vicinity score.

Variables Values Total
s ∈ [0,1] 7152
q ∈ N 596
d {0.8, 0.9, 0.95}
tα,γ {0, 0.25, 0.5, 0.75}

The intraclass correlation coefficient and the effective sample size for the model
data set are shown in Table 5.4. Compared to the exploratory data set, the intraclass
correlation coefficient drops with 0.15. This means that the variance within each query’s
observations accounts for 15 percentage points more in the model data set. This also
shows in that the effective sample size is a much larger fraction of the total number of
observations. The variance reduction comes from the β variable had a larger variance
than both α and γ (see Figure 5.3).

Table 5.4 – The intraclass correlation coefficient and effective sample size for the model data set.

Type Sigma ICC Total
Independent clusters (var. q) 0.03 0.61 596
Residual 0.02 0.38
Sample Size 7152
Effective Sample Size 922

5.2.2 Model formulation
Here the Linear Mixed Model that is used to describe the relationship between the

selected independent variables and the score is formulated.
The dependent variable predicted by the model is the Rank Biased Cluster score. In

the model, this variable is denoted with y, and all the scores relating to the ith query are

30


5. Analysis

denoted with the vector yi. The independent variables are divided into fixed and random
factors, where fixed consists of tαγ and the three different levels of d. Their values for
the ith query are written as the vectors tαγ i, 1i, d0.9i, and d0.95i, with 1i being a vector of
ones corresponding to the default level of d (d = 0.8). This means that the relationships
between the dependent variable and d0.9 or d0.95 capture the influence of changing the
value of d from the default d0.8. An example is that if the default d = 0.8 gives a score at
0.5 and the model estimates the fixed effect of d0.9 to 0.1, the actual estimated value for
d0.9 is 0.5+0.1 = 0.6.

The main random factor in the model is the variable q, representing the randomly
generated queries. There is an interaction between the random queries and tαγ , as shown
in the first part of the analysis. This interaction means that the effect of tαγ on the score
varies around the value of its fixed part according to its random part.

With these considerations in mind, let qi be the ith query, yi be the observations
from the ith query, and xi and zi be the corresponding values of the fixed and random
independent variables. The value of the q and the default level of d is always at one and
written with vectors of ones, i.e. with 1i. For the vector d0.9i, the jth element equals one
if the jth d equals 0.9 and otherwise the element is zero. The values in the vector d0.95i
are analogous. The model is formulated as

yi = xiλ + ziη + ε i,

xi =
[
1i (d0.9)i (d0.95)i (tαγ)i

]
,

zi =
[
1i (tαγ)i

]
,

η ∼ N (0,Σ)
ε i ∼ N (0i,σ2Ii),

(5.1)

where λ , Σ, and σ2 are estimated when fitting the model to the data.

5.2.3 Model assumptions
There are five assumptions behind the Linear Mixed Model, which all can be inferred
from Equation (5.1). They are:

• the observations (scores) for each query are multivariate normally distributed,
• the observations for each query are independent of observations from other queries,
• the means of the multivariate normal distributions have a linear relationship with

the fixed factors (d and tαβ ),
• there is no trend in the residuals (differences between actual observations and values

predicted by the model),
• the error term (variance not explained by the multivariate normal distribution) is

independent of other observations and drawn from a normal distribution with zero
mean and a fixed variance.

According to Wiley and Wiley, there is no good test for the independence between scores
in different queries but the four others can be assessed visually using diagnostic plots [43].
The relevant diagnostic plots are shown in Figure 5.7.

The assumption that the scores come from the expected multivariate normal dis-
tribution and the individual error term is independently and normally distributed can be
assessed together by examining the distribution of the residuals. The residuals represent
the deviation from the predicted mean and should, if the scores are normally distributed

31


5. Analysis

0.
0

0.
5

1.
0

1.
5

2.
0

−0
.5

0

−0
.2

5

0.
00

0.
25

0.
50

0.
75

Residuals

D
e
n

s
it

y

0
.0

0

0
.2

5

0
.5

0

0
.7

5

1
.0

0

0
.3

0
.4

0
.5

0
.6

Predicted value

A
c

tu
a

l 
v
a

lu
e

10
20
30
40
50
60

count

−0
.5

0

−0
.2

5

0.
00

0.
25

0.
50

0.
75

−4 −2 0 2 4
Theoretical

S
a
m

p
le

−0
.5

0

−0
.2

5

0.
00

0.
25

0.
50

0.
3

0.
4

0.
5

0.
6

Predicted value

R
e

s
id

u
a

l

20

40

60

count

Figure 5.7 – Diagnostic plots for the Linear Mixed Model. Top left: the density of the residu-
als, with the orange curve representing a corresponding normal distribution. Top right: the actual
values over the predicted values, with the straight line representing a perfect match. Lower left: a
QQ-plot, contrasting the residual scores against the normal distribution. Lower right: the resid-
uals over the predicted values, with the diagonal lines representing the maximal residual for that
predicted value.

around its random mean, be normally distributed around zero. The upper and lower left
panels of Figure 5.7 show the distribution of the residuals and a corresponding QQ-plot.
As can be seen, the normal distribution is not a perfect fit. However, as the QQ-plot in-
dicates, this is mainly due to there being too few large negative residuals. This might be
an effect of that the density of scores does not approach zero for low scores. The model
assumes that the scores take values from the real line and therefore fails to deal properly
with the lack of negative scores.

The homogeneity of the residuals, i.e. that there is no trend in the residuals, can be
assessed visually in the lower right panel of Figure 5.7, where the residuals are plotted
for the predicted values. The grey diagonal lines in the figure represent the maximum
and minimum possible residuals, given that the scores are restricted to between zero and
one. On average, there is no trend between the predicted values and the residuals. An
exception is a somewhat larger density in the lower-left corner, which likely is due to the
already discussed discrepancy between the normality assumption and that restriction of
the scores within zero and one.

The linear relation between the fixed factors and the mean scores can be assessed
visually in the upper right panel of Figure 5.7, which shows the actual scores over the
corresponding prediction by the model described in Equation (5.2). Although the figure
shows that the scores vary a lot around the linear prediction, most of the data spread out
symmetrically around the linearly increasing mean, represented in the figure by the solid
line. A discrepancy to notice is the density of scores in the lower-left corner. A possible

32


5. Analysis

reason for this might again be that the scores are cut off at zero, limiting the room for
negative variation.

In summary, the assumptions fit the data in general, with the exception probably
caused by the limiting of the score to between zero and one. This problem is discussed
with the reasons for choosing the Linear Mixed Model in Section 3.4 and deemed accept-
able given the aim and constraints of the thesis.

5.2.4 Final Model
The final model is reached by fitting the model specification to the model data using

the R lme4 function [3]. The resulting model is described by

yi =
(

0.481i +0.09(d0.9)i +0.13(d0.95)i −0.24(tαγ)i

)
+
(

η11i +η2(tαγ)i

)
+ ε i,

η ∼ N

([
0
0

]
,

[
0.04 −0.02
−0.02 0.06

])
,

ε i ∼ N (0i,0.006Ii),

(5.2)

where i represents the ith query, the first set of large parenthesis denotes the fixed effects
and the second denotes the random, and ε i is the randomness not explained by the model.

Starting with the variable d, 0.481i means that the default level of d at 0.8 sets
the baseline score at 0.48. If d instead equals 0.9 for the jth observation within the ith
query, the corresponding element (d0.9)i j then equals 1 adding 0.09 to the baseline score.
Similarly, if d equals 0.95, 0.13 is added.

The fixed effect of the variable tαγ is estimated to decrease the score with −0.24
when tαγ is increased with one. This means that the model predicts that if tαγ is changed
from 0.5 to 0.75 the score will on average decrease with −0.24× (0.75−0.5) = 0.06.

−1
.0

−0
.5

0.
0

0.
5

1.
0

−1
.0

−0
.5

0.
0

0.
5

1.
0

q

t_
a
g

Figure 5.8 – Scatterplot of 1000 samples from the distribution of the random coefficient vector η .

The random effects consists of random changes to the mean, captured by η1, and
to the influence of tαγ , captured by η2, for each query. The two random changes are de-
scribed by the distribution of η , detailed in the Equation (5.2), and illustrated in Figure 5.8
where 1000 samples from the distribution are shown. This means that on top of the fixed
influence common to all queries, the scores in each query has an individual addition of
η11i +η2(tαγ)i. As an example, η might for some query take the value [0.15,−0.15].

33


5. Analysis

Then 0.15×1i adds 0.15 to the mean for all scores from that query. When it comes to tαγ ,
its impact will be a combination of the fixed and the random part, so a unit change in tαγ
impacts the score with (−0.24)+(−0.15) =−0.39.

Finally, the error term ε i adds a random error independently and identically gener-
ated for each score with the variance 0.006.

A limitation with the model is that if the random terms and the individual random
error take somewhat extreme values, the model can predict scores outside the interval
[0,1]. This is due to the previously noted discrepancy between the model assumption that
the scores can take any real value and the scores actual interval [0,1].

34


6
Discussion

The main result is that varying the variables in the modified ranking algorithm tangi-
bly and statistically significantly influences the interconnectedness in the ranked lists,
although there is a large variance between different queries.

A related result is that using the vicinity average score, represented by the variable
γ , lead to higher connectedness than the vicinity total, represented by the variable β .
Furthermore, the average also leads to lower variance compared to expansion without
normalisation. A possible explanation of these differences is that articles with a mix of
relevant and irrelevant citation links are penalized when using the average. For queries
where the top documents have similar scores and where there are many connections in the
relevant part of the list, using the vicinity average may lead to that documents only rises
high up in the ranking if they are connected to many top hits. This is likely to happen for
a few documents, but these will also be more connected. On a similar track, for queries
where there are large differences in the score between the top-ranking documents or where
there are few connections in the relevant part of the list, using a total vicinity score might
lead to that all documents related to the top document(s) dominate the reranking. That
being said, it is not certain why this is the case and it would have been interesting to look
further into the differences between the total and average vicinity scores.

The results also show that the depth variable, d, significantly determine how con-
nected the lists are judged to be, with a larger d leading to a higher degree of interconnect-
edness. That there is a significant difference relating to different values of d is expected,
since changing between the three different values of d leads to a large difference in the
size of the influential part of the lists. It is, however, surprising that increasing d corre-
lated with more connections in the lists. A possible reason for this is that the articles in
the top of the ranked lists often belonged to clusters but that the connections were first
revealed a bit further down.

Another also somewhat surprising result was the insignificance of the query speci-
ficity, especially given the significance of the topic. The expectation was that specific
queries would return articles from more specific topics, which would then increase the
connections between high ranking articles, while vague queries would retrieve articles
from a broad range of topics, leading to few connections. The different amount of con-
nections would then, in turn, lead to different impact of the ranking algorithm modifica-
tion. A suspected reason for why the query specificity was not important is a deficiency in
the implementation. The randomly generated query terms are processed with stop word
lists, removing common words like ’and’, ’or’, ’a’ etc., but they do not limit the query
to a single language. While English is the dominating language for the sampled query
terms, with about three out of four articles being in English, other languages such as
Chinese, French, German etc., are relatively common as well. It is therefore likely that

35


6. Discussion

longer queries will include terms from different languages and since the titles of articles
are usually only written in one language, the query simply becomes an integration of two
or more disjoint queries. The result is therefore that a longer query, in the cases where
terms from different languages are included, creates a broader query instead of a more
specific. A possible partial solution would have been to limit the query terms to letters in
the Latin alphabet, which was not implemented due to the time constraints. The problem
would, however, have remained for the different European languages, with no obvious
solution. One attempt would have been to match titles against a list of English stop words
(e.g. ’and’) and only sample from the titles that match.

36


7
Conclusion

This thesis aims to determine if the ranked articles’ interconnectedness can be controlled
by three variables weighting together the default ranking with the ranking according to the
vicinity’s total and average scores. The analysis shows that the variables can influence the
interconnectedness. Furthermore, it shows that it is enough to replace the three variables
with an interpolation between two of the three.

The point with this aim was to showcase an indirect approach to studying ex-
ploratory searching. In this thesis, the focus on how the search engine is affected enabled
developing the Rank Biased Cluster-measure. This, in