Rank-based annotation system for subjective
assessments in supervised learning

Application in computed tomography of the lungs

Master’s thesis in Computer science and engineering

Herman Bergström

Hanna Tärnåsen

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2023


Master’s thesis 2023

Rank-based annotation system for subjective
assessments in supervised learning

Application in computed tomography of the lungs

Herman Bergström
Hanna Tärnåsen

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2023


Rank-based annotation system for subjective assessments in supervised learning
Application in computed tomography of the lungs
Herman Bergström, Hanna Tärnåsen

© Herman Bergström, Hanna Tärnåsen, 2023.

Supervisor: Ida Häggström, Department of Electrical Engineering
Supervisor: Mats Lidén, School of Medical Sciences Örebro
Examiner: Ida Häggström, Department of Electrical Engineering

Master’s Thesis 2023
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Illustration of the rank-based annotation system created in this thesis.

Typeset in LATEX
Gothenburg, Sweden 2023

iv


Rank-based annotation system for subjective assessments in supervised learning
Application in computed tomography of the lungs
Herman Bergström, Hanna Tärnåsen
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Supervised learning has become a common approach for extracting information from
images. To effectively train a model, a large amount of labeled data is required.
While some image annotation tasks are objective and well-defined, others require
the annotators to make a subjective assessment. The difficulty and subjective nature
of these annotation tasks cause the standard rating-based annotation techniques to
suffer from inconsistencies between annotators, implying that two different annota-
tors could assign highly differing labels based on their personal biases. This thesis’
overarching goal is to provide an alternate rank-based system for annotating subjec-
tive data that could be applied to supervised learning, with the hope of increasing
the quality of labels.

The target application for this project is the annotation of the degree of bronchial
wall thickening seen in CT scans of the lungs in patients with chronic obstructive
pulmonary disease (COPD). Four potential implementations are compared, and con-
sistency, as well as resource demands, are evaluated in several parts. These include
imitating the annotation process with simulation, user evaluation with arbitrary
subjective assessments, and lastly evaluating bronchial wall thickenings with radiol-
ogists.

After evaluation, it is observed that the implementation showing the most poten-
tial is one based on the TrueSkill algorithm, which employs Bayesian inference and
assumes that underlying scores are not definite but instead follows a normal distribu-
tion. The findings presented in this thesis indicate a clear increase in inter-annotator
agreement for this rank-based system and the study demonstrates that the indirect
approach of evaluating images creates more reliable labels than the direct rating-
based method.

Keywords: Approximate Sorting, Noisy Sorting, TrueSkill, Rank-Based Annotation,
Subjective Annotations, Supervised Machine Learning,

v


Acknowledgements
We would first and foremost like to thank our supervisor, Mats Lidén, for his con-
tinued guidance and engagement. This thesis would not be possible without the
expertise you offered in the field of medical imaging, and the efforts you made to
gather annotated data. We look forward to working with you in the future. Further,
we would also like to thank our examiner and Chalmers supervisor Ida Häggström.
We greatly appreciate the feedback you offered during the initial stages of this thesis,
as well as the opportunity to reach out to you with any and all questions.

Finally, we would like to offer our sincerest gratitude to all the people who volun-
teered to annotate data for this project. This includes the 13 friends and acquain-
tances who offered their time during the initial user evaluation. It also includes
the four experts at Örebro University Hospital who provided the annotations of
bronchial wall thickening, essential for this thesis. The data each of you helped us
gather has been invaluable when attempting to achieve the goals of this project.

Herman Bergström and Hanna Tärnåsen, Gothenburg, June 2023

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Chronic Obstructive Pulmonary Disease . . . . . . . . . . . . . . . . 2
1.2 Objective and Method . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 5
2.1 Time Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.3 Noisy Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Bayesian Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 TrueSkill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 Borda Count . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.8 Confidence Bounds in Multi-Armed Bandits . . . . . . . . . . . . . . 10
2.9 Hamming LUCB . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.10 Evaluation Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.10.1 Krippendorff’s Alpha . . . . . . . . . . . . . . . . . . . . . . . 11
2.10.2 Spearman’s Footrule Distance . . . . . . . . . . . . . . . . . . 12

3 Method 13
3.1 Implementing Sorting Algorithms . . . . . . . . . . . . . . . . . . . . 13

3.1.1 Merge Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Borda Count . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3 Hamming LUCB . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.4 TrueSkill . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Evaluation Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4 Evaluation of Simulation 23
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

ix


Contents

5 Evaluation of Arbitrary Subjective Annotation 27
5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.1.1 Group Testing Setup . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.2 Individual Testing Setup . . . . . . . . . . . . . . . . . . . . . 30

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.2.1 Group Testing Results . . . . . . . . . . . . . . . . . . . . . . 31
5.2.2 Individual Testing Results . . . . . . . . . . . . . . . . . . . . 32

5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6 Evaluation of Bronchial Wall Thickening Annotation 35
6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6.2.1 Rating-Based Annotation Quality . . . . . . . . . . . . . . . . 38
6.2.2 Rank-Based Annotation Quality . . . . . . . . . . . . . . . . . 38
6.2.3 Resource Requirements . . . . . . . . . . . . . . . . . . . . . . 39

6.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Conclusion 43
7.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

Bibliography 47

A Appendix 1 - Graphical User Inteface of Application I

x


List of Figures

1.1 CT image demonstrations of phenotypes prominent in COPD. . . . . 2

2.1 In this image a list with the elements 2,3,4, and 1 has been divided
into sublists only containing one element. Thereafter the first two
sublists are compared with the conclusion that 2 < 3, merging them
to the sublist {2, 3}. Similarly, {4} and {1} are merged to the sublist
{1, 4}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 This image visualizes the process after the creation of the two sorted
subsets created in Figure 2.1. It shows step by step how the lists [2, 3]
and [1, 4] merges into the final sorted list [1, 2, 3, 4]. . . . . . . . . . . 7

3.1 A visualization of how sorting the elements with the values [1,2,3,4]
could be performed with Borda Count . . . . . . . . . . . . . . . . . 17

3.2 A visualization of how sorting the elements with integer values 1
through 10 could be performed with Hamming LUCB. . . . . . . . . 18

3.3 A visualization of how sorting the list [1,2,3,4] could be performed
with TrueSkill. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Visualization of the evaluation steps that will be made, and how al-
gorithms will be excluded from further testing. . . . . . . . . . . . . . 21

4.1 A visualization of how the probability of a correct comparison is af-
fected by different comparison clearness. This assumes that the list
only contains integers between 0 and 9 (i.e. m = 9). . . . . . . . . . . 24

4.2 The normalized Spearman’s footrule distance, with different compar-
ison clearness’, between the current estimation of the algorithm over
1000 comparisons, averaged over 10 runs. . . . . . . . . . . . . . . . . 25

5.1 An example screenshot of the user interface when sorting by pairwise
comparisons. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2 An example screenshot of the user interface when ordering a shorter
list. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.3 The average normalized spearman’s footrule distance between the
three orderings produced by each algorithm. . . . . . . . . . . . . . . 31

5.4 Moving average (with a window size of 20) of the RMSE of the esti-
mated TrueSkill scores of the current iteration, compared to the one
before. The dashed lines indicate a change in user annotating. . . . . 32

xi


List of Figures

5.5 Similarity of orderings produced by different users on separate sets
of images, where the y-axis represents the normalized Spearman’s
footrule distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1 An example screenshot of the user interface when making pairwise
comparisons of CT images. . . . . . . . . . . . . . . . . . . . . . . . . 36

6.2 An example screenshot of the user interface when rating the degree
of bronchial wall thickening in a single image. . . . . . . . . . . . . . 36

6.3 Slice selection of a CT scan. Slices highlighted in green were extracted
to be annotated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6.4 The inter-annotator agreement when distributing the four labels pro-
vided during rating according to the average observed distribution
after every comparison. The dashed line represents the agreement
achieved by the rating-based approach. . . . . . . . . . . . . . . . . . 40

6.5 The inter-annotator agreement over the averaged time. . . . . . . . . 40

A.1 The initial view when introduced to the application. A mandatory
pop-up urging the user to provide who will be annotating the current
session. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

A.2 The main page view essentially provides some overview of the appli-
cation. On the left side of the screen all available algorithms as well
as the active algorithm, the number of images in the set, and a total
number of annotations made. In the top-right corner the currently
annotating user is presented and below are some instructions for the
user. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

A.3 The fields the user is offered when creating a new ranking instance.
When Scrolling is enabled, the user can load nifti files containing
multiple slices and scroll between them. . . . . . . . . . . . . . . . . . II

A.4 There are three algorithms to choose from, the rank-based algorithms
TrueSkill and Merge Sort, as well as the option to rate images. The
comparison size field differs depending on the selected algorithm. . . . III

A.5 The view when annotating pairwise comparisons. The user provides
order and difference level by using one of the five buttons. Here it is
also visible to the user that it is possible to undo the last annotation. III

A.6 The interface when ordering a list of three images. The annotator or-
ders the images from the least to the most amount of visible bronchial
wall thickening, left to right. Difference levels are assigned by using
the radio buttons between two images. . . . . . . . . . . . . . . . . . IV

xii


List of Tables

5.1 The distribution of users across groups. Two different datasets of 100
images were used for the different steps. . . . . . . . . . . . . . . . . 30

5.2 The distribution of images and how they were sorted by participants. 30
5.3 The distribution of differences supplied by annotators when ranking

the apparent age of people in images using the TrueSkill system. . . . 32
5.4 The average time per annotation of users when making pairwise com-

parisons, as well as when ordering shorter lists. . . . . . . . . . . . . . 33

6.1 The distribution of differences supplied by annotators when annotat-
ing bronchial wall thickening using the rank-based system. . . . . . . 39

xiii


List of Tables

xiv


1
Introduction

Since the rise of convolutional neural networks (CNN), the approach of supervised
learning has become increasingly popular for solving image analysis tasks [1]. Com-
mon applications include image classification, object detection, and segmentation.
However, in order to train a CNN, a large amount of labeled data is required. The
process of creating labeled data is referred to as annotation, and the exact nature
of the annotation task can vary depending on the target application. Many of these
tasks have an objective ground truth and do not entail much cognitive workload,
such as drawing bounding boxes around cars or stating if an image contains a hu-
man or not. In contrast, some annotations tasks instead have participants make
a subjective assessment. Such assessments could for example be establishing the
quality of a video [2] or determining the emotional state of a person [3].

An issue that can be encountered when applying the standard rating-based method
to these tasks, where the annotator simply assigns a number or label to an image,
is that results vary heavily depending on the annotator. It not only becomes an
issue of how the annotator perceives the image but also of how they perceive the
ratings or labels of the annotation task. Furthermore, assessments do not necessarily
have an underlying ground truth, implying that results will vary regardless of the
competence of annotators. A proposed solution that has shown promise is to instead
use a rank-based approach [4], [5]. This involves creating an ordering of the images
by comparing them, rather than assigning labels directly.

Although a rank-based system has the potential to increase annotator consistency,
it also demands a different amount of resources compared to a rating-based one. As
the set of data points that are to be ordered increases in size, the amount of possible
pairwise comparisons increases quadratically. Consequently, if annotation resources
are limited, exhaustive pairwise comparisons will quickly become infeasible. One
field where this is the case is medical imaging, where annotation requires trained
radiologists. While there have been studies within this field to limit the number of
comparisons required to rank Computed Tomography (CT) images [6], these have
mostly concerned individual radiologists sorting a list of images by themselves. As
the comparisons are subjective, the opinions of radiologists will differ, inducing noise
into the annotations.

The objective of this thesis is to construct a novel rank-based method for subjective
image annotation which can manage noisy comparisons. The system should be able
to account for the opinions of multiple annotators while handling disagreements.

1


1. Introduction

With a target application in medical imaging, the desired system should be able to
create a reasonable ordering of elements given only a sample of noisy comparisons.
Further, the system should allow radiologists to produce more consistent labels than
a rating-based system would.

1.1 Chronic Obstructive Pulmonary Disease

Chronic Obstructive Pulmonary Disease (COPD) is a respiratory disease generally
caused by smoking [7]. Typical symptoms are shortness of breath and cough. Ac-
cording to the World Health Organization, it is the third leading cause of death
worldwide [8]. COPD is diagnosed with a pulmonary function test (spirometry)
revealing chronic airway obstruction [9], [10]. The disorder causes a variable combi-
nation of emphysema (i.e. destruction of the alveoli) and small airway disease, and
usually, the larger airways are also affected. The relative amount of emphysema
and airway involvement in a patient defines different COPD subtypes - phenotypes
- that cannot readily be distinguished with the pulmonary function test.

Computed tomography is the best non-invasive method for visualizing the lungs,
making a visual assessment of emphysema and bronchial involvement in COPD
possible [11]. Emphysema causes dark cystic air spaces in CT images of the lungs,
an example of which can be seen in Figure 1.1a. Affected airways can be visible as
a thickening of the bronchial walls, highlighted in Figure 1.1b.

(a) An example of the emphysema phenotype.
The arrow highlights the destroyed lung tissue.

(b) An example of bronchial wall thickening.
The arrow highlights the swollen wall.

Figure 1.1: CT image demonstrations of phenotypes prominent in COPD.

While spirometry is used for diagnosing COPD, quantifying the degree of bronchial
wall thickening and emphysema in CT images can differentiate COPD phenotypes
[12]. Phenotyping is important in research and clinical medicine, for example when
investigating therapies directed towards one of the disease components. Furthermore,
visual and quantitative CT evaluation has been shown to add prognostic information
in COPD, even when spirometry data is present [13].

2


1. Introduction

One project that aims to predict and prevent pulmonary disease is the Swedish car-
diopulmonary bioimage study (SCAPIS) [14], [15]. The SCAPIS study is a Swedish
population-based cohort for research in Cardiovascular Disease and COPD. A total
of 30,000 randomly chosen Swedish individuals between the ages of 50 and 64 have
received thorough medical examinations as part of the study, including blood tests,
CT imaging, and spirometry. The dataset provides a large amount of CT scans of
the lungs in individuals of different health, including people suffering from COPD
where the aforementioned phenotypes can be observed. Quantifying the degree of
bronchial wall thickening among subjects in SCAPIS could help lead to insights and
further understanding of the pathology of COPD.

Nevertheless, extracting reliable airway measurements from CT scan slices has proven
to be a difficult task, with most research having been focused on using classical im-
age analysis methods, such as segmentation, to measure the airway wall [16], [17].
This makes it difficult to create an automated process for quantifying the degree of
bronchial wall thickening visible in an image. One of the main issues is that the size
of small airways is around the size of most CT scan resolutions, meaning that it is
just barely measurable [18]. Despite this, radiologists are able to get an impression
and make assessments when looking at these images. In an attempt to mimic this,
an interesting new approach would be to instead train a CNN to predict the degree
of bronchial wall thickening, as a number in an arbitrary range. However, as this
method is a case of supervised machine learning, it consequently requires labeled
data.

1.2 Objective and Method

The primary objective of this thesis is to implement and evaluate a rank-based anno-
tation system where images are ordered through comparisons. The long-term goal is
to create a system that would ultimately enable more consistent and reliable anno-
tation of subjective data. The target application, used to evaluate the performance
of the system, is the annotation of bronchial wall thickening visible in CT images.
However, the developed system should:

• Allow for any type of comparison where a subjective assessment is made.

• Produce an approximate ordering of the elements given a sample of potentially
noisy comparisons.

• Support groups of annotators ordering data together by managing disagree-
ments.

The approach taken has been to look into the area of sorting algorithms in order to
compare various solutions and their applicability and improve on those that were
recognized as relevant. This will be covered in more detail in Section 3.1.

3


1. Introduction

1.3 Contributions
These specific advancements in the field are incorporated within this thesis work:

• Experimental evaluation of the performance of different sorting algorithms in
the case of noisy comparisons.

• A novel method for subjective annotation of data.

• Empirical evidence that annotation via pairwise comparison can increase con-
sistency and inter-annotator agreement compared to a rating-based system
when annotating bronchial wall thickening.

1.4 Thesis Outline
Chapter 2 will present the theoretical background for understanding the thesis. It
will cover basic concepts such as sorting regular and noisy data. It will also explain
some probabilistic topics as well as offer some more in-depth explanations of parts
relevant to the implemented systems. Furthermore, evaluation methods, including
inter-annotator agreement and Spearman’s footrule distance, will also be discussed.

Chapter 3 will detail the methodology used when implementing the sorting algo-
rithms, including Merge Sort, Borda Count, Hamming LUCB, and TrueSkill, and
the overall evaluation structure used in this project.

Chapter 4 will present the results of the simulation experiment, including the out-
comes for the various sorting algorithms. The chapter will also provide a conclusion
that summarizes the findings which will then be further evaluated in the next chap-
ter.

Chapter 5 will detail the evaluation of the system on arbitrary subjective annota-
tions. The results will include group and individual testing, and their setup will be
explained. The chapter will conclude with information that can be passed down to
the next evaluation step.

Chapter 6 will present the final evaluation of the thesis, the annotation of bronchial
wall thickening in CT scans. Here the results and conclusions from chapters four
and five will be used for the final assessment and testing.

Finally, Chapter 7 will conclude with a summary of the main findings and their
implications, limitations of the study, and suggestions for future research. This
conclusion will tie together the various sections of the thesis and provide a compre-
hensive overview of the study.

4


2
Theory

This chapter covers the technical background essential to comprehend the key con-
cepts presented in this thesis. We start by defining time complexity and addressing
how it can be used to estimate the efficiency of generic algorithms. Thereafter, the
theoretical sorting of a list with merge sort is described using a more concrete ex-
ample. The idea of noisy sorting, or sorting data with noise, is then introduced.
Following that, a probabilistic perspective is thoroughly explored, beginning with
an explanation of Bayesian inference and proceeding with Gaussian distribution. To
better understand some of the implementations in this thesis, the two major topics,
TrueSkill and confidence bounds in multi-armed bandits, are also covered. Finally,
this chapter will discuss various evaluation techniques that will be used in the report.

2.1 Time Complexity
The efficiency of sorting algorithms can be measured by their time complexity, which
is defined as the number of operations required to sort a given set of data. Time
complexity is often expressed in terms of the input size n and is used to estimate how
the algorithm’s performance would scale with the increase or decrease in the input
data [19]. The big Oh notation is a commonly used metric in computer science to
describe the time and space complexity of an algorithm. It is written as O(f(n)) and
provides a way to categorize an algorithm’s growth rate as the input size increases.

There exist some primary types of complexities that are used to evaluate the effi-
ciency of algorithms based on their time complexity. These include constant time
complexity (O(1)), logarithmic time complexity (O(log n)), linear time complex-
ity (O(n)), linearithmic time complexity (O(n log n)), quadratic time complexity
(O(n2)), exponential time complexity (O(2n)), and factorial time complexity (O(n!))
[20]. The complexities can be ordered from the best to the worst, with constant time
complexity being the most efficient and factorial time complexity being the least ef-
ficient (1≪ log n≪ n≪ n log n≪ n2 ≪ 2n ≪ n!) [19].

2.2 Merge Sort
Merge Sort is a well-known sorting algorithm with the expected time complexity of
O(n log n) [19]. As the name suggests, the core of the algorithm is centered around
merging already sorted sublists. This is done by first splitting the list into sublists

5


2. Theory

only containing one element. When merging sublists, they are merged two and
two until completion where elements are put in the correct order for each merge, the
process of merging two lists can be seen in Algorithm 1. The amount of comparisons
this requires is reduced as a consequence of the sublists themselves already being
sorted.

Algorithm 1 General process for merging two stored lists
function MergeTwoSortedArrays(leftSubArray, rightSubArray)

i← 0, j ← 0
merged← initialize an empty list

while i < len(leftSubArray) and j < len(rightSubArray) do
if leftSubArray[i] ≤ rightSubArray[j] then

Append leftSubArray[i] to merged
i← i + 1

else
Append rightSubArray[j] to merged
j ← j + 1

end if
end while
Append the remaining elements of leftSubArray to merged
Append the remaining elements of rightSubArray to merged

return merged

end function

To further grasp the concept the process is exemplified in Figure 2.1 and Figure 2.2.

Figure 2.1: In this image a list with the elements 2,3,4, and 1 has been divided into
sublists only containing one element. Thereafter the first two sublists are compared
with the conclusion that 2 < 3, merging them to the sublist {2, 3}. Similarly, {4}
and {1} are merged to the sublist {1, 4}.

6


2. Theory

(a) Merging setup with the two first sublists
that are to be merged. Where in this case
these are the only 2 sublists available [2, 3] and
[1, 4]

(b) Comparing the first two elements in the
sublists; 2 and 1. Since 1 < 2, it can be de-
duced that 1 will be the smallest element. It
is subsequently moved from the sublist to the
sorted merged list.

(c) Comparing the first two elements in the
current sublists; 2 and 4. Concludes in 2 en-
tering the merged list as the second smallest
element.

(d) Comparing again the first two elements in
the current sublists; 3 and 4. Concludes in 3
entering the merged list as the third smallest
element.

(e) With elements only remaining in one of the
sublists, the rest of this list can simply be ap-
pended and subsequently complete the merg-
ing.

Figure 2.2: This image visualizes the process after the creation of the two sorted
subsets created in Figure 2.1. It shows step by step how the lists [2, 3] and [1, 4]
merges into the final sorted list [1, 2, 3, 4].

While not constructed to manage noisy comparisons but rather assumes determin-
istic data, merge sort still makes for an interesting benchmark as it does have the
goal of minimizing the number of comparisons required.

2.3 Noisy Sorting
In contrast to the deterministic sorting setting, noisy sorting assumes that the out-
come of a comparison will vary according to some probability. For example, in the
model introduced by Braverman et al. [21], there is an unknown true order, and
the probability of a correct comparison is 1

2 + λ, where λ > 0 is a constant. As
a proposed solution, the authors apply maximum likelihood estimation to estimate
an ordering of the elements given observed noisy comparisons. However, there are
additional approaches to address the noisy sorting problem. One such approach is to
view the issue as a multi-armed bandit problem [22], which refers to a hypothetical
scenario in which an individual chooses between actions that have different rewards.

7


2. Theory

In contrast, another approach is attempting to minimize a sorting cost function,
such as that produced by a probabilistic interpretation of the least squares function
[23].

While there are potentially many different attributes the same set of data can be
ordered according to, the problem can be generalized. Similar for all orderings is
that the data is being assessed based on some quality. To simplify the explanations
in this thesis, if a comparison between two elements i and j determines that i is of
greater quality (e.g. the comparison implies i > j), this thesis will refer to this as
element i winning a comparison against element j.

2.4 Bayesian Inference
Bayesian inference is a framework that revolves around updating the beliefs about a
distribution according to some observed data [24]. Let y ∈ Y be the observed data,
and θ ∈ Θ be the parameters that describe a likelihood function P (y | θ). Here, both
θ and y could potentially be multi-dimensional. After having made observations of
y, and given a prior distribution P (θ), the distribution of θ can be updated using
Bayes’ rule, which states that

P (θ | y) = P (y | θ)P (θ)
P (y)

= P (y | θ)P (θ)∫
θ P (y | θ)P (θ)

.

In this case, P (θ | y) is referred to as the posterior distribution. As the marginal
distribution P (y) can become very costly to calculate, it is common to utilize the
notion of conjugate priors. That is, if the posterior P (θ | y) and the prior P (θ)
are in the same distribution family, then the posterior can be calculated using a
closed-form expression, avoiding integration. It is then said that P (θ) is a conjugate
prior to P (y | θ). For example, let the likelihood function P (y | θ) follow a binomial
distribution such that y | θ ∼ Bin(n, θ). Further, assume that the parameter
θ follows a beta distribution with parameters α and β, that is θ ∼ Beta(α, β).
After observing a sample y, the posterior will also follow a beta distribution. More
specifically, θ | y ∼ Beta(α+y, β+n−y). For more information regarding conjugate
families, we refer to the book by Johnson et al. [25].

2.5 Gaussian Distribution
The Gaussian distribution (often called normal distribution) is a prevalent continu-
ous distribution family [26]. It is defined by two parameters; the mean µ and the
standard deviation σ. For simplicity, the distribution is most commonly denoted
N (µ, σ2) and the probability density function is defined as

pdf(x) = 1
σ
√

2π
e− 1

2(x−µ
σ )2

.

8


2. Theory

With a bell-shaped curve, the distribution is symmetric around its mean. This
makes it commonly used for the estimation of values where the positive deviation
should be about the same as the negative deviation.

2.6 TrueSkill
The TrueSkill algorithm is a skill rating system originally developed by Microsoft
for Xbox Live online matchmaking [27]. The algorithm is often seen as a generaliza-
tion of the Elo system, which was originally introduced to rank chess players [28].
However, TrueSkill also utilizes Bayesian Inference and factor graphs to estimate
the underlying skills of players, while also keeping track of the uncertainty of its
estimations. The main idea is that the skill s of a player follows a Gaussian distri-
bution N (µ, σ2), modeling the uncertainty of the true skill of a player. As matches
are played, the current expected skill µ, as well as the confidence σ, will be updated.
For a match with n players, let s := (s1, ..., sn) be the vector containing the skills
of each player. Further, team assignments are defined by k non-overlapping subsets
Aj ⊂ {1, ..., n}, and A denotes the set of all teams {A1, ..., Ak}. Subsequently, the
vector r := (r1, ..., rk) represents the outcome (rank) of each team. The posterior
update of a game is given by the equation

P (s | r, A) = P (r | s, A) P (s)
P (r | A)

.

For the details on how this posterior is calculated, we refer to the original paper by
Herbrich et al. [27]. Apart from updating the estimation of a player’s skill level, the
system also predicts the quality of a match, given the players on each team, defined
as the probability of a draw.

While originally designed for video game matchmaking, the TrueSkill algorithm
can be applied to any system where comparisons between two or more items are
being made, and the outcomes of the comparisons are inconsistent (e.g. noisy). For
example, one can imagine that instead of comparing players, we compare CT scans
of bronchial wall thickening. Furthermore, instead of attempting to estimate an
underlying skill of a player, one can attempt to estimate the degree of bronchial
wall thickening present in an image. The problem can be reformulated to fit any
annotation task where a subjective assessment is made.

2.7 Borda Count
Borda Count is a simple, yet robust, noise-resistant sorting algorithm [29]. The idea
is that every element in a list has an associated score. After a pairwise comparison
between two elements, the scores of both are updated. The element that wins the
comparison has its score increased by 1, and conversely, the element that loses has
its score decreased by 1. Which two elements to compare are chosen at random,

9


2. Theory

meaning that the score of every element will reflect its ability to win against an-
other random element in the collection. The final estimated sorting is produced by
ordering the elements according to their scores.

2.8 Confidence Bounds in Multi-Armed Bandits
Multi-armed bandits’ is a field of research where algorithms make decisions under
uncertainty [30]. An algorithm collects rewards by selecting one arm (action) a every
discrete time step. The reward an arm returns is probabilistic, with each arm having
an unknown expected reward µ(a). The goal of the algorithm is to maximize the
total reward over a set amount of rounds, by leveraging exploration and exploitation.
To do this, the algorithms keep track of the currently estimated expected reward µ̄(a)
of the arms, as well as how confident they are in that estimation. To represent this,
confidence intervals, a notion common in statistics [31], is used. More specifically,
a confidence radius r(a) is introduced for each arm. The main idea is that the
probability that µ̄(a) and µ(a) differ by more than r(a) is small. More formally,

P (|µ̄(a)− µ(a)| ≤ r(a)) ≥ 1− λ,

where λ is some small constant derived from the definition of r. Following this, an
Upper Confidence Bound (UCB), as well as a Lower Confidence Bound (LCB), can
be defined as

UCB(a) = µ̄(a) + r(a) and LCB(a) = µ̄(a)− r(a)

respectively. The interval [LCB(a), UCB(a)] is referred to as the confidence in-
terval. Within the multi-armed bandits’ field, algorithms can utilize these bounds
when deciding which arm to pull next. For example, given two arms a and a′, if
UCB(a) < LCB(a′), then it can with high probability be assumed that µ(a) < µ(a′).
One algorithm of interest for this project which utilizes confidence intervals is the
Hamming LUCB sorting algorithm.

2.9 Hamming LUCB
The Hamming LUCB algorithm is similar to Borda Count, but it utilizes a few
parameters to determine the next comparison. The algorithm was initially developed
by the authors Heckel et al. for an approximate ranking task [22]. The primary
parameter that differentiates Hamming LUCB from Borda Count is the number
of subsets that the list is split into, as well as their respective sizes. Rather than
estimating the exact ordering of elements, the algorithm aims to split the list into n
subsets in such a way that we can be confident enough that the elements in subset
i are larger than those in subset i− 1. If n is equal to the size of the list, this gives
the exact ordering.

10


2. Theory

In addition to keeping track of an estimated score, each element also has an associ-
ated confidence radius. By ensuring that the lower and upper confidence bounds of
two elements chosen from two neighboring subsets do not overlap, the approximate
sorting is produced. The permitted error margin between the subgroups is the sec-
ond parameter. This is necessary because it would require numerous comparisons
before the confidence intervals of two adjacent elements stopped overlapping. In-
stead, the number of necessary comparisons can be drastically decreased by adding
a gap between the subsets and making sure they dont overlap. The elements within
the error margin may, however, not be correctly assigned to the subset according to
this method.

2.10 Evaluation Methods
The following sections offer background on the evaluation methods that are utilized
in this thesis.

2.10.1 Krippendorff’s Alpha
A measure of the degree of agreement or consensus among annotators who have
independently annotated the same data is called inter-annotator agreement (IAA)
[32] (also referred to under various similar names such as inter-rater reliability, inter-
observer reliability etc.). The degree of the annotators’ agreement beyond chance
is often expressed as a statistical measure. These measurements adjust for the
possibility of random agreement and take into consideration both the annotators’
agreement and disagreement. A high level of IAA is favored since it shows that the
annotations are dependable and consistent, which boosts the trust in the data and
the models that are based on it. In contrast, a low level of IAA indicates inconsistent
or ambiguous annotations.

Krippendorff’s Alpha [33] is one statistical measurement that uses a variety of tech-
niques to produce an IAA measurement. The fundamental form of Krippendorff’s
alpha is

1− Do

De

,

where Do denotes the observed difference and De is the difference anticipated by
chance. The outcome α equating to 0 would essentially be the result of the observed
annotations being indistinguishable from annotations selected at random. In con-
trast, an α equal to 1 would be indicative of full agreement among annotators and
subsequently imply perfect reliability. There is no definite distinction as to what al-
pha could be considered reliable. However, alphas above 0.8 are generally thought to
be entirely reliable, while alphas above 0.667 are commonly considered acceptable
to make educated assumptions [33]. In essence, depending on the circumstances,
each situation needs to be interpreted differently.

While there are many other measurements of IAA, such as Fleiss’ Kappa [34] and
Cohen’s Kappa [35], we decided to use Krippendorff’s Alpha due to its flexible nature
as it can handle more than two annotators, missing values, as well as many different

11


2. Theory

types of labels (including ordinal). Although this is not a standardized metric in
the medical field, it has been utilized in some recent literature including the studies
by Vikgren et al. [36] and Lidén et al. [37].

To measure the discrepancy between values assigned to data points by different
annotators, numerous so-called difference functions can be used. These essentially
quantify the difference between two potential value assignments; c and k. The
functions are used to define Do and De as

Do = 1
n

∑
c

∑
k

ock δ2
ck and De = 1

n(n− 1)
∑

c

∑
k

nc nk δ2
ck,

where ock, n, nc, and nk are the frequencies in the respective coincidence matrix,
which keep track of agreements and disagreements amongst annotators, and δ2

ck

represents the selected appropriate difference function [38].

The simplest difference function implementation is the nominal metric approach. An
observed pair of annotations is essentially assessed as either the same or different.
This means that if two annotators are in disagreement, the difference is the same
regardless of what labels they provided. The formula for the nominal difference
function is

δ2
ck =

0 iff c = k

1 iff c ̸= k .

Another function more suitable for annotations where there is an underlying order-
ing is the ordinal difference function. This approach takes into account both the
frequencies in the ranks as well as the deviations in annotators’ ranks. The formula
for Krippendorff’s Alpha Ordinal metric differences is

δ2
ck =

g=k∑
g=c

ng −
nc + nk

2

2

,

where c and k are different ranks. For more information regarding the available
difference functions we refer to the paper by K. Krippendorff [38].

2.10.2 Spearman’s Footrule Distance
Spearmans’ footrule distance is a method of estimating how much one list differs
from another through the sum of the absolute differences between the indexes of
each element in the two lists [39]. Thereafter it is possible to divide the sum by the
overall number of list entries to obtain the distance normalized. The Spearman’s
footrule distance essentially calculates the degree to which the entries in two ranked
lists differ from one another. Because it is non-parametric, it can be used to compare
lists of any size without making any assumptions about how the data are distributed
in general. The distance is more formally defined as

D(π, σ) =
n∑

i=1
|π(i)− σ(i)|,

where π(i) and σ(i) represents the rank of element i in list π and σ respectively.

12


3
Method

This chapter presents the comprehensive methodology adopted in this thesis, which
is applicable to all the evaluation segments. The evaluation segments encompass
the simulation evaluation, arbitrary subjective annotation evaluation, and bronchial
wall thickening annotation evaluation. The methodology details the implementation
of sorting algorithms, including the necessary components every algorithm should
incorporate. It also provides an in-depth description of the individual sorting al-
gorithms developed. Furthermore, the chapter outlines the various steps that were
made when evaluating the algorithms, providing a holistic view of the project’s
structure.

3.1 Implementing Sorting Algorithms
To achieve the primary objective of this thesis, it is imperative to devise a system that
can be used with simulated annotations, as well as with real annotations provided
through a graphical user interface (GUI). Furthermore, it was of interest to explore
the possibility for users to compare more than two elements at a time. This was
explored as it could potentially increase annotation speed, and subsequently reduce
the resources required. As such, an essential attribute of the algorithms was that
annotations could not only be submitted in a pairwise manner but also as a list of
data points that had been ordered. To ensure this, an interface that could be used
across different algorithms was designed. The first step was to identify the essential
components that every sorting algorithm should possess. The main components
identified were the ability to retrieve the next desired comparison, and the ability
to infer the current sorting of the data based on the results of that comparison.
The algorithms were also required to hold the data which was represented with
corresponding keys that are unique for all data points.

To facilitate the development of these algorithms, an abstract class was created
to encapsulate the fundamental components required for each algorithm, including
the sorting mechanism and the necessary data. The class included two primary
methods that every instance of the algorithm should implement; get_comparison()
and inference().

The former method, get_comparison(user_id), returns the list keys, which effec-
tively contains the next elements the algorithm suggests should be compared. The
amount of elements returned directly reflects the size of comparisons. For example,

13


3. Method

two elements indicate a pairwise comparison, while four elements imply that the
user is supposed to order this list of four elements. The size is determined through
a parameter that is provided upon initialization of the sorting algorithm. The user
who is presently annotating is passed as a parameter since in some circumstances it
is considered relevant to offer comparisons on an individual basis.

inference(user_id, keys, diff_lvls), the second function, entails appropriately pro-
cessing the user’s annotations and incorporating them into the corresponding al-
gorithms. The user’s response is represented as the lists sorted_keys and diff_lvls,
with the latter being a representation of the level of difference between the elements.

Furthermore, there are three more required functions; get_result(), is_finished(),
and comparison_is_available(). These can essentially return the current sorted list,
check if the sorting can be considered finished, and check on an individual basis if
there are any possible comparisons to be made. The following pseudocode elucidates
this process.

Algorithm 2 General process for ordering data with a sorting algorithm
Initiate sort_alg with data

while not sort_alg.is_finished() do
keys← sort_alg.get_comparison()
sorted_keys, diff_lvls← response from user when given keys

sort_alg.inference(user_id, sorted_keys, diff_lvls)

end while

The sorted_keys provided by the user represent a sorted list that indicates the
order they believe to be correct. Additionally, the diff_lvls list represents the per-
ceived difference level between adjacent elements in the list. Since there are n
sorted_keys in the list, there are n-1 perceived differences in the diff_lvls list. For
instance, if sorted_keys = [x1, x2, x3], then diff_lvls = [difference between x1&x2,
difference between x2&x3]. Representing the difference levels involves using an enum
with three values:

• None, corresponding to no perceived difference.

• Normal, corresponding to a perceived difference.

• Large, corresponding to a large perceived difference.

These values indicate the extent of difference between adjacent elements in the list.
We decided upon these levels as we reasoned that there would be scenarios where
a user thinks that there is no difference between two elements, as well as scenarios
where the user thinks the difference is very clear. By representing the difference levels
in this way, more information can be gained from the user’s sorting preferences.

When sorting lists with more than two elements, it is important to demonstrate
how the algorithm can be traversed. To do this, all (n − 1)! comparisons must be

14


3. Method

conducted, with each comparison being treated as a pairwise comparison. The max-
imum difference levels between the respective comparisons can then be determined
and stated as their difference level. For example, let us consider a list of sorted ele-
ments [x1, x2, x3, x4] and their corresponding difference levels [None, Large, Normal].
When sorting this list, there are actually 3! = 6 pairwise comparisons. By adopting
the largest difference level between the two compared data points for each pairwise
comparison the result becomes (x1, x2) : None, (x1, x3) : Large, (x1, x4) : Large,
(x2, x3) : Large, (x2, x4) : Large, and (x3, x4) : Normal. As such in the individual al-
gorithms, it is possible to represent the original sorted list as these multiple pairwise
comparisons.

The following sections will cover the algorithms which have been implemented fol-
lowing this interface. When choosing which algorithms to evaluate in the project,
we have attempted to take a multitude of approaches, utilizing different fields of
theory, into account. The four algorithms which have been implemented are Merge
Sort, Borda Count, Hamming LUCB, and TrueSkill. Firstly, as Merge Sort is one
of the more popular algorithms used in the deterministic sorting setting, we wanted
to include it as a benchmark algorithm which does not take noise into considera-
tion. Secondly, Borda Count was implemented due to its combined simplicity and
robustness, making it an excellent benchmark for noisy sorting algorithms. Thirdly,
Hamming LUCB was of interest as it took inspiration from the multi-armed bandits
literature while also offering termination criteria for when the list is approximately
sorted. Lastly, the TrueSkill algorithm was integrated as it is able to choose the next
comparison in a more dynamic fashion than the aforementioned algorithms, while
utilizing the Bayesian inference framework.

3.1.1 Merge Sort
The Merge Sort algorithm follows a simple and intuitive approach. The developed
algorithm instantiates by splitting the data into sub-lists, each containing one el-
ement. These sub-lists are then merged two at a time, with each element being
appended in the correct order, eventually resulting in the complete list being sorted.
For instance, let us consider an unsorted list of elements [x1, x2, ..., xn−1, xn]. When
Merge Sort is instantiated with regards to this data, it will create a list of sub-lists
[[x1], [x2], ..., [xn−1], [xn]].

The get_comparison() function is straightforward as it returns the first key in the
first two sub-lists. However, the inference step, inference(), of this algorithm can
be quite complex, as it requires determining which elements to merge and in what
order. When merging two sub-lists only consisting of two elements the merged list
is quite clear. However, when the sub-lists are larger it is not quite as intuitive.
Essentially the chosen approach is to compare the two sub-lists’ first elements to
determine the element that is the smallest in both subsets. After having determined
the smallest element, it is simply moved to the end of the new list and the process
is repeated until both of the sub-lists are emptied.

Unfortunately, Merge Sort has a significant flaw that has yet to be addressed: it can
only manage comparisons of exactly two elements at a time. As such, the algorithm

15


3. Method

will in practice only be used with pairwise comparisons.

3.1.2 Borda Count

The Borda Count algorithm takes a different approach compared to the Merge Sort
algorithm. Borda Count method includes assessing the plausibility of a win for a
random element rather than assuming full certainty. At the start of the algorithm, a
list of data points is created with corresponding scores, initially set to 0. The primary
objective is to generate random comparisons that update the corresponding score
for each data point. A win increases the score, while a loss decreases it. The final
ordering of the list is achieved by sorting the elements based on their acquired scores.

For the algorithmic details, the selection of which elements to compare is arbitrary,
and in our implementation, it is done by randomly pairing the elements in the list.
This ensures that all data points are compared equally, as each element occurs once
in the list of planned comparisons (except for a few elements in the cases where
the size of the data is not divisible by the comparison size). When the algorithm
has compared all paired data points, it randomly pairs elements once more. For
inference, all (n − 1)! comparisons and their respective maximum difference levels
are obtained. If there is no observed difference between the two elements, no score
is updated for either element. For a win, a value of +1 is added to the element that
scored higher, and a value of −1 is added to the element that scored lower. If the
difference level is Large, the values added are +2 and −2, respectively.

Consider the simple case where the objective is to sort a list of elements, each with an
individual element, where the underlying values are the discrete numbers 1 through
4. This process is visualized in Figure 3.1 where the x-axis represents the underlying
true values of the elements. At first, as can be seen in Figure 3.1a, all values would be
initialized to the same value, in this case 0. After one iteration, meaning outputting
an arbitrary comparison and inputting the result of said comparison, the updated
scores can be seen in Figure 3.1b. It is here clear that the get_comparison() function
yielded the element pair with the true values 2 and 4, and that the user returned
the information that 4 was perceived as larger than 2, meaning the elements with
values 2 and for 4 got their scores updated with -1 and +1 respectively. With the
same basic concept the algorithm can continue, and after 4 iterations the sorting, as
seen in Figure 3.1c, is seemingly correct where the lowest value also has the lowest
score.

16


3. Method

(a) The different elements
are initialized to the same
value 0

(b) An arbitrary comparison
is chosen, 2 and 4, and their
individual scores are subse-
quently updated by +1 for
win and -1 for loss.

(c) After 20 comparisons the
scores for the elements and
their respective value could
look like this.

Figure 3.1: A visualization of how sorting the elements with the values [1,2,3,4]
could be performed with Borda Count

One significant detail about Borda Count is that Borda Count does not have a
finishing condition, and consequently requires a maximum comparison amount. This
means that the algorithm will continue to make comparisons and update scores until
it is explicitly stopped.

3.1.3 Hamming LUCB
As previously covered, Hamming LUCB attempts to split a list into a predetermined
amount of sets with a given error margin, while keeping track of an estimated score
as well as a confidence radius for each element. The radius used in this project is
the same as introduced in the original paper by Heckel et al. [22]. Let u(a) be the
number of times element a has been compared. Further, let s be the size of the list
and d ∈ (0, 1) be a tolerance parameter, where larger values imply a higher error
tolerance. The radius is then defined as

r(a) =

√√√√β(a, d/s)
2u(a)

,

where β is defined as

β(a, d′) = log
( 1

d′

)
+ 0.75 log

(
log

( 1
d′

))
+ 1.5 log

(
1 + log

(
u(a)

2

))
.

When selecting the next comparison, the algorithm considers the elements from each
subset where the confidence interval is furthest outside the set, as well as each ele-
ment that is currently within an error margin. Out of these, the algorithm selects the
element with the largest confidence radius and compares it with a random element
from the list. This approach prioritizes elements that are near the boundaries of a
subset or within one of the error margins. As for the inference, it is quite similar to
Borda Count as the score of an element is increased for a given win and decreased for

17


3. Method

a loss, where a comparison labeled as Large would also here indicate 2 wins/losses.
The formula used is τi ← Ti−1

Ti
τi + 1

Ti
1{i wins}, where τi is the approximated score of

data point i, and Ti represents the amount of times i has been compared [22]. With
each inference, Ti for all received elements at the certain step is simply incremented,
and their respective scores, τi, are subsequently updated based on the new Ti and
the previous score.

Figure 3.2 shows how the underlying representation of data points is updated during
an example run. The list consists of discrete values 1 through 10. The parameters
supplied imply that the algorithm should split the list into two sets of equal size, with
a margin of 2 elements between them. During initialization of the algorithm, each
element is compared once and given a score of 1 if it wins, or 0 if it loses, as can be
seen in Figure 3.2a. During the next step, the results of which are depicted in Figure
3.2b, 2 is compared against 4 and loses. The exact choice of elements to compare
here was arbitrary as each previously had the same radius. Lastly, Figure 3.2c shows
the underlying representation of the algorithm after the termination criteria have
been met.

(a) During the initial round, each element is
compared once.

(b) The value 2 is compared to 4 and loses.

(c) The final sorting of the algorithm before it
terminates. Note that no element in Set 1 over-
laps with an element in Set 2. Further, note
that elements closer to the edges of the sets
have been compared more and subsequently
have a smaller radius.

Figure 3.2: A visualization of how sorting the elements with integer values 1 through
10 could be performed with Hamming LUCB.

18


3. Method

3.1.4 TrueSkill
The final algorithm that was implemented was TrueSkill. At its core, the algorithm
uses the TrueSkill ranking system which assumes that the score of every element
follows a normal distribution. This method allows the ranking of the data points
with a Bayesian statistical perspective. A significant difference from Borda Count
and Hamming LUCB is foremost that the certainty of a data point can not be directly
derived from its number of comparisons, but rather, the TrueSkill measurement is
dynamic and adaptive depending on the results of the comparisons.

When selecting the next comparison, an adaptation was made of a previously de-
veloped algorithm called TSSort [40]. This algorithm considers the combination of
elements that have the most overlap in the 95% confidence interval. Furthermore, it
also takes into account the width of the widest confidence interval in the set. This
is because there may be cases where the overlap between two sets of elements is
nearly 100%, but their variance is very small, indicating that the data points are
nearly equal in skill. Conversely, there may be another set of elements where the
overlap is smaller, but the widest confidence interval is very large, indicating sig-
nificant uncertainty in the ordering of those elements. By doing so, the algorithm
ensures that it makes informed comparisons that take into account not simply the
point estimates of the skill levels of the elements but also the level of confidence
associated with those estimates. The overlap between two elements i and j can be
calculated with the lower and upper limits of the data points with regard to two
standard deviations. Then let a, b := µi ± 2σi, and c, d := µj ± 2σj representing the
lower and upper limits of the data points, then the formula for the weighted overlap
is defined as

Overlapi,j := min(b, d)−max(a, c)
max(b, d)−min(a, c)

·max(b− a, d− c).

When working with more than two elements, the choice is essentially made by search-
ing for the combination with the highest additive sum. Additionally, modifications
were made to prevent the same comparisons from being presented to the same user
multiple times. This risk is particularly high in situations where the data points are
similar. We reason that presenting the same comparison to a user does not really
offer any relevant information for this thesis, while also potentially being frustrating
for the annotator. Instead, the algorithm identifies the highest-scored comparison
that has not yet been compared by the user and presents that as the next compari-
son.

Regarding the inference step, TrueSkill uses a Bayesian approach to update the
estimated skill levels of the elements. After each comparison, the algorithm updates
the mean and standard deviation of the normal distribution representing the skill
level of the elements. The update process takes into account the expected outcome
of the comparison based on the skill levels of the elements and the uncertainty
associated with each skill level. Similar to Hamming LUCB and Borda Count, when
a comparison was labeled Large it would be treated as 2 consecutive wins. When
the difference provided is None, TrueSkill counts it as a draw.

19


3. Method

The algorithm’s execution when ordering values 1 through 4 can be seen in Figure
3.3. The method instantiates with the default parameters of the TrueSkill algorithm.
This means that the score of each element is assumed to follow the Gaussian dis-
tribution N (25, 8.332), as shown in Figure 3.3a. The following step, Figure 3.3b,
illustrates the scenario where 2 and 1 are compared, yielding new mean and vari-
ance values. The comparison can successfully result in higher certainty regarding
the points because of the newly acquired knowledge. In essence, these are the val-
ues’ posterior distributions. The final estimation after 20 comparisons, as shown
in Figure 3.3c results in significantly smaller variations, indicating the certainty of
the estimate scores, and more exact placements of the values after performing this
process.

(a) Each element is initialized with an estimated
score of 25, and a variance of 8.332.

(b) During the next step, 2 wins over 1. The
posterior distributions are calculated for both
and the order is updated.

(c) After another 20 comparisons, the current
estimations of the distributions could look as
above.

Figure 3.3: A visualization of how sorting the list [1,2,3,4] could be performed with
TrueSkill.

3.2 Evaluation Structure
The primary goal of this thesis is to compare a rank-based annotation system to
a rating-based one when labeling CT scans containing varying degrees of bronchial
wall thickening. However, as there are multiple ways to construct a rank-based
system, and annotation for this specific task requires trained radiologists, a different
form of evaluation will be needed earlier in the development process. For this reason,
the project has been split into three stages. These are:

20


3. Method

1. Simulation, covered in Section 4.

2. Arbitrary Subjective Annotation, covered in Section 5.

3. Annotation of Bronchial Wall Thickening, covered in Section 6.

The idea was to initially test all the algorithms presented in Section 3.1 using simu-
lated noisy pairwise comparisons. By making this early comparison, algorithms that
showed less promise could be excluded before any user tests were performed. Con-
sequently, annotation resources could be more meaningfully distributed during later
evaluation. The second step was to test the remaining algorithms on an arbitrary
subjective annotation task that did not require any expert knowledge. This way,
the results of the algorithms when people with differing opinions annotate could
be compared. The best-performing sorting algorithm would then be used as the
rank-based system for the final step, where it is compared to a rating-based system.
A visualization of this process can be seen in Figure 3.4.

Figure 3.4: Visualization of the evaluation steps that will be made, and how algo-
rithms will be excluded from further testing.

21


3. Method

22


4
Evaluation of Simulation

Each of the four algorithms covered in Section 3.1 (i.e. Merge Sort, Borda Count,
Hamming LUCB, and TrueSkill), was first evaluated using simulation. The purpose
of this evaluation was to see how the algorithms performed in a noisy setting. Before
making any user tests, we wanted to see which algorithms showed the most promise,
so as to not waste annotation resources evaluating an algorithm that required an
infeasible amount of comparisons. To this end, the algorithms were used to order a
list, with known ground truth ordering, using simulated noisy pairwise comparisons.

4.1 Experimental Setup
A list of integers was to be ordered. The sorting algorithms decided which elements
to compare, and the simulated annotator would subsequently answer which number
was the largest. The probability that the correct number was chosen to be the
largest was

1
2

+ γ,

where γ depends on the similarity of the values. The main idea is that γ is small for
values that are close to each other, and large for values that are far apart. It was
assumed that this would best reflect the noise encountered in subjective assessments.
For this reason, we defined the function γ as

γ(x1, x2) = log(c · |x1 − x2|)
2 · log(c ·m)

,

where x1 and x2 are two different integers from the list. Here, m denotes the largest
achievable difference in the set (e.g. if the list consists of numbers between 1 and
100, then m = 100 − 1 = 99), and c is a parameter representing the clearness
of the comparison. The impact of different levels of comparison clearness on the
probability of a correct assessment can be visualized in Figure 4.1. Figure 4.1a
shows the probability of a correct comparison in the case of c = 1. In contrast,
Figure 4.1b shows the case where c = 100, which ensures that almost all rulings are
accurate. This is demonstrated by the fact that neighboring items in Figure 4.1b are
almost certain to be correctly assessed, whereas adjacent elements in Figure 4.1a still
have a 50 % chance of being assessed incorrectly. In situations where x1 = x2 (i.e.

23


4. Evaluation of Simulation

(a) Comparison clearness equal to 1 (b) Comparison clearness equal to 100

Figure 4.1: A visualization of how the probability of a correct comparison is affected
by different comparison clearness. This assumes that the list only contains integers
between 0 and 9 (i.e. m = 9).

the gamma function is undefined), the comparison is arbitrary, and the probability
of both possible outcomes were 50 %.

This way, the probability of a correct comparison ranges from 1
2 when the values are

next to each other, to 1 when the values are as far apart as possible. The parameter
c can be set to any value larger than 1 in order to control how quickly the function
converges. Larger values for c decrease the noise of the comparisons. The only
difference level used is Normal as we reason that inserting a probability of the other
levels appearing would not offer any more information as the simulation is simply
a very rough estimation of a real annotation scenario. While this simple model will
not be able to capture what the noise of the comparisons in the final application
will look like, it does work to give us an indication of how the different algorithms
perform compared to each other as the noise varies.

A list of 100 numbers between 1 and 200 was generated through uniform sam-
pling. Borda Count, Hamming LUCB, and TrueSkill were run for 1000 pairwise
comparisons (i.e. simulated annotations). After each comparison, the normalized
Spearman’s footrule distance was measured between the current estimation of the
algorithm and the correct order of the list. Merge Sort was run for the number of
comparisons it required, and the footrule distance was only calculated once using
the final sorting.

In the case of Hamming LUCB, a few parameters had to be set. Firstly, it was
decided to let the algorithm attempt to split the list into 5 different subsets of
equal size. Further, a margin of 4 elements between each set was used. Lastly, the
tolerance parameter d was set to 0.99. For the TrueSkill algorithm, each element
was initiated with the default parameters, that is µ = 25 and σ2 = 8.332.

24


4. Evaluation of Simulation

(a) Graph of the normalized Spearman’s
footrule distance with c = 1.

(b) Graph of the normalized Spearman’s
footrule distance with c = 10.

(c) Graph of the normalized Spearman’s
footrule distance with c = 100.

Figure 4.2: The normalized Spearman’s footrule distance, with different comparison
clearness’, between the current estimation of the algorithm over 1000 comparisons,
averaged over 10 runs.

4.2 Results
After running these simulations ten times, each with a newly generated list, the aver-
age distance per comparison was calculated. The result of this with the comparison
clearness parameter c set to 1 can be seen in Figure 4.2a. Likewise, the results of
running the same experiments with c = 10 and c = 100 can be seen in Figures 4.2b
and 4.2c respectively. In the graphs, the amount of comparisons of Merge Sort is
the average of the 10 runs.

4.3 Conclusion
There are several takeaways from these results. Most interesting is the fact that
TrueSkill appears to be converging significantly faster than any of the other algo-
rithms. Not only does it achieve the lowest distance after 1000 comparisons, but
it also outperforms Merge Sort given the same amount of comparisons. It should

25


4. Evaluation of Simulation

however also be noted that, as the clearness parameter is increased, Merge Sort
shows the greatest increase in performance. This is to be expected as the algorithm
is constructed to find the optimal ordering, with as few comparisons as possible, in
a noiseless setting.

Furthermore, it would seem that Borda Count and Hamming LUCB are both per-
forming poorly in comparison. Perhaps the largest reason for this similar perfor-
mance is the fact that the algorithms always compare the element of interest to
another randomly selected one. This non-dynamic selection technique will lead to
difficulties distinguishing between elements that are of similar quality. Although
there are parameters for the Hamming LUCB algorithm that can be tuned, testing
during implementation indicated that the tweaking of these will not have a notice-
able effect compared to the aforementioned issue. These instead have a greater
impact on the termination criteria, which was not reached during any of the 10 runs.
While robust and noise resistant, these algorithms do not utilize information gained
from previous comparisons, most likely making them too resource-demanding for
the target application.

26


5
Evaluation of Arbitrary Subjective

Annotation

With the knowledge obtained from running the simulations, the next step was to
test the most promising sorting methods on users performing an actual annotation
task. It was decided that the subjective annotation task in question would be to
order images of people based on how old they appeared to be. We reasoned that
this task would be simple enough for the average person to perform, while also being
difficult and subjective enough to cause disagreements (i.e. noise).

Although there is no ground truth ordering, we can still measure how consistent the
results produced by the algorithms are. We speculate that if multiple groups were
to order the same list, a good-performing algorithm that is noise resistant would
be able to produce more similar outputs compared to one that is not. Moreover,
the potential added efficiency of letting the user order a shorter list, as opposed
to making pairwise comparisons, can also be measured, with the caveat that the
results are only representative of this specific annotation task. This, as the cognitive
workload could increase if the annotation task became more difficult. As such, the
goals of this experiment were to see which system produced higher quality orderings
in terms of reliability, as well as to evaluate the efficiency of different comparison
sizes. With these goals in mind, the two main questions posed for these initial tests
were:

1. How consistent are the sorted lists produced by the different algorithms?

2. In the case of this annotation task, is there a gain in letting the user sort a
list of four images instead of making pairwise comparisons?

To answer these questions, two tests were performed. The first test discussed further
in Section 5.1.1, had annotators order lists as groups via pairwise comparisons using
different algorithms. The orderings produced by the groups could subsequently
be compared to see if any algorithm seemed to produce more consistent results.
The second test that was performed, which is covered in Section 5.1.2, instead
had participants annotate individually, using both pairwise comparisons, as well as
letting them order a shorter list. The purpose of this test was instead to measure
the time required per annotation, to see if any of the options proved more efficient.

27


5. Evaluation of Arbitrary Subjective Annotation

5.1 Experimental Setup

In order to perform the tests, a graphical user interface was developed. This interface
supported both pairwise comparisons as can be seen in Figure 5.1, as well as ordering
of lists of size three or four, as can be seen in Figure 5.2. More images and further
descriptions of the final version of the interface can be seen in Appendix A. For the
pairwise comparison Of Similar Age, Younger/Older, and Clearly Younger/Older
would directly correlate to the difference levels None, Normal, and Large. Similarly,
for the ordering of lists, these are instead displayed above the image with No Dif-
ference and Large Difference naturally correlating to None and Large whilst no text
correlates to Normal. Apart from saving the current state of the sorting algorithm
after every comparison, the result of each comparison was also stored as a row in
a CSV file, along with auxiliary information such as the time it took to make the
annotation as well as which user performed it. This CSV file was vital as it would
allow for later recreation of the annotations, step by step, helping us investigate
convergence.

Figure 5.1: An example screenshot of the user interface when sorting by pairwise
comparisons.

28


5. Evaluation of Arbitrary Subjective Annotation

Figure 5.2: An example screenshot of the user interface when ordering a shorter list.

The images used for the experiments were collected from Kaggle [41]. An initial
cleaning of the dataset was made to remove images of low quality. As this is a
publicly available dataset, most of the images are of celebrities. Participants of the
study were however asked to only take the appearances in the images into account,
disregarding any eventual further knowledge of the person. The dataset also included
labels in the form of ages, but these were ignored as the objective ground truth was
not of interest for this specific study.

5.1.1 Group Testing Setup
As previously discussed, the main goal of the group annotation test was to compare
how consistent the outputs of the algorithms were. The results from the simulation
indicated that TrueSkill showed the most promise, so we subsequently wanted to
test it against a benchmark algorithm. It was decided that this benchmark would
be Merge Sort as it, apart from TrueSkill, seemed to be the best performing one
given a reasonable amount of comparisons.

Nine annotators were recruited and subsequently split into three groups. Each group
was to order the same list of 100 images which was noted as data set A (with different
initial shuffled orderings) according to how old the person in the image appeared to
be. This was done using the TrueSkill algorithm, and having each member of the
group make 220 pairwise comparisons, one after another, totaling 660 comparisons.
This total amount of comparisons was selected as previous simulations indicated that
the algorithm had converged to a reasonable extent at this point (see Figure 4.2).
Similar to Section 4, the default parameters were used when initializing TrueSkill.
The groups were then randomly shuffled, and the same procedure was done for
another list of 100 images which was noted as data set B, this time using the Merge
Sort algorithm. Each member except the last was asked to make 180 comparisons,

29


5. Evaluation of Arbitrary Subjective Annotation

with the final member making comparisons until the algorithm was finished. Table
5.1 shows the group assignment for the two steps in more detail.

User Algorithm Data Group
1 TrueSkill A 1
2 TrueSkill A 1
3 TrueSkill A 1
4 TrueSkill A 2
5 TrueSkill A 2
6 TrueSkill A 2
7 TrueSkill A 3
8 TrueSkill A 3
9 TrueSkill A 3

User Algorithm Data Group
1 MergeSort B 3
2 MergeSort B 3
3 MergeSort B 2
4 MergeSort B 1
5 MergeSort B 2
6 MergeSort B 1
7 MergeSort B 2
8 MergeSort B 1
9 MergeSort B 3

Table 5.1: The distribution of users across groups. Two different datasets of 100
images were used for the different steps.

5.1.2 Individual Testing Setup
When performing the individual tests, four users were recruited and allowed to order
two lists by themselves. Half of the users started by ordering a list using pairwise
comparisons, while the other half annotated by ordering lists of four images. The
lists consisted of 50 images, and the participants making pairwise comparisons were
asked to make 220 annotations, while the ones ordering shorter lists were asked to
make 60. Similarly to the group tests, these numbers were based on convergence
observed during simulation, as well as on how much total time we estimated that
the user would have to spend to complete the task. After having finished this first
round of annotations, each user annotated a second list of images using the method
that was not initially utilized. The algorithm used was TrueSkill for all scenarios.
The exact distribution of comparison sizes and corresponding image sets can be seen
in Table 5.2.

User Algorithm Size Data
10 TrueSkill 2 C
11 TrueSkill 2 D
12 TrueSkill 2 C
13 TrueSkill 2 D

User Algorithm Size Data
10 TrueSkill 4 D
11 TrueSkill 4 C
12 TrueSkill 4 D
13 TrueSkill 4 C

Table 5.2: The distribution of images and how they were sorted by participants.

5.2 Results
This section will cover the results of the two initial user tests performed. First, the
consistency of the outputs of the large group tests will be discussed. Thereafter, the
efficiency tests of ordering a shorter list will be covered.

30


5. Evaluation of Arbitrary Subjective Annotation

5.2.1 Group Testing Results
After the trials had concluded, the annotation process for each group was recon-
structed using the collected data. When doing this, the distances between the three
different orderings could be calculated after each comparison. This distance was
more specifically achieved by using normalized Spearman’s footrule to measure the
distance between groups 1 and 2, groups 1 and 3, and groups 2 and 3. Figure 5.3
shows the average distance between these group orderings estimated by TrueSkill
at each time step. It also shows the average distance between the final orderings
produced by Merge Sort (the exact amount of comparisons required has been av-
eraged over the three runs). The lowest average distance achieved by TrueSkill is
approximately 0.151, while Merge Sort reaches a distance of about 0.233. Apart
from this significant difference, it would also appear that TrueSkill converges more
quickly, reaching the average distance acquired by Merge Sort in 368 comparisons
as opposed to 550.

Figure 5.3: The average normalized spearman’s footrule distance between the three
orderings produced by each algorithm.

Apart from looking at how the orderings produced by different groups converge
toward each other, the convergence of the individual orderings was also measured.
Figure 5.4 shows the moving average of the root mean squared error of the esti-
mated scores of the current iteration, compared to the one before. The vertical lines
highlight changes of annotator. The interesting thing to note is that for two of the
three groups, the change between annotators had a great effect on the current rate
of convergence.

Lastly, when groups annotated with the TrueSkill algorithm, they had the option to
state that one person was Of Similar Age, Younger/Older, or Clearly Younger/Older.
To better understand the perceived difficulty of the task, the frequency of the use
of these different options was calculated. The distribution of these difference levels
can be seen in Table 5.3. This will be further discussed in Section 6.3

31


5. Evaluation of Arbitrary Subjective Annotation

Figure 5.4: Moving average (with a window size of 20) of the RMSE of the estimated
TrueSkill scores of the current iteration, compared to the one before. The dashed
lines indicate a change in user annotating.

Group
Difference Level

Of Similar Age Younger/Older Clearly Younger/Older

1 183 27.7 % 362 54.8 % 115 17.4 %

2 230 34.8 % 381 57.7 % 49 7.4 %

3 241 36.5 % 377 57.1 % 42 6.4 %

33 % 56.6 % 10.4 %

Table 5.3: The distribution of differences supplied by annotators when ranking the
apparent age of people in images using the TrueSkill system.

5.2.2 Individual Testing Results
The time per annotation was recorded for each user. Table 5.4 shows the average
time on a user and method level. Exactly how to compare these numbers is not ob-
vious. Each annotation made by ordering a list of four images leads to six pairwise
comparisons being made in the background. Using this measurement, each of the
users made more comparisons per second using the list ordering method. However,
the selection strategies are fundamentally different, meaning that the six compar-
isons made would potentially not have been the same six as the pairwise algorithm
would have selected. Furthermore, as the annotation task itself becomes different,
the quality of each comparison can vary. If one annotation task would be more
difficult, then noise could increase.

To exemplify the problem of comparing the annotation speeds, we can compare the
orderings of the different users at different times. As can be seen in Table 5.2, users

32


5. Evaluation of Arbitrary Subjective Annotation

User Average Time Per
Pairwise Comparison

Average Time Per
List Ordering

10 3.9 s 17.8 s
11 2.9 s 13.9 s
12 5.6 s 27.7 s
13 4.4 s 11.2 s

Table 5.4: The average time per annotation of users when making pairwise compar-
isons, as well as when ordering shorter lists.

10 and 12 ordered the same collections of images using the same techniques. The
same is true for users 11 and 13. Consequently, the rate at which the orderings
converge toward each other can be compared. Figure 5.5a shows the similarity of
the orderings for users 10 and 12 when they utilized pairwise comparisons, as well
as ordering shorter lists. Similarly, Figure 5.5b shows the distance between the
orderings produced by users 11 and 13. The time of each annotation has been
calculated by averaging the times of the two annotators. While these examples are
too few to draw any conclusions (as the user’s subjective opinions could differ to
varying degrees), it can be seen that the cases where the annotators ordered shorter
lists did not converge faster.

(a) Similarity of orderings produced by users
10 and 12 over time.

(b) Similarity of orderings produced by users
11 and 13 over time.

Figure 5.5: Similarity of orderings produced by different users on separate sets of
images, where the y-axis represents the normalized Spearman’s footrule distance

5.3 Conclusion
The results from the group tests indicate that TrueSkill is better at producing con-
sistent orderings. We reason that this is due to the fact that the algorithm is able
to manage disagreements and inconsistencies among annotators better, allowing it
to better capture the general consensus of a group. Moreover, TrueSkill is able to

33


5. Evaluation of Arbitrary Subjective Annotation

reach this better ordering in fewer comparisons than what would be required of
Merge Sort.

When looking at the convergence of the scores of TrueSkill, as seen in Figure 5.4,
it becomes clear that annotator preference is important. As different annotators
perceive the annotation task differently, some will be quicker to state that there is
no difference between two images, while others may be more inclined to try to decide
upon a winner. Furthermore, as the annotation task is subjective, an annotator that
is in agreement with the current ordering will make fewer updates than one that is
not. For these reasons, it can be difficult to use the current rate of convergence as
a well-defined finish condition for the TrueSkill algorithm. Instead, convergence on
a user basis can be one factor to consider when deciding if more comparisons are
required.

Lastly, it is difficult to draw any conclusions from the individual tests. While the
time per comparison did decrease when ordering a shorter list, there is still a lack
of understanding of the quality of these comparisons compared to the standard
pairwise version of the algorithm. The consequence of this is that even though the
annotator is technically able to make more comparisons in a specific set of time, the
ordering may still not converge faster than the pairwise case. For this reason, we
believe that, while ordering shorter lists shows some promise, further user testing is
required before drawing any conclusions.

34


6
Evaluation of Bronchial Wall

Thickening Annotation

The final step of the project was to evaluate the annotation system on the application
area of interest. That is, to rank CT images of the lungs according to the severity
of the bronchial wall thickening present. The goal was to take the most promising
rank-based algorithm and compare it to a rating-based system. This evaluation
step aimed to evaluate and address two key features, the quality of the annotation
and the resource requirements. In order to evaluate these metrics, the following
questions were considered:

1. Do the annotations produced by the radiologists align better when using the
rank-based system compared to a rating-based one?

2. For the rank-based systems, how is the similarity of the produced labels af-
fected as the amount of comparisons increases?

Through the previous evaluations, it has become evident that TrueSkill is the algo-
rithm displaying the greatest potential. Not only is it noise resistant, but it also
displays properties of quick convergence. For these reasons, TrueSkill will be used as
the rank-based algorithm for this experiment. Moreover, as the results from Section
5.2.2 were inconclusive, it was decided that only pairwise comparisons would be used
for the test. As a lack of resources does not permit us to test both solutions, it was
reasoned that the standard pairwise alternative would be the safer option, as the
annotation task may potentially be more difficult.

6.1 Experimental Setup
Slight adjustments were made to the interface in order to better support comparing
images of CT scans. Figure 6.1 shows an example screenshot of pairwise comparison.
The annotators had five options. Stating that A & B Are Equal corresponded
to the None difference level, while the A/B More Severe and A/B Much More
Severe options corresponded to the Normal and Large levels respectively. In order
to conduct the experiment, the annotators also had to be able to provide ratings
for single images. A simple image rating system was as such developed, an example
image of which can be seen in Figure 6.2. For this system, the annotators were
allowed to assign one of four labels to the image, stating that the bronchial wall

35


6. Evaluation of Bronchial Wall Thickening Annotation

thickening appeared to be: None, Mild, Moderate, or Severe.

Figure 6.1: An example screenshot of the user interface when making pairwise com-
parisons of CT images.

Figure 6.2: An example screenshot of the user interface when rating the degree of
bronchial wall thickening in a single image.

The CT scans used for this experiment are a small subset from the SCAPIS study
[14]. An initial sample of 77 subjects was selected for the experiment. For each
patient, a series of slices covering the entire lungs were available. The slices were
spaced at an interval of 0.6 mm, providing comprehensive coverage of the lungs.

36


6. Evaluation of Bronchial Wall Thickening Annotation

Additionally, two previously generated segmentations were provided, offering the
regions corresponding to the left and right lung respectively. These segmentations
could be used to only display one lung at a time, making sure the image presented
to the annotator only contains the relevant information.

Additional preprocessing of the data was necessary in collaboration with an expe-
rienced radiologist in the field. In order to obtain more diverse data, it was first
necessary to delete slices that were adjacent to one another since they frequently
showed similarities and overlapped bronchial wall thickening. To mitigate this, a
decision was made to consider slices at regular intervals of 1 cm. This interval was
approximated by selecting every 17th slice, as depicted in Figure 6.3 with black
lines. Moreover, it was observed that certain regions at the top and bottom of the
lungs rarely displayed visible bronchial wall thickening. Consequently, slices in these
specific regions were excluded from further analysis. To ensure a diverse range of
slices for evaluation, every fourth slice was retained while discarding the first and
last extracted slices for each individual lung. Figure 6.3 illustrates an example of
this selection process, with the remaining slices of interest indicated by the green
lines for each individual.

Figure 6.3: Slice selection of a CT scan. Slices highlighted in green were extracted
to be annotated.

However, in order to get an interesting collection of slices for the experiment, further
selection had to be made. The are a few reasons for this. Firstly, the distribution of
degrees of bronchial wall thickening is very unbalanced, with most slices showing no
signs of bronchial wall thickening, and very few showing signs of severe. Secondly,
some slices did not have any visible bronchial walls at all, making them uninteresting

37


6. Evaluation of Bronchial Wall Thickening Annotation

for the target experiment. To make this selection, a radiologist, independent from
the annotation evaluation that was to be done, organized a sample from the previ-
ously extracted slices into the categories Non-Representative, None, Mild, Moderate,
or Severe. We could then use this rough categorization to select the final 75 images
such that all images had visible bronchial walls, and the degree of wall thickening
was more reasonably distributed.

Thereafter, four experts, one radiologist and three radiology residents (in training),
were recruited for the test. Each annotator was to both rate all 75 slices, as well
as create an ordering of the elements by making 500 pairwise comparisons. Two
participants began by rating, while the other two started by ordering the images.

6.2 Results
For the final experiment, a group of trained experts was involved in annotating CT
scans of the lungs to assess the extent of visible bronchial wall thickening. The
objective was to measure the level of agreement among the annotators regarding
both the direct labels they assigned and the labels and rankings provided by the
ranking system for the 75 ordered CT images.

6.2.1 Rating-Based Annotation Quality
Once the radiologists completed the task of annotating the CT scans and assigning
ratings to indicate the amount of visible bronchial wall thickening, the collected rat-
ings were subjected to analysis to assess the inter-annotator agreement. To measure
this agreement, Krippendorff’s alpha was calculated, utilizing the ordinal difference
function as it was deemed to be the most appropriate metric given the ordinal na-
ture of the labels. The resulting agreement score, approximately 0.38, was observed
to be relatively low. A score in this range suggests that there are significant inter-
annotator variations in the assessment of bronchial wall thickenings.

6.2.2 Rank-Based Annotation Quality
There are multiple ways that the annotator agreement can be compared using the
ordered lists of images. One quite apparent method is to simply compare the ranks
of the elements. When comparing the rankings of the 75 CT scans by the annotators,
Krippendorff’s alpha was once again employed using the ordinal difference function.
Each of the 75 ranks was treated as a distinct label. This approach yielded an
agreement score of approximately 0.73, which is considered acceptable for alpha.

While Krippendorff’s alpha is designed to function on any span of numbers, the case
can be made that it is difficult to compare the value when using 75 labels instead of
four. To further evaluate the rank-based method, the agreement was also calculated
after distributing the four labels used in the rating-based system across the ordered
lists. The distribution used for each of the four orderings was the average distribution
observed from the rating-based system. When using this system, 35 % of the cases
were labeled as None, 43 % as Mild, 20 % as Moderate, and 2 % as Severe. Assigning

38


6. Evaluation of Bronchial Wall Thickening Annotation

labels to the ordered lists based on this average distribution resulted in an alpha
value of approximately 0.65.

Lastly, to gain additional understanding of the effect of the rank-based system,
the four labels were once again distributed on the orderings. However, this time
the distribution used was the same as the individual distribution obtained by each
annotator during the rating experiment (meaning that each of the four orderings
used a different distribution). Using these labels, a much-decreased agreement score
of approximately 0.40 was observed. This highlights the benefits of being able to
apply the same distribution, and is something which will be discussed further in
Section 6.3.

When ranking, the radiologists were to classify one of the images as Equal, More
Severe, or Much More Severe. Table 6.1 visualizes the distribution of differences
provided by each annotator when making the comparisons. A classification of Equal
was assigned to roughly 65,7 % of the comparisons. Even though Equal was the
classification for the vast majority of comparisons, More Severe variances were reg-
ularly seen, found in about 32.7 % of the comparisons. Notably, the Much More
Severe classification was found in only 1.6 % of the instances.

User
Difference Level

Equal More Severe Much More Severe

1 327 65.4 % 165 33 % 8 1.6 %

2 409 81.8 % 88 17.6 % 3 0.6 %

3 284 56.8 % 199 39.8 % 17 3.4 %

4 294 58.8 % 202 40.4 % 4 0.8 %

65.7 % 32.7 % 1.6 %

Table 6.1: The distribution of differences supplied by annotators when annotating
bronchial wall thickening using the rank-based system.

6.2.3 Resource Requirements
Using the saved annotations, the ordering of the images could be recreated after
every comparison. Indirect labels could subsequently be distributed, which allows
us to investigate how the inter-annotator agreement converges. For this analysis,
it was decided to use four labels and distribute them according to the average dis-
tribution observed during rating as it makes the agreement score the most clearly
comparable to the rating-based system. Figure 6.4 shows Krippendorff’s alpha after
every comparison. Furthermore, as the time of each annotation was recorded, the
amount of resources that were spent on the two different methods can be quantified.
Figure 6.5 shows how alpha varies over time, where the time has been calculated by
averaging the time it took the four annotators to make a specific comparison (e.g.

39


6. Evaluation of Bronchial Wall Thickening Annotation

if it took on average 10 minutes for the annotators to make 100 comparisons, then
the alpha obtained after 100 comparisons is placed at 10 minutes in the graph).

Figure 6.4: The inter-annotator agreement when distributing the four labels pro-
vided during rating according to the average observed distribution after every com-
parison. The dashed line represents the agreement achieved by the rating-based
approach.

Figure 6.5: The inter-annotator agreement over the averaged time.

6.3 Conclusion
When letting radiologists annotate using both methods, it has become clear that
the rank-based method can produce better-aligned labels. The primary reason for
this is the ability to use the same distribution of labels on all orderings. While
it can be argued that this is an unfair comparison, we reason that it is instead
one of the advantages of rank-based systems. In this case, one of the difficult and
subjective assessments radiologists had to make was where to draw the line between
two different degrees of bronchial wall thickening (e.g. decide where the images go

40


6. Evaluation of Bronchial Wall Thickening Annotation

from mild to moderate). In some sense, the indirect rating avoids this problem by
instead focusing on creating an internal ordering of the data points. The implication
is that if one was to label a set of images, the choice of annotator will not have as
large of an impact on the end result if they instead use the rank-based version and
the distribution is predetermined.

Another interesting thing to note is the apparent difficulty of the task. As can be
seen in Table 6.1, the majority of presented comparisons were deemed to be equal.
This is in clear contrast to the distribution of difference levels observed during the
initial user tests, as seen in Table 5.3. There are multiple possible explanations for
this. Firstly, an uneven distribution could mean that there are in fact a lot of images
displaying a similar degree of bronchial wall thickening, implying that there is no
large difference between elements. Secondly, the subjective assessment could also
simply be very difficult, making it hard to state which of the two images has more
visible bronchial wall thickening. The interesting thing to note is that regardless of
this, an agreement score of 0.73 was achieved when comparing the ranks produced.
This implies that while the task may have been difficult, there are clear similarities
in the orderings produced by the different annotators.

Finally, it is clear that the rank-based method is more resource-demanding. As the
trade-off between annotation quality and resources is of interest, it is important to
note that it took on average 50 minutes to make 500 comparisons, as opposed to the
meager 5 minutes spent on average directly rating the images. It should, however,
be noted that while 500 comparisons were made to achieve the alpha values which
are discussed, Figure 6.4 shows that it only took about 100 comparisons before the
rank-based system started to outperform the rating-based one. This is important as
100 pairwise comparisons is a small sample compared to the 2775 possible, implying
that the TrueSkill algorithm is able to make effective use of the resources.

41


6. Evaluation of Bronchial Wall Thickening Annotation

42


7
Conclusion

This thesis has produced and justified a novel rank-based annotation method through
gradual testing and evaluation. The following sections will discuss the implications
of the results of this project. The advantages of this system, as opposed to existing
systems, will be highlighted, while the current limitations and areas which required
further testing will also be brought up.

7.1 Discussion
The efficiency of any rank-based system will rely on the sorting algorithm used.
Throughout all stages of evaluation, TrueSkill has shown to be a high-performing
algorithm in multiple regards. Not only is it able to produce more consistent outputs
compared to Merge Sort, as shown in Section 5.2, but it also converges toward these
more consistent orderings in fewer comparisons than Merge Sort requires. This
combination of quick convergence and noise resistance makes it a good candidate
for any rank-based system. While Borda Count and Hamming LUCB would most
likely be able to produce equally consistent orderings given enough annotations, the
simulation covered in Section 4 indicated that the number of comparisons this would
require would in all likelihood be infeasible for the target application.

In the case of annotation of bronchial wall thickeni