Deep Active Learning for
Swedish Named Entity Recognition
An empiric evaluation of active learning algorithms
for Named Entity Recognition

Master’s thesis in Computer Science Algorithms, Language and Logic

Nadim Hagatulah
Kalle Arvidsson

Department of Computer Science and Engineering

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2021
www.chalmers.se

www.chalmers.se


Master’s thesis 2021

Deep Active Learning for
Swedish Named Entity Recognition

An empiric evaluation of active learning algorithms
for Named Entity Recognition

Nadim Hagatulah
Kalle Arvidsson

Department of Computer Science and Engineering
Division of Data Science and AI

Chalmers University of Technology
Gothenburg, Sweden 2021


Deep Active Learning for Swedish Named Entity Recognition
An empiric evaluation of active learning algorithms for Named Entity Recognition

Nadim Hagatulah
Kalle Arvidsson

© Nadim Hagatulah 2021.
© Kalle Arvidsson 2021.

Academic Supervisor: Richard Johansson, Department of Computer Science and
Engineering
Company Supervisor: Petter Wolff, Sahlgrenska Science Park
Examiner: Krasimir Angelov, professor, Department of Computer Science and En-
gineering

Master’s Thesis 2021
Department of Computer Science and Engineering
Division of Data Science and AI
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: The box illustrates a set with circles as data points. The checkmarks repre-
sent data points in the labeled set, question marks represent data that is queried to
an oracle, and the triple dots represent the unannotated set.

Typeset in LATEX, template by Magnus Gustaver
Printed by Chalmers Reproservice
Gothenburg, Sweden 2021

iv


Deep Active Learning for Swedish Named Entity Recognition
An empiric evaluation of active learning algorithms for Named Entity Recognition

Nadim Hagatulah
Kalle Arvidsson

Department of Computer Science and Engineering
Chalmers University of Technology

Abstract
Named entity recognition holds promise for numerous practical applications involv-
ing text data, such as keyword extraction and automated anonymization. However,
successfully train a machine learning model for Named Entity Recognition is chal-
lenging due to the amount of annotated data required, especially for cases where
language that is not globally common such as Swedish is involved. In such cases,
using a Deep pre-trained model such as BERT in conjunction with the practice of
active learning may be preferred. To obtain some insight into the implementation of
such an approach, this thesis serves as an empirical study of various active learning
strategies when used in conjunction with BERT-based name entity recognition. The
performance of different active learning algorithms and the effect of acquisition size
on the performance of active learning is the main focus of this study. In conclusion,
after comparing and evaluating 17 different active learning methods, the study’s
empirical results demonstrate entropy sampling to be the best performing active
learning algorithm for Named Entity Recognition of Swedish texts, and the choice
of acquisition sizes is practically negligible to performance.

Keywords: Active Learning, Deep Learning, Transformer, BERT, NLP, Named
Entity Recognition, Diversity-Based Sampling, Uncertainty-Based Sampling, Pool-
Based Sampling, Cumulative Training.

v


Acknowledgements
The authors would like to thank our supervisors Richard Johansson and Petter Wolff,
that throughout the whole project, have given invaluable help and feedback. The
thesis would not have been possible without both of their involvement and helpful
support.

Nadim Hagatulah and Kalle Arvidsson, Gothenburg, June 2021

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Related Works/Previous works . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 3
2.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Active Learning in Deep Learning . . . . . . . . . . . . . . . . 6

2.2.1.1 Batch Awareness . . . . . . . . . . . . . . . . . . . . 6
2.2.1.2 Model Uncertainty . . . . . . . . . . . . . . . . . . . 7

2.2.2 Uncertainty Sampling . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 Diversity Sampling . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.5 Diverse mini-batch Active Learning . . . . . . . . . . . . . . . 10
2.2.6 BatchBALD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2.7 Coreset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.8 Discriminative Active Learning . . . . . . . . . . . . . . . . . 12
2.2.9 Expected Gradient Length . . . . . . . . . . . . . . . . . . . . 13
2.2.10 Query By Committee . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Text Representation . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.2 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . 16

2.3.2.1 Tagging Schemes . . . . . . . . . . . . . . . . . . . . 16
2.4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.5.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3 Method 23
3.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2 Software Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 24

ix


Contents

3.2.2 Sentence Embedding . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1 SUC 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.3 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4.4 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4.5 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5.1 BERT for NER . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4 Results 35
4.1 Acquisition size 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.1 Uncertainty Sampling . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 Diversity Sampling . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1.3 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.2 Acquisition size 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.1 Uncertainty Sampling . . . . . . . . . . . . . . . . . . . . . . . 38
4.2.2 Diversity Sampling . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2.3 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Comparison based upon acquisition size . . . . . . . . . . . . . . . . . 41
4.4 Consistency of performance . . . . . . . . . . . . . . . . . . . . . . . 42

5 Discussion 43
5.1 Performances of Query Methods . . . . . . . . . . . . . . . . . . . . . 43
5.2 Impact of Acquisition Size . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

6 Conclusion 49

A Appendix 1 I

B Appendix 2 III

C Appendix 3 V

D Appendix 4 IX

x


List of Figures

2.1 The active learning loop, which begins at the first stage which is
predicting sample from the unlabeled data set. In the second stage,
a query method ranks the informative score of the predictions and
picks the N best scores for the oracle to manually annotate. After
the oracle has annotated the samples, the labeled data set is extended
and the model is retrained. . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 The active learning loop with DAL. Steps 1, 2, and 3 are repeated N
times for each mini-query before step 4 is performed, which introduces
batch awareness and allows the DAL loop to scale to large batch sizes. 12

2.3 Simplified view of the Skip-Gram architecture with the example sen-
tence "Kalle ska äta pizza deg" where "äta" is the main word. . . . . . 15

2.4 Simplified view of the CBOW architecture with the example sentence
"Kalle ska äta pizza deg" where "äta" is the main word. . . . . . . . . 15

2.5 The transformer architecture presented in Vaswani et at [1]. Note
that positional encoding is applied to the input, as the attention-based
model is otherwise incapable of distinguishing sequential relationship. 20

2.6 The intended training procedure of BERT according to Devlin et
at [2]. Fine-tuning for specific task can be done through adapting the
output layer and the input formulation according to the intended task. 21

3.1 The frequency of sentence lengths in the SUC 3.0 data set. Starting
at sentence length one which represents a one word sentence. . . . . . 29

4.1 Average F1-score across three different seeds versus number of iter-
ations for uncertainty based query methods, using acquisition size
16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.2 Average F1-score across three different seeds versus number of itera-
tions for diversity based query methods and combined methods, using
acquisition size 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Accumulated query time in seconds for tested query methods, using
acquisition size 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

4.4 Average F1-score across three different seeds versus number of iter-
ations for uncertainty based query methods, using acquisition size
50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.5 Average F1-score across three different seeds versus number of itera-
tions for diversity based query methods and combined methods, using
acquisition size 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

xi


List of Figures

4.6 Accumulated query time in seconds for tested query methods, using
acquisition size 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.1 16 samples per iteration, uncertainty-based sampling algorithms, first
graph has seed 100, second seed 200 and third seed 300. . . . . . . . . I

A.2 16 samples per iteration, diversity-based sampling algorithms, first
graph has seed 100, second seed 200 and third seed 300. . . . . . . . . II

B.1 50 samples per iteration, uncertainty-based sampling algorithms, first
graph has seed 100, second seed 200 and third seed 300. . . . . . . . . III

B.2 50 samples per iteration, diversity-based sampling algorithms, first
graph has seed 100, second seed 200 and third seed 300. . . . . . . . . IV

C.1 Comparison of the naive clustering algorithm combined with differ-
ent uncertainty-based sampling algorithms. The first graph is with
acquisition size 16 and second is acquisition size 50. . . . . . . . . . . V

C.2 Comparison of the DBAL algorithm combined different uncertainty-
based sampling algorithms. The first graph is with acquisition size
16 and second is acquisition size 50. . . . . . . . . . . . . . . . . . . . VI

C.3 Comparison of different distance metrics for the coreset algorithm.
The first graph is for acquisition size 16 and second is for acquisition
size 50. The coreset legend uses the cosine distance metric. . . . . . . VII

D.1 Mean and deviations of BatchBALD, Core-Set and DAL. . . . . . . . IX
D.2 Mean and deviations of C-Entropy, DBAL-Entropy and Entropy. . . . X
D.3 Mean and deviations Marginal, bald, least and random sampling. . . XI

xii


List of Tables

2.1 Example of different schemes are applied to the sentence "Kalle Arvids-
son wants to watch the movie Justice League or Naruto ." . . . . . . 17

3.1 Two shared computer resource servers offered by Chalmers, Titan
and Bayes, where Bayes includes Controller, Shannon and Markov
and Titan is the singular computer in Titan server. . . . . . . . . . . 23

3.2 The algorithms that are supported by the framework. MHS denotes
Mahalanobis, CRL denotes Correlation and CB denotes Cityblock
(Manhattan). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.3 The mapping of a sentence to numeric sentence vector embedding. . . 26
3.4 Example of how two sentences in the data frame looks like. . . . . . . 27
3.5 The entities provided by the SUC 3.0 data set. Second column cor-

responds to mapping to more common and easier tags to work with. . 27
3.6 Example of how a named tag in SUC 3.0 looks like. . . . . . . . . . . 28
3.7 Frequency of tokens in the SUC 3.0 data set after the preproccesing

steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.8 An example of predicted tokens on the second column, with the BIO-

scheme on the sentence "Kalle ska äta på Condeco" on the first column
and, correct tokens on the third column. . . . . . . . . . . . . . . . . 30

3.9 NER experiments performed with the different pretrained BERTmod-
els that support the Swedish language. Time is in format (min-
utes:seconds). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.10 An example of how informativeness score is calculated for NER. The
sentence "Kalle Arvidsson reser till Kina" is fed into the pretrained
KB-BERTmodel (before fine-tuning) with the predefined labels PRS,LOC,ORG,MISC
and O (no tagging scheme). Each word in the sentence gains proba-
bilities for each label. Applying the Least Confidence query method
for each word yields the informativeness score shown in the bottom
row. The far right column shows the resulting informativeness score
for whole sentence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 This table captures the regions of interests, mainly from iteration 9-
29, from Graph 4.1. Iteration index is zero based and acquisition size
is 16. The F1-scores are rounded to two decimals precision points. . 36

4.2 This table captures the regions of interests, mainly from iteration 9-
29, from Graph 4.2. Iteration index is zero based and acquisition size
is 16. The F1-scores are rounded to two decimals precision points. . 37

xiii


List of Tables

4.3 This table captures the regions of interests, mainly from iteration 3-
11, from Graph 4.4. Iteration index is zero based and acquisition size
is 50. The F1-scores are rounded to two decimals precision points. . 39

4.4 This table captures the regions of interests, mainly from iteration 3-
10, from Graph 4.5. Iteration index is zero based and acquisition size
is 50. The F1-scores are rounded to two decimals precision points. . 40

4.5 This table provides for each uncertainty based method a comparison
of the F1-scores between acquisitions sizes at certain points where
the amount of sampled data are similar. The left hand side value
of the bar in a cell is from acquisition size 16 and right hand side is
acquisition size 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.6 This table provides for each diversity related method a comparison
of the F1-scores between acquisitions sizes at certain points where
the amount of sampled data are similar. The left hand side value
of the bar in a cell is from acquisition size 16 and right hand side is
acquisition size 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

xiv


1
Introduction

Unstructured text data, such as patient stories [3], holds promising potential for
various applications. However, to make proper use of such unstructured text data,
information extraction methods would be required. Extracting information from
unstructured text data has been a commonly researched theme within the field of
Natural Language Processing (NLP), with Named Entity Recognition(NER) being
a prime example of such a subject. Using NER, one can highlight the valuable in-
formation or omit sensitive privacy information within a body of text.

At the time of this thesis, a common approach to handle NLP tasks in general
is through using supervised machine learning(ML). However, a common challenge
within the NLP field is the lack of annotated data. To train an accurate supervised
machine learning model, large amounts of accurate annotated data are required.
For an information extraction task involving unstructured text data, accurate an-
notations would require manual work by domain experts, which is costly and inef-
ficient [4]. There is therefore an interest in reducing the amount of annotated data
required to successfully train a supervised machine learning model.

1.1 Purpose
There has been numerous research about information extraction from unstructured
texts but relatively few attempts of applying such techniques to Swedish texts [5]. As
such, information extraction tasks in Swedish is still a subject that requires further
development. For that end, sizeable accurately annotated data sets are required.
The high-level goal of the project is then to find a way to efficiently acquire accurate
annotations for unstructured text data. To this end, Named Entity Recognition for
unstructured Swedish texts has been selected as a representative, but the general
idea should be extensible to other languages and NLP tasks as well.

To handle the task of extracting information from unstructured Swedish texts despite
the lack of annotated data, this thesis proposes using active learning in combination
with a transformer-based transfer learning model. The purpose of this study is then
to gain an insight on the practical usage of active learning with the consideration
of the performance and scalability, specifically for the case of information extrac-
tion of unstructured texts using supervised machine learning. In order to fulfill this
purpose, the objective of this study is to investigate the performance, in particular
the convergence rate of model performance but also the time consumption, of var-

1


1. Introduction

ious Active Learning algorithms on the training of transformer-based NLP models
on Swedish texts. For this end, the idea is to develop an annotation framework
that can be used to evaluate active learning methods for reducing the amount of
annotated data required for the model to reach a reasonable performance [6].

1.2 Related Works/Previous works
Using active learning for machine learning concerning NLP tasks has been done in
numerous researches in conjunction with a variety of different machine learning mod-
els such as conditional random fields (CRF), convolutional neural networks (CNN),
recurrent neural networks (RNN), and variations of BERT model. In all of those
studies, various types of active learning algorithms were shown to have an advantage
over randomly sampling data for training set in NLP tasks.

In studies related to CRF-based sequence labeling tasks, uncertainty based meth-
ods were shown to to have an advantage over other methods such as query-by-
committee [7], information density [8], and diversity based sampling [9]. Addi-
tionally, Kholghi et al (2015) showed in their study concerning CRF-based entity
recognition that not retrain the entire model from scratch for each active learning
iteration reduces the amount of labeled data required to reach the performance com-
parable to that of supervised training [8].

As for the studies concerning active learning in conjunction with neural network
based NLP, Shen et al (2018) evaluated three different uncertainty based active
learning algorithms on named entity recognition using a CNN-LSTM model [6]. All
three algorithms performed similarly, with the least confidence method underper-
forming slightly compared to mean normalized log-probability and Bayesian active
learning by disagreement. Said study also noted the computational advantage of
mean normalized log-probability over Bayesian active learning by disagreement, as
the latter required multiple forward passes.

The study which is most similar to this is Ein-Dor et al (2020), which evaluated
and compared a variety of active learning algorithms on binary text classification
tasks using BERT [10]. Experiments were performed across a number of different
data sets with different categorization to classify, with the target class prior of the
data sets ranging from 10% to 50%. The research has shown that none of the tested
active learning algorithms consistently outperformed its counterparts in the case of
BERT-based binary text classification. The run-time difference of different active
learning algorithms is also noted in the study.

2


2
Theory

This chapter discusses the theory behind the key concepts of the study. The back-
ground and general idea of active learning are thoroughly discussed. Popular acqui-
sition methods from literature that will be used for experiments are also introduced
and motivated. An overview of the NLP task which will be solved with active
learning is discussed then finally the transformer model is thoroughly explained.

2.1 Learning Algorithms
A machine learning model generally requires a learning algorithm and training data.
The ML model artifact is created by the training process, and the goal of the train-
ing process is for the ML model to capture the patterns which lie within the training
data. It is therefore important for the training data to be sufficiently representa-
tive of the domain in order to produce a well-functioning model which correctly
generalizes to predicting new unseen data. There are multiple different training al-
gorithms that utilize the training data in different methods in order to best capture
the patterns for the ML model.

2.1.1 Supervised Learning
In supervised learning, a model is typically trained by feeding it with training data
that have labels that are the desired mapping to the data points. The weight param-
eters of the model are then adjusted until a loss function is reduced to a satisfactory
level, which indicates that the model has learned the patterns of the training data
well enough.

Supervised learning has some challenges which include requiring domain experts
to label a sufficient amount of labeled data for a model to be able to converge to
a satisfactory level [11]. However, factors including experiences, emotions, and cir-
cumstances would often affect a person’s judgment during annotation. As such,
annotations from a single person would regularly contain bias and other mistakes,
which means they are not reliable. It is therefore generally required to have multiple
annotators to guarantee the integrity of the data annotations through redundancy,
which in turn makes the process expensive and time-consuming. A popular method
to measure agreement between annotators is with Cohen’s kappa [12] which assigns
a score based on the agreement between annotators.

3


2. Theory

2.1.2 Unsupervised Learning
In contrast to supervised learning, unsupervised learning does not require any an-
notations of the training data. This is because the goal of an unsupervised learning
algorithm is to find structure in the training data on its own by finding patterns
based on data points features, often in the form of clusters or density estimation.
The challenges that unsupervised learning presents are that it is a more complex
method than supervised learning and therefore generally requires more data and
time for convergence in training. The results can also be hard to interpret, and,
because no labels are given, the correct answers can not be accurately determined
which allows the risk of erroneous results [13]. This would require external eval-
uation from a human domain expert or an internal evaluation in the form of an
objective function.

2.2 Active Learning
Annotating data for supervised learning is usually a task with high demands for
time and manpower, especially if the task involves expertise in a certain field. In
order to reduce the amount of annotated data required to train a model, one may
apply the practice of active learning which is to use an algorithm to find a sample
set within an unannotated data set that is most likely to yield the most significant
performance improvement for a given ML model [14] [7], should the sample set be
manually annotated and used to train the given model.

There are generally three active learning scenarios [15]. Membership Query
Synthesis where the learner generates fictional data point instances based on the
underlying distribution of unlabeled pool. The generated data points are queried
to an oracle for a label and are accepted as a successfully generated data point if
the oracle can recognize and assign a label, otherwise if unrecognizable, the data
point will be discarded. Stream-Based Selective Sampling where unlabeled
data points are examined one by one against some informativeness criteria. If the
examined data point is accepted according to the informativeness criteria, it is then
queried to the oracle for a label assignment but otherwise discarded. Pool-Based
Sampling considers the whole unlabeled data set by assigning all data points an
informativeness score and then selects the top highest scoring data points.

Membership query synthesis is only useful for cases where the unlabeled data pool is
small [15] and the focus is therefore on expanding the data set. The Stream-based Se-
lective Sampling and Pool-Based Sampling scenarios both rely on an informativeness
score. The Pool-Based Sampling method has to perform an expensive informative-
ness assignment over the whole unlabeled data set before deciding which data points
to include, while on the contrary Stream-Based Selective Sampling immediately de-
cides for each data point. Therefore if resources and hardware are limited then
Stream-Based Selective Sampling is the appropriate method, otherwise, Pool-Based
Sampling generally produces better results since it has an underlying knowledge of
whole unlabeled data set [16].

4


2. Theory

The algorithm which determines the informativeness of the sample set drawn from
the unlabeled data set is called a query method. There exists a variety of query
methods [7] with the most common approaches being uncertainty-based sampling
and diversity-based sampling. In uncertainty-based sampling, the sample set is con-
structed by using the samples which the model is most uncertain of, whereas in
diversity-based sampling the algorithm tries to select a diverse set of queried sam-
ples.

Unlabeled 
Dataset

Labeled 
Dataset Oracle

Model
Query 

Method

1. Predict datapoints

5. Retrain the model with 
the extended dataset

2. Score the predictions 
informativeness

3. Pick N best 
scored datapoints

4. Label datapoints and 
extend the labeled set

Figure 2.1: The active learning loop, which begins at the first stage which is
predicting sample from the unlabeled data set. In the second stage, a query method
ranks the informative score of the predictions and picks the N best scores for the
oracle to manually annotate. After the oracle has annotated the samples, the labeled
data set is extended and the model is retrained.

The samples are assigned an informativeness score and then the most informative
data is queried to an oracle, which is an agent such as a human annotator. The
oracle annotates the sample with a ground truth label so it can be used for the
supervised training of the model. The training in active learning can be done in two
methods. The first being Incremental learning, which is to use the already trained
model and update the weights with only the newly acquired data from the iteration
in the loop. The second method is Cumulative training where the model weights
reset every iteration and is then retrained from scratch with all the cumulative
labeled data acquired from the active learning process [17]. The whole general

5


2. Theory

active learning loop is described in Figure 2.1.

2.2.1 Active Learning in Deep Learning
Combining deep learning with active learning presents some challenges. The main
issue is that the Softmax output from deep learning models tends not to correctly
capture the uncertainty of the model for the classes [18] and the fact that most active
learning algorithms are intended for one-by-one query and are not batch aware.

2.2.1.1 Batch Awareness

In theory, uncertainty-based active learning should query one sample per iteration.
This is impractical in cases involving deep learning models for two reasons. First,
deep learning models are usually trained with batches of multiple samples for the
purpose of parallelization and stabilization of the training. Second, to retrain the
model from scratch for each sample added to the training set is inefficient as Deep
Learning models are usually resource-intensive to train. Thus, it is more practical
to query samples in batches of size K where K is an acquisition size greater than
one. However, the intuitive approach of naively choosing top-K highest informative
scoring samples will usually result in K near-identical samples [19].

A method to alleviate the problem is to choose K to be small enough in between
training rounds. If at some point model M is uncertain in predicting some class c,
then if K is chosen to be large, a large number of similar samples consisting of class c
would be chosen. The training would be inefficient because the model would always
be skewed towards some dominating class c which affects the generalization of the
model negatively. Instead, if K was chosen to be small, then sufficient identical K
points would be chosen for the model’s generalization to not be affected but instead
positively improved on the weak class c predictions.

Another method is solving the issue through different methods of diversity-based
sampling, which sought to prevent the querying of near-identical samples. Such an
approach would require the diversity-based method to perform well on the unanno-
tated data. For instance, in order for diversity sampling with clusters to succeed
the cluster sizes and centroid positions needs to be well placed, which requires a
good representation of the data points. A cluster size that is too small would create
some large dominating cluster and other small clusters, if the smaller clusters are
depleted, then querying will only occur from the dominating cluster which results
in only identical points being chosen.

Algorithms such as DAL, DBAL, Coreset, and BatchBALD are developed with
batch awareness in mind and are more thoroughly discussed in the subsequent sec-
tions.

6


2. Theory

2.2.1.2 Model Uncertainty

Usually, the uncertainty-based query methods rely on the predictions of the learning
model. However, the Softmaxed raw output from a deep learning model tends to be
overconfident [20], e.g high confidence can be assigned to unseen data and therefore
the Softmax output of the final layer is unreliable as a measure of confidence and
shown in the literature to be worse than random sampling [21].

A method to estimate uncertainty is to cast the deep learning models as Bayesian
models and thus leveraging Bayesian probability theory which offers tools to reason
about the model confidence in predictions. By enabling dropouts during the pre-
dictions of the model, outputs are no longer deterministic. Each run will produce
different results which are interpreted as samples from a probabilistic distribution.
Thus, averaging over multiple Softmax outputs allows the model to be Bayesian
approximated as the Gaussian process [20].

Using dropout during evaluation is called theMonte Carlo dropout (MC)method,
where every neuron in some layer has a probability p to be dropped, ( the value of
the neuron multiplied with 0). Applying Monte Carlo dropout multiple times and
summing the results is the equivalent to Monte Carlo integration, thus for some
p(y = c|x) where x denotes a sample and given a sample, the probability for class c
can be represented through the following formula:

p(y = c|x) =
∫
p(y = c|x, ω)p(ω)dω

Here, p(ω) denotes the probability for a weight to be active or not, that is if the
neuron is dropped, thus the integration over ω considered all possible weights by its
probability, therefore:

p(y = c|x) ≈
∫
p(y = c|x, ω) ∗ q∗(ω)dω

Where p(ω) ≈ q∗(ω) and q∗(ω) denotes the dropout distribution. The Monte Carlo
integration can further be approximated by running multiple instances T and then
averaging over some probability class c, which finally gives:

p(y = c|x) ≈ 1
T

∑
t

p(y|x, ωt) = 1
T

∑
t

pt
c

This yields the output probability for every class c. A higher amount of runs T
samples more from the assumed probability distribution, which therefore gives a
better approximation of the probability distribution.

2.2.2 Uncertainty Sampling
As uncertainty sampling methods are based upon the idea of selecting the samples
which the model is most uncertain of, the question is then how the uncertainty is
measured. As mentioned, the common approach is to evaluate the predictions made
by the model in form of the probabilistic distribution of the classes. Based upon

7


2. Theory

the evaluation function used to measure uncertainty, there are multiple uncertainty-
based query methods including the following examples:

Least confidence (LC) is the most basic query method which measures uncer-
tainty by taking the difference between the most confident prediction and absolute
certainty.

LC = 1−max
c∈C

p(y = c | x)

This method only considers the highest valued model output and high informative-
ness scores are given to output with low confidence. The intuition being that for
those samples the model is most uncertain of, training will be most rewarding. It
only considers the topmost confident class and discards the other information.

Margin confidence (MC) is a method that looks at the margin difference be-
tween the two topmost confident predictions and then the difference from 100%
confidence.

MC = 1− (max
c∈C

p(y = c | x)− max
d∈C/{c}

p(y = d | x))

In MC, the informativeness scoring both most confident and second most is con-
sidered, therefore amending some of the weaknesses of LC [22] by using more
information. LC queries samples that have a small margin between classes and thus
should allow the model to easier discriminate between those classes during training.

Entropy score (ES) is a query method based on the average information inherent
in variables’ possible outcomes as defined by Shannon [22].

ES = H[y | x] = −
∑
c∈C

p(y = c | x) log2(p(y = c | x))

This method is most general and uses all information, every label class probabilities
are used to measure the informativeness of the sample. Thus it benefits models with
a large amount of classes [22]. If class probabilities are evenly distributed then a
high informative score is given because it indicates the model being confused be-
tween classes.

Bayesian Active Learning by Disagreement (BALD) [23] is a method that
evaluates uncertainty from a Bayesian statistical perspective. It is supposed to be
used in conjunction with a Bayesian framework.

BALD = H[y | x]− E[H[y | x, ω]]
Here H[y | x] is the general entropy of the output which captures the overall un-
certainty of the model, and H[y | x, ω] is the entropy of the output given model
parameter ω which captures the uncertainty with a specific model setting. As in a
Bayesian framework the model parameters ω can be considered as a random vari-
able, E[H[y | x, ω]] is then the expected entropy of the output for any specific model
settings. Intuitively, the objective is to find the samples which the model is most
uncertain of due to there being settings of the parameters which produce predic-
tions that are confident yet contradicting. This approach is technically equivalent

8


2. Theory

to finding the samples which cause a maximal decrease in the expected uncertainty
of the parameters of a Bayesian model [23].

2.2.3 Diversity Sampling
Uncertainty sampling is sensitive to selecting outliers and data which does not rep-
resent the data set [24]. In contrast, diversity sampling alleviates such issues by
selecting the subset of samples that are considered to cover the entire data set
as thoroughly as possible. Depending on the approach used to generate the subset,
there are numerous types of diversity sampling methods including the following [25]:

Model-based Outliers, that explores the logits and gradients of a model, group-
ing and assigning informativeness score to samples based on how the neurons of the
model react to the unlabeled data points, which are methods such as EGL (section
2.2.9).

Representative Sampling, where each sample is scored with an informativeness
score through evaluating how representative it is of the whole data set relative to
the labeled set. This includes methods such as DAL (section 2.2.8), Coreset (section
2.2.7), and BatchBALD (section 2.2.6).

Cluster-based Sampling, where an unsupervised method is used to find struc-
ture and trends in the unlabeled data pool by clustering similar samples together
into clusters, methods such as DBAL (section 2.2.5) and Clustering (section 2.2.4).
Clustering samples can be done using various different algorithms but the most com-
monly used algorithm is K-Means, which is defined as; Given a predefined integer K
for the number of clusters and a set N of samples, then the samples will be clustered
into K clusters with each cluster having a centroid point µ [26] which minimized
the inertia defined as:

argmin
K∑

j=0

N∑
i=0

(||xj
i − µj||2)

Where xj
i defies the data point xi that belongs to cluster j and µj is the centroid

of that cluster. The sum of squared error uses the Euclidean metric which is not
efficient in capturing in the high dimensionality of document classification [27].

Because of the curse of dimensionality, cosine distance is the preferred metric for doc-
ument similarity classification, but to alleviate the problem of Euclidean distance,
L2-normalisation can be applied to the data set before application of Euclidean dis-
tance which acts as a proxy for cosine distance. L2-normalization unit normalizes
each vector by setting the sum of squares of vector elements to 1. Thus for two vec-
tors X and Y the Euclidean distance is denoted as ||X−Y ||, thus for the minimized
inertia in cluster minimization,

||X − Y ||2 = (X − Y )T (X − Y ) = ||X||2 + ||Y ||2 − 2XTY

9


2. Theory

Thus because X and Y is normalized then ||X||2 = ||Y ||2 = 1 equation can be sim-
plified as

2(1− cos(X, Y ))

which shows Euclidean distance and cosine similarity relates and the equivalence of
order between the metrics.

2.2.4 Clustering
A naive approach to the clustering-based diversity sampling method is to split the
whole unlabeled set into clusters and then pick samples from each cluster and add
them to the labeled set so that the labeled set will be balanced with regard to the
defined clusters. It is expected that this method would ensure more diversity than
uncertainty sampling and the quality of diverse data is dependent on a predefined
number ofK clusters. Choosing samples from clusters can be done randomly but also
combined with an uncertainty sampling method, thus choosing the most informative
sample in each cluster.

2.2.5 Diverse mini-batch Active Learning
Another diversity sampling method that utilizes clusters is Diverse mini-batch Active
Learning (DBAL). The method also combines methods of diversity with uncertainty
sampling, but instead of first clustering and then ranking by an informativeness
score assigned by the uncertainty acquisition function as in naive clustering, DBAL
first ranks whole unlabeled data set by informativeness score and in a second step
then clusters [28]. The samples closest to the centroids are selected and added to
the labeled set. For instance, first, the whole set is scored on informativeness by an
uncertainty sampling method, then top β∗K-samples are prefiltered, where β is some
prefilter constant and K is the number of samples to query per iteration. The set
is prefiltered for efficiency purposes since running K-means can be computationally
expensive. The prefiltered set is then clustered with weighted K-means clustering
into K clusters where the weights are the informativeness score assigned by the
uncertainty method for each sample. The K-means clustering objective function is
then modified to;

argmin
K∑

j=0

N∑
i=0

sj
i (||x

j
i − µj||2)

Here sj
i represents the uncertainty informativeness of sample xj

i and acts as a weight.
Finally, a batch of K samples is chosen by picking the samples closest to each K
centroids.

2.2.6 BatchBALD
BatchBALD is an extension of BALD that instead of scoring informativeness for
a single sample, calculates the informativeness of the whole batch of K samples

10


2. Theory

at once, where K denotes the acquisition size. BatchBALD is then a batch-aware
method that both scores a whole batch but also finds the best batch which represents
the data set. It is thus a representative diversity sampling method combined with
uncertainty sampling. Naively calculating informativeness of all possible batches is
not feasible because there exists an exponential amount of possible subsets [19]. To
at least find good enough subsets with a greedy algorithm, the submodularity prop-
erty is exploited, which yields a 1 − 1/e approximate solution [19] and the subsets
are scored as;

batchBALD = H[y0, .., yn | x0, .., xn]− E[H[y0, .., yn | x0, .., xn, ω]]

Just as the standard BALD algorithm, the first term captures the general uncer-
tainty of the selected batch, and the second term computes the expected uncertainty
for the selected batch. High informativeness is then assigned to batches that have
low confidence in predictions which is represented by the disagreement of what label
to assign which yields a large first term and small second term [19]. Notably, Batch-
BALD increases the amount of computation required to select a batch compared to
the naive BALD algorithm [19].

2.2.7 Coreset
Coreset is a batch-aware active learning algorithm that tries to query a batch K
set of samples, where K is a predefined acquisition size [29]. The algorithm tries
to query samples by selecting a batch K subset of unlabeled set that represents the
larger population of unlabeled data set by using the Coreset selection algorithm [30],
thus falling into the representative diversity algorithm category and which is also a
purely diversity-based method. Bounds between average Coreset loss of a subset of
data and rest of unlabeled data set via a geometric representation are defined, then
minimizing the bound would then find the subset of samples that represents the
whole unlabeled data set and is therefore valuable as an active learning algorithm.
Minimizing the bound is equivalent to solving the k-center problem [29] [31]. Then
the algorithm selects K points such that the largest distance between a sample and
nearest center is minimized. The problem is defined as:

min
s1:|s1|<K

max
i

min
j∈s1∪s0

D(xi, xj)

Where D is a distance function between two data points. The problem is NP-hard,
but a greedy 2-OPT approximated solution is to first pick an arbitrary center, then
set the next center at the farthest data point and repeat until K centers have been
selected. The 2-OPT solution is formulated as:

max
i

min
j∈s1∪s0

D(xi, xj) ≤ 2×OPT

The solution can further be improved by defining a mixed-integer program and query
on upper-bounds. The greedy approach can also be improved by defining an upper
limit on distances, which excludes unwanted outliers [29].

11


2. Theory

Various distance metrics can be used for computing the distance between data points
with the distance function D. In this project, the Cosine, Correlation, Mahalanobis
and Cityblock (Manhattan) distance metrics will be considered.

2.2.8 Discriminative Active Learning
Discriminative Active Learning (DAL) is a representative diversity sampling method
that tries to create a diverse labeled data set by choosing examples that are different
from what samples are already in the labeled data set. To achieve different labels,
a binary classifier is used to discriminate if a sample is similar to samples in labeled
set or different [32].

Labeled Set

Unlabeled Set

Model

Binary 
Classifier

1. Labeled and unlabeled 
data is feed to binary 
classifier to train it.

2. Each sample in unlabeled 
dataset is scored by probability of 
belonging in labeled set.

3. The top K/N samples from 
unlabeled set that has highest 
probability of being from unlabeled 
set is moved to labeled set.

Oracle

4. Oracle annotates the samples 
after step 2 and 3 only afer K 
samples has been queried and 
exends set.

5. Model is trained

Figure 2.2: The active learning loop with DAL. Steps 1, 2, and 3 are repeated
N times for each mini-query before step 4 is performed, which introduces batch
awareness and allows the DAL loop to scale to large batch sizes.

The binary classifier outputs a probability of sample either belonging to the labeled
data set or unlabeled set, then scoring each sample in unlabeled set by an informa-
tiveness score that corresponds to the probability of the sample either belonging to
the labeled or unlabeled set. Then for a batch acquisition size K, naively selecting
top K samples would only select K similar samples. To introduce batch awareness,
instead of selecting the top K samples, the top sub-batches K/N samples are se-
lected for some mini query amount N . The binary classifier is then retrained with
new sub batch and the process is iterated until K samples have been selected. Be-

12


2. Theory

cause the binary classifier has to be retrained each iteration, the trade-off for batch
awareness is more computational time for a higher number of sub-batches N . The
active learning loop with DAL is presented at Figure 2.2, which has some differences
from the more general loop in Figure 2.1. The main difference is that no expen-
sive predictions are performed by the model which allows for a much faster active
learning process.

2.2.9 Expected Gradient Length
A model-based outlier diversity sampling active learning method is gradient-based
algorithms. Unseen or rare data is generally the most valuable data for a deep learn-
ing model to train because such data causes a large impact on the gradients of the
model during back-propagation and the measurement is called true gradient length
(TGL) [33].

Since TGL would require already labeled data to know in advance, Expected Gradi-
ent Length (EGL) is a method that calculates the highest expected gradient length
by calculating the gradient length norm ||∇L(xi, yi; θ)||, where L represents the gra-
dients of the loss function with respect to the parameters θ, for each possible label
and then taking the sum [34]. Then if DL represents the labeled data set, EGL is
defined as;

EGL = argmax
x

∑
y∈C

P (ŷ|x; θ)||∇L(DL + (x, ŷ); θ)||

For efficiency, since training instances are assumed to be independent the calculation
||∇L(DL + (x, ŷ); θ)|| can be simplified to only||∇L((x, ŷ); θ)||. Since the algorithm
has to consider every possible label for a sample, if feature space or label space is large
then it will become computationally infeasible [7]. For instance, if the task is NER
(discussed in section 2.3.1) then for a sample x, the number of label combinations
will be exponential in size of sample x. EGL is otherwise well suited for tasks such
as classification etc.

2.2.10 Query By Committee
Query By Committee (QBC) is a classical active learning method that instead of
measuring uncertainty or samples by diversity, instead constructs an ensemble of
models, where each model has a vote and then depending on the output of the
model, which represents the vote of said model for a label, assigns informativeness
score to samples. The models vary by having different hyperparameters, such as
learning rates, epochs, seeds, etc. The ensemble is then trained on the same data
set but the outputs of each model are potentially different which therefore gener-
ates competing hypotheses. Samples that the ensemble disagrees on the most are
assigned the highest informativeness score, which is typically done by taking the
difference of the max label output from the ensemble and 1 (Least Confidence) [35].
Other methods include calculating the entropy of the votes or taking the difference
between topmost voted labels.

13


2. Theory

QBC is a suitable active learning method if the ensemble is large but does therefore
require an inexpensive underlying model. Deep learning models generally consist of
large neural networks and are therefore not suitable to be used in QBC but instead,
cheap models such as Trees, SVM, etc are more feasible for a QBC solution.

2.3 Natural Language Processing
Natural Language Processing (NLP) is a subfield and discipline within the domain
of Artificial Intelligence (AI). The goal is to process and manipulate human natural
language to solve various tasks [36], such as information retrieval, information
extraction, text translation, text summarization, and speech recognition.

2.3.1 Text Representation
Generally, before the document texts can be processed and used for computations in
NLP tasks, it has to be numerically represented for a machine to interpret it. There
are multiple methods to achieve text to numeric representation. One method is to
generate one-hot-encoded vectors for each word, which is a fast and computationally
inexpensive operation. This can be done by using Bag-Of-Words (BoW) or TF-IDF
algorithms. A flaw of assigning each word in a corpus a one-hot-encoded vector is
that the representation will be large and therefore memory inefficient and a case of
the curse of dimensionality [27]. Another more important flaw is that context and
semantic meaning of words is lost with one-hot encoding, as each word is simply
represented by a numeric one, and the rest of the vector is filled with numeric zeroes
which hold no semantic value. Ergo, clustering a collection of words such as doc-
uments or sentences with one-hot-encoding would generally only result in clusters
that are generated based upon similarities in word frequencies.

One way to address the flaws of one-hot-encoding is to use more complex word
embedding methods. The meaning of words is numerically encoded in vectors such
that semantically similar words are closer in the vector space [37]. Documents in
the corpus can be mapped to numeric vectors by various methods such as neural
networks, dimensionality reduction on word co-occurrence matrices, probabilistic
models, etc but the most popular and highest performing methods to date are neu-
ral networks. Therefore, typically for a neural network, the drawbacks are requiring
a lot of data to be trained on and being time exhaustive.

Two popular word vector generating methods with underlying neural networks are
the Skip-Gram model and Continuous Bag of Words Model (CBOW) [38]. Training
is generally performed in an unsupervised method, the models training are done
through predicting word with regards to the neighboring words of a sentence. The
number of neighbors to consider in prediction is determined by a window size pa-
rameter, e.g for a window size of 2 then both words that occurred before and two
words that occurred after some word is being considered by the algorithm.

14


2. Theory

Kalle ska pizza deg

äta

Figure 2.3: Simplified view of the Skip-Gram architecture with the example sen-
tence "Kalle ska äta pizza deg" where "äta" is the main word.

Skip-Gram models try to predict the neighbors of some main word which is the
main word context. The architecture has an input size of 1xV where V is the total
number of words in the corpus, e.g a one-hot-encoded word is the input. Then
one hidden layer with size VxE where E is the size of the word embedding, i.e the
dimension. Finally, the output of the hidden layer is a 1xE which is sent to the final
output layer, that through a Softmax activation function generates probabilities of
size 1xV, which represents a predicted one-hot-encoded word and one prediction for
each neighboring context word is feed to the network. Figure 2.3 shows an example
of Skip-Gram where the main word "äta" is fed into the model. The window size is
2, then the output is two words occurring before the main word, "Kalle" and "ska",
then two words after the main word which are "pizz" and "deg".

Kalle ska

äta

pizza deg

Figure 2.4: Simplified view of the CBOW architecture with the example sentence
"Kalle ska äta pizza deg" where "äta" is the main word.

The CBOW model has the same dimension of hidden and output layer but the goal
is different. On the contrary of predicting neighboring words to get the context
of a word, CBOW tries to predict the main word based on neighbors and context,
and therefore the difference in architecture lies in the input layer. The input is
instead B 1xV, where B is each neighbor context word and the resulting output is
the singular main word, which makes CBOW a faster model to train. Figure 2.4
shows an example of CBOW where the window size is set to 2. The two words
occurring before the main word "Kalle" and "ska", then the two words occurring

15


2. Theory

after the main word, "pizza" and "deg" is fed into the CBOW model to produce the
main word output "äta".

2.3.2 Named Entity Recognition
Named entity recognition (NER) is an information extraction task within the field
of NLP problems. The goal is to detect and classify entities related to a certain
subject within a text, most often unstructured or semi-structured, into predefined
categories [39] such as organizations and locations. Extracting information from
unstructured data is mostly known for its applications in the field of biochemistry
and medicine.

The four main common methods of NER solutions are:

• Rule-based approach: Works using hand-crafted rules, therefore limited to
rules but does not require labeled data, usually a gazetteer method [40].

• Unsupervised learning approach: Unsupervised methods which do not
require labeled data, such as clustering.

• Feature-based approach: Uses traditional machine learning methods to
find the feature patterns and structures behind the entities, such as decision
trees and support vector machines. Requires careful feature engineering [41].

• Deep-learning approach: Requires a lot of labeled data, usually RNN,
LSTM models, or transformer models.

In recent researches, this is usually done by using a transformer-based ML model,
such as BERT which has shown considerably better performance than the older
more common methods of LSTM or gazetteer methods [42].

2.3.2.1 Tagging Schemes

There are multiple different labeling schemes for NER token annotations. The differ-
ent labeling schemes affect the result and which schema is most appropriate depends
mostly on corpus text language [43] [44] and domain. Some common schemes [45]
are:

• IO: Two tags exist in this scheme. Either a token has the inside tag (I-), or
the outside tag (O).

• IOB: Also referred to as BIO scheme, it consists of three tags, inside tag (I-),
outside tag (O), but also extends the more simple IO scheme with a beginning
tag (B-).

• IOE: Similar to IOB scheme, it also consist of three tags, an inside tag (I-),
outside tag (O), and an end tag (E-) instead of the beginning tag.

• IOBES: Has five tags, inside tag (I-), outside tag (O), beginning tag (B-),
and the accompanying ending tag (E-) but also the single tag (S-).

Table 2.1 presents the usage of the different tagging schemes on an example sentence.
The different schemes can affect the result of the NER model a lot [43] because the

16


2. Theory

structure of languages varies. Having more tags also increase the complexity of the
NER model which also affects the result.

Tokens IO IOB IOE IOBES
Kalle I-PRS B-PRS I-PRS B-PRS
Arvidsson I-PRS I-PRS E-PRS E-PRS
wants O O O O
to O O O O
watch O O O O
the O O O O
movie O O O O
Justice I-WORK B-WORK I-WORK B-WORK
League I-WORK I-WORK E-WORK E-WORK
or O O O O
Naruto I-WORK B-WORK E-WORK S-WORK
. O O O O

Table 2.1: Example of different schemes are applied to the sentence "Kalle Arvids-
son wants to watch the movie Justice League or Naruto ."

Using the IO schema allows for a low number of n + 1 tags where n denotes the
number of entities. A limitation is that it can not differentiate between consecutive
entities. IOB schemes have 2n+1 tags, allowing for the identification of consecutive
entities but double the number of tag classes a NER model has to learn. IOBES
increases to 3n+1. The more complex schemes as IOBES captures more information
but if the data set is not balanced in the 3n+ 1 number tags then the NER models
performance will be negatively influenced.

2.4 Transfer Learning
In machine learning, transfer learning is the practice of using an already trained
model and repurposing it for another problem. This approach has three main ad-
vantages. The first is that the learned parameters in a trained model are adjusted
from some task to fit for another related task, which may immensely reduce the
amount of training data required for a new related task. This is because the model
is not trained from scratch but the patterns learned from solving a related task are
reused. The second advantage is that the performance of the model tends to in-
crease. The third is that since the model is not trained from scratch, training time
to achieve a certain performance may also be immensely reduced [46]. The limit-
ing disadvantage of transfer is negative transfer, which happens when the model is
trained with samples that may affect the performance negatively. Such samples are
called negative samples [47]. Examples of negative samples are samples that do not
relate to the original domain of the model.

17


2. Theory

Transfer learning is a useful method for NLP tasks since training a well-functioning
model requires a huge amount of labeled data and computational power which both
are hard to come by since domain experts are required and computational power
requires expensive resources.

Depending on the source domain and source task of a pre-trained model and target
of domain with another target task, different transfer learning scenarios can arise.
The marginal probability distribution of source and targets domain are different
which is the general standard idea of transfer learning, also known as domain adap-
tion [48], e.g data sets contain different topics. A special case of domain adaption is
cross-lingual adaption [49] in NLP, which corresponds in transfer learning to when
the feature space between target and source is different, e.g different languages in
data set. Another scenario is that conditional probability could be different between
source and target tasks which arise when the distribution of classes between data
sets are different. Another instance is the difference in label space between the
source and target task since the output is the difference then this usually requires
that either output layer of the pre-trained model be adjusted or replaced with a new
layer suited for the target task.

There are different ways to fine-tune a pre-trained model, and which method is
most appropriate depends on the scenarios that can arise and the severity of differ-
ences between both domain and the task of source and target but also available time
and resources. If the difference between source and target is small, then training
time can be vastly reduced by freezing layers in the pre-trained model. Freezing
layers sets the weights of the neurons in a frozen layer to constant, which means
that backpropagation calculations are skipped in frozen layers that allows for a huge
decrease in computational time. However, since neurons in the frozen layers will not
be updated, freezing layers make the model less adaptable to new data. Thus, freez-
ing layers with a sufficiently large difference between source and target may result
in reduced accuracy.

2.5 Transformer
A transformer is a deep learning model that is designed to handle sequential data
and is frequently used in the field of NLP. Since the introduction of the model in
2017, the transformer has become the model of choice for NLP and replaced Recur-
rent neural networks (RNN) and long short-term memory (LSTM) methods. The
major difference between transformers and the older RNN and LSTM methods is
how the sequential data is processed: a transformer processes the entire sequence
at once as a set of inputs, whereas in RNN and LSTM each input is processed se-
quentially as presented. This allows the transformer to learn dependencies between
inputs with high positional differences in the sequence more effectively [1]. This
improvement combined with training with larger data sets leads to the success of
complex pre-trained systems such as Bidirectional Encoder Representations from
Transformers (BERT) [2], which has been trained on a large language data set and

18


2. Theory

can be fine-tuned to serve more specific tasks.

The essential mechanism transformer relies upon is called attention. The goal of
attention is to obtain a representation that determines the importance of each item
in the set of input. This is done by using three different vectors called query, keys,
and values. Usually, these vectors are acquired through multiplying an input with
corresponding learnable weight matrix WQ, WK , and WV . The query vector and
keys vector are used to compute a set of weights which represent the importance of
each value, and the output is generated by computing the weighted sum using said
set of weights and the value vector. The process can be described in the following
formula:

Attention(Q,K, V ) = softmax(f(Q,K) ∗ V )

where Q represents the query, K represents the keys, V represents the values, and
f is the function used to compute the weights using Q and K. The case where all
vectors used by attention are obtained from the same input is called self-attention.

Transformer uses the scaled dot-product attention:

f(Q,K) = QKT

√
dk

as it provides a computational advantage over additive attention by using optimized
matrix multiplication algorithms. The scaling factor 1√

dk
which is based upon the

dimension of query and keys dk is used to counteract vanishing gradients caused
by dot products of high magnitude. Additionally, the transformer uses a technique
called multi-headed attention which employs multiple weighted scale-dot attention
in conjunction to obtain and make use of information from multiple representation
subspaces. This is done through concatenating the output of h parallel scale-dot
attention layers and multiplying the result with a weight matrix WO.

The transformer model follows an encoder-decoder architecture. The encoder blocks
consist of a self-attention layer followed by a position-wise fully connected layer.
These layers are connected through a residual connection with layer normalization.
The decoder blocks consist of a masked self-attention layer followed by an attention
layer with query and keys obtained from the encoder, and a final position-wise fully
connected layer. All three layers are connected through residual connection with
layer normalization, same as the encoder blocks [1]. The architecture presented in
the original papers is shown in Figure 2.5.

19


2. Theory

Figure 2.5: The transformer architecture presented in Vaswani et at [1]. Note
that positional encoding is applied to the input, as the attention-based model is
otherwise incapable of distinguishing sequential relationship.

2.5.1 BERT
BERT is a pre-trained transformer-based machine learning model designed with the
purpose to be easily fine-tuned to a variety of NLP tasks, presented by Devlin et
al [2]. The idea is to obtain a word embedding through pre-training the model with
unsupervised tasks on a large corpus, and use the part of the model which repre-
sents the word embedding for transfer learning. It was mentioned by the authors in
the original paper that this approach was proven in previous researches to provide
significant advantages over training an embedding from scratch [2].

As can be observed in Figure 2.6, the structure of BERT consists of a stack of
the aforementioned encoder blocks. While the approach of using transformer-based
models to acquire linguistic embedding is not unheard of, what differentiates BERT
from similar approaches is the tasks used for its unsupervised pre-training, which
are masked language modeling (MLM) and next sentence prediction. In MLM, a
certain percentage of the input is replaced with a mask token, and the task is to
predict the original tokens which were replaced. This fill-in-blank approach allows
for the model to obtain a bi-directional representation, as both the context before
and after the masked tokens will be used for the prediction of said tokens. As for
next sentence prediction, a number of pairs of sentences (A,B) are picked from the
corpus. The sentence pairs are then labeled binarily for their sequential relation,
that is whether if B directly follows A. The task is to correctly predict the binary
label for each pair of sentences. The goal of pre-training the model with next sen-
tence prediction is to train the model for the relationship between sentences, which

20


2. Theory

is crucial for question answering and natural language inference tasks [2].

Figure 2.6: The intended training procedure of BERT according to Devlin et at [2].
Fine-tuning for specific task can be done through adapting the output layer and the
input formulation according to the intended task.

In order to make raw text processable by BERT, the text needs to be tokenized into
a sequence of numerical representations. The tokenization method used for BERT is
Wordpiece, introduced by Wu et al [50]. The Wordpiece tokenization is based upon a
word-level vocabulary, yet if a word is not present from the vocabulary then the word
is recursively split into sub-words until the missing word consists of words present in
the vocabulary. In the worst case, the word will be split into individual characters,
which are guaranteed to exist in the vocabulary. The individual sub-words are then
tokenized accordingly. This approach allows for word-level representations with a
limited vocabulary size while also ensuring that no out-of-vocabulary words exist.

21


2. Theory

22


3
Method

This chapter explains the basis of how the experiments were conducted during the
thesis. The experiment setup, which includes the hardware used, the hyperparame-
ters of the model, and the implementation details of the active learning algorithms
are hereby described. A description of the data set and the preprocessing steps are
also provided. The metrics and explanation of how they are used to evaluate the
performance of the active learning algorithms are presented, then finally the model’s
specific configurations for the NER task are explained.

3.1 Hardware
The experiments were conducted on the Bayes and Titan servers. Bayes was mainly
used, which is a shared resource for master students supervised by faculty in the
DS&AI department and classes in the ADS GU program. There are three computers
in total available on the Bayes and one on the Titan server. The computers hardware
are described on Table 3.1.

Processor GPU RAM (GB)
Controller Intel Xeon Gold 6126 Tesla V100-PCIE-16GB 768
Shannon AMD EPYC 7451 Tesla V100-PCIE-32GB 512
Markov AMD EPYC 7451 Tesla V100-PCIE-32GB 512
Titan Intel i7-5930K GeForce GTX TITAN X-12GB 128

Table 3.1: Two shared computer resource servers offered by Chalmers, Titan and
Bayes, where Bayes includes Controller, Shannon and Markov and Titan is the
singular computer in Titan server.

There is multiple available hardware on the server, but since it is a shared resource,
the available hardware at the moment was used to conduct the experiment because
of the shared resource policy. Experiments are only performed on one CPU core
and are limited to not use any multiprocessing or multi-threading to allow for the
most consistent and fair comparison between active learning algorithms.

3.2 Software Framework
A Pool-Based Sampling active learning framework for easy and quick experiment
testing between query algorithms was developed. The framework is built as a tool

23


3. Method

that supports manual annotation of general NLP tasks but also automatic annota-
tion if labels are already provided for the data points (the experiment cases).

Algorithm 1: Software Framework General Loop
initialization;
M ← pre-trained transformer model, initial state is saved in a reset function
Q← Chosen active learning query function
L← labeled data set
U ← unlabeled data set
T ← test data set
K ← Acquisition size
N ← Number of iterations to run
for 0...N do

L,U = Q(L,U,K,M) // K data points is moved to labeled set.
M.reset() // Model is reset to initial state.
M = M.train(L) // Retrain data with extended labeled set.
M.eval(T ) // Evaluate on test set and log the results.

end

The experiment’s main loop is described in Algorithm 1, and it can be observed
that the training is done cumulatively. The model is retrained from scratch at each
iteration. For a neural network, incremental training is suitable if only one epoch
is run during training [17] but this is generally not practical in neural networks
and therefore cumulative training was chosen for the framework. The available
supported algorithms in the framework, which are also thoroughly described in the
theory section, are presented at Table 3.2.

Baseline Uncertainty Diversity Combined (DBAL) Combined (Clustering)
Random BALD Coreset-Cosine D-BALD C-Random

Least Coreset-MHS D-Least C-BALD
Margin Coreset-CRL D-Margin C-Least
Entropy Coreset-CB D-Entropy C-Margin

DAL C-Entropy

Table 3.2: The algorithms that are supported by the framework. MHS denotes
Mahalanobis, CRL denotes Correlation and CB denotes Cityblock (Manhattan).

The uncertainty sampling methods measure the output uncertainty of the model
with Monte Carlo dropouts in all cases. The framework then supports a total of 17
active learning algorithms that will be evaluated and compared to each other.

3.2.1 Hyperparameters
The general settings and model hyperparameters for the experiments were:

24


3. Method

• Limit = 1500, Since the server is shared among multiple users, there is
restrictions in CPU time and memory usage. Therefore a random amount of
data is selected from data set to be used in test.

• Seed = 100/200/300, Different seeds for different test cases, allows for a
unique data set for each test case, but same when algorithms are compared to
each other.

• Iterations = 30, Number of query iterations. It is limited because of limited
hardware access.

• Batch size = 16, Batch size, was determined by BERT author recommen-
dation.

• Acquisition size = 50 or 16 How many samples chosen to be queried each
iteration.

• Learn Rate = 2e−5 The models learning rate, chosen by recommendation of
BERT author.

• Epochs = 5 Number of training epochs, chosen from BERT author recom-
mendation.

• Number of Clusters = 5 Number of clusters for diversity sampling.
• Monte Carlo Drop Iterations = 10 Authors recommend the Monte Carlo

Dropouts to be in the range of 30-100 [20] but it is deliberately chosen to be
vastly lower because of hardware limitations.

Identical parameters and settings were used across all experiments because opti-
mizing hyperparameter for some experiment setup may introduce bias for that ex-
periment setup. Therefore no effort in parameter optimization has been made and
only recommended hyperparameters from authors [2] was chosen. Other parame-
ters were chosen with CPU time and memory in mind so that test cases could be
finished in a reasonable time window without clogging shared server resources.

The DAL-specific parameters are the number of mini-batch queries and the ar-
chitecture of the binary classifier. The number of mini-batches, that introduce the
degree of batch awareness, is chosen to be 4 or 10 depending on the acquisition size
being either 16 or 50. The neural network architecture setup for binary classification
is chosen to be a neural network with 3 layers, a width of 256 neurons with ReLU
activation functions. The binary classifier used here is intended to be identical to
the one used in the original paper [32].

Since the dimensions of word representations are large (Section 3.3.2), the distance
metric for Coreset is set to cosine. As Coreset authors specified, the distance metric
is vital for the performance but the defaulted Euclidean distance [29] would be
affected by the curse of dimensionality [27].

Finally, the prefilter coefficient constant β, which is a DBAL parameter, is cho-
sen to be 10 by recommendation of authors [28].

25


3. Method

3.2.2 Sentence Embedding
Texts have to be numerically represented, and since semantic information is impor-
tant in most NLP tasks, a sentence vector embedding method is most appropriate.
Since training a model from scratch is resource and time-intensive, a pre-trained
model is chosen to be used.

For the diversity algorithms, naive clustering, DBAL, Coreset, and DAL, a pre-
trained CBOW model is used to map the sentences to numeric sentence vector
embedding.
The pre-trained word vector1 of the CBOW model was trained on Common Crawl2
and Swedish Wikipedia word entries [51]. The hyperparameters of the model are;

• Position-Weights = True,
• Dimensions = 300,
• n-grams size = 5,
• window size = 5,
• negatives = 10

The pretrained CBOW Fasttext model is then trained to generate word embeddings,
but can also generate sentence embeddings by averaging the word embedding in a
sentence [52].

Word Vector Normalized
Kalle [-0.068, 0.230, 0.044] [-0.280, 0.943, 0.178]
ska [-0.612, 0.133, 0.018] [-0.977, 0.212, 0.029]
äta [-0.376, 0.150, -0.367] [-0.687, 0.275, -0.672]

Sentence Vector [-0.648, 0.477, -0.155]

Table 3.3: The mapping of a sentence to numeric sentence vector embedding.

To get the resulting sentence embedding s(”x1..xn”) where ”x1..xn” is a sentence
(sequence of words) e.g "Kalle ska äta". Word embeddings are first generated for
each word, which is shown in Table 3.3. The word embeddings normalization step
is performed by dividing the word embedding by its `2-normalized value.

s(x1...xn) = 1
N

N∑
i=1

w(xi)
`2(w(xi))

Where w(x) denotes the learned sentence’s word x to word vector mapping, `2(w(x) =√
( ∑

x2) denotes the normalization of the word vector and the sentence length is
denoted by N .

1https://fasttext.cc/docs/en/crawl-vectors.html
2https://commoncrawl.org/

26


3. Method

3.3 Data Set
The testing framework pipeline described in Section 3.1 requires the data set input
to be a Pandas data frame with three mandatory columns as shown in table 3.4.
A column header should be sentence_id that encodes every sentence with a unique
integer. Another column header needs to be named ’words’ that is a string with
whitespace-separated tokens of a sentence. Lastly, there needs to be a column named
’labels’ that is a python list with labels, each label in list mapping to the same rows
words based on list index.

sentence_id words labels
68 Av MICHAEL WINIARSKI . [O,B-PRS,I-PRS,O]
410 Av DN:s speciella medarbetare . [O,B-ORG,O,O,O]

Table 3.4: Example of how two sentences in the data frame looks like.

3.3.1 SUC 3.0
SUC 3.0 was released in 2012 and is a data set that consists of a collection of Swedish
texts from 1990. An effort has been made to keep the corpus balanced with different
types of styles and texts. There are a total of 1,166,593 tokens in 74,245 sentences.
The corpus contains 9 different entity category tags for the tokens, which are per-
son, place, institution, work, animal, product, myth, and other. Explanations for
the entity tags can be found in Table 3.5.

Entity Mapping Frequency Explanation
person PRS 22062 Person name
myth PRS 301 Name of mythological entities
place LOC 9517 Locations
inst ORG 8464 Organizations
work MISC 4402 Creations (eg books, paintings)

product MISC 945 Products
event MISC 368 Event
other O 1119285 Not an entity

Table 3.5: The entities provided by the SUC 3.0 data set. Second column corre-
sponds to mapping to more common and easier tags to work with.

The data set is contained in an XML file, a sentence begins with a <sentence> tag
and each word is wrapped in a <w> tag. Words that are entities are wrapped around
a <name> tag, but also sometimes a <ne> tag. Both tags denote named entities
but <ne> tags are automatically tagged with the annotation tool Sparv [53] while
<name> tags are partially manually tagged and also automatically derived, but

27


3. Method

most importantly, <named> tags are always checked manually to be correct [54].
Table 3.6 shows an example of how a <name> tag entry can look like. The other
tags, such as <ana> that stands for analysis, <ps> for part-of-speech, <m> for
morphosyntactic information and <b> for base form [54], are not used in this study
and therefore ignored.

<name type=person>
<w n=494>drottning<ana><ps>NN<m>UTR SIN IND<b>drottning</w>
<w n=495>Sofia<ana><ps>PM<m>NOM<b>Sofia</w>
</name>

Table 3.6: Example of how a named tag in SUC 3.0 looks like.

Some preprocessing is applied to the SUC 3.0 data set. Firstly, since the experiment
framework required formatting as Table 3.4 shows, it is preprocessed to the appro-
priate format and the entities need to be assigned a tagging scheme. To follow the
practice of previous and generally most NER tasks [55] [56] [57], the BIO scheme
is chosen.

Also with the motivation of following previous and generally most NER tasks prac-
tice, the predefined named categories in SUC were mapped to [PRS,LOC,ORG,MISC,O],
as can be seen in table 3.5. This also allows for better overall performance because
fewer labels give less complexity and some categories such as events with only 368
instances would be hard to gain any representation because of the imbalance.

Entity Frequency
B-PRS 15373
I-PRS 6855
B-LOC 8764
I-LOC 665
B-ORG 6445
I-ORG 1920
B-MISC 3892
I-MISC 3034
O 1111747

Table 3.7: Frequency of tokens in the SUC 3.0 data set after the preproccesing
steps.

Figure 3.1 shows the SUC 3.0 data sets distribution of sentence lengths up to length
50 where majority of the sentence lengths were concentrated. Furthermore, 7898 sen-
tences in the data set contained less than 10 characters and were removed to ensure
the data sets integrity. The removed sentences such as "§ 6" were deemed to contain
no valuable information and bloats the data set. Thus the table 3.7 represents the
final preprocessed data set token frequency used for the NER experiments.

28


3. Method

Figure 3.1: The frequency of sentence lengths in the SUC 3.0 data set. Starting
at sentence length one which represents a one word sentence.

3.4 Metrics
The generally most common evaluation and scoring metrics for NER are precision,
recall, and F1-score [55] [56] [57], where F1-score is regarded as the final bench-
mark performance metric of models. The same scoring scheme is chosen during this
project to follow the same scoring practice as previous NLP classification tasks have
done, for both classification and NER tasks but our project will also consider and
give weight to the processing time of algorithms with regards to hardware.

Precision, recall and F1-score are based on the concepts of:

• True Positive (TP) Correctly predict x to belong to class c.
• True Negative (TN) Correctly predict that x does not belong to any class

c.
• False Positive (FP) Predict that x should be assigned to a class but assign

to wrong class.
• False Negative (FN) Incorrectly predict that x does not belong to class c

Thus there are three ways to incorrectly label an entity. First to notice that there
is an entity but assign it the wrong label, secondly to notice that there is an entity

29


3. Method

but incorrectly assign the boundary, and thirdly to make both mistakes, give the
wrong label and incorrectly assign the boundary.

Sentence: Predicted Actual
Kalle B-MISC B-PRS
ska I-MISC O
äta O O
på B-LOC O
Condeco B-ORG B-ORG

Table 3.8: An example of predicted tokens on the second column, with the BIO-
scheme on the sentence "Kalle ska äta på Condeco" on the first column and, correct
tokens on the third column.

Computing TP,TN,FP and FN on table 3.8 would yield TP = 1 (Condeco), TN = 1
(äta), FP = 2 (ska,på), FN = 1, (Kalle).

3.4.1 Accuracy
The most intuitive method of model evaluation would be accuracy which is a simple
but naive metric because it only considers the correctly predicted data samples over
the whole data set.

Accuracy = TP + TN
TP + TN + FP + FN

Calculating accuracy for example in table 3.8 would yield 1+1
1+1+2+1 ≈ 0.40. The

drawback and reason for accuracy metric not being used more often in previous
works are because of the accuracy paradox [58], which means that in imbalanced
data sets, which often is the case in NLP tasks, then naively only predicting the
dominant class would misleadingly give a high accuracy score.

3.4.2 Precision
Precision only considers samples predicted as an entity and gives a measure of how
correct the positives are.

Precision = TP
TP + FP

This metric is important for cases where false positives has a high cost and are
unwanted. Precision applied to example in Table 3.8 yields 1

1+2 ≈ 0.33.

3.4.3 Recall
Recall measures the real positive cases that are correctly predicted positive.

Recall = TP
TP + FN

30


3. Method

Applying it to Table 3.8 yields 1
1+1 = 0.5. This metric is more important for cases

where false negatives is the main focus of task at hand.

3.4.4 F1-Score
F1-score is a commonly used metric of performance for various classification tasks,
including token-wise classification such as NER. By default, the F1-score is used to
evaluate binary classification tasks. It is calculated by taking the harmonic mean of
the precision and recall, where precision is the percentage of true positives within
the samples flagged as positive, and recall is the percentage of true positives within
all actual positive samples. The formula is as follows:

F1 = 2 ∗ precision ∗ recall
precision + recall

In order to use F1-score for evaluating classification tasks with multiple labels or
classes, the classification of each label/class is considered as a binary classification
task and the F1-score is calculated for each label/class independently. The resulting
individual F1-scores are then averaged to provide an overall representation of the
performance. Applying it for the example in Table 3.8 yields 2 ∗ 0.33∗0.5

0.33+0.5 ≈ 0.40.

3.4.5 Time
Time is also measured and an important factor in active learning because the goal
for active learning is to reduce the time and resources required for training a model.
Therefore if the time required for active learning to process is high then it defeats the
whole purpose of time and resource-saving. The time is measured both for training
time and query processing time of the active learning algorithms. To keep the time
measurement consistent and fair across experiments, time will only be considered
for the Bayes controller computer.

3.5 Learning Model
Since training and creating a well-functioning transformer model from scratch re-
quires a combination of lengthy training time, high-end hardware, and an enormous
amount of training data, it was decided that a pre-trained model shall be used. A
criterion for choosing a pre-trained model suited for this project is that it needs to
perform well with Swedish texts, thus naturally a model pre-trained with Swedish
texts is preferred.

There exist multiple pre-trained models that support and have been trained with
Swedish texts. Some of the considered pre-trained models are Kungliga Biblitoeket
(KB-Bert) which has a standard 12-layer, 768-hidden, 12-heads, 110M parameters
model. Arbetsförmedligen (AF-AI) has a 12-layer, 768-hidden, 12-heads, 110M pa-
rameters model but also a large 24-layer, 1024-hidden, 16-heads, 340M parameters
model. A more complex model should have better performance with computational

31


3. Method

time trade-off according to BERT authors [2]. Also, Google has published a mul-
tilingual model(M-Bert) that supports 104 languages including Swedish and has a
standard 12-layer, 768-hidden, 12-heads, 110M parameters model architecture.

Models F1-score Time
AF-AI-BERT 0.566 10:06

AF-AI-Large-BERT 0.571 17:58
M-BERT 0.640 10:37

KB-BERT 0.640 09:30

Table 3.9: NER experiments performed with the different pretrained BERT models
that support the Swedish language. Time is in format (minutes:seconds).

The highest priority is for computational time to be low and performance to be
adequate. Speed is a crucial aspect as multiple experiments need to be conducted in
a limited time window. As such, the choice of the model is decided through exper-
imentation concerning Swedish NER, where all models were trained and evaluated
using the same training set and evaluation set which were randomly sampled from
a subset of 1,500 sentences sampled from SUC 3.0 using seed 100. The size of the
training set and evaluation set were 480 and 2,000, respectively. Training parameters
used were the batch size of 16, and the learning rate of 2e−5. With consideration of
computational processing time and performance from test results shown in table 3.9,
the KB-Bert model is chosen to be used for this research. The base BERT models,
AF-AI-BERT, M-BERT and KB-BERT have the same architecture and parameters.
The reasons for a difference in computation time between the base models is because
the experiments in this study are performed on a shared resource. Therefore, the
time is approximated and mainly considered to highlight the difference between the
base model and the BERT-large model.

KB-BERT, which is developed by the National Library of Sweden, used a corpus
of various types of modern Swedish texts from 1940 to 2019, including newspaper
articles, official government reports, legal e-deposits, social media comments, and
Wikipedia articles. The data set used for its pre-training consists of 3,497 million
words and 260 million sentences. The tokenization method used is SentencePiece
with a vocabulary size of circa 50,000 [59].

3.5.1 BERT for NER
When BERT is purposed for a NER task, the transformer acts as an encoder for a
linear classifier that is added on top of the BERT model [60]. The linear classifier
gets the input from the transformer and the number of output corresponds to the
number of available labels.

In our case with KB-Bert, the complete architecture is the standard base 12 layer
BERT model [2] [59] with a classifier on top. The classifier is a simple fully connected

32


3. Method

neural network with 768 inputs, no hidden layer, and 9 outputs followed by a Soft-
max layer. The input size is chosen to accommodate the size of the word-wise output
of BERT, while the output size is the amount of assignable labels. The assignable
labels are: [’O’, ’B-ORG’, ’B-PRS’, ’I-PRS’, ’B-LOC’, ’B-MISC’, ’I-MISC’, ’I-ORG’,
’I-LOC’].

A sentence is first processed by the BERT model which returns the word-wise fea-
ture vectors. Each individual feature vector in the sequence is then processed by the
aforementioned classifier, yielding the predicted probabilities of the labels for each
word in the sentence, which is the final output of the model. The label with the
highest predicted probability is taken as the final prediction for the corresponding
word.

Sentence: Kalle Arvidsson reser till Kina Informativeness
PRS 0.12 0.10 0.07 0.08 0.09
LOC 0.22 0.21 0.25 0.24 0.28
ORG 0.15 0.28 0.14 0.17 0.13
MISC 0.22 0,11 0.32 0.27 0.19
O 0.30 0.30 0.22 0.24 0.31

Least 0.7 0.7 0.68 0.73 0.69 0.7

Table 3.10: An example of how informativeness score is calculated for NER. The
sentence "Kalle Arvidsson reser till Kina" is fed into the pretrained KB-BERT model
(before fine-tuning) with the predefined labels PRS,LOC,ORG,MISC and O (no tag-
ging scheme). Each word in the sentence gains probabilities for each label. Applying
the Least Confidence query method for each word yields the informativeness score
shown in the bottom row. The far right column shows the resulting informativeness
score for whole sentence.

The query methods assign and calculate informativeness score only based on the
generated probabilities of the model, which in the case of NER means the informa-
tiveness score obtained is for individual words. In order to gain the informativeness
score for an entire sentence, which is necessary for the case of NER [39], the indi-
vidual informativeness score for each word is calculated and then averaged over the
number of words in the sentence. This is the same approach used for BALD by
Shen et al in their experiments [10]. An example of the process can be seen in table
3.10 where the final total informativeness score of the sentence is then calculated as
(0.7 + 0.7 + 0.68 + 0.73 + 0.69)/5 = 0.7.

33


3. Method

34


4
Results

In this chapter, an overview of the performance of each query method is provided
through plots and tables that describe the averaged F1-score across three different
seeds against the number of iterations. The different seeds, which are 100, 200, and
300, decide which data points the subsets unlabeled pool and validation set contain,
but also allow for the same subsets for each query method experiment. Further-
more, the region of interest of the plots is highlighted with tables to more clearly
distinguish the differences in the average F1-score of each query method.

For the sake of simplicity, only the result of the best performing variant of Coreset,
Clustered uncertainty, and DBAL will be considered. As seen from Appendix C, the
Entropy variant of both Clustered uncertainty and DBAL are the best performing
amongst their peers, and will be used to represent the respective algorithm in the
following results. As there is no clear best performing Coreset variant, the simplest
variant which uses cosine distance will be used to represent said algorithm.

The experimental results of the individual query methods for each of the three
different seeds are presented in Appendix A and B, to present the consistency of
their performance across different subsets of the data.

To emphasize the impact of different acquisition sizes, the results for uncertainty
and diversity sampling algorithms are presented for acquisition sizes 16 and 50 re-
spectively in sections 4.1 and 4.2. The comparison of performance depending on
acquisition size is presented in section 4.3.

4.1 Acquisition size 16
The following results are obtained from experiments performed in 30 iterations with
an acquisition size of 16. Each experiment results in a total of 480 data points being
acquired from an unlabeled pool of size 1500.

4.1.1 Uncertainty Sampling
Figure 4.1 and Table 4.1 describe the performance of the uncertainty based meth-
ods for this experiment. Per observation, all tested uncertainty-based methods
converged at a noticeably faster rate than random sampling, with entropy-based
uncertainty converging fastest amongst the tested methods. By the final iteration,

35


4. Results

most of the uncertainty-based methods have achieved an F1-score of at least 0.75,
with the Least Confidence method being the exception with an F1-score of 0.71.

Figure 4.1: Average F1-score across three different seeds versus number of itera-
tions for uncertainty based query methods, using acquisition size 16.

Iteration Data Entropy BALD Margin Least Rand
9 160 0.26 0.01 0.03 0.01 0.01
11 192 0.31 0.01 0.26 0.12 0.00
13 224 0.51 0.24 0.29 0.28 0.10
15 256 0.62 0.38 0.54 0.46 0.20
17 288 0.66 0.52 0.61 0.54 0.34
19 320 0.69 0.58 0.68 0.63 0.50
21 352 0.70 0.63 0.69 0.65 0.47
23 384 0.73 0.68 0.71 0.65 0.56
25 416 0.75 0.70 0.71 0.67 0.57
27 448 0.76 0.73 0.74 0.68 0.60
29 480 0.76 0.75 0.75 0.71 0.63

Table 4.1: This table captures the regions of interests, mainly from iteration 9-
29, from Graph 4.1. Iteration index is zero based and acquisition size is 16. The
F1-scores are rounded to two decimals precision points.

4.1.2 Diversity Sampling
Figure 4.2 and Table 4.2 describe the performance of the diversity-based meth-
ods and the combined methods for this experiment. Most of the tested methods

36


4. Results

performed better than random sampling with the exception of coreset. Combined
methods performed in general better than any given purely diversity-based meth-
ods, with the worst of the combined methods performing similarly to the best of the
purely diversity-based methods. While clustered entropy sampling had the highest
convergence rate overall, DBAL was the only method that achieved an F1-score of
at least 0.75 by the final iteration. In general, none of the tested methods provided
any noticeable advantage over purely uncertainty-based methods in this case.

Figure 4.2: Average F1-score across three different seeds versus number of itera-
tions for diversity based query methods and combined methods, using acquisition
size 16.

Iteration Data C-Entropy B-BALD CoreSet DAL DBAL Rand
9 160 0.06 0.00 0.00 0.00 0.00 0.00
11 192 0.30 0.03 0.00 0.02 0.05 0.00
13 224 0.40 0.22 0.06 0.23 0.40 0.10
15 256 0.51 0.44 0.18 0.34 0.50 0.20
17 288 0.62 0.49 0.28 0.49 0.57 0.34
19 320 0.64 0.57 0.38 0.58 0.64 0.50
21 352 0.70 0.60 0.48 0.60 0.67 0.47
23 384 0.70 0.65 0.51 0.65 0.69 0.56
25 416 0.71 0.65 0.53 0.67 0.71 0.57
27 448 0.73 0.67 0.66 0.69 0.74 0.60
29 480 0.73 0.67 0.64 0.69 0.76 0.63

Table 4.2: This table captures the regions of interests, mainly from iteration 9-
29, from Graph 4.2. Iteration index is zero based and acquisition size is 16. The
F1-scores are rounded to two decimals precision points.

37


4. Results

4.1.3 Time
For an acquisition size of 16, the result of the experiment indicates that the purely
diversity-based query methods require significantly less query time than uncertainty-
based methods. As seen in Figure 4.3, most methods involving uncertainty required
similar query times of about 1200 seconds, while BatchBALD required a query
time about double of any other uncertainty-related method. Coreset required mere
seconds of query time in total, making it the fastest amongst the methods tested.

Figure 4.3: Accumulated query time in seconds for tested query methods, using
acquisition size 16.

4.2 Acquisition size 50
The following results are obtained from experiments performed in 30 iterations with
an acquisition size of 50, which results in all 1500 samples being acquired from the
unlabeled pool.

4.2.1 Uncertainty Sampling
Figure 4.4 and Table 4.3 describe the performance of the uncertainty based methods
for this experiment. The experiment has shown the tested uncertainty-based meth-
ods to have an improved convergence rate than random sampling. Entropy-based
sampling is shown to perform noticeably better than other methods, reaching an
F1-score of 0.77 already by the eighth iteration. BALD and margin confidence per-
formed similarly, reaching F1-scores of 0.75 by the ninth iteration. The difference
between query methods diminished as the iterations increased that by the nineteenth
iteration, the maximum difference between each method is no higher than 0.03.

38


4. Results

Figure 4.4: Average F1-score across three different seeds versus number of itera-
tions for uncertainty based query methods, using acquisition size 50.

Iteration Data Entropy BALD Margin Least Rand
3 200 0.22 0.16 0.19 0.14 0.17
4 250 0.48 0.44 0.44 0.44 0.28
5 300 0.66 0.58 0.59 0.52 0.39
6 350 0.74 0.68 0.67 0.64 0.53
7 400 0.74 0.70 0.69 0.66 0.57
8 450 0.77 0.73 0.72 0.70 0.65
9 500 0.78 0.75 0.75 0.72 0.67
10 550 0.77 0.76 0.75 0.73 0.72
11 600 0.78 0.77 0.75 0.72 0.72
19 1000 0.80 0.80 0.78 0.79 0.78
29 1500 0.81 0.81 0.80 0.79 0.81

Table 4.3: This table captures the regions of interests, mainly from iteration 3-
11, from Graph 4.4. Iteration index is zero based and acquisition size is 50. The
F1-scores are rounded to two decimals precision points.

4.2.2 Diversity Sampling
Figure 4.5 and Table 4.4 describe the performance of the diversity-based methods
and the combined methods for this experiment. Amongst the diversity-based meth-
ods and combined methods, clustered entropy showed a drastic advantage over other
methods by reaching an F1-score of 0.75 at the eighth iteration. Starting from the

39


4. Results

fourth iteration, DBAL had a consistent yet less noticeable advantage over random
sampling, while BatchBALD on the other hand only had a noticeable advantage
over random sampling from the fourth to tenth iteration. Coreset performed nigh-
identical to random sampling except in the third and fourth iteration where it had
a disadvantage. DAL was shown to be outclassed by every other tested method,
performing consistently worse than random sampling. By the nineteenth iteration,
the maximum difference between F1-score between each tested query method has
diminished to no more than 0.04.

Figure 4.5: Average F1-score across three different seeds versus number of itera-
tions for diversity based query methods and combined methods, using acquisition
size 50.

Iteration Data C-Entropy B-BALD Coreset DAL DBAL Rand
3 200 0.24 0.15 0.01 0.04 0.07 0.17
4 250 0.51 0.35 0.16 0.19 0.34 0.29
5 300 0.62 0.51 0.38 0.25 0.57 0.39
6 350 0.71 0.61 0.49 0.37 0.60 0.53
7 400 0.72 0.64 0.58 0.47 0.63 0.57
8 450 0.75 0.61 0.63 0.57 0.68 0.65
9 500 0.76 0.71 0.67 0.61 0.72 0.67
10 550 0.78 0.68 0.70 0.66 0.72 0.72
19 1000 0.80 0.76 0.80 0.76 0.80 0.78
29 1500 0.81 0.82 0.82 0.78 0.82 0.81

Table 4.4: This table captures the regions of interests, mainly from iteration 3-
10, from Graph 4.5. Iteration index is zero based and acquisition size is 50. The
F1-scores are rounded to two decimals precision points.

40


4. Results

4.2.3 Time

Figure 4.6: Accumulated query time in seconds for tested query methods, using
acquisition size 50.

For an acquisition size of 50, all uncertainty-related methods except BatchBALD
required a similar total query time of circa 800 seconds. BatchBALD required a
query time about four times that of the other uncertainty-based methods, making
it the slowest amongst the methods tested. DAL required a query time of 900
seconds, exceeding that of the average uncertainty-related method. Coreset once
again required the least amount of query time, which is mere seconds.

4.3 Comparison based upon acquisition size
A generic overview of the effects of acquisition size on performance is presented in
the following table, which displays the F1-score with certain specific amounts of
samples queried on respective acquisition size for each of the tested query methods.

Iteration 11 | 3 15 |4 18|5 21 | 6 24 | 7 27 | 8
Data 192 |200 256 | 250 304 | 300 352 | 350 400|400 448 | 450

Entropy 0.31 | 0.22 0.62 | 0.48 0.69 | 0.66 0.70 | 0.74 0.73 | 0.74 0.76 | 0.77
Least 0.12 | 0.14 0.46 | 0.44 0.55 | 0.52 0.65 | 0.64 0.64 | 0.66 0.68 | 0.70
Margin 0.26 | 0.19 0.54 | 0.43 0.65 | 0.59 0.69 | 0.67 0.71 | 0.69 0.74 | 0.72
BALD 0.01 | 0.16 0.38 | 0.44 0.58 | 0.58 0.63 | 0.68 0.69 | 0.70 0.73 | 0.73
Random 0.00 | 0.17 0.20 | 0.27 0.39 | 0.39 0.47 | 0.53 0.56 | 0.57 0.60 | 0.65

Table 4.5: This table provides for each uncertainty based method a comparison
of the F1-scores between acquisitions sizes at certain points where the