Deep Active Learning for Swedish Named Entity Recognition An empiric evaluation of active learning algorithms for Named Entity Recognition Master’s thesis in Computer Science Algorithms, Language and Logic Nadim Hagatulah Kalle Arvidsson Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2021 www.chalmers.se www.chalmers.se Master’s thesis 2021 Deep Active Learning for Swedish Named Entity Recognition An empiric evaluation of active learning algorithms for Named Entity Recognition Nadim Hagatulah Kalle Arvidsson Department of Computer Science and Engineering Division of Data Science and AI Chalmers University of Technology Gothenburg, Sweden 2021 Deep Active Learning for Swedish Named Entity Recognition An empiric evaluation of active learning algorithms for Named Entity Recognition Nadim Hagatulah Kalle Arvidsson © Nadim Hagatulah 2021. © Kalle Arvidsson 2021. Academic Supervisor: Richard Johansson, Department of Computer Science and Engineering Company Supervisor: Petter Wolff, Sahlgrenska Science Park Examiner: Krasimir Angelov, professor, Department of Computer Science and En- gineering Master’s Thesis 2021 Department of Computer Science and Engineering Division of Data Science and AI Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: The box illustrates a set with circles as data points. The checkmarks repre- sent data points in the labeled set, question marks represent data that is queried to an oracle, and the triple dots represent the unannotated set. Typeset in LATEX, template by Magnus Gustaver Printed by Chalmers Reproservice Gothenburg, Sweden 2021 iv Deep Active Learning for Swedish Named Entity Recognition An empiric evaluation of active learning algorithms for Named Entity Recognition Nadim Hagatulah Kalle Arvidsson Department of Computer Science and Engineering Chalmers University of Technology Abstract Named entity recognition holds promise for numerous practical applications involv- ing text data, such as keyword extraction and automated anonymization. However, successfully train a machine learning model for Named Entity Recognition is chal- lenging due to the amount of annotated data required, especially for cases where language that is not globally common such as Swedish is involved. In such cases, using a Deep pre-trained model such as BERT in conjunction with the practice of active learning may be preferred. To obtain some insight into the implementation of such an approach, this thesis serves as an empirical study of various active learning strategies when used in conjunction with BERT-based name entity recognition. The performance of different active learning algorithms and the effect of acquisition size on the performance of active learning is the main focus of this study. In conclusion, after comparing and evaluating 17 different active learning methods, the study’s empirical results demonstrate entropy sampling to be the best performing active learning algorithm for Named Entity Recognition of Swedish texts, and the choice of acquisition sizes is practically negligible to performance. Keywords: Active Learning, Deep Learning, Transformer, BERT, NLP, Named Entity Recognition, Diversity-Based Sampling, Uncertainty-Based Sampling, Pool- Based Sampling, Cumulative Training. v Acknowledgements The authors would like to thank our supervisors Richard Johansson and Petter Wolff, that throughout the whole project, have given invaluable help and feedback. The thesis would not have been possible without both of their involvement and helpful support. Nadim Hagatulah and Kalle Arvidsson, Gothenburg, June 2021 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Related Works/Previous works . . . . . . . . . . . . . . . . . . . . . . 2 2 Theory 3 2.1 Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . 3 2.1.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 4 2.2 Active Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.2.1 Active Learning in Deep Learning . . . . . . . . . . . . . . . . 6 2.2.1.1 Batch Awareness . . . . . . . . . . . . . . . . . . . . 6 2.2.1.2 Model Uncertainty . . . . . . . . . . . . . . . . . . . 7 2.2.2 Uncertainty Sampling . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 Diversity Sampling . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.5 Diverse mini-batch Active Learning . . . . . . . . . . . . . . . 10 2.2.6 BatchBALD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.7 Coreset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.8 Discriminative Active Learning . . . . . . . . . . . . . . . . . 12 2.2.9 Expected Gradient Length . . . . . . . . . . . . . . . . . . . . 13 2.2.10 Query By Committee . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Text Representation . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.2 Named Entity Recognition . . . . . . . . . . . . . . . . . . . . 16 2.3.2.1 Tagging Schemes . . . . . . . . . . . . . . . . . . . . 16 2.4 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.5.1 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3 Method 23 3.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Software Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . 24 ix Contents 3.2.2 Sentence Embedding . . . . . . . . . . . . . . . . . . . . . . . 26 3.3 Data Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.3.1 SUC 3.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4.2 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4.3 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.4.4 F1-Score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.5 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5 Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.5.1 BERT for NER . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4 Results 35 4.1 Acquisition size 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.1 Uncertainty Sampling . . . . . . . . . . . . . . . . . . . . . . . 35 4.1.2 Diversity Sampling . . . . . . . . . . . . . . . . . . . . . . . . 36 4.1.3 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2 Acquisition size 50 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.1 Uncertainty Sampling . . . . . . . . . . . . . . . . . . . . . . . 38 4.2.2 Diversity Sampling . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.3 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 Comparison based upon acquisition size . . . . . . . . . . . . . . . . . 41 4.4 Consistency of performance . . . . . . . . . . . . . . . . . . . . . . . 42 5 Discussion 43 5.1 Performances of Query Methods . . . . . . . . . . . . . . . . . . . . . 43 5.2 Impact of Acquisition Size . . . . . . . . . . . . . . . . . . . . . . . . 44 5.3 Future Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 6 Conclusion 49 A Appendix 1 I B Appendix 2 III C Appendix 3 V D Appendix 4 IX x List of Figures 2.1 The active learning loop, which begins at the first stage which is predicting sample from the unlabeled data set. In the second stage, a query method ranks the informative score of the predictions and picks the N best scores for the oracle to manually annotate. After the oracle has annotated the samples, the labeled data set is extended and the model is retrained. . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 The active learning loop with DAL. Steps 1, 2, and 3 are repeated N times for each mini-query before step 4 is performed, which introduces batch awareness and allows the DAL loop to scale to large batch sizes. 12 2.3 Simplified view of the Skip-Gram architecture with the example sen- tence "Kalle ska äta pizza deg" where "äta" is the main word. . . . . . 15 2.4 Simplified view of the CBOW architecture with the example sentence "Kalle ska äta pizza deg" where "äta" is the main word. . . . . . . . . 15 2.5 The transformer architecture presented in Vaswani et at [1]. Note that positional encoding is applied to the input, as the attention-based model is otherwise incapable of distinguishing sequential relationship. 20 2.6 The intended training procedure of BERT according to Devlin et at [2]. Fine-tuning for specific task can be done through adapting the output layer and the input formulation according to the intended task. 21 3.1 The frequency of sentence lengths in the SUC 3.0 data set. Starting at sentence length one which represents a one word sentence. . . . . . 29 4.1 Average F1-score across three different seeds versus number of iter- ations for uncertainty based query methods, using acquisition size 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2 Average F1-score across three different seeds versus number of itera- tions for diversity based query methods and combined methods, using acquisition size 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3 Accumulated query time in seconds for tested query methods, using acquisition size 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.4 Average F1-score across three different seeds versus number of iter- ations for uncertainty based query methods, using acquisition size 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.5 Average F1-score across three different seeds versus number of itera- tions for diversity based query methods and combined methods, using acquisition size 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 xi List of Figures 4.6 Accumulated query time in seconds for tested query methods, using acquisition size 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 A.1 16 samples per iteration, uncertainty-based sampling algorithms, first graph has seed 100, second seed 200 and third seed 300. . . . . . . . . I A.2 16 samples per iteration, diversity-based sampling algorithms, first graph has seed 100, second seed 200 and third seed 300. . . . . . . . . II B.1 50 samples per iteration, uncertainty-based sampling algorithms, first graph has seed 100, second seed 200 and third seed 300. . . . . . . . . III B.2 50 samples per iteration, diversity-based sampling algorithms, first graph has seed 100, second seed 200 and third seed 300. . . . . . . . . IV C.1 Comparison of the naive clustering algorithm combined with differ- ent uncertainty-based sampling algorithms. The first graph is with acquisition size 16 and second is acquisition size 50. . . . . . . . . . . V C.2 Comparison of the DBAL algorithm combined different uncertainty- based sampling algorithms. The first graph is with acquisition size 16 and second is acquisition size 50. . . . . . . . . . . . . . . . . . . . VI C.3 Comparison of different distance metrics for the coreset algorithm. The first graph is for acquisition size 16 and second is for acquisition size 50. The coreset legend uses the cosine distance metric. . . . . . . VII D.1 Mean and deviations of BatchBALD, Core-Set and DAL. . . . . . . . IX D.2 Mean and deviations of C-Entropy, DBAL-Entropy and Entropy. . . . X D.3 Mean and deviations Marginal, bald, least and random sampling. . . XI xii List of Tables 2.1 Example of different schemes are applied to the sentence "Kalle Arvids- son wants to watch the movie Justice League or Naruto ." . . . . . . 17 3.1 Two shared computer resource servers offered by Chalmers, Titan and Bayes, where Bayes includes Controller, Shannon and Markov and Titan is the singular computer in Titan server. . . . . . . . . . . 23 3.2 The algorithms that are supported by the framework. MHS denotes Mahalanobis, CRL denotes Correlation and CB denotes Cityblock (Manhattan). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.3 The mapping of a sentence to numeric sentence vector embedding. . . 26 3.4 Example of how two sentences in the data frame looks like. . . . . . . 27 3.5 The entities provided by the SUC 3.0 data set. Second column cor- responds to mapping to more common and easier tags to work with. . 27 3.6 Example of how a named tag in SUC 3.0 looks like. . . . . . . . . . . 28 3.7 Frequency of tokens in the SUC 3.0 data set after the preproccesing steps. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.8 An example of predicted tokens on the second column, with the BIO- scheme on the sentence "Kalle ska äta på Condeco" on the first column and, correct tokens on the third column. . . . . . . . . . . . . . . . . 30 3.9 NER experiments performed with the different pretrained BERTmod- els that support the Swedish language. Time is in format (min- utes:seconds). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.10 An example of how informativeness score is calculated for NER. The sentence "Kalle Arvidsson reser till Kina" is fed into the pretrained KB-BERTmodel (before fine-tuning) with the predefined labels PRS,LOC,ORG,MISC and O (no tagging scheme). Each word in the sentence gains proba- bilities for each label. Applying the Least Confidence query method for each word yields the informativeness score shown in the bottom row. The far right column shows the resulting informativeness score for whole sentence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.1 This table captures the regions of interests, mainly from iteration 9- 29, from Graph 4.1. Iteration index is zero based and acquisition size is 16. The F1-scores are rounded to two decimals precision points. . 36 4.2 This table captures the regions of interests, mainly from iteration 9- 29, from Graph 4.2. Iteration index is zero based and acquisition size is 16. The F1-scores are rounded to two decimals precision points. . 37 xiii List of Tables 4.3 This table captures the regions of interests, mainly from iteration 3- 11, from Graph 4.4. Iteration index is zero based and acquisition size is 50. The F1-scores are rounded to two decimals precision points. . 39 4.4 This table captures the regions of interests, mainly from iteration 3- 10, from Graph 4.5. Iteration index is zero based and acquisition size is 50. The F1-scores are rounded to two decimals precision points. . 40 4.5 This table provides for each uncertainty based method a comparison of the F1-scores between acquisitions sizes at certain points where the amount of sampled data are similar. The left hand side value of the bar in a cell is from acquisition size 16 and right hand side is acquisition size 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.6 This table provides for each diversity related method a comparison of the F1-scores between acquisitions sizes at certain points where the amount of sampled data are similar. The left hand side value of the bar in a cell is from acquisition size 16 and right hand side is acquisition size 50. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 xiv 1 Introduction Unstructured text data, such as patient stories [3], holds promising potential for various applications. However, to make proper use of such unstructured text data, information extraction methods would be required. Extracting information from unstructured text data has been a commonly researched theme within the field of Natural Language Processing (NLP), with Named Entity Recognition(NER) being a prime example of such a subject. Using NER, one can highlight the valuable in- formation or omit sensitive privacy information within a body of text. At the time of this thesis, a common approach to handle NLP tasks in general is through using supervised machine learning(ML). However, a common challenge within the NLP field is the lack of annotated data. To train an accurate supervised machine learning model, large amounts of accurate annotated data are required. For an information extraction task involving unstructured text data, accurate an- notations would require manual work by domain experts, which is costly and inef- ficient [4]. There is therefore an interest in reducing the amount of annotated data required to successfully train a supervised machine learning model. 1.1 Purpose There has been numerous research about information extraction from unstructured texts but relatively few attempts of applying such techniques to Swedish texts [5]. As such, information extraction tasks in Swedish is still a subject that requires further development. For that end, sizeable accurately annotated data sets are required. The high-level goal of the project is then to find a way to efficiently acquire accurate annotations for unstructured text data. To this end, Named Entity Recognition for unstructured Swedish texts has been selected as a representative, but the general idea should be extensible to other languages and NLP tasks as well. To handle the task of extracting information from unstructured Swedish texts despite the lack of annotated data, this thesis proposes using active learning in combination with a transformer-based transfer learning model. The purpose of this study is then to gain an insight on the practical usage of active learning with the consideration of the performance and scalability, specifically for the case of information extrac- tion of unstructured texts using supervised machine learning. In order to fulfill this purpose, the objective of this study is to investigate the performance, in particular the convergence rate of model performance but also the time consumption, of var- 1 1. Introduction ious Active Learning algorithms on the training of transformer-based NLP models on Swedish texts. For this end, the idea is to develop an annotation framework that can be used to evaluate active learning methods for reducing the amount of annotated data required for the model to reach a reasonable performance [6]. 1.2 Related Works/Previous works Using active learning for machine learning concerning NLP tasks has been done in numerous researches in conjunction with a variety of different machine learning mod- els such as conditional random fields (CRF), convolutional neural networks (CNN), recurrent neural networks (RNN), and variations of BERT model. In all of those studies, various types of active learning algorithms were shown to have an advantage over randomly sampling data for training set in NLP tasks. In studies related to CRF-based sequence labeling tasks, uncertainty based meth- ods were shown to to have an advantage over other methods such as query-by- committee [7], information density [8], and diversity based sampling [9]. Addi- tionally, Kholghi et al (2015) showed in their study concerning CRF-based entity recognition that not retrain the entire model from scratch for each active learning iteration reduces the amount of labeled data required to reach the performance com- parable to that of supervised training [8]. As for the studies concerning active learning in conjunction with neural network based NLP, Shen et al (2018) evaluated three different uncertainty based active learning algorithms on named entity recognition using a CNN-LSTM model [6]. All three algorithms performed similarly, with the least confidence method underper- forming slightly compared to mean normalized log-probability and Bayesian active learning by disagreement. Said study also noted the computational advantage of mean normalized log-probability over Bayesian active learning by disagreement, as the latter required multiple forward passes. The study which is most similar to this is Ein-Dor et al (2020), which evaluated and compared a variety of active learning algorithms on binary text classification tasks using BERT [10]. Experiments were performed across a number of different data sets with different categorization to classify, with the target class prior of the data sets ranging from 10% to 50%. The research has shown that none of the tested active learning algorithms consistently outperformed its counterparts in the case of BERT-based binary text classification. The run-time difference of different active learning algorithms is also noted in the study. 2 2 Theory This chapter discusses the theory behind the key concepts of the study. The back- ground and general idea of active learning are thoroughly discussed. Popular acqui- sition methods from literature that will be used for experiments are also introduced and motivated. An overview of the NLP task which will be solved with active learning is discussed then finally the transformer model is thoroughly explained. 2.1 Learning Algorithms A machine learning model generally requires a learning algorithm and training data. The ML model artifact is created by the training process, and the goal of the train- ing process is for the ML model to capture the patterns which lie within the training data. It is therefore important for the training data to be sufficiently representa- tive of the domain in order to produce a well-functioning model which correctly generalizes to predicting new unseen data. There are multiple different training al- gorithms that utilize the training data in different methods in order to best capture the patterns for the ML model. 2.1.1 Supervised Learning In supervised learning, a model is typically trained by feeding it with training data that have labels that are the desired mapping to the data points. The weight param- eters of the model are then adjusted until a loss function is reduced to a satisfactory level, which indicates that the model has learned the patterns of the training data well enough. Supervised learning has some challenges which include requiring domain experts to label a sufficient amount of labeled data for a model to be able to converge to a satisfactory level [11]. However, factors including experiences, emotions, and cir- cumstances would often affect a person’s judgment during annotation. As such, annotations from a single person would regularly contain bias and other mistakes, which means they are not reliable. It is therefore generally required to have multiple annotators to guarantee the integrity of the data annotations through redundancy, which in turn makes the process expensive and time-consuming. A popular method to measure agreement between annotators is with Cohen’s kappa [12] which assigns a score based on the agreement between annotators. 3 2. Theory 2.1.2 Unsupervised Learning In contrast to supervised learning, unsupervised learning does not require any an- notations of the training data. This is because the goal of an unsupervised learning algorithm is to find structure in the training data on its own by finding patterns based on data points features, often in the form of clusters or density estimation. The challenges that unsupervised learning presents are that it is a more complex method than supervised learning and therefore generally requires more data and time for convergence in training. The results can also be hard to interpret, and, because no labels are given, the correct answers can not be accurately determined which allows the risk of erroneous results [13]. This would require external eval- uation from a human domain expert or an internal evaluation in the form of an objective function. 2.2 Active Learning Annotating data for supervised learning is usually a task with high demands for time and manpower, especially if the task involves expertise in a certain field. In order to reduce the amount of annotated data required to train a model, one may apply the practice of active learning which is to use an algorithm to find a sample set within an unannotated data set that is most likely to yield the most significant performance improvement for a given ML model [14] [7], should the sample set be manually annotated and used to train the given model. There are generally three active learning scenarios [15]. Membership Query Synthesis where the learner generates fictional data point instances based on the underlying distribution of unlabeled pool. The generated data points are queried to an oracle for a label and are accepted as a successfully generated data point if the oracle can recognize and assign a label, otherwise if unrecognizable, the data point will be discarded. Stream-Based Selective Sampling where unlabeled data points are examined one by one against some informativeness criteria. If the examined data point is accepted according to the informativeness criteria, it is then queried to the oracle for a label assignment but otherwise discarded. Pool-Based Sampling considers the whole unlabeled data set by assigning all data points an informativeness score and then selects the top highest scoring data points. Membership query synthesis is only useful for cases where the unlabeled data pool is small [15] and the focus is therefore on expanding the data set. The Stream-based Se- lective Sampling and Pool-Based Sampling scenarios both rely on an informativeness score. The Pool-Based Sampling method has to perform an expensive informative- ness assignment over the whole unlabeled data set before deciding which data points to include, while on the contrary Stream-Based Selective Sampling immediately de- cides for each data point. Therefore if resources and hardware are limited then Stream-Based Selective Sampling is the appropriate method, otherwise, Pool-Based Sampling generally produces better results since it has an underlying knowledge of whole unlabeled data set [16]. 4 2. Theory The algorithm which determines the informativeness of the sample set drawn from the unlabeled data set is called a query method. There exists a variety of query methods [7] with the most common approaches being uncertainty-based sampling and diversity-based sampling. In uncertainty-based sampling, the sample set is con- structed by using the samples which the model is most uncertain of, whereas in diversity-based sampling the algorithm tries to select a diverse set of queried sam- ples. Unlabeled Dataset Labeled Dataset Oracle Model Query Method 1. Predict datapoints 5. Retrain the model with the extended dataset 2. Score the predictions informativeness 3. Pick N best scored datapoints 4. Label datapoints and extend the labeled set Figure 2.1: The active learning loop, which begins at the first stage which is predicting sample from the unlabeled data set. In the second stage, a query method ranks the informative score of the predictions and picks the N best scores for the oracle to manually annotate. After the oracle has annotated the samples, the labeled data set is extended and the model is retrained. The samples are assigned an informativeness score and then the most informative data is queried to an oracle, which is an agent such as a human annotator. The oracle annotates the sample with a ground truth label so it can be used for the supervised training of the model. The training in active learning can be done in two methods. The first being Incremental learning, which is to use the already trained model and update the weights with only the newly acquired data from the iteration in the loop. The second method is Cumulative training where the model weights reset every iteration and is then retrained from scratch with all the cumulative labeled data acquired from the active learning process [17]. The whole general 5 2. Theory active learning loop is described in Figure 2.1. 2.2.1 Active Learning in Deep Learning Combining deep learning with active learning presents some challenges. The main issue is that the Softmax output from deep learning models tends not to correctly capture the uncertainty of the model for the classes [18] and the fact that most active learning algorithms are intended for one-by-one query and are not batch aware. 2.2.1.1 Batch Awareness In theory, uncertainty-based active learning should query one sample per iteration. This is impractical in cases involving deep learning models for two reasons. First, deep learning models are usually trained with batches of multiple samples for the purpose of parallelization and stabilization of the training. Second, to retrain the model from scratch for each sample added to the training set is inefficient as Deep Learning models are usually resource-intensive to train. Thus, it is more practical to query samples in batches of size K where K is an acquisition size greater than one. However, the intuitive approach of naively choosing top-K highest informative scoring samples will usually result in K near-identical samples [19]. A method to alleviate the problem is to choose K to be small enough in between training rounds. If at some point model M is uncertain in predicting some class c, then if K is chosen to be large, a large number of similar samples consisting of class c would be chosen. The training would be inefficient because the model would always be skewed towards some dominating class c which affects the generalization of the model negatively. Instead, if K was chosen to be small, then sufficient identical K points would be chosen for the model’s generalization to not be affected but instead positively improved on the weak class c predictions. Another method is solving the issue through different methods of diversity-based sampling, which sought to prevent the querying of near-identical samples. Such an approach would require the diversity-based method to perform well on the unanno- tated data. For instance, in order for diversity sampling with clusters to succeed the cluster sizes and centroid positions needs to be well placed, which requires a good representation of the data points. A cluster size that is too small would create some large dominating cluster and other small clusters, if the smaller clusters are depleted, then querying will only occur from the dominating cluster which results in only identical points being chosen. Algorithms such as DAL, DBAL, Coreset, and BatchBALD are developed with batch awareness in mind and are more thoroughly discussed in the subsequent sec- tions. 6 2. Theory 2.2.1.2 Model Uncertainty Usually, the uncertainty-based query methods rely on the predictions of the learning model. However, the Softmaxed raw output from a deep learning model tends to be overconfident [20], e.g high confidence can be assigned to unseen data and therefore the Softmax output of the final layer is unreliable as a measure of confidence and shown in the literature to be worse than random sampling [21]. A method to estimate uncertainty is to cast the deep learning models as Bayesian models and thus leveraging Bayesian probability theory which offers tools to reason about the model confidence in predictions. By enabling dropouts during the pre- dictions of the model, outputs are no longer deterministic. Each run will produce different results which are interpreted as samples from a probabilistic distribution. Thus, averaging over multiple Softmax outputs allows the model to be Bayesian approximated as the Gaussian process [20]. Using dropout during evaluation is called theMonte Carlo dropout (MC)method, where every neuron in some layer has a probability p to be dropped, ( the value of the neuron multiplied with 0). Applying Monte Carlo dropout multiple times and summing the results is the equivalent to Monte Carlo integration, thus for some p(y = c|x) where x denotes a sample and given a sample, the probability for class c can be represented through the following formula: p(y = c|x) = ∫ p(y = c|x, ω)p(ω)dω Here, p(ω) denotes the probability for a weight to be active or not, that is if the neuron is dropped, thus the integration over ω considered all possible weights by its probability, therefore: p(y = c|x) ≈ ∫ p(y = c|x, ω) ∗ q∗(ω)dω Where p(ω) ≈ q∗(ω) and q∗(ω) denotes the dropout distribution. The Monte Carlo integration can further be approximated by running multiple instances T and then averaging over some probability class c, which finally gives: p(y = c|x) ≈ 1 T ∑ t p(y|x, ωt) = 1 T ∑ t pt c This yields the output probability for every class c. A higher amount of runs T samples more from the assumed probability distribution, which therefore gives a better approximation of the probability distribution. 2.2.2 Uncertainty Sampling As uncertainty sampling methods are based upon the idea of selecting the samples which the model is most uncertain of, the question is then how the uncertainty is measured. As mentioned, the common approach is to evaluate the predictions made by the model in form of the probabilistic distribution of the classes. Based upon 7 2. Theory the evaluation function used to measure uncertainty, there are multiple uncertainty- based query methods including the following examples: Least confidence (LC) is the most basic query method which measures uncer- tainty by taking the difference between the most confident prediction and absolute certainty. LC = 1−max c∈C p(y = c | x) This method only considers the highest valued model output and high informative- ness scores are given to output with low confidence. The intuition being that for those samples the model is most uncertain of, training will be most rewarding. It only considers the topmost confident class and discards the other information. Margin confidence (MC) is a method that looks at the margin difference be- tween the two topmost confident predictions and then the difference from 100% confidence. MC = 1− (max c∈C p(y = c | x)− max d∈C/{c} p(y = d | x)) In MC, the informativeness scoring both most confident and second most is con- sidered, therefore amending some of the weaknesses of LC [22] by using more information. LC queries samples that have a small margin between classes and thus should allow the model to easier discriminate between those classes during training. Entropy score (ES) is a query method based on the average information inherent in variables’ possible outcomes as defined by Shannon [22]. ES = H[y | x] = − ∑ c∈C p(y = c | x) log2(p(y = c | x)) This method is most general and uses all information, every label class probabilities are used to measure the informativeness of the sample. Thus it benefits models with a large amount of classes [22]. If class probabilities are evenly distributed then a high informative score is given because it indicates the model being confused be- tween classes. Bayesian Active Learning by Disagreement (BALD) [23] is a method that evaluates uncertainty from a Bayesian statistical perspective. It is supposed to be used in conjunction with a Bayesian framework. BALD = H[y | x]− E[H[y | x, ω]] Here H[y | x] is the general entropy of the output which captures the overall un- certainty of the model, and H[y | x, ω] is the entropy of the output given model parameter ω which captures the uncertainty with a specific model setting. As in a Bayesian framework the model parameters ω can be considered as a random vari- able, E[H[y | x, ω]] is then the expected entropy of the output for any specific model settings. Intuitively, the objective is to find the samples which the model is most uncertain of due to there being settings of the parameters which produce predic- tions that are confident yet contradicting. This approach is technically equivalent 8 2. Theory to finding the samples which cause a maximal decrease in the expected uncertainty of the parameters of a Bayesian model [23]. 2.2.3 Diversity Sampling Uncertainty sampling is sensitive to selecting outliers and data which does not rep- resent the data set [24]. In contrast, diversity sampling alleviates such issues by selecting the subset of samples that are considered to cover the entire data set as thoroughly as possible. Depending on the approach used to generate the subset, there are numerous types of diversity sampling methods including the following [25]: Model-based Outliers, that explores the logits and gradients of a model, group- ing and assigning informativeness score to samples based on how the neurons of the model react to the unlabeled data points, which are methods such as EGL (section 2.2.9). Representative Sampling, where each sample is scored with an informativeness score through evaluating how representative it is of the whole data set relative to the labeled set. This includes methods such as DAL (section 2.2.8), Coreset (section 2.2.7), and BatchBALD (section 2.2.6). Cluster-based Sampling, where an unsupervised method is used to find struc- ture and trends in the unlabeled data pool by clustering similar samples together into clusters, methods such as DBAL (section 2.2.5) and Clustering (section 2.2.4). Clustering samples can be done using various different algorithms but the most com- monly used algorithm is K-Means, which is defined as; Given a predefined integer K for the number of clusters and a set N of samples, then the samples will be clustered into K clusters with each cluster having a centroid point µ [26] which minimized the inertia defined as: argmin K∑ j=0 N∑ i=0 (||xj i − µj||2) Where xj i defies the data point xi that belongs to cluster j and µj is the centroid of that cluster. The sum of squared error uses the Euclidean metric which is not efficient in capturing in the high dimensionality of document classification [27]. Because of the curse of dimensionality, cosine distance is the preferred metric for doc- ument similarity classification, but to alleviate the problem of Euclidean distance, L2-normalisation can be applied to the data set before application of Euclidean dis- tance which acts as a proxy for cosine distance. L2-normalization unit normalizes each vector by setting the sum of squares of vector elements to 1. Thus for two vec- tors X and Y the Euclidean distance is denoted as ||X−Y ||, thus for the minimized inertia in cluster minimization, ||X − Y ||2 = (X − Y )T (X − Y ) = ||X||2 + ||Y ||2 − 2XTY 9 2. Theory Thus because X and Y is normalized then ||X||2 = ||Y ||2 = 1 equation can be sim- plified as 2(1− cos(X, Y )) which shows Euclidean distance and cosine similarity relates and the equivalence of order between the metrics. 2.2.4 Clustering A naive approach to the clustering-based diversity sampling method is to split the whole unlabeled set into clusters and then pick samples from each cluster and add them to the labeled set so that the labeled set will be balanced with regard to the defined clusters. It is expected that this method would ensure more diversity than uncertainty sampling and the quality of diverse data is dependent on a predefined number ofK clusters. Choosing samples from clusters can be done randomly but also combined with an uncertainty sampling method, thus choosing the most informative sample in each cluster. 2.2.5 Diverse mini-batch Active Learning Another diversity sampling method that utilizes clusters is Diverse mini-batch Active Learning (DBAL). The method also combines methods of diversity with uncertainty sampling, but instead of first clustering and then ranking by an informativeness score assigned by the uncertainty acquisition function as in naive clustering, DBAL first ranks whole unlabeled data set by informativeness score and in a second step then clusters [28]. The samples closest to the centroids are selected and added to the labeled set. For instance, first, the whole set is scored on informativeness by an uncertainty sampling method, then top β∗K-samples are prefiltered, where β is some prefilter constant and K is the number of samples to query per iteration. The set is prefiltered for efficiency purposes since running K-means can be computationally expensive. The prefiltered set is then clustered with weighted K-means clustering into K clusters where the weights are the informativeness score assigned by the uncertainty method for each sample. The K-means clustering objective function is then modified to; argmin K∑ j=0 N∑ i=0 sj i (||x j i − µj||2) Here sj i represents the uncertainty informativeness of sample xj i and acts as a weight. Finally, a batch of K samples is chosen by picking the samples closest to each K centroids. 2.2.6 BatchBALD BatchBALD is an extension of BALD that instead of scoring informativeness for a single sample, calculates the informativeness of the whole batch of K samples 10 2. Theory at once, where K denotes the acquisition size. BatchBALD is then a batch-aware method that both scores a whole batch but also finds the best batch which represents the data set. It is thus a representative diversity sampling method combined with uncertainty sampling. Naively calculating informativeness of all possible batches is not feasible because there exists an exponential amount of possible subsets [19]. To at least find good enough subsets with a greedy algorithm, the submodularity prop- erty is exploited, which yields a 1 − 1/e approximate solution [19] and the subsets are scored as; batchBALD = H[y0, .., yn | x0, .., xn]− E[H[y0, .., yn | x0, .., xn, ω]] Just as the standard BALD algorithm, the first term captures the general uncer- tainty of the selected batch, and the second term computes the expected uncertainty for the selected batch. High informativeness is then assigned to batches that have low confidence in predictions which is represented by the disagreement of what label to assign which yields a large first term and small second term [19]. Notably, Batch- BALD increases the amount of computation required to select a batch compared to the naive BALD algorithm [19]. 2.2.7 Coreset Coreset is a batch-aware active learning algorithm that tries to query a batch K set of samples, where K is a predefined acquisition size [29]. The algorithm tries to query samples by selecting a batch K subset of unlabeled set that represents the larger population of unlabeled data set by using the Coreset selection algorithm [30], thus falling into the representative diversity algorithm category and which is also a purely diversity-based method. Bounds between average Coreset loss of a subset of data and rest of unlabeled data set via a geometric representation are defined, then minimizing the bound would then find the subset of samples that represents the whole unlabeled data set and is therefore valuable as an active learning algorithm. Minimizing the bound is equivalent to solving the k-center problem [29] [31]. Then the algorithm selects K points such that the largest distance between a sample and nearest center is minimized. The problem is defined as: min s1:|s1|