Methods for Optimizing BERT Model on Edge Devices Accelerating Biomedical NLP with Pruned and Quantized BERT Models Master’s thesis in Complex Adaptive Systems Atefeh Mirzabeigi Amir Ali Barani Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2025 Master’s thesis 2025 Methods for Optimizing BERT Model on Edge Devices Accelerating Biomedical NLP with Pruned and Quantized BERT Models Atefeh Mirzabeigi Amir Ali Barani Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2025 A Chalmers University of TechnologyLATEX Accelerating Biomedical NLP with Pruned and Quantized BERT Models Atefeh Mirzabeigi, Amir Ali Barani © Atefeh Mirzabeigi, Amir Ali Barani 2025. Supervisor: Mehrdad Farahani, Computer science and engineering Advisor: Miguel Carmona, Michaël Ughetto, AstraZeneca Examiner: Richard Johansson, Computer science and engineering Examiner: Mats Granath, Physics Master’s Thesis 2025 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Description of the picture on the cover page (if applicable) Typeset in LATEX Gothenburg, Sweden 2025 iv A Chalmers University of TechnologyLATEX Accelerating Biomedical NLP with Pruned and Quantized BERT Models Atefeh Mirzabeigi, Amir Ali Barani Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Named-entity recognition (NER) of clinical efficacy endpoints in oncology abstracts supports downstream discovery pipelines at AstraZeneca. Yet, the fine-tuned trans- former models currently used are too slow and over parameterized for large-scale CPU deployment. This thesis evaluated whether post-training model compression techniques can accelerate inference without retraining or harming extraction quality. In the first stage of this project, standard BERT and BioBERT were individually pruned with a three-stage, Fisher-guided structured pruning workflow at three levels of sparsity. Subsequently, in the second stage, dynamic 8-bit integers quantization using ONNX Runtime was applied to standard BERT, BioBERT, and DistilBERT. The third stage involved combining both pruning and quantization, further optimiz- ing the pre-trained standard BERT and BioBERT transformers. Experiments were run on annotated MEDLINE sentences covering 25 efficacy labels, with F1 score and inference latency per sample serving as primary metrics. A 25% structured-sparsity level yielded no measurable drop in F1 score, and the addi- tional 8-bit integers dynamic step cut latency further. The best configuration, 25%- pruned + 8-bit integers BioBERT, reduced mean CPU inference time from 32.52 ms to 12.02 ms (2.6-fold speed-up) while accuracy fell only from 0.982 to 0.980 and F1 score from 0.954 to 0.948. The Post-training structured pruning combined with 8-bit integers dynamic quanti- zation makes the oncology-NER pipeline about three times faster in inference time on standard CPUs without compromising the extraction quality or needing special hardware or libraries. Keywords: Natural language processing, Named entity recognition, Post-training quantization, Structured pruning, Model compression v Acknowledgements We want to start by expressing our gratitude to AstraZeneca for allowing us to work on this thesis and for being among the kind and supportive people, even for a short time. Moreover, to provide all resources and support to successfully finish this project and learn more about real applications in machine learning and AI in industry. We are especially grateful to our supervisors, Mehrdad Farahani, Michaël Ughetto, and Miguel Carmona, for their invaluable guidance, support, and dedica- tion throughout the project. Additionally, we would like to thank our examiner, Richard Johansson, for his insightful guidance and contributions, which have helped us complete this project. Atefeh Mirzabeigi & Amir Ali Barani , Gothenburg, 2025-06-12 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis: AI Artificial Intelligence BERT Bidirectional Encoder Representations from Transformers CPU Central Processing Unit FFN Feed-Forward Network FLOPs Floating-Point Operations GELU Gaussian Error Linear Unit GPU Graphics Processing Unit I2E Interactive Information Extraction KD Knowledge Distillation LLM Large Language Model LSTM Long Short-Term Memory MEDLINE MEDLINE biomedical citations database MLM Masked Language Modelling NER Named-Entity Recognition NLP Natural Language Processing NSP Next-Sentence Prediction ONNX Open Neural Network eXchange PTQ Post-Training Quantisation QAT Quantisation-Aware Training RNN Recurrent Neural Network ix x Contents List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Theory 5 2.1 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.1 DistilBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3.2 BioBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.1 Pruning Units . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.4.1.1 Unstructured Pruning . . . . . . . . . . . . . . . . . 9 2.4.1.2 Structured Pruning . . . . . . . . . . . . . . . . . . . 9 2.4.2 Pruning Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.4.2.1 Magnitude-based Metrics . . . . . . . . . . . . . . . 10 2.4.2.2 Loss-based Metrics . . . . . . . . . . . . . . . . . . . 10 2.4.2.3 Second-order Methods . . . . . . . . . . . . . . . . . 10 2.5 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.5.1 Uniform and Non-uniform Quantization . . . . . . . . . . . . 11 2.5.1.1 Post-Training Quantization (PTQ) . . . . . . . . . . 11 2.5.1.2 Quantization-Aware Training (QAT) . . . . . . . . . 11 2.6 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6.1 Operator Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.6.2 Node Elimination . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.7.2 Recall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7.3 Precision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.7.4 F1 scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Methods 15 3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 xi Contents 3.2 Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.1 Fisher-based Mask Search . . . . . . . . . . . . . . . . . . . . 18 3.2.2 Mask Rearrangement . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.3 Mask Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Model Quantization and Optimization . . . . . . . . . . . . . . . . . 19 3.3.1 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Post Training Quantization . . . . . . . . . . . . . . . . . . . 19 3.3.3 Quantization Aware Training: Integer Only BERT . . . . . . . 20 3.4 Pruning and Quantization . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.6 Testing And Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 22 4 Results 25 4.1 Structural Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 Pruning and Quantization . . . . . . . . . . . . . . . . . . . . . . . . 34 5 Discussion 41 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Bibliography 47 A Appendix 1 I xii List of Figures 2.1 The Transformer architecture, including the encoder and decoder com- ponents. Reproduced from Dive into Deep Learning [23], under the Apache 2.0 license . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 3.1 Overview of Post-Training Pruning Framework. (a) The mask vari- ables are applied as 1. Then they undergo the three-stage process of (b) mask search, (c) rearrangement and (d) rescale. [34] . . . . . . . . 18 4.1 Global metrics comparison of BERT-base (green), 20% pruned (blue), 25% pruned (red), and 30% pruned (orange). . . . . . . . . . . . . . . 26 4.2 Global metrics comparison of BioBERT (green), 20% pruned (blue), 25% pruned (red), and 30% pruned (orange). . . . . . . . . . . . . . . 26 4.3 Radar chart comparing class-wise F1 scores for the standard BERT model (blue) and the version with 20% and 25% structured pruning (green). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4 Radar chart comparing class-wise F1 scores for the standard BioBERT model (blue) and the version with 20% and 25% structured pruning (green). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.5 Inference latency and throughput per sample (ms) for standard BERT and Quantized BERT on the baseline test dataset on CPU . . . . . . 29 4.6 Radar charts comparing the F1 scores of quantized BERT (green) and standard BERT (blue) across all entities. . . . . . . . . . . . . . . . . 30 4.7 Inference latency and throughput per sample (ms) for standard BioBERT and quantized BioBERT on the baseline test dataset on CPU. . . . . 31 4.8 Radar charts comparing the F1 scores of quantized BioBERT (green) and standard BioBERT (blue) across all entities. . . . . . . . . . . . . 31 4.9 Inference latency and throughput per sample (ms) for standard Dis- tilBERT and quantized DistilBERT on the baseline test dataset on CPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.10 Radar charts comparing the F1 scores of quantized DistilBERT (green) and standard DistilBERT (blue) across all entities. . . . . . . . . . . 33 4.11 Inference latency and throughput per sample (ms) of standard BERT (green), 20% and 25% pruned BERT (blue), and 20% and 25% pruned and quantized BERT (red). . . . . . . . . . . . . . . . . . . . . . . . 35 xiii List of Figures 4.12 Inference latency and throughput per sample (ms) of standard BioBERT (green), 20% and 25% pruned BioBERT (blue), and 20% and 25% pruned and quantized BioBERT (red). . . . . . . . . . . . . . . . . . 35 4.13 Radar charts comparing F1 score performance of the standard BERT (blue), 20% and 25% pruned BERT (green), and 20% and 25% pruned and quantized BERT (red) models across all defined entity endpoints 36 4.14 Radar charts comparing F1 score performance of the standard BioBERT (blue), 20% and 25% pruned BioBERT (green), and 20% and 25% pruned and quantized BioBERT (red) models across all defined en- tity endpoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.15 Global performance metrics of the standard BERT (blue), 20% and 25% pruned BERT (red), and 20% and 25% pruned and quantized BERT (green) models. . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.16 Global performance metrics of the standard BioBERT (blue), 20% and 25% pruned BioBERT (red), and 20% and 25% pruned and quan- tized BioBERT (green) models. . . . . . . . . . . . . . . . . . . . . . 39 A.1 Heat map of 30% of neurons and heads sparsity of pruned BioBERT . II A.2 Radar charts of BERT (above) and BioBERT (below). The stan- dards model (grean), 30% pruned model (blue), and 30% pruned and quantized model (red). . . . . . . . . . . . . . . . . . . . . . . . . . III A.3 Inference time of BERT (above) and BioBERT (below). The stan- dards model (green), 30% pruned model (blue), and 30% pruned and quantized model (red). . . . . . . . . . . . . . . . . . . . . . . . . . . IV A.4 Global performance metrics of the standard BERT (above) and BioBERT (below). The standard model (blue), 30% pruned model (red), and 30% pruned and quantized model (green). . . . . . . . . . . . . . . . V xiv List of Tables 3.1 Total number of sentences in the training and test datasets, as well as the number of endpoints mentioned (words in entities) across datasets from two different annotators. Different formats of endpoints, such as durations, percentages, and confidence intervals, are treated as distinct entities. [11] . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1 Comparison of global performance metrics for I-BERT, standard and quantized BERT, BioBERT, and DistilBERT. . . . . . . . . . . . . . 34 xv List of Tables xvi 1 Introduction The idea of creating an artificial intelligence that mirrors the capability of the human brain has likely existed for millennia. A compelling example of this dream can be found in the ancient Greek myth of Talos, as told by the poets Hesiod and Homer around 700 B.C. [1] The myth describes Talos as a giant bronze man built by Hephaestus, the Greek god of invention and blacksmithing. Tasked with guarding the island of Crete, Talos would circle the island three times a day, hurling boulders at approaching enemy ships. While Talos was a product of imagination and divine craftsmanship, modern ad- vancements in artificial intelligence (AI) have transformed this ancient dream into reality. Over the past century, AI has transformed from simple rule-based systems to sophisticated machine learning models that can learn and adapt from data. Among these achievements, machine learning has become a crucial part of AIs success. It allows systems to recognize patterns, make decisions, and generate human-like lan- guage. Within machine learning models, natural language models such as GPT-3/4, BERT [2], RoBERTa [3], and others have shown impressive performance in solv- ing complex tasks. They have repeatedly yielded better outcomes over traditional methods in areas such as text classification, named entity recognition (NER), and question answering. Among the tasks mentioned, NER is considered essential in scientific research. This is due to the fact that extracting relevant information from a vast and rapidly growing corpus of scientific literature can be both difficult and challenging. This is especially true in fields like medicine, where countless articles and discoveries are published daily. To access the most recent results efficiently, scientists can leverage models like BERT. Using BERT for NER not only makes this process faster and more convenient to search through millions of articles, but also enables researchers to identify key concepts and relationships that would have been computationally intensive and difficult to find manually. However, despite their impressive capabilities, training and developing these models remains time-, memory-, and energy-intensive [4]–[6]. As these models continue to achieve better performance and become more sophisticated, they also tend to grow over-parameterized and larger. This raises the question of how to reduce their size while maintaining the same performance level as the original model or at least delivering results at an acceptable rate. 1 1. Introduction For the project in question, the company AstraZeneca has a BERT model trained on two datasets generated using an index of MEDLINE in i2e [7]. The current model has a high runtime and significant memory usage, making it a priority for AstraZeneca to reduce these costs. To extract oncology efficacy endpoints from scientific literature, the company utilizes multiple fine-tuned pipelines. These pipelines consist of var- ious BERT-based models, including BioBERT [8], PubMedBERT [9], DistilBERT [10], and the foundational BERT architecture. [11]. However, training large deep- learning models is both computationally expensive and time-consuming. Given that hundreds of thousands of new articles are published monthly, using these models to extract the latest information becomes increasingly slow and resource-intensive. To address this challenge, potential solutions are being explored to reduce the models’ runtime and computational costs. A significant amount of research has focused on minimizing the size of BERT models while maintaining their performance. For instance, BinaryBERT [12], Q8BERT [13], I-BERT [14], and BI-BERT [15] use quantization techniques to minimize model size while maintaining reasonable performance metrics. Other approaches, including DistilBERT [10] and TinyBERT [16], reduce models size by transferring knowledge from larger models to smaller ones. Additionally, pruning techniques, as applied in O-BERT [17], reduce the number of parameters to minimize model size while limiting the impact on accuracy. Although these methods are efficient and deliver excellent results, some require spe- cialized GPUs that may not be readily accessible, while others demand significant time and resources, such as re-training BERT models. For this project, the focus is on methods that can be applied in post-training and are compatible with a wide range of models. Among these, post training quantization and pruning stand out as particularly promising approaches due to their ability to reduce model size and computational demands without significantly compromising performance. Building on this foundation, the project aims to investigate the following research questions: • To what extent is it feasible to optimize the inference time of these BERT- based models, and what methodologies can be implemented to achieve this optimization without compromising the critical performance benchmarks es- sential for accurate oncology efficacy endpoint extraction? • To what extent can the memory footprint of these BERT-based models be minimized without compromising their operational efficiency and scalability in real-world oncology endpoint extraction tasks? • What strategies can be employed to effectively reduce the architectural size of these deep learning models while preserving their competitive performance metrics, particularly in terms of accuracy and F1-score? 2 1. Introduction 1.1 Limitations The main constraint encountered during this research stemmed from the necessity of optimizing the models on CPU nodes. Since extensive research has been conducted on quantization techniques to accelerate inference time and memory footprint on GPU nodes, optimizing these BERT-based models while preserving performance metrics on CPUs was challenging. Additionally, Integer-Only BERT (I-BERT) was implemented to quantize the model with 8-bit integers. Although it worked on CPU and GPU, to fully leverage the promised inference accelerations of I-BERT, a special type of hardware was needed. Consequently, the decision was made to move on to other methods that work effectively on CPUs, based on the company’s requirements. Furthermore, the Hugging Face website was internally filtered by AstraZeneca. Given that, the training and inference process of the existing models were based on Hug- ging Face [18], this made the process of training the models on internal Scientific Computing Platform (SCP) CPU and GPU nodes more difficult, and the training process was subsequently migrated to Azure Databricks [19]. 3 1. Introduction 4 2 Theory In the following section, the theoretical foundations is introduced to support a clearer understanding of the models and techniques used throughout the study. 2.1 Transformer Transformers have become dominant models in almost all natural language process- ing tasks. Before the introduction of the article "Attention Is All You Need" [20], recurrent neural networks (RNNs) and long short-term memory networks (LSTMs) faced challenges with sequential processing, which made the training process slow and inefficient for long sequences. Additionally, their limited ability to capture long- term dependencies due to the vanishing gradient problem was a significant obstacle. The improvements made to encoder-decoder RNNs for sequence-to-sequence appli- cations, such as machine translation [21], can be considered the foundation behind the Transformer model. In RNN models, the entire input sequence is processed se- quentially and compressed into a single fixed-length vector, typically the final hidden state of the encoder, which is then fed into the decoder to generate the output se- quence. The key innovation of the attention mechanism was that, instead of relying on the same input vector at each decoding step, the decoder could focus on specific parts of the input sequence as needed. At each decoding step, the decoder receives a vector consisting of a weighted sum of the input representations, which allows it to dynamically emphasize the most relevant information [22]. In this weighted sum, each weight reflects the importance of a specific input token. Crucially, these weights must be distinct and well-defined so that the model can effectively learn and capture the nuances within its parameters. In 2017, Vaswani et al. introduced the state-of-the-art Transformer architecture, featuring an attention mechanism that could not only attend to different parts of a sequence by assigning different weights to them but also read and process all words simultaneously. This parallel processing makes Transformers faster and more efficient, especially when handling large datasets and long sequences. An illustration of transformer model architecture can be seen in Figure 2.1. It indicates the use of multi-head, self-attention, and feed-forward layers in both the encoder and decoder structures, which enables efficient modeling of sequence relationships. 5 2. Theory Figure 2.1: The Transformer architecture, including the encoder and decoder compo- nents. Reproduced from Dive into Deep Learning [23], under the Apache 2.0 license 2.2 Attention Attention is the fundamental mechanism in transformer models like BERT that al- lows the model to capture relationships between different parts of the input sequence, regardless of how far apart they are. The core idea behind attention is to assign a weight to each input, which allows the model to compute a weighted combination that indicates where the focus should be and how much attention each part requires. In other words, it shows the relation between one token and another token in an input, and which word or words are more important compared to others in a specific sequence. Formally, the attention mechanism can be described as: Attention(Q, K, V ) = Softmax ( QKT √ dk ) V, (2.1) Where Q (query), K (key), and V (Value) are projections of the input embeddings, and dk is the dimensionality of the key vectors. The Softmax function ensures that the attention weights are normalized and can be interpreted as probabilities. This mechanism allows the model to capture the meaning of the entire sentence more effectively, as not all words contribute equally to the overall meaning. Certain 6 2. Theory words may carry more information depending on the context, and attention enables the model to recognize and leverage this information. This mechanism is one of the key reasons why transformer models achieve better performance on tasks such as translation, text classification, and question answering. In practice, transformers do not have just a single attention computation component; rather, they contain a multi- head attention mechanism. This mechanism allows the model to learn different types of relationships simultaneously between different parts of the input by projecting the input into lower-dimensional spaces, applying attention separately in each space, and then finally combining the results. The computation for multi-head attention can be formally expressed as follows: MultiHead(Q, K, V ) = Concat(head1, . . . , headh)W O (2.2) where headi = Attention(QW Q i , KW K i , V W V i ) (2.3) [1] 2.3 BERT BERT (Bidirectional Encoder Representations from Transformers) is a transformer- based machine learning model for natural language processing (NLP). It was intro- duced in 2019 by researchers at Google and is one of the first models to deeply understand the full context of a text from both directions, left to right and right to left, which is why it is called "bidirectional". This means that BERT reads and understands the entire sentence at once. Despite the fact that transformer-based models typically consist of both an encoder and a decoder in their architectures, BERT is based only on the encoder part of the transformer architecture. Each en- coder layer contains multi-head self-attention, a feedforward neural network (a fully connected layer applied after attention), layer normalization, residual connections, and dropout to help with training. BERT was pre-trained on two major tasks before being fine-tuned: masked language modeling (MLM) and next sentence prediction (NSP). In the MLM task, 15% of the words in a sentence are randomly masked, and the model is trained to predict the missing (masked) words. MLM requires BERT to understand the complete content of a sentence, both before and after the masked word. In the NSP task, BERT is trained to predict whether two given sentences are consecutive in the original text. This helps BERT learn the relationships between sentences, not just the relationships between words inside a sentence. 2.3.1 DistilBERT In deep learning, there are different compression techniques such as pruning, knowl- edge distillation, and quantization that play important roles in reducing the com- putational costs and accelerating models’ inference time. Among these techniques, 7 2. Theory Knowledge distillation or model distillation is the process of transferring knowledge from a large pre-trained model (teacher) to a smaller model (student) without loss of performance and validity. The distilled models can be effectively deployed on resource-constrained hardware since they are faster and less expensive to evaluate. [24]. Knowledge distillation (KD), is typically applied on large deep neural networks with massive numbers of parameters and layers. This process makes it useful especially in the case of Natural Language Processing (NLP) and the domain of Large Model Languages (LLM), which often involve models with millions of parameters. The primary objective of KD is to train a more compact student model to mimic the predictions made by a more complex teacher model. In 2020, Sanh et al.[10] proposed a distilled version of BERT with almost the same architecture; the main difference between DistilBERT and standard BERT is in the Transformer encoder layers and the number of parameters. DistilBERT consists of 6 Transformer encoder layers and 66 million parameters, which is 40% smaller than standard BERT. Although BERT is pre-trained using two tasks, Masked Language Modeling (MLM) and Next Sentence Prediction (NSP), in DistilBERT, the NSP objective was removed completely, which simplifies the training and reduces overall model complexity. Furthermore, during the initial training of DistilBERT, cross-entropy, MLM loss, and cosine embedding loss were utilized for the DistilBERT (student model) to emulate the behavior of the BERT (teacher model) while preserving as much of its performance as possible. Cross-entropy loss, which was used in the distillation process, compares the students predicted probabilities to those of the teacher model. In addition, cosine embedding loss was incorporated into the cross-entropy loss to align the directions of the student and teacher hidden state vectors. Finally, MLM loss was added to the last two losses so that the student model learns to predict masked tokens. Despite this reduction, DistilBERT achieves 97% of BERT’s performance on many NLP tasks, which makes it an efficient alternative for real-world applications. 2.3.2 BioBERT BioBERT (Bidirectional Encoder Representations from Transformers for Biomedi- cal Text Mining) is a variant of BERT, published in 2019 by Lee et al. from Korea University and Clova AI Research, NAVER Corporation. BioBERT shares the same architecture as the BERT model, consisting of 12-layer transformers with 786 hid- den units and 12 attention heads, but the main difference is the training corpus of BioBERT. While BERT was pre-trained on general domain corpora (BookCorpus and Wikipedia), BioBERT was initialized with BERT’s weights and then further pretrained on approximately 18 billion words of biomedical text from PubMed ab- stracts and PMC full-text articles using the same masked language modeling and next sentence prediction objectives. This additional domain-specific training enables 8 2. Theory BioBERT to have a deeper contextual understanding of biomedical terminology and language patterns that the general model may struggle with. 2.4 Pruning Pruning is a fundamental method for compressing neural network models by elim- inating non-essential parameters while preserving the performance of the model at equivalent level of the original state. Language models have reached sizes with bil- lions of parameters and pruning methods have been used as an important method alongside other compression techniques to accelerate inference time on resource- constrained environments. There are two critical questions regarding pruning ap- proaches: (1) what to prune, and (2) how to prune. In the following sections, we answer these important questions. 2.4.1 Pruning Units Pruning units refer to elements that are selected for removal during the pruning process, including elements such as weights, neurons, layers, attention heads, etc. Based on pruning units, pruning methods can be categorized as structured and unstructured pruning [25]. 2.4.1.1 Unstructured Pruning Unstructured pruning or weight-wise pruning is the finest-grained case. In general, individual weights are identified based on importance criteria and zeroed out, but in practice, for small or medium models like BERT, unstructured pruning usually does not directly set the weights to 0, but their corresponding masks M are set to 0. The mathematical representation of unstructured pruning can be written as the following constrained optimization problem: min w,M L(w ⊙ M ; D) = min w,M 1 N N∑ i=1 ℓ(w ⊙ M ; (xi, yi)) (2.4) S.t. ∥M∥0 ≤ k where the binary mask M is multiplied element-wise into the model by ⊙, w = {w1, w2, ..., wM} represents neural network weights, D is a dataset composed of N input xi and output yi pairs, and k is the target non-sparsity ratio. However, one shortcoming of the unstructured pruning method is that it results in irregular sparse patterns that limit the computational efficiency benefits of standard hardware architectures that are not designed for sparse computation. 2.4.1.2 Structured Pruning This theoretical approach removes entire filters, such as attention heads, neurons in feed-forward networks (FFNs), or entire layers. While this method typically achieves lower sparsity ratios than unstructured approaches, it does not require the support 9 2. Theory of special hardware and software and can directly speed up networks and reduce the size of the neural networks. 2.4.2 Pruning Metrics These metrics can be categorized into three principal theoretical frameworks: 2.4.2.1 Magnitude-based Metrics Magnitude-based metrics employ the magnitude (absolute values) of weights or acti- vations to determine importance. The theoretical principle underlying this approach is that weights with smaller magnitudes contribute proportionally less to model out- puts. 2.4.2.2 Loss-based Metrics Loss-based metrics constitute a more theoretically sophisticated approach that as- sesses a pruning unit’s importance based on its impact on the loss function. The central theoretical principle is that removing parameters that minimally affect the loss function will preserve model performance. Loss-based metrics are further sub- divided into two theoretical frameworks: 2.4.2.3 Second-order Methods These methods incorporate the Hessian matrix (second-order derivatives) to more accurately approximate the change in loss when parameters are removed. The the- oretical expression for importance in second-order methods is: I = 1 2 (w − w∗)T HL(w∗)(w − w∗) (2.5) Where HL(w∗) is the Hessian matrix. This formulation captures parameter interac- tions and provides a more accurate theoretical estimate of importance, though at a higher computational cost. The second-order methods derive their theoretical foundation from a Taylor expan- sion of the loss function. For a well-trained model with weights w∗, the change in loss when moving to a pruned state w can be approximated as: L(w) − L(w∗) ≈ (w − w∗)T ∇L(w∗) + 1 2 (w − w∗)T HL(w∗)(w − w∗) (2.6) Since ∇L(w∗) ≈ 0 for a well-trained model, the second-order term dominates the approximation. This theoretical insight suggests that the importance of a parameter can be effectively estimated using the second-order information contained in the Hessian matrix. 10 2. Theory 2.5 Quantization Quantization is a technique that maps values from a large (often continuous) set to a smaller, finite set to reduce the memory and computational costs of large language models during inference. It transforms high-precision floating-point values (typically 32-bit) to lower-precision formats (such as 8-bit or 4-bit integers), significantly re- ducing model size and improving inference speed, especially on hardware optimized for low-bit operations. For example, quantizing weights from 32-bit float to 4-bit integer can compress the model to approximately 1/8 of its original size. 2.5.1 Uniform and Non-uniform Quantization Under uniform symmetric quantization, a real number x is mapped to an integer value q ∈ [−2b−1, 2b−1 − 1], where b specifies the quantization bit precision: q = Q(x, b, S) = Int(clip(x, −α, α) S ) (2.7) Where Q is the quantization operator, Int is the integer map, clip is the truncation function, α is the clipping parameter for outlier control, and S is the scaling factor defined as α/(2b−1 − 1). The dequantization process is: x̃ = DQ(q, S) = Sq ≈ x (2.8) Non-uniform quantization does not use equally spaced intervals, which allows a better representation of neural network weight distributions. The general formula is: Q(r) = Qi, if r ∈ [∆i, ∆i+1) (2.9) where Qi represents quantization levels and ∆i defines the intervals. While non- uniform methods may better capture parameter distributions, they often require lookup tables that create deployment overhead on hardware. 2.5.1.1 Post-Training Quantization (PTQ) Quantization techniques are broadly categorized into Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT) approaches. PTQ determines quan- tization parameters without re-training, making it fast and computationally efficient, but potentially less accurate for low-precision settings. The model can be directly used for inference after quantization and it’s generally easier to implement. 2.5.1.2 Quantization-Aware Training (QAT) Conversely, QAT methods involve re-training, and the model is trained with simu- lated quantization effects, which helps recover the error introduced by quantization. The following text describes different quantization techniques for the self-attention layer in the Transformer architecture. The first approach represents a common quantization method where the parameters are quantized and stored in 8-bit integer 11 2. Theory format. Since the operations are performed in floating-point numbers, the out- puts are dequantized before each operation. The second approach is a quantization method in which the model is partially quantized by integer arithmetic. Before each Softmax operation, the inputs are dequantized for the operation to be performed in floating-point numbers, and then after Softmax, the outputs are quantized back to feed to the matrix multiplication operation. The third approach corresponds to I- BERT quantization, which will be discussed further in the methods chapter. In this case, the model is fully quantized, and there is no dequantization or floating-point arithmetic involved throughout the process. Additionally, QAT addresses the perturbation caused by quantization by either sim- ulating the process in re-training or using additional parameters to fine-tune the quantized model. This perturbation helps the model converge to a point with bet- ter post-quantization performance. While QAT typically achieves higher accuracy, it requires significant computational resources for re-training, which may not always be possible for models with billions of parameters. 2.6 Optimization Apart from quantization, various optimization techniques can be employed to en- hance the performance of transformer models such as BERT. These optimizations aim to reduce memory usage and computational complexity while preserving model accuracy. By implementing these techniques, inference time will be speed up signif- icantly and these powerful models will be more efficient for real-world applications. Additionally, a core part of transformer architecture models like BERT are attentions which are one of the most computationally expensive parts of the model. By opti- mizing attention, the model can be improved effectively. Spare attentions, attention head pruning and optimized matrix multiplication are the attention optimization techniques that enable models like BERT to handle longer sequences better and reduce their memory requirements. 2.6.1 Operator Fusion Operator fusion is a key optimization technique that combines multiple operations into a single, more efficient operation without changing the output. This process reduces memory access and computational overhead, which leads to performance improvement [26]. The fusion of operations minimizes data movement between memory and processing units, which is often a bottleneck in model execution. This fusion reduces the number of memory read/write operations and kernel launches, which results in lower latency and better utilization of hardware resources. However, it is important to note that the implementation of operator fusion must carefully consider numerical stability and potential interactions with other optimizations [27], [28] . 12 2. Theory 2.6.2 Node Elimination Another optimization technique is node elimination. This process involves identify- ing and removing unnecessary or redundant nodes from the computational graph of the model that do not contribute to the final output. By simplifying the graph structure, both the memory footprint and the computational requirements of the model can be reduced [29]. In BERT and similar transformer models, common targets for node elimination include identity operations that don’t modify the data, redundant reshape or trans- pose operations, and unused output from multi-output operations that potentially can reduce the model size [30]. 2.7 Performance Metrics To evaluate how well a machine learning model performs on a given task, perfor- mance metrics are used. In the case of transformer-based architectures (such as BERT, DistilBERT, or BioBERT), the choice of metrics depends on the specific task such as classification, sequence labeling, question answering, or text genera- tion. For sequence labeling tasks (e.g., Named Entity Recognition or Part-of-Speech tag- ging), where transformers assign a label to each token, common evaluation metrics include: • Token-level accuracy, which measures the overall correctness of token predic- tions. • Entity-level precision, recall, and F1-score, which evaluate the correctness of predicted entity spans. • Span-level F1-score, which considers entire entity spans rather than individual token matches. The standard evaluation metrics for classification tasks are commonly defined as follows: true positives (TP), which are correctly identified positive cases; false pos- itives (FP), which are incorrectly labeled as positive; true negatives (TN), which represent correctly identified negative cases; and false negatives (FN), which occur when negative cases are mistakenly classified as positive. 2.7.1 Accuracy Accuracy is the ratio of the correct predictions to total predictions and measures the overall correctness of our model. It can be defined as: Accuracy = number of correct predictions number of total predictions = TP + TN TP + TN + FP + FN (2.10) 13 2. Theory 2.7.2 Recall It refers to the ability of the model to find all instances of the positive cases. Recall = TP TP + FN (2.11) 2.7.3 Precision Precision refers to the fraction of positive predictions that were accurately predicted by the model. It determines how often our predicted positive cases are actually positive. Precision = TP TP + FP (2.12) 2.7.4 F1 scores The F1 Score is the harmonic mean of precision and recall. It provides an overall trade-off metric between precision and recall, and it is especially helpful in situations where there is a need to manage an imbalanced distribution between the outputs. F1 = 2 × Precision × Recall Precision + Recall = 2 × TP 2 × TP + FP + FN (2.13) 14 3 Methods Building upon the theoretical foundation established in the previous chapter, this section presents the practical approaches and models utilized in our research, con- ducted as part of the AstraZeneca NER (Named Entity Recognition) project. The focus here is on the implementation and optimization of transformer-based models for biomedical text classification. In this project, three BERT-based architectures have been utilized, including BERT- base, BioBERT, and DistilBERT. All of these models have been pre-trained on large text corpora, with BioBERT and PubMedBERT specifically adapted for biomedical NLP tasks. The objective was to optimize these existing models in two primary ways: • Quantization, to accelerate inference time and reduce computational costs. • Pruning, to decrease model size and memory footprint. Both techniques aim to make the models more efficient for deployment, particu- larly in environments with limited resources, without compromising the predictive performance metrics, which were discussed in the previous chapter. 3.1 Data The dataset used in this project is derived from the work of Gendrin-Brokmann et al. [11], who developed it to train BERT-based models for extracting efficacy end- points from biomedical literature, with a particular focus on oncology trials. The corpus comprises annotated sentences sampled from the MEDLINE database [31], with initial sentence extraction and preprocessing performed using the I2E [32] nat- ural language query platform developed by Linguamatics. I2E is designed to extract relevant facts, relationships, and entities from large collections of unstructured and semi-structured text data, especially in domains such as life sciences and healthcare. Also, it uses NLP capabilities, which often combine linguistic rules with statistical methods and machine learning, to read text and pull out specific pieces of informa- tion (like efficacy endpoints, diseases, genes, compounds, etc.). For the purpose of our analysis, the dataset utilized contains entities annotated as clinically relevant efficacy endpoints such as survival metrics (e.g., overall survival, progression-free survival), response rates, and associated statistics such as confidence 15 3. Methods intervals, hazard ratios, and p-values. These entities are represented in various textual formats, such as percentage-based or duration-based expressions, and were manually labeled to support fine-grained entity recognition. The annotation process was done by utilizing the Label Studio annotation tool [33], and involved two domain experts in clinical information science. Each sentence was independently labeled by both annotators, and any disagreements were subsequently resolved through review, which ensures high-quality and consistent labeling. According to the summary statistics in Table 3.1, the training set comprises 5,392 annotated sentences, while the test set contains 983 sentences. The dataset is specif- ically tailored for token classification tasks and is well-suited for evaluating Named Entity Recognition (NER) models in the biomedical domain. In this work, this dataset is reused to train our models and systematically evaluate the impact of different fine-tuning and model compression techniques, such as quantization and pruning, on the task of efficacy endpoint extraction. 3.2 Pruning To make the models smaller and faster during inference, the decision was to perform pruning in addition to quantization. The goal was to remove the redundant parts of the model while maintaining baseline NER performance. We opted for structured pruning, which involves removing masked attention and FFN neurons. Although higher sparsity could have been achieved with unstructured pruning and minimal accuracy drop, we chose structured pruning because unstructured sparse patterns don’t necessarily lead to inference speedup on standard hardware [25]. Since struc- tured pruning produces a smaller and denser model that can run on common hard- ware without the need for special libraries, three levels of structured sparsity (20%, 25%, and 30%) were applied to both Standard BERT and BioBERT in this project to reduce their memory footprint and accelerate inference time. Our structured pruning process was inspired by the method introduced in A Fast Post-Training Pruning Framework for Transformers [34] article. In their post-training method, the entire dataset is not used for retraining; instead, a subset of the training dataset is sufficient. A sample of the dataset, along with a few algorithmic steps, is used to identify and remove the unimportant parts of the model. Their approach comprises three stages: Fisher Information mask search, mask rearrangement, and mask tuning. The first two stages of this approach were used, and the third stage was modified to suit our practical considerations. The pruning procedure can be summarized as follows: 16 3. Methods Table 3.1: Total number of sentences in the training and test datasets, as well as the number of endpoints mentioned (words in entities) across datasets from two different annotators. Different formats of endpoints, such as durations, percentages, and confidence intervals, are treated as distinct entities. [11] Endpoint Type Training Dataset (words in entities) Test Dataset (words in entities) DFS 2907 56 DFS_CIH 1282 0 DFS_CIL 1176 0 DFS_percent 6847 1654 DFS_percent_CIH 1980 73 DFS_percent_CIL 1770 67 DoR 1931 23 DoR_CIH 677 6 DoR_CIL 737 6 ORR 3320 509 ORR_CIH 541 104 ORR_CIL 503 95 OS 5051 1092 OS_CIH 1418 198 OS_CIL 1372 180 OS_percent 5805 5489 OS_percent_CIH 2063 379 OS_percent_CIL 1880 343 PFS 5139 536 PFS_CIH 1332 143 PFS_CIL 1256 121 PFS_percent 1758 353 PFS_percent_CIH 959 24 PFS_percent_CIL 981 16 time_point 5925 2591 Total sentences: Training = 5392, Test = 983 17 3. Methods Figure 3.1: Overview of Post-Training Pruning Framework. (a) The mask variables are applied as 1. Then they undergo the three-stage process of (b) mask search, (c) rearrangement and (d) rescale. [34] 3.2.1 Fisher-based Mask Search At first, the importance of each attention layer and hidden filter was evaluated in each feed-forward layer using a Fisher Information Matrix score. In this step, we measure how much the removal of each of those components’ weights will impact the model’s loss. This mask search was performed by computing the diagonal ap- proximation of the Fisher Information Matrix of the model parameters by making a forward and backward pass on a small sample of the training dataset. Components with the lowest Fisher-based importance were marked as pruning candidates. Then a binary mask was applied over all heads and neurons such that components that were set to be pruned got a mask value of 0 while the rest remained assigned with mask 1. 3.2.2 Mask Rearrangement After applying the binary mask, the pruning selection was refined through a mask rearrangement step. In the first stage the Fisher Information Matrix was simplified by diagonal approximation. However, that stage alone will not result in optimal masking pattern, hence the second step was to account for intra-layer interactions and distribution of pruned units across each different layer. Removing too many components from a single layer might harm model performance, even if each was low-importance in isolation. Thus, we adjusted the pruning mask to better balance the masking pattern in each layer. Basically, this stage improves upon the initial Fisher mask by considering inter- layer trade-offs to maintain overall performance. We iterated this two-stage process until the mask pattern satisfied the global FLOPs constraint while keeping the loss minimal. At this point, the structured mask was finalized, and it was defined which attention heads and FFN neurons to remove and prune those components from the model architecture. 18 3. Methods 3.2.3 Mask Tuning In the past two stages, the mask values were binary (1 or 0); in this stage, the nonzero values from the previous stage are tuned to real values to recover some accuracy loss. In the earlier stages, the model was pruned to satisfy the latency/Flop constraint by applying the binary mask. This binary mask selection and intra-layer rearrangement produced a pruned transformer. In this pruned model, this binary mask will cause a degradation in accuracy. The mask tuning stage is designed to recover this lost accuracy by adjusting the remaining components’ masks without changing them. By allowing those remaining masks to deviate from 1, the model can compensate for removed components. 3.3 Model Quantization and Optimization 3.3.1 Optimization To reduce the model size and complexity and to improve performance, the model was optimized with ONNX Runtime [35] Transformer optimization tool. This step involves a series of graph-level transformations specifically designed for BERT-like architectures. The goal is to streamline the models computational graph, reduce inference time, and prepare it for further compression through quantization by elim- inating inefficiencies. The optimizations applied include operator fusion, removal of redundant operations, constant folding, and reordering and simplification of com- putation graphs. These techniques help consolidate operations, minimize runtime overhead, and make the model more efficient for quantization. 3.3.2 Post Training Quantization In addition to pruning and optimization, another strategy applied to the existing model was ONNX Runtime dynamic quantization [36], which aimed to reduce the model size and improve inference efficiency, particularly on CPU-based systems. Specifically, dynamic quantization was used. The weights of the model, originally stored in 32-bit floating-point format, were quantized offline to 8-bit integers. Each weight tensor was assigned a scale and zero-point to map the 32-bit floating-point range into 8-bit integer, to reduce memory usage and speed up computations. During inference, the activations (the outputs from previous layers) remain in 32-bit floating-point but are dynamically quantized right before they are used in quantized operations. This dynamic quantization process enables the system to convert the activations into 8-bit integers, which is more efficient for computation, while still re- taining enough precision to ensure accurate results. After the quantized operations are performed, the output is dequantized back into 32-bit floating-point for further processing, maintaining the integrity of the models computation. In practice, ONNX Runtime applies quantization selectively to certain linear opera- 19 3. Methods tions, such as fully connected layers (GEMM), matrix multiplications, and element- wise additions. This targeted quantization improves the inference performance by accelerating the execution of these key operations without the need for model retrain- ing or additional calibration data. By quantizing only specific operations, ONNX Runtime strikes a balance between speed optimization and model accuracy. The quantization of weights can be represented as: quantizedval = round ( floatval scale ) + zeropoint (3.1) where: • quantizedval is the resulting quantized value. • floatval is the original floating-point value. • scale is the scaling factor that maps the floating-point range to the integer range. • Zeropoint is the scaling factor that maps the floating-point range to the integer range. This equation is used to convert a floating-point value (floatval) to an integer repre- sentation (quantizedval) by scaling and shifting, ensuring that the quantized values are as close as possible to the original floating-point values while using the reduced bit-width for storage and computation. 3.3.3 Quantization Aware Training: Integer Only BERT Furthermore, I-BERT was chosen as one of the quantization methods for this project. I-BERT focuses on uniform symmetric quantization and employs static quantization, where the scaling factors remain fixed during inference. To enable fully integer-only computation, not just for the weights but also for the entire model pipeline, Sehoon Kim et al. [14] introduced different non-linear activation functions such as GELU and softmax by using polynomial approximation functions. Since polynomials in- volve only addition and multiplication, these operations can be efficiently executed using integer arithmetic alone and eliminate the need for floating-point operations during inference. Additionally, matrix multiplications and embeddings in I-BERT are carried out us- ing 8-bit integer precision, while the non-linear functionslike GELU, softmax, and Layer Normalization are computed using 32-bit integer precision. The reason for using 32-bit integers in these cases is that it avoids the need for dequantization and introduces minimal computational overhead. Although we apply this approach to RoBERTa-Base, the same technique can be extended to other transformer-based models, as long as they rely on similar non-linear operations. The following equation approximates GELU activation, which first provides a poly- nomial approximation to the Gaussian error function L(x), and then i-GELU is 20 3. Methods defined. Here a = −0.2888, b = −1.769, and sgn(x) is the sign function. L(x) = sgn(x) · ( a · (clip(|x|, max = −b) + b) 2 + 1 ) (3.2) i-GELU(x) := x · 1 2 ( 1 + L ( x√ 2 )) (3.3) For the softmax function, I-BERT uses a second-order polynomial to approximate the exponential function over the interval (− ln 2, 0]. The following approximation is derived by minimizing the distance between the exponential function and the polynomial over this domain. L(p) = 0.3585 · (p + 1.353)2 + 0.344 ≈ exp(p) (3.4) This leads to the integer-only exponential approximation: i-exp(x̃) := L(p) 2z (3.5) Where z = ⌊−x ln 2 ⌋ , p = x̃ + z · ln 2 (3.6) This method avoids computing actual exponentials by relying on polynomial approx- imations and bit-shifting. Furthermore, for Layer Normalization, I-BERT adopts the standard normalization process, but reformulates it to be compatible with integer arithmetic. The procedure is as follows: x̃ = x − µ σ (3.7) Where the mean µ and standard deviation σ are computed as: µ = 1 C C∑ i=1 xi (3.8) σ = √√√√ 1 C C∑ i=1 (xi − µ)2 (3.9) This expression is then implemented with fixed-point approximations to make the operation integer-friendly while maintaining numerical stability. Although I-BERT was implemented within this project, and its performance metrics are represented in the Results chapter, a decision was made not to proceed with I- BERT in the final approach. This was primarily because it necessitates specialized hardware for optimal 8-bit precision. Moreover, in comparison to other techniques that were utilized in this work, I-BERT showed notable accuracy and F1 drops and higher inference times. 21 3. Methods 3.4 Pruning and Quantization To leverage both the benefits of quantization and pruning to compress the model and further accelerate inference time, dynamic quantization was implemented on the pruned model. In this regard, the model was first pruned using the Fisher Information Matrix Pruning method on the provided GPUs, and subsequently, it was quantized by ONNX Runtime’s dynamic quantization library on CPU nodes. To quantize the pruned model, the same pruning amounts (20%, 25%, and 30%) were applied to neurons and attention heads in the model. 3.5 Training To train the original models, BERT-based, DistilBERT, and BioBERT on the train- ing dataset, the data was tokenized using the Hugging Face AutoTokenizer. In the tokenization process, sentences are divided into individual tokens, which are words or sub-word units, and for each token, one label is assigned. The models were fine- tuned with code adapted from the Hugging Face Transformers library. Additionally, during training, the model learned to classify each token according to a predefined set of NER tags extracted from the training data. All artifacts, including model checkpoints, configuration files, and training logs, were systematically organized in timestamped output directories for reproducibility and post-training analysis and quantization. 3.5.1 Hardware The models were trained and evaluated on the Azure Databricks platform, where ac- cess was provided to 36 CPU cores, 440 GB of memory, and one NVIDIA A10 GPU. The GPU resources were particularly essential during the model pruning process in this project, where compute-intensive operations like the Fisher Information Matrix pruning approach required parallel processing capabilities. This hardware config- uration provided by AstraZeneca was sufficient computational power for both the initial fine-tuning phases and the subsequent model optimization and quantization experiments. 3.6 Testing And Evaluation To evaluate the quantization, pruning, and pruning-quantization techniques devel- oped in this work and to compare their performances with original BERT variation models, they need to be sufficiently tested to ensure their robustness and function- ality. Since for the dynamically quantized model only weights are quantized, and it comes in one configuration, it is not possible to test the model with different pa- rameters. However, for both pruning alone and combined pruning and quantization, the models and implementations can be tested by varying the amount of pruning. This allows us to determine precisely how much the model can be pruned without sacrificing performance metrics and compromising its functionality. 22 3. Methods In the inference phase of the project, the structural pruning was evaluated at three fixed sparsity levels (20%, 25%, and 30%) using the Fisher Information Matrix approach and standard PyTorch inference [37] with Hugging Face’s tokenization pipeline. For both quantization and pruning-quantization experiments, the inference was exported to ONNX format using torch.onnx.export. Moreover, in pruning- quantization experiments, same as pruning experiments, the models were evaluated at three fixed sparsity levels (20%, 25%, and 30%). In all experiments implemented in this project, including pruning, quantization, and pruning-quantization, perfor- mance metrics such as F1-score, accuracy, precision, and recall were calculated using the seqeval library [38]. Additionally, benchmarks were conducted to measure in- ference latency and throughput with a batch size of 16. 23 3. Methods 24 4 Results Building upon the methodologies detailed in the last chapter, this chapter presents the comprehensive results derived from our quantization, pruning, and combined quantization and pruning experiments. 4.1 Structural Pruning This section quantifies the pruning effect on the BERT and BioBERT models within the NER pipeline. A three-stage structural-pruning procedure was applied, compris- ing Fisher-based mask search, mask rearrangement, and mask tuning. In Figure 4.1 and Figure 4.2, performance metrics such as accuracy, F1 score, precision, and recall were obtained for the standard models and for sparse models in which 20%, 25%, and 30% of attention heads and FFN neurons had been removed. The global per- formance of standard BERT remained flat up to 25% structured sparsity; accuracy slipped by 0.2% (0.973 to 0.971) and F1 score by only 1% (0.928 to 0.918). However, at 30% sparsity, recall collapsed by 16.5% (0.906 to 0.779), and the F1 score fell to 0.854. Paradoxically, precision increased with greater sparsity and peaked at 0.944 at 30% sparsity. In contrast, BioBERT, which was pretrained in the medical domain, showed more resilient. Even at 30% sparsity, it retains 0.975 accuracy and 0.936 F1 score, only 0.7% and 1.8% below baseline, and its recall remains 12.5% higher than BERT at the same sparsity level. As in standard BERT, precision rose with higher sparsity. 25 4. Results Figure 4.1: Global metrics comparison of BERT-base (green), 20% pruned (blue), 25% pruned (red), and 30% pruned (orange). Figure 4.2: Global metrics comparison of BioBERT (green), 20% pruned (blue), 25% pruned (red), and 30% pruned (orange). To visualize the effect of structural pruning on each oncology efficacy endpoint, Figure 4.4 shows radar charts of the F1 scores for the standard BioBERT (blue) against the versions pruned by 20% and 25% (green). The plots indicate that most medical efficacy endpoints suffer little or no degradation after the pruning process. The two polygons coincide on every high-count label, such as DFS, PFS, and OR, showing that 20% and 25% structural pruning is virtually loss-free. For rare DoR classes (DoR, DoR_CIH, DoR_CIL), which have very few examples in both the training and test sets, structured pruning actually improves performance. 26 4. Results Figure 4.3: Radar chart comparing class-wise F1 scores for the standard BERT model (blue) and the version with 20% and 25% structured pruning (green). 27 4. Results Figure 4.4: Radar chart comparing class-wise F1 scores for the standard BioBERT model (blue) and the version with 20% and 25% structured pruning (green).28 4. Results 4.2 Quantization In this section, the results of our evaluations are presented by comparing the per- formances of BERT, DistilBERT, and BioBERT before and after applying dynamic quantization. Our primary objective is to optimize inference time and reduce model size while maintaining competitive accuracy. For each model, inference latency was calculated as the total time taken to process all samples (test dataset), divided by the total number of samples and expressed in milliseconds per sample. Through- put was simultaneously measured as the number of samples processed per second, obtained by dividing the total number of samples by the total inference time. To structure the presentation of our findings, we first analyze the impact of dynamic quantization of weights on inference time and throughput. Then we examine the models’ F1 scores for each entity to assess how well the models generalize to the Name Entity Recognition (NER) task post-quantization. Finally, we delve into the models’ global performances in terms of accuracy, F1 scores, recall, and precision across all models. The standard BERT model as depicted in Figure 4.5 achieved an inference latency of 30.28 ms per sample, while the quantized version reduced it to 11.63 ms, which yields 2.6-fold faster inference. Additionally, throughput for BERT after quantization was increased from 33.0 to 86.0 samples per second. This means that the quantized model can handle about 2.6 times more samples within the same time frame. Regarding F1 scores, as illustrated in Figure 4.6, the F1 scores for all entities of standard BERT before quantization and the quantized BERT model are presented. For entities with a sufficient number of samples, the observed F1 score drop due to quantization was either small or negligible in most cases. Additionally, in PFS- entities, except for PFS-present, the quantized model showed a lower F1 score compared to the standard BERT model. Furthermore, the quantized model achieved better F1 scores for DoR, DoR-CIH, and DoR-CIL, despite the limited number of samples for these entities. Figure 4.5: Inference latency and throughput per sample (ms) for standard BERT and Quantized BERT on the baseline test dataset on CPU 29 4. Results Figure 4.6: Radar charts comparing the F1 scores of quantized BERT (green) and standard BERT (blue) across all entities. For the BioBERT model before and after applying dynamic quantization, as shown in Figure 4.7, before quantization, it achieved an inference time of 30.94 ms per sample. After quantization, the inference time was reduced to 14.14 ms per sam- ple. Furthermore, throughput was increased from 32.2 to 70.7 samples per second. These observations yield 2.4-fold faster inference time, which indicates the model can handle 2.4 times more samples in the same amount of time. The F1 scores for all entities of BioBERT before and after applying dynamic quantization are shown in Figure 4.8. BioBERT experienced a clear drop, particularly in PFS- entities ex- cept for PFS-present. While the overall F1 drop was negligible, the quantized model for the entities with fewer than 30 samples, such as DoR-CIH, DoR-CIL, and PFS- percent-CIL, exhibited the most F1 drops among all entities, although the F1 score for DoR was higher in the quantized model. 30 4. Results Figure 4.7: Inference latency and throughput per sample (ms) for standard BioBERT and quantized BioBERT on the baseline test dataset on CPU. Figure 4.8: Radar charts comparing the F1 scores of quantized BioBERT (green) and standard BioBERT (blue) across all entities. Similarly, for the DistilBERT model shown in Figure 4.9, the unquantized baseline exhibited an inference latency of 16.61 ms per sample. In the quantized model, 31 4. Results inference latency was reduced to 7.14 ms, which represents a 2.3-fold reduction in latency. Moreover, throughput for DistilBERT improved from 60.2 to 140.1 samples per second, which indicates a 2.3-fold increase in processing capacity. Lastly, the F1 scores of the DistilBERT model before and after dynamic quantization are shown in Figure 4.10. Similar to the other two models, the quantized DistilBERT also exhibits a better F1 score for DoR, despite this entity having fewer samples compared to the others. While the overall performance of the quantized DistilBERT remains acceptable, it experienced F1 score drops relative to the standard DistilBERT in entities such as PFS, PFS-CIL, PFS-CIH, OS-percent-CIL, and OS-percent-CIH. Figure 4.9: Inference latency and throughput per sample (ms) for standard Distil- BERT and quantized DistilBERT on the baseline test dataset on CPU. 32 4. Results Figure 4.10: Radar charts comparing the F1 scores of quantized DistilBERT (green) and standard DistilBERT (blue) across all entities. The overall performance metrics across all models before and after applying quan- tization are presented in Table 4.1. Regarding overall accuracy, negligible drops were observed after quantization; as can be seen, BioBERT experienced a decrease of 0.007 (from 0.985 to 0.978), BERT a drop of 0.003 (from 0.969 to 0.966), and DistilBERT a reduction of 0.007 (from 0.968 to 0.961). In terms of overall F1 scores, a minor decrease was noted for BERT (0.006, from 0.912 to 0.906), while BioBERT and DistilBERT experienced slightly larger reductions of 0.018 (from 0.959 to 0.941) and 0.020 (from 0.911 to 0.891), respectively. Moreover, precision generally saw slight improvements for BioBERT (increasing by 0.011, from 0.952 to 0.963) and BERT (increasing by 0.013, from 0.891 to 0.904). However, DistilBERT experienced a minor decrease of 0.003 (from 0.903 to 0.900). Recall decreased more than other metrics across all models. BioBERT exhibited the largest reduction of 0.047 (from 0.967 to 0.920), followed by DistilBERT with a drop of 0.038 (from 0.920 to 0.882), and BERT with a decrease of 0.027 (from 0.934 to 0.907). The I-BERT model, included in Table 4.1 for comparison, consis- tently demonstrated lower overall performance metrics (Accuracy: 0.952, F1: 0.881, 33 4. Results Precision: 0.821, Recall: 0.950) compared to the standard (unquantized) BERT, BioBERT, and DistilBERT models. As a result, the decision was made not to use I-BERT due to its poor performance compared to other quantized models. Model Accuracy F1 Score Precision Recall Standard BioBERT 0.985 0.959 0.952 0.967 Quantized BioBERT 0.978 0.941 0.963 0.920 Standard BERT 0.969 0.912 0.891 0.934 Quantized BERT 0.966 0.906 0.904 0.907 Standard DistilBERT 0.968 0.911 0.903 0.920 Quantized DistilBERT 0.961 0.891 0.900 0.882 I-BERT 0.952 0.881 0.0.821 0.0.950 Table 4.1: Comparison of global performance metrics for I-BERT, standard and quantized BERT, BioBERT, and DistilBERT. 4.3 Pruning and Quantization Having everything at hand from quantization and pruning, the combined impact of applying both techniques to the models is investigated to have a smaller and faster inference time. To experiment effects of quantization on the pruned model, dynamic quantization was applied on BERT and BioBERT models at 20% and 25% fixed sparsity. Additionally, Figure 4.11 demonstrates the impact of two compres- sion stages on the BERT model in inference latency and throughput per sample. Removing 20% of the parameters already lowers the inference latency by 30.1 ms to 24.4 ms (-19%) and raises throughput by 33.2 to 41.1 ms per sample (+24%). With a quarter of the model removed, latency drops to 23.7 ms and throughput rises to 42.2 ms per sample. Applying dynamic quantization to either model halves the runtime to 12.41 ms/80.6 ms (20% mask) and 11.7 ms / 85.8 ms (25% mask). The 2.5 to 2.6-fold acceleration indicates that combining pruning and quantization gives a cumulative speed advantage over the dense model on CPU. Figures 4.13 (for standard BERT) and 4.14 (for BioBERT) depict radar charts of F1 scores for all efficacy end-points: the dense model (blue), the 20% and 25% pruned models (green), and their quantized versions (red). For standard BERT (Figure 4.13), the compressed polygons overlap the baseline on every high-count label (e.g., DFS, OS, PFS, ORR; n > 1000); the only noticeable decline appears on the rare DoR–CIH and DoR–CIL classes, where F1 falls from 0.83 to 0.58 at 25% sparsity plus quantization. As in standard BERT, BioBERT (Figure 4.14) shows no degradation on high-count end-points up to 25% sparsity plus quantization, and even records noticeable gains on low-count labels such as DoR–CIH and DoR–CIL (+0.30 F1), confirming that the compression pipeline maintains or occasionally improves performance across the label set. 34 4. Results Figure 4.11: Inference latency and throughput per sample (ms) of standard BERT (green), 20% and 25% pruned BERT (blue), and 20% and 25% pruned and quantized BERT (red). Figure 4.12: Inference latency and throughput per sample (ms) of standard BioBERT (green), 20% and 25% pruned BioBERT (blue), and 20% and 25% pruned and quantized BioBERT (red). 35 4. Results Figure 4.13: Radar charts comparing F1 score performance of the standard BERT (blue), 20% and 25% pruned BERT (green), and 20% and 25% pruned and quantized BERT (red) models across all defined entity endpoints 36 4. Results Figure 4.14: Radar charts comparing F1 score performance of the standard BioBERT (blue), 20% and 25% pruned BioBERT (green), and 20% and 25% pruned and quantized BioBERT (red) models across all defined entity endpoints 37 4. Results Figures 4.15 (for standard BERT) and 4.16 (for BioBERT) present the performance metrics such as accuracy, F1 score, precision and recall for the dense models (blue) together with models pruned by 20% and 25% (red) and the corresponding combined pruned model quantized to 8-big integers (green). For standard BERT, after performing both compression methods, at 20% sparsity, Accuracy drops by 0.973 to 0.970 and F1 score by 0.928 to 0.916, while precision increases from 0.912 to 0.930, the loss is driven entirely by a 3% recall dip. At the same sparsity level in BioBERT, pruning and quantization decrease accu- racy (0.982 to 0.982) and F1 (0.954 to 0.948). As with standard BERT, precision increases by 1.5% and recall decreases by 2.7%. Pushing to 25% sparsity, BioBERT performance remains acceptable with (F1 0.932, accuracy 0.974), whereas the corre- sponding BERT model shows a performance drop. Figure 4.15: Global performance metrics of the standard BERT (blue), 20% and 25% pruned BERT (red), and 20% and 25% pruned and quantized BERT (green) models. 38 4. Results Figure 4.16: Global performance metrics of the standard BioBERT (blue), 20% and 25% pruned BioBERT (red), and 20% and 25% pruned and quantized BioBERT (green) models. 39 4. Results 40 5 Discussion This study investigated the dynamic quantization and structural pruning of three BERT variations: BERT, BioBERT, and DistilBERT, across three fixed sparsity levels. Subsequently, the combined effect of these quantization and pruning strate- gies was evaluated. The objective of this project was to reduce the memory footprint and inference time of these models without compromising their performance. Our analysis begins with insights that were derived from the structural pruning experi- ments on BERT and BioBERT and further examines their impact on performance metrics, inference time, and the observed patterns in pruned neurons and heads. As demonstrated in the previous chapter in Figures 4.1 and 4.2, the performance of both models after 20% pruning generally remained robust. For pruned BERT, the results showed a minor impact on performance, with accuracy dropping by 0.001, F1 score by 0.005, and recall by 0.035. Notably, for BioBERT at 20% sparsity, the performance metrics for F1 score, accuracy, and precision demonstrated an improve- ment compared to the standard BioBERT before structural pruning, while its recall experienced a slight drop of only 0.006. The enhanced performance observed in the pruned BioBERT model can be inter- preted as follows: Given that BioBERT was pretrained on specialized data in the medical domain, the selective removal of less informative parameters (specifically, 0.39% neuron sparsity and 0.02% head sparsity in the 20% pruned BioBERT) likely enabled the remaining parameters, which presumably contain more important infor- mation, to contribute more effectively to generalization. These remaining parame- ters were thus able to perform better after pruning. Additionally, this improvement was critically influenced by the strategic nature of the pruning method. Instead of randomly removing the model’s parameters, the Fisher Information Matrix was a key factor in achieving these improved results, as its objective is to identify and elim- inate parameters that have the least impact on the model’s output or loss function. For the 25% sparsity level, the BERT model indicated acceptable performance. While recall experienced a 3.8% drop, other performance metrics, such as accuracy and F1 score, remained competent compared to the standard BERT and the 20% Pruned BERT. Since the accuracy drop for 25% sparsity is still less than 1% (at 0.002), and the F1 score experienced only a 1% reduction compared to the standard BERT model, it can be effectively used for deployment. Regarding 25% pruned BioBERT, the model indicated the same value for accuracy and F1 score compared to standard BioBERT. This result also shows that even after achieving 0.41% neuron 41 5. Discussion sparsity and 0.036% head sparsity of BioBERT, the performance, except for recall with a 2% drop, remained largely intact. One can analyze that for BioBERT, even after 25% of sparsity, since the model is pretrained on specific medical data, the informative neurons contributed more to the model when the redundant neurons and heads were pruned. Focusing on the pruned models’ inference time, for the 20% pruned BERT model, the inference time experienced a 19% acceleration. For the 25% pruned model, a 21% improvement in inference time was observed. At these sparsity levels, since the inference time in both the pruned models is so close, one might therefore opt for the latter, prioritizing performance over marginal inference gains. Moreover, for BioBERT, the inference time after 20% pruning accelerated by 27.2%, and at 25% sparsity, it exhibited a 30.8% faster inference time. Due to the nearly identical performance between the 25% sparsity level and the standard BioBERT model (as discussed previously), one might consider this level of sparsity for BioBERT to leverage both lower inference time and higher pruning level. In pruning, the experiments leveraged the benefits of pruning in terms of sparsity while maintaining the same performance metrics. Shifting our focus from structural pruning, we now turn to findings from our dynamic quantization experiments. In the quantized models, which were implemented in this project, the models exhibited notable inference latency time accelerations. Dynamic quantization was applied to all three BERT variations, such as BioBERT, BERT, and DistilBERT, with negligible performance metrics sacrifice. The finding in previous chapters shows that across all three models, 2.3 to 2.6-fold acceleration was exhibited in inference time. Additionally, this inference acceleration can be attributed to a combination of ONNX Runtime optimizations and dynamic quantization techniques. The optimization step applies several graph-level transfor- mations, such as constant folding, operator fusion, node elimination, and attention fusions. These techniques reduce computational overhead by computing constant values before inference, combining multiple smaller operations into one more effi- cient larger operation, and simplifying the computational graph, which results in inference speedup. Furthermore, dynamic quantization, particularly the conversion of weights from 32- bit floating point to 8-bit integers, is arguably the most impactful component for the inference speedup, especially on edge devices like CPUs. Integer operations are more efficient than floating-point arithmetic on these platforms, which enables faster execution without affecting model accuracy. Moreover, dynamic quantization by using pre-quantized 8-bit integer weights and quantizing the activations on the fly helps accelerate the inference time. Additionally, 8-bit integer numbers occupy one- fourth of the memory of 32-bit floating point numbers, which means less data needs to be moved from memory to the CPU’s processing units. This reduces memory bandwidth bottlenecks, which are often a major limitation in large model inference. For the quantized model in terms of accuracy, F1, precision, and recall, in all three models, the accuracy and F1 drop were below 1% and 2% respectively and a small 42 5. Discussion increase in precision was found which suggested that quantization in this specific task, was increasing the number of False Positives (FPs). Although BioBERT after and before quantization exhibited the best performance among all models, its recall experienced a 4% drop, as earlier was mentioned due to FPs where the model predicts an entity, but it’s incorrect. Finally, we combined pruning and dynamic quantization techniques to leverage both the inference speedup from quantization and the efficiency of the pruned models, applying these to models with 20%, 25%, and 30% sparsity levels. Regarding per- formance metrics, the 20% pruned and quantized BioBERT model notably demon- strated better performance compared to its solely quantized models. For BERT, while the F1 score exhibited a 0.006 drop compared to the quantized model, the remaining performance metrics increased slightly after the combination of dynamic quantization and pruning. While quantization leads to the inference speedup and memory reduction, it can simultaneously introduce noise across the entire model. This noise can lead to slight shifts in weight values, distorted activations, and degra- dation in fine-grained, token-level embeddings. Since the pruning in this project was applied before dynamic quantization and after pruning, the less informative weights that contribute least to the model’s outputs are removed. Due to the nature of structural pruning and the reduction in less informative neurons and heads, the noise introduced during quantization decreased, consequently, the performance met- rics increased slightly after the combination of pruning and quantization compared to the solely quantized model. Besides the performance metrics, the inference time of BioBERT after the combina- tion of these two mentioned techniques for a 20% sparsity level slightly improved. It is essential to highlight that, for BERT and BioBERT, based on the results for F1 scores across all the entity endpoints, the F1 for popular entities in both 20% and 25% sparsity levels was competent. Also, in low-frequency entities such as DoR-CIL and CoR-CIH, the quantized BioBERT with 25% of sparsity experienced better F1 scores, and for entities like PFS-precent-CIL and PFS-precent-CIH, the quantized BERT model maintained its F1 scores in 25% pruning level and improved its F1 scores in 20% of pruning. This can be due to the regularization roles of pruning, which can improve generalization and lead to better performance on the test set, especially for less-represented categories. The interpretations for this improvement in F1 scores in low-frequency entities can be seen as follows: when pruning was applied before quantization, the model was already adjusted to rely on the most important weights. This combination of pruning and dynamic quantization acts as some fine-tuning or calibration, so the model’s focus might go on a smaller but more meaningful parameter space, which improved the F1 scores for these challenging low-frequency entities. In summary, this analysis demonstrates the capabilities of both structural prun- ing and dynamic quantization in optimizing large language models like BERT and BioBERT for deployment. While individual techniques yielded progress in efficiency, especially in reducing inference time and memory footprint, our findings show ac- ceptable behavioral nuances in performance metrics. Notably, BioBERT consistently showed better performance across various compression methods. After the pruning 43 5. Discussion technique, it maintained or even enhanced key performance metrics and accelerated inference time in the quantization technique. The combination of these methods further indicates their ability to be used in real-world applications, even on edge hardware devices like CPUs. These results not only affirm the practical viability of model compression for clinical NLP tasks but also provide valuable insights into the differential impacts of pruning and quantization on model generalization. 44 5. Discussion 5.1 Conclusion The journey through this research has demonstrated the potential of model compres- sion to make large language models like BERT more practical and accessible. This study aimed to examine optimization techniques on the oncology NLP pipeline in AstraZeneca to reduce memory footprint and inference time on CPU nodes without compromising performance metrics, especially accuracy and F1. The existing NLP pipeline, consisting of three main BERT variations, namely BERT, BioBERT, and DistilBERT, was trained and then prepared for subsequent optimization. The objective was achieved by the implementation of structural pruning, dynamic quantization individually, and subsequently, a combination of structural pruning and dynamic quantization. For structural pruning, Fisher Information Matrix was applied to BioBERT and BERT at three levels of neuron and head sparsity (20%, 25%, and 30%). For the 20% and 25% pruning levels, the models generally maintained strong performance in terms of accuracy and F1 scores. However, beyond 25% sparsity, the models’ performance experienced a drastic drop. In the case of quantization, a post training quantization (PTQ) approach, facilitated by ONNX Runtime dynamic quantization, was applied to BERT, BioBERT, and DistilBERT. Dynamic quantization, which involves converting 32-bit floating-point weights before inference and quantizing activations on-the-fly, proved its ability to decrease inference time effectively. Additionally, BERT and BioBERT were optimized through a combination of dy- namic quantization and structural pruning, utilizing the same three sparsity levels as the individual pruning experiments. The results from the combination of quantiza- tion and 20% of pruning for BERT and BioBERT achieved competent performance. During inference, the pruning method provided regularization and calibration effects, so compared to the individual quantization, F1 scores and the accuracy in this level of sparsity appeared better. Overall, BioBERT with the combination of dynamic quantization and 20% of structured pruning can be considered the best model with a 0.02% accuracy drop among all experiments. Finally, it is worth mentioning that, although I-BERT was implemented in this project, it was later decided not to use it due to its poor performance and specific hardware requirements for inference. This research confirms that with careful appli- cation of these compression techniques, the NLP models can be used in real-world deployment. 45 5. Discussion 46 Bibliography [1] S. News, “Ancient myths reveal early fantasies of artificial life,” 2019. [Online]. Available: https://news.stanford.edu/stories/2019/02/ancient-myths- reveal-early-fantasies-artificial-life. [2] J. Devlin, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [3] Y. Liu, M. Ott, N. Goyal, et al., “Roberta: A robustly optimized bert pretrain- ing approach,” arXiv preprint arXiv:1907.11692, 2019. [4] E. Mueller, J. Hansjakob, D. Auge, and A. Knoll, “Minimizing inference time: Optimization methods for converted deep spiking neural networks,” in 2021 International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1–8. [5] J. Chen and X. Ran, “Deep learning with edge computing: A review,” Pro- ceedings of the IEEE, vol. 107, no. 8, pp. 1655–1674, 2019. [6] Z. Wang, “Sparsednn: Fast sparse deep learning inference on cpus,” CoRR, vol. abs/2101.07948, 2021. [7] IQVIA Ltd., I2e: Information extraction platform, I2E is developed and mar- keted by IQVIA Ltd. Further information can be obtained from http://www. linguamatics.com, n.d. [8] J. Lee, W. Yoon, S. Kim, et al., “Biobert: A pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020. [9] Y. Gu, R. Tinn, H. Cheng, et al., “Domain-specific language model pretraining for biomedical natural language processing,” ACM Trans. Comput. Healthcare, vol. 3, no. 1, Oct. 2021. doi: 10.1145/3458754. [Online]. Available: https: //doi.org/10.1145/3458754. [10] V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “Distilbert, a distilled version of bert: Smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108, 2019. [11] A. Gendrin-Brokmann, E. Harrison, J. Noveras, et al., “Investigating deep- learning nlp for automating the extraction of oncology efficacy endpoints from scientific literature,” Information Based Medicine, 2024, Available online 4 July 2024. doi: 10.1016/j.ibmed.2024.100152. [12] H. Bai, W. Zhang, L. Hou, et al., “Binarybert: Pushing the limit of bert quantization,” arXiv preprint arXiv:2012.15701, 2020. [13] O. Zafrir, G. Boudoukh, P. Izsak, and M. Wasserblat, “Q8bert: Quantized 8bit bert,” in 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing-NeurIPS Edition (EMC2-NIPS), IEEE, 2019, pp. 36–39. 47 https://news.stanford.edu/stories/2019/02/ancient-myths-reveal-early-fantasies-artificial-life https://news.stanford.edu/stories/2019/02/ancient-myths-reveal-early-fantasies-artificial-life http://www.linguamatics.com http://www.linguamatics.com https://doi.org/10.1145/3458754 https://doi.org/10.1145/3458754 https://doi.org/10.1145/3458754 https://doi.org/10.1016/j.ibmed.2024.100152 Bibliography [14] S. Kim, A. Gholami, Z. Yao, M. W. Mahoney, and K. Keutzer, “I-bert: Integer- only bert quantization,” in International conference on machine learning, PMLR, 2021, pp. 5506–5518. [15] H. Qin, Y. Ding, M. Zhang, et al., “Bibert: Accurate fully binarized bert,” arXiv preprint arXiv:2203.06390, 2022. [16] X. Jiao, Y. Yin, L. Shang, et al., “Tinybert: Distilling bert for natural language understanding,” arXiv preprint arXiv:1909.10351, 2019. [17] E. Kurtic, D. Campos, T. Nguyen, et al., The optimal bert surgeon: Scal- able and accurate second-order pruning for large language models, 2022. arXiv: 2202.09906 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2202. 09906. [18] Hugging Face, Hugging Face Website and Hub, https://huggingface.co/, Accessed: [Date of Access, e.g., May 20, 2025], 2024. [19] Microsoft Azure, Azure Databricks Documentation, https://docs.microsoft. com/en- us/azure/databricks/, Accessed: [Date of Access, e.g., May 20, 2025], 2024. [20] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Ad- vances in neural information processing systems, vol. 30, 2017. [21] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014. [22] D. into Deep Learning, Attention mechanisms and transformers, Accessed: Month Day, Year. [Online]. Available: https://d2l.ai/chapter_attention- mechanisms-and-transformers/index.html#. [23] A. Zhang, Z. C. Lipton, M. Li, and A. J. Smola, Dive into deep learning, https://d2l.ai/chapter_attention-mechanisms-and-transformers/ transformer.html, Accessed: 2025-04-28, 2021. [24] G. Hinton, O. Vinyals, and J. Dean, Distilling the knowledge in a neural net- work, 2015. arXiv: 1503 . 02531 [stat.ML]. [Online]. Available: https : / / arxiv.org/abs/1503.02531. [25] W. Wang, W. Chen, Y. Luo, et al., “Model compression and efficient inference for large language models: A survey,” arXiv preprint arXiv:2402.09748, 2024. [26] W. Dai, H. Deng, M. Rong, et al., Flexible operator fusion for fast sparse transformer with diverse masking on gpu, 2025. arXiv: 2506.06095 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2506.06095. [27] A. Acharya, U. Bondhugula, and A. Cohen, “Effective loop fusion in polyhedral compilation using fusion conflict graphs,” ACM Transactions on Architecture and Code Optimization, vol. 17, no. 4, pp. 1–26, Sep. 2020, issn: 1544-3973. doi: 10.1145/3416510. [Online]. Available: http://dx.doi.org/10.1145/ 3416510. [28] A. Phani, B. Rath, and M. Boehm, “Lima: Fine-grained lineage tracing and reuse in machine learning systems,” in Proceedings of the 2021 International Conference on Management of Data, ser. SIGMOD ’21, Virtual Event, China: Association for Computing Machinery, 2021, pp. 1426–1439, isbn: 9781450383431. doi: 10.1145/3448016.3452788. [Online]. Available: https://doi.org/10. 1145/3448016.3452788. 48 https://arxiv.org/abs/2202.09906 https://arxiv.org/abs/2202.09906 https://arxiv.org/abs/2202.09906 https://huggingface.co/ https://docs.microsoft.com/en-us/azure/databricks/ https://docs.microsoft.com/en-us/azure/databricks/ https://d2l.ai/chapter_attention-mechanisms-and-transformers/index.html# https://d2l.ai/chapter_attention-mechanisms-and-transformers/index.html# https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html https://d2l.ai/chapter_attention-mechanisms-and-transformers/transformer.html https://arxiv.org/abs/1503.02531 https://arxiv.org/abs/1503.02531 https://arxiv.org/abs/1503.02531 https://arxiv.org/abs/2506.06095 https://arxiv.org/abs/2506.06095 https://doi.org/10.1145/3416510 http://dx.doi.org/10.1145/3416510 http://dx.doi.org/10.1145/3416510 https://doi.org/10.1145/3448016.3452788 https://doi.org/10.1145/3448016.3452788 https://doi.org/10.1145/3448016.3452788 Bibliography [29] H. Zhang, Z. Yu, G. Dai, et al., Understanding gnn computational graph: A co- ordinated computation, io, and memory perspective, 2021. arXiv: 2110.09524 [cs.LG]. [Online]. Available: https://arxiv.org/abs/2110.09524. [30] T. Theodoridis, M. Rigger, and Z. Su, “Finding missed optimizations through the lens of dead code elimination,” in Proceedings of the 27th ACM Interna- tional Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’22, Lausanne, Switzerland: Association for Computing Machinery, 2022, pp. 697–709, isbn: 9781450392051. doi: 10 . 1145/3503222.3507764. [Online]. Available: https://doi.org/10.1145/ 3503222.3507764. [31] U.S. National Library of Medicine, MEDLINE database, https://www.nlm. nih.gov/medline/index.html, Accessed: 2025-05-21, 2023. [32] Linguamatics, I2e natural language processing platform, https://www.linguamatics. com/products/i2e, Accessed: 2025-05-21, 2023. [33] Heartex, Label studio: Open source data labeling tool, https://github.com/ heartexlabs/label-studio, Accessed: 2025-05-21, 2020. [34] W. Kwon, S. Kim, M. W. Mahoney, J. Hassoun, K. Keutzer, and A. Gholami, “A fast post-training pruning framework for transformers,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 101–24 116, 2022. [35] Microsoft, ONNX Runtime: Accelerating Machine Learning Inference, https: //onnxruntime.ai/, Accessed: 2025-05-21. [36] O. R. Contributors, Onnx runtime quantization tools, https://github.com/ microsoft/onnxruntime/tree/main/onnxruntime/python/tools/quantization, Accessed: 2025-05-21, 2023. [37] A. Paszke, S. Gross, F. Massa, et al., “Pytorch: An imperative style, high- performance deep learning library,” in Advances in Neural Information Pro- cessing Systems, Curran Associates, Inc., vol. 32, 2019. [38] H. Nakayama, Seqeval: A python framework for sequence labeling evaluation, https://github.com/chakki-works/seqeval, Accessed: 2025-05-21, 2018. 49 https://arxiv.org/abs/2110.09524 https://arxiv.org/abs/2110.09524 https://arxiv.org/abs/2110.09524 https://doi.org/10.1145/3503222.3507764 https://doi.org/10.1145/3503222.3507764 https://doi.org/10.1145/3503222.3507764 https://doi.org/10.1145/3503222.3507764 https://www.nlm.nih.gov/medline/index.html https://www.nlm.nih.gov/medline/index.html https://www.linguamatics.com/products/i2e https://www.linguamatics.com/products/i2e https://github.com/heartexlabs/label-studio https://github.com/heartexlabs/label-studio https://onnxruntime.ai/ https://onnxruntime.ai/ https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/quantization https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/python/tools/quantization https://github.com/chakki-works/seqeval Bibliography 50 A Appendix 1 Figure A.1 shows the heat-map of BioBERT pruned to 30% overall sparsity. The top panel focuses on the attention heads, at this sparsity level the Fisher information pruning method has dropped 3 of the 144 heads outright, which appear in white. The remaining heads display different shades of blue. This color variation comes from the third pruning stage, rescaling, where the binary masks on the surviving heads were fine tuned and converted to float values to regain some of the loss introduced earlier. The bottom panel covers FFN. After the full three-stage routine, about 56.5% of the FFN neurons were removed, visible as the light-blue cells scattered across the layers. Figure A.2 shows the radar charts for standard BERT (top panel) and standard BioBERT (bottom panel), together with their 30% sparsity pruned versions and the corresponding pruned quantized variants. At this higher sparsity, the F1 scores for most clinical-efficacy endpoints fall significantly compared with the lower sparsity of 20% and 25%. Even so, BioBERT has less performance drop than standard BERT. Figure A.3 shows the inference time for standard BERT (top row) and standard BioBERT (bottom row) next to their 30% sparsity pruned models and the pruned + quantized versions. At 30% sparsity, the latency is lower than the latency at 20% and 25% sparsity, but this extra speed comes at a cost. As one can see in Figure 4.15 clearly, the global F1 score drops to 0.827 for BERT and 0.886 for BioBERT, along with similar declines in recall, while gaining some precision. I A. Appendix 1 Figure A.1: Heat map of 30% of neurons and heads sparsity of pruned BioBERT II A. Appendix 1 Figure A.2: Radar charts of BERT (above) and BioBERT (below). The standards model (grean), 30% pruned model (blue), and 30% pruned and quantized model (red). III A. Appendix 1 Figure A.3: Inference time of BERT (above) and BioBERT (below). The standards model (green), 30% pruned model (blue), and 30% pruned and quantized model (red). IV A. Appendix 1 Figure A.4: Global performance metrics of the standard BERT (above) and BioBERT (below). The standard model (blue), 30% pruned model (red), and 30% pruned and quantized model (green). V List of Figures List of Tables Introduction Limitations Theory Transformer Attention BERT DistilBERT BioBERT Pruning Pruning Units Unstructured Pruning Structured Pruning Pruning Metrics Magnitude-based Metrics Loss-based Metrics Second-order Methods Quantization Uniform and Non-uniform Quantization Post-Training Quantization (PTQ) Quantization-Aware Training (QAT) Optimization Operator Fusion Node Elimination Performance Metrics Accuracy Recall Precision F1 scores Methods Data Pruning Fisher-based Mask Search Mask Rearrangement Mask Tuning Model Quantization and Optimization Optimization Post Training Quantization Quantization Aware Training: Integer Only BERT Pruning and Quantization Training Hardware Testing And Evaluation Results Structural Pruning Quantization Pruning and Quantization Discussion Conclusion Bibliography Appendix 1