Methods for Optimizing BERT Model on Edge Devices - Accelerating Biomedical NLP with Pruned and Quantized BERT Models
Loading...
Download
Date
Authors
Type
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Model builders
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
Named-entity recognition (NER) of clinical efficacy endpoints in oncology abstracts
supports downstream discovery pipelines at AstraZeneca. Yet, the fine-tuned transformer
models currently used are too slow and over parameterized for large-scale
CPU deployment. This thesis evaluated whether post-training model compression
techniques can accelerate inference without retraining or harming extraction quality.
In the first stage of this project, standard BERT and BioBERT were individually
pruned with a three-stage, Fisher-guided structured pruning workflow at three levels
of sparsity. Subsequently, in the second stage, dynamic 8-bit integers quantization
using ONNX Runtime was applied to standard BERT, BioBERT, and DistilBERT.
The third stage involved combining both pruning and quantization, further optimizing
the pre-trained standard BERT and BioBERT transformers. Experiments were
run on annotated MEDLINE sentences covering 25 efficacy labels, with F1 score
and inference latency per sample serving as primary metrics.
A 25% structured-sparsity level yielded no measurable drop in F1 score, and the additional
8-bit integers dynamic step cut latency further. The best configuration, 25%-
pruned+8-bit integers BioBERT, reduced mean CPU inference time from 32.52 ms
to 12.02 ms (2.6-fold speed-up) while accuracy fell only from 0.982 to 0.980 and F1
score from 0.954 to 0.948.
The Post-training structured pruning combined with 8-bit integers dynamic quantization
makes the oncology-NER pipeline about three times faster in inference time
on standard CPUs without compromising the extraction quality or needing special
hardware or libraries.
Description
Keywords
Natural language processing, Named entity recognition, Post-training quantization, Structured pruning, Model compression
