Methods for Optimizing BERT Model on Edge Devices - Accelerating Biomedical NLP with Pruned and Quantized BERT Models
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Named-entity recognition (NER) of clinical efficacy endpoints in oncology abstracts
supports downstream discovery pipelines at AstraZeneca. Yet, the fine-tuned transformer
models currently used are too slow and over parameterized for large-scale
CPU deployment. This thesis evaluated whether post-training model compression
techniques can accelerate inference without retraining or harming extraction quality.
In the first stage of this project, standard BERT and BioBERT were individually
pruned with a three-stage, Fisher-guided structured pruning workflow at three levels
of sparsity. Subsequently, in the second stage, dynamic 8-bit integers quantization
using ONNX Runtime was applied to standard BERT, BioBERT, and DistilBERT.
The third stage involved combining both pruning and quantization, further optimizing
the pre-trained standard BERT and BioBERT transformers. Experiments were
run on annotated MEDLINE sentences covering 25 efficacy labels, with F1 score
and inference latency per sample serving as primary metrics.
A 25% structured-sparsity level yielded no measurable drop in F1 score, and the additional
8-bit integers dynamic step cut latency further. The best configuration, 25%-
pruned+8-bit integers BioBERT, reduced mean CPU inference time from 32.52 ms
to 12.02 ms (2.6-fold speed-up) while accuracy fell only from 0.982 to 0.980 and F1
score from 0.954 to 0.948.
The Post-training structured pruning combined with 8-bit integers dynamic quantization
makes the oncology-NER pipeline about three times faster in inference time
on standard CPUs without compromising the extraction quality or needing special
hardware or libraries.
Beskrivning
Ämne/nyckelord
Natural language processing, Named entity recognition, Post-training quantization, Structured pruning, Model compression
