Methods for Optimizing BERT Model on Edge Devices - Accelerating Biomedical NLP with Pruned and Quantized BERT Models

dc.contributor.authorBarani, Amir Ali
dc.contributor.authorMirzabeigi, Atefeh
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerJohansson, Richard
dc.contributor.supervisorFarahani, Mehrdad
dc.date.accessioned2026-02-05T10:40:28Z
dc.date.issued2025
dc.date.submitted
dc.description.abstractNamed-entity recognition (NER) of clinical efficacy endpoints in oncology abstracts supports downstream discovery pipelines at AstraZeneca. Yet, the fine-tuned transformer models currently used are too slow and over parameterized for large-scale CPU deployment. This thesis evaluated whether post-training model compression techniques can accelerate inference without retraining or harming extraction quality. In the first stage of this project, standard BERT and BioBERT were individually pruned with a three-stage, Fisher-guided structured pruning workflow at three levels of sparsity. Subsequently, in the second stage, dynamic 8-bit integers quantization using ONNX Runtime was applied to standard BERT, BioBERT, and DistilBERT. The third stage involved combining both pruning and quantization, further optimizing the pre-trained standard BERT and BioBERT transformers. Experiments were run on annotated MEDLINE sentences covering 25 efficacy labels, with F1 score and inference latency per sample serving as primary metrics. A 25% structured-sparsity level yielded no measurable drop in F1 score, and the additional 8-bit integers dynamic step cut latency further. The best configuration, 25%- pruned+8-bit integers BioBERT, reduced mean CPU inference time from 32.52 ms to 12.02 ms (2.6-fold speed-up) while accuracy fell only from 0.982 to 0.980 and F1 score from 0.954 to 0.948. The Post-training structured pruning combined with 8-bit integers dynamic quantization makes the oncology-NER pipeline about three times faster in inference time on standard CPUs without compromising the extraction quality or needing special hardware or libraries.
dc.identifier.coursecodeDATX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/310964
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectNatural language processing
dc.subjectNamed entity recognition
dc.subjectPost-training quantization
dc.subjectStructured pruning
dc.subjectModel compression
dc.titleMethods for Optimizing BERT Model on Edge Devices - Accelerating Biomedical NLP with Pruned and Quantized BERT Models
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeComplex adaptive systems (MPCAS), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 25-112 AA.pdf
Storlek:
6.2 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: