Methods for Optimizing BERT Model on Edge Devices - Accelerating Biomedical NLP with Pruned and Quantized BERT Models
| dc.contributor.author | Barani, Amir Ali | |
| dc.contributor.author | Mirzabeigi, Atefeh | |
| dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv |
| dc.contributor.department | Chalmers University of Technology / Department of Computer Science and Engineering | en |
| dc.contributor.examiner | Johansson, Richard | |
| dc.contributor.supervisor | Farahani, Mehrdad | |
| dc.date.accessioned | 2026-02-05T10:40:28Z | |
| dc.date.issued | 2025 | |
| dc.date.submitted | ||
| dc.description.abstract | Named-entity recognition (NER) of clinical efficacy endpoints in oncology abstracts supports downstream discovery pipelines at AstraZeneca. Yet, the fine-tuned transformer models currently used are too slow and over parameterized for large-scale CPU deployment. This thesis evaluated whether post-training model compression techniques can accelerate inference without retraining or harming extraction quality. In the first stage of this project, standard BERT and BioBERT were individually pruned with a three-stage, Fisher-guided structured pruning workflow at three levels of sparsity. Subsequently, in the second stage, dynamic 8-bit integers quantization using ONNX Runtime was applied to standard BERT, BioBERT, and DistilBERT. The third stage involved combining both pruning and quantization, further optimizing the pre-trained standard BERT and BioBERT transformers. Experiments were run on annotated MEDLINE sentences covering 25 efficacy labels, with F1 score and inference latency per sample serving as primary metrics. A 25% structured-sparsity level yielded no measurable drop in F1 score, and the additional 8-bit integers dynamic step cut latency further. The best configuration, 25%- pruned+8-bit integers BioBERT, reduced mean CPU inference time from 32.52 ms to 12.02 ms (2.6-fold speed-up) while accuracy fell only from 0.982 to 0.980 and F1 score from 0.954 to 0.948. The Post-training structured pruning combined with 8-bit integers dynamic quantization makes the oncology-NER pipeline about three times faster in inference time on standard CPUs without compromising the extraction quality or needing special hardware or libraries. | |
| dc.identifier.coursecode | DATX05 | |
| dc.identifier.uri | http://hdl.handle.net/20.500.12380/310964 | |
| dc.language.iso | eng | |
| dc.setspec.uppsok | Technology | |
| dc.subject | Natural language processing | |
| dc.subject | Named entity recognition | |
| dc.subject | Post-training quantization | |
| dc.subject | Structured pruning | |
| dc.subject | Model compression | |
| dc.title | Methods for Optimizing BERT Model on Edge Devices - Accelerating Biomedical NLP with Pruned and Quantized BERT Models | |
| dc.type.degree | Examensarbete för masterexamen | sv |
| dc.type.degree | Master's Thesis | en |
| dc.type.uppsok | H | |
| local.programme | Complex adaptive systems (MPCAS), MSc |
