Accelerating LLM Inference via ANN-Based Draft Model Construction - Speculative decoding with dense and FlashHead draft models

dc.contributor.authorMa, Guangyu
dc.contributor.authorWang, Weiyou
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerDubhashi, Devdatt
dc.contributor.supervisorKarppa, Matti
dc.date.accessioned2026-06-30T06:41:37Z
dc.date.issued2026
dc.date.submitted
dc.description.abstractLarge language model inference is limited by the sequential nature of autoregressive decoding. Speculative decoding reduces this cost by using a smaller draft model to propose several candidate tokens, which are then verified by a larger target model. However, this shifts part of the work to the draft side: every proposed token still requires a full-vocabulary output projection, which can become a non-negligible source of latency. This thesis studies whether an approximate-nearest-neighbor based replacement for this output projection can reduce the draft-side cost in speculative decoding. We use an existing FlashHead model as a drop-in replacement for the dense projection of a Qwen3-0.6B draft model, while keeping the target model and the speculative verification procedure unchanged. The final experimental setup focuses on a Qwen3 32B-AWQ base model and compares three decoding modes: baseline decoding, speculative decoding with a dense draft model, and speculative decoding with a FlashHead draft model. The results show that FlashHead strongly accelerates the isolated draft projection, reducing single-step projection latency from 2.493 ms to 0.577 ms, a 4.321× speedup. At the complete draft-model step level, this becomes a smaller 1.108× speedup, and in the ordinary end-to-end speculative comparison FlashHead improves throughput only from 9.769 to 9.823 tokens/s relative to the dense draft. With deferred final token verification, FlashHead reaches 11.374 tokens/s compared with 10.731 tokens/s for the dense draft, showing that draft-side projection acceleration is real but mostly diluted by target-side verification and finalization costs.
dc.identifier.coursecodeDATX05
dc.identifier.urihttps://hdl.handle.net/20.500.12380/311637
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectLarge Language Models, Speculative Decoding, Approximate Nearest Neighbor, FlashHead
dc.titleAccelerating LLM Inference via ANN-Based Draft Model Construction - Speculative decoding with dense and FlashHead draft models
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeHigh-performance computer systems (MPHPC), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 26-29 GM WW.pdf
Size:
6.08 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Size:
2.35 KB
Format:
Item-specific license agreed upon to submission
Description: