Accelerating LLM Inference via ANN-Based Draft Model Construction - Speculative decoding with dense and FlashHead draft models

Ma, Guangyu; Wang, Weiyou

Accelerating LLM Inference via ANN-Based Draft Model Construction - Speculative decoding with dense and FlashHead draft models

dc.contributor.author	Ma, Guangyu
dc.contributor.author	Wang, Weiyou
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering	en
dc.contributor.examiner	Dubhashi, Devdatt
dc.contributor.supervisor	Karppa, Matti
dc.date.accessioned	2026-06-30T06:41:37Z
dc.date.issued	2026
dc.date.submitted
dc.description.abstract	Large language model inference is limited by the sequential nature of autoregressive decoding. Speculative decoding reduces this cost by using a smaller draft model to propose several candidate tokens, which are then verified by a larger target model. However, this shifts part of the work to the draft side: every proposed token still requires a full-vocabulary output projection, which can become a non-negligible source of latency. This thesis studies whether an approximate-nearest-neighbor based replacement for this output projection can reduce the draft-side cost in speculative decoding. We use an existing FlashHead model as a drop-in replacement for the dense projection of a Qwen3-0.6B draft model, while keeping the target model and the speculative verification procedure unchanged. The final experimental setup focuses on a Qwen3 32B-AWQ base model and compares three decoding modes: baseline decoding, speculative decoding with a dense draft model, and speculative decoding with a FlashHead draft model. The results show that FlashHead strongly accelerates the isolated draft projection, reducing single-step projection latency from 2.493 ms to 0.577 ms, a 4.321× speedup. At the complete draft-model step level, this becomes a smaller 1.108× speedup, and in the ordinary end-to-end speculative comparison FlashHead improves throughput only from 9.769 to 9.823 tokens/s relative to the dense draft. With deferred final token verification, FlashHead reaches 11.374 tokens/s compared with 10.731 tokens/s for the dense draft, showing that draft-side projection acceleration is real but mostly diluted by target-side verification and finalization costs.
dc.identifier.coursecode	DATX05
dc.identifier.uri	https://hdl.handle.net/20.500.12380/311637
dc.language.iso	eng
dc.setspec.uppsok	Technology
dc.subject	Large Language Models, Speculative Decoding, Approximate Nearest Neighbor, FlashHead
dc.title	Accelerating LLM Inference via ANN-Based Draft Model Construction - Speculative decoding with dense and FlashHead draft models
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	High-performance computer systems (MPHPC), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 26-29 GM WW.pdf
Size:: 6.08 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen