Accelerating LLM Inference via ANN-Based Draft Model Construction - Speculative decoding with dense and FlashHead draft models
| dc.contributor.author | Ma, Guangyu | |
| dc.contributor.author | Wang, Weiyou | |
| dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv |
| dc.contributor.department | Chalmers University of Technology / Department of Computer Science and Engineering | en |
| dc.contributor.examiner | Dubhashi, Devdatt | |
| dc.contributor.supervisor | Karppa, Matti | |
| dc.date.accessioned | 2026-06-30T06:41:37Z | |
| dc.date.issued | 2026 | |
| dc.date.submitted | ||
| dc.description.abstract | Large language model inference is limited by the sequential nature of autoregressive decoding. Speculative decoding reduces this cost by using a smaller draft model to propose several candidate tokens, which are then verified by a larger target model. However, this shifts part of the work to the draft side: every proposed token still requires a full-vocabulary output projection, which can become a non-negligible source of latency. This thesis studies whether an approximate-nearest-neighbor based replacement for this output projection can reduce the draft-side cost in speculative decoding. We use an existing FlashHead model as a drop-in replacement for the dense projection of a Qwen3-0.6B draft model, while keeping the target model and the speculative verification procedure unchanged. The final experimental setup focuses on a Qwen3 32B-AWQ base model and compares three decoding modes: baseline decoding, speculative decoding with a dense draft model, and speculative decoding with a FlashHead draft model. The results show that FlashHead strongly accelerates the isolated draft projection, reducing single-step projection latency from 2.493 ms to 0.577 ms, a 4.321× speedup. At the complete draft-model step level, this becomes a smaller 1.108× speedup, and in the ordinary end-to-end speculative comparison FlashHead improves throughput only from 9.769 to 9.823 tokens/s relative to the dense draft. With deferred final token verification, FlashHead reaches 11.374 tokens/s compared with 10.731 tokens/s for the dense draft, showing that draft-side projection acceleration is real but mostly diluted by target-side verification and finalization costs. | |
| dc.identifier.coursecode | DATX05 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.12380/311637 | |
| dc.language.iso | eng | |
| dc.setspec.uppsok | Technology | |
| dc.subject | Large Language Models, Speculative Decoding, Approximate Nearest Neighbor, FlashHead | |
| dc.title | Accelerating LLM Inference via ANN-Based Draft Model Construction - Speculative decoding with dense and FlashHead draft models | |
| dc.type.degree | Examensarbete för masterexamen | sv |
| dc.type.degree | Master's Thesis | en |
| dc.type.uppsok | H | |
| local.programme | High-performance computer systems (MPHPC), MSc |
