Accelerating LLM Inference via ANN-Based Draft Model Construction - Speculative decoding with dense and FlashHead draft models
Hämtar...
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Large language model inference is limited by the sequential nature of autoregressive
decoding. Speculative decoding reduces this cost by using a smaller draft model to
propose several candidate tokens, which are then verified by a larger target model.
However, this shifts part of the work to the draft side: every proposed token still
requires a full-vocabulary output projection, which can become a non-negligible
source of latency.
This thesis studies whether an approximate-nearest-neighbor based replacement for
this output projection can reduce the draft-side cost in speculative decoding. We
use an existing FlashHead model as a drop-in replacement for the dense projection
of a Qwen3-0.6B draft model, while keeping the target model and the speculative
verification procedure unchanged. The final experimental setup focuses on a Qwen3
32B-AWQ base model and compares three decoding modes: baseline decoding,
speculative decoding with a dense draft model, and speculative decoding with a
FlashHead draft model.
The results show that FlashHead strongly accelerates the isolated draft projection,
reducing single-step projection latency from 2.493 ms to 0.577 ms, a 4.321× speedup.
At the complete draft-model step level, this becomes a smaller 1.108× speedup, and
in the ordinary end-to-end speculative comparison FlashHead improves throughput
only from 9.769 to 9.823 tokens/s relative to the dense draft. With deferred final
token verification, FlashHead reaches 11.374 tokens/s compared with 10.731 tokens/s
for the dense draft, showing that draft-side projection acceleration is real but mostly
diluted by target-side verification and finalization costs.
Beskrivning
Ämne/nyckelord
Large Language Models, Speculative Decoding, Approximate Nearest Neighbor, FlashHead
