Accelerating LLM Inference via ANN-Based Draft Model Construction - Speculative decoding with dense and FlashHead draft models

Hämtar...
Bild (thumbnail)

Publicerad

Typ

Examensarbete för masterexamen
Master's Thesis

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Large language model inference is limited by the sequential nature of autoregressive decoding. Speculative decoding reduces this cost by using a smaller draft model to propose several candidate tokens, which are then verified by a larger target model. However, this shifts part of the work to the draft side: every proposed token still requires a full-vocabulary output projection, which can become a non-negligible source of latency. This thesis studies whether an approximate-nearest-neighbor based replacement for this output projection can reduce the draft-side cost in speculative decoding. We use an existing FlashHead model as a drop-in replacement for the dense projection of a Qwen3-0.6B draft model, while keeping the target model and the speculative verification procedure unchanged. The final experimental setup focuses on a Qwen3 32B-AWQ base model and compares three decoding modes: baseline decoding, speculative decoding with a dense draft model, and speculative decoding with a FlashHead draft model. The results show that FlashHead strongly accelerates the isolated draft projection, reducing single-step projection latency from 2.493 ms to 0.577 ms, a 4.321× speedup. At the complete draft-model step level, this becomes a smaller 1.108× speedup, and in the ordinary end-to-end speculative comparison FlashHead improves throughput only from 9.769 to 9.823 tokens/s relative to the dense draft. With deferred final token verification, FlashHead reaches 11.374 tokens/s compared with 10.731 tokens/s for the dense draft, showing that draft-side projection acceleration is real but mostly diluted by target-side verification and finalization costs.

Beskrivning

Ämne/nyckelord

Large Language Models, Speculative Decoding, Approximate Nearest Neighbor, FlashHead

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

Endorsement

Review

Supplemented By

Referenced By