Accelerating Token Generation in Large Reasoning Models Using FlashHead
Hämtar...
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Large Reasoning Models (LRMs) improve performance on complex reasoning tasks
by generating long intermediate reasoning traces, but this increases inference la
tency under autoregressive decoding. A significant part of this cost can come from
the language modeling head, which computes logits over the full vocabulary at ev
ery decoding step. This thesis studies FlashHead, a training-free retrieval-based
alternative to dense output embedding computation, as a way to accelerate token
generation in LRMs.
FlashHead replaces full-vocabulary scoring with a two-stage retrieval procedure: cen
troid screening followed by exact scoring over a reduced candidate set. In this
work, FlashHead is integrated into the DeepSeek-R1-Distill-Qwen-1.5B inference
pipeline. The implementation includes offline clustering of the output embedding
matrix, construction of FlashHead assets, output embedding reordering, CSR-style
cluster metadata, and a Triton-based fused GPU kernel for candidate scoring.
The system is evaluated on MMLU-Pro, IFEval, and GSM8K with chain-of-thought
prompting. We compare the dense baseline with both a PyTorch FlashHead im
plementation and an optimized reorder + Triton variant, using task-level metrics,
next-token fidelity, time-per-output-token, and FlashHead-only latency.
The results show that FlashHead preserves task-level performance on IFEval and
MMLU-Pro, while incurring a modest degradation on GSM8K-CoT. Token-level
fidelity remains high, with Top-1 agreement above 0.95 across the evaluated token
level datasets. The optimized reorder + Triton implementation reduces FlashHead
only latency compared with the PyTorch implementation, although the end-to-end
speedup remains limited by the Transformer body and other decoding overheads.
These findings suggest that retrieval-based output embedding computation can re
duce the cost of token generation when combined with system-level GPU opti
mization. However, the observed quality and efficiency trade-offs remain workload
dependent, highlighting the need for further improvements in candidate selection
and end-to-end decoding performance.
Beskrivning
Ämne/nyckelord
Large Reasoning Models, FlashHead, inference acceleration, output embedding, approximate retrieval, Triton
