Accelerating Token Generation in Large Reasoning Models Using FlashHead

Hämtar...
Bild (thumbnail)

Publicerad

Typ

Examensarbete för masterexamen
Master's Thesis

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Large Reasoning Models (LRMs) improve performance on complex reasoning tasks by generating long intermediate reasoning traces, but this increases inference la tency under autoregressive decoding. A significant part of this cost can come from the language modeling head, which computes logits over the full vocabulary at ev ery decoding step. This thesis studies FlashHead, a training-free retrieval-based alternative to dense output embedding computation, as a way to accelerate token generation in LRMs. FlashHead replaces full-vocabulary scoring with a two-stage retrieval procedure: cen troid screening followed by exact scoring over a reduced candidate set. In this work, FlashHead is integrated into the DeepSeek-R1-Distill-Qwen-1.5B inference pipeline. The implementation includes offline clustering of the output embedding matrix, construction of FlashHead assets, output embedding reordering, CSR-style cluster metadata, and a Triton-based fused GPU kernel for candidate scoring. The system is evaluated on MMLU-Pro, IFEval, and GSM8K with chain-of-thought prompting. We compare the dense baseline with both a PyTorch FlashHead im plementation and an optimized reorder + Triton variant, using task-level metrics, next-token fidelity, time-per-output-token, and FlashHead-only latency. The results show that FlashHead preserves task-level performance on IFEval and MMLU-Pro, while incurring a modest degradation on GSM8K-CoT. Token-level fidelity remains high, with Top-1 agreement above 0.95 across the evaluated token level datasets. The optimized reorder + Triton implementation reduces FlashHead only latency compared with the PyTorch implementation, although the end-to-end speedup remains limited by the Transformer body and other decoding overheads. These findings suggest that retrieval-based output embedding computation can re duce the cost of token generation when combined with system-level GPU opti mization. However, the observed quality and efficiency trade-offs remain workload dependent, highlighting the need for further improvements in candidate selection and end-to-end decoding performance.

Beskrivning

Ämne/nyckelord

Large Reasoning Models, FlashHead, inference acceleration, output embedding, approximate retrieval, Triton

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

Endorsement

Review

Supplemented By

Referenced By