Accelerating Token Generation in Large Reasoning Models Using FlashHead

dc.contributor.authorTong, Fengming
dc.contributor.authorTang, Yanping
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerDubhashi, Devdatt
dc.contributor.supervisorKarppa, Matti
dc.date.accessioned2026-06-30T07:28:07Z
dc.date.issued2026
dc.date.submitted
dc.description.abstractLarge Reasoning Models (LRMs) improve performance on complex reasoning tasks by generating long intermediate reasoning traces, but this increases inference la tency under autoregressive decoding. A significant part of this cost can come from the language modeling head, which computes logits over the full vocabulary at ev ery decoding step. This thesis studies FlashHead, a training-free retrieval-based alternative to dense output embedding computation, as a way to accelerate token generation in LRMs. FlashHead replaces full-vocabulary scoring with a two-stage retrieval procedure: cen troid screening followed by exact scoring over a reduced candidate set. In this work, FlashHead is integrated into the DeepSeek-R1-Distill-Qwen-1.5B inference pipeline. The implementation includes offline clustering of the output embedding matrix, construction of FlashHead assets, output embedding reordering, CSR-style cluster metadata, and a Triton-based fused GPU kernel for candidate scoring. The system is evaluated on MMLU-Pro, IFEval, and GSM8K with chain-of-thought prompting. We compare the dense baseline with both a PyTorch FlashHead im plementation and an optimized reorder + Triton variant, using task-level metrics, next-token fidelity, time-per-output-token, and FlashHead-only latency. The results show that FlashHead preserves task-level performance on IFEval and MMLU-Pro, while incurring a modest degradation on GSM8K-CoT. Token-level fidelity remains high, with Top-1 agreement above 0.95 across the evaluated token level datasets. The optimized reorder + Triton implementation reduces FlashHead only latency compared with the PyTorch implementation, although the end-to-end speedup remains limited by the Transformer body and other decoding overheads. These findings suggest that retrieval-based output embedding computation can re duce the cost of token generation when combined with system-level GPU opti mization. However, the observed quality and efficiency trade-offs remain workload dependent, highlighting the need for further improvements in candidate selection and end-to-end decoding performance.
dc.identifier.coursecodeDATX05
dc.identifier.urihttps://hdl.handle.net/20.500.12380/311643
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectLarge Reasoning Models, FlashHead, inference acceleration, output embedding, approximate retrieval, Triton
dc.titleAccelerating Token Generation in Large Reasoning Models Using FlashHead
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeHigh-performance computer systems (MPHPC), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 26-35 YT.pdf
Size:
573.17 KB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Size:
2.35 KB
Format:
Item-specific license agreed upon to submission
Description: