Accelerating Token Generation in Large Reasoning Models Using FlashHead

Tong, Fengming; Tang, Yanping

Accelerating Token Generation in Large Reasoning Models Using FlashHead

dc.contributor.author	Tong, Fengming
dc.contributor.author	Tang, Yanping
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering	en
dc.contributor.examiner	Dubhashi, Devdatt
dc.contributor.supervisor	Karppa, Matti
dc.date.accessioned	2026-06-30T07:28:07Z
dc.date.issued	2026
dc.date.submitted
dc.description.abstract	Large Reasoning Models (LRMs) improve performance on complex reasoning tasks by generating long intermediate reasoning traces, but this increases inference la tency under autoregressive decoding. A significant part of this cost can come from the language modeling head, which computes logits over the full vocabulary at ev ery decoding step. This thesis studies FlashHead, a training-free retrieval-based alternative to dense output embedding computation, as a way to accelerate token generation in LRMs. FlashHead replaces full-vocabulary scoring with a two-stage retrieval procedure: cen troid screening followed by exact scoring over a reduced candidate set. In this work, FlashHead is integrated into the DeepSeek-R1-Distill-Qwen-1.5B inference pipeline. The implementation includes offline clustering of the output embedding matrix, construction of FlashHead assets, output embedding reordering, CSR-style cluster metadata, and a Triton-based fused GPU kernel for candidate scoring. The system is evaluated on MMLU-Pro, IFEval, and GSM8K with chain-of-thought prompting. We compare the dense baseline with both a PyTorch FlashHead im plementation and an optimized reorder + Triton variant, using task-level metrics, next-token fidelity, time-per-output-token, and FlashHead-only latency. The results show that FlashHead preserves task-level performance on IFEval and MMLU-Pro, while incurring a modest degradation on GSM8K-CoT. Token-level fidelity remains high, with Top-1 agreement above 0.95 across the evaluated token level datasets. The optimized reorder + Triton implementation reduces FlashHead only latency compared with the PyTorch implementation, although the end-to-end speedup remains limited by the Transformer body and other decoding overheads. These findings suggest that retrieval-based output embedding computation can re duce the cost of token generation when combined with system-level GPU opti mization. However, the observed quality and efficiency trade-offs remain workload dependent, highlighting the need for further improvements in candidate selection and end-to-end decoding performance.
dc.identifier.coursecode	DATX05
dc.identifier.uri	https://hdl.handle.net/20.500.12380/311643
dc.language.iso	eng
dc.setspec.uppsok	Technology
dc.subject	Large Reasoning Models, FlashHead, inference acceleration, output embedding, approximate retrieval, Triton
dc.title	Accelerating Token Generation in Large Reasoning Models Using FlashHead
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	High-performance computer systems (MPHPC), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 26-35 YT.pdf
Size:: 573.17 KB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen