Accelerating Token Generation in Large Reasoning Models Using FlashHead
| dc.contributor.author | Tong, Fengming | |
| dc.contributor.author | Tang, Yanping | |
| dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv |
| dc.contributor.department | Chalmers University of Technology / Department of Computer Science and Engineering | en |
| dc.contributor.examiner | Dubhashi, Devdatt | |
| dc.contributor.supervisor | Karppa, Matti | |
| dc.date.accessioned | 2026-06-30T07:28:07Z | |
| dc.date.issued | 2026 | |
| dc.date.submitted | ||
| dc.description.abstract | Large Reasoning Models (LRMs) improve performance on complex reasoning tasks by generating long intermediate reasoning traces, but this increases inference la tency under autoregressive decoding. A significant part of this cost can come from the language modeling head, which computes logits over the full vocabulary at ev ery decoding step. This thesis studies FlashHead, a training-free retrieval-based alternative to dense output embedding computation, as a way to accelerate token generation in LRMs. FlashHead replaces full-vocabulary scoring with a two-stage retrieval procedure: cen troid screening followed by exact scoring over a reduced candidate set. In this work, FlashHead is integrated into the DeepSeek-R1-Distill-Qwen-1.5B inference pipeline. The implementation includes offline clustering of the output embedding matrix, construction of FlashHead assets, output embedding reordering, CSR-style cluster metadata, and a Triton-based fused GPU kernel for candidate scoring. The system is evaluated on MMLU-Pro, IFEval, and GSM8K with chain-of-thought prompting. We compare the dense baseline with both a PyTorch FlashHead im plementation and an optimized reorder + Triton variant, using task-level metrics, next-token fidelity, time-per-output-token, and FlashHead-only latency. The results show that FlashHead preserves task-level performance on IFEval and MMLU-Pro, while incurring a modest degradation on GSM8K-CoT. Token-level fidelity remains high, with Top-1 agreement above 0.95 across the evaluated token level datasets. The optimized reorder + Triton implementation reduces FlashHead only latency compared with the PyTorch implementation, although the end-to-end speedup remains limited by the Transformer body and other decoding overheads. These findings suggest that retrieval-based output embedding computation can re duce the cost of token generation when combined with system-level GPU opti mization. However, the observed quality and efficiency trade-offs remain workload dependent, highlighting the need for further improvements in candidate selection and end-to-end decoding performance. | |
| dc.identifier.coursecode | DATX05 | |
| dc.identifier.uri | https://hdl.handle.net/20.500.12380/311643 | |
| dc.language.iso | eng | |
| dc.setspec.uppsok | Technology | |
| dc.subject | Large Reasoning Models, FlashHead, inference acceleration, output embedding, approximate retrieval, Triton | |
| dc.title | Accelerating Token Generation in Large Reasoning Models Using FlashHead | |
| dc.type.degree | Examensarbete för masterexamen | sv |
| dc.type.degree | Master's Thesis | en |
| dc.type.uppsok | H | |
| local.programme | High-performance computer systems (MPHPC), MSc |
