Adaptive KV Cache Management for Efficient Transformer-based LLM Inference - Leveraging Attention Sparsity for Memory Optimization

Xu, Dikai

Adaptive KV Cache Management for Efficient Transformer-based LLM Inference - Leveraging Attention Sparsity for Memory Optimization

dc.contributor.author	Xu, Dikai
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering	en
dc.contributor.examiner	Ali-Eldin Hassan, Ahmed
dc.contributor.supervisor	Ali-Eldin Hassan, Ahmed
dc.date.accessioned	2026-01-16T08:47:00Z
dc.date.issued	2025
dc.date.submitted
dc.description.abstract	This Master’s thesis addresses the critical challenge of memory inefficiency in Transformerbased Large Language Models (LLMs) during inference, specifically focusing on the prohibitive memory footprint of the Key-Value (KV) cache. As LLMs scale, the KV cache becomes a significant bottleneck, limiting longer context windows and overall operational efficiency. To mitigate this issue, we propose and evaluate Adap-KV, a novel adaptive memory management strategy for the KV cache. Adap-KV employs a layer-aware dynamic allocation approach that intelligently adjusts KV cache size in real-time, leveraging insights from attention sparsity patterns. Our method aims to optimize memory utilization without compromising the performance or quality of LLM inference. Experimental results demonstrate that Adap-KV significantly reduces KV cache memory consumption, thereby enhancing the efficiency and scalability of Transformer-based LLMs, making them more amenable for real-world deployments with extended context capabilities.
dc.identifier.coursecode	DATX05
dc.identifier.uri	https://hdl.handle.net/20.500.12380/310903
dc.language.iso	eng
dc.setspec.uppsok	Technology
dc.subject	Large Language Models
dc.subject	Transformers
dc.subject	KV Cache
dc.subject	Memory Optimization
dc.subject	Adaptive Memory Management
dc.subject	Attention Sparsity
dc.subject	Deep Learning Inference
dc.subject	Resource Efficiency
dc.title	Adaptive KV Cache Management for Efficient Transformer-based LLM Inference - Leveraging Attention Sparsity for Memory Optimization
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	Computer systems and networks (MPCSN), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 25-152 DX.pdf
Size:: 1.06 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen