Adaptive KV Cache Management for Efficient Transformer-based LLM Inference - Leveraging Attention Sparsity for Memory Optimization
| dc.contributor.author | Xu, Dikai | |
| dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv |
| dc.contributor.department | Chalmers University of Technology / Department of Computer Science and Engineering | en |
| dc.contributor.examiner | Ali-Eldin Hassan, Ahmed | |
| dc.contributor.supervisor | Ali-Eldin Hassan, Ahmed | |
| dc.date.accessioned | 2026-01-16T08:47:00Z | |
| dc.date.issued | 2025 | |
| dc.date.submitted | ||
| dc.description.abstract | This Master’s thesis addresses the critical challenge of memory inefficiency in Transformerbased Large Language Models (LLMs) during inference, specifically focusing on the prohibitive memory footprint of the Key-Value (KV) cache. As LLMs scale, the KV cache becomes a significant bottleneck, limiting longer context windows and overall operational efficiency. To mitigate this issue, we propose and evaluate Adap-KV, a novel adaptive memory management strategy for the KV cache. Adap-KV employs a layer-aware dynamic allocation approach that intelligently adjusts KV cache size in real-time, leveraging insights from attention sparsity patterns. Our method aims to optimize memory utilization without compromising the performance or quality of LLM inference. Experimental results demonstrate that Adap-KV significantly reduces KV cache memory consumption, thereby enhancing the efficiency and scalability of Transformer-based LLMs, making them more amenable for real-world deployments with extended context capabilities. | |
| dc.identifier.coursecode | DATX05 | |
| dc.identifier.uri | http://hdl.handle.net/20.500.12380/310903 | |
| dc.language.iso | eng | |
| dc.setspec.uppsok | Technology | |
| dc.subject | Large Language Models | |
| dc.subject | Transformers | |
| dc.subject | KV Cache | |
| dc.subject | Memory Optimization | |
| dc.subject | Adaptive Memory Management | |
| dc.subject | Attention Sparsity | |
| dc.subject | Deep Learning Inference | |
| dc.subject | Resource Efficiency | |
| dc.title | Adaptive KV Cache Management for Efficient Transformer-based LLM Inference - Leveraging Attention Sparsity for Memory Optimization | |
| dc.type.degree | Examensarbete för masterexamen | sv |
| dc.type.degree | Master's Thesis | en |
| dc.type.uppsok | H | |
| local.programme | Computer systems and networks (MPCSN), MSc |
