ODR kommer att vara otillgängligt pga systemunderhåll onsdag 25 februari, 13:00 -15:00 (ca). Var vänlig och logga ut i god tid. // ODR will be unavailable due to system maintenance, Wednesday February 25, 13:00 - 15:00. Please log out in due time.
 

Adaptive KV Cache Management for Efficient Transformer-based LLM Inference - Leveraging Attention Sparsity for Memory Optimization

dc.contributor.authorXu, Dikai
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerAli-Eldin Hassan, Ahmed
dc.contributor.supervisorAli-Eldin Hassan, Ahmed
dc.date.accessioned2026-01-16T08:47:00Z
dc.date.issued2025
dc.date.submitted
dc.description.abstractThis Master’s thesis addresses the critical challenge of memory inefficiency in Transformerbased Large Language Models (LLMs) during inference, specifically focusing on the prohibitive memory footprint of the Key-Value (KV) cache. As LLMs scale, the KV cache becomes a significant bottleneck, limiting longer context windows and overall operational efficiency. To mitigate this issue, we propose and evaluate Adap-KV, a novel adaptive memory management strategy for the KV cache. Adap-KV employs a layer-aware dynamic allocation approach that intelligently adjusts KV cache size in real-time, leveraging insights from attention sparsity patterns. Our method aims to optimize memory utilization without compromising the performance or quality of LLM inference. Experimental results demonstrate that Adap-KV significantly reduces KV cache memory consumption, thereby enhancing the efficiency and scalability of Transformer-based LLMs, making them more amenable for real-world deployments with extended context capabilities.
dc.identifier.coursecodeDATX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/310903
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectLarge Language Models
dc.subjectTransformers
dc.subjectKV Cache
dc.subjectMemory Optimization
dc.subjectAdaptive Memory Management
dc.subjectAttention Sparsity
dc.subjectDeep Learning Inference
dc.subjectResource Efficiency
dc.titleAdaptive KV Cache Management for Efficient Transformer-based LLM Inference - Leveraging Attention Sparsity for Memory Optimization
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeComputer systems and networks (MPCSN), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 25-152 DX.pdf
Storlek:
1.06 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: