Full System-Level Simulation of Neural Compute Architectures

Kalamkar, Arjun

Full System-Level Simulation of Neural Compute Architectures

dc.contributor.author	Kalamkar, Arjun
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering	en
dc.contributor.examiner	Stenström, Per
dc.contributor.supervisor	Stenström, Per
dc.date.accessioned	2026-01-19T08:55:02Z
dc.date.issued	2025
dc.date.submitted
dc.description.abstract	The proliferation of large-scale artificial intelligence models necessitates specialized hardware like Neural Processing Units (NPUs) to achieve efficient computation. However, an NPU’s real-world performance is deeply influenced by system-level effects that are often overlooked. Existing simulation tools typically lack the ability to model a detailed NPU microarchitecture within a full-system context, obscuring critical performance bottlenecks arising from the operating system, device drivers, and memory contention. This thesis introduces gem5-fsnpu, a novel simulation framework that bridges this gap by integrating a reconfigurable, transaction-level cycle-accurate NPU model into the gem5 full-system simulator [1], [2]. The framework includes a complete, vertically-integrated software stack, featuring a custom Linux driver and a user-space library with an intelligent, hardware-aware tiling algorithm, enabling realistic hardware-software co-design studies. We demonstrate the framework’s capabilities through a comprehensive Design Space Exploration, evaluating NPU performance on benchmarks including general matrix multiplication (GEMM) and complex Transformer layers like Multi-Head Attention (MHA). Architectural parameters such as systolic array dimensions (2D vs. 3D), on-chip memory size, and dataflow are systematically varied. The results reveal that system-level overheads are frequently the dominant performance bottleneck. For instance, the framework shows how for command-intensive workloads like MHA, the software control path latency can eclipse the hardware computation time, becoming the primary performance limiter. The study also quantifies the critical relationship between on-chip memory size and software tiling efficiency, demonstrating that an undersized memory can nullify the benefits of a powerful compute core. This work validates the necessity of full-system simulation for accelerator design and provides a powerful tool for researchers, proving that a holistic, hardware-software co-design approach is paramount to achieving efficient AI acceleration.
dc.identifier.coursecode	DATX05
dc.identifier.uri	http://hdl.handle.net/20.500.12380/310923
dc.language.iso	eng
dc.setspec.uppsok	Technology
dc.subject	AI
dc.subject	Heterogeneous computing
dc.subject	Neural Processing Unit
dc.subject	Full-system simulation
dc.subject	gem5
dc.title	Full System-Level Simulation of Neural Compute Architectures
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	High-performance computer systems (MPHPC), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 25-180 AK.pdf
Size:: 6.91 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen