Full System-Level Simulation of Neural Compute Architectures
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
The proliferation of large-scale artificial intelligence models necessitates specialized hardware like Neural Processing Units (NPUs) to achieve efficient computation. However, an NPU’s real-world performance is deeply influenced by system-level
effects that are often overlooked. Existing simulation tools typically lack the ability to model a detailed NPU microarchitecture within a full-system context, obscuring critical performance bottlenecks arising from the operating system, device drivers, and memory contention. This thesis introduces gem5-fsnpu, a novel simulation framework that bridges this gap by integrating a reconfigurable, transaction-level cycle-accurate NPU model into the gem5 full-system simulator [1], [2]. The framework includes a complete, vertically-integrated software stack, featuring a custom Linux driver and a user-space library with an intelligent, hardware-aware tiling algorithm, enabling realistic hardware-software co-design studies. We demonstrate the framework’s capabilities through a comprehensive Design Space Exploration, evaluating NPU performance on benchmarks including general matrix multiplication (GEMM) and complex Transformer layers like Multi-Head Attention (MHA). Architectural parameters such as systolic array dimensions (2D vs. 3D), on-chip memory size, and dataflow are systematically varied. The results reveal that system-level overheads are frequently the dominant performance bottleneck. For
instance, the framework shows how for command-intensive workloads like MHA, the software control path latency can eclipse the hardware computation time, becoming the primary performance limiter. The study also quantifies the critical relationship between on-chip memory size and software tiling efficiency, demonstrating that an undersized memory can nullify the benefits of a powerful compute core. This work validates the necessity of full-system simulation for accelerator design and provides a powerful tool for researchers, proving that a holistic, hardware-software co-design approach is paramount to achieving efficient AI acceleration.
Beskrivning
Ämne/nyckelord
AI, Heterogeneous computing, Neural Processing Unit, Full-system simulation, gem5
