JIT-Based RVV Optimization for Implicit GEMM Convolution and FlashAttention - With Cross-Architecture Autotuning, a Winograd F(4,3) Ablation, and AVX2 / AVX-512 / NEON Baselines
Hämtar...
Ladda ner
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Convolutional Neural Networks (CNNs) are fundamental to modern deep learning
applications, including image classification, object detection, and autonomous systems. The computational efficiency of convolution operations is critical for real-time
inference on resource-constrained devices. Traditional implementations rely on the
im2col transformation followed by General Matrix Multiplication (GEMM), which
incurs significant memory overhead due to explicit tensor expansion.
This thesis presents the Implicit GEMM convolution algorithm, which eliminates
the im2col memory overhead by computing input coordinates dynamically during
matrix multiplication. We implement and optimize this algorithm across four
CPU vector backends—x86-64 with AVX2 and AVX-512, ARM with NEON, and
RISC-V with the Vector Extension (RVV)—and complement it with a JIT-based
FlashAttention kernel and an RVV Winograd F(4,3) ablation study.
The contributions of this thesis are:
1. A lightweight Just-In-Time (JIT) code generation framework for RVV that
emits register-blocked, vector-length-agnostic Implicit GEMM micro-kernels at
runtime, with loop unrolling and explicit register assignment.
2. A Vector Length Agnostic (VLA) RVV implementation evaluated across six
VLEN configurations (128–8192 bits) on the gem5 simulator and on the
BananaPi-F3 development board.
3. A hand-written AVX2 micro-kernel (6×16) and a portable ARM NEON micro
kernel (8×8) that share the same Implicit GEMM design, used to validate
cross-architecture portability.
4. A lightweight RVV autotuner over {MR, LMUL, k-unroll} with register-budget
pruning, contrasted with the AVX-512 autotuner to identify which tuning
knobs transfer across architectures and which do not.
5. A JIT FlashAttention kernel as a non-convolution case study, and an RVV
Winograd F(4,3) five-way ablation that separates algorithmic gains from
implementation-level effects.
The implementations are integrated into the Intel oneDNN framework. On x86
64, the AVX2 Implicit GEMM achieves a peak of 164.68 GFLOPS (9.5× over
oneDNN’s gemm_convolution), eliminating a 56.85MB im2col buffer. Extended to
AVX-512 with NUMA-aware tiling and a per-layer autotuner, the same design reaches
599 GFLOPS averaged over five VGG-16 layers and a single-layer peak of 1161
GFLOPS, with cross-network speedups of 22×–228× over a non-vectorized scalar
baseline. On ARM NEON, the peak is 81.0 GFLOPS (15.4× over a scalar reference).
On RVV, the JIT kernel delivers up to 3.28× over the scalar reference (gem5,
VLEN=256) and is validated on the real BananaPi-F3 hardware (3.08×). Across the
two autotuners, every layer except a few narrow-channel cases independently selects
MR=8, indicating that the dominant tile parameter is governed by the architectural
vector-register budget rather than by ISA-specific concerns.
This work demonstrates the effectiveness of Implicit GEMM as a memory-efficient
alternative to traditional convolution methods, with particular relevance for emerging
RISC-V platforms where optimized deep learning libraries remain limited, and
provides a cross-architecture analysis of which optimization decisions transfer and
which do not.
Beskrivning
Ämne/nyckelord
Implicit GEMM, Convolution, Neural Networks, oneDNN, JIT Compi lation, Cross-Architecture Optimization, Autotuning, AVX2, AVX-512, ARM NEON, RISC-V Vector Extension, FlashAttention, Winograd, High Performance Computing
