JIT-Based RVV Optimization for Implicit GEMM Convolution and FlashAttention - With Cross-Architecture Autotuning, a Winograd F(4,3) Ablation, and AVX2 / AVX-512 / NEON Baselines

Hämtar...
Bild (thumbnail)

Publicerad

Typ

Examensarbete för masterexamen
Master's Thesis

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

Convolutional Neural Networks (CNNs) are fundamental to modern deep learning applications, including image classification, object detection, and autonomous systems. The computational efficiency of convolution operations is critical for real-time inference on resource-constrained devices. Traditional implementations rely on the im2col transformation followed by General Matrix Multiplication (GEMM), which incurs significant memory overhead due to explicit tensor expansion. This thesis presents the Implicit GEMM convolution algorithm, which eliminates the im2col memory overhead by computing input coordinates dynamically during matrix multiplication. We implement and optimize this algorithm across four CPU vector backends—x86-64 with AVX2 and AVX-512, ARM with NEON, and RISC-V with the Vector Extension (RVV)—and complement it with a JIT-based FlashAttention kernel and an RVV Winograd F(4,3) ablation study. The contributions of this thesis are: 1. A lightweight Just-In-Time (JIT) code generation framework for RVV that emits register-blocked, vector-length-agnostic Implicit GEMM micro-kernels at runtime, with loop unrolling and explicit register assignment. 2. A Vector Length Agnostic (VLA) RVV implementation evaluated across six VLEN configurations (128–8192 bits) on the gem5 simulator and on the BananaPi-F3 development board. 3. A hand-written AVX2 micro-kernel (6×16) and a portable ARM NEON micro kernel (8×8) that share the same Implicit GEMM design, used to validate cross-architecture portability. 4. A lightweight RVV autotuner over {MR, LMUL, k-unroll} with register-budget pruning, contrasted with the AVX-512 autotuner to identify which tuning knobs transfer across architectures and which do not. 5. A JIT FlashAttention kernel as a non-convolution case study, and an RVV Winograd F(4,3) five-way ablation that separates algorithmic gains from implementation-level effects. The implementations are integrated into the Intel oneDNN framework. On x86 64, the AVX2 Implicit GEMM achieves a peak of 164.68 GFLOPS (9.5× over oneDNN’s gemm_convolution), eliminating a 56.85MB im2col buffer. Extended to AVX-512 with NUMA-aware tiling and a per-layer autotuner, the same design reaches 599 GFLOPS averaged over five VGG-16 layers and a single-layer peak of 1161 GFLOPS, with cross-network speedups of 22×–228× over a non-vectorized scalar baseline. On ARM NEON, the peak is 81.0 GFLOPS (15.4× over a scalar reference). On RVV, the JIT kernel delivers up to 3.28× over the scalar reference (gem5, VLEN=256) and is validated on the real BananaPi-F3 hardware (3.08×). Across the two autotuners, every layer except a few narrow-channel cases independently selects MR=8, indicating that the dominant tile parameter is governed by the architectural vector-register budget rather than by ISA-specific concerns. This work demonstrates the effectiveness of Implicit GEMM as a memory-efficient alternative to traditional convolution methods, with particular relevance for emerging RISC-V platforms where optimized deep learning libraries remain limited, and provides a cross-architecture analysis of which optimization decisions transfer and which do not.

Beskrivning

Ämne/nyckelord

Implicit GEMM, Convolution, Neural Networks, oneDNN, JIT Compi lation, Cross-Architecture Optimization, Autotuning, AVX2, AVX-512, ARM NEON, RISC-V Vector Extension, FlashAttention, Winograd, High Performance Computing

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

Endorsement

Review

Supplemented By

Referenced By