Reducing MPI Communication Latency with FPGA-Based Hardware Compression
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
High-performance computing (HPC) clusters face significant communication overhead in distributed deep learning, where frequent data exchanges via the Message Passing Interface (MPI) can bottleneck overall training. This thesis explores an
FPGA-based hardware compression approach to reduce MPI communication latency. We prototype integrating an FPGA compression module into the MPI stack, enabling on-the-fly compression of message payloads using fast lossless algorithms LZ4, Snappy, and Zstd. This hardware-accelerated compression offloads work from CPUs/GPUs and shrinks data volume before network transmission, thereby speeding up inter-node communication. In our evaluation, LZ4/Snappy/Zstd achieved
compression ratios of 1.53x/1.51x/1.84x and reduced communication time by 34.6%, 33.8%, and 45.7%, yielding overall training speedups of 1.34x, 1.32x, and 1.50x, respectively. Experimental evaluation on representative deep learning workloads
demonstrates up to a 1.50x improvement in end-to-end training time with the FPGA compression enabled. Among the tested compressors, Zstd achieved the highest compression ratio, translating to the greatest latency reduction and performance gain. These results highlight that FPGA-based compression can substantially improve throughput in distributed training by alleviating network delays, with negligible added overhead. The proposed method offers a practical path to accelerate HPC communications and scale deep learning workloads more efficiently.
Beskrivning
Ämne/nyckelord
Computer, science, computer science, engineering, project, thesis, compression, FPGA, GPU, acceleration, HPC, DNN, lz4, zstd, snappy, lossless compression, MPI, networking, smart-NIC
