Reducing MPI Communication Latency with FPGA-Based Hardware Compression
| dc.contributor.author | BOURBIA, ANIS | |
| dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv |
| dc.contributor.department | Chalmers University of Technology / Department of Computer Science and Engineering | en |
| dc.contributor.examiner | Petersen Moura Trancoso, Pedro | |
| dc.contributor.supervisor | Vázquez Maceiras, Mateo | |
| dc.date.accessioned | 2026-01-19T09:01:07Z | |
| dc.date.issued | 2025 | |
| dc.date.submitted | ||
| dc.description.abstract | High-performance computing (HPC) clusters face significant communication overhead in distributed deep learning, where frequent data exchanges via the Message Passing Interface (MPI) can bottleneck overall training. This thesis explores an FPGA-based hardware compression approach to reduce MPI communication latency. We prototype integrating an FPGA compression module into the MPI stack, enabling on-the-fly compression of message payloads using fast lossless algorithms LZ4, Snappy, and Zstd. This hardware-accelerated compression offloads work from CPUs/GPUs and shrinks data volume before network transmission, thereby speeding up inter-node communication. In our evaluation, LZ4/Snappy/Zstd achieved compression ratios of 1.53x/1.51x/1.84x and reduced communication time by 34.6%, 33.8%, and 45.7%, yielding overall training speedups of 1.34x, 1.32x, and 1.50x, respectively. Experimental evaluation on representative deep learning workloads demonstrates up to a 1.50x improvement in end-to-end training time with the FPGA compression enabled. Among the tested compressors, Zstd achieved the highest compression ratio, translating to the greatest latency reduction and performance gain. These results highlight that FPGA-based compression can substantially improve throughput in distributed training by alleviating network delays, with negligible added overhead. The proposed method offers a practical path to accelerate HPC communications and scale deep learning workloads more efficiently. | |
| dc.identifier.coursecode | DATX05 | |
| dc.identifier.uri | http://hdl.handle.net/20.500.12380/310924 | |
| dc.language.iso | eng | |
| dc.setspec.uppsok | Technology | |
| dc.subject | Computer | |
| dc.subject | science | |
| dc.subject | computer science | |
| dc.subject | engineering | |
| dc.subject | project | |
| dc.subject | thesis | |
| dc.subject | compression | |
| dc.subject | FPGA | |
| dc.subject | GPU | |
| dc.subject | acceleration | |
| dc.subject | HPC | |
| dc.subject | DNN | |
| dc.subject | lz4 | |
| dc.subject | zstd | |
| dc.subject | snappy | |
| dc.subject | lossless compression | |
| dc.subject | MPI | |
| dc.subject | networking | |
| dc.subject | smart-NIC | |
| dc.title | Reducing MPI Communication Latency with FPGA-Based Hardware Compression | |
| dc.type.degree | Examensarbete för masterexamen | sv |
| dc.type.degree | Master's Thesis | en |
| dc.type.uppsok | H | |
| local.programme | High-performance computer systems (MPHPC), MSc |
