Development of multi-GPU parallelization for a DEM solver: A parallelization extension for an existing state of the art DEM solver
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
The thesis presents a multi-GPU parallelization extension for an existing single GPU
Discrete Element Method solver. The implementation extends the solver’s capability
to simulate large particle populations, making it possible to decrease the difference
between simulations and real-world particulate systems. The code is developed with
HPC in mind, carefully minimizing the additional overhead as a consequence of
the parallelization operations by minimizing total number of communication points
between the GPUs.
The computational domain is divided amongst the GPUs by splitting physical space
through one of the three Cartesian axes. Although topologically simplistic, it advantageously results in few communication points for each GPU as well as efficient
transfers between GPUs as memory locality is trivially achieved. The HPC GPU
clusters targeted by the solver generally have 4-8 GPUs which for most cases will
be well suited for the one-dimensional domain decomposition.
A load balancing scheme have been developed which dynamically shifts the domain
borders to distribute the computational load between the devices. The scheme is
optimized for even simulation time between the GPUs. This is achieved by measuring and monitoring execution time of some key operations performed in the DEM
algorithm and incrementally shift the domain borders to reach a state where all
solvers have close to equal execution times for these operations.
Performance measurements have been performed through Amazon Web Services Accelerated Computing instances with systems ranging from 4 to 8 GPUs. The total
cost of the parallelization in relation to total execution time ranges from 2.6% to
6.5% with increasing number of connected GPUs. Thus, the implementation of the
parallelization scheme is deemed efficient and successful. The chosen and defined
algorithm is verified and benchmarked on three cases. The verification shows that
the physics of the single GPU solver is preserved for the multi-GPU solver. The
dynamic load balancing is shown to give beneficial advantages over static decomposition and the optimization scheme for the balancing is verified on a simulation
case with dynamic particle behavior. The overall scaling of the algorithm is studied by benchmarking and monitoring the cost associated with the different steps of
the DEM algorithm. It is shown that for certain steps, part of the original single
GPU solver, the scaling is worse than for the added implementation steps. This is
analyzed and considered to be an effect of the memory schemes for the peer-to-peer
mode on the GPUs and will require further attention in future work.
Beskrivning
Ämne/nyckelord
Discrete Element Method, Parallelization, GPU, mulit-GPU, HPC, Domain decomposition, Dynamic domain decomposition