Development of multi-GPU parallelization for a DEM solver: A parallelization extension for an existing state of the art DEM solver

Publicerad

Typ

Examensarbete för masterexamen
Master's Thesis

Modellbyggare

Tidskriftstitel

ISSN

Volymtitel

Utgivare

Sammanfattning

The thesis presents a multi-GPU parallelization extension for an existing single GPU Discrete Element Method solver. The implementation extends the solver’s capability to simulate large particle populations, making it possible to decrease the difference between simulations and real-world particulate systems. The code is developed with HPC in mind, carefully minimizing the additional overhead as a consequence of the parallelization operations by minimizing total number of communication points between the GPUs. The computational domain is divided amongst the GPUs by splitting physical space through one of the three Cartesian axes. Although topologically simplistic, it advantageously results in few communication points for each GPU as well as efficient transfers between GPUs as memory locality is trivially achieved. The HPC GPU clusters targeted by the solver generally have 4-8 GPUs which for most cases will be well suited for the one-dimensional domain decomposition. A load balancing scheme have been developed which dynamically shifts the domain borders to distribute the computational load between the devices. The scheme is optimized for even simulation time between the GPUs. This is achieved by measuring and monitoring execution time of some key operations performed in the DEM algorithm and incrementally shift the domain borders to reach a state where all solvers have close to equal execution times for these operations. Performance measurements have been performed through Amazon Web Services Accelerated Computing instances with systems ranging from 4 to 8 GPUs. The total cost of the parallelization in relation to total execution time ranges from 2.6% to 6.5% with increasing number of connected GPUs. Thus, the implementation of the parallelization scheme is deemed efficient and successful. The chosen and defined algorithm is verified and benchmarked on three cases. The verification shows that the physics of the single GPU solver is preserved for the multi-GPU solver. The dynamic load balancing is shown to give beneficial advantages over static decomposition and the optimization scheme for the balancing is verified on a simulation case with dynamic particle behavior. The overall scaling of the algorithm is studied by benchmarking and monitoring the cost associated with the different steps of the DEM algorithm. It is shown that for certain steps, part of the original single GPU solver, the scaling is worse than for the added implementation steps. This is analyzed and considered to be an effect of the memory schemes for the peer-to-peer mode on the GPUs and will require further attention in future work.

Beskrivning

Ämne/nyckelord

Discrete Element Method, Parallelization, GPU, mulit-GPU, HPC, Domain decomposition, Dynamic domain decomposition

Citation

Arkitekt (konstruktör)

Geografisk plats

Byggnad (typ)

Byggår

Modelltyp

Skala

Teknik / material

Index

item.page.endorsement

item.page.review

item.page.supplemented

item.page.referenced