CFD on GPUs in Aerospace Applications Benchmarking the Fluent native GPU solver on aerospace ap- plications, and how to approach purchasing GPUs for CFD as a business case. Master’s thesis in Mobility Engineering Filip Gustafsson, Gustav Rönn DEPARTMENT OF Mechanics and Maritime Sciences CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2025 www.chalmers.se www.chalmers.se Master’s thesis 2025 CFD on GPUs in Aerospace Applications An evaluation of Ansys Fluent native GPU solver in aerospace applications, and GPUs for CFD simulations as an investment. Filip Gustafsson, Gustav Rönn Department of Mechanics and Maritime Sciences Division of Fluid Dynamics Chalmers University of Technology Gothenburg, Sweden 2025 CFD on GPUs in Aerospace Applications An evaluation of Ansys Fluent native GPU solver in aerospace applications, and GPUs for CFD simulations as an investment. Filip Gustafsson, Gustav Rönn © Filip Gustafsson, Gustav Rönn, 2025. Supervisors: Björn Bragée, EDRMedeso Jan Östlund, GKN Aerospace Examiner: Professor Lars Davidsson, Department of Mechanics and Maritime Sci- ences Master’s Thesis 2025 Department of Mechanics and Maritime Sciences Division of Fluid Dynamics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Simulation with pathline visualization on the TRS case from Ansys Fluent 2025 R1. Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2025 iv CFD on GPUs in Aerospace Applications An evaluation of Ansys Fluent native GPU solver in aerospace applications, and GPUs for CFD simulations as an investment. Filip Gustafsson, Gustav Rönn Department of Mechanics and Maritime Sciences Chalmers University of Technology Abstract Running CFD simulations on GPUs is becoming commercially viable, and one major early adopter is the automotive industry, where external aerodynamics cases can be run at a fraction of the time compared to simulations on CPUs. The aerospace industry has not yet adopted GPUs to the same extent, as aerospace cases often require support for more complex physics such as compressible flows and combustion. This study compares the performance of the Ansys Fluent GPU solver with the CPU solver in aerospace applications and is carried out with the support of GKN Aerospace and EDR & Medeso, a reseller of Ansys software. It uses a novel approach to evaluate the attractiveness of purchasing GPUs for a local cluster, compared to purchasing CPUs, from a cost, power consumption, and strategic perspective. A case-based methodology is used to compare the solvers with 3 simulation setups that are representative of typical aerospace applications. The current version of the GPU solver supports all the necessary features to run 2/3 cases, although it requires minor simplifications to the case setups. For the cases it does support, key results include GPU simulations providing a time reduction of 41-98% per iteration, an energy consumption reduction of 88-93% per iteration, a 27-73% reduction in iterations to reach convergence, a cloud computing cost reduction of 83-91% and a total cost of ownership reduction of 48-67% for systems with equivalent simulation capacity on a local cluster. If the simulation capacity demand for simulation setups that the GPU solver supports is sufficient, purchasing GPUs for CFD simulation is a cost-effective and energy-efficient solution to meet simulation capacity demands in comparison to purchasing CPUs. The speedups provided by the Ansys Fluent GPU solver can be leveraged to generate significant value in an engineering process by enabling more design iterations, improved simulation fidelity, and faster simulation turnaround, compared to the CPU solver. Keywords: CFD, GPU, CPU, HPC, Ansys Fluent, Native GPU Solver, Simulation, Aerospace, Business case, Turbomachinery. v Acknowledgements We would like to thank EDR & Medeso, GKN Aerospace and the people who have supported us throughout our master’s thesis. Our supervisor, Björn Bragée, for his interest in the project, great discussions, and continuous support. The rest of the staff at EDR & Medeso, especially Klas Johans- son and Tomas Jarneholt for their wide CFD knowledge and willingness to always answer all our questions. Our other supervisor, Jan Östlund at GKN Aerospace, for continuously supporting us in interpreting simulation results and evaluating setups. Our examiner, Professor Lars Davidson. Others that have contributed with their knowledge: Pekka Wikman, Marcus Lejon and Tomas Fernström, GKN Aerospace. Anders Jönsson, Tobias Berg and Didier Besette, Ansys. Torbjörn Wirdung, Volvo Cars. Adam Koc and Jan Wallenberg, GoVirtual. Lastly we want to thank our friends and families for their ongoing support through- out our entire studies. Filip Gustafsson, Gustav Rönn, Gothenburg, June 2025 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: ALU Algorithmic Logic Unit CFD Computational Fluid Dynamics CPU Central Processing Unit CSR Compressed Sparse Row format FLOPS Floating Point Operations per Second CUDA Compute Unified Design Architecture FP32 32-bit Floating Point number FP64 64-bit Floating Point number GPU Graphics Processing Unit HPC High Performance Computing IC Integrated Circuit SIMD Single Instruction, Multiple Data SIMT Single Instruction, Multiple Threads SMP Streaming Multi-Processor SP Streaming Processor RAM Random Access Memory RANS Reynolds-Averaged Navier-Stokes equations TDP Thermal Design Power VRAM Video Random Access Memory ix Nomenclature Below is the nomenclature of parameters and variables that have been used through- out this thesis. A Area Af Area of a specific face, f a Acceleration C Cost CP 0 Total pressure coefficient CV Control volume êi General unit vector Etot Total energy consumption fclock Clock frequency fmemory Memory frequency Fthrust Thrust force g Gravitational constant = 9.81 î Unit vector in x Isp Specific impulse Ispvac Specific impulse in a Vacuum environment k Turbulent kinetic energy kcooling Cooling factor L Length M Momentum ṁ Mass flow ṁcorr Corrected mass flow n Number of n̂ Normal vector P Pressure & Power Pd Dynamic pressure xi Pratio Pressure ratio over rotor stage Pref Reference or ambient pressure Ps Static pressure Pt Total pressure Ptot Total power consumption p Price Re Reynolds number Si Internal energy source term SM Momentum source term tsas Start and shutdown time tsim Simulation time ttot Total time T Temperature Tref Reference or ambient temperature Tt Total temperature ts Time step u Velocity in x-direction U Velocity vector v Velocity in y-direction V Volume VP Volume w Velocity in z-direction β Bandwidth Γ Diffusion coefficient µ Dynamic viscosity ϕ General fluid property vector Φ Dissipation function ρ Density τ Shear stress υ Kinematic viscosity xii Contents List of Acronyms ix Nomenclature xi List of Figures xv List of Tables xvii 1 Introduction 1 2 Theory 5 2.1 Fluid theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Basic fluid mechanics and governing equations . . . . . . . . . 5 2.1.2 Discretization . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1.3 Coupled pressure-based solver . . . . . . . . . . . . . . . . . . 8 2.1.4 Turbulence Models and Species transport . . . . . . . . . . . . 10 2.1.4.1 k − ω SST . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.5 Definitions of variables, coefficients and expressions . . . . . . 10 2.1.5.1 Coefficients . . . . . . . . . . . . . . . . . . . . . . . 11 2.2 Computer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1 Floating point numbers . . . . . . . . . . . . . . . . . . . . . . 11 2.2.1.1 Single-precision numbers . . . . . . . . . . . . . . . . 11 2.2.1.2 Double-precision numbers . . . . . . . . . . . . . . . 11 2.2.1.3 Floating point numbers in CFD . . . . . . . . . . . . 12 2.2.2 Central Processing Unit . . . . . . . . . . . . . . . . . . . . . 12 2.2.2.1 Processing cores . . . . . . . . . . . . . . . . . . . . 12 2.2.3 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.4 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . 13 2.2.4.1 Utilization . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.5 High Performance Computing . . . . . . . . . . . . . . . . . . 14 2.2.5.1 Power Consumption . . . . . . . . . . . . . . . . . . 14 2.3 Business Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1 Cost-Benefit Analysis . . . . . . . . . . . . . . . . . . . . . . . 14 2.3.1.1 Total cost of ownership . . . . . . . . . . . . . . . . 15 3 Methods 17 3.1 Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 xiii Contents 3.1.1 Hardware specification . . . . . . . . . . . . . . . . . . . . . . 18 3.2 Calculation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.1 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2.2 Energy consumption . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Simulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.1 Porting to GPU solver . . . . . . . . . . . . . . . . . . . . . . 20 3.3.2 TRS case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.3.2.1 GPU solver setup . . . . . . . . . . . . . . . . . . . . 22 3.3.3 Nozzle case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.3.3.1 GPU solver setup . . . . . . . . . . . . . . . . . . . . 24 3.3.4 Rotor case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.4.1 CFX setup . . . . . . . . . . . . . . . . . . . . . . . 26 3.3.4.2 Fluent GPU and CPU setup . . . . . . . . . . . . . . 26 3.4 Business case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4.2 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 4 Results 31 4.1 TRS case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.1 Simplifications and issues . . . . . . . . . . . . . . . . . . . . . 31 4.1.2 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.1.3 Energy consumption . . . . . . . . . . . . . . . . . . . . . . . 34 4.1.4 Accuracy and comparison . . . . . . . . . . . . . . . . . . . . 34 4.1.4.1 Convergence . . . . . . . . . . . . . . . . . . . . . . 35 4.1.5 Cloud cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Nozzle case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3 Rotor case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.3.1 Simplifications and issues . . . . . . . . . . . . . . . . . . . . . 37 4.3.2 Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.3 Energy consumption . . . . . . . . . . . . . . . . . . . . . . . 37 4.3.4 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.3.5 Accuracy and comparison . . . . . . . . . . . . . . . . . . . . 39 4.3.6 Cloud cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.4 Business case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.4.1 Total cost of ownership . . . . . . . . . . . . . . . . . . . . . . 40 4.4.2 Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.3 System power consumption . . . . . . . . . . . . . . . . . . . 47 5 Discussion 49 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Further research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 A Supported physics models I B Sparse matrix structure V xiv List of Figures 1.1 Maximum FP32 computational performance measured in TFlops for some CPUs and GPUs throughout the years. . . . . . . . . . . . . . . 2 1.2 Memory bandwidth β of some CPUs and GPUs throughout the years. 2 2.1 An arbitrary number represented in IEEE 754 Floating point format. 11 3.1 Isometric view of the TRS case . . . . . . . . . . . . . . . . . . . . . 21 3.2 The full 360° and the 30° section of the TRS case. . . . . . . . . . . . 23 3.3 Isometric view of the Nozzle case. . . . . . . . . . . . . . . . . . . . . 24 3.4 Isometric view of Rotor case. . . . . . . . . . . . . . . . . . . . . . . . 25 4.1 30° TRS case in single precision. . . . . . . . . . . . . . . . . . . . . . 32 4.2 360° TRS case in single precision. . . . . . . . . . . . . . . . . . . . . 32 4.3 Interconnect speed on the 30° simulation of the TRS case. . . . . . . 33 4.4 GPU solver simulation time over mesh size for the TRS case. . . . . . 33 4.5 360° single precision energy consumption for running 300 iterations. . 34 4.6 Circumferentially-averaged CP 0 over span in the outlet. . . . . . . . . 35 4.7 No. of iterations until converged for 30° TRS case in single precision. 35 4.8 No. of iterations until converged for 30° TRS case in double precision. 35 4.9 360° TRS case Rescale cost for total time of simulation. . . . . . . . . 36 4.10 Rotor case double precision time per 10 000 iterations. . . . . . . . . 37 4.11 Rotor case double precision energy consumption per 10 000 iterations. 38 4.12 No. of iterations until converged for the Rotor case. . . . . . . . . . . 38 4.13 Rotor case Rescale cost per 10 000 iterations. . . . . . . . . . . . . . 39 4.14 Visualization of equivalent simulation capacity for 2x Nvidia H100 compared to 14x Intel 6455B. . . . . . . . . . . . . . . . . . . . . . . 40 4.15 Equivalent simulation capacity for the TRS case when comparing 14x Intel 6455B with 2x Nvidia H100. . . . . . . . . . . . . . . . . . . . . 41 4.16 Local vs cloud cost over a 3 year period for 1 2x H100 machines. . . . 45 4.17 Local vs cloud cost over a 3 year period for 7 2x 6455B machines. . . 45 4.18 Local vs cloud cost over a 3 year period for 12 2x 8375C machines. . . 46 4.19 System power consumption. . . . . . . . . . . . . . . . . . . . . . . . 47 5.1 Strategic implementation decision tree. . . . . . . . . . . . . . . . . . 50 B.1 2x2 equidistant mesh with 4 cells . . . . . . . . . . . . . . . . . . . . V B.2 3x3 equidistant mesh with 9 cells . . . . . . . . . . . . . . . . . . . . V xv List of Figures xvi List of Tables 3.1 CPUs used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.2 GPUs used. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Power consumption metrics for the hardware used. . . . . . . . . . . 20 3.4 Estimated purchasing prices for the hardware used. . . . . . . . . . . 29 4.1 Comparison between results from CFX and Fluent GPU solver. . . . 39 4.2 Total Cost of Ownership of 2x Nvidia H100 over a 3 year period. . . . 42 4.3 Total Cost of Ownership of 14x Intel Xeon 6455B (448 cores) over a 3 year period. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.4 Total Cost of Ownership of 24x Intel Xeon 8375C (768 cores) over a 3 year period. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.5 Pugh matrix of benefits for CPU system vs. GPU system investment. 46 A.1 Physics models and solution schemes available in the CPU and GPU solver in Ansys Fluent 2025R1 . . . . . . . . . . . . . . . . . . . . . . III xvii List of Tables xviii Chapter 1 Introduction When complex geometries are introduced into fluid mechanics problems, analyt- ical solutions are not feasible. The problems are instead solved numerically using Computational Fluid Dynamics (CFD), by discretizing the governing continuity and Navier-Stokes equations into a mesh of control volumes. An early paper on a 3D model was published in 1967 by John Hess and A.M.O. Smith [1]. Between then and now there has been major developments in the modeling of fluid flows, including turbulence models, discretization schemes, meshing developments, and multi-physics simulations. CFD codes have traditionally been written for the Central Processing Unit (CPU) of the computer, designed to execute instructions sequentially. Single Instruction, Multiple Data (SIMD) Graphics Processing Units (GPUs) were com- mercialized in the late 1990s and their performance has been increasing at a high rate. Today, server GPUs have far surpassed the server CPUs in potential comput- ing power and the gap continues to grow larger. For example, the Nvidia H100 GPU has a 40x larger theoretical maximum number of 32-bit floating point operations per second (Flops) compared to the 32-core Intel Xeon Gold 6455B CPU, both of which are used in this thesis. In Figure 1.1 the theoretical performance of some CPUs and GPUs from the last 10 years are shown. 1 1. Introduction 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 20 40 60 E3-1285V4 8180 8280 8375C 8488C9655P8593QK80 P100 V100 A100 H100 H200 Year F P 32 [T F L O P S ] Theoretical FP32 computational capacity CPU GPU Figure 1.1: Maximum FP32 computational performance measured in TFlops for some CPUs and GPUs throughout the years. The theoretical maximum GPU computing performance can only be reached if all processing cores are fully utilized, which is not realistic in CFD codes. Read and write operations create a major computational cost in CFD codes and the computa- tional time is therefore normally bottle-necked by the memory bandwidth, β, when solving large CFD problems. The memory bandwidth of some CPUs and GPUs from the last decade can be found in Figure 1.2. 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 0 1 2 3 E3-1285V4 8180 8280 8375C 8488C 9655P6768PK80 P100 V100 A100 H100 H200 Year β [T B /s ] Memory Bandwidth β CPU GPU Figure 1.2: Memory bandwidth β of some CPUs and GPUs throughout the years. GPUs for CFD simulations can be considered a disruptive technology in the fluid 2 1. Introduction simulation field, showing speedups by up to 32x in the Ansys Fluent GPU solver when running on multiple GPUs, driving automotive manufacturers to invest mil- lions of Euros in GPUs for aerodynamics simulations.[2] [3] The investment in GPUs for CFD is motivated by the strive to optimize engineer and researcher time, since skilled personnel can account for 5-10x the cost compared to the next biggest expense which is normally computing hardware or software licenses. [4] Faster simulations allow engineers to incorporate more simulations earlier in the design process. [4] EDR & Medeso is supervising this master thesis together with GKN Aerospace to evaluate the Fluent GPU and CPU solvers’ performance in aerospace applications and whether an investment in GPUs would be cost-efficient and beneficial for their cluster. EDR & Medeso AB is a reseller of calculation and simulation software for engineers and they deliver training, support and consulting services in this area. Their software suite consists of products from Ansys, Rescale, Neural Concept, ScaleX Enterprise, Trimble, IDEA StatiCa and CSI. They operate in the Nordics, the UK, and Poland, and have around 160 employees worldwide, with about 40 based in Sweden. [5][6] GKN Aerospace Sweden AB manufactures and sells engine components for airplanes and rockets, and performs engine maintenance. They are based in Trollhättan and part of the GKN Aerospace Group based in the UK. They produce engine components for civil aircraft, the RM12 engine for SAAB 39 Gripen and rocket nozzles and turbines for the Ariane 6 rocket. [7] In the ANSYS Fluent suite, a solver that was natively written for running on GPUs was first included in the 2023 R1 release. The solver is continuously updated for each release that comes out. The current release of the GPU solver is limited in its functionality compared to the CPU solver. The main differences between the CPU and GPU solver are listed in the appendix, in Table A.1. In this project, the 2025 R1 release of ANSYS Fluent is used. The available literature on CFD on GPUs is limited, and in the case of aerospace applications, non-existent. This is the case since the field itself is rather new, ANSYS rolled out their GPU solver in 2020 to customers [8]. The GPU-accelerated Siemens Star-CCM+ solver shows a 20x speedup for an ex- ternal automotive aerodynamics case when comparing 2x 64-core AMD EPYC 7742 CPUs with 8x Nvidia A100 GPUs, while the Ansys Fluent Native GPU solver shows a 30x speedup for a different external automotive aerodynamics case with the same CPUs and GPUs. [4] [8] In non-automotive, time reductions in the range 5x to 20x depending on the case and GPU. [9] These were smaller cases with 11 & 16 million cells, respectively. Power consumption can also be reduced by moving from a CPU server to a GPU server. A 4x reduction in power consumption was found when comparing 64x 16-core Intel Xeon Gold 6242 with 6x Nvidia V100 GPUs without taking cooling, simula- tion time and the CPU in the GPU server into account. [8] When taking simulation time into account, the results become case-dependent, ranging from a factor of 0.67x to 8x. [9] It should be noted that this study accounts for the necessity of a CPU in every GPU server node but assumes that the CPU runs with maximum power 3 1. Introduction consumption throughout the whole simulation, rather than idle power consumption, overestimating the GPU node power consumption in comparison to the CPU node power consumption. The accuracy of the Ansys Fluent Native GPU solver has been studied on a laminar flow over a sphere and it was found to compute the drag coefficient to an error of 0.252%. [8] It has been shown to correlate well with experimental and CPU solver results on a further set of cases. [10] [11] The cost of CFD simulations on GPUs has also been previously studied with cost reductions of up to 83%. [9] Volvo Cars reached a 2.5x speedup when running on 8x Blackwell GPUs compared to cost-equivalent CPUs with a total of 2016 cores. [2] 4 Chapter 2 Theory To understand how GPUs can solve CFD problems faster than CPUs, it is necessary to dive into the calculations that a CFD code performs, the architecture of comput- ers, and how they relate to each other. This is complemented with some background theory on business case methods. 2.1 Fluid theory The field of fluid mechanics is vast and a commercial CFD code is a ridiculously complex software with a multitude of features and solutions for different problems. The main solution procedure is similar for most codes, however, and the goal of this chapter is to explain how the main calculation procedure is derived and why it can be sped up through parallelization. 2.1.1 Basic fluid mechanics and governing equations To start off, the characteristics of a flow depends on its velocity U . A Reynolds number Re > 4000 indicates that a flow is fully turbulent [12] and a Mach number M ≥ 1 indicates that compressibility effects are important. [13] Re = ρUL µ (2.1) M = U a (2.2) The governing equations of the flow of a compressible Newtonian fluid can be written as shown in (2.3), (2.4), (2.5), (2.6), and (2.7). [14] Continuity equation: ∂ρ ∂t + div(ρu) = 0 (2.3) Navier-Stokes equations, describing momentum in the x-, y-, and z-components: ∂(ρu) ∂t + div(ρuu) = −∂p ∂x + div(µ grad u) + SMx (2.4) 5 2. Theory ∂(ρv) ∂t + div(ρvu) = −∂p ∂y + div(µ grad v) + SMy (2.5) ∂(ρw) ∂t + div(ρwu) = −∂p ∂z + div(µ grad w) + SMz (2.6) where SM is the momentum source term. Energy equation: ∂(ρi) ∂t + div(ρiu) = −p div u + div(k grad T ) + Φ + Si (2.7) where Φ is the dissipation function and Si is the source term. The time derivative terms are not considered when solving in steady-state. 2.1.2 Discretization The governing equations are discretized over a control volume by applying the Gauss divergence theorem. Discretizing the transport equation, which is a common differ- ential form for all flow equations, yields the following steady-state integrated form where ϕ is defined as a general variable representing a fluid property, n̂ is the normal vector and Sϕ is the source term for variable ϕ. [14]∫ A n̂(ρϕu)dA = ∫ A n̂(Γgradϕ)dA + ∫ CV SϕdV (2.8) The physical interpretation of this form is that the divergence integrated over a control volume equals the fluxes over its surfaces. Splitting over a finite number of faces yields: M∑ f ∫ Af n̂f (ρϕu)dAf = M∑ f ∫ Af n̂f (Γ grad ϕ)dAf + ∫ CV SϕdV (2.9) Assuming that the fluid property varies linearly across each face allows the surface integral to be approximated to the value at the point on the face that is between the node and the neighboring node. M∑ f n̂fρfϕfufAf = M∑ f n̂f (Γgradϕ)fAf + ∫ CV SϕdV (2.10) The diffusion term is related to the orthogonality of the mesh. It is divided into an implicit and an explicit term, where the implicit term depends on the orthogonal component ∆ of the vector d and the explicit term depends on the non-orthogonal component k of the vector d. M∑ f=1 n̂f (Γgradϕ)fAf = M∑ f=1 (ρΓϕ)f∆(gradϕ)f + M∑ f=1 (ρΓϕ)fk(gradϕ)f (2.11) This can be discretized through the "minimum correction" approach, the "orthog- onal correction" approach or the "over-relaxed" approach. Here is the orthogonal 6 2. Theory correction approach. M∑ f=1 (ρΓϕ)f |n̂fAf |dϕN − ϕP |d|2 + M∑ f=1 (ρΓϕ)f n̂fAf (1− d |d| )(fx(gradϕ)P +(1−fx)(gradϕ)N) (2.12) (1 − d |d|) is zero for an orthogonal mesh, taking the second term out of the equation. This can be applied to all 5 governing equations. Continuity equation:∫ A n̂(ρU)dA + ∫ A n̂.(ρV)dA + ∫ A n̂.(ρW)dA = 0 (2.13) M∑ f n̂fρfUfAf + M∑ f n̂fρfVfAf + M∑ f n̂fρfWfAf = 0 (2.14) X-momentum: ∫ A n̂(ρuu − µ ∂u ∂x )dA + ∫ A n̂(ρvu − µ ∂u ∂y )dA + ∫ A n̂(ρzu − µ ∂u ∂z )dA =∫ ∆V −∂p ∂x dV + ∫ CV SMxdV (2.15) M∑ f=1 n̂f (ρfufuf − µf (∂u ∂x )f )Af + M∑ f=1 n̂f (ρfvfuf − µf (∂u ∂y )f )Af+ (2.16) M∑ f=1 n̂f (ρfzfuf − µf (∂u ∂z )f )Af = −(∂p ∂x )VP + SUVP + SP VP uP The energy equation and the Y and Z momentum equations are discretized in a similar manner. The discretized equations are then split into coefficients. For every node P , there are nb neighbours. The value at a face f between a node P and its neighbour node N is calculated through linear interpolation where fx is the distance from node P to face f . ϕf = ϕP (1 − fx) + ϕNfx (2.17) Solving the equation system for u gives: aP uP = ∑ nb anbunb + ∑ pfA · î + S (2.18) For generalized fluid property ϕ: aP ϕP = ∑ nb anbϕnb + ∑ pfAêi + S (2.19) This creates a system of algebraic equations. Aϕ = B (2.20) 7 2. Theory where A is a sparse matrix with aP on the diagonal and anb on the off-diagonal corresponding to the neighbouring cell, ϕ is a vector containing a general fluid prop- erty (for example x-velocity or pressure) and B is a vector containing source terms. The size of the A matrix will be the square of the number of cells in the domain. The number of anb coefficients will depend on the number of faces a given cell has. A cubic cell, for instance, has six faces (and six directly neighbouring cells) and therefore has six corresponding anb coefficients in the A matrix. [15] Size of matrix A: Asize = n2 cells (2.21) Number of coefficients in matrix A: Acoeff = ncells(1 + n̄faces per cell) (2.22) The number of faces per cell is defined by the number of face connections each face has, which means that a face that connects to cells will have one coefficient for each cell. The size of matrix A will grow faster than its coefficients when the mesh has a large number of cells. 2.1.3 Coupled pressure-based solver In the pressure-based solver, a flow problem can be solved using a segregated or a coupled approach. The segregated approach solves each momentum and pres- sure correction equation separately, while the coupled approach solves all equations simultaneously in one matrix equation. The structure of matrix A and vectors ϕ and B for general Aϕ = B in a segregated solver is as follows: A =  A11 . . . A1n ... . . . ... Am1 . . . Amn , ϕ =  ϕ1 ... ϕn , B =  bϕ 1 ... bϕ m  The sparsity of the A matrix increases with an increasing number of cells, since every cell is only connected to its direct neighbours. Examples for 2D 2x2 and 3x3 equidistant meshes are attached in Appendix B. The sparsity of a sparse matrix can be defined by: ncells − nnnz ncells (2.23) where nnnz is the number of non-zero elements in the matrix. In a coupled solver for a 3D case the A matrix contains all three momentum equations, pressure correction terms, pressure-velocity coupling terms, and the matrix is laid out in the following way: 8 2. Theory A =  A11 . . . A1n 0 . . . 0 0 . . . 0 −a11 . . . −a1n ... . . . ... ... . . . ... ... . . . ... ... . . . ... Am1 . . . Amn 0 . . . 0 0 . . . 0 −am1 . . . −amn 0 . . . 0 A11 . . . A1n 0 . . . 0 −b11 . . . −b1n ... . . . ... ... . . . ... ... . . . ... ... . . . ... 0 . . . 0 Am1 . . . Amn 0 . . . 0 −bm1 . . . −bmn 0 . . . 0 0 . . . 0 A11 . . . A1n −c11 . . . −c1n ... . . . ... ... . . . ... ... . . . ... ... . . . ... 0 . . . 0 0 . . . 0 Am1 . . . Amn −cm1 . . . −cmn −d11 . . . −d1n −e11 . . . −e1n −f11 . . . −f1n A11 . . . A1n ... . . . ... ... . . . ... ... . . . ... ... . . . ... −dm1 . . . −dmn −em1 . . . −emn −fm1 . . . −fmn Am1 . . . Amn  with vector ϕ  u1 ... um v1 ... vm z1 ... zm p1 ... pm  and vector B  BU 1 ... BU m BV 1 ... BV m BZ 1 ... BZ m BP 1 ... BP m  The blocks in the A matrix for a coupled solver can be summarized as: [16] ∑ ij AP =  auu ij auv ij auw ij aup ij avu ij avv ij avw ij avp ij awu ij awv ij aww ij awp ij apu ij apv ij apw ij app ij  for cell i and neighbor connection j. auv ij ,auw ij ,avu ij ,avw ij ,awu ij ,awv ij contains terms only for cells that have direct contact to boundaries, which is assumed to become computa- tionally negligible on mesh sizes with a number of cells in the order of millions. An obvious conclusion when looking at the efficiency of the Aϕ = B equation for sparse matrices is that it is inefficient to multiply all these zeros together. One way of working around this is the Compressed Sparse Row (CSR) format. This transforms the sparse matrix into three vectors: one to store row indices (or row pointers), one to store column indices and one to store values. This effectively reduces the size of an A matrix block to: [17] Ablocksize = nnnz = ncells(1 + nnb) (2.24) with the row and column indices vectors being larger than ncells and smaller than ncells(1 + nnb) depending on the sparse format used. For a non-adaptive mesh the 9 2. Theory row and column indices are constant and therefore only need to be read from mem- ory the first iteration. 2.1.4 Turbulence Models and Species transport Turbulence is an unsteady flow phenomenon. To model turbulent regions in steady- state, the flow properties are approximated using a turbulence model, which is commonly done by Reynolds-averaging the Navier-Stokes equations (RANS). The most relevant models are the two equation linear eddy viscosity models k − ω and k − ϵ, with the transported variables turbulent kinetic energy k, specific turbulent dissipation rate ω and dissipation rate ϵ. 2.1.4.1 k − ω SST The k − ω baseline (BSL) turbulence model is based on the observation that the k − ω model is accurate in adverse pressure gradients and that the k − ϵ model is accurate in freestream flow. The k − ϵ model is rewritten as a k − ω formulation and a switching function is used to determine which model to use. The BSL model is further improved into the k − ω shear-stress transport (SST) model by assuming that the principal shear-stress is proportional to the turbulent kinetic energy in the boundary layer. This allows it to account for the transport of the turbulent shear stress. [18] BSL formulation: ∂ρk ∂t + ∂ρujk ∂xj = Pk − β∗ρωk + ∂ ∂xj [ (µ + σkµt) ∂k ∂xj ] (2.25) ∂ρω ∂t +∂ρujω ∂xj = γPω−βρω2+2ρ(1−F1)σω2 1 ω ∂k ∂xj ∂ω ∂xj + ∂ ∂xj [ (µ + σωµt) ∂ω ∂xj ] (2.26) where: F1 = tanh((max(min( √ k 0.09ωy ; 0.45ω Ω); 400ν y2ω ))4) (2.27) SST term: Dτ Dt = ∂τ ∂t + uk ∂τ ∂xk (2.28) 2.1.5 Definitions of variables, coefficients and expressions Fthrustvac = vexhaust ṁ + ∫ Aoutlet PSoutlet dA (2.29) isp = Fthrust ṁ g (2.30) ispvac = Fthrustvac ṁ g (2.31) Pratio = Pt2 Pt1 (2.32) 10 2. Theory ṁcorr = ṁ √ Tt0/Tref Pt0/Pref (2.33) where Tt0,Pt0 is the total temperature and total pressure at the inlet and Tref = 288.15 K and Pref = 101325 Pa 2.1.5.1 Coefficients Total pressure coefficient Cp0 = Pt − Pt, ref Pt, ref − Ps, ref (2.34) Pressure coefficient Cp = Ps − Ps, ref Pt, ref − Ps, ref (2.35) 2.2 Computer Architecture Computers contain many components, including a motherboard, processing units, memory, long-term storage, a power supply, and input/output systems. This sec- tion focuses on the components and functions that are most relevant for scientific computing applications such as CFD. 2.2.1 Floating point numbers Floating point binary number formats are standardized through IEEE 754. They are divided into the sign, the exponent and the significand (or mantissa) and are categorized according to the number of bits they utilize. [19] 2.2.1.1 Single-precision numbers A single-precision floating point number FP32 contains 32 bits. These bits are divided into 1 sign bit, 7 exponent bits, and 24 significand bits. The single-precision floating point is accurate to between 6-9 significant decimals.[19] This creates a risk of inaccuracy when executing operations if many significant decimals are important for the accuracy of the results. Below, in Figure 2.1, an arbitrary number is shown in IEEE 754 Floating point format. 0 1000 0101 0001 0100 0000 0000 0000 000 Sign Exponent Mantissa Figure 2.1: An arbitrary number represented in IEEE 754 Floating point format. 2.2.1.2 Double-precision numbers A double-precision floating point number FP64 contains 64 bits. These bits are divided into 1 sign bit, 10 exponent bits and 53 significand bits. The double-precision floating point format is accurate to between 15-17 significant digits. [19] 11 2. Theory 2.2.1.3 Floating point numbers in CFD To exemplify why this is relevant in CFD, consider how the total pressure Pt and static pressure PS are related to the dynamic pressure PD. PD = Pt − PS (2.36) If Pt = 100095.73 and PS = 99701.27 the resulting FP32 difference is 394.453 while the FP64 difference is 394.460. [20] This inaccuracy can cause incorrect results in certain cases. 2.2.2 Central Processing Unit The central processing unit (CPU) fetches, decodes, and executes instructions that are retrieved from a program sequentially. They can differ in architecture, but most contain some core features: an Algorithmic Logic Unit (ALU), registers, a control unit, and a bus that connect them. A simple machine architecture is the MARIE computer, which contains a CPU and a main memory. The CPU in the MARIE architecture contains an ALU, an accumulator, a memory bus register, a memory address register, an input register, an output register, an instruction register and a program counter. The ALU performs operations such as addition and multiplication. The operations are defined by physical circuits, and every CPU architecture has a set of operations that can be performed. Registers hold input and output instructions, instructions to execute and memory addresses. The control unit contains an instruction register and program counter, and initiates the next cycle loop which depends on what instruction it stores. The time it takes for a single-core CPU without hyper-threading to run a program can be simplified to: t program = instructions program ∗ avg.cycles instruction ∗ t cycle (2.37) where t cycle is the clock-speed of the CPU. Most cycles reads or writes instructions from the main memory, and the memory bus bandwidth is therefore also a factor in CPU performance. FP32CP U [FLOPS] = 8 · 2fclockncores (2.38) 2.2.2.1 Processing cores Multi-core processors have multiple CPUs on one integrated circuit (IC). They uti- lize multithreading, which is a parallelization technique that, in essence, divides the instructions from a program and places them into different containers. The instructions in each container are executed independently with local variables and synchronized globally when necessary. 2.2.3 Memory Memory bandwidth describes the rate of which data can be read and written to the Random Access Memory (RAM) of a computer. [21] Insufficient memory bandwidth 12 2. Theory can cause bottlenecks in a system, as the memory cannot write or read the data from the processing unit at a sufficient rate. Insufficient bandwidth leads to the cores being unutilized and remaining at idle, as the memory read and write cannot keep up with the speed of the cores. This is calculated as described in (2.39). β [B/s] = 8fmemorynchannels (2.39) 2.2.4 Graphics Processing Unit A Graphics Processing Unit (GPU) has thousands of small cores, called streaming processors (SPs) or Compute Unified Design Architecture (CUDA) cores. The SPs are organized into streaming multi-processors (SMPs) which batch them together in sets that share fast cache memory and can synchronize. It is optimized to execute many computations in parallel to maximize total throughput. As with multi-core CPUs, GPUs utilize multithreading, but take it one step further. In GPUs, mul- tithreading is implemented in the hardware and applied in a single instruction, multiple threads (SIMT) model. SIMT is an execution model that executes one instruction on multiple data using multithreading for parallel execution. A great example is large matrix operations where one instruction is applied to transform a large dataset, which is common in graphics applications, AI training algorithms, and in CFD solvers. FP32GP U [FLOPS] = 2fclockncores (2.40) GPUs have dedicated Video Random Access Memory (VRAM) built in. The amount of VRAM available determines how much data can be stored simultaneously, and the memory bandwidth determines how fast data can be transferred between the VRAM and the streaming processors. If the amount of data exceeds the available VRAM space, a memory overload occurs and a fatal error is induced. 2.2.4.1 Utilization To maximize the utility of GPUs, the computation tasks should have a low depen- dency on each other, low synchronization requirements and the processes should be oversubscribed, i.e. parallelized in excess of the inherent parallel capabilities of the GPU. Oversubscription is beneficial because threads that are not ready to run can be set aside while fetching data and switched to a ready-to-run thread that can be executed immediately, maximizing core utilization. [22] [23] In the case of the Fluent GPU solver, additional parameters need to be taken into account. When running a case on multiple GPUs, the case should have a minimum of 2 million cells per GPU. To avoid memory overload, all matrices need to fit into the memory simultaneously, and a rule of thumb is that 1 GB VRAM is required per 1 million cells for a hex- ahedral mesh, with single-precision accuracy. Double-precision requires 50% more memory and other settings such as polyhedra meshes, flow scheme and AMG solver aggregation type will also affect the memory required. [24] 13 2. Theory 2.2.5 High Performance Computing Aggregating computational resources by grouping individual computers into a clus- ter or building a supercomputer is used to maximize computational capabilities. 2.2.5.1 Power Consumption The power consumption of a HPC system is dependent on the components in the system, the cooling design, and other factors. It can roughly be estimated as: Ptot [W ] = ∑ Pcomponent + ∑ Pcooling (2.41) Although CPU temperature correlates closely with system power consumption in most applications, memory-intensive applications such as CFD can have a signif- icantly higher power consumption than other applications with similar CPU tem- peratures. Therefore, the intensity of RAM operations shows a relation with power consumption in memory-intensive applications. This is not the case for applications that are not memory-intensive.[25] The total global data center electricity demand for 2022 has been estimated to 240- 340 TWh, accounting for 1-1.3% of the total global electricity demand for 2022 of 26 600 TWh. [26] The total primary energy consumption worldwide 2022 was 168 500 TWh [27], therefore the total global data center electricity demand for 2022 accounted for 0.14-0.20% of the total global energy demand. HPC energy use is cur- rently increasing by 20-40% annually. [26] As computational demands are projected to continue increasing, energy efficiency in HPC systems becomes important to limit their environmental impact and electricity demand. GPUs provide an opportunity to perform computationally expensive calculations for a fraction of the energy con- sumption per computation of CPUs, and are thus an alternative with great potential to reduce the energy demand of HPC systems. [26] It should be noted that while HPC consumes energy, it also drives energy efficiency, renewable energy develop- ment, and resource optimization in many fields, including wind turbines, vehicle aerodynamics, and airplane design. 2.3 Business Case To evaluate an investment as a business case, many factors need to be taken into account. These include costs, benefits, risks, environmental aspects, competitors and other factors. 2.3.1 Cost-Benefit Analysis The goal of a Cost-Benefit Analysis is to compare an investment with other potential investments, usually presented as a Net Present Value (NPV) of the investment. The comparison aspect is evaluated using a discount rate, which is the expected rate of return on comparable investments. There are different methods for calculating the discount rate. These include the capital asset pricing model (CAPM), the build- up method and the Fama-French three-factor model.[28] In the context of HPC 14 2. Theory CAE applications, these do not make much sense to use, as the benefits of an HPC investment are hardly quantifiable. It may instead make more sense to divide the cost-benefit analysis into a Total Cost of Ownership (TCO) analysis and then treat the benefits in the discussion, preferably set in context with a comparison of another similar investment. A Pugh matrix is one way to present the potential benefits objectively. [29] 2.3.1.1 Total cost of ownership Total Cost of Ownership (TCO) can be divided into acquisition costs, operational costs, end-of-life costs, and indirect costs. [30] Acquisition costs are typically one- time capital expenditures, with the most significant being initial purchase prices, delivery and shipping fees, and installation costs. Operational costs are operational expenses that include maintenance and repairs, energy consumption, and labor costs. End-of-life costs are disposal fees and residual value. Indirect costs include training costs, downtime or lost productivity, and compliance or regulatory costs. A TCO analysis helps to discern between capital expenditures and operational ex- penditures. In some cases, it can be beneficial to pay the initial purchase price of the asset upfront, for example, if the residual value of the asset is high. If the asset on the other hand depreciates in value quickly, it may instead make more sense to transfer it into an operating expense by leasing it. In the case of HPC applications, ownership models can be divided into three main types: • On-premise infrastructure financed through capital expenditures • On-premise infrastructure financed through operating expenses (i.e. leasing) • Cloud computing financed through operating expenses, which can be set to vary on-demand The expected life span of on-premise HPC systems are 3-6 years for >80% of systems. [31] 15 2. Theory 16 Chapter 3 Methods A case-based methodology is used to evaluate the GPU solver performance and capabilities in typical aerospace simulation scenarios. Multiple simulations were run using various types of hardware to differentiate and evaluate the performance of the different solvers. In this chapter, these hardware configurations and any changes to the physical models will be brought up. Furthermore, the results of the performance evaluations are used as input in the business case analysis. 3.1 Hardware The simulations were run on a cluster called Rescale. The hardware used was lim- ited by the selection of processors and graphics cards available on Rescale. To draw conclusions from the simulations carried out, the hardware had to be carefully eval- uated before being selected to be used. The latest Nvidia card, the H200 GPU, was not available at the time of this study. Thus, the best alternatives available to use were the Nvidia A100 and H100 cards, Nvidias flagship server GPUs from the Ampere and Hopper architectures. The main differences between the cards are discussed in section 3.1.1. For a fair comparison, CPUs from Intel’s Ice-Lake and Sapphire Rapids architec- tures were selected, which highlights the performance difference between CPUs and GPUs with similar release dates. The exact CPUs were chosen with the same core count to avoid different CPU parallelization capabilities affecting the results. An- other factor that could influence the results and that is highly relevant for this type of application is the hardware’s ability to scale performance. This relates to the per- formance loss of using multiple components in parallel. By using two GPU cards, the simulation time can be reduced by at most 50%, and is likely to scale worse than that. The same principles apply to CPUs. This means that running simula- tions with varying numbers of units is of relevance, as the scalability of the hardware can be investigated and analyzed. By scaling better, running more complex tasks in parallel can reduce the overall simulation time. The scaling is also influenced by the interconnect bandwidth between components, which differ greatly depending on whether the components are placed on the same node or on different nodes. 17 3. Methods 3.1.1 Hardware specification Cores Clock speed [GHz] Released WT DP [W] Intel Xeon Platinum 8375C 32 3.48 Q3 2021 300 Intel Xeon Gold 6455B 32 3.00 Q1 2023 ∼275 Table 3.1: CPUs used. with core counts and clock speed taken directly from the process_output.log file in rescale. Memory [GB] Memory Bandwidth [GB/s] FP32 SPs SMPs Released WT DP [W] Nvidia A100 PCIe 80 1845.7 6912 108 Q2 2020 300 Nvidia H100 NVL 94 3836.4 16896 132 Q3 2022 400 Table 3.2: GPUs used. with memory, memory bandwidth, and core counts taken directly from the process_output.log file in Rescale. All hardware is tested with 1x and 2x component configurations, to test scaling. All configurations are run on one node only, to avoid interconnect bandwidth between nodes influencing the results. In Figure 4.3, the interconnect speeds influence on total simulation time can be seen. Both simulations were done on the same CPU, but with varying interconnect speeds. 3.2 Calculation methods Equations to measure the elapsed time were defined to quantify and compare results from the simulations. The simulation time is normalized to the same amount of iterations for all cases. The normalized simulation time can be combined with the iterations required for convergence to get the true simulation time. Another metric that was compared between the simulations was the power consumption of the different hardware components during the simulation, which was estimated based on the simulation time and the hardware data. 3.2.1 Time The run time of each simulation was determined from the output log files. The startup time is defined as the time between the timestamp when ANSYS is started (the first timestamp in the log file) and the timestamp of the first iteration. The 18 3. Methods simulation time tsim is defined as the time between the timestamp of the first itera- tion and the last iteration, as shown in (3.3). The shutdown time is defined as the time between the timestamp of the last iteration and the last timestamp of the log file. The start and shutdown time tsas is defined as the sum of the startup time and shutdown time, as defined in (3.2). The total time is then obtained by summarizing the time for loading, initializing, solving, and saving the case, as seen in (3.1). ttot [s] = tsas + tsim (3.1) where tsas = (tsfirstit − tsfluentstarted) + (tsend − tslastiteration) (3.2) tsim = tslastit − tsfirstit (3.3) 3.2.2 Energy consumption The energy consumption for the different cases and setups is calculated differently depending on the hardware used for that setup. For the simulations running on the CPU solver, the systems tested only have CPUs, thus only making the power draw and number of CPUs relevant, along with the total time it is active. This results in (3.4). The base power draw of the rest of the computer is not accounted for in the energy consumption equations, as it cannot be estimated as accurately as the energy consumption of the processing units without access to the computers. EtotCP U [Wh] = (1 + kcooling)ttot(WT DPCP U nCP U) 3600 (3.4) where kcooling = 1.5 and nCP U and nGP U represents the number of CPUs and GPUs in the current setup. For the GPU solver, both the GPU and CPU will be active, but at different times during the simulations. During the startup and shutdown time, the CPU will be running as this process is not handled by the GPU, requiring the CPU to be used. During the iteration time, the CPU will have minimum power draw, here noted as WidleCP U . This results in equation (3.5) EtotGP U [Wh] = (1 + kcooling)· tsas((WT DPCP U nCP U) + (WidleGP U nGP U)) + tsim((WidleCP U nCP U) + (WT DPGP U nGP U)) 3600 (3.5) The factor kcooling is slightly conservatively estimated as 1.5 in order to include the power consumption of the cooling necessary to keep the simulation hardware run- ning at maximum performance over extended periods of time. [32] In table 3.3 the power draw of the components used in the simulations are listed. 19 3. Methods The two AMD CPUs are paired with the A100 and H100 GPUs, respectively, on Rescale. WT DP [W] Widle [W] Paired with Intel Xeon Platinum 8375C 300 50 N/A Intel Xeon Gold 6455B 275 50 N/A Nvidia A100 PCIe 300 150 AMD 7v13 Nvidia H100 NVL 400 150 AMD 9684X AMD 7v13 240 50 A100 AMD 9684X 400 100 H100 Table 3.3: Power consumption metrics for the hardware used. 3.3 Simulations All simulations were solved in Ansys Fluent 2025 R1 on both the CPU solver and the native GPU solver. The simulations were run on the Rescale HPC cluster. 3.3.1 Porting to GPU solver It is possible to load cases from the CPU solver straight into the GPU solver. When the case is loaded into the GPU solver, some unsupported features are automatically unselected and some need to be replaced manually. If, after that, the case is still not able to be run, the simulation file needs to be rebuilt from scratch. To do this, the mesh file is loaded into the GPU solver, and great care is taken to implement all supported features and settings exactly as they were implemented in the CPU solver. If the case still does not run, the same case is run on the CPU solver to check if removing features that are unsupported on the GPU is viable with respect to physics. If the CPU solver converges to a solution but the GPU solver does not converge, the initialization and solver settings can be tweaked according to the sug- gestions in the Ansys Fluent User Guide. After porting a case to the GPU solver, it is recommended to verify its accuracy, especially if features and settings have been modified. This is done by choosing some relevant variables and coefficients to study and comparing the original CPU solver stock case, the CPU solver case with modified features and settings, and the GPU solver case with modified features and settings. 3.3.2 TRS case The first case, shown in Figure 3.1 is a Turbine Rear Structure (TRS) case with outlet guide vanes (OGVs) that is validated against experimental results from a test rig at Chalmers University of Technology. This test rig is a closed-circuit low-speed large-scale 1.5 stage LPT-OGV facility which can achieve operating conditions that are realistic for up to large turbofan engines with Reynolds numbers of up to 435 000. For this case, the average Reynolds number for the inlet flow is 305 500 and 20 3. Methods is based on the height of the inlet channel (outer radius minus inner radius). The total and static pressure coefficients are normalized against reference values that are measured in the bulk flow region between the OGVs. [33] Y Z X Figure 3.1: Isometric view of the TRS case The inlet boundary is modeled as a pressure inlet through profiles of total pressure with respect to the radius, and the velocity is divided into profiles of its axial, radial, and tangential components. The outlet boundary is modeled as a pressure outlet with Radial Equilibrium Pressure Distribution. • Solver – Pressure-based – Steady-state – Low-Re Correction – Coupled – Distance-based Rhie-Chow – Green-Gauss cell based spatial discretization gradient • Physics – k − ω SST – Viscous Heating – Low-Re Correction – Energy Equation • Materials – Aluminum – Air ∗ Ideal gas ∗ Sutherland viscosity ∗ Cp and thermal conductivity defined through polynomials • Inlet Boundary – Pressure Inlet – Total pressure profile 21 3. Methods – Cylindrical coordinate system velocity profiles • Outlet Boundary – Pressure Outlet – Radial Equilibrium Pressure Distribution 3.3.2.1 GPU solver setup As mentioned earlier in the report, all features and physics models are not available in the GPU solver, resulting in the need for changes in simulation setup in order to be able to run it. For the TRS case a number of changes have been done in order to comply with the models available in the GPU solver. The case file that was provided was run on the regular CPU solver in order to establish a baseline result that can be used to compare with the results from the GPU solver. Along with this, experimental data was provided by GKN, allowing for validation of the changes and simplifications made to the simulation file. First, the case file was not able to run as it was stock, so the mesh was imported and then the case was set up from scratch using the same mesh file as the stock case. To achieve the same results in both solvers, the boundary conditions needed to match as close as possible. The stock case has a profile file attached that is used to generate the correct swirl on the inlet air. This cylindrical coordinate profile data was not possible to use as the GPU solver does not support profiles for cylindrical coordinates. The only type of profiles supported was for cartesian coordinates. In order to obtain the correct values for this type of coordinate system, the stock case was run on the CPU solver and the inlet velocity vector field was then exported in cartesian coordinates. One feature that was supported for cylindrical coordinates was to re-write the profile data as mathematical expressions. This was done and the data was curve fitted as polynomials, but this caused results that did not align with the experimental data nor with the simulations from the CPU solver. After some investigation comparing the expressions in the GPU and CPU solver, it was concluded that the polynomials were a good approximation, but a bug in the GPU solver resulted in an incorrect interpretation. This bug was logged to Ansys and the chosen method was to export the profile data from the stock CPU solver case in cartesian coordinates and use that as inlet boundary condition. All other boundary conditions were set up with the same values as in the stock case, such as the inlet velocity, temperature etc. The case was only tested using the k−ω SST model. This model is the most robust and also proved to work reliably in the GPU solver. For the scope of this thesis, it was deemed enough to test with one turbulence model in order to compare the performance between the solvers. The realizable k − ϵ model is also supported, but is not considered necessary to test for the scope of this thesis. The transition SST model is not yet supported by the GPU solver. For a more complete list of supported features, see Appendix A. The TRS case is a relatively small case with 2 million cells, accounting for a 30° 22 3. Methods slice of the turbine rear structure. To unleash more of the potential in the GPU solver, a full 360° case with 24 million cells was constructed and tested. To further understand the performance scaling on GPUs, it was rotated to 60°, 120°, 180°, 240°, and 300° as well. Figure 3.2: The full 360° and the 30° section of the TRS case. 3.3.3 Nozzle case The second case is a truncated ideal convergent-divergent nozzle for a first stage launcher that is optimized from the open-source RFZ model, which is modeled after the SpaceX Falcon 9. [34] It has supersonic flow and contains combustion, which is modeled using species transport and volumetric reactions with 23 species. Turbu- lence is modeled using the k − ω SST model. The case takes a long time to solve, and the bulk of the calculation time is spent on calculating the volumetric reactions. Listed below are some of the most important settings used in the simulation setup. • Mesh – 41.7 e+03 cells – 125 e+03 faces – Periodic boundaries – Resolved walls • Physics – k − ω SST – Viscous heating – Production limiter – Energy equation • Species transport – Volumetric reactions – Stiff chemistry solver – ISAT – Diffusion energy source – Finite-rate/NO TCI – 23 species Chemkin • Inlet boundary – Pressure inlet – Pt = 10.5 MPa 23 3. Methods – Tt = 3614K – Species fractions • Outlet boundary – Pressure outlet – Pt = 101.3 kPa – Tt = 3615 K – No species • Solver – Coupled – Least squares cell based – Rhie-chow momentum – Global time step – Double-precision • Initialization and solution – Hybrid – 500 iterations without reactions to establish flow field – 15000 iterations with reactions XZ Y Figure 3.3: Isometric view of the Nozzle case. 3.3.3.1 GPU solver setup The nozzle case features a quite complex set of physics, using the stiff chemistry solver with volumetric reactions to solve the combustion in a rocket nozzle. Some of these features have been implemented in recent updates. A Chemkin database was imported to define correct physical properties for the species involved in the combustion process. A number of changes to the stock simulation file was made in order to try to make it run on the GPU solver. The inlet had to be changed from a pressure inlet to a velocity inlet or to a mass flow inlet in order to get the chemistry solver to run at 24 3. Methods all, avoiding floating point errors. The values used for the velocity inlet were taken from a simulation run on the CPU solver. An attempt was made to rebuild the file from the mesh, similar to the method used for the TRS case in Section 3.3.2. The original mesh was created as a 2D-mesh in ICEM. The GPU solver only supports 3D-meshes, so the 2D mesh was rotated slightly to create a 3D mesh. Since the mesh was created in ICEM it posted chal- lenges to work with the mesh and redo it in Fluent. Attempts to make the case run were also made by changing all available under- relaxation factors, and also by letting the initial flow field solution converge further before turning the volumetric reactions on. After struggling to reach convergence in the stiff chemistry solver on the GPU solver with a case setup that worked in the CPU solver, a bug report was filed to Ansys and the case was investigated by their combustion experts. 3.3.4 Rotor case The Rotor case is a large 360° model of an axial compressor without the stator. This kind of high-fidelity model may be of interest when investigating transient effects or rotor-stator interaction, for example. The Rotor case set up in CFX is converted into a Fluent solver case which is meant to be as equivalent and representative as possible. The goal is not explicitly to find a setup which gives perfect accuracy be- tween the two different solvers, but rather to test the speedup and feature support of the GPU solver on a compressor application. The flow in the case is transonic. XZ Y Figure 3.4: Isometric view of Rotor case. 25 3. Methods 3.3.4.1 CFX setup • Mesh – 32.75 million cells – 100.1 million faces – 3 cell zones: inlet duct, rotor, outlet duct – Inlet-rotor interface – Rotor-outlet interface – Resolved walls • Physics – k − ω SST – Energy equation – Viscous work • Inlet boundary – Pressure inlet – Pt = 25.00 kPa – Tt = 288.15 K • Rotor cell zone – Mesh motion – 17000 rev/min – Stationary walls: Shroud and static hub – Pref = 0 kPa • Outlet boundary – Pressure outlet – Ps = 25.25kPa – Tt = 288.15K • Inlet duct to rotor and rotor to outlet duct interfaces – Transient Rotor Stator frame change • Solver – Second order backward euler – Double-precision • Initialization and solution – Initialized with u = 10 m/s – 2304 time steps – 3.06e-06 s time step size – 2 full rotations – Residual target 1e-05 3.3.4.2 Fluent GPU and CPU setup The mesh was imported into Fluent and the setup was redone with the objective of setting it up as closely as possible as the CFX case. As the case is very large, it was not possible to iterate on the case setup in a trial and error fashion, and this resulted in some settings differing. Most importantly, the operating pressure was set to 101.325 kPa in Fluent and 0 kPa in CFX. The case setup on the CPU and GPU solvers in Fluent are identical, however. • Mesh 26 3. Methods – 32.75 million cells – 100.1 million faces – 3 cell zones: inlet duct, rotor, outlet duct – Inlet-rotor interface – Rotor-outlet interface – Resolved walls • Physics – K-ω SST – Production limiter – Energy equation • Inlet boundary – Pressure inlet – Pt=25.00 kPa – Tt=288.15 K • Rotor cell zone – Mesh motion – 1780 rad/s – Stationary walls: Shroud and static hub – Pref = 101.325 kPa • Outlet boundary – Pressure outlet – Ps=25.25 kPa – Tt=288.15 K • Solver – SIMPLE – Least squares cell based – Rhie-chow momentum – Second order implicit transient formulation – Double-precision • Initialization and solution – Hybrid – 2304 time steps – 3.06e-06 s time step size – Maximum 20 iterations per time step – 2 full rotations – Residual target 1e-04 3.4 Business case To build a business case for the investment of GPU solutions in an HPC environment, a wide range of factors must be considered and evaluated. Some relevant costs and benefits include the following: • Costs – Direct costs ∗ Purchasing price or leasing cost (including the complete computer system) 27 3. Methods ∗ Licensing cost ∗ Cooling system cost – Indirect costs ∗ Electricity ∗ Facilities ∗ Personnel – Intangible costs ∗ Reduced simulation capacity of CPU – Opportunity costs ∗ Purchasing CPUs instead ∗ Reserving computation capacity off-site ∗ Accessing GPUs on-demand • Benefits – Direct benefits ∗ Reduced cost per simulation – Indirect benefits ∗ Shorter time per simulation – Intangible benefits ∗ Faster turnaround ∗ More design iterations ∗ Increase simulation capacity ∗ Environmental impact – Competitive benefits ∗ Shorter delivery times ∗ Optimized products The main opportunity cost to compare to is purchasing CPUs instead as GKN have a policy against sending simulations off-site for security reasons. Cloud computing solutions will be investigated lightly anyway as cost data was acquired throughout the project. Computer system costs are primarily estimated from the pricing of real components with the Exxact configurator. [35] This is used in conjunction with the manufactur- ers’ recommended prices. Benefits are compared and summarized with a Pugh matrix. The reasoning behind the use of a Pugh matrix for this use case is that the benefits are hard to quantify, the importance of each benefit is subjective and varies between use cases and existing infrastructure, and that a Pugh matrix allows for a useful overview of the benefits that are expected. 3.4.1 Assumptions To simplify calculations, the server cost is split into 3 parts: CPU, GPU, and base system. The base system includes other components needed for a complete server computer and is the same for both CPU and GPU servers for the sake of simplicity. The base system power draw is estimated from a server build with the AMD Epyc 28 3. Methods 9684X CPU and an Nvidia A6000 GPU. [36] Base system cost: €8 000 Base system power draw: 300 W Base system breakdown: Rack-mountable server Motherboard 192 GB RAM 2 TB SSD 10 TB Harddrive 2x 1GBase-T Ethernet 1x 1GbE Dedicated Management Port (IPMI) 2x 2600 W (1+1) Redundant - 31.5" Depth Linux OS The installation cost is fixed to €1000 and the delivery and shipping costs are esti- mated to €200 per server computer. The prices for the CPUs and GPUs are estimated according to Table 3.4. Price [EUR] Paired with Intel Xeon Platinum 8375C 1 650 N/A Intel Xeon Gold 6455B 2 800 N/A Nvidia H100 NVL 25 000 AMD 9684X AMD 9684X 6 000 H100 Table 3.4: Estimated purchasing prices for the hardware used. 3.4.2 Definitions The cost benefit analysis calls for data that needs to be calculated, estimated, and assumed through different equations and definitions. This is not an exact science, but rather a balance needs to be struck between accuracy and simplicity. For com- pleteness, the factors not included in the calculations are also explained. Leasing cost: Cleasing = 3∑ 1 Cleasing,year (3.6) Electricity cost: Celectricity = 24 · 365Ptotpelectricity (3.7) where the electricity price pelectricity = €0.087 per kWh 29 3. Methods Regarding the licensing cost, Ansys has a license package called Ultimate, that in- cludes both the license for Fluent and the HPC packs. Thus the difference between CPU and GPU license cost will be zero, as the same Ultimate license can be used to run simulations on both solvers, making the total licensing cost indifferent and not needed in the cost comparisons. In addition, the cost calculations in this project compares adding systems into an already-existing cluster, while the Ultimate license package covers the entire cluster. Power consumption: PtotCP U [W ] = nCP U( WBasesystem nCP Upersystem + WT DPCP U ))(1 + kcooling) (3.8) where kcooling = 1.5 and nCP U/GP U is the number of CPUs and GPUs in the current setup. The power consumption calculation for the GPU system assumes that the GPUs are running at full load, the CPU runs at idle and that all GPUs fit into one system. PtotGP U [W ] = (WidleCP U + WT DPGP U nGP U + WBase system) · (1 + kcooling) (3.9) The cooling system cost is calculated as: Ccooling system = CcoolingPtotkcooling 1 + kcooling (3.10) where the cooling cost Ccooling is estimated to €1.5 per W of cooling needed. This assumes liquid cooling, that the systems evaluated have a small number of nodes, and that the customer already has a cooling tower to produce 35°C water into the Coolant Distribution Units. 30 Chapter 4 Results The simulation results are split by case and used to build business cases of GPU purchases and CPU purchases that account for equivalent simulation capacity. As the GPU solver does not support all features of the CPU solver, the business case is designed to evaluate replacing a minority of the CPU simulation capacity with GPUs in a first-stage adoption of GPUs for CFD simulations. 4.1 TRS case The TRS case compares all CPUs and GPUs in Tables 3.1 and 3.2 in a 1x and 2x configuration. The comparisons shown are for the 30° case and 360° case. 4.1.1 Simplifications and issues The TRS case required some setup modifications to run on the GPU solver, as outlined in section 3.3.2.1. To summarize, the original pressure inlet was replaced with a velocity inlet, the cylindrical coordinate system velocity profiles in the inlet were replaced with a cartesian velocity vector field imported from the CPU solver solution, the radial equilibrium pressure distribution setting in the outlet was re- moved, the Low-Re correction was removed from the turbulence model, and finally, the cell based spatial discretization was changed from Green-Gauss to least squares. In addition to this, the 30° TRS case was not possible to run on 2x Nvidia H100 GPUs. 4.1.2 Time In the simulation time comparison of the 30° TRS case shown in Figure 4.1, it is observed that the time to perform 300 iterations is reduced on every Nvidia GPU setup compared to every Intel CPU setup. The start and shutdown time is slightly higher on the GPUs than on the CPUs. The total time reduction ranges between 41.2-77.5%. The 300 iteration time reduction ranges between 69.7-90.3%. 31 4. Results 1x Intel 8375C (32 cores) 2x Intel 8375C (64 cores) 1x Intel 6455B (32 cores) 2x Intel 6455B (64 cores) 1x Nvidia A100 2x Nvidia A100 1x Nvidia H100 0 200 400 600 800 1,000 647 388 446 306 160 145 180 T im e [s] 300 iteration time Start and shutdown time Figure 4.1: 30° TRS case in single precision. In the simulation time comparison of the 360° degree TRS case shown in figure 4.2, the reduction of the 300 iteration time on GPUs is larger than in the 30° case. The total time reduction ranges between 78.9-95.2%. The 300 iteration time reduction ranges between 86.8%-98.0%. 1x Intel 8375C (32 cores) 2x Intel 8375C (64 cores) 1x Intel 6455B (32 cores) 2x Intel 6455B (64 cores) 1x Nvidia A100 2x Nvidia A100 1x Nvidia H100 2x Nvidia H100 0 0.2 0.4 0.6 0.8 1 ·104 6,888 3,417 4,291 2,363 500 462 421 332 T im e [s] 300 iteration time Start and shutdown time Figure 4.2: 360° TRS case in single precision. In Figure 4.3, it can be seen how varying interconnect speeds between nodes affect the overall simulation time for the 30° simulation of the TRS case. Both simulations 32 4. Results were run using four Intel 8375C CPUs, both with 32 cores each. Each node contains two CPUs each, resulting in two total nodes in use. 50 Gbps 200 Gbps 0 200 400 600 800 650 310T im e [s] Time consumed for 300 iterations Figure 4.3: Interconnect speed on the 30° simulation of the TRS case. Figure 4.4 shows the simulation time of the TRS case on single A100 and H100 GPUs for different mesh sizes. It has been obtained by copying and rotating the 30° slice of the turbine, resulting in mesh sizes ranging between 2-24 million cells. 0.0 0.5 1.0 1.5 2.0 2.5 Cells 1e7 0 50 100 150 200 250 300 Ti m e [s ] a100 h100 Figure 4.4: GPU solver simulation time over mesh size for the TRS case. 33 4. Results 4.1.3 Energy consumption Figure 4.5 shows the energy consumption for running 300 iterations on the 360° TRS case in single precision. The reduction in energy consumption on the tested GPUs compared to the tested CPUs ranges between 87.7-93.4%. 1x Intel 8375C 1x Intel 6455B 1x Nvidia A100 1x Nvidia H100 0 500 1,000 1,500 1,435 819.5 94.3 101.7 En er gy [W h] 300 iterations energy consumption Figure 4.5: 360° single precision energy consumption for running 300 iterations. 4.1.4 Accuracy and comparison The single-precision accuracy comparison in Figure 4.6 shows circumferentially- averaged data from the stock 30° CPU TRS case and compares it with the sim- plified GPU case, and with experimental data from the test rig. It compares the total pressure coefficient CP 0 in an outlet plane. The simplified GPU setup corre- lates closely with the stock CPU case for the total pressure coefficient Cp0, although the physics settings differ slightly, as discussed in Section 4.1.1. The simplified GPU case has also been compared against CPU case with the same simplifications and CP 0 is equal up to ∼2 significant digits, while the total pressure is equal up to ∼5 significant digits. When plotting the simplified GPU case against the CPU case with the same simplifications in the same manner as in Figure 4.6, they look identical to the naked eye. 34 4. Results −0.1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 −0.4 −0.2 0 0.2 Span C P 0 Case 1 Accuracy Experimental CPU (Stock) GPU (Simplified) Figure 4.6: Circumferentially-averaged CP 0 over span in the outlet. 4.1.4.1 Convergence Convergence was tested on the 30° version of the TRS case, using the same conver- gence criteria for both solvers, run in both single and double precision. Figures 4.7 and 4.8 show the number of iterations required for the solution to converge. The criteria was set as 1e−04 for all residuals, and all residuals needed to reach this value for the solution to converge. The residuals were continuity, x, y, and z velocities, energy, k, and ω. The data from the convergence comparisons done shows that the GPU solver requires 51.4% fewer iterations compared to the CPU solver in single precision, and 26.7% fewer iterations in double precision. 0 50 100 150 200 250 CPU GPU 237 115 Iterations Figure 4.7: No. of iterations until con- verged for 30° TRS case in single preci- sion. 0 50 100 150 200 250 CPU GPU 217 159 Iterations Figure 4.8: No. of iterations until con- verged for 30° TRS case in double preci- sion. 4.1.5 Cloud cost Running the 360° TRS case on the GPU solver on the Rescale cloud results in a cost reduction of 83.3% for the simulations in Figure 4.2 compared to running it on the 35 4. Results CPU solver on the Rescale cloud. The GPU solver is run on 2x H100 GPUs, and the CPU solver on 2x 6455B CPUs. 2x Intel 6455B 2x Nvidia H100 0 5 10 15 10.7 1.79 Pr ic e in U SD [$ ] Cost for 300 iterations on Rescale Figure 4.9: 360° TRS case Rescale cost for total time of simulation. 4.2 Nozzle case Thorough testing of the nozzle case was done, but convergence could not be reached on the GPU solver without complete changes to the case setup. The same GPU solver setup of the case was successfully solved on the CPU solver. The conclusion is that the nozzle case is not possible to solve on the GPU solver. After Ansys tested the case, an Ansys combustion expert confirmed that this nozzle case from GKN does not run on the GPU solver in version 2025 R1. The closest modifications that can be made to reach convergence is to convert the case to transient and rotate the mesh to 90°. This makes the case take a significantly longer time to run, completely erasing the potential speedup benefits of running the case on the GPU solver. It should also be noted that the nozzle case would likely utilize GPUs poorly any- way, since the mesh only contains 41 k cells and because the bulk of the calculation time is taken by the stiff chemistry solver, which solves different equations compared to the fluid flow equations, and those equations seem to be less fit for parallelization. 4.3 Rotor case The Rotor case was run on double precision and since the case is computationally expensive, only two computing setups were compared: 8x Intel 8375C CPUs totaling 256 cores, and 2x Nvidia H100 GPUs. As discussed in Section 4.3.5, the Fluent case was setup differently than the CFX case. However, this does not affect the comparisons between the Fluent CPU solver and the Fluent GPU solver. 36 4. Results 4.3.1 Simplifications and issues The rotor case ported into Fluent also required some simplifications. The most relevant is that the Fluent GPU solver only supports that SIMPLE pressure-velocity coupling scheme, while CFX is a coupled solver. It also did not run on 1x H100 GPU, due to memory overload on the VRAM. 4.3.2 Time The Fluent GPU solver is compared with the Fluent CPU solver. Figure 4.10 shows a 66.6% total time reduction between 8x Intel 8375C CPUs totaling 256 cores and 2x Nvidia H100 GPUs per 10 000 iterations. It can be seen that the start time becomes negligible for cases that require this scale of computation. Figure 4.10 does not take into account the significantly faster convergence of the GPU solver. 8x Intel 8375C (256) 2x Nvidia H100 0 0.5 1 1.5 2 2.5 ·104 19,296 6,449 T im e [s] 10000 iteration time Start time Figure 4.10: Rotor case double precision time per 10 000 iterations. 4.3.3 Energy consumption Figure 4.11 shows energy consumption per 10 000 iterations for the rotor case in double precision. The energy consumption is reduced by 88.2% . 37 4. Results 8x Intel 8375C (256) 2x Nvidia H100 0 10 20 30 40 32.16 3.78 En er gy [k W h] 10000 iteration energy Start time energy Figure 4.11: Rotor case double precision energy consumption per 10 000 iterations. 4.3.4 Convergence Figure 4.12 shows a 73.1% reduction in the number of iterations required for the Rotor case in double precision to converge, with the convergence criteria set to 1e−04 for all residuals, and with all residuals needing to reach this value for the solution to converge, each time step allowing a maximum of 20 iterations. The residuals were continuity, x-, y-, and z-velocities, energy, k, and ω. It is noted that the continuity equation converged significantly slower on the CPU solver. Figure 4.12 is a comparison between the Fluent CPU solver and Fluent GPU solver with the SIMPLE pressure-velocity coupling scheme. The Rotor case was also tested with the PISO pressure-velocity coupling scheme in the CPU solver, and similar convergence behavior to the SIMPLE scheme in the CPU solver was observed. It should also be noted that the CPU solver consistently hit the maximum of 20 inner iterations per time step for most time steps for both the SIMPLE and PISO coupling scheme. Therefore, the reduction in the number of iterations required may have been even more dramatic had the 20 inner iteration maximum been removed. 0 0.5 1 1.5 2 2.5 3 3.5 4 ·104 CPU GPU 37,781 10,146 Iterations Figure 4.12: No. of iterations until converged for the Rotor case. 38 4. Results 4.3.5 Accuracy and comparison In Table 4.1, the results from the Rotor case from the original CFX file are compared to the results obtained from the simulations in Fluents GPU solver. The results from the Fluent CPU solver were not saved as the simulation reached the maximum walltime before it finished running. The ambient pressure was set 5x higher in Fluent, making the actual fluid flow case compared different from the CFX case. However, it was decided to be worthwhile to compare the pressure ratio and corrected mass flow between the case setups anyway. The pressure ratio is defined in (2.32), and the corrected massflow in (2.33). It can be seen that the pressure ratio Pratio over the rotor is 12.8% larger in CFX, and the corrected massflow ṁcorr is 33.4% larger in CFX. Pt1 is the total pressure in the interface between the inlet duct and rotor, and Pt2 is the total pressure in the interface between the rotor and the outlet duct. CFX Fluent GPU Pt1 [kPa] 24.84 124.1 Pt2 [kPa] 27.55 135.9 Pratio 1.109 1.095 ṁcorr [kg s ] 2.651 1.988 Table 4.1: Comparison between results from CFX and Fluent GPU solver. 4.3.6 Cloud cost Running the Rotor case on the GPU solver on the Rescale cloud reduces the cost by 90.8% per 10 000 iterations compared to the CPU solver. The GPU solver is run on 2x H100 GPUs and the CPU solver is run on 8x 8375C CPUs. 8x Intel 8375C (256) 2x Nvidia H100 0 100 200 300 400 500 398.46 36.64 Pr ic e in U SD [$ ] Figure 4.13: Rotor case Rescale cost per 10 000 iterations. 39 4. Results 4.4 Business case The faster convergence of the GPU solver is not taken into account in the business case as it is likely to vary depending on the convergence criterias that are chosen and the case that is run. 4.4.1 Total cost of ownership An estimation of total cost of ownership (TCO) is most useful if set in relation to an opportunity cost. In this case the TCO on 2x H100 GPUs is compared to CPUs of equivalent simulation capacity, meaning that the two different investments provide the capacity to perform the same simulations in the same timeframe. In the TRS case comparison, the CPU cases are run on one node each. The simulations on the TRS case presented in Figure 4.2 can be interpreted to show that 7 simulations can be run sequentially on the two H100 GPUs in the same timeframe that 7 simulations can be run in parallel on 14x Intel 6455B CPUs, this interpretation is visualized in Figures 4.14 and 4.15, and serves as a basis for the TCOs in the business case. Since it is possible to mount a maximum of 2 CPUs in one computer, 14 CPUs require 7 computers to house them. Figure 4.14: Visualization of equivalent simulation capacity for 2x Nvidia H100 compared to 14x Intel 6455B. 40 4. Results GPU1 GPU2 CPU1 CPU2 CPU3 CPU4 CPU5 CPU6 CPU7 CPU8 CPU9 CPU10 CPU11 CPU12 CPU13 CPU14 0 2 4 6 Processors T im e U ni ts Sim 1 Sim 2 Sim 3 Sim 4 Sim 5 Sim 6 Sim 7 Figure 4.15: Equivalent simulation capacity for the TRS case when comparing 14x Intel 6455B with 2x Nvidia H100. Table 4.2 shows the TCO of 2x Nvidia H100 GPUs. It assumes that they are fit in one computer system with one AMD 9684X CPU. It includes the cost of expanding the cooling system, the shipping, delivery, and installation costs, and the cost of electricity. It does not include other operational costs, end-of-life costs, and indirect costs as they either require rough estimations or disclosure of sensitive information. 41 4. Results Year 1 Year 2 Year 3 3-year cost Acquisition costs GPU lease €17 000 €17 000 €17 000 €51 000 CPU lease €2 000 €2 000 €2 000 €6 000 Rest of computer €2 670 €2 670 €2 670 €8 000 Cooling system cost €1 250 €1 250 Delivery and shipping costs €200 €200 Installation costs €1 000 €1 000 Operational costs Electricity €2 300 €2 300 €2 300 €6 900 Maintenance and repairs Not included Labor costs Not included Facilities Not included Licensing costs Not included End-of-life costs Disposal N/A Residual value N/A Indirect costs Training costs Not included Downtime or lost productivity Not included Compliance or regulatory costs N/A Sum €75 350 Table 4.2: Total Cost of Ownership of 2x Nvidia H100 over a 3 year period. The TCO of 2x Nvidia H100 shown in Table 4.2 shows a 48.0% reduction in cost compared to the TCO of 14x Intel Xeon 6455B shown in Table 4.3. It can be seen that while the GPUs themselves are more expensive, the cost of the rest of the computers and electricity is a lot cheaper for a GPU system compared to CPU systems of equivalent capacity. 42 4. Results Year 1 Year 2 Year 3 3-year cost Acquisition costs CPU lease €13 000 €13 000 €13 000 €39 000 Rest of computer €18 700 €18 700 €18 700 €56 000 Cooling system cost €13 400 €13 400 Delivery and shipping costs €1 400 €1 400 Installation costs €1 000 €1 000 Operational costs Electricity €11 300 €11 300 €11 300 €34 000 Maintenance and repairs Not included Labor costs Not included Facilities Not included Licensing costs Not included End-of-life costs Disposal N/A Residual value N/A Indirect costs Training costs Not included Downtime or lost productivity Not included Compliance or regulatory costs N/A Sum €144 800 Table 4.3: Total Cost of Ownership of 14x Intel Xeon 6455B (448 cores) over a 3 year period. In the rotor case, the relation between simulation speed and processing units differ and the opportunity cost in the business case is therefore modified to account for this. 24x Intel Xeon 8375C CPUs totaling 768 cores are estimated to represent equivalent simulation capacity to 2x H100 GPUs. The TCO of 2x Nvidia H100 shown in Table 4.2 shows a 66.5% reduction in cost compared to the TCO of 24x Intel Xeon 8375C shown in Table 4.4. 43 4. Results Year 1 Year 2 Year 3 3-year cost Acquisition costs CPU lease €13 200 €13 200 €13 200 €39 600 Rest of computer €32 000 €32 000 €32 000 €96 000 Cooling system cost €24 300 €24 300 Delivery and shipping costs €2 400 €2 400 Installation costs €1 000 €1 000 Operational costs Electricity €20 600 €20 600 €20 600 €61 700 Maintenance and repairs Not included Labor costs Not included Facilities Not included Licensing costs Not included End-of-life costs Disposal N/A Residual value N/A Indirect costs Training costs Not included Downtime or lost productivity Not included Compliance or regulatory costs N/A Sum €225 000 Table 4.4: Total Cost of Ownership of 24x Intel Xeon 8375C (768 cores) over a 3 year period. Below, in Figures 4.16, 4.17 and 4.18 the cost for using a local cluster solution com- pared to using a cloud service over a three year period is shown. The cloud cost is based on a fixed sum for each hour used for the specific hardware of each case on Rescale. The hardware setups are matching in terms of simulation capacity, as calculated in Section 4.4.1. The cost for the local cluster solution is based on the calculations from Section 4.4.1. The only running cost is the electricity price. For each of the hardware configura- tions, the break even point is shown in the respective graph. This point marks where it becomes cheaper to have a local cluster compared to using an available online cloud solution. The x-axes range from 0 hours to 26280, which is equivalent to three years. 44 4. Results 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 ·104 0 1 2 3 4 5 ·105 3566 h Time [h] C os t [E ur o] Cost Comparison: Local Cluster vs Cloud. 2xH100. Local cluster Cloud Figure 4.16: Local vs cloud cost over a 3 year period for 1 2x H100 machines. 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 ·104 0 0.5 1 1.5 2 ·106 1403 h Time [h] C os t [E ur o] Cost Comparison: Local Cluster vs Cloud. 14xIntel 6455B. Local cluster Cloud Figure 4.17: Local vs cloud cost over a 3 year period for 7 2x 6455B machines. 45 4. Results 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 ·104 0 0.5 1 1.5 2 2.5 3 ·106 1278 h Time [h] C os t [E ur o] Cost Comparison: Local Cluster vs Cloud. 24xIntel 8375C Local cluster Cloud Figure 4.18: Local vs cloud cost over a 3 year period for 12 2x 8375C machines. 4.4.2 Benefits The benefits of CPUs and GPUs for CFD simulations are compared in Table 4.5. How these benefits should be weighted is quite subjective and use-case-dependent, therefore, no weights are applied. There is also some overlap between some of the benefits. CPU GPU Direct benefits Cost per simulation - + Indirect benefits Time per simulation - + Supported physics + - Power consumption - + Intangible benefits Turnaround time - + Number of design iterations possible - + Simulation capacity - + Environmental impact - + Ease of implementation + - Flexible scaling + - Potential scalability - + Competitive benefits Delivery time of projects - + Product optimization - + Table 4.5: Pugh matrix of benefits for CPU system vs. GPU system investment. 46 4. Results 4.4.3 System power consumption The power consumption of the different systems outlined in Section 4.4.1 is shown in Figure 4.19. The 2x H100 system shows a 79.9% reduction in power consumption compared to the 14x 6455B system, and a 88.9% reduction in power consumption compared to the 24x 8375C system. 14x Intel 6455B (448) 24x Intel 8375C (768) 2x Nvidia H100 0 1 2 3 ·104 14,900 27,000 3,000 Sy st em po we r co ns um pt io n [W ] Figure 4.19: System power consumption. 47 4. Results 48 Chapter 5 Discussion Of the three cases that are representative of typical aerospace applications, the TRS case and Rotor case were possible to run on the Ansys Fluent GPU solver, while the Nozzle case was not able to be run. In addition, the TRS case required multi- ple simplifications and the Rotor case had to be ported from CFX to Fluent. The rather complicated process of porting the cases requires engineering time and ne- cessitates testing and validation to ensure simulation accuracy. However, after this hurdle was overcome, the GPU solver proved its worth. Of the parameters tested: simulation time, energy consumption, convergence, and cost analysis, all were consistently better on the GPU solver with the hardware specified in Section 3.1.1 for both the TRS case and the Rotor case. Section 4.1.4 shows that even though the TRS case required multiple simplifications to run on the GPU solver, it still provided accurate results. While the GPU solver does not seem mature enough yet to reliably handle combustion problems, it seems ready to be implemented on real-world turbomachinery problems today. Purchasing GPU systems is estimated to be cheaper than purchasing CPU systems of equivalent simulation capacity. To reach these numbers, many assumptions and price estimations had to be made, all of which are accounted for in Sections 3.4.1 and 3.4.2. Applying these results to other businesses is possible as well, but as local prices and existing infrastructure varies from business to business, and as the equiv- alent simulation capacity may differ on a simulation-by-simulation basis, the TCOs should be recalculated for every business application. The calculations provided in this study can then serve as a blueprint for what to take into account. Identifying simulation cases that benefit from being run on the GPU solver cor- rectly is important to make sound purchasing decisions regarding GPUs, and to utilize bought GPUs to their full extent. In general, the bulk of the calculations done in the simulation should be done to solve the fluid flow, rather than chemistry problems for instance. Simulations should take more than a couple of minutes to run for the speedups to be significant, as the GPU solver does not speed up the start and shutdown time, tsas. It should also be taken account that meshing would be done on the CPU, therefore the total time increases significantly if the case needs to be re-meshed between every run. The solver matrix size needs to be sufficiently 49 5. Discussion large to take advantage of multiple GPUs, at minimum around 2 million cells per GPU. On the other hand, the entire solver matrix needs to fit in the VRAM of the GPU. This creates a size window that the solver matrix needs to fit in. Finally, the case setup should be compared against the supported features of the GPU solver, found in Appendix A. Figure 5.1 shows a decision tree model for how to approach GPUs for CFD simula- tions as a series of business decisions over three adoption phases. After identifying simulation cases that can benefit from being run on the GPU solver. The next step is to estimate the GPU simulation capacity required for the next 3 years. If the simulation capacity required exceeds 4000 h on 2x H100, which is the breakeven between 2x H100 on a local cluster and the Rescale cluster as shown in Figure 4.16, it makes economic sense to buy 2x H100 cards. The performance of and demand for GPU simulations should then be continuously evaluated on a yearly basis. This al- lows for further GPU purchases if the demand is high, while minimizing sunk costs if the demand decreases. While implementing GPU simulations in the early adoption phase may seem to be a lot of work for the reduced cost, time, and power consump- tion on a limited number of simulation cases, it positions the company better for a future larger investment in GPUs. Allowing for accuracy testing and exploration of higher-fidelity simulations at an earlier stage compared to the competition, and be- ing prepared to scale up faster when the solver is deemed mature enough. This goes hand in hand with the continuous updates of the GPU solver from Ansys making the solver more mature, as well as the fast development of GPU performance. Invest in GPUs? Cases compatible? Wait Simulation time over 3 years Cloud Buy 2x H100 Evaluate performance and upcoming demand Sell GPUs Keep current Expand No Yes t < 4000h t > 4000h Non satisfactory Satisfactory, same or less demand Satisfactory, increased demand Exploration phase Early adoption phase Yearly iterative phase Re-evaluate at a later time Figure 5.1: Strategic implementation decision tree. The speedups from running CFD simulations on GPUs can be leveraged in different ways to add maximum value to the engineering process. They can, for example, be used to allow an engineer to test more design iterations of a component in the design phase, or to validate how multiple components work together in a shorter time 50 5. Discussion frame. They can also be used to increase simulation fidelity, as intensive simulations that would normally take an unreasonably long time to complete on CPUs can be completed in a reasonable time frame on GPUs. An example is turbulence modeling, where Large Eddy Simulations (LES) are more computationally expensive, but can be used to model turbulence with higher accuracy than RANS simulations in some cases. 5.1 Conclusion This study shows that the Fluent GPU solver outperforms the CPU solver with re- gard to time, cost, power consumption, and convergence. Furthermore, it provides accurate results for the tested cases. It does, however, lack some features that are commonly used in aerospace applications, and the GPU solver can not replace the CPU solver completely in its current state. The key findings are that 2x H100 cards are equivalent to 450-750 cpu cores in simulation capacity per iteration, that the GPU solver converges all residuals to 1e-04 in 27-73% fewer iterations, the energy consumption per iteration is reduced by 88-94%, the cost for running simulations on the Rescale cloud is reduced by 83- 91% and the total cost of ownership to upgrade a local cluster with a GPU system instead of a CPU system of equivalent simulation capacity is reduced by 48-67%. The Fluent native GPU solver is shown to be mature enough that it is possible to move into the early adoption phase of CFD on GPUs, either through cloud simulations or through a smaller-scale GPU purchase accounting for a minority of the total local cluster simulation capacity. In conclusion, the GPU solver should be used in all cases that it supports. 5.2 Further research Further research on this topic could focus on expanding the range of cases in which the GPU solver has been benchmarked on and incorporate additional physics models and solver settings. LES simulations, for instance, could benefit from running on the GPU solver. It could also be worthwhile to dig further into how well the GPU and CPU solver matches each other in terms of simulation accuracy, and also how accurately cases originally modeled in CFX can be modeled with the Fluent GPU solver. The Fluent GPU solver could also be compared against other commercial GPU solvers. It could also be evaluated how the impact of the GPU solver speedups affect the engineering process and study how they can be leveraged to add maximum value to a business. 51 5. Discussion 52 Bibliography [1] J. Hess and A. Smith, “Calculation of potential flow about arbitrary bodies,” Progress in aerospace sciences, vol. 8, pp. 1–138, 1967. [2] W. Slagter. (2025) Ansys speeds up volvo ex90 aerodynamics simulations with nvidia gpu-accelerated cfd. ANSYS Inc. [Online]. Available: https://www.an sys.com/blog/ansys-speeds-up-volvo-ex90-aerodynamics-simulations [3] N. Alarcon. (2019) Volkswagen accelerates aerodynamics concept design with nvidia v100 gpus on aws. NVIDIA Corporation. [Online]. Available: https://developer.nvidia.com/blog/volkswagen-accelerates-aerodynamics-con cept-design-with-nvidia-v100-gpus-on-aws/ [4] C. Porter and N. Krishnamoorthy. (2022) The computational fluid dynamics revolution driven by gpu acceleration. NVIDIA Corporation. [Online]. Available: https://developer.nvidia.com/blog/computational-fluid-dynamics-r evolution-driven-by-gpu-acceleration/ [5] EDR & Medeso. [Online]. Available: https://edrmedeso.com/ [6] Edr & medeso ab. Allabolag.se. [Online]. Available: https://www.allabolag.se /foretag/edr-medeso-ab/v%C3%A4ster%C3%A5s/datorer-kringutrustningar /2K3GXCDI5YCUH [7] (2025) Gkn sweden, om oss. GKN Aerospace Sweden. [Online]. Available: https://www.gknaerospace.com/se// [8] (2022) Unleashing the full power of gpus for ansys fluent, part 1. ANSYS Inc. [Online]. Available: https://www.ansys.com/blog/unleashing-the-full-power-o f-gpus-for-ansys-fluent/ [9] Z. Cooper-Baldock, B. Vara Almirall, and K. Inthavong, “Speed, power and cost implications for gpu acceleration of computational fluid dynamics on hpc systems,” Supercomputing Asia, 2024. [10] “Ansys fluent native multi-gpu solver: Cfd validation studies in version 23r2,” ANSYS Inc., Tech. Rep., 2023. [11] “Speed and accuracy: First-of-its-kind broad-spectrum cfd solver built natively on gpus,” ANSYS Inc., Tech. Rep., 2022. [12] (2024) What is reynolds number? SimScale. [Online]. Available: https: //www.simscale.com/docs/simwiki/numerics-background/what-is-the-reyno lds-number/ [13] (2021) Mach number. NASA. [Online]. Available: https://www.grc.nasa.gov /www/k-12/airplane/mach.html [14] H. K. Versteeg and W. Malalasekera, An Introduction to Computational Fluid Dynamics. Pearson Education Limited, 2007. 53 https://www.ansys.com/blog/ansys-speeds-up-volvo-ex90-aerodynamics-simulations https://www.ansys.com/blog/ansys-speeds-up-volvo-ex90-aerodynamics-simulations https://developer.nvidia.com/blog/volkswagen-accelerates-aerodynamics-concept-design-with-nvidia-v100-gpus-on-aws/ https://developer.nvidia.com/blog/volkswagen-accelerates-aerodynamics-concept-design-with-nvidia-v100-gpus-on-aws/ https://developer.nvidia.com/blog/computational-fluid-dynamics-revolution-driven-by-gpu-acceleration/ https://developer.nvidia.com/blog/computational-fluid-dynamics-revolution-driven-by-gpu-acceleration/ https://edrmedeso.com/ https://www.allabolag.se/foretag/edr-medeso-ab/v%C3%A4ster%C3%A5s/datorer-kringutrustningar/2K3GXCDI5YCUH https://www.allabolag.se/foretag/edr-medeso-ab/v%C3%A4ster%C3%A5s/datorer-kringutrustningar/2K3GXCDI5YCUH https://www.allabolag.se/foretag/edr-medeso-ab/v%C3%A4ster%C3%A5s/datorer-kringutrustningar/2K3GXCDI5YCUH https://www.gknaerospace.com/se// https://www.ansys.com/blog/unleashing-the-full-power-of-gpus-for-ansys-fluent/ https://www.ansys.com/blog/unleashing-the-full-power-of-gpus-for-ansys-fluent/ https://www.simscale.com/docs/simwiki/numerics-background/what-is-the-reynolds-number/ https://www.simscale.com/docs/simwiki/numerics-background/what-is-the-reynolds-number/ https://www.simscale.com/docs/simwiki/numerics-background/what-is-the-reynolds-number/ https://www.grc.nasa.gov/www/k-12/airplane/mach.html https://www.grc.nasa.gov/www/k-12/airplane/mach.html Bibliography [15] H. Jasak, “Error analysis and estimation for the finite volume method with applications to fluid flows,” Ph.D. dissertation, Imperial College London, 1996. [16] Ansys Fluent Theory Guide, Ansys Inc., 2025. [17] T.-R. Teschner. (2024) How to write a cfd library: The sparse matrix class. cfd.university. [Online]. Available: https://cfd.university/learn/how-to-compi le-write-and-use-cfd-libraries-in-c/how-to-write-a-cfd-library-the-sparse-mat rix-class/ [18] F. R. Menter, “Improved two-equation k-omega turbulence models for aerody- namic flows,” NASA Technical Memorandum, 1992. [19] W. Kahan, “Ieee standard 754 for binary floating-point arithmetic,” 1997. [Online]. Available: https://people.eecs.berkeley.edu/~wkahan/ieee754status/ IEEE754.PDF [20] Add or subtract floating point num