Creating a Microbenchmark with wide cov- erage of Memory-boundedness Investigating memory-boundedness by creating a microbench- mark to characterize DVFS behaviour Niklas Côté Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2022 Master’s thesis 2022 Creating a Microbenchmark with wide coverage of Memory-boundedness Investigating memory-boundedness by creating a microbenchmark to characterize DVFS behaviour Niklas Côté Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2022 Creating a Microbenchmark with wide coverage of Memory-boundedness Investigating memory-boundedness by creating a microbenchmark to characterize DVFS behaviour Niklas Côté © Niklas Côté, 2022. Supervisor: Bhavishya Goel, Department of Computer Science and Engineering Examiner: Miquel Pericas, Department of Computer Science and Engineering Master’s Thesis 2022 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2022 iv Creating a Microbenchmark with wide coverage of Memory-boundedness Investigating memory-boundedness by creating a microbenchmark to characterize DVFS behaviour Niklas Côté Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Memory-boundedness of an application is defined as the degree to which the perfor- mance of the application depends on the size and performance of memory instead of the CPU. The degree of memory-boundedness of an application determines its speedup when the CPU frequency is increased: an application with no memory- boundedness will exhibit linear speedup with frequency increase while an application with 100% memory-boundedness will exhibit no speedup at all. Dynamic voltage and frequency scaling (DVFS) is a power saving technique which aims to save energy by dynamically reducing frequency during the memory-bound phase of the appli- cation. The DVFS decision making is based on prediction models which predict the appropriate voltage and frequency for the application phase. To increase the accuracy of prediction models, training data needs to be collected from applications which exhibit varying degrees of compute-bound and memory-bound behavior. A single microbenchmark which can simulate wide variations of memory-boundedness behaviour could reduce the time required to train the prediction models and improve prediction accuracy. This thesis analyzes different measurement methods for memory-boundedness and proposes a new formula based on L3 cache misses which is believed to be a more suited fit for the definition of memory-boundedness. The testing of formulas was done on a benchmark suite from NASA called NAS Parallel Benchmarks (NPB), where the consistency and values of the formulas were evaluated. The result of measurements with the new L3 cache miss formula was then used to create a mi- crobenchmark which can produce large variations of memory-boundedness. Keywords: DVFS, memory-boundedness, microbenchmark, benchmark, voltage, fre- quency, prediction models, static voltage, dynamic voltage v Acknowledgements Most of all I would like to thank my supervisor Bhavishya Goel, for being a great teacher and mentor for this very interesting project and for always believing in me and motivating me to learn more. I also want to thank everyone involved in the presentation and writing process of this project, Miquel Pericas, Marcus Mathiason and Edvin Leidö, in helping me clarify and further improve the thesis. Finally I would also like to thank my partner, my family and my friends for always being there for me and pushing me to be my very best. Niklas Côté, Gothenburg, July 2022 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Theory 3 2.1 Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.2 Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2.3 Static and Dynamic voltage . . . . . . . . . . . . . . . . . . . . . . . 4 2.3.1 Static voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3.2 Dynamic voltage . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.4 Dynamic voltage and frequency scaling . . . . . . . . . . . . . . . . . 5 2.5 Defining Memory-boundedness . . . . . . . . . . . . . . . . . . . . . . 5 2.6 NAS benchmark suite . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.7 Creating a Microbenchmark . . . . . . . . . . . . . . . . . . . . . . . 7 3 Related work 9 3.1 Measuring Memory-boundedness . . . . . . . . . . . . . . . . . . . . 9 3.2 Prediction models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3.2.1 Stall based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.2 Leading Loads . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2.3 CRIT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 Method 13 4.1 Measuring Memory-boundedness . . . . . . . . . . . . . . . . . . . . 13 4.2 Microbenchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2.1 Implementation and testing the microbenchmark . . . . . . . 16 5 Results 21 5.1 NAS benchmark suite . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.2 Microbenchmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 5.3 Comparison of NAS and Microbenchmark . . . . . . . . . . . . . . . 24 5.3.1 Memory-boundedness . . . . . . . . . . . . . . . . . . . . . . . 24 ix Contents 5.3.2 Energy and Execution Time . . . . . . . . . . . . . . . . . . . 24 6 Conclusion 27 6.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 6.3 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 Bibliography 31 References 31 x List of Figures 2.1 Shows how the compute-bound portion scales linearly with frequency and how the memory-bound portion does not scale with frequency. . 6 3.1 Stall based prediction model assumptions, shows how the CPU still stalls even if work is ready to be done on CPU. . . . . . . . . . . . . 11 3.2 Leading loads prediction model assumptions, shows how the CPU benefits from being able to do work when ready, compared to 3.1. . . 12 4.1 Basic illustration of the technique pointer-chasing . . . . . . . . . . . 16 5.1 Graph of memory-boundedness coverage for frequencies 0.8 GHz, 2.6 GHz and 4.4 GHz on 1 thread from the benchmarks of the NAS bench- mark suite and the Microbenchmark. Y-axis is memory-boundedness and X-axis is the CPU frequency. . . . . . . . . . . . . . . . . . . . . 22 5.2 Graph of memory-boundedness coverage for frequencies 0.8 GHz, 2.6 GHz and 4.4 GHz on 2 threads from the benchmarks of the NAS benchmark suite and the Microbenchmark. Y-axis is memory- boundedness and X-axis is the CPU frequency. . . . . . . . . . . . . . 22 5.3 Graph of memory-boundedness coverage for frequencies 0.8 GHz, 2.6 GHz and 4.4 GHz on 4 threads from the benchmarks of the NAS benchmark suite and the Microbenchmark. Y-axis is memory- boundedness and X-axis is the CPU frequency. . . . . . . . . . . . . . 23 5.4 Graph of normalized energy usage and normalized execution time for frequencies 0.8 GHz, 2 GHz, 3 GHz and 4.4 GHz on NAS benchmark EP and setting 0% memory-boundedness on the microbenchmark. . . 25 5.5 Graph of normalized energy usage and normalized execution time for frequencies 0.8 GHz, 2 GHz, 3 GHz and 4.4 GHz on NAS benchmark MG and setting 25% memory-boundedness on the microbenchmark. . 25 5.6 Graph of normalized energy usage and normalized execution time for frequencies 0.8 GHz, 2 GHz, 3 GHz and 4.4 GHz on NAS benchmark CG and setting 45% memory-boundedness on the microbenchmark. . 26 xi List of Figures xii List of Tables 4.1 Embarrassingly Parallel benchmark (EP), high intensity of computa- tional instructions, memory-accesses close to none. Expected result around 0% memory-boundedness. Negative values and values above 100% are considered errors, since that occurrence is not possible (*). . 14 4.2 Multi-Grid benchmark (MG), high intensity of memory-misses and accesses. Expected to increase when increasing frequency and number of threads. Negative values and values above 100% are considered errors, since that occurrence is not possible (*). . . . . . . . . . . . . 15 xiii List of Tables xiv 1 Introduction This chapter introduces the project background, the problem statement and the project limitations. 1.1 Background Dynamic voltage and frequency scaling (DVFS) is the mechanism of tuning the CPU frequency to match the needs of the ongoing execution and tuning the voltage to fit the CPU frequency. This has the potential to decrease energy consumption of systems while maintaining performance within acceptable levels as defined by QoS (Quality of Service). Most modern microprocessors today have the ability to harness the power of DVFS, which can be used to optimize the amount of energy required to run a particular application. In order to maximize the efficacy of DVFS, the optimal voltage-frequency pair has to be chosen in real-time. Many scheduling algorithms that employ DVFS make use of models that can predict the impact of changing the frequency and voltage on performance and energy consumption [3, 6, 11, 9]. The application characteristics like memory-boundedness and concurrency decide which voltage-frequency pair is optimal [2] for energy efficiency. Creating effective DVFS prediction models requires availability of benchmarks with wide-ranging memory- boundedness and concurrency behavior, either to generate the training set or to test the accuracy of the models. A systematic characterization of the impact of voltage and frequency change on the performance and energy consumption over a wide range of application characteristics can be very useful to generate training and/or test data. As part of this thesis, we have developed a parameterizable microbenchmark which can be finely tuned to mimic benchmarks with different memory-boundedness and concurrency behavior. This microbenchmark can be used to collect information about impact of voltage-frequency change on performance and energy consumption across the entire DVFS configuration space of the platform. The purpose of this microbenchmark is to create deeper understanding of DVFS be- havior and characteristics depending on memory-boundedness of applications. This knowledge can be used to create more accurate prediction models in the future. These prediction models rely on already available benchmarks with varying degree of memory-boundedness behavior. The problem is that finding the right set of benchmarks that represent wide range of memory-boundedness can be fairly diffi- cult. As per the data collected by computer architecture research team at Chalmers, 40 different benchmarks from NAS benchmark suite [4] were required to collect data 1 1. Introduction depicting wide enough memory-boundedness behavior for training the DVFS pre- diction model. Apart from being time consuming, it is also difficult to debug the prediction model using such training data since the memory-boundedness cannot be changed orthogonally to the rest of the benchmark behavior. A microbenchmark that can be finely tuned to exhibit desired degree of memory-boundedness would be a great tool to do systematic and comprehensive study of DVFS characteristics of the platform so that more accurate DVFS prediction models can be created. To the best of our knowledge, such a microbenchmark to explore DVFS configuration space does not exist. 1.2 Problem statement Currently there are no tools to exclusively express different memory-boundedness behaviours, one would have to use a number of different benchmarks to do so. Dif- ferent benchmarks when compared can vary in other features than solely memory- boundedness and is therefore not ideal in our case. Large variations in benchmarks create difficulties for researchers to compare and evaluate memory-boundedness be- haviour. The purpose of this thesis is to solve this issue by creating a microbench- mark tuned for various memory-boundedness behaviours without changing larger parts of the application. Such a tool can lead to deeper understanding of DVFS behavior and help prediction models to increase accuracy over application phases that exhibit varying degrees of memory-boundedness behaviour. 1.3 Limitations For the purpose of this master thesis, we have limited our experiments and data col- lection to just a single platform although the microbenchmark and the methodology should be portable to other platforms. The tests were conducted on a limited subset of available settings to reduce the experimentation run time. The scope of this the- sis was limited to creating the microbenchmark and comparing the characteristics of microbenchmark with those of NAS benchmark. The training and/or testing of existing DVFS prediction models using the created microbenchmark can be done as part of future work. 2 2 Theory This chapter contains an overview of DVFS, frequency and voltage. Techniques to measure memory-boundedness and creating a microbenchmark are also presented. Lastly the reasoning and explanation of the chosen methodology to create the mi- crobenchmark is also covered. 2.1 Frequency The frequency of a CPU is the speed of the clock, which determines how quickly instructions are being processed, which means that a higher frequency leads to a higher performance of the CPU. The CPU frequency has been a measurement to compare CPUs, but in the age of multi-core CPU systems it is no longer the only measurement to compare CPUs. Most CPUs today have similar maximum, min- imum and average frequency, but have very different amount of instructions per cycle (IPC). The amount of instructions which can be executed per cycle have been increased in modern CPUs by utilizing computer architectures such as multi-core, many-core and various memory scheduling protocols for those architectures. The new techniques mentioned utilize computer architectures to increase performance of CPUs which is more beneficial to increase performance compared to increasing the frequency. The frequency speed is limited to the speed which transistors can transfer signals and the transistors are limited to the capacitance of the material used to create them. Since the capacitance for transistors has not increased as fast as the increase of frequency, increasing frequency further would lead to higher CPU temperatures and inefficient energy usage. Efficient frequency utilization plays a large role in decreasing the energy consumption of all computational units. If the frequency is increased and instructions are not ready to be processed, the increase in frequency is not utilized efficiently and only leads to larger energy consumption and temperature of the CPU. Changing the frequency to fit the needs of the execution can lead to large energy savings and an increase in the life-span of computational components. 2.2 Voltage The voltage in a CPU is provided by the power source and leads power to the tran- sistors to have the ability to transfer signals. As the voltage increases, the transistors are able to send signals faster due to more intensive electric fields. The voltage of 3 2. Theory a CPU has limits since increasing the voltage leads to higher power leakage, which increases the temperature of the CPU. When the temperature of a CPU increases too much, the efficiency and durability of the transistors starts to be affected nega- tively, resulting in system failures and slower execution times. Voltage provides power to the CPU and is a requirement to compute instructions. To achieve efficient power savings the voltage needs to be adjusted to the frequency of the CPU. When the frequency is lowered the voltage should then be decreased to achieve larger power savings, since the transistors does not need to send signals as fast. The calculation of power usage is a well known formula which depends on the combined power usage of the dynamic portion and the static portion, with the variables capacitance, voltage, static power leakage and frequency of the CPU. The formula 2.3 shows that the energy usage decreases as voltage and frequency lowers. However the energy efficiency may also decrease since more total energy is required to compute the same amount of instructions. To achieve efficient energy usage the voltage and frequency need to be balanced with performance and execution time. Static Power = IleakageV (2.1) Dynamic Power = αCV 2f (2.2) Total power = (αCV 2f) + (IleakageV ) (2.3) 2.3 Static and Dynamic voltage The voltage can be static or dynamic but that does not change the effects of increas- ing or decreasing voltage. Static voltage is when the voltage is static and does not change during execution or idle state of the CPU. Dynamic voltage has the ability to change the voltage depending on the work-load of the CPU and the frequency, to decrease energy consumption and maintain high performance. 2.3.1 Static voltage Static voltage is when the voltage stays at a certain level for the entire execution even if the CPU is idle. The energy usage of static voltage is determined by the voltage and the leakage of power which is affected by temperature and transistor voltage. To limit the energy consumption when using static voltage, leakage of current, voltage and temperature of the CPU need to be limited 2.1. 2.3.2 Dynamic voltage Dynamic voltage is when the voltage switches the amount of supply voltage to the CPU, preferably to fit the work-load and frequency which leads to energy savings by not consuming unnecessary resources. The energy consumption of dynamic voltage is calculated by the capacitance of transistors in the CPU, the voltage and frequency. 4 2. Theory To limit energy consumption using dynamic voltage, voltage and frequency need to be limited 2.2. 2.4 Dynamic voltage and frequency scaling There are many techniques to adjust the CPU frequency and voltage to fit the needs of the system. Techniques such as over-clocking, under-clocking and DVFS are used to fit the user or application requirements on performance and energy. One large benefit with DVFS compared to over- and under-clocking is that DVFS is dynamic and not static during run-time, making it able to choose frequency and voltage depending on the needs of the execution and allocates necessary resources. The DVFS operates by decreasing the CPU frequency and voltage when the CPU is idle or has a low workload, and increasing it when there is a higher workload. Thereby utilizing energy, performance and resources more efficiently when using DVFS. The DVFS decision making is controlled in the operating system by using DVFS prediction models. Prediction models can be specific for types of system and application characteristics. To achieve higher accuracy of the DVFS decisions, the operating systems needs to be able to predict and schedule changes of frequency and voltage. The scheduling and prediction of DVFS is a widely researched area. The reason for this is the large variety of tasks and systems DVFS can be optimized for. Prediction and scheduling models are often more accurate for certain systems and workloads, such as memory-bound or compute-bound workloads. DVFS changes the CPU-frequency and voltage to save energy, but DVFS does not change the frequency in memory. The memory-boundedness of applications complicate the extent of usefulness and accuracy of DVFS scheduling. 2.5 Defining Memory-boundedness Memory-boundedness is a characteristic which can be evaluated and measured by using various microbenchmarks and benchmarks. Memory-boundedness has been measured differently in various papers such as [2], [8] and [9], but essentially memory-boundedness is defined as the degree of work which is stalled by memory. Each execution has a portion of the work-load in memory and in the CPU. What is special about the memory-bound part is that when changing the CPU frequency the stall time in memory does not scale with the CPU portion. So an increase in CPU frequency scales the execution time for the compute-bound part, but the exe- cution time of the memory-bound part stays the same. The memory-bound portion becomes a bottleneck in decreasing execution time when increasing CPU frequency, as can be seen in figure 2.1. The calculations of memory-boundedness vary between papers since they measure different aspects of the memory-boundedness but figure 2.1 shows the basic concept of why memory-boundedness affects the degree of effi- cient frequency utilization. 5 2. Theory Figure 2.1: Shows how the compute-bound portion scales linearly with frequency and how the memory-bound portion does not scale with frequency. 2.6 NAS benchmark suite The NAS Parallel Benchmarks (NPB) is a benchmark suite from NASA for testing high-performance systems and supercomputers. The NPB is an open source plat- form consisting of various benchmarks with various system characteristics. NASA provides, updates and shares these benchmarks with researchers and the public to test the performance of many systems. The benchmark suite contains a variety of different benchmarks with different characteristics, working-set-size and has the ability to run in parallel. The benchmarks simulate different types of work-loads, which researchers use to investigate system behaviour and optimize their systems for these types of applications. The version used in our thesis is NPB 3.4.2 with the programming model OpenMP. This version contains benchmarks IS, EP, CG, MG, FT, BT, BT-IO, SP, LU, UA, DC and DT, but only versions IS, EP, CG, MG, FT, SP, LU and UA will be used in our thesis. The motivation for only using these is the variation of memory-boundedness the other benchmarks exhibits is already covered from the selected benchmarks. List of NAS benchmarks tested • IS - Integer Sort, random memory access • EP - Embarrassingly Parallel • CG - Conjugate Gradient, irregular memory access and communication 6 2. Theory • MG - Multi-Grid on a sequence of meshes, long- and short-distance commu- nication, memory intensive • FT - discrete 3D fast Fourier Transform, all-to-all communication • SP - Scalar Penta-diagonal solver • LU - Lower-Upper Gauss-Seidel solver • UA - Unstructured Adaptive mesh, dynamic and irregular memory access 2.7 Creating a Microbenchmark Benchmarks are programs created to simulate a workload in order to evaluate and measure performance of a system. Microbenchmarks are benchmarks used to simu- late a specific single system characteristic, often in the hardware at the lower-level. Microbenchmarks are widely used by researchers to gain a deeper understanding of system behaviour in hardware, since microbenchmarks can be configured. Per- formance counters are often used to measure the system behaviour when running a microbenchmark. Depending on the data collected by the performance counters, the microbenchmark can be optimized further to generate the wanted system behaviour. The system behaviour of interest is various range of memory-boundedness. To create a microbenchmark with wide range of memory-boundedness, instructions to produce memory stalls and instructions to avoid memory-stalls are required. The instructions used to generate memory stalls are instructions which generate L3 memory misses and L1-, L2- or L3-cache hits, the instructions to avoid memory stalls are simple additions and multiplications of values. 7 2. Theory 8 3 Related work This chapter contains related work on how to measure memory-boundedness and using the knowledge of memory-boundedness behaviour to train prediction models. 3.1 Measuring Memory-boundedness Goel et al. [2] calculates the memory-boundedness of an application by dividing the execution in two parts, a compute-bound part and a memory-bound part. The exe- cution part is the part which scales execution time depending on the CPU frequency, i.e. the part which does not scale with CPU frequency is the memory-bound part of the execution. The variables to calculate the compute-bound and memory-bound part of the execution are frequency and execution time at two different frequencies, as can be seen in equation 3.1. It is a formula which is based on theoretical assump- tions, but in practice is difficult to measure accurately. The memory-boundedness could vary for one frequency depending on which other frequency is used to cal- culate memory-boundedness. For example, if calculating memory-boundedness for frequency 4 Ghz, the memory-boundedness is different if you use 2 Ghz or 3 Ghz as the second frequency in the formula. The memory-boundedness results of an application running on a frequency should by the formula be the same no matter which frequency is used, but in practice is not the case, which makes it unreliable. TF 2 = TF 1 ∗ (M + (1−M) ∗ F1 F2) (3.1) M = The fraction which the CPU does no work, stalling for memory TFx = Execution time at frequency x Fx = Frequency at setting x The equation 3.1 shows how Goel et al. calculates the memory-bound part of an execution by scaling the frequency (from F1 to F2) and measures the difference in execution time. Spiliopoulos et al. [9] measures memory-boundedness as the total stalls divided by number of cycles, as can be presented by equation 3.2. Their formula is overall simple to use, however a downside with the formula is that the total stalls may also include execution based stalls and could include stalls that are dependent on the CPU and not memory. In the tests that were supposed to give results approximately of 0% memory-bound (very low amount of memory accesses), the formula 3.2 produced 9 3. Related work results of the execution being around 15% memory-bound even if very low amounts of memory access was measured. Memory − boundedness = Total_Stalls Cycles (3.2) Molka et al. [8] measures the memory-boundedness as the max out of stalls caused by L1 cache data misses or resource stalls, then divide that by the number of cycles. It is a consistent formula but gives a underestimation of the memory-boundedness since L1 cache scales with the CPU frequency. Memory-boundedness should by our definition not scale with the CPU frequency. Stalls in Store Buffer = RESOURCE_STALLS : SB L1 Data Cache Misses = CY CLE_ACTIV ITY : STALLS_L1D_PENDING Memory − boundedness = MAX(SB, STALLS_L1D_PENDING) Cycles (3.3) With the inspiration from the previous formulas and the definition used in prediction models, the decision to create a new formula was made. The formula will be easy to use, measure and understand. It should be accurate and not overestimate the degree of memory-boundedness. The formula is based on L3 cache misses, misses in L3 cache causes long memory stalls which are highly significant to the execution. L3 stall cycles are the most significant time consuming activity of the memory. Memory-boundedness should be the part of the execution which does not scale with frequency of the CPU. Assuming that the core and uncore frequency are in sync, the L1, L2 and L3 caches are affected by changes of the frequency of the CPU, misses to those caches are by definition not fully memory-bound. The misses on the L3-cache are fully memory-bound since each miss is waiting for memory (DRAM) which is not affected by the CPU frequency. The L3-miss performance counter will in theory create an accurate estimate of the memory-boundedness, since it calculates the misses which L3 cache stalls waiting for memory. The formula can be seen in figure 4.1. Memory − boundedness = CY CLE_ACTIV ITY : STALLS_L3_MISS Cycles (3.4) 3.2 Prediction models Prediction models are based on statistical data from running benchmarks with vary- ing characteristics, to predict the upcoming optimal voltage-frequency pair for a program with unknown characteristics. There are several ways to create prediction models and the topic is widely researched. 10 3. Related work Figure 3.1: Stall based prediction model assumptions, shows how the CPU still stalls even if work is ready to be done on CPU. 3.2.1 Stall based The Stall based prediction model [1], [5] is a simple model which makes it widely usable in various DVFS systems. It is based on performance counters and the assumption that when a memory-stall occurs the CPU becomes fully idle. This approach leads to low accuracy rate in out-of-order pipelines, since other instructions could in practice still execute during a memory-stall. Overlapping misses can also lead to even longer stall time. If the previous miss returns and could be executed on the CPU, it will not if there is another miss stalling the CPU. The strength of this model is however the simplicity and the DVFS system usability. 3.2.2 Leading Loads The leading loads prediction model [1], [5], [10] was developed around 2010 and was created as an improved version of the stall based model. The difference compared to the stall based model is the understanding of how out-of-order pipelines can exploit memory-level parallelism, to not overestimate the amount of stall time. Leading loads took into account that several memory-misses can occur in parallel and overlap in out-of-order pipelines. Leading load is the time of the first memory stall in a burst of several memory misses. The rest of the stall time for other misses is ignored. When the leading load finishes, the next memory stall will be the new leading load. By understanding memory-level parallelism and the occurrence of overlapping misses, the leading loads approach managed to improve the accuracy of DVFS prediction models in many different systems. The leading loads was also simple enough to be implemented on real hardware just as stall based, but with higher accuracy in DVFS prediction. The disadvantages of the leading loads approach is the lack of tools, to evaluate memory system behaviours such as prefetching, memory-locality and memory scheduling policies. 3.2.3 CRIT The CRIT prediction model [7], expands further than leading loads on tools for evaluation of the functionality of memory systems. CRIT uses an algorithm to 11 3. Related work Figure 3.2: Leading loads prediction model assumptions, shows how the CPU benefits from being able to do work when ready, compared to 3.1. determine the memory stall time of dependent misses, which means the misses that contains necessary data for the CPU to continue execution. Which is an more accurate assumption compared to leading loads, since a leading load could be non- dependent and the CPU could execute and thereby benefit from an increase in frequency. 12 4 Method The method of this thesis will be covering how to define memory-boundedness and tools used for measuring it. Then the structuring, implementation and testing of the microbenchmark is covered, also the tuning of various settings. 4.1 Measuring Memory-boundedness Memory-boundedness is defined as the part of the execution time which does not scale with the CPU frequency. Previous work( [2], [9]) has shown that the memory bound fraction of the application can be approximated by measuring the CPU stalls caused by memory accesses. Measuring the theoretical memory-boundedness requires isolating the CPU stall cy- cles caused specifically due to memory accesses. However in modern out-of-order processors, multiple stall events (branch misprediction, RS/ROB stalls, cache misses, etc) may happen in parallel with memory access stalls making it very challenging to isolate stalls due to one specific type of event. Therefore we started the testing and measurements of memory-boundedness with all formulas 3.1, 3.2 and 3.3 mentioned in chapter Theory, except 4.1 which was added and created later on. To investigate which formula to use for creating the result for NAS benchmark suite of memory- boundedness coverage, the formulas were tested for the EP and MG benchmarks on frequencies 800MHz, 2GHz, 3GHz and 4,4GHz, and number of threads 1, 2 and 4 in tables 4.1 and 4.2. The results in the tables are a comparison between a benchmark with almost no memory accesses and a benchmark with high amounts of memory accesses. To in- vestigate the formulas, the behaviour and consistency was used as evaluation. The behaviour for the formulas was similar when increasing frequency and number of threads. From these formulas, 3.1 varied the most and was not consistent or simple enough in our case. The formula 3.2 had a strange value for the EP benchmark since it resulted in a 15% memory-boundedness for a benchmark which should be close to 0%, since the execution time of the EP benchmark scaled almost perfectly with the CPU frequency. The results of 3.2 can only be explained by that the per- formance counter STALLS_TOTAL also includes some sort of execution stalls in the calculation. Since that is affected by changing the frequency of the CPU that is not an accurate measurement for memory-boundedness. The formula 3.3 looks more promising because of the simplicity to measure and evaluate but also the consistency. 13 4. Method Table 4.1: Embarrassingly Parallel benchmark (EP), high intensity of compu- tational instructions, memory-accesses close to none. Expected result around 0% memory-boundedness. Negative values and values above 100% are considered errors, since that occurrence is not possible (*). Thread 1 0.8GHz 2GHz 3GHz 4.4GHz Goel-Formula -0.60%* -0.20%* 0.10% -0.30%* Spiliopoulos-Formula 15.10% 14.90% 14.90% 14.90% Molka-Formula 0.20% 0.20% 0.20% 0.10% L3-Stall formula 0.00% 0.00% 0.00% 0.00% Threads 2 0.8GHz 2GHz 3GHz 4.4GHz Goel-Formula -0.90%* -0.40%* 0.80% -1.10%* Spiliopoulos-Formula 15.10% 15.00% 15.00% 14.90% Molka-Formula 0.20% 0.20% 0.20% 0.10% L3-Stall formula 0.00% 0.00% 0.00% 0.00% Threads 4 0.8GHz 2GHz 3GHz 4.4GHz Goel-Formula -0.10%* 0.00% -0.10%* 0.20% Spiliopoulos-Formula 15.00% 15.00% 15.00% 15.10% Molka-Formula 0.20% 0.20% 0.10% 0.10% L3-Stall formula 0.00% 0.00% 0.00% 0.00% However those counters will produce an overestimation of how many instructions are memory-bound, since they include stalls which are affected by frequency of the CPU. The later created formula 4.1, uses a newly added Intel specific performance counter to calculate the total L3-misses stall cycles of all cores (CYCLE_ACTIVITY:STALLS_L3_MISS). When the L3-cache misses it stalls to memory, which in theory does not scale with the CPU frequency unlike L1, L2 and L3 cache accesses. The L3-miss formula 4.1 was based on the assumption that memory-boundedness is the part which does not scale with CPU frequency and is the most consistent and simplest formula. The formula 4.1 was the one we used to create the results of memory-boundedness of NAS benchmarks and later on the microbenchmark. 4.2 Microbenchmark The microbenchmark is created to have a wide range of memory-boundedness with the formula 4.1. To measure memory-boundedness the performance counters CY- CLE_ACTIVITY:STALLS_L3_MISS and cycles 4.1 was used. To measure the energy consumption to study the effects of memory-boundedness of applications, Intel’s Running Average Power Limit (RAPL) interface was used in combination with performance counters. RAPL is a power modeling and measuring interface which is focused on energy consumption. RAPL is an accessible and powerful tool used to create a overview of the energy characteristics of an execution. Memory − boundedness = CY CLE_ACTIV ITY : STALLS_L3_MISS Cycles (4.1) 14 4. Method Table 4.2: Multi-Grid benchmark (MG), high intensity of memory-misses and accesses. Expected to increase when increasing frequency and number of threads. Negative values and values above 100% are considered errors, since that occurrence is not possible (*). Thread 1 0.8GHz 2GHz 3GHz 4.4GHz Goel-Formula 26.50% 30.30% 44.60% 54.10% Spiliopoulos-Formula 24.30% 42.60% 49.30% 56.70% Molka-Formula 20.60% 39.70% 46.80% 54.50% L3-Stall formula 10.70% 19.50% 21.90% 24.90% Threads 2 0.8GHz 2GHz 3GHz 4.4GHz Goel-Formula 6.30% 103.60%* 56.70% 65.80% Spiliopoulos-Formula 30.60% 40.60% 55.30% 63.30% Molka-Formula 26.80% 31.80% 52.90% 61.20% L3-Stall formula 17.50% 18.60% 32.80% 37.80% Threads 4 0.8GHz 2GHz 3GHz 4.4GHz Goel-Formula 25.40% 45.60% 98.70% 99.10% Spiliopoulos-Formula 32.70% 51.30% 59.90% 70.10% Molka-Formula 28.70% 43.60% 53.50% 67.40% L3-Stall formula 19.70% 28.60% 34.30% 42.90% The microbenchmark will have various settings of memory-boundedness. To have various settings it is important to understand what generates the highest and lowest amount of memory-boundedness to tune the rest of the settings. The lowest amount of memory-boundedness was created by looping over simple additions and multipli- cations of a variable. The highest amount of memory-boundedness was achieved by using a technique called pointer-chasing meaning we access the value of the first ad- dress of each cache-block with a pointer and store that value in the next cache-block. Pointer-chasing is used to generate L3-misses by knowing the size of the cache-blocks (64 bytes). The accesses are used once on each cache-block so that each instruction creates an L3 cache miss. Prefetching should be able to be turned on and misses would still occur since the instruction for the miss will be saved in the location of the value accessed in the previous block. Since only one block can be accessed at a time misses will occur. The other settings are created by looking at values of memory-boundedness gen- erated and then investigating which percentages were missing to cover the widest range of memory-boundedness behaviour. Each setting between lowest and highest was created by using a mixture of misses, hits and arithmetic instructions such as additions or multiplications. These cases were then aggregated to run sequentially, so one run of the microbenchmark runs all settings in order, when one would want it to train a DVFS predictor. The microbenchmark with the independent settings will still be used for research purposes or reconfiguration of settings. 15 4. Method Figure 4.1: Basic illustration of the technique pointer-chasing 4.2.1 Implementation and testing the microbenchmark The aim of the microbenchmark is to have different settings of memory-boundedness, the first goal is to create the most memory intensive program possible. To create a high level of memory-boundedness each memory access should be a miss. Misses cause stalls on the pipeline, because the execution waits for memory to return a value necessary for a continuation of the execution. To generate misses, the size of cache blocks and last level cache needs to be known. The machine used for creating this microbenchmark has a cache block size of 64 bytes and the size of last level cache is 12 MB. To generate misses the working-set- size needs to be larger than 12 MB or else each cache block will fill the last level cache. Each cache block will be read from only once. The first address of each cache block is loaded and the rest of the cache block is ignored, so that each load instruction is a miss and causes a last level cache miss. Another technique to increase the amount of memory-boundedness is loop-unrolling, which increases the degree of load instructions per loop so that less cycles are used as overhead to run the loop. The optimal amount of instructions per loop for high memory-boundedness was determined to be 16 load instructions, if one were to go higher the memory-boundedness would decrease. Combining these techniques and using different memory-boundedness formulas, a memory-boundedness of between 73− 85% could be achieved, depending on which formula was used. The memory-boundedness of the microbenchmark can be configured to produce different variations. The variations of memory-boundedness is instructions that generates misses, instructions that hit and instructions which make a computation. The variations can be chosen in the setting of 0, 10, 20, 23, 25 30, 35, 40, 45, 50, 55, 60, 70 and highest possible, percentage of memory-boundedness calculated from using the formula created in this project. The equation 4.1 shows how the calculation of the memory-boundedness for each configuration is done, the microbenchmark is configured for frequency 4 Ghz, the percentage of the settings is expected to decrease if a lower frequency is used or increase if a higher frequency is used. To add parallelism to the microbenchmark OpenMP was used. The working-set-size 16 4. Method scales with number of threads and each thread is working independently on its data. The pseudo-code below shows how the microbenchmark is structured and what is required to function accurately. To replicate the results, the system needs to have the same cache-block size, working-set-size, processor speed and memory-speed. The memory-boundedness is configured by using memory and computational instructions in combination. The main ways to increase or decrease memory-boundedness is by using loop-unrolling, memory misses, memory hits and computational instructions. The memory-boundedness instruction combination is ran 1000 times to ensure more accurate and reliable measurements. In order for the memory-bound setting to be accurate the system is required to have the same cache-block size, working-set size needs to be greater than cache-block size to generate misses. CPU and RAM speed needs to be the same as the machine used to configure these settings. The instructions executed for each memory-boundedness will remain but can be adjusted for other system settings. Algorithm 1 shows the code of the microbenchmark which run one setting depending on user input. Algo- rithm 2 shows the code of the microbenchmark when running all settings in order. For testing and tuning of the settings Algorithm 1 was used. Algorithm 2 is the final microbenchmark which has a wide range of coverage of memory-boundedness behaviours in one execution. The results of running the microbenchmark will vary depending on which platform it is ran on, but will still cover a wide range of memory- boundedness even if results differ. The setting for 23% was used to fill large gaps discovered in the results when increasing number of threads, to have a larger variety of memory-boundedness. 17 4. Method Algorithm 1 Code for testing each setting Require: Cache-block size = 64 bytes Require: Cache size