Optimizing Quantum Computer Simula- tions With Data Compression & GPU Ac- celeration Bachelor’s thesis in Computer science and engineering Erik Ljung Darko Petrov Felix Bråberg Björn Forssén Beata Ringmar Jonas Hedlund Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2022 Bachelor’s thesis 2022 Optimizing Quantum Computer Simulations With Data Compression & GPU Acceleration Erik Ljung Darko Petrov Felix Bråberg Björn Forssén Beata Ringmar Jonas Hedlund Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2022 Optimizing Quantum Computer Simulations With Data Compression & GPU Ac- celeration © Erik Ljung, Darko Petrov, Felix Bråberg, Björn Forssén, Beata Ringmar, Jonas Hedlund 2022. Supervisor: Stavroula Zouzoula, Computer and Network Systems division, Depart- ment of Computer Science and Engineering Examiner: Pedro Petersen Moura Trancoso, Computer and Network Systems divi- sion, Department of Computer Science and Engineering Science Bachelor’s Thesis 2022 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2022 iii Abstract Simulating quantum computers involves high memory usage and often long execu- tion times. For that reason the purpose of this project was to analyze whether data compression and GPU acceleration can be used to run simulations with more qubits than previously allowed. Ultimately this project sheds some light on how GPU ac- celeration and data compression algorithms, ZFP and FPZIP, impact the amount of qubits that are able to be simulated. The simulator tested in this project was a modified version of the Quantum Exact Simulation Toolkit (QuEST). From the results of this project it was found that data compression shows good potential in decreasing the total memory usage per qubit size. However, the use of data com- pression negatively impacted the execution time, but by using GPU acceleration the impact was reduced. iv Sammandrag Simulering av kvantdatorer har en hög minnesanvändning och ofta långa exekver- ingstider. Av den anledningen var syftet med detta projektet att analysera ifall datakomprimering och GPU-acceleration kan användas för att köra simuleringar med fler qubits än vad som var tidigare möjligt. Detta projekt belyser hur GPU- acceleration och komprimeringsalgoritmerna, ZFP och FPZIP, påverkar mängden qubits som kan simuleras. Simulatorn som testades i projektet var en modifierad version av Quantum Exact Simulation Toolkit (QuEST). Resultatet av projektet påvisade att datakomprimering visar en god potential för att minska den totala minnesanvändningen per qubit. Användningen av datakomprimering påverkade dock exekveringstiden negativt, men genom att använda GPU-acceleration minskades ef- fekterna. Acknowledgements We would like to express our deepest gratitude to our supervisor Stavroula Zouzoula who has granted invaluable insight and support through the course of this project. Contents List of Figures ix List of Tables 1 1 Introduction 3 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Theory 6 2.1 Simulating Quantum Circuits . . . . . . . . . . . . . . . . . . . . . . 6 2.1.1 Grover’s Search Algorithm . . . . . . . . . . . . . . . . . . . . 8 2.1.2 QuEST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Data Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 ZFP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 FPZIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.1 CUDA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4 Measuring Performance . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Fidelity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3 Implementation 15 3.1 Adapting QuEST to Compressed Data . . . . . . . . . . . . . . . . . 15 3.1.1 Defining a Memory Structure . . . . . . . . . . . . . . . . . . 15 3.1.2 Fetching & Modifying the State Vector . . . . . . . . . . . . . 16 3.1.3 Moving Data Across Two Different Blocks . . . . . . . . . . . 17 3.1.4 Static & Dynamic Memory Allocation . . . . . . . . . . . . . 18 3.2 Testing of Data Compression . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Determining Error Rate for Compression . . . . . . . . . . . . . . . . 21 3.4 GPU Acceleration of the Simulator . . . . . . . . . . . . . . . . . . . 21 3.4.1 Testing GPU Acceleration . . . . . . . . . . . . . . . . . . . . 22 4 Results & Discussion 23 4.1 Original QuEST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 ZFP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.1 ZFP & Number of Qubits . . . . . . . . . . . . . . . . . . . . 26 4.2.2 Comparing modes of ZFP . . . . . . . . . . . . . . . . . . . . 27 vii Contents 4.3 FPZIP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4 Fidelity of ZFP and FPZIP . . . . . . . . . . . . . . . . . . . . . . . 32 4.5 GPU Acceleration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.5.1 Limitations of GPU acceleration . . . . . . . . . . . . . . . . . 35 4.6 Determining the feasibility of compression . . . . . . . . . . . . . . . 35 4.7 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 5 Conclusion 38 5.1 Project Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.2 Possible Sources of Error . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Bibliography 40 viii List of Figures 2.1 The oracle operation applied to |s⟩ . . . . . . . . . . . . . . . . . . . 9 2.2 The reflection operator . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Visual representation of fidelity shown as the intersection between two state vectors ρ0 and ρ1 [1] . . . . . . . . . . . . . . . . . . . . . . 14 3.1 The memory block structure 1) saving decompressed state to mem- ory 2) loading compressed state from memory 3) loading compressed block into second decompressed block . . . . . . . . . . . . . . . . . . 16 3.2 Code for handling the loading of compressed memory blocks . . . . . 17 3.3 Code for handling the process of determining which block needs to be replaced in memory . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.4 Code for handling the dynamic allocation when saving data to a com- pressed memory block . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.1 Memory usage (a) and execution time (b) of running Grover’s search algorithm on QuEST with no data compression . . . . . . . . . . . . 23 4.2 Execution time at 16 qubits over the block sizes 32-16384 floating- points values for ZFP compression modes, fixed rate (-r), fixed preci- sion (-p), fixed accuracy (-a) and dynamic allocation (-d). . . . . . . . 24 4.3 Memory used at 16 qubits over the block sizes 32-16384 floating-points for ZFP compression modes, fixed rate (-r), fixed precision (-p), fixed accuracy (-a) and dynamic allocation (-d). . . . . . . . . . . . . . . . 25 4.4 Memory usage in KB (a) and compression ratio (b) per qubit when running Grover’s search algorithm. Fixed rate (-r), fixed precision (-p), fixed accuracy (-a) and dynamic allocation (-d) . . . . . . . . . . 26 4.5 Execution time in ms (a) and original execution time in ms (b) per qubit when running Grover’s search algorithm. Fixed rate (-r), fixed precision (-p), fixed accuracy (-a) and dynamic allocation (-d) . . . . 27 4.6 Memory in KB per qubit comparing the ZFP modes, fixed-rate, fixed- precision, and fixed-accuracy compression (a) static and (b) dynamic allocation for 32-bit precision and 6 decimal point accuracy, (c) static and (d) dynamic allocation for 16-bit precision and 4 decimal point accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.7 Time in ms per qubit comparing the ZFP’s precision compression to fixed-rate using dynamic memory allocation for (a) 32 bit, (b) 16 bit floating-point were (c) and (d) use dynamic memory allocation . . . . 29 ix List of Figures 4.8 Memory usage (a) and execution time (b) of running Grover’s search algorithm on QuEST with FPZIP compression in range of 2048-16384 floating-points per block size for 16 quits . . . . . . . . . . . . . . . . 30 4.9 Memory usage (a) and execution time (b) per qubit running Grover’s search algorithm on QuEST with FPZIP compression . . . . . . . . . 31 4.10 Compression ratio (a) and execution time (b) per qubit for all com- pression algorithms and modes running Grover’s search algorithm on QuEST. Fixed rate (-r), fixed precision (-p), fixed accuracy (-a) and dynamic allocation (-d) . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.11 Illustration of the underlying fidelity for the different algorithms (a) all results for fidelity and (b) fidelity results without zfp -r 16 . . . . 33 4.12 Execution time on the GPU as function of block size for 16 qubits (a) complete block range and (b) last 6 block ranges . . . . . . . . . . . . 34 4.13 Best execution time (a) and speed up for best execution time (b) on the GPU. Fixed rate (-r), and dynamic allocation (-d) . . . . . . . . . 34 4.14 Execution time at the best memory usage (a) and speed up of the ex- ecution time (b) on the GPU. Fixed rate (-r), and dynamic allocation (-d) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 x List of Tables 3.1 Test parameters for testing ZFP compression. Here -p refers to pre- cision, -r to rate and -a to accuracy. . . . . . . . . . . . . . . . . . . . 20 3.2 Test parameters for testing FPZIP compression . . . . . . . . . . . . 20 3.3 Test parameters for testing ZFP compression with GPU acceleration . 22 1 List of Tables Glossary Bloch sphere - A geometric representation of a quantum state. Bottleneck - A part of a program, in this project, that is particularly inhibiting for the overall performance. CPU - Central Processing Unit, this is where the main calculations happen in a computer. CUDA - Compute Unified Device Architecture is a parallel programming model designed by Nvidia. It’s used for accessing and programming Nvidia GPUs. Floating-point - A way to represent a large range of decimal numbers with a small number of bits by sacrificing some accuracy. GPU - Graphics Processing Unit, a hardware component dedicated to perform many parallel calculations, usually relating to graphics, and assist the CPU. heaptrack - A tool for analyzing heap memory utilization. heaptrack_print - A tool for reading output files from heaptrack and presenting them in a human readable form. Lossless compression - Compression technique that will perform reversible com- pression where no data from the original source is lost. Lossy compression - Compression technique that will perform irreversible com- pression where data from the original source is lost. Negabinary - A number represented in base −2. Quantum gate - A logic gate representing the quantum logic applied to any num- ber of qubits much like classical logic gates. RAM - Random-Access Memory, the main memory unit of a computer. VRAM - Video Random-Access Memory, the main memory unit on a graphics card. 2 1 Introduction 1.1 Background Quantum computers as a concept were first mentioned by the mathematician Yuri Manin and later gained traction when the physicist Richard Feynman furthered the concept [2]. Feynman argued that as the complexity of the system increases, simula- tions of quantum systems become infeasible on classical computers. He even stated during a keynote address at the California Institute of Technology that, "...nature isn’t classical, ... and if you want to make a simulation of nature, you’d better make it quantum mechanical, and by golly it’s a wonderful problem, because it doesn’t look so easy." [3]. The purpose for the development of quantum computers is to solve problems that are otherwise too difficult to solve on classical computers. Through the utilization of specially designed algorithms, this purpose can be achieved. One such algorithm is Shor’s algorithm, which, with a quantum computer, can perform prime factorization in polynomial time [4]. This is a significant improvement from classical algorithms that perform prime factorization in non-polynomial time. Quantum computers can, therefore, have a big impact on cyber security since modern cryptography relies on the difficulty of prime factorization. The main difference between a classic and quantum computer, is the size of the problem space in which they operate. The basic building blocks in classical com- puters are bits, which can have a value of either 1 or 0. In quantum computers, bits are instead replaced with quantum bits, or qubits. A qubit can fall into one of three different states: 1, 0, or a superposition state of 1 and 0. While classical computers can only represent one specific state at a time, quantum computers are able to rep- resent all possible states, known as quantum superposition [2]. The reason quantum computers can operate in a larger problem space is because they can utilize quan- tum mechanical properties such as superposition and quantum entanglement. To fully utilize the potential in a quantum computer, specific quantum algorithms are required. These algorithms take advantage of the quantum mechanical properties to simultaneously represent all possible states and then collapse the system into a desired end state. While, theoretically, quantum computers have a broad range of applications, the ability to construct these machines is still in early development. IBM only recently 3 1. Introduction revealed their quantum computer with 127 qubits [5]. This is when the role of a simulator becomes imperative. Simulators are often used to replicate the functional- ity of quantum algorithms by simulating the qubits on classical hardware [6]. There are multiple ways of implementing a quantum simulator on a classical computer. One such way is to represent the quantum states as state vectors. Matrices are used to represent quantum gates, and show how the state vectors can be manipulated. Quantum Exact Simulation Toolkit (QuEST) is an example of a simulator that im- plements pure state-vector representation to simulate a quantum computer [7]. A drawback with the state-vector based simulators is that resource requirements will grow by a factor of 2 for each added qubit, e.g. simulating 30 qubits will require a state-vector of size 230 to represent each possible state [8]. One of the main resource limitations when it comes to simulating quantum com- puters is the increase in memory requirement for each new qubit [8]. One way to address this problem is to use data compression [9]. Data compression works by compressing larger pieces of data into smaller representations. There are two types of compression techniques, lossless compression and lossy compression. Lossless compression refers to compression where no data is lost. Although lossless compres- sion is effective there are practical limits to how much the data can be compressed [10]. An alternative compression method is lossy compression. Lossy compression may achieve a higher compression ratio but comes with the risk of data loss. Due to this risk, lossy compression is not inherently reversible [11]. Another common way to improve performance is through hardware acceleration. It can be used to offload some of the work that the Central Processing Unit (CPU) performs and potentially do faster computations. This is because hardware ac- celerators are typically designed for a specific task and that task only. Graphical Processing Units (GPUs) are well-suited for this role as they are more able to be parallelized than a CPU, thereby allowing a higher level of concurrent executions [12]. 1.2 Purpose When simulating the full state of quantum computer circuits on classical hardware, one of the key issues is the memory and computation requirements that increase ex- ponentially with the number of qubits [8]. For example, a quantum circuit consisting of 25 qubits can be simulated on a laptop, while a 50 qubit circuit will require a supercomputer. Therefore, the purpose of this project is to analyze if data compres- sion and GPU acceleration can be used to allow the simulation of larger quantum circuits at a reasonable speed. Ultimately this project should provide insight into how different compression algorithms and GPU acceleration can impact the number of qubits that can be simulated on classical hardware. The task of testing different compression algorithms included both lossless and lossy techniques. The reason for trying both categories of compression was to provide further knowledge of how data loss may impact the performance, size, and correctness of a quantum circuit. 4 1. Introduction 1.3 Scope Certain limitations were set in order to narrow the scope of the project. One such limitation was to implement optimization techniques for only one quantum com- puter simulator. Each simulator on the market uses a different simulation method in order to focus on a different area of research. The choice to study only one simulator, allowed for a thorough investigation into how compression and hardware acceleration affect memory usage and execution time for that specific simulator. Qiskit [13] and CirQ [14] are well-established quantum simulators made by some of the biggest names in quantum computing today, Google and IBM. Both Google and IBM are also developers of actual quantum computing hardware, which may lead to the simulator’s libraries containing fewer details of the actual quantum com- putational procedure [15]. On the other hand, QuEST was built purely with the intention to be used for research purposes and as such was better suited for the needs of this study [15]. The choice to focus on one simulator also resulted in the limitation of one simulation method. The use of QuEST meant limiting the project to the state vector method of simulating general quantum circuits. While there are various methods of simulation, the state vector method is fast and versatile when memory size is sufficient and the number of qubits is low [16]. Another limitation of the project was the choice to study the effects of the opti- mization techniques using one quantum algorithm. The algorithm in question was Grover’s search algorithm which is a well-established quantum algorithm for effi- ciently finding an element in an unsorted list [17]. The choice to limit the study to only one algorithm was related to the problem statement and scope. The focus of this project was not on the quantum algorithm itself, but instead on the algorithm’s performance and whether that performance can be improved using techniques found in classical computing. The study was also limited by two compression algorithms, one lossless and one lossy. While this was partially due to the project’s time con- straint, the choice to study two different compression algorithms was deliberate. By implementing two contrasting compression methods, one could easily compare which was superior in terms of resource requirements. 5 2 Theory This chapter presents the underlying theory necessary to understand the methods of implementation and testing discussed in the subsequent chapters. The topics covered in this chapter range from the basics of quantum computer simulators, data compression, performance metrics, and GPUs. 2.1 Simulating Quantum Circuits As previously stated, quantum bits differ from classical bits. There exists a way to represent qubits using classical bits to simulate quantum states on classical hard- ware. One way to do this is to describe quantum states using state vectors. The state vector representing a single qubit contains two probabilities for collapsing into either state 0 or 1. For two qubits, there are four possible states for the system to collapse into, and as such the state vector contains four entries. The depiction of these state vectors is shown below. 1 qubit[ a0 a1 ] 2 qubits a00 a01 a10 a11  n qubits a00...0 a00...1 ... ... ... a11...1  It is common to utilize Dirac notation when representing individual states inside a state vector. Using Dirac notation simplifies the representation of states with a higher number of qubits [18]. Below is an example of how Dirac notation works for a system of 2 qubits. |00⟩ =  1 0 0 0  |01⟩ =  0 1 0 0  |10⟩ =  0 0 1 0  |11⟩ =  0 0 0 1  In conclusion, to fully define the state of a quantum computer, a state vector of 2n amplitudes is required (n represents the number of qubits to simulate). This ex- ponential growth will also lead to an exponential increase in memory requirements when simulating quantum systems. After a certain number of qubits, it becomes very resource-intensive to simulate a quantum computer [19]. When using QuEST 6 2. Theory to simulate a circuit on a personal computer, only about 30 qubits can be run as the memory usage reaches approximately 16GB. This is the limit for most personal computers today, due to the limited memory on consumer hardware and the state vector’s exponential growth. Of course, simply using state vectors to represent qubits does not create a func- tional simulator, there must also be a way to simulate quantum gates. Classical gates provide the basic functionality inside conventional computers, in much the same way as quantum logic gates provide the basic functionality for quantum com- puters. Quantum gates are represented as matrices, applied to a qubit to perform basic operations. The most common quantum gates operate on one or two qubits, meaning that multiple gates are needed to form a quantum circuit, just as conven- tional circuits are composed of multiple logic gates [20]. Some of the most imperative single-bit gates are the Pauli gates, X, Y, Z which can be represented as matrices [21]. X = σx = [ 0 1 1 0 ] Y = σy = [ 0 −i i 0 ] Z = σz = [ 1 0 0 −1 ] The Pauli-X matrix, as understood from linear algebra, is used to represent the X- gate. When the X-gate operates on a qubit, the Pauli-X matrix is multiplied by that qubit’s state vector. An example of how this gate operates on |0⟩ is below. The gate can also be expressed as a rotation of π radians along the x-axis of the Bloch Sphere. The Pauli-X operator is analogous to the NOT-gate in classical computing. The Y- and Z-gates operate similarly, except that they rotate along the y- and z-axes, respectively [21]. X|0⟩ = [ 0 1 1 0 ] [ 1 0 ] = |1⟩ |0⟩ = [ 1 0 ] |1⟩ = [ 0 1 ] The Hadamard Gate, H, is also a fundamental operation in quantum computers. When given a basis state, the Hadamard Gate creates a superposition of the states. The matrix, shown below, is used to represent the gate. The Hadamard Gate and the Pauli Gates are both implemented in Grover’s search algorithm [20]. H = 1√ 2 [ 1 1 1 −1 ] 7 2. Theory 2.1.1 Grover’s Search Algorithm Grover’s search algorithm answers a classic computing problem, namely searching for an element within a list. Grover’s search algorithm is a clear demonstration of when the performance of a quantum computer significantly outperforms that of a classical computer. With classical computation, the time complexity of finding the correct element in an unsorted list is O(N) [17]. However, when using a quan- tum computer the time complexity is only O( √ N). This decrease in complexity is achieved by utilizing an oracle and a reflection operator. Together, these operations define a technique known as amplitude amplification. Simply, this technique ampli- fies the probability of the state collapsing into the winning state, w by reducing the likelihood of it collapsing into all the other states. The first operation in the algorithm is the oracle. The oracle works by taking several inputs and changing the winning state’s sign. The inputs are encoded as the basis states of a specified number of qubits. For example, for two qubits, the inputs will be the states 00, 01, 10, and 11. It flips the sign of the winning state to its negative phase through the use of a unitary matrix. To exemplify the algorithm’s functionality, let w = 11. When the position of the winning state is still unknown, the probability of the superposition collapsing to any of the basis states has equal probability. The state that represents this is known as the superposition state, or s [22]. The following vectors express the states |w⟩ and |s⟩: |w⟩ =  0 0 0 1  = |11⟩ |s⟩ =  1 1 1 1  1 2 The oracle takes |s⟩ as an input and outputs a state where the sign of the winning state has flipped, shown from the following vector: 1 1 1 −1  This oracle operation can also be described graphically, as seen in Figure 2.1. The graph on the left shows the original vectors |w⟩ and |s⟩. It also shows the state |s′⟩ that is orthogonal to |w⟩. The graph on the right shows the transformation of |s⟩ after the oracle has been applied. It has been reflected across |s′⟩ to get a new vector. 8 2. Theory Figure 2.1: The oracle operation applied to |s⟩ The next step is to apply a reflection operator, defined as 2|s⟨s| − 1, to this new vector. The result of this operation is the reflection of the vector back across |s′⟩. From Figure 2.2, one can see the combination of these two transformations has moved the vector closer to the winning vector, |w⟩. With 2 qubits, it only takes one application of the amplitude amplification operation for our quantum state to arrive at the winning state. However, by increasing the number of qubits, the number of times the amplitude amplification must be applied increases by √ N [22]. Figure 2.2: The reflection operator 2.1.2 QuEST QuEST is a simulator for quantum circuits that are built for high performance by using multi-threading, GPU acceleration, and distributed computation [23]. This allows QuEST to run efficient simulations on everything ranging from simple laptops to powerful supercomputers. The QuEST simulator supports both state-vector and density matrix representations, thereby allowing simulations of quantum circuits with pure or mixed states. State-vector representation involves having a vector filled with complex numbers. QuEST solves this by having two separate floating- point arrays in memory, where one of the arrays stores the real parts of the complex number and the other holds the imaginary parts. Together, these two arrays define the complete state of a quantum computer with a fixed number of qubits [7]. 2.2 Data Compression The amount of available data is constantly increasing, although handling it remains a challenge. One method of tackling this issue is through data compression. Data 9 2. Theory compression is the task of taking a representation of data and determining a smaller data size that still represents the same data. Today, data compression has many purposes, most commonly in the storage of images and audio files [24]. Compressing data can be done in one of two ways, either through lossless or lossy compression. Lossless compression refers to when no data is lost and the quality remains the same. What makes lossless compression advantageous is that after de- compression, the data is restored to its original form. This function is essential when compressing confidential documents or digital images where one wants to maintain high quality. The alternative is lossy compression which can achieve a higher com- pression ratio but at the risk of permanent data loss. In this type of compression, once a file is compressed it cannot be entirely restored to its original form [11]. 2.2.1 ZFP ZFP is a lossy compression algorithm for high compression of multi-dimensional nu- merical arrays[25]. The algorithm works by splitting the multi-dimensional array, or d-dimensional, into arrays of 4d values, also known as blocks. These blocks are compressed and decompressed independently of one another [26]. Once the data has been partitioned into blocks, the floating-point values contained in each block are converted to a block-floating-point representation. Block-floating-point is a method that assigns a single exponent to all the values in the block. Essentially, each of the floating-point values becomes a signed integer [26]. The blocks undergo a process of decorrelation using a near orthogonal transform to remove redundant information. This orthogonal transform resembles the discrete cosine transform used in image processing [27]. It transforms data from a spatial domain into a frequency domain, where the higher frequencies are identified as redundant [28]. Next, the signed integer coefficients are reordered in decreasing magnitude resulting in the grouping of ones together and zeros together in the bit plane [27]. After which, the standard representation of integers, or the two’s complement signed in- tegers, are converted to their negabinary-representation. The leftmost one-bit in this representation encodes the sign and approximates the magnitude of a value. Negabinary-representation facilitates encoding due to the numbers having many leading zeros regardless of the sign. Currently, the list of 4d values is sorted by the magnitude of the coefficient. The next step is to reorder them from most to least significant bit on the bit plane [26]. Often, coefficients have many leading zeros. Only a single zero-bit is needed to represent a group consisting only of zeros, which results in less memory usage. As such, the goal is for groups of zero-bits to be encoded together. Each bit plane is losslessly compressed using embedded coding. The embedded coder will emit one bit at a time until it satisfies the criteria for stopping. This stopping criterion is defined by which mode of ZFP compression is being run, namely fixed-rate, fixed-precision, or fixed-accuracy [27]. 10 2. Theory The three compression modes each determine a different way of compressing data. Fixed-rate compression stores the compressed floating-point number as a set number of bits. In other words, it will always compress the data to a defined limit even if the data has the potential to be represented using a smaller data set. Using fixed- precision compression, the bit size of the compressed data may vary but will never become greater than some fixed limit. The difference between these two modes is that fixed-precision can represent the same amount of data using a smaller com- pressed data size. Lastly, fixed-accuracy compression limits the compressed data based on a defined accuracy for the floating-point number, maintained throughout the execution. If, for example, the tolerance is three decimal points, fixed-accuracy mode will compress the data to a size that corresponds to three decimal point ac- curacy [27]. 2.2.2 FPZIP FPZIP is a lossless compression algorithm for two or three-dimensional floating- point arrays. The algorithm parses through the array and predicts each data point based on a subset of data available to the decompressor. The data in this subset is already encoded. Next, the predicted values and the actual floating-point values, are mapped to their signed integer representation which calculates the prediction residu- als. If the residuals were to be calculated by subtraction of the original floating-point values it might cause an irreversible loss of information [29]. Then, the residual values are partitioned into entropy codes and raw bits by an entropy coder. Using a method known as range coding, the entropy coder encodes the symbols by taking a range of integers that function as a unique representation of a string. Each symbol in the string has a probability distribution that divides the original range into sub-ranges. The size of these intervals will be proportionate to the probability of the symbol it represents. These portions must be at non-overlapping intervals. The range is then reduced further to a sub-interval which corresponds to the next symbol to be encoded. This process then repeats itself for all the remaining values [29]. 2.3 Graphics Processing Unit Hardware acceleration is the process of running a dedicated set of instructions on a hardware component that is optimized to perform those specific instructions effi- ciently. One typical example of hardware acceleration is rendering computer graph- ics, where the rendering task usually is offloaded to a dedicated Graphics Processing Unit, or GPU. The GPU is much better suited for handling tasks that allow for a high level of parallelism, such as calculating the colors of dedicated pixels [30]. GPUs are different from CPUs in that they can perform many computations simultane- ously. CPUs, in contrast, typically perform complex instructions that are sequential in nature. GPUs are built to handle a large number of graphics calculations which often involve floating-point arithmetic. GPUs also differ from CPUs on an architec- tural level. GPUs were created to prevent the CPU from becoming overwhelmed 11 2. Theory when performing graphics calculations. Today, GPUs have grown to become sophis- ticated and powerful multi-core systems that can respond to high computational demands. Mainly, GPUs are either integrated on the motherboard of the respective computer or connected via a PCIe port. The latter alternative usually yields better performance [12]. 2.3.1 CUDA Compute Unified Device Architecture (CUDA) is a programming model created by Nvidia to provide programming access to their line of GPUs used in consumer soft- ware. CUDA can improve a program’s performance by carrying-out calculations on the GPU instead of the CPU. A developer can use CUDA with several programming languages such as C/C++, OpenCL, Fortran, etc. The CUDA programming model uses an abstraction hierarchy. This hierarchy consists of warps, grids, threads, and blocks, of which threads are the fundamental execution units. Multiple threads make up a block and multiple blocks make up a so-called grid. A warp is a composite of threads executing in Single-Instruction-Multiple-Thread (SIMT) mode. SIMT is when the multiple threads in a warp can simultaneously perform the same instruc- tions. There are two types of functions, those executed on the GPU, referred to as kernel functions, and those executed on the CPU. However, it is the responsibility of the CPU to transfer the data between the host (CPU) and the device (GPU). The CPU is also responsible for the invocation of the said kernel functions and the setting up of the dimensions for the blocks and grids of the device. Parameters such as these are to be defined by the programmer [31]. The architecture of a GPU, consisting of thousands of threads compared to the tens of threads on a CPU, promotes the parallel execution of tasks. While the single thread speed on a CPU is usually higher than that on a GPU, the throughput of the GPU is higher as a result of the high parallelism that it offers. Not all Nvidia GPUs support CUDA, but some that do are the GeForce, TitanX series GPUs as well as a collection of the Quadro and RTX series [32]. 2.4 Measuring Performance To achieve a formal and measurable quantity of performance, ’performance’ is de- fined as the absolute execution time of a program per the definition stated by Patter- son and Hennessy [33]. Equation 2.1 and equation 2.2 both illustrate the underlying equation. For equation 2.2, I is the number of instructions of a program, CPI is the average number of cycles per instruction and T is the period of the system clock. Execution T ime = Instructions Program × Clock Cycles Instruction × Seconds Clock Cycle (2.1) CPU Time = I × CPI × T (2.2) 12 2. Theory Not only is defining a metric for execution time important so is finding a definition of memory utilization since it is the core tool for analyzing the efficacy of the compres- sion algorithms. The metric used for measuring memory usage is compression ratio. Equation 2.3 below shows the specific formula for calculating the data compression ratio. Compression Ratio = Uncompressed Size Compressed Size (2.3) 2.5 Fidelity When information travels, it is vulnerable to changes, both expected and unex- pected. One way to describe the overall similarity between the sent and received information is fidelity. In quantum computing, fidelity is the measure of the error rate produced between the expected output and the actual output. In Quantum computer simulator when compression is applied data loss can hap- pened. To determine the data loss, fidelity can be used to represent the similarity between the original data and the data after decompression. When changes occur in the state vectors, potential errors can arise in the simulation. State vectors deter- minate how the qubits will collapse to get the resulting calculation and high fidelity minimizes the risk of qubits collapsing to the wrong state. Using Equation 2.4 fidelity can be calculated mathematically by turning into density matrix and comparing the original state vectors as ρ1 and the state vectors after using compression as ρ2. This produces a number between 0 and 1 that represent how similar the two state vectors are. F (ρ1, ρ2) = {trace[ √√ ρ1ρ2 √ ρ1]}2 (2.4) Figure 2.3 shows visual representation of how fidelity is calculated. Comparing ρ1 as the original data and ρ2 as the data after compression. Fidelity is represented in the intersecting area between the two graphs. 13 2. Theory Figure 2.3: Visual representation of fidelity shown as the intersection between two state vectors ρ0 and ρ1 [1] 14 3 Implementation This chapter details the methods used to integrate data compression into QuEST, implement GPU acceleration, and test these new techniques. Describing the method- ology of the study is crucial when contextualizing the final results. 3.1 Adapting QuEST to Compressed Data As mentioned in Chapter 2, the QuEST simulator used two arrays of floating-point numbers to represent the current state of the quantum computer. When performing operations, QuEST directly fetched and modified the floating-point numbers in these two arrays, thereby changing the underlying state of the quantum system. Given this approach, a new memory structure was created to support the compression of the state vector. If the memory structure remained the same, one would run into the same problem as when no data compression is used. Which is to say, if the entire state vector were to be decompressed at one time, it would require a memory size as large as the original state vector, defeating the purpose of using data compression. 3.1.1 Defining a Memory Structure The new memory structure took advantage of a block-style implementation. The two state vector arrays were split into multiple blocks where each block contained a fixed number of floating-point values. Through this approach, only a fixed number of blocks remained decompressed in memory at one time. This allowed the remaining blocks to remain compressed in memory and decreased the overall memory usage. Figure 3.1 illustrates how the memory block structure works. As can be seen, two decompressed blocks are used to fetch and modify the state vector’s floating-point values. 15 3. Implementation Figure 3.1: The memory block structure 1) saving decompressed state to memory 2) loading compressed state from memory 3) loading compressed block into second decompressed block For a specific index in the state vector to be fetched or modified, the QuEST simula- tor recalculated the index into two separate parts, one block index and one internal index. As the names imply, the block index determined which block was to be ac- cessed while the internal index determined the index of the value within the specific block. 3.1.2 Fetching & Modifying the State Vector By implementing a new memory block structure, the process of accessing specific values within the state vector became more complex. More explicitly, the new memory structure resulted in the handling of three potential scenarios to ensure the compression and decompression of the correct memory block. The first of these scenarios was when no compressed block was already decom- pressed in the memory. In this case, QuEST needed to find the correct block and decompress it before performing the fetch or modify operation. In the second sce- nario, the decompressed block in memory had the same block index as the index currently being accessed. As a result, no further action was necessary as QuEST had previously accessed the block. In the final scenario, the decompressed block in memory did not share the same block index as the current index, and thus QuEST had to perform two actions to access the correct value. Firstly, it had to compress the existing decompressed block and store it in the correct location in the memory structure. Secondly, QuEST had to find the correct block corresponding to the cur- rent index, decompress the block, and store it in memory, thereby allowing access to the correct value. Figure 3.2 shows the specific code for the handling of these three scenarios. 16 3. Implementation Figure 3.2: Code for handling the loading of compressed memory blocks 3.1.3 Moving Data Across Two Different Blocks The quantum gates used in QuEST involved moving data from one block of memory to another. This was done in the modified QuEST simulator by allowing two blocks to remain decompressed in memory at one time to avoid the need for repeated compression and decompression operations. The simple logic of the least recently used (LRU) determined which decompressed block was to be replaced when accessing a compressed block. In other words, the memory block that was accessed previously remained decompressed while the other memory block was replaced. Figure 3.3 below shows the underlying code used for finding which of the decompressed blocks to replace. Figure 3.3: Code for handling the process of determining which block needs to be replaced in memory 17 3. Implementation 3.1.4 Static & Dynamic Memory Allocation The block memory structure allowed the memory to be allocated in one of two ways, either statically at the beginning of the program or dynamically when memory was first accessed. These two types of memory allocation resulted in different optimiza- tion effects. Static memory allocation had the potential to allow for faster execution time as there was no temporary storage required during the compression of the orig- inal state vector. Yet, dynamic memory allocation could result in a more effective compression since the space allocated at any given moment was only that which was indispensable for storing the current compressed data block. During static memory allocation, a fixed amount of space was allocated at com- pile time, regardless of whether it was fully utilized. However, in dynamic memory allocation, a temporary storage block was allocated during compile-time and used as intermediary storage while a data block was being compressed. After compression, the size of the compressed data was used to allocate memory, and the data was copied into the correct memory block. Through this method, only the size of the currently compressed data was allocated for any given memory block. Figure 3.4 below shows the code for handling the dynamic allocation when saving data to a memory block. Figure 3.4: Code for handling the dynamic allocation when saving data to a compressed memory block Whether to use static or dynamic memory allocation depended on the compression algorithm itself. In this case, ZFP supported both static and dynamic memory allo- cation, while FPZIP only worked with dynamic allocation. This limitation is because FPZIP does not provide a maximum estimate for the final size of the compressed memory, rendering it impossible to efficiently utilize static memory allocation with- out allocating the same size as the uncompressed data. 18 3. Implementation 3.2 Testing of Data Compression To be able to understand if data compression was a valid alternative for simulating larger quantum circuits several tests had to be performed. The tests were performed by simulating an implementation of Grover’s search algorithm on the modified in- stance of QuEST. Memory usage and execution time were the main factors used when evaluating the effectiveness of the compression algorithms. Memory usage showed whether data compression allowed for the simulation of larger quantum cir- cuits. It could also easily compare the performance of the two algorithms based on the ability to compress the state vector. The feasibility of integrating compression into QuEST was measured using execution time. If execution time was too large, the simulation of higher qubit circuits would be impractical, thereby deeming the integration of ZIP and FPZIP as infeasible in practice. A test script was created to collect the memory usage and execution time data. The test script’s purpose was to ensure the tests were performed in controlled settings and with similar inputs. For memory usage, the test script used heaptrack to generate an output file. Next, heaptrack_print could extract the specific information about QuEST’s peak mem- ory utilization. To collect the data on execution time, the perf_counter_ns function within Python’s time package was used. This tool calculated the difference between the start and the end of a test run in nanoseconds. The results collected from the tests were copied to a file to be parsed by a data processing script. The new memory block structure also made it relevant to test a new independent variable, namely block size. The range of block sizes to be tested depends on the number of qubits used. If a block size was too big, the entire state vector would remain decompressed in memory, which led to no call to the compression algorithm. For example, for 15 qubits the maximum block size was 215 − 1 = 32767. Any value above the maximum size would cause the entire state vector to fit within the de- compressed block. The effect block size had on memory usage was also considered. An increase in block size means that the decompressed memory block would also increase in size and potentially result in higher memory usage. As stated in Chapter 1, one of the main resource limitations when simulating quantum computers was the increase in memory requirement for each new qubit. For this reason, the number of qubits used in the quantum circuit was another independent variable tested. In order to test a higher number of qubits, one also has to test higher block sizes. For example, 10 different block sizes were tested for a quantum circuit running 16 qubits. While limiting the number of tests run, it still provided a clear account of how block size impacts execution time and memory usage. From Table 3.1 and Table 3.2 it is possible to see the different tests performed, and that for 16 qubits ten different block sizes were used. Another aspect when testing the quantum simulator was the modes and compres- sion sizes of ZFP and FPZIP. For ZFP, the tested modes were fixed-rate (-r), fixed- precision (-p), and fixed-accuracy (-a). fixed-rate meant that the compression al- gorithm would compress the original floating-points (64 bits) down to fixed data size. In this case, 32-bit or 16-bit representation. The fixed-precision parameter 19 3. Implementation worked similarly to fixed-rate, the difference being that it would be able to com- press the data even lower than the set limit. fixed-precision was tested with the same compression size as used for fixed-rate (32-bit and 16-bit). Lastly, fixed-accuracy compressed the data down to a representation that guaranteed a fixed level of ac- curacy. In this case, the tested parameters were four decimal accuracy (1e-4) or six decimal accuracy (1e-6). Table 3.1 illustrates the different modes and compression sizes used to test the ZFP implementation. ZFP was tested using both static and dynamic allocation (-d). Table 3.1: Test parameters for testing ZFP compression. Here -p refers to precision, -r to rate and -a to accuracy. Qubits Block Size Allocation Type Parameters 10 32, 64, 128, 256 static, dynamic -r 32, -r 16, -p 32, -p 16, -a 1e-4, -a 1e-6 11 64, 128, 256, 512 static, dynamic -r 32, -r 16, -p 32, -p 16, -a 1e-4, -a 1e-6 12 128, 256, 512, 1024 static, dynamic -r 32, -r 16, -p 32, -p 16, -a 1e-4, -a 1e-6 13 256, 512, 1024, 2048 static, dynamic -r 32, -r 16, -p 32, -p 16, -a 1e-4, -a 1e-6 14 512, 1024, 2048, 4096 static, dynamic -r 32, -r 16, -p 32, -p 16, -a 1e-4, -a 1e-6 15 1024, 2048, 4096, 8192 static, dynamic -r 32, -r 16, -p 32, -p 16, -a 1e-4, -a 1e-6 16 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384 static, dynamic -r 32, -r 16, -p 32, -p 16, -a 1e-4, -a 1e-6 FPZIP had only one mode to test, namely precision mode (-p). In this case, the precision was set to 64-bit to ensure that FPZIP remained lossless. Table 3.2 il- lustrates the modes and settings used for the testing of the FPZIP. The testing of FPZIP was limited to only dynamic allocation. Table 3.2: Test parameters for testing FPZIP compression Qubits Block Size Allocation Type Parameters 10 32, 64, 128, 256 dynamic -p 64 11 64, 128, 256, 512 dynamic -p 64 12 128, 256, 512, 1024 dynamic -p 64 13 256, 512, 1024, 2048 dynamic -p 64 14 512, 1024, 2048, 4096 dynamic -p 64 15 1024, 2048, 4096, 8192 dynamic -p 64 16 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384 dynamic -p 64 20 3. Implementation 3.3 Determining Error Rate for Compression Other than memory usage and execution time, one key aspect of evaluating the most effective compression algorithm was determining the error rate. The error rate refers to the fidelity between the state vectors resulting from QuEST without compression and QuEST with compression integration. While lossy compression may lead to lower memory usage, it may also cause data loss, altering the final results. The error rate was a tool to depict the severity of this loss. If fidelity was too low, the benefits of improved memory usage would be irrelevant as the data would be ineffectual. Calculating the error rate also helped to compare the performance of the lossless and lossy compression algorithms. The fidelity function took two vectors and returned a number between 0 and 1. This value indicated the similarity between the two vectors in percentage form. Fidelity as a measure of error rate allowed one to determine the similarity between the state vectors of the unmodified QuEST compared to the modified version. 3.4 GPU Acceleration of the Simulator Integrating compression logic into the simulator could increase execution time, and therefore GPU acceleration was a way to mitigate this problem. It involved offload- ing all of the qubit operations performed by the simulator to the GPU, allowing for a higher level of parallel executions. In the original implementation of the QuEST simulator support already existed for GPU acceleration through CUDA. Because of this the new GPU acceleration code would build on the existing implementation within QuEST. When integrating GPU acceleration, the existing code was modified to handle the new memory structure, and to be able to compress and decompress the data every time a new memory block needed to be accessed. Due to the new block memory structure, the maximum number of parallel executions was thereby limited to the total number of values within a memory block. The reason for this is that the qubit operations could only be performed on decompressed values of the state vector. In addition to the qubit operations, compressing and decompressing blocks were also offloaded to the GPU. This was integrated into QuEST easily, as CUDA already sup- ported ZFP with an implementation to accelerate compression and decompression. However, this implementation only supported fixed-rate compression. Due to the lack of support, fixed-precision mode, fixed-accuracy mode, as well as FPZIP were not tested using GPU acceleration. While only testing the fixed-rate compression did not provide the complete picture, it still allowed one to see the impact of GPU accelerated code on execution time. To run the GPU accelerated implementation it was only a matter of compiling the modified QuEST simulator with a specific flag that compiled the CUDA imple- mentation of the simulator. The CUDA implementation was also limited to only 21 3. Implementation using the VRAM of the GPU for storing the state vector. This was because the existing implementation in QuEST used older version of the CUDA API that did not utilize unified memory (where data is shared between RAM and VRAM). 3.4.1 Testing GPU Acceleration Tests allowed one to study the impact of GPU acceleration on execution time. The tests ran ZFP with a fixed-rate mode of 32- and 16-bit representation. Through the results, it was possible to determine the speedup that occurred when using GPU acceleration as compared to using the original fixed-rate mode. Table 3.3 shows the list of parameters used to test the GPU accelerated ZFP implementation. As can be seen, the number of qubits tested increased to 18 qubits to show how GPU acceleration handles larger qubit sizes. Table 3.3: Test parameters for testing ZFP compression with GPU acceleration Qubits Block Size Allocation Type Parameters 10 32, 64, 128, 256 static, dynamic -r 32, -r 16 11 64, 128, 256, 512 static, dynamic -r 32, -r 16 12 128, 256, 512, 1024 static, dynamic -r 32, -r 16 13 256, 512, 1024, 2048 static, dynamic -r 32, -r 16 14 512, 1024, 2048, 4096 static, dynamic -r 32, -r 16 15 1024, 2048, 4096, 8192 static, dynamic -r 32, -r 16 16 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384 static, dynamic -r 32, -r 16 17 4096, 8192, 16384, 32768 static, dynamic -r 32, -r 16 18 8192, 16384, 32768, 65536 static, dynamic -r 32, -r 16 All the tests ran on a dedicated computer, to ensure the same hardware configu- ration. The computer was a Linux machine running Ubuntu 20.04.04 LTS, on an AMD Ryzen 5 1600x processor with 16 GB of RAM. The same hardware configura- tion was used for the GPU accelerated simulator, alongside an Nvidia GTX980 Ti graphics card. 22 4 Results & Discussion This chapter will provide the data resulting from tests of QuEST integrated with data compression algorithms (ZFP and FPZIP) and QuEST with GPU acceleration. Throughout this chapter, the different compression modes will be explored and compared to determine the optimal settings with regard to memory utilization, execution time, and fidelity. 4.1 Original QuEST In order to fully comprehend the results derived from the integration of the two compression algorithms as well as GPU acceleration, one must first look at the per- formance of the original QuEST without any integrated optimization techniques. Figure 4.1 displays how both the amount of memory utilized and execution time scales exponentially with the number of simulated qubits. When considering the results from using data compression and GPU acceleration, this is the benchmark for determining the effectiveness of the solutions. (a) (b) Figure 4.1: Memory usage (a) and execution time (b) of running Grover’s search algorithm on QuEST with no data compression 23 4. Results & Discussion 4.2 ZFP As discussed in Chapter 2, to integrate compression into QuEST, a new block- structured memory was defined. The new block structure introduced a new inde- pendent variable, namely block size. Figure 4.2 and Figure 4.3 show the effect block size has on ZFP compression and ultimately how that influences the performance of Grover’s search algorithm in terms of execution time and memory usage. Figure 4.2: Execution time at 16 qubits over the block sizes 32-16384 floating- points values for ZFP compression modes, fixed rate (-r), fixed precision (-p), fixed accuracy (-a) and dynamic allocation (-d). Figure 4.2 illustrates how the execution time changes with respect to the increase in block size for all ZFP modes at 16 qubits. As shown, the execution time decreases until around 210, after which it remains almost constant as a result of the increased overhead from more calls to compress and decompress blocks. For example, at a block size of 25, there are a total of 4096 blocks (2048 for each part of the complex state vector), while at a block size of 210, the total number of blocks is only 128. This difference impacts how many calls to compression and decompression occur during execution. One can also see from the graph that the pattern for execution time is the same across all modes of ZFP, further supporting the theory that exe- cution time is affected by the increased number of blocks resulting from low block sizes. Another conclusion that can be drawn, is that there is very little difference in execution time whether one chooses to use dynamic or static memory allocation. The reason for this is described in more detail when discussing figure 4.7 below. Other than the effect on execution time, the block size may also impact total mem- ory usage. Figure 4.3 depicts this relationship and shows that optimal memory usage 24 4. Results & Discussion is achieved at a block size of 29. Two factors can explain the U-shaped relationship between block size and memory usage. While the decompressed memory blocks are small for block sizes from 25 to 27, the overhead when creating these memory blocks is significant. For each newly created memory block, 32 bytes of data are allocated, irrespective of block size. For example, at block size 25 there are 4096 compressed memory blocks which result in an overhead of 131072 bytes. The overhead grows considerably when many blocks are needed to represent the compressed data. The second factor that impacts memory usage is the size of the decompressed memory block. As the block size increases, so does the memory requirement necessary to keep a block decompressed in memory. For example, at block size 213, the total memory required to represent the decompressed data is 262144 bytes (each floating- point value is 8 bytes), while at a block size of 29, the decompressed data size is only 16384 bytes. Therefore, the increase in memory block size negatively impacts the simulator’s total memory requirement. From analyzing the results, one can see that the optimal block size for 16 qubits is 29 and 210. At both these block sizes, the execution time and memory usages are at an optimal level for the different modes. Figure 4.3: Memory used at 16 qubits over the block sizes 32-16384 floating-points for ZFP compression modes, fixed rate (-r), fixed precision (-p), fixed accuracy (-a) and dynamic allocation (-d). 25 4. Results & Discussion 4.2.1 ZFP & Number of Qubits Figure 4.4 illustrates how the number of qubits affect the memory usage and the compression ratio for different modes of ZFP. By analysing the compression ratio in Figure 4.1 (b), one can see that all modes except zfp -a 1e-6 achieve improved memory usage compared to the original implementation of Grover’s search algo- rithm. The modes that achieve the best level of compression are zfp -a 1e-4 -d, zfp -a 1e-6 -d, and zfp -p 16 -d. These modes are able to achieve a compression ratio above 4 when running 16 qubits, which indicates that it may allow for the simulation of quantum circuits with 2 more qubits on the same hardware. (a) (b) Figure 4.4: Memory usage in KB (a) and compression ratio (b) per qubit when running Grover’s search algorithm. Fixed rate (-r), fixed precision (-p), fixed accu- racy (-a) and dynamic allocation (-d) Figure 4.5 (a) shows how execution time is affected by the number of qubits. Follow- ing a similar pattern of growth as seen in Figure 4.4 (a), the execution time for all the ZFP modes increases exponentially. From the graph, one can also see that the best performing solution is to use zfp -a 1e-4 -d. Figure 4.5 (b) shows the execution time of the original implementation of Grover’s search algorithm. By comparison, one can see that the execution time when running with compression is much higher than the original. In most cases, this difference is quite significant. For example, for zfp -r 32 -d the execution time increased from around 2200 to 170000 milliseconds for 16 qubits, which is approximately 77 times greater than the original. This may indicate that even if compression can decrease memory usage, it may lead to infea- sible execution times for simulations with a high number of qubits. Infeasible, in this context, meaning an execution time that takes days or even weeks to complete a simulation. 26 4. Results & Discussion (a) (b) Figure 4.5: Execution time in ms (a) and original execution time in ms (b) per qubit when running Grover’s search algorithm. Fixed rate (-r), fixed precision (-p), fixed accuracy (-a) and dynamic allocation (-d) 4.2.2 Comparing modes of ZFP As observed, the modes of ZFP are able to achieve varying degrees of good memory utilization and execution time. In order to choose decide which is the best perform- ing mode, one must further analyze and compare the various compression methods. Figure 4.6 shows how the different modes impact the memory usage. The graphs re- veal that if one wants to use static memory allocation, ZFP fixed-rate mode achieves the best memory usage. However, if instead dynamic memory allocation is used, fixed-precision and fixed-accuracy both outperform fixed-rate compression. One ex- planation is that fixed-rate mode always compresses the data down to a fixed size, even if it has the potential to be represented using a smaller data size. Contrary to this, fixed-precision and fixed-accuracy mode are able to dynamically compress the data to a smaller representation when it is more uniform. An example of uniform data is when the values are all zeros. Given the improved compression ratio for dynamic allocation the underlying state vector must provide certain level of unifor- mity. Because of this uniformity in the data, both fixed-precision and fixed-accuracy were able to compress the data further than fixed-rate compression. From this, it can be concluded that it is optimal to use fixed-precision and fixed-accuracy modes when using dynamic allocation. Besides the benefit of dynamic allocation, one can also see that different accuracy levels and compression sizes may also positively impact memory usage. For example, Figure 4.6 shows that a compression size of 16-bit (zfp -r 16 -d and zfp -p 16 -d) and an accuracy of 4 decimals (zfp -a 1e-4 -d) are able to achieve a much higher level of compression compared to using 32-bit compression size or 6 decimals accu- racy. This is expected, given that a lower compression size and accuracy rate will also lead to lower memory usage. When simulating 16 qubits, fixed-rate compres- sion can decrease the compression size from 638KB down to 374KB, fixed-precision 27 4. Results & Discussion compression size decreases from 397KB to 264KB, and fixed-accuracy compression size decreases from 270KB to 220KB. The largest improvement is seen for fixed-rate mode, 32-bit to 16-bit representation. On the other hand, for fixed-accuracy has no significant reduction in memory usage when the accuracy is lower. A lower com- pression size and accuracy may also lead the underlying error rate of the data to increase. Therefore the best possible solution with regards to memory usage might be to use zfp -a 1e-6 -d given the effective compression ratio and the potential for maintaining a higher accuracy. (a) (b) (c) (d) Figure 4.6: Memory in KB per qubit comparing the ZFP modes, fixed-rate, fixed- precision, and fixed-accuracy compression (a) static and (b) dynamic allocation for 32-bit precision and 6 decimal point accuracy, (c) static and (d) dynamic allocation for 16-bit precision and 4 decimal point accuracy Figure 4.7 shows the results of execution time when running the different modes of ZFP. One can observe that, independently of fixed-rate, fixed-precision, or fixed- accuracy mode, the execution time when running dynamic and static memory allo- cation is almost identical. This indicates that the added execution step of copying data from the temporary storage block into the memory block during dynamic allo- cation has only a negligible impact on the overall execution time. As such, one can conclude that dynamic allocation is most appropriate due to its ability to decrease the overall memory usage without causing a negative impact on the total execution 28 4. Results & Discussion time. (a) (b) (c) (d) Figure 4.7: Time in ms per qubit comparing the ZFP’s precision compression to fixed-rate using dynamic memory allocation for (a) 32 bit, (b) 16 bit floating-point were (c) and (d) use dynamic memory allocation Figure 4.7 shows that fixed-accuracy mode performs the most effectively of the three modes, no matter whether static or dynamic memory allocation is used. Figure 4.7 (a) and (b) reveals that at a compression size of 32-bit and an accuracy of 6 decimals, the difference between the execution time of fixed-accuracy versus fixed- rate and fixed-precision mode is more significant. For example, when running with 16 qubits, fixed-accuracy mode takes approximately 122 000 milliseconds to execute, while fixed-rate mode takes about 170 000 milliseconds to finish execution. From Figure 4.7 (c) and (d), one can also observe that using 16-bit compression size or an accuracy of 4 decimals leads to a lower execution time. However, all modes of ZFP compression increase the execution time significantly, seen in Figure 4.5 (a) and (b). For example, at 16 qubits, the execution time for zfp -r 32 -d is 170 000 milliseconds compared with 2200 milliseconds in the original implementation. This increase in execution time indicates that running ZFP compression might not be a feasible solution in practice. At a high number of qubits, the execution time might 29 4. Results & Discussion be too large to justify using compression to decrease memory usage. 4.3 FPZIP The following results describe how the use of lossless compression affects the per- formance of Grover’s search algorithm in terms of total memory utilization and execution time. Figure 4.8 (a) shows how memory usage and execution time change as the block size increases. In the same way as ZFP, the increase in block size has re- sulted in a kind of U-shaped trend. When the block size is 25, the amount of memory used is just above 800KB. As the block size increases, memory usage decreases until about 29, which is the optimal block size, using approximately 650KB of memory. The decrease in memory usage can be explained using the same logic as described when discussing how block size affects memory in ZFP. While a small block size means smaller blocks in the decompressed memory, it also means a greater amount of blocks are created. The creation of the blocks results in a significant amount of overhead. From 29, the increasing block size begins to have a negative impact on memory usage. For example, at block size 214, about 1950KB of data is utilized, or three times as much as when the block size is 29. FPZIP only uses dynamic allocation, so the size of the temporary compression block stored in memory is the same as the block size, meaning that as the block size increases, so does the memory usage. (a) (b) Figure 4.8: Memory usage (a) and execution time (b) of running Grover’s search algorithm on QuEST with FPZIP compression in range of 2048-16384 floating-points per block size for 16 quits Figure 4.8 (b) shows execution time in relation to the increasing block size. The impact block size has on the execution time for FPZIP does not follow the same linear pattern as in ZFP. The graph shows that execution time decreases as the block size increases. One explanation may be that the FPZIP algorithm can opti- mize the compression of larger quantities of data and is, therefore, more efficient at compressing the data. In conclusion, the optimal block size for lossless compression is not as apparent as for ZFP. Simply by looking at how block size affects memory 30 4. Results & Discussion usage, one would choose 29 as the most effective size. In terms of execution time, 214 would be the optimal choice, however it has a significant negative impact on memory utilization. (a) (b) Figure 4.9: Memory usage (a) and execution time (b) per qubit running Grover’s search algorithm on QuEST with FPZIP compression Another aspect to analyze from the results of FPZIP is how memory usage and ex- ecution time increases with the number of qubits. Figure 4.9 illustrates the results for FPZIP and how it performs in comparison with the original QuEST implemen- tation. Much like ZFP, memory usage and execution time follow an exponential growth pattern as the number of qubits increases. The graphs show that FPZIP can effectively compress the data. For example, at 16 qubits, approximately 658KB of memory is being used, which is a significant decrease compared to the original QuEST that used 1.13MB. FPZIP can decrease the memory usage by 30%, similar to that of zfp -r 32. However, the execution time for FPZIP increases at a much faster rate than the original QuEST implementation. Given the risk of too high execution times, FPZIP may prove to be infeasible when running a higher number of qubits. 31 4. Results & Discussion (a) (b) Figure 4.10: Compression ratio (a) and execution time (b) per qubit for all com- pression algorithms and modes running Grover’s search algorithm on QuEST. Fixed rate (-r), fixed precision (-p), fixed accuracy (-a) and dynamic allocation (-d) Figure 4.10 shows how the performance FPZIP compares to that of ZFP. From Figure 4.10 (a), one can see that FPZIP has around the same compression ratio as ZFP modes using 32 bit fixed-rate compression, but it falls far behind the other better performing ZFP modes. Figure 4.10 (b), reveals that FPZIP has the highest execution time compared to all other modes of ZFP. This, most likely, is due to FPZIP working will the full 64 bit values during compression compared to the modes of ZFP that only have to do computations on 32-bits or 16-bits. Based on this fact alone, one can claim that ZFP is the better option. However, in order to make a final conclusion, the fidelity of ZFP and FPZIP must also be considered. 4.4 Fidelity of ZFP and FPZIP This section analyzes how well the compression algorithms perform in terms of the fidelity of the state vector. Figure 4.11 (a) illustrates the fidelity with respect to the number of qubits for the different modes of ZFP and FPZIP. From the graphs, one can see that the fidelity of zfp -r 16 drops significantly for every qubit. In fact, from the downward trend, one can predict that as the qubits continue to increase, the error will also increase. The compression provided by zfp -r 16 is made irrelevant by the fact that the data changes substantially. Figure 4.11 (b) shows the behavior of the other compression modes more clearly by excluding the ZFP mode, zfp -r 16. There is no visible change in the fidelity as most of the parameters remain close to 100%. Only for zfp -p 16 and zfp -a 1e-4 is there a visible fluctuation. This level of fluctuation indicates that zfp -p 16 and zfp -a 1e-4 may achieve a lower fidelity for higher qubits. However, even by excluding these three modes, it is possible to see that ZFP can achieve a near similar level of fidelity as FPZIP. It suggests that some modes of ZFP can provide a higher compression ratio with an equal level of fidelity as FPZIP. From this, the optimal solution for 32 4. Results & Discussion memory usage, execution time, and fidelity would be zfp -a 1e-6 -d. (a) (b) Figure 4.11: Illustration of the underlying fidelity for the different algorithms (a) all results for fidelity and (b) fidelity results without zfp -r 16 4.5 GPU Acceleration Throughout this chapter, one concerning aspect has been the execution time. As shown in Figure 4.5, execution time grows more significantly than the original QuEST implementation. In turn, the risk of unreasonable execution times for run- ning a high number of qubits increases. One possible solution to address this problem would be to offload the calculations from the CPU to the GPU. Firstly, Figure 4.12 shows the correlation between block size and execution time and how it is impacted by GPU acceleration. The increase in block size leads to a decrease in the execution time. This trend is due to an increase in block size which allows for a higher level of parallelism, as more values are decompressed in memory and thereby can be computed in parallel. 33 4. Results & Discussion (a) (b) Figure 4.12: Execution time on the GPU as function of block size for 16 qubits (a) complete block range and (b) last 6 block ranges Figure 4.13 (a) shows how execution time is impacted for ZFP in fixed-rate mode. The graph describes the trend as the number of qubits increases. The conclusion drawn from Figure 4.12, is that the largest block size results in the fastest execu- tion time. Figure 4.13 (a), depicts the qubits using the largest block size. From the provided speed-up of using GPU acceleration, shown in Figure 4.13 (b), one can see that execution time has decreased significantly. For mode zfp -r 32 -d us- ing 16 qubits, execution time has decreased from 170000 to 4500 milliseconds, an improvement of about 38 times. The same rate of improvement can be seen from the other modes run on the GPU. Figure 4.13 (a) also shows that execution time has decreased compared to the original implementation of QuEST. While execution time was a big dilemma after integrating ZFP and FPZIP, these results indicate that by using GPU acceleration the compression algorithms could provide a feasible solution with a reasonable execution time. (a) (b) Figure 4.13: Best execution time (a) and speed up for best execution time (b) on the GPU. Fixed rate (-r), and dynamic allocation (-d) However, drawing a conclusion purely based on execution time and Figure 4.13 (a) creates a skewed view of the results. The figure shows the impact at the largest 34 4. Results & Discussion possible block size (which, as previously stated, significantly improves execution time), however, this will have an adverse effect on memory usage. In fact, the memory usage will be greater than ZFP without GPU acceleration, as seen in Figure 4.4. To gather a complete picture of the results, Figure 4.14 illustrates how execution time and speed-up are impacted when the block size with the lowest possible memory usage is used. Here one can see that instead of a reduction in execution time by around 38 times for 16 qubits, it is decreased to around 5.5 times, which is a much lower improvement. (a) (b) Figure 4.14: Execution time at the best memory usage (a) and speed up of the execution time (b) on the GPU. Fixed rate (-r), and dynamic allocation (-d) 4.5.1 Limitations of GPU acceleration Executing the simulator on the GPU has the potential to improve the execution time but could also limit the possible memory usage. In the current implementation of the GPU accelerated simulator, the computations and memory operations are restricted by the size of the GPU’s VRAM. However, GPUs tend to have much smaller amounts of VRAM when compared to the RAM available in most modern computers. For example, a modern computer may have a RAM size of 16 GBs, while the VRAM on the graphics card is around 6 GBs. As such, the current implementation of GPU accelerated code is limited to only running circuits that are small enough to fit inside the available VRAM of the GPU. A possible solution would be to utilize unified memory during the execution of the simulator, where memory is shared between VRAM and RAM. This would utilize the high level of parallelism in the GPU while simultaneously avoiding the GPUs memory limit. 4.6 Determining the feasibility of compression From the results, one can conclude whether data compression is a feasible alterna- tive for running larger quantum simulations using the same hardware. One must consider how the memory usage will scale with the number of qubits. Figure 4.10 (a) reveals that all implementations except zfp -a 1e-6 are able to decrease memory 35 4. Results & Discussion usage. The best alternative among the implementations are zfp -p 16 -d, zfp -a 1e-4 -d and zfp -a 1e-6 -d which are able to achieve a compression ratio above 4× for 16 qubits. If this ratio were to be maintained for higher qubits, it would theoretically allow for the simulation of circuits with 2 more qubits using the same hardware configuration. The results of ZFP and FPZIP also show that ZFP, on average, can achieve a lower compression size than FPZIP. Thus, it can be concluded that lossy compression is the best alternative to achieve a lower compression size. Although it is theoretically possible to simulate larger quantum circuits, in prac- tice it might not be feasible given the potential for high execution time. Figure 4.10 (b), shows the growth of execution time for the different compression modes. Com- paring these results with the execution time of the unmodified QuEST, shown in Figure 4.5 (b), one can see that the increase is significant. Even the best performing implementation, zfp -a 1e-4 -d, still has an execution time for 16 qubits that is more than 50 times higher than the original execution time. For larger qubits, this indi- cates that the execution time may prove infeasible and thereby make it impractical to utilize data compression. Due to the impractical nature of running the simulations on the CPU, utilizing the GPU could improve the performance. Figure 4.13 (a) illustrates the execution time for the GPU accelerated implementation. The execution time remains higher than the original implementation, but the difference is significantly less than the simulation using compression without GPU acceleration. For example, at 16 qubits the GPU accelerated implementation of zfp -r 16 -d is only 1.5 times slower than the original. The decrease in execution time shows promise in improving the feasibility of using compression with the help of GPU acceleration. The GPU works most effi- ciently when there is a higher level of parallelism, which means it works best when the block size is large. However, the large block size also leads to an increase in the compression size. Figure 4.14 (a) shows instead execution time when the best memory usage is the main priority. Here, the execution time for zfp -r 16 -d has increased at 16 qubits from 1.5 times to around 10 times slower than the original. This is still an improvement over the CPU implementation and could potentially allow for the execution of larger quantum circuits with the help of compression and GPU acceleration. Finally, when determining the feasibility of using compression, one must consider fidelity. The fidelity for each of the implementations is presented in Figure 4.11. The two graphs show that most of the algorithms maintain a fidelity near 1.0. However, zfp -r 16, zfp -a 1e-4, and zfp -p 16 diverge from the fidelity of 1.0. The mode zfp -r 16 even rapidly decreases in fidelity after just 12 qubits. This rapid decrease in fidelity gives a strong indication that zfp -r 16 is not a feasible alternative given the potential for inconsistent results at higher qubits. There are also fluctuations in zfp -a 1e-4 and zfp -p 16, which indicates that the implementations might deviate from 1.0 and provide a much lower fidelity at higher qubits. This lower level of fidelity increases the risk that the underlying state vector representation becomes inconsistent, thereby leading to unreliable results. 36 4. Results & Discussion In conclusion, zfp -a 1e-6 -d is the best solution for data compression. It can provide a memory usage level 4 times smaller than the original memory size at 16 qubits and lead to a better execution time than almost all other ZFP modes. It also maintains fidelity at around 1.0. By potentially creating a GPU accelerated implementation of zfp -a 1e-6 -d, it may become a practical solution for using data compression in quantum computer simulators. 4.7 Ethics The possibility of powerful and accessible quantum computers is certainly a highly contentious issue from an ethical standpoint. The benefits of the technology are obvious, solving tasks that would be much more difficult for a classical computer such as optimizing machine learning models [34] and simulating protein-folding [35], providing further insight into vaccine development and Alzheimer’s disease. How- ever, what could be the results if quantum computers were to be used malevolently? Today, most personal online data such as messages, passwords, and bank records, are protected by encryption methods. Most of these algorithms depend on not being able to do extremely large-scale computations [4]. If quantum computers became available, the users of these machines would potentially be able to access all of this previously secure data, clearly an immense safety concern. No external individuals were directly affected throughout this project. One could argue, however, that a likely outcome of the project is to accelerate the science of quantum computers and whether or not the development of quantum computers is ethical due to their destructive potential. The potential impacts it could have on so- ciety are indeed almost limitless [34] however, the development of these systems will happen regardless of the outcomes of this study. Primarily, the goal of optimizing quantum simulators is to make it more accessible for the everyday programmer to learn about quantum computing. The goal is not to accelerate the development of state-of-the-art quantum computers at companies like IBM or Google. With this in mind, this study did not have any significant impact on society or significant ethical ramifications. 37 5 Conclusion This chapter presents some final conclusions regarding the compression and GPU acceleration. Also presented are some final thoughts regarding error sources and potential for improvement. 5.1 Project Conclusion This report presented the results of combining QuEST with the ZFP and FPZIP compression algorithms to reduce total memory consumption when simulating Grover’s search algorithm. Furthermore, the use of GPU acceleration was analyzed to de- termine if it would lead to better execution time. From the data, it is possible to conclude that using lossy compression with ZFP may be more favorable than using lossless compression with FPZIP. Among the different ZFP modes, using fixed- accuracy mode with 6 decimal accuracy (zfp -a 1e-6 -d) has shown the most promis- ing results in regards to memory usage, execution time, and a high level of fidelity. To conclude, the results show that the limiting factor of data compression is the high execution time. The execution time of QuEST using compression was more than 50 times greater than the execution time of the original QuEST implementation. Using GPU acceleration proved to decrease this difference and could potentially be an alternative for running the QuEST simulator with data compression. Assum- ing that feasible execution time can be achieved and that the compression ratio is maintained, hypothetically it would be possible to run quantum circuits with 2 more qubits for the same hardware configuration. This is under the assumption that zfp -a 1e-6 -d is used with a compression ratio of 4. 5.2 Possible Sources of Error One factor that could affect the testing is the nature of floating-point numbers on digital computers. Some precise floating-point numbers have decimals that stretch further than the range that double-precision floating-point numbers do. This means that the number has to be rounded or cut off, which results in a slight loss of information. Ultimately, this could perhaps affect the fidelity and error rate tests. This could be the case in section 4.4 where the graphs show a fidelity that fluctuates for zfp -p 16 and zfp -a 1e-4. Another source of error is an insufficient number of tests. Several tests were performed to determine how compression and GPU 38 5. Conclusion acceleration affect memory usage and execution time. While the goal was to test as much as possible, one source of error could be that the tests were not exhaustive enough. For example, running more tests for a higher number of qubits may have allowed for a more accurate understanding of the growth rate of execution time and memory usage. 5.3 Future Work Based on the results, some key aspects to explore in future studies is how the com- pression algorithm’s performance is affected by an increasing number of qubits. For example, one could perform tests of up to 25 qubits or more, which would allow for the creation of projection graphs and determine if using data compression is a feasible solution for larger qubits. Another aspect to explore further is how utilizing unified memory between RAM and VRAM might impact the ability to decrease execution time whilst providing a lower memory utilization. For the current imple- mentation, only using VRAM limits the size of the simulated circuit to the size of the VRAM itself. By utilizing unified memory it is possible to avoid this limitation. Finally, one might also explore how data compression and GPU acceleration im- pact memory usage and execution time for quantum algorithms other than Grover’s search algorithm. One might discover that data compression is better suited for specific algorithms and use cases, with more feasible execution times. It may also be possible to determine which types of algorithms would work well with data com- pression and thereby allow dedicated memory optimizations through compression techniques. 39 Bibliography [1] C. Oh, C. Lee, L. Banchi, and et al, “Optimal measurements for quantum fidelity between gaussian states,” 01 2019. [2] M. Dyakonov, “When will useful quantum computers be constructed? not in the foreseeable future, this physicist argues. here’s why: The case against: Quantum computing,” IEEE Spectrum, vol. 56, no. 3, pp. 24–29, 2019. [3] R. P. Feynman, “Simulating physics with computers,” International Journal of Theoretical Physics, vol. 21, no. 6-7, p. 467–488, 1982. [4] P. W. Shor, “Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer,” SIAM Journal on Computing, vol. 26, no. 5, p. 1484–1509, Oct 1997. [Online]. Available: http: //dx.doi.org/10.1137/S0097539795293172 [5] P. Ball, “First quantum computer to pack 100 qubits enters crowded race,” Nature, vol. 599, no. 7886, p. 542–542, 2021. [6] T. H. Johnson, S. R. Clark, and D. Jaksch, “What is a quantum simulator?” EPJ Quantum Technology, vol. 1, no. 1, pp. 1–12, 2014. [7] T. Jones, A. Brown, I. Bush, and et al, “Quest and high performance simulation of quantum computers,” Scientific Reports, vol. 9, no. 1, 2019. [8] Y. Zhou, E. M. Stoudenmire, and X. Waintal, “What limits the simulation of quantum computers?” Phys. Rev. X, vol. 10, p. 041038, Nov 2020. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevX.10.041038 [9] K. Sayood, Introduction to data compression. Morgan Kaufmann, 2017. [10] J. H. Pujar and L. M. Kadlaskar, “A new lossless method of image compression and decompression using huffman coding techniques,” https://www.jatit.org/, vol. 15, no. 1, 2005. [Online]. Available: https://www.jatit.org/volumes/ research-papers/Vol15No1/3Vol15No1.pdf [11] P. Lindstrom, “Fixed-rate compressed floating-point arrays,” IEEE Transac- tions on Visualization and Computer Graphics, vol. 20, no. 12, pp. 2674–2683, 2014. [12] L. A. Baumgardner, A. K. Shanmugam, H. Lam, and et al, “Fast parallel tandem mass spectral library searching using gpu hardware acceleration,” Journal of Proteome Research, vol. 10, no. 6, pp. 2882–2888, 2011, pMID: 21545112. [Online]. Available: https://doi.org/10.1021/pr200074h [13] G. Aleksandrowicz, T. Alexander, P. Barkoutsos, and et al, “Qiskit: An open-source framework for quantum computing,” jan 2019. [Online]. Available: https://doi.org/10.5281/zenodo.2562111 40 http://dx.doi.org/10.1137/S0097539795293172 http://dx.doi.org/10.1137/S0097539795293172 https://link.aps.org/doi/10.1103/PhysRevX.10.041038 https://www.jatit.org/volumes/research-papers/Vol15No1/3Vol15No1.pdf https://www.jatit.org/volumes/research-papers/Vol15No1/3Vol15No1.pdf https://doi.org/10.1021/pr200074h https://doi.org/10.5281/zenodo.2562111 Bibliography [14] C. Developers, “Cirq,” Aug. 2021, See full list of authors on Github: https://github .com/quantumlib/Cirq/graphs/contributors. [Online]. Avail- able: https://doi.org/10.5281/zenodo.5182845 [15] Y. Suzuki, Y. Kawase, Y. Masumura, and et al, “Qulacs: a fast and versatile quantum circuit simulator for research purpose,” Quantum, vol. 5, p. 559, Oct 2021. [Online]. Available: http://dx.doi.org/10.22331/q-2021-10-06-559 [16] X.-C. Wu, S. Di, E. M. Dasgupta, and et al, “Full-state quantum circuit simulation by using data compression,” Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Nov 2019. [Online]. Available: http://dx.doi.org/10.1145/3295500. 3356155 [17] P. G. Kwiat, J. R. Mitchell, P. D. D. Schwindt, and et al, “Grover's search algorithm: An optical approach,” Journal of Modern Optics, vol. 47, no. 2-3, pp. 257–266, feb 2000. [Online]. Available: https: //doi.org/10.1080%2F09500340008244040 [18] P. A. M. Dirac, “A new notation for quantum mechanics,” Mathematical Pro- ceedings of the Cambridge Philosophical Society, vol. 35, no. 3, p. 416–418, 1939. [19] R. LaRose, “Distributed memory techniques for classical simulation of quantum circuits,” 2018. [Online]. Available: https://arxiv.org/abs/1801.01037 [20] F. A. Bais and J. D. Farmer, “The physics of information,” in Philosophy of Information, ser. Handbook of the Philosophy of Science, P. Adriaans and J. van Benthem, Eds. Amsterdam: North-Holland, 2008, pp. 670– 671. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ B9780444517265500200 [21] I. Djordjevic, “Chapter 3 - quantum circuits and quantum information processing fundamentals,” in Quantum Information Processing and Quantum Error Correction, I. Djordjevic, Ed. Oxford: Academic Press, 2012, p. 92. [Online]. Available: https://www.sciencedirect.com/science/article/pii/ B9780123854919000034 [22] C. Zalka, “Grover’s quantum searching algorithm is optimal,” Phys. Rev. A, vol. 60, pp. 2746–2751, Oct 1999. [Online]. Available: https: //link.aps.org/doi/10.1103/PhysRevA.60.2746 [23] Cglosser, “Cglosser/quest: Quantum electromagnetics simulation toolkit (quest).” [Online]. Available: https://github.com/cglosser/QuEST [24] O. A. Mahdi, M. A. Mohammed, and et al, “Implementing a novel approach an convert audio compression to text coding via hybrid technique,” International Journal of Computer Science Issues, vol. 9, no. 6, 2012. [25] “Floating point compression: Lossless and lossy solutions.” [Online]. Available: https://computing.llnl.gov/projects/floating-point-compression [26] P. Lindstrom, M. Salasoo, M. Larsen, and et al, “Zfp 0.5.5 documentation,” 2019. [Online]. Available: https://zfp.readthedocs.io/en/release0.5.5/index. html [27] J. Diffenderfer, A. L. Fox, J. A. Hittinger, and et al, “Error analysis of zfp compression for floating-point data,” SIAM Journal on Scientific Computing, vol. 41, no. 3, 2019. 41 https://doi.org/10.5281/zenodo.5182845 http://dx.doi.org/10.22331/q-2021-10-06-559 http://dx.doi.org/10.1145/3295500.3356155 http://dx.doi.org/10.1145/3295500.3356155 https://doi.org/10.1080%2F09500340008244040 https://doi.org/10.1080%2F09500340008244040 https://arxiv.org/abs/1801.01037 https://www.sciencedirect.com/science/article/pii/B9780444517265500200 https://www.sciencedirect.com/science/article/pii/B9780444517265500200 https://www.sciencedirect.com/science/article/pii/B9780123854919000034 https://www.sciencedirect.com/science/article/pii/B9780123854919000034 https://link.aps.org/doi/10.1103/PhysRevA.60.2746 https://link.aps.org/doi/10.1103/PhysRevA.60.2746 https://github.com/cglosser/QuEST https://computing.llnl.gov/projects/floating-point-compression https://zfp.readthedocs.io/en/release0.5.5/index.html https://zfp.readthedocs.io/en/release0.5.5/index.html Bibliography [28] B. Furht, Ed., Discrete Cosine Transform (DCT). Boston, MA: Springer US, 2008, pp. 186–188. [Online]. Available: https://doi.org/10.1007/ 978-0-387-78414-4_304 [29] P. Lindstrom and M. Isenburg, “Fast and efficient compression of floating-point data,” IEEE transactions on visualization and computer graphics, vol. 12, pp. 1245–50, 09 2006. [30] H. AI. (2022) What is hardware acceleration? [Online]. Available: https://www.heavy.ai/technical-glossary/hardware-acceleration [31] A. Amariutei and S. Caraiman, “Parallel quantum computer simulation on the gpu,” in 15th International Conference on System Theory, Control and Computing, 2011, pp. 1–6. [32] “Cuda c++ programming guide.” [Online]. Available: https://docs.nvidia. com/cuda/cuda-c-programming-guide/index.html#introduction [33] D. A. Patterson and J. L. Hennessy, Computer Organisation and Design, The Hardware/Software Interface. Morgan Kaufmann, 2013. [34] R. de Wolf, “The potential impact of quantum computers on society.” Ethics Inf Techno, vol. 19, 12 2017. [Online]. Available: https://link.springer.com/ article/10.1007/s10676-017-9439-z [35] R. A. Barkoutsos, P. W. S., and et al., “Resource-efficient quantum algorithm for protein folding,” npj Quantum Inf, vol. 7, no. 38, 2021. [Online]. Available: https://doi.org/10.1038/s41534-021-00368-4 42 https://doi.org/10.1007/978-0-387-78414-4_304 https://doi.org/10.1007/978-0-387-78414-4_304 https://www.heavy.ai/technical-glossary/hardware-acceleration https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction https://link.springer.com/article/10.1007/s10676-017-9439-z https://link.springer.com/article/10.1007/s10676-017-9439-z https://doi.org/10.1038/s41534-021-00368-4 List of Figures List of Tables Introduction Background Purpose Scope Theory Simulating Quantum Circuits Grover's Search Algorithm QuEST Data Compression ZFP FPZIP Graphics Processing Unit CUDA Measuring Performance Fidelity Implementation Adapting QuEST to Compressed Data Defining a Memory Structure Fetching & Modifying the State Vector Moving Data Across Two Different Blocks Static & Dynamic Memory Allocation Testing of Data Compression Determining Error Rate for Compression GPU Acceleration of the Simulator Testing GPU Acceleration Results & Discussion Original QuEST ZFP ZFP & Number of Qubits Comparing modes of ZFP FPZIP Fidelity of ZFP and FPZIP GPU Acceleration Limitations of GPU acceleration Determining the feasibility of compression Ethics Conclusion Project Conclusion Possible Sources of Error Future Work Bibliography