Interconnect_0 Interconnect_v1_0 acc1_m_axis acc1_s_axis acc2_m_axis acc2_s_axis acc3_m_axis acc3_s_axis cx_m_axis cx_s_axis clk acc1_cxu_hit acc2_cxu_hit acc3_cxu_hit acc1_trigonometric_0 acc1_trigonometric_v1_0 acc1_m_axis acc1_s_axis clk acc1_cxu_hit acc2_vect_trans_0 acc2_vect_trans_v1_0 acc2_m_axis acc2_s_axis clk acc2_cxu_hit acc3_fft_0 acc3_fft_v1_0 acc3_m_axis acc3_s_axis clk acc3_cxu_hit clk_wiz_1 Clocking Wizard CLK_IN1_D reset clk_out1 default_sysclk1_300 mdm_1 MicroBlaze Debug Module (MDM) V MBDEBUG_0 microblaze_riscv_0 MicroBlaze V INTERRUPT DLMB ILMB M_AXI_DP CX_M_AXIS CX_S_AXIS DEBUG Clk microblaze_riscv_0_local_memory DLMB ILMBLMB_Clk reset Adding a Composable Extension for Custom Instructions to the MicroBlaze-V core Master’s thesis in Embedded Electronic System Design ARAVIND PRASANNANPILLAI SREEVILASAM SHAILESH SURESH VELLOLI Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2024 Master’s thesis 2024 Adding a Composable Extension for Custom Instructions to the MicroBlaze-V core ARAVIND PRASANNANPILLAI SREEVILASAM SHAILESH SURESH VELLOLI Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2024 Adding a Composable Extension for Custom Instructions to the MicroBlaze-V core ARAVIND PRASANNANPILLAI SREEVILASAM SHAILESH SURESH VELLOLI © ARAVIND PRASANNANPILLAI SREEVILASAM SHAILESH SURESH VELLOLI, 2024. Supervisor: Per Larsson Edefors, CSE Department Company advisor: Goran Bilski, Mathiesen Tryggve, AMD Examiner: Lena Peterson, CSE Department Master’s Thesis 2024 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Block diagram of the design from Vivado Typeset in LATEX Gothenburg, Sweden 2024 iv Adding a Composable Extension for Custom Instructions to the MicroBlaze-V core ARAVIND PRASANNANPILLAI SREEVILASAM SHAILESH SURESH VELLOLI Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract This report presents the design and implementation of a Composable Extension (CX) for custom instructions in the MicroBlaze-V core, which is a customisable RISC-V core offered by AMD. The implementation is done on a Field Programmable Gate Array (FPGA) and the performance is evaluated with accelerators against the current MicroBlaze-V design. Integration of a new CX interface allows the designer to add any number of custom instructions and accelerators according to the requirement. The accelerators can be either freshly designed or by using the existing Xilinx Intellectual Property (IP) cores with additional parameters. In this project, existing IP cores have been used as accelerators as this demonstrates how easy it is to integrate the IP cores with the interface design. The accelerator functions were also programmed in software using C to compare and analyze the performance of the CX extension in MicroBlaze-V. Different metrics like speedup, resource utilisation and power consumption were considered to evaluate the efficiency of the entire system. A significant performance improvement has been observed with the accelerators at the expense of higher resource utilisation. Keywords: Composable Extension (CX), Custom Instructions, Field Programmable Gate Array (FPGA), MicroBlaze-V, RISC-V, Accelerators, Intellectual Property (IP), Xilinx, Evaluation Metrics v Acknowledgements We would like to extend our profound gratitude to each and every one, who has helped us in the progress of this thesis work. First of all, we would like to thank our supervisors at AMD, Goran Bilski and Tryggve Mathiesen without whom we would not have been able to make significant progress in this thesis. They patiently heard about the difficulties that we faced throughout the course of this project and provided us with the necessary information to overcome them. We are also grateful to our academic supervisor Per Larsson Edefors who guided us in the right direction throughout our work. We would also like to express our gratitude towards our examiner Lena Peterson for providing significant comments and feedback on our work. We would also like to thank the management of Chalmers University of Technology for providing all the facilities required to do our project. And lastly, we would like to thank our parents and friends who helped us during the entire course of this project and acted as a pillar of support and confidence. Aravind Prasannanpillai Sreevilasam and Shailesh Suresh Velloli, Gothenburg, September 2024 vii Contents Glossary xi 1 Introduction 1 1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Purpose and Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Technical Background 3 2.1 Field Programmable Gate Array . . . . . . . . . . . . . . . . . . . . . 3 2.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2.3 RISC-V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.4 CX Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4.1 CX standard encoding . . . . . . . . . . . . . . . . . . . . . . 6 2.5 Control and Status Registers (CSRs) . . . . . . . . . . . . . . . . . . 8 2.6 AXI4-Stream Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.7 Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.8 Intellectual Property (IP) Core . . . . . . . . . . . . . . . . . . . . . 10 2.8.1 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.8.2 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . 12 2.9 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3 Methods 17 3.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.2 TestBench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4 Design 21 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 MicroBlaze-V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2.1 CX interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2.2 CSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.3 Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 4.3.1 Write Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.3.2 Read Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4 Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.4.1 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.4.2 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 ix 4.5 Reference Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5 Results 35 5.1 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.2 Hardware utilisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.3 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6 Conclusion 39 6.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Bibliography 41 A Appendix 1 I A.1 Reference Model C code . . . . . . . . . . . . . . . . . . . . . . . . . I A.1.1 Trigonometric Function . . . . . . . . . . . . . . . . . . . . . . I A.1.2 Vector Translation . . . . . . . . . . . . . . . . . . . . . . . . I A.1.3 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II x Glossary AXI An on-chip communication protocol used be- tween two hardware modules. 6, 8, 9, 12, 23, 24, 26, 29, 31, 35 CF single or a group of custom instructions. 6 CLB A basic logic block that executes complex logic functions and implement memory functions. 3 CPI number of cycles divided by the number of in- structions. 15, 35 CPU a hardware component that is the core com- putation unit in a server. 1–3, 6, 9, 21 CSR Special register that holds control and status information in a processor. ix, 2, 5, 6, 8, 21– 25, 35 CX interface contract of a composable extension consists of custom function instructions, csr, and their behavior.. ix, 3, 6 CXU A hardware core that implements composable extension.. 6, 7, 21, 22, 25, 26, 29, 30, 36 FPGA A reconfigurable integrated circuit. 3, 4, 9, 17, 18, 21, 34–36 HDL definition for all language model used to de- scribe the behavior of hardwares. 4 IP a reusable unit of logic or integrated circuit layout design. ix, 10–13, 15 ISA An abstract model that defines how software controls the CPU in a computer. 1, 5, 8 register fast memory units in the hardware that stores binary data. 3–7, 23, 24 VHDL A language model used to describe the hard- ware. 4, 10, 18 xi Glossary xii 1 Introduction In the continuously evolving world of CPU architectures, the pursuit of enhanced performance and efficiency remains a constant driving force. As the demand for specialised applications rises, it turns out to be very important to have the required hardware support. This thesis project focuses on advancing processor architecture by integrating a proposed composable extension (CX) for custom instructions to the MicroBlaze-V soft-core processor. The RISC-V processor architecture is popular for its openness and flexibility [1]. The custom extension definition in the RISC-V core is aimed at optimising the execution of specialised tasks and accelerate the tasks using dedicated hardware. However, the custom opcode space in RISC-V is unmanaged and this makes it difficult to create an ecosystem where different parties can publish and exchange their own composable custom instruction. An open composition like this requires routine and robust integration of elements authored by different parties into a stable system that can work together as a unit. This thesis work involves not only the research, design, implementation, and evalua- tion of the proposed CX extension, and its integration on the RISC-V core, but also incorporating dedicated accelerators to offload and accelerate critical tasks in the system, such that it follows AMD’s design philosophy which has embraced RISC-V as a foundation of its processor designs. Through a thorough research study on the RISC-V Instruction Set Architecture (ISA), our work intends to identify key areas of improvement and devise suitable solutions for the same, that will consider the energy efficiency of the overall system performance. This thesis hopes to contribute valuable insights to the customisable processor architectures, thereby ensuring that AMD’s innovation technology remains a front-runner in the computing sector. 1.1 Related Work The reserved opcodes for the custom instructions in the RISC-V architecture enable integration of the CX extension. The design of the CX extension in this thesis work is based on a proposed RISC-V Composable Custom Extension specification in [2]. The specification document defines the procedure and the process flow for handling the custom instructions. The document serves as a foundation for the thesis work as it explains the relevance of CX extension, the parameters required, and how it functions as a whole system. It describes how the instructions can be encoded and how they can be interfaced to communicate with the processor, which is followed 1 1. Introduction in this thesis work. This specification document also speaks about the required Control and Status Registers (CSRs) and the logic interface to connect and control the processor core and the accelerators. Several researchers have worked on running hardware accelerators in the RISC-V core. In [3], a convolution neural network has been implemented based on RISC- V architecture using the reserved opcode space for the custom instructions. The CORDIC based hardware accelerators were implemented on the VexRiscv CPU core which is a 32-bit RISC-V chip [4] and the performance and resource utilisation have been assessed, similar to the metric analysis followed in this project. These works provide a significant contribution to the design and implementation phase of our thesis work. 1.2 Purpose and Goal In order to define a standard extension for the new instructions, it must be of gen- eral interest, broad utility and non-proprietary. Usually, defining a new standard extension is a long process managed by RISC-V International. The RISC-V archi- tecture allows independent vendors to define their own CX extension. Sharing the custom opcode space for the custom instructions in a 32b processor is critical and it obviates the need to transition to a higher-bit processor in some settings. In this thesis work, a CX extension interface will be designed and integrated into the MicroBlaze-V soft-core processor for managing the custom instructions. Potential hardware accelerators will be plugged in through an interconnect module so that we can demonstrate the custom instructions and analyze the performance of the core. The performance and resource utilisation of the implemented design will be compared with the CPU without the accelerators, and the improvement will be noted. 1.3 Thesis Outline In Chapter 2, all the relevant background knowledge required for the thesis will be described. In Chapter 3, the methods followed to carry out the thesis work and the tools used will be depicted. A detailed description of the design and the constraints in the design will be presented in Chapter 4. Chapter 5 will focus on the results obtained from the implementation of the design and comparison with the software model with respect to different evaluation metrics. Finally, chapter 6 covers the challenges faced throughout the course of the work and its future scope. 2 2 Technical Background This section describes the specification standards used in this thesis work and the necessary background information required for the reader to understand the work done. The section begins with an introduction to Field Programmable Gate Arrays (FPGA), followed by information on pipelining and the Reduced Instruction Set Computing (RISC) architecture. In the subsequent sections, we describe more on RISC-V, the MicroBlaze soft-core, and the proposed CX extension. While detailing the CX extension, we describe the overall working of the extension, how the instruc- tions are encoded and processed in the MicroBlaze-V and how the results and status are written back to the corresponding registers. Further, we move to the description of accelerators and the potential accelerators that we could use for our current work. We end the chapter with the metrics that we could consider for evaluation. 2.1 Field Programmable Gate Array Field Programmable Gate Arrays (FPGA) are semiconductor devices that can be configured to meet the desired functionality or application. They contain config- urable logic blocks (CLB) connected by a set of programmable interconnects which allows the designer to perform both simple and complex tasks. The FPGAs include different memory elements ranging from single-bit flip flops to very dense memory arrays, for digital storage. The FPGAs provide better performance compared to a general CPU and can be reprogrammed according to the requirement. This versa- tility and the re-programmable feature of FPGAs allow them to be used in various applications, including image processing, wireless communications and medical di- agnosis. The FPGA was first introduced by Xilinx in 1985 and they continue to be one of the leading manufacturers today. Apart from Xilinx, modern FPGAs which are manufactured by firms like Intel, Altera, etc. offer a large range of features like impressive logic densities, flash memory, embedded processors and digital signal processing (DSP) blocks [5]. The FPGAs can be configured by altering electrical inputs and outputs and figuring out how each resources are used and connected to form the hardware design. However, in software perspective, FPGA designs can be streamlined by using pre-designed libraries of digital circuits and functions, also known as intellectual property (IP) cores. Often, third-party suppliers and FPGA vendors offer these libraries, which are available for purchase or lease, which is the case with AMD IP cores. 3 2. Technical Background FPGAs can be programmed by loading a bitstream which describes the configuration blocks like the lookup tables, registers and other blocks. The bitstream is generated by a tool called Vivado [6], where the program is written in Hardware Descrip- tion Language (HDL), for example, Very High-Speed Integrated Circuit (VHSIC) Hardware Description Language (VHDL). 2.2 Pipelining Pipelining is a technique used in digital systems, particularly in processor design, to enhance the performance by allowing multiple instructions to be processed si- multaneously rather than waiting for one instruction to complete execution before starting the next [7]. The overall process can be divided into stages and each stage in the pipeline does a specific task. Every instruction enters a stage once the previ- ous instruction has been completed. This helps in more efficient usage of resources and increases the overall throughput. 3.3 Statically scheduled pipelines 93 In st ru ct io n m em or y P C r1 r2 R1 R2w W op co de R eg is te rs C on tr ol A LU D at a m em or y@ W R W R IF/ID ID/EX ME/WB IF ID EX ME WB + R E G _d at a co nt ro l 4 + (P C )+ 4 (P C )+ 4 br an ch @ O ffs et EX/ME (P C )+ 4 (P C )+ 4 co nt ro l W R in r eg .# W R co nt ro l Figure 3.5. Basic 5-stage pipeline for independent loads, stores, and register-to-register instructions. Basic 5-stage pipeline for independent loads, stores, and ALU instructions Figure 3.5 shows a basic 5-stage pipeline. This pipeline can execute independent Loads, Stores, and ALU instructions. Independent instructions do not share resources such as registers and memory locations. The major resources in the data path are the instruction memory (cache), the register file with two read ports and one write port, an ALU capable of integer arithmetic and logic operations, and a data memory (cache). Two consecutive stages are separated by pipeline registers, labeled by the two stages they separate. As an instruction moves from one stage to the next it is re-coded, and the re-coded instruction is stored in the pipeline register. All pipeline registers are clocked in every cycle. In every clock, the following activities take place in each stage. � I-fetch (IF) In every clock, the program counter (PC) is incremented by 4 while the current instruction is fetched in the instruction memory. At the end of the cycle (the trailing edge of the clock), (PC)+4 is stored in the PC and the new instruction is stored in IF/ID.� I-decode (ID) The opcode is decoded into control signals. Control signals set up the function of the various combinational components in subsequent stages EX, ME, WB, and are con- nected to control inputs of the hardware components in each stage. At the end of the clock these control signals are stored in a control field in ID/EX. Two input registers are always fetched from the register file, even if they are not needed. The entire instruction except for the opcode is passed on to the next, EX, stage; (PC)+4 must be carried along through the pipeline in case of an exception. Figure 2.1: Five-stage processor pipelining [7] The typical stages as shown in a 5-stage pipeline as seen in figure 2.1 include : 1. Fetch: In this stage, the instruction is fetched from the memory, and the program counter where the address of the next instruction is stored, is incre- mented. 2. Decode: Here, the opcode of the instruction is decoded and the corresponding function to be done is established. 3. Execute: In this stage, the actual operation is carried out. 4 2. Technical Background 4. Memory Access: In this stage, the memory is accessed to load or store the data, if it is a memory operation. 5. Write Back: The result of the operation is written to the register in this stage. However, when multiple instructions are executed at the same time, data hazards could occur due to the data dependency between the instructions. In addition to this, instructions like jump and branch could also introduce control hazards since they disturb the order in which the instructions are expected to be executed. Advanced algorithms like branch prediction and out-of-order executions are implemented in modern processors to mitigate these challenges [7]. 2.3 RISC-V The RISC-V is an open-source ISA based on the RISC model. It was developed by Krste Asanovic, Andrew Waterman and Yunsup Lee of the University of California at Berkeley in 2010 [3]. The RISC-V architecture has evolved from RISC which operates at a very high speed, integrating pipelining in its operations. It has become very popular due to its openness and flexibility and is used in small embedded processors to high-end processor configurations. It is designed as a modular ISA composed of a small set of standard instructions and a set of extensions that can be added according to the needs of the developer. RISC-V provides both 32-bit and 64-bit instruction sets, and also various extensions like floating point, multiply and accumulate, vectors, etc [8]. MicroBlaze-V AMD makes use of the RISC-V architecture in their MicroBlaze-V soft-core pro- cessor. The MicroBlaze-V processor offers a wide variety of customisable and easy to integrate microprocessor configurations based on the RISC Harvard model. It is used in many areas including the medical industry, automotive industry, and communication markets due to its flexibility. The MicroBlaze-V core is fully em- bedded into AMD’s Vivado tool which makes it easier to use its functionalities in our designs. The MicroBlaze-V soft-core processor is a 32-bit processor, which implies it consists of thirty-two 32-bit general purpose registers, a 32-bit Program counter, and a 32-bit address bus [9]. It includes a standard set of instructions and CSRs but its flexibility allows us to customise and add more instructions according to the requirements of the developer. The architecture of the MicroBlaze-V core is shown in figure 2.2. 5 2. Technical Background MicroBlaze Processor Reference Guide 7 UG984 (v2018.2) June 21, 2018 www.xilinx.com Chapter 2 MicroBlaze Architecture Introduction This chapter contains an overview of MicroBlaze™ features and detailed information on MicroBlaze architecture including Big-Endian or Little-Endian bit-reversed format, 32-bit general purpose registers, virtual-memory management, cache software support, and AXI4-Stream interfaces. Overview The MicroBlaze embedded processor soft core is a reduced instruction set computer (RISC) optimized for implementation in Xilinx® Field Programmable Gate Arrays (FPGAs). The following figure shows a functional block diagram of the MicroBlaze core. X-Ref Target - Figure 2-1 Figure 2-1: MicroBlaze Core Block Diagram Bus IF I-Cache Instruction Buffer Instruction Buffer Branch Target Cache Program Counter M_AXI_IC Memory Management Unit (MMU) ITLB DTLBUTLB Bus IF D-Cache M_AXI_DC M_AXI_DP DLMB M0_AXIS .. M15_AXIS S0_AXIS .. S15_AXIS Special Purpose Registers Instruction Decode Register File 32 x 32b ALU Shift Barrel Shift Multiplier Divider FPU Instruction-side Bus interface Data-side Bus interface Optional MicroBlaze feature M_AXI_IP ILMB M_ACE_DCM_ACE_IC X19738-090717 Send Feedback Figure 2.2: MicroBlaze Architecture [10] AMD has a few Fast Simplex Logic (FSL) custom instructions that enable data in and out of the processor through the AXI4- Stream interface but the need for a CX extension is in demand. The focus of this work is to integrate a Composable Extension (CX) to the current MicroBlaze-V soft-core. 2.4 CX Extension The CX extension is a group of named custom functions that bridge between software and hardware, enabling the software libraries and hardware cores that implement the same extension. The CX multiplexing enables the composition of a system of individually authored and versioned components [2]. The Custom Functions (CF), identified based on CF_IDs are executed using Composable Execution Units (CXUs), which are hardware units identified using unique CXU_IDs. Each CXU operates on the operands from register/immediate values and writes the result back to the destination register along with updating the status in the CSRs. Additionally, CSRs also contain the state_id which represents the state context of the CXU. The CXU logic interface defined between the CXUs and CPU manages the control flow of each instruction and its corresponding result. 2.4.1 CX standard encoding When a particular CXU is selected, the software issues custom functions to the configured CXU using different types of instruction encodings: R-type, I-type, and flex type. For each encoding type, the instruction specifies the CF_ID, source operands (register or immediate) and possibly a destination register in the encoded format. 6 2. Technical Background Custom-0 R-type encoding This type of instruction encoding has two source register operands, a destination register, and CF_ID of 10 bits as given in figure 2.3. The assembly instruction can be written as : cx_reg cf_id, rd, rs1, rs2 two source registers, or one source register and one immediate value. R-type and I-type instructions always write a destination register whereas flex-type instructions never do so. 2.3.1. Custom-0 R-type encoding Assembly instruction: cx_reg cf_id,rd,rs1,rs2 An R-type CF instruction issues a CXU request for a zero-extended 10-bit CF_ID cf_id with two source register operands identified by rs1 and rs2. The CXU response data is written to destination register rd. 067111214151920242531 1101000rdcf_id[2:0]rs1rs2cf_id[9:3] custom-0 Figure 8. CX R-type instruction encoding 2.3.2. Custom-1 I-type encoding Assembly instruction: cx_imm cf_id,rd,rs1,imm An I-type CF instruction issues a CXU request for a zero-extended 4-bit CF_ID cf_id with one source register operand identified by rs1 and a signed-extended 8-bit immediate value imm. The CXU response is written to destination register rd. 067111214151920232431 1101010rd000rs1cf_id[3:0]imm[7:0] custom-1 Figure 9. CX I-type instruction encoding  This new, irregular immediate field encoding may have a disproportionate impact on area and critical path delay in the decode or execute pipeline stages of a RISC-V processor core. Seven-eighths of the custom-1 encoding space is reserved for future custom function instruction encodings. 0671112141531 1101010reserved1-7reserved custom-1 Figure 10. CX reserved I-type instruction encodings 2.3.3. Custom-2 flex-type encoding Assembly instruction: cx_flex cf_id,rs1,rs2 Assembly instruction: cx_flex25 custom A flex-type CF instruction issues a CXU request for a zero-extended 10-bit CF_ID cf_id with two source register operands identified by rs1 and rs2. There is no destination register and CXU response data (but not a possible error status) is discarded. The instruction is executed purely for its effect upon the selected state context of the selected CXU. 2.3. Custom function instruction encodings | Page 15 Draft Proposed RISC-V Composable Custom Extensions Specification Figure 2.3: CX R-type instruction encoding [2] Custom-1 I-type encoding This type of instruction issues a CXU request for a CF_ID with 4 bits (zero padded) having one source register operand and one immediate operand, and the result is written back to the destination register. The assembly instruction can be written as : cx_imm cf_id, rd, rs1, imm The instruction encoding is shown in figure 2.4. two source registers, or one source register and one immediate value. R-type and I-type instructions always write a destination register whereas flex-type instructions never do so. 2.3.1. Custom-0 R-type encoding Assembly instruction: cx_reg cf_id,rd,rs1,rs2 An R-type CF instruction issues a CXU request for a zero-extended 10-bit CF_ID cf_id with two source register operands identified by rs1 and rs2. The CXU response data is written to destination register rd. 067111214151920242531 1101000rdcf_id[2:0]rs1rs2cf_id[9:3] custom-0 Figure 8. CX R-type instruction encoding 2.3.2. Custom-1 I-type encoding Assembly instruction: cx_imm cf_id,rd,rs1,imm An I-type CF instruction issues a CXU request for a zero-extended 4-bit CF_ID cf_id with one source register operand identified by rs1 and a signed-extended 8-bit immediate value imm. The CXU response is written to destination register rd. 067111214151920232431 1101010rd000rs1cf_id[3:0]imm[7:0] custom-1 Figure 9. CX I-type instruction encoding  This new, irregular immediate field encoding may have a disproportionate impact on area and critical path delay in the decode or execute pipeline stages of a RISC-V processor core. Seven-eighths of the custom-1 encoding space is reserved for future custom function instruction encodings. 0671112141531 1101010reserved1-7reserved custom-1 Figure 10. CX reserved I-type instruction encodings 2.3.3. Custom-2 flex-type encoding Assembly instruction: cx_flex cf_id,rs1,rs2 Assembly instruction: cx_flex25 custom A flex-type CF instruction issues a CXU request for a zero-extended 10-bit CF_ID cf_id with two source register operands identified by rs1 and rs2. There is no destination register and CXU response data (but not a possible error status) is discarded. The instruction is executed purely for its effect upon the selected state context of the selected CXU. 2.3. Custom function instruction encodings | Page 15 Draft Proposed RISC-V Composable Custom Extensions Specification Figure 2.4: CX I-type instruction encoding [2] Custom-2 flex type encoding This instruction issues a CXU request for a 10-bit zero padded CF_ID with two source register operands. No destination register is involved in this operation and the response data is discarded. The assembly instruction can be written as : cx_flex cf_id, rs1, rs2 The instruction encoding is shown in figure 2.5 067111214151920242531 1101101customcf_id[2:0]rs1rs2cf_id[9:3] custom-2 Figure 11. CX flex-type instruction encoding Alternatively, equivalently, the cx_flex25 form of instruction issues an arbitrary 25-bit custom instruction. 06731 1101101custom custom-2 Figure 12. CX flex-type instruction alternate encoding  A flex-type CF instruction may be used with a CXU-L2 request’s raw instruction field req_insn (3.4.5) to provide an arbitrary 32-7=25-bit custom request to a CXU. The absence of an (integer) destination register field is a feature that provides added, CPU-uninterpreted, custom instruction bits to a CXU.  One disadvantage of this approach: when the selected CXU routinely discards the R[rs1] or R[rs2] operands, use of the flex-type custom function instruction can create a useless false dependency on the rs1 and rs2 registers, which may uselessly delay issue of the CF instruction in an out-of-order CPU core. 2.4. Custom function instruction execution via composable extension multiplexing Figure 13 illustrates how a custom function instruction and the CXU CSRs implement composable extension / CXU composition via composable extension multiplexing. When the CPU issues a custom function instruction, it produces a CXU request from the fields of the instruction, two source operands from the register file and/or an immediate field of the instruction, and the cxu_id and state_id fields of mcx_selector. The CXU request may include the request ID cookie (defined by the CPU), the CXU_ID, STATE_ID, raw instruction, CF_ID, and operands. The CXU_ID identifies which CXU must process the request. The CXU includes state context(s) and a datapath. The STATE_ID selects the state context to use for this request. The CXU checks for errors in CXU_ID, STATE_ID, and CF_ID per 2.2.2, processes the request, possibly updating this state context, and produces a CXU response, which may include the same request ID cookie, a success/error status, and the response data. The CPU commits the custom function instruction by updating cx_status (when response status is an error condition) and writing the response data to the destination register. 2.4. Custom function instruction execution via composable extension multiplexing | Page 16 Draft Proposed RISC-V Composable Custom Extensions Specification Figure 2.5: CX flex-type instruction encoding [2] 7 2. Technical Background 2.5 Control and Status Registers (CSRs) The Control and Status Registers (CSRs) are special registers in a processor that monitor and manage the operation of a system. The CSRs store the control and status information of different units of the processor. They help the software to set the parameters, initiate and control the operations in the processor. They also pro- vide feedback about the current state of the processor and its hardware components, with the help of flags or status bits. A few examples are interrupt mask registers, interrupt status registers, error status registers, etc. The CSRs are crucial for debugging and understanding the state of the system as they provide information about what the processor is doing at the given moment. The CSRs control access to critical resources by enforcing certain privilege levels, thus ensuring the safety and integrity of the system. Fine-tuning these CSRs can optimise the performance of the system by enabling or disabling certain hardware features. The CSRs are handled in specific privilege levels, mainly in the machine mode, where you have full access to all the controls in the processor. The CSRs are accessed through specific instructions provided by the ISA of the processor. The MicroBlaze-V core has some standard CSRs implemented in it. A few custom CSRs like mstatus (describing the Machine Status), misa (describing the supported ISA) are also defined already in MicroBlaze-V [9]. 2.6 AXI4-Stream Interface The Advanced eXtensible Interface (AXI) Stream protocol can be used as a standard interface for exchanging data between components connected to each other [11]. It facilitates high speed and efficient communication of data and this feature makes it useful for this thesis work. The AXI-Stream Interface conforms to the ARM AMBA AXI4-Stream Protocol Specification [11]. It works by the principle of handshaking between the transmitter and receiver. The transmitter is treated as the Master and the receiver is treated as the Slave. The AXI4-Stream interface makes use of mainly three signals : 1. TVALID: sent out by the Master indicating that the datastream is valid and it wants to send the data. 2. TREADY: sent by the Slave device indicating it is ready to receive the data. 3. TDATA: The datastream sent out through AXI Stream from the source to the receiver. The data transfer happens only when both TVALID and TREADY signals are asserted irrespective of the order in which they are set as shown in figure 2.6. It is possible to omit TREADY in some cases where the receiver can always accept a transfer. In such cases, TREADY is always assumed to be HIGH. 8 2. Technical Background Interface Signals 2.2 Handshake signaling ARM IHI 0051B Copyright © 2010, 2021 Arm Limited or its affiliates. All rights reserved. 2-19 ID040921 Non-Confidential 2.2.2 Handshake with TREADY asserted before TVALID In Figure 2-2, the Receiver drives TREADY HIGH before the data and control information is valid. This indicates the Receiver can accept the data and control information in a single ACLK cycle. In this case, the transfer occurs once the Transmitter drives TVALID HIGH. Figure 2-2 shows the transfer occurring at T3. Figure 2-2 Handshake with TREADY asserted before TVALID 2.2.3 Handshake with TVALID and TREADY asserted simultaneously In Figure 2-3, the Transmitter asserts TVALID HIGH and the Receiver asserts TREADY HIGH in the same ACLK cycle. In this case, transfer takes place in the same cycle, as shown in T2 of Figure 2-3. Figure 2-3 Handshake with TVALID and TREADY asserted simultaneously ACLK INFORMATION TVALID TREADY T0 T1 T2 T3 ACLK INFORMATION TVALID TREADY T0 T1 T2 T3 Figure 2.6: Timing Diagram of AXI4-Stream Data Transfer In addition to these, a signal called TLAST can be configured in some cases to indicate the packet boundaries. Asserting the TLAST bit indicates the final transfer of an operation and no more bits should be followed after it. 2.7 Accelerators The accelerators are used to accelerate a given task, which otherwise would require a considerable amount of time to execute. The accelerators function as the hardware units in our project, executing the custom function and writing the result and error status back to the processor. A system composed of a processor and a hardware accelerator enhances the software programmability for much of the software run on the processor while improving the power and performance for the functions run on the hardware accelerator. Since the hardware accelerators are designed specifically to handle a particular function, their resources are rightly matched for the precision of the operation, rather than the typical processors which offer resources in bits of the multiple of 8. The hard- ware accelerators also reduce the performance time by covering the execution time required for branch prediction, data caching and other complex operation involved in modern processors. Moreover, as the hardware accelerators provide only the ad- equate amount of hardware resources required to perform the given operation, it also reduces the amount of power consumption involved in executing a stream of instructions. FPGA based accelerators are quite convenient due to their flexibility and adapt- ability. They can be reprogrammed to run any function easily. Moreover, making use of an FPGA minimizes the latency and power consumption compared to using a CPU [12]. In this project, the accelerators are utilised to demonstrate the newly integrated extension. 9 2. Technical Background 2.8 Intellectual Property (IP) Core An Intellectual Property (IP) core is a standalone reusable functional logic unit that performs a complex task and can be used in several digital designs. These are developed using hardware description languages like Verilog, VHDL etc. There are several IP cores in the Xilinx library available to be used as a plug-in and play module and a few of them can be used in our design. 2.8.1 CORDIC CORDIC stands for Coordinate Rotational Digital Computer. This algorithm was initially developed by Volder to solve trigonometric equations in an iterative fashion, and was generalized later by Walther to include more functions like the hyperbolic and square root equations [13]. The CORDIC core can be configured in two ways in terms of its architecture. • Fully parallel configuration with single cycle data throughput: In this case, the CORDIC core implements the operations in parallel using an array of shift-addsub stages. • Word serial implementation with multiple cycle throughput: In this stage, the shift addsub operations are performed serially using a single shift addsub stage in a feedback loop. The CORDIC IP core can be used to implement many different mathematical func- tions including trigonometric functions, square root, hyperbolic and also rectangular- polar conversions. The respective function can be configured in the IP core and the operands can be selected accordingly whether it is a phase or cartesian operand. The pin diagram of the CORDIC IP core is shown in figure 2.7. CORDIC v6.0 9 PG105 August 6, 2021 www.xilinx.com Chapter 2: Product Specification implements these shift-addsub operations serially, using a single shift-addsub stage and feeding back the output. A word serial CORDIC core with N bit output width has a latency of N cycles and produces a new output every N cycles. The implementation size this iterative circuit is directly proportional to the internal precision. Resource Utilization For details about performance, visit Performance and Resource Utilization. Port Descriptions A block diagram of the CORDIC core is presented in Figure 2-1. X-Ref Target - Figure 2-1 Figure 2-1: CORDIC Symbol and Pinout s_axis_cartesian_tdata s_axis_cartesian_tvalid s_axis_cartesian_tready s_axis_phase_tdata s_axis_phase_tvalid s_axis_phase_tready aclk aresetn aclken m_axis_dout_tdata m_axis_dout_tvalid m_axis_dout_tready m_axis_dout_tuser m_axis_dout_tlast DS858_01_082311 s_axis_phase_tuser s_axis_cartesian_tuser s_axis_cartesian_tlast s_axis_phase_tlast Send Feedback Figure 2.7: Pin Diagram of CORDIC IP core [13] . 10 2. Technical Background Trigonometric Functions The trigonometric functions like sine, cosine, tangent, etc. have a wide range of applications in our day-to-day lives. The applications range from astronomy where it is used to find the distance of Earth from other planets and stars, to navigation, construction sites, marine engineering etc. The CORDIC IP core calculates the sine and cosine of the given phase value (in radians). Since we need the phase value as input, only the phase input parameters are enabled in this case. The coarse rotation module in the IP core limits the value of input angle between −π and +π. The input angle is expressed as a fixed-point two’s complement number with an integer width of 3 bits while the output vector is expressed as a pair of fixed-point two’s complement numbers with integer width of 2 bits [13]. The IP core gives both the sine and cosine values of the phase value in a single output vector of 32 bits with the lower 16 bits representing sine and the higher 16 bits representing cosine. The output format is given in figure 2.8. SINECOSINE 0151631 Figure 2.8: Output vector format of sin and cos function in CORDIC . Vector Translation The vector translation function to convert rectangular to polar coordinates is used in different areas which include navigation, robotics and signal processing. The CORDIC IP core performs the vector translation operation where it rotates the input vector around the circle in an angle θ until the Y component equals zero. The scaled magnitude and phase of the rotated input vector are obtained as outputs. In this case, both the Cartesian and phase input parameters are enabled. The vector translation shows linear behaviour with respect to magnitude. The number of significant magnitude bits of the input vector limits the accuracy of the phase output from CORDIC. The input and output representation is similar to the trigonometric function where the output magnitude is expressed as fixed two’s complement numbers with integer width of 2 bits and the phase angle is expressed as fixed two’s complement number with an integer width of 3 bits. The output vector format is similar to the trigonometric function with 32 bits but with the magnitude representing the lower 16 bits and phase value representing the higher 16 bits. The output vector representation is shown in figure 2.9. MAGNITUDEPHASE 0151631 Figure 2.9: Output vector format of Translate function in Cordic . 11 2. Technical Background The CORDIC IP core makes use of the AXI4-Stream Protocol for sending in inputs and sending out output signals. It works by basic handshaking between tvalid and tready signals. The AXI4-Stream interface can be configured in CORDIC in Non- blocking as well as Blocking Mode. The Non-blocking mode does not have a tready signal and is always assumed to be asserted by default. On the other hand, the Blocking mode has a tready signal present and this helps to control the data flow through the core, making sure the output buffer is not overloaded with data. 2.8.2 Fast Fourier Transform (FFT) FFT is a computationally efficient algorithm for computing the Discrete Fourier Transform (DFT) of a signal [14]. The FFT IP core provided in the Xilinx library uses the Cooley-Tukey FFT algorithm [15] for calculating the DFT. The Cooley- Tukey FFT algorithm is one of the most widely used methods for calculating the DFT. The DFT X(k), k = 0, ...N − 1 of the sequence x(n), n = 0, ...N − 1 is defined as (2.1), where N is the transform length. The Inverse Discrete Fourier Transform (IDFT) is given by (2.2). X(k) = N−1∑ n=1 x(n)e−jnk2π/N (2.1) x(n) = 1 N N−1∑ k=1 X(k)ejnk2π/N (2.2) Cooley-Tukey FFT algorithm The Cooley-Tukey algorithm decomposes the DFT for larger sizes into sub-components and performs the DFT. This divide-and-conquer strategy significantly reduces the computational complexity of the DFT from O[N2] to O[NlogN ]. The radix-2 al- gorithm, the most common variant of the Cooley-Tukey algorithm, recursively de- composes the DFTs into smaller DFTs of half the size until the computations can be performed directly. Xilinx FFT IP core The FFT IP core has two different types of input signals for input and config in- structions and one output signal to provide the output. The pin diagram of the FFT IP core is given in figure 2.10. 12 2. Technical Background DS808 July 25, 2012 www.xilinx.com 31 Product Specification Fast Fourier Transform v8.0 Pinout This section describes the core ports as shown in Figure 34 and described in Table 3. X-Ref Target - Figure 34 Figure 34: Core Schematic Symbol Table 3: Core Signal Pinout Name Direction Optional Description aclk Input No Rising-edge clock. aclken Input Yes Active-high clock enable (optional). aresetn Input Yes Active-low synchronous clear (optional, always take priority over aclken). A minimum aresetn active pulse of two cycles is required. s_axis_config_tvalid Input No TVALID for the Configuration channel. Asserted by the external master to signal that it is able to provide data. s_axis_config_tready Output No TREADY for the Configuration channel. Asserted by the FFT to signal that it is ready to accept data. s_axis_config_tdata Input No TDATA for the Configuration channel. Carries the configuration information: CP_LEN, FWD/INV, NFFT and SCALE_SCH. See Section Run-Time Transfer Configuration. s_axis_data_tvalid Input No TVALID for the Data Input channel. Used by the external master to signal that it is able to provide data. s_axis_data_tready Output No TREADY for the Data Input channel. Used by the FFT to signal that it is ready to accept data. s_axis_config_tdata s_axis_config_tvalid s_axis_config_tready s_axis_data_tdata s_axis_data_tvalid s_axis_data_tready s_axis_data_tlast aclk aresetn aclken m_axis_data_tdata m_axis_data_tvalid m_axis_data_tready m_axis_data_tuser m_axis_data_tlast m_axis_status_tdata m_axis_status_tvalid m_axis_status_tready event_frame_started event_tlast_unexpected event_tlast_missing event_fft_overflow event_data_in_channel_halt event_data_out_channel_halt event_status_channel_halt DS808_01_080910 Figure 2.10: Pin Diagram of FFT IP core. [16] The Xilinx FFT IP core provides four different types of architecture to implement the FFT computations. These architectures process the data as continuous streamlining, pipelined streaming I/O architecture or independent data frames and burst I/O architecture. In the pipelined streaming solution, the Decimation in Frequency (DIF) method has been used for FFT computations and several radix-2 butterfly processing engines implement continuous data processing. Each processing engine has independent memory units to store the input and the intermediate result. The burst I/O architecture uses the Decimation in Time (DIT) method for the FFT computation and is implemented as radix-4, radix-2, and radix-2 lite architectures. In the burst I/O architecture, the FFT core loads the data separately followed by computing the transform and unloading the result. The radix-4 and the radix-2 architectures use the radix-4 and radix-2 butterfly structures for the computations respectively. The radix-2 lite architecture uses the same butterfly processing engines as the radix-2 but uses the same adder/subtractor blocks for the computations [16]. The Radix-2 Burst I/O architecture is illustrated in figure 2.11. The transform length will determine the number of stages for the FFT algorithm. The transform length can be configured during runtime and in the design stage of the FFT IP core. The configuration port in the core enables the selection of FFT and Inverse Fast Fourier Transform (IFFT) as well as the scaling performed in each stage. The streaming I/O architecture would give the highest throughput by utilising the most resources while the radix-2 lite burst I/O architecture would have the least throughput. The performance of each architecture is given in figure 2.12. 13 2. Technical Background DS808 July 25, 2012 www.xilinx.com 17 Product Specification Fast Fourier Transform v8.0 Radix-2 Lite Burst I/O This architecture differs from the Radix-2 Burst I/O in that the butterfly processing engine uses one shared adder/subtractor, hence reducing resources at the expense of an additional delay per butterfly calculation. Again, as with the Radix-4 and Radix-2 Burst I/O architectures, data can be simultaneously loaded and unloaded only if the output samples are in bit reversed order. This solution supports point sizes from 8 to 65536. See Figure 26. Run-Time Transfer Configuration All run-time configuration options discussed in this section are programed using the Configuration channel. Please see section Configuration Channel for more information. X-Ref Target - Figure 25 Figure 25: Radix-2 Burst I/O X-Ref Target - Figure 26 Figure 26: Radix-2 Lite Burst I/O - ROM for Twiddles Data RAM 0 Data RAM 1 sw itc h sw itc h Input Data Output Data RADIX-2 BUTTERFLY Generate one output each cycle Sine one cycle, cosine the next Multiply real one cycle, imaginary the next Store data in single RAM ds260_05_102306 Input Data Output Data ROM for Twiddles Data DPM 0 Data DPM 1 RADIX-2 BUTTERFLY - Figure 2.11: Radix-2 Burst I/O architecture [16]. Resources Throughput Radix-2-lite Burst I/O Radix-2 Burst I/O Radix-4 Burst I/O Streaming  Architecture Figure 2.12: Resource versus throughput for different architecture types. The data format of the input, output, and configuration instructions are shown in figures 2.13. 14 2. Technical Background XK_REXK_IM 0151631 (a) Input and Output instructions FWDXK_IM 01623 1 0's (b) Config instruction Figure 2.13: Data format of instructions in FFT IP core Here, XK_IM and XK_RE represent the real and imaginary parts of the data while the FWD bit represents whether a FFT or IFFT should be performed. 2.9 Evaluation Metrics Performance always comes into importance when we integrate anything into an existing processor design. It is of interest to assess the benefits of running a certain function in the hardware with the help of a processor, rather than running the same function in software. The performance of the processor and the hardware usage can be evaluated using various metrics. Some of the commonly used metrics are given below: 1. Cycles Per Instruction (CPI) - This denotes the average number of clock cycles that each instruction takes to execute on the machine [7]. CPI = Texe IC ∗ TC (2.3) where Texe is the execution time, IC is the instruction count and TC is the machine cycle time. 2. Speedup - The speedup of a machine over a reference machine is defined as the fraction of the execution time that a program takes to run in the reference machine to the execution time that the same program takes to run in the current machine. CPI = Texe,Ref Texe (2.4) where Texe,Ref is the execution time of the program on the reference machine and Texe is the execution time of the program on the machine that is being evaluated [7]. 3. Resource utilisation - This denotes the amount of hardware usage for exe- cuting a program in the specified machine. 4. Power Consumption - This denotes the amount of power consumed (both static and dynamic) for executing a program in the specified machine. 15 2. Technical Background 16 3 Methods This chapter describes the overall process flow of the thesis work. This is followed by an overview of the tools used for the design and implementation of the prototype and to test the entire module on an FPGA. 3.1 Workflow The first step in our thesis work was to understand the MicroBlaze-V core and the relevance of integrating a CX extension to it. The initial few weeks of the work were invested in literature survey on custom instructions and the proposed specification for the CX extension. Once we had a good understanding about the work to be done, we started by designing a block diagram of the entire system. We designed an interface for the CX extension on paper that would be implemented on RTL and later integrated into the MicroBlaze core. We also did some surveys in parallel to select the accelerators that could be used for demonstrating the CX extension. After the design has been done on paper, we proceeded to work on the RTL imple- mentation of the selected accelerators and testing them to function as a unit using a testbench. Further on, we moved to developing the RTL for the design and ver- ifying its behaviour and timing using Vivado. The design was implemented using Vivado after verification along with checking the power and the timing constraints to make sure that the requirements were met. Later, the newly developed interface was integrated to the existing MicroBlaze-V core and the same process of RTL de- sign and verification was repeated. After the bugs and timing issues were fixed, we implemented the entire core design in the KCU105 FPGA to test the behaviour on hardware. In parallel to the RTL design, we also worked on the software implementation of the accelerators using C. The C program for finding sine and cosine, magnitude and phase of the vector and FFT were written and built in Vitis and later simulated using Questasim. In order to record the output results, we logged the results from the waveform to an output file via UART by using a TCL script. The execution time of all the three programs were noted so that it can be compared with the execution time of the acceleration with CX-enabled MicroBlaze-V. The resource utilisation of the core with and without the interface and accelerators were also reported for further comparisons. The entire plan of the project was prepared on an excel sheet with all the tasks and sub-tasks and the person assigned to work on each specific task. In addition to 17 3. Methods this, we had weekly sync meeting with our advisors at AMD every Tuesday where we reported our work done in the previous week and discussed our plans on moving forward. 3.2 TestBench A top-level test bench was designed for the functional verification of the CX interface along with the interconnect and accelerator modules. The test bench was designed to imitate the software libraries which supply the instructions. The test vectors are generated using a Python script by concatenating the input of the IP core along with the CXU_ID and CF_ID. The input to the IP core was extracted from the inbuilt test bench of the IP core library. The test also features a self-checking system where the computed result from the accelerators is compared against the expected result obtained from the IP core library. The test bench was written to check for errors by itself using the assert and report statements in VHDL. The test bench design can be explained clearly using the flowchart given in figure 3.1. 3.3 Tools Used The following tools given in table 3.1 were used during the thesis work. Table 3.1: Software and Hardware tools used in the thesis work Tools Purpose Emacs Text editor for the VHDL code. QuestaSim To perform functional verification of the design. Vivado To implement design on FPGA. Vitis Environment to test the functionality in software. Python Used to generate the test vectors for the testbench. Xilinx Kintex Ultrascale KCU105 FPGA board used to run the program. Latex For documentation Inkscape For sketching 18 3. Methods Start i=0 Feed test vector at i Accelerator process Expected Output = Actual Output? i=i+1 Output mismatch. Error No Yes End Figure 3.1: Top-level TestBench Design Flow 19 3. Methods 20 4 Design In this section, the design of each component facilitating the implementation of the CX extension will be described in detail including the process flow. First, an overview of the design will be introduced in Section 4.1 followed by the modifications in MicroBlaze-V including the pipeline flow, design of CX interface, and additional CSRs required for the implementation of CX in Section 4.2. In Section 4.3, the In- terconnect that acts as a bridge between the CPU and accelerators will be described in detail. Section 4.4 introduces the design of the implemented accelerators and last, the reference model used for the performance evaluation will be described in Section 4.5. 4.1 Overview Implementing the CX extension in the RISC-V architecture involves modifications to the MicroBlaze-V core and integrating software libraries and hardware modules. These modifications to the core include the definitions of an additional hardware module (CX interface), definitions of new CSRs, and required changes in the decode and write-back pipeline stages to handle the execution of custom instructions. Cus- tom instructions required to execute a custom function are fetched with the help of software libraries. Software libraries generate and feed the instruction code to the IF stage in the RISC-V architecture. When custom instructions are decoded, the CX interface handles their execution with the help of external accelerators. AMD’s support for the MicroBlaze RISC-V processor allows compiling the hardware specification defined in Vivado with the software code built on the Vitis IDE. The Electronic Design Automation (EDA) support in Vitis allows exporting the current hardware design as a platform, and building applications. This application helps in generating the required instruction codes using inline C assembler in elf format. By importing the generated elf file to the Vivado environment, MicroBlaze-V can be implemented on FPGA along with the custom instructions. These generated instruc- tion codes will be stored in the external memory (LMB RAM) of the MicroBlaze-V. While running the design on hardware, each instruction will be fetched in each clock cycle and executed by the MicroBlaze-V processor. To demonstrate the CX extension, the FFT and CORDIC function accelerators are considered as CXUs. Unique CXU_IDs will be defined for the accelerators along with function IDs (CF_ID) for each custom instruction. These defined IDs are fetched during execution through the custom instructions and CSRs. As shown in 21 4. Design figure 4.1, the CXU_ID will be fetched from the CSR called mcx_selector while the CF_ID will be fetched from the instruction itself. Figure 4.1: Block diagram of the overall design. As the accelerators considered do not have any state context, state ID is ignored. This information along with the operand values will be sent to the CXU after veri- fying the version compatibility. When the CXU completes its execution, the result along with the status will be written back to the MicroBlaze core. 4.2 MicroBlaze-V In this thesis work, the MicroBlaze-V core has been modified for the implementation of the CX extension. In order to handle the execution of the custom instruction, a hardware module called CX interface has been designed and integrated into the pro- cessor core. In order to support this new module, some of the other components also need to be modified. This includes integrating the new CSRs and updating the de- coder module to support the custom instructions. Figure 4.2 shows the architecture of the modified MicroBlaze-V core to handle the CX extension. 22 4. Design I-C ache Instruction-side Bus Interface M_AXI_IP ILMB BUS IF Program Counter Branch Target Cache Instruction Buffer Instruction Decode compressed Instruction Decode Control and  Status  Registers Multiplier Divider ALU Barrel Shifter FPU Bit Manipulation CX_Interface Register File 32 registers Floating Point Register File 32 registers D -C ache Data-side Bus Interface BUS IF M_AXI_DC M_ACE_DC M_AXI_DP DLMB Optional feature Modifications for CX Extension Figure 4.2: Architecture of the modified MicroBlaze-V core. The highlighted modules in yellow show the blocks that are modified to integrate the CX interface. 4.2.1 CX interface The CX interface designed in MicroBlaze-V handles the execution of custom instruc- tions through a series of operations shown in figure 4.3. When the ID stage decodes the custom instruction, the operand values and the CF_ID from the instruction code along with the CXU_ID and version from the CSRs will be forwarded to the CX interface. The CX interface fetches this information and compares the version in CSR with the MicroBlaze version. If the versions match, then the operand values along with the identifiers (CXU_ID, CF_ID) will be encoded and transmitted via the AXI stream interface. In case of an incorrect version, the IV flag in the CSR will be updated and the execution of the custom instruction will be dropped. After the transmission of data, the MicroBlaze-V will wait for the result and the pipeline will be stalled. Figure 4.3: Workflow of the CX interface The functional block diagram of the CX interface is shown in figure 4.4. The ex- ecution of a custom instruction is controlled by the EX_CX_Instr register, where the signal will be asserted while decoding a custom instruction. CX_EX_Firstcycle denotes the first cycle in the EX stage and helps with handshake signals in the AXI-4 stream protocol. EX_CX_Write determines whether the custom instruction expects a result or not. If the value of this signal is low during the execution of a custom instruction, the CX interface will consider the instruction as a Write in- struction and will not wait for the response. EX_Piperun indicates whether the current stage of the pipeline is in the EX stage or not. EX_CX_Stall flag is used to stall the pipeline and EX_CX_Result contains the result of the execution. The EX_MCX_Sel and EX_New_MCX_Status are the CSRs designed for the CX ex- tension and the EX_Write_MCX_Status flag is used as an enable signal to write 23 4. Design to the CSRs. The CX interface communicates with the accelerators via the AXI-4 stream protocol. The master interface helps to write the data and the slave interface helps to read the data. The control and data signals of the master and slave AXI interface can also be seen from the block diagram. CX_Interface EX_CX_Instr EX_CX_Write EX_MCX_Sel EX_New_MCX_Status Clk Reset EX_CX_Stall EX_CX_Result EX_Write_MCX_Status CX_M_AXIS_Tdata(87 downto 0) CX_M_AXIS_Tvalid CX_M_AXIS_Tready CX_S_AXIS_Tdata(39 downto 0) CX_S_AXIS_Tready CX_S_AXIS_Tvalid EX_Piperun EX_CX_Firstcycle Figure 4.4: Design of CX interface with signals As per the custom instruction encoding format, the destination register address will not be present for the custom-2 flex type encoding type. Therefore the custom- 2 flex type instructions are considered as write instructions where the MicroBlaze core would not expect a result. In the case of write instructions, the pipeline will be stalled until the transfer of payload in the master interface of the AXI-4 stream has finished. The execution of each instruction for a CX extension might require more than one clock cycle. Therefore, stalling the pipeline would be critical as it can lead to various pipeline hazards. In this design, the pipeline is stalled in the decode stage during the execution of a custom instruction. The pipeline will be resumed when the custom instruction finishes its execution. The AXI master and slave interfaces have been designed for the transfer of the information from the MicroBlaze-V to the accelerators. The required parameters needed for the execution of the custom instruction will be encoded as shown in figure 4.5 and transferred to the accelerators via the master interface. Correspondingly, the executed result of the custom instruction along with the status of execution will be read by the MicroBlaze through the slave interface. The received information, shown in figure 4.6 will be decoded and the destination register and the status CSR will be updated. Figure 4.5: Encoded datastream write by the RISC-V core 24 4. Design Figure 4.6: Encoded datastream read by the RISC-V core 4.2.2 CSR The implementation of CX requires to define additional CSRs for extension multi- plexing and custom instruction execution. For the execution of a collision-free in- struction, the compatibility of MicroBlaze-V IP core as well as accelerators need to be verified. As the execution of a custom instruction follows the specified procedure, verifying the status of the execution of a custom instruction is also important. 1. mcx_selector - The mcx_selector CSR enables CX multiplexing and allows the developer to select the corresponding CXU required to run the particular instruction. It can be read or written only in the machine level. The format of this CSR is given in figure 4.7.  In a privileged architecture system, user level read access to mcx_selector values could reveal goings- on in other software threads and thus facilitate side channel attacks.  In a privileged architecture with M/S/U levels, for example, what CSRs are required and what access permissions should they have? 027282931 reservedcxe000 version Figure 3. mcx_selector CSR 0xBC0 (version 0: legacy custom instructions)) 0781516232427282931 cxu_idreservedstate_idreservedcxe100 version Figure 4. mcx_selector CSR 0xBC0 (version 1: extension multiplexing) The mcx_selector CSR has the following fields: .version: extension multiplexing version • When version=0, disable composable extension multiplexing. When cxe=0, custom-[0123] instructions execute the CPU’s built-in custom instructions and custom CSR addresses select the CPU’s built-in custom CSRs. When cxe=1, custom-[0123] instructions and custom CSR accesses raise an illegal-instruction exception. • When version=1, enable version-1 composable extension multiplexing. The cxu_id and state_id fields select the current CXU and state context. When cxe=0, custom-[012] instructions issue CXU requests, and custom CSR accesses access CX CSRs, of the CXU and state context identified by cxu_id and state_id. When cxe=1, custom-[012] instructions and custom CSR accesses raise an illegal instruction exception. • version values 2-7 are reserved. .cxe: custom operation exception enable • When (version=0 or version=1) and cxe=1, a custom operation raises an illegal-instruction exception. .cxu_id: select the hart’s current CXU • A valid cxu_id identifies a configured CXU. • When enabled, when cxu_id does not identify a configured CXU, executing a custom operation instruction causes an invalid CXU_ID error. The cx_status.CX error bit is set and the instruction’s destination register, if any, is zeroed. .state_id: select the hart’s current CXU’s current state context • A valid state_id identifies a state context of a CXU. • When enabled, when cxu_id is valid, but state_id does not identify a state context of the current CXU, executing a custom operation instruction causes an invalid STATE_ID error. The cx_status.IS error bit is set and the custom operation instruction’s destination register, if any, is zeroed. No error occurs when mcx_selector is CSR-written with an invalid CX selector, i.e., when .cxu_id or .state_id are invalid. Rather, subsequently executing a custom operation instruction may cause a CXU_ID or STATE_ID 2.2. New CX control / status registers | Page 14 Draft Proposed RISC-V Composable Custom Extensions Specification Figure 4.7: Definition of mcx selector CSR Here, cxu_id field identifies the corresponding CXU and the state_id field identifies the corresponding state. 2. cx_status - This CSR accumulates the CXU error flags and it may be read and written in all privilege levels. All the fields of this CSR are set to 0 by the application software by default, before the execution of an instruction and then the values are updated by the CXU in case of any errors. The fields of cx_status are given in figure 4.8. instructions, and read cx_status to determine if there were any errors. 0123456731 IVICISOFIFOPCUreserved accrued errors Figure 5. cx_status CSR 0x801 The cx_status CSR has the following fields: .IV: invalid CX version error • Set by a CSR-write to mcx_selector, or by a CF instruction, when mcx_selector.version is invalid. (For example, when new software writes a new selector type that old hardware does not implement.) .IC: invalid CXU_ID error • Set by a CF instruction when mcx_selector.cxu_id is invalid. .IS: invalid STATE_ID error • Set by a CF instruction when mcx_selector.cxu_id is valid but mcx_selector.state_id is invalid. .OF: state context is off error • Set by a CF instruction when mcx_selector.cxu_id and mcx_selector.state_id are valid but the selected state context is in the off state. .IF: invalid CF_ID error • Set by a CF instruction when mcx_selector.cxu_id and mcx_selector.state_id are valid but the instruction’s CF_ID is invalid. .OP: CXU operation error • Set by a CF instruction when mcx_selector.cxu_id, mcx_selector.state_id, and its CF_ID are valid but there is an error in the requested operation or its operands, in lieu of custom error state. .CU: custom CXU operation error • Set by a CF instruction of a stateful extension when mcx_selector.cxu_id, mcx_selector.state_id, and its CF_ID are valid but there is an error in the requested operation or its operands, with custom (extension- defined) error state available.  The custom error state of a stateful extension may be obtained using custom functions of the extension. In addition, the custom error state of a serializable extension may also be obtained using IStateContext custom functions cf_read_status and/or cf_read_state.  Should writing mcx_selector automatically zero cx_status? This shortens the code path to use an extension by one instruction but it precludes the use case of clearing errors, issuing a series of custom function instructions across multiple extensions, then checking for errors. For simplicity we do not adopt this option. 2.2. New CX control / status registers | Page 13 Draft Proposed RISC-V Composable Custom Extensions Specification Figure 4.8: Definition of cx status CSR The flags seen in figure 4.8 are described below. • IV - Invalid CX version error: set when the mcx_selector version is in- valid. • IC - Invalid CXU_ID error: set when the cxu_id provided to the mcx_selector is invalid. • IS - Invalid STATE_ID error: set when cxu_id is valid but the state_id is invalid. 25 4. Design • OF - State context is off error: set when cxu_id and state_id are valid but the state context is in off state. • IF - Invalid CF_ID error: set when cxu_id and state_id are valid but the CF_ID of the instruction in invalid. • OP - CXU operation error: set when cxu_id, state_id and CF_ID are valid but there is an error in the requested operation or operands. • CU - Custom CXU operation error: set when cxu_id, state_id and CF_ID are valid but there is an error in the requested operation or operands with its custom error state available. 4.3 Interconnect The interconnect has been designed to act as a bridge between the MicroBlaze-V core and the accelerators. The interconnect has independent AXI-4 stream interfaces for communicating with the MicroBlaze-V processor and the accelerators as shown in figure 4.9. The overall execution of a custom instruction has been handled in the interconnect as two phases, the write and read phase. The write phase handles the communication from the MicroBlaze to the accelerators and the read phase handles the opposite. A side lobe channel, CXU hit has also been designed to denote the accelerators executing the custom instruction. When broadcasting the information to all the accelerators, the interconnect need not know the header ID (CXU_ID) of the accelerators, thereby avoiding the need to have lookup tables in the interconnect module. The accelerator with the correct CXU_ID will respond to the broadcast request and the interconnect will wait for the response. If none of the accelerators respond to the request, the custom instruction will fail with a CXU error. Figure 4.9: AXI-4 stream interfaces in the design 26 4. Design 4.3.1 Write Phase The write phase of the interconnect handled by the broadcaster, evaluator, and error handling units in figure 4.10 establishes the communication from MicroBlaze to the accelerators. During the write phase, the interconnect broadcasts the payload from the master interface of the CX interface to all the accelerators. Broadcasting the information eliminates the need to store the header IDs including the CXU ID and CF ID of the accelerators in the MicroBlaze or in the interconnect. Accelerators with the matching CXU_ID will respond to the request and indicate the intercon- nect with the help of the side lobe channel (CXU Hit). The error handling in the interconnect uses the CXU Hit flag from the accelerator to check for CXU Error. In case of a mismatch, the CXU error flag will be set and accumulated. Accumulating the CXU error helps to restore any previous invalid CXU errors. Figure 4.10: Interconnect functionality in the write phase 4.3.2 Read Phase During the read phase, the computed result from the accelerators along with the accumulated error flag will be read by the MicroBlaze-V processor. The functions performed by the interconnect during the read phase are shown in figure 4.11. The mixer waits to get the response from the accelerators along with the status of oper- ation and append the accumulated CXU error flag and write the information to the CX interface. During this function, bitwise OR operation will be performed for the response from the accelerators. Therefore, the non-triggered accelerators’ response should be set as void or 0. If none of the accelerators are triggered, then the error handling identifies the error and writes back to the CX interface. An OR function- ality has been used to distinguish between the response from the mixer function 27 4. Design and the error handling unit. The evaluator unit facilitates generating the response based on the instruction types (Read/Write) by analysing the ready flag in the slave interface of the CX interface. When the ready flag is high, the instruction will be identified as a read instruction where the CX interface expects a result from the accelerator. When the ready flag is low, the instruction will be identified as a write instruction where the CX interface will not expect a result. Figure 4.11: Interconnect functionality in the read phase 4.4 Accelerators The accelerator executes the custom instruction by performing the computations based on the parameters provided and writes the result back to the MicroBlaze- V processor through the Interconnect module. The master and slave interfaces defined in the accelerator handle the reading and writing of the data respectively. Additionally, the CXU Hit channel identifies whether the custom instruction has the correct CXU_ID. Each accelerator has a unique CXU_ID which acts as the identifier for the accelerators. This thesis work considers the FFT and the CORDIC IP cores from the Xilinx library for hardware accelerations. Table 4.1 includes the identifiers for each accelerator and their corresponding function. Table 4.1: Accelerator identifier Accelerator Instruction Type CXU ID CF ID Trigonometry Custom-0 R type 0x20 0b10 Vector Translation Custom-0 R type 0x25 0b01 FFT Custom-2 flex type 0x30 0b01 Custom-2 flex type 0x30 0b10 Custom-0 R type 0x30 0b11 A wrapper module, shown in figure 4.12 defined over the IP cores manages the 28 4. Design execution of the custom instruction. Upon receiving the tvalid and tdata signals, the wrapper module samples and checks the integrity of the incoming bit stream. If the CXU_ID is valid, the data stream will be received by asserting the ready signal with the help of the AXI slave interface and will be noted as CXU hit. In case of a CXU mismatch, the data stream will not be received and the CXU Hit signal will not be asserted. The required parameters for the IP core will be forwarded after the CF_ID check and operand check have been performed. If both checks fail, the error handling will manage and provide an error response back to MicroBlaze. The wrapper module waits until the IP core computes the result and the computed result will be encoded with the the status to generate the response back to the MicroBlaze. This combinatorial logic algorithm helps in the faster response of the accelerators. The AXI interfaces of the wrapper and the IP cores are connected with the help of the AXI master and slave interface defined in the wrapper. These interfaces facilitate the connection of the AXI handshake signals (valid and ready) and the data channel. As mentioned before, the custom instructions are defined based on write-back. The accelerator module follows the normal process flow when the custom instruction expects a result back. In case the custom instruction does not expect a result back, then the accelerators will not write the result back to the MicroBlaze. The AXI master interface defined in the wrapper handles this functionality. When these custom instructions produce a CF error or an op error, then error handling stores and accumulates the corresponding status. The accumulated status will be written back later to the MicroBlaze-V when the accelerators run a read instruction. Acc_s_axis_tdata Acc_s_axis_tvalid Acc_s_axis_tready Sampling CXU Check CXU_hit CF Check OP Check IP core Error Handling Encoder Acc_m_axis_tready Acc_m_axis_tvalid Acc_m_axis_tdata Accelerator Wrapper AXI Slave Interface CX_Status s_axis_tdata m_axis_tdata m_axis_tvalid AXI Master Interface cf error op error error_tvalid m_axis_tready s_axis_tvalid s_axis_treadyy Figure 4.12: Wrapper module for the accelerators 29 4. Design 4.4.1 CORDIC This thesis work implements trigonometric and vector translation functionalities of the CORDIC IP core from Vivado. These functionalities are implemented as two independent CXUs with unique CXU_IDs as mentioned in table4.1. As the CORDIC accelerators require only one instruction to perform the computations, two LSB bits from the instruction code would be sufficient to represent the ID. As mentioned before, the IP core will be executed only after the error checking has been performed. In case of CORDIC functionalities, the state ID is ignored as the computation does not require any state context. Since both the CORDIC functions produce a result, custom -0 type instruction encoding is used. Therefore, the accelerator wrapper follows the normal process flow and returns the result to the MicroBlaze-V. The trigonometric IP core has been configured for operating on a 16-bit fixed point input value which produces a 32-bit fixed point value, thereby enabling the write back of the computed result seamlessly. The 32-bit output value contains 16-bit sine and cosine values in the LSB and MSB parts respectively. As the operand values in the instruction code allow 32 bits, the input for the IP core has been defined on the LSB bits and the rest of the bits are ignored. The vector translation function of the CORDIC IP core is configured for a 32-bit Cartesian input producing a 32-bit result containing the scaled magnitude and rotated phase angle. Both the functions of CORDIC are configured to yield maximum performance by using the maximum pipelining mode and parallel architectural configuration. The truncation of the data helps to limit the result to 32-bits in both CORDIC functions. Similar wrapper modules have been designed for both CORDIC functions following the normal process flow but they differ in input handling. This difference in the sampling unit ensures to drive the correct input for the IP cores. The overall timing required for the wrapper to perform these computations is shown in figure 4.13. As mentioned, when receiving the bit stream having CXU valid, the CXU Hit flag will be asserted and will be de-asserted after the execution is completed. This flag also indicates the interconnect of the execution of a custom instruction. The wrapper module waits for the valid signal which indicates that the result is ready and returns the result along with the status in the accelerator master interface. By enabling the blocking mode of the IP core, the tready signal can be controlled manually. 4.4.2 FFT The FFT IP core has been configured based on the radix-2 burst I/O architecture with a transform length of 256 enabling the number of elements, N or k in equations (2.1 and 2.2), in a data frame. Therefore, the number of butterfly stages in the FFT algorithm will be log2 256 = 8. The input to the FFT core has been designed to have 16 bits on both real and imaginary components. Therefore, the output will also have 16-bit real and 16-bit imaginary components. The IP core has been configured for fixed-point computations. The Burst I/O architecture allows writing and reading each data frame in separate phases. Therefore, three custom instructions are required to configure, write, and 30 4. Design Clk Acc_s_axis_valid Acc_s_axis_data Acc_s_axis_tready CXU_HIT IP_s_axis_valid IP_s_axis_data IP_s_axis_ready IP_m_axis_valid IP_m_axis_ready IP_m_axis_data Acc_m_axis_valid Acc_m_axis_ready Acc_m_axis_data Figure 4.13: Timing diagram of accelerators using CORDIC IP core read data from the FFT core. The design constraints set in the software manage the order of execution of these phases. In the write phase, the configurations and inputs required for the computations will be fed into the FFT core. During the read phase, the MicroBlaze will read the computed result using each instruction. The wrapper module designed around the FFT core manages these independent phases along with the normal process flow including sampling and error checking. When receiving a bit stream with CXU valid, each phase of operation of the FFT acceleration will be differentiated using the CF_ID with the help of a demultiplexer unit as shown in figure 4.14. Since the property of the IP core enables to have independent AXI interface to configure, read, and write from the IP core, restricting the tvalid signal enables to control each phase of operation. This enables to connect the tdata to the IP core directly as tvalid restricts the IP core from reading the data. Similarly, tready in the slave interface of the wrapper has been configured to be dependent on the control signals in the AXI interface of the IP core as shown in the block diagram. In case of output instruction, the acc_s_axis_tready flag will depend on the tvalid in the output interface of the IP core. When considering the delay in computing the output, the wrapper module needs to receive the instruction and wait for the response. Therefore, the valid signal in the output interface has been logically OR’ed and forwarded to the demux. The FFT wrapper will only write back to the MicroBlaze for the custom instruction to read the output. In case of a write instruction, the error handling unit will store and accumulate the error. This means that the cf error and the operation error will be accumulated in the error handling unit as shown in figure 4.15 and are flushed out when a read instruction is processed. The ready flag in the master interface of the wrapper has been connected to the ready flag in the output interface. When the output instruction returns the result to the MicroBlaze, the ready flag will be de-asserted so that the following 31 4. Design data is not read.       IP core Configuration Interface Input Interface Output Interface acc_s_axis_tvalid & acc_m_axis_tready CF_ID CF Error s_axis_config_tvalid s_axis_config_tready CF_ID CXU valid CXU valid acc_s_axis_tready OR s_axis_tvalid s_axis_tready m_axis_tdata m_axis_tvalid m_axis_tready Encoder cx_status acc_m_axis_tdata Sampler acc_s_axis_tdata s_axis_config_tdata s_axis_tdata 1 Figure 4.14: Block diagram of accelerator design using the FFT IP core (a) cf error (b) op error Figure 4.15: Accumulation of the errors by the error handling unit When executing an input or configuration instruction, the FFT wrapper follows the same methodology and timing. The timing of the FFT write process when following a valid instruction is shown in figure 4.16. In case of an output instruction, the wrapper modules need to wait until the IP core generates the output. The timing behaviour of an output instruction can be seen in figure 4.17. 32 4. Design Clk Acc_s_axis_valid Acc_s_axis_data Acc_s_axis_tready CXU_HIT IP_s_axis_valid IP_s_axis_data IP_s_axis_ready Figure 4.16: Timing diagram of input and config instructions in the FFT acceler- ator Clk Acc_s_axis_valid Acc_s_axis_data Acc_s_axis_tready CXU_HIT IP_m_axis_valid IP_m_axis_ready IP_m_axis_data Acc_m_axis_valid Acc_m_axis_ready Acc_m_axis_data Figure 4.17: Timing diagram of output instruction in the FFT accelerator 33 4. Design 4.5 Reference Model It is necessary to compare the performance of the newly designed core and analyze its advantage over executing the same functions in software. So, for that reason, the trigonometric, vector translation, and FFT function codes were written in C programming language and were made to run in the existing MicroBlaze-V environ- ment. The in-built math library in C has been used for calculating the trigonometric, square root and other mathematical operations involved in the calculations. More- over, it was also made sure that the program size doesn’t grow significantly in size so that it can be fit inside the core memory. A few design decisions were needed to be taken to satisfy this requirement. One of them was to make use of the Xil- inx developed xil_printf() function instead of the normal printf() in order to compress the size. The size of the LMB RAM had to be increased to an extent to fit the nested loop operations involved in FFT calculation. Once the codes were written, each of them was compiled and built using Vitis, which generated an elf file. This elf file was imported to Vivado, which in turn allows the MicroBlaze design to communicate with the software code. Finally, this model was implemented and run on the KCU105 FPGA to analyze the performance metrics which are described in Section 2.9. 34 5 Results In this section, we show the results obtained in this work. Through this work, AMD has got a new interface in their Microblaze-V core to plug in any accelerator for executing custom instructions. However, the accelerators should have the AXI-4 stream communication implemented and should have a similar wrapper designed to interface with the Microblaze core. In this thesis work, we compared the results obtained by running the custom in- structions using accelerators and by running the same function in the core without the accelerators. It was noted that the output value of the function written in C code does not match the output of the IP core. This is expected due to the dif- ference in the way of implementing these functions in the IP core. For example, the truncation parameters are specific to the FFT IP core and might not match the conventional algorithm followed to program the same code in C. As a result, it is not ideal to compare the output of the performed operation, but would rather make sense to compare the performance in terms of speed up, resource usage and other metrics described in Section 2.9. These metrics were recorded by running the design in the KCU105 FPGA at a clock frequency of 100 MHz and the obtained results are illustrated below. As mentioned earlier, these metrics were recorded for both the existing MicroBlaze-V design and the modified core with the CX extension and accelerators. These results are then compared to show the benefits that have been achieved by having an interface to execute custom instructions. 5.1 Speedup The speedup can be calculated from the execution time values of both the MicroBlaze- V core designs. The execution time was recorded by calculating the number of cycles using the CSR called cycles available in MicroBlaze-V. The cycles CSR samples the time from the beginning of the execution until it completes the execution. The difference between the time samples is recorded to calculate the total time taken to execute the specific instruction. This gives the value of CPI for that instruction. Ex- ecution time is obtained by multiplying the CPI with the time period of the system clock (10 ns). The reported execution time is given in table5.1. 35 5. Results Table 5.1: Execution time of the given functions in the modified design and current design. Execution Time Function CX-enabled MicroBlaze-V MicroBlaze-V Sin and Cos 0.31 us 16.59 us Vector Translation 0.31 us 43.68 us FFT 57.69 us 17.85 ms It makes more sense to calculate the speedup for each function using the execution time. This can be done using equation 2.4. The calculated speedup values are plotted for comparison in figure 5.1. Sin and Cos Vect Trans FFT Functions 0 50 100 150 200 250 300 350 Sp ee du p Figure 5.1: Speedup comparison of functions from the original version and by using accelerators From the figure, it is clear that the execution time has improved significantly for all the functions. The use of accelerators has made the execution of the functions faster by executing them in hardware, which otherwise would take multiple loops to execute in software. 5.2 Hardware utilisation The resource utilisation comparison of the hardware with the addition of CXUs and the interface is an essential parameter regarding the system cost. The resource utilisation of the system with and without accelerators in reference to KCU105 FPGA is shown in figure 5.2. 36 5. Results LUT LUTRAM FF BRAM DSP Resources 0 0.5 1 1.5 2 2.5 3 U til iz at io n % Without CX and acccelerators With CX and accelerators Figure 5.2: Resource utilisation percentage comparison This resource utilisation comparison shows the additional resources required to im- plement the CX extension to perform the selected computations. The extra usage of resources is obvious as it is required to have more flip flops, LUTs, DSPs and other resources for the newly designed interface and accelerators. For instance, the accel- erations including FFT have been configured to use DSPs for faster response time. Since we consider a single extension to connect all the accelerators, it is sufficient to compare the resource utilisation 5.3 Power Consumption The extra hardware resources consume more power as well and thus the power should also be reported and ensured that it is well within the constraints and does not throw any warning in Vivado. A comparison of both dynamic and static power was done among both the designs and the results are illustrated in Figure 5.3. 37 5. Results Dynamic Power Static Power 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 P ow er ( W ) Without CX and accelerators With CX and accelerators Figure 5.3: Comparison of power consumption The implementation report from Vivado, shown in figure 5.3 explains the power dissipation in the FPGA board and does not illustrate the actual power consump- tion of the core with acceleration during runtime. In this case, we should have considered the energy consumption as the parameter which expects to show the improvement with acceleration. Our thesis work failed to explore this area due to timing constraints, which is one of our limitations. The above results show that the introduction of the CX interface to the MicroBlaze- V core has enabled the execution of custom instructions with ease and at a higher performance. As the amount of hardware used is still in satisfactory limits, the design has been successful and can be a possible introduction in AMD’s future customer releases. 38 6 Conclusion The introduction of the CX interface to the MicroBlaze-V enables the plug-and- play of different accelerator modules. The current state of the CX extension in MicroBlaze-V supports any accelerator with an AXI4-stream interface. In the case of multiple accelerators, the same wrapper module in section 4.4 can be redefined for the identification of the accelerator and error handling. The instruction codes for enabling the accelerators will be fed into the MicroBlaze-V core using an inline assembler. On running the module, the instructions along with the operators will be taken in from the software, get processed in the core and accelerators, and output the result. With the help of the reference model, the accelerated functions have been compared with the same functions implemented on MicroBlaze-V using C functions. To implement complex functions including FFT and trigonometry, a long set of standard instructions has to be fed to MicroBlaze-V. The same function can be exe- cuted with minimum CPU time by implementing a CX extension that uses external accelerators. An improvement of 300 times in the execution time can be observed when using FFT accelerators compared with the normal C functions. Similarly, an improvement of 50 times and 150 times in the CPU time can be observed for the trigonometric and vector translation functions. This improvement requires addi- tional hardware resources in the FPGA including FFs, DSPs, etc. Therefore, the FPGA device dissipates additional total power compared to standard MicroBlaze-V implementation. After implementing the CX extension, the MicroBlaze-V is expected to consume less power for the custom function. 6.1 Challenges Initially, investigating and proposing potential accelerators that are industrially rel- evant was challenging. We were planning to design an accelerator function from scratch. Due to the constraint in the duration of this thesis work and as the main focus was on enabling MicroBlaze-V to support custom instructions, we decided to use the standalone IP cores from AMD’s Vivado library. Therefore, from an industrial perspective, our thesis work demonstrates the easiness of connecting a standalone IP core to the MicroBlaze-V without extra effort. Another challenge we faced was to get the software implementation of the accelerator functions to work. As these implementations, especially FFT, require multiple loops, 39 6. Conclusion it require more memory space in BRAM, due to which the results were not obtained as expected. However, we did manage to unroll the loops and also increase the BRAM memory to an extent so that the expected output was obtained. We also faced some difficulties when modifying the processor core to support the custom instructions, especially during the write-back to the registers and updating CSRs. Due to our inexperience and unfamiliarity with the MicroBlaze-V, we had to rely on our advisors at AMD for the necessary modifications in the core. 6.2 Future Work The current state of the CX extension in MicroBlaze-V supports the implementation of FFT acceleration with a set of limitations defined in the software. That is, the execution order for the custom instructions for FFT should be in the chronological order of configuration, input, and output instructions. The chronological execution order can be managed by the hardware enabling the user to ignore the execution order. With the recent modification in the specification for enabling CX extension, the status of execution of a custom instruction should be updated in the CSRs. The designed interconnect block managing the number of accelerators can be re- leased as a standalone IP core with peripheral support to MicroBlaze-V thereby, allowing the user to seamlessly interface multiple accelerators. Furthermore, the wrapper module functions can be predefined in Vivado enabling the users to install independent accelerators to MicroBlaze-V. 40 References [1] E. Cui, T. Li, and Q. Wei, “RISC-V Instruction Set Architecture Extensions: A Survey,” IEEE Access, vol. 11, pp. 24 696–24 711, 2023. [2] “Draft proposed RISC-V composable custom extensions specification,” https: //github.com/grayresearch/CX. [3] Z. Li, W. Hu, and S. Chen, “Design and Implementation of CNN Custom Processor Based on RISC-V Architecture,” in 2019 IEEE 21st International Conference on High Performance Computing and Communications; IEEE 17th International Conference on Smart City; IEEE 5th International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2019, pp. 1945–1950. [4] K.-D. Nguyen, D. T. Kiet, T.-T. Hoang, N. Q. N. Quynh, and C.-K. Pham, “A CORDIC-based Trigonometric Hardware Accelerator with Custom Instruction in 32-bit RISC-V System-on-Chip,” in 2021 IEEE Hot Chips 33 Symposium (HCS), 2021, pp. 1–13. [5] “What is a field programmable gate array (FPGA)?” https://www.ibm.com/ think/topics/field-programmable-gate-arrays. [6] Vivado Design Suite User Guide, Advanced Micro Devices, Inc, 10 2023. [7] M. A. Michel Dubois and P. Stenström, Parallel Computer Organization and Design, 1st ed. Cambridge University Press, 2012. [8] J. L. Hennessy and D. A. Patterson, Computer Architecture - A Quantitative Approach, 6th ed. Morgan Kaufmann, 2019. [9] MicroBlaze Processor Reference Guide (UG984), Advanced Micro Devices, Inc, 6 2021. [10] I. Mhadhbi, S. Ben Othman, and S. Ben Saoud, “Impact of MicroBlaze FPGAs Design Methodologies of the Embedded Systems Performances,” in Interna- tional Conference on Control, Engineering Information Technology, 2014, pp. 146–151. [11] AMBA AXI-Stream Protocol Specification, ARM Limited, 4 2021. [12] G. Park, T. Taing, and H. Kim, “High-speed FPGA-to-FPGA Interface for a Multi-Chip CNN Accelerator,” in 2023 20th International SoC Design Confer- ence (ISOCC), 2023, pp. 333–334. [13] CORDIC v6.0 Product Guide (PG105), Advanced Micro Devices, Inc, 8 2021. 41 https://github.com/grayresearch/CX https://github.com/grayresearch/CX https://www.ibm.com/think/topics/field-programmable-gate-arrays https://www.ibm.com/think/topics/field-programmable-gate-arrays References [14] P. Duhamel and M. Vetterli, “Fast Fourier transforms: a tutorial review and a state of the art,” Signal processing, vol. 19, no. 4, pp. 259–299, 1990. [15] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation of complex Fourier series,” Mathematics of computation, vol. 19, no. 90, pp. 297–301, 1965. [16] Fast Fourier Transform v9.1 Product Guide (PG109), Advanced Micro Devices, Inc, 5 2022. 42 A Appendix 1 A.1 Reference Model C code A.1.1 Trigonometric Function 1 # include 2 # include 3 4 int main () 5 { 6 int angle = 0; 7 8 // printing the sine value of angle 9 xil_printf ("sin (%d) = %d", angle ,sin(angle)); 10 xil_printf ("cos (%d) = %d", angle ,co