Interconnect_0

Interconnect_v1_0

acc1_m_axis

acc1_s_axis

acc2_m_axis

acc2_s_axis

acc3_m_axis

acc3_s_axis

cx_m_axis

cx_s_axis

clk

acc1_cxu_hit

acc2_cxu_hit

acc3_cxu_hit

acc1_trigonometric_0

acc1_trigonometric_v1_0

acc1_m_axis

acc1_s_axis

clk acc1_cxu_hit

acc2_vect_trans_0

acc2_vect_trans_v1_0

acc2_m_axis

acc2_s_axis

clk

acc2_cxu_hit

acc3_fft_0

acc3_fft_v1_0

acc3_m_axis

acc3_s_axis

clk

acc3_cxu_hit

clk_wiz_1

Clocking Wizard

CLK_IN1_D

reset clk_out1

default_sysclk1_300

mdm_1

MicroBlaze Debug Module (MDM) V

MBDEBUG_0

microblaze_riscv_0

MicroBlaze V

INTERRUPT

DLMB

ILMB

M_AXI_DP

CX_M_AXIS

CX_S_AXIS

DEBUG

Clk

microblaze_riscv_0_local_memory

DLMB

ILMBLMB_Clk

reset

Adding a Composable Extension for
Custom Instructions to the MicroBlaze-V core

Master’s thesis in Embedded Electronic System Design

ARAVIND PRASANNANPILLAI SREEVILASAM

SHAILESH SURESH VELLOLI

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2024


Master’s thesis 2024

Adding a Composable Extension for
Custom Instructions to the MicroBlaze-V core

ARAVIND PRASANNANPILLAI SREEVILASAM
SHAILESH SURESH VELLOLI

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2024


Adding a Composable Extension for
Custom Instructions to the MicroBlaze-V core
ARAVIND PRASANNANPILLAI SREEVILASAM
SHAILESH SURESH VELLOLI

© ARAVIND PRASANNANPILLAI SREEVILASAM
SHAILESH SURESH VELLOLI, 2024.

Supervisor: Per Larsson Edefors, CSE Department
Company advisor: Goran Bilski, Mathiesen Tryggve, AMD
Examiner: Lena Peterson, CSE Department

Master’s Thesis 2024
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Block diagram of the design from Vivado

Typeset in LATEX
Gothenburg, Sweden 2024

iv


Adding a Composable Extension for Custom Instructions to the MicroBlaze-V core
ARAVIND PRASANNANPILLAI SREEVILASAM
SHAILESH SURESH VELLOLI
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
This report presents the design and implementation of a Composable Extension
(CX) for custom instructions in the MicroBlaze-V core, which is a customisable
RISC-V core offered by AMD. The implementation is done on a Field Programmable
Gate Array (FPGA) and the performance is evaluated with accelerators against the
current MicroBlaze-V design.

Integration of a new CX interface allows the designer to add any number of custom
instructions and accelerators according to the requirement. The accelerators can
be either freshly designed or by using the existing Xilinx Intellectual Property (IP)
cores with additional parameters. In this project, existing IP cores have been used
as accelerators as this demonstrates how easy it is to integrate the IP cores with the
interface design.

The accelerator functions were also programmed in software using C to compare and
analyze the performance of the CX extension in MicroBlaze-V. Different metrics like
speedup, resource utilisation and power consumption were considered to evaluate
the efficiency of the entire system. A significant performance improvement has been
observed with the accelerators at the expense of higher resource utilisation.

Keywords: Composable Extension (CX), Custom Instructions, Field Programmable
Gate Array (FPGA), MicroBlaze-V, RISC-V, Accelerators, Intellectual Property
(IP), Xilinx, Evaluation Metrics

v


Acknowledgements
We would like to extend our profound gratitude to each and every one, who has
helped us in the progress of this thesis work. First of all, we would like to thank
our supervisors at AMD, Goran Bilski and Tryggve Mathiesen without whom we
would not have been able to make significant progress in this thesis. They patiently
heard about the difficulties that we faced throughout the course of this project
and provided us with the necessary information to overcome them. We are also
grateful to our academic supervisor Per Larsson Edefors who guided us in the right
direction throughout our work. We would also like to express our gratitude towards
our examiner Lena Peterson for providing significant comments and feedback on
our work. We would also like to thank the management of Chalmers University of
Technology for providing all the facilities required to do our project. And lastly, we
would like to thank our parents and friends who helped us during the entire course
of this project and acted as a pillar of support and confidence.

Aravind Prasannanpillai Sreevilasam and Shailesh Suresh Velloli,
Gothenburg, September 2024

vii


Contents

Glossary xi

1 Introduction 1
1.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Purpose and Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Technical Background 3
2.1 Field Programmable Gate Array . . . . . . . . . . . . . . . . . . . . . 3
2.2 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 RISC-V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 CX Extension . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.4.1 CX standard encoding . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Control and Status Registers (CSRs) . . . . . . . . . . . . . . . . . . 8
2.6 AXI4-Stream Interface . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.7 Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.8 Intellectual Property (IP) Core . . . . . . . . . . . . . . . . . . . . . 10

2.8.1 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.8.2 Fast Fourier Transform (FFT) . . . . . . . . . . . . . . . . . . 12

2.9 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Methods 17
3.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 TestBench . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 Tools Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Design 21
4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.2 MicroBlaze-V . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2.1 CX interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2.2 CSR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3.1 Write Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3.2 Read Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 Accelerators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.4.1 CORDIC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.4.2 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

ix


4.5 Reference Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Results 35
5.1 Speedup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Hardware utilisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.3 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

6 Conclusion 39
6.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Bibliography 41

A Appendix 1 I
A.1 Reference Model C code . . . . . . . . . . . . . . . . . . . . . . . . . I

A.1.1 Trigonometric Function . . . . . . . . . . . . . . . . . . . . . . I
A.1.2 Vector Translation . . . . . . . . . . . . . . . . . . . . . . . . I
A.1.3 FFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

x


Glossary

AXI An on-chip communication protocol used be-
tween two hardware modules. 6, 8, 9, 12, 23,
24, 26, 29, 31, 35

CF single or a group of custom instructions. 6
CLB A basic logic block that executes complex logic

functions and implement memory functions. 3
CPI number of cycles divided by the number of in-

structions. 15, 35
CPU a hardware component that is the core com-

putation unit in a server. 1–3, 6, 9, 21
CSR Special register that holds control and status

information in a processor. ix, 2, 5, 6, 8, 21–
25, 35

CX interface contract of a composable extension
consists of custom function instructions, csr,
and their behavior.. ix, 3, 6

CXU A hardware core that implements composable
extension.. 6, 7, 21, 22, 25, 26, 29, 30, 36

FPGA A reconfigurable integrated circuit. 3, 4, 9, 17,
18, 21, 34–36

HDL definition for all language model used to de-
scribe the behavior of hardwares. 4

IP a reusable unit of logic or integrated circuit
layout design. ix, 10–13, 15

ISA An abstract model that defines how software
controls the CPU in a computer. 1, 5, 8

register fast memory units in the hardware that stores
binary data. 3–7, 23, 24

VHDL A language model used to describe the hard-
ware. 4, 10, 18

xi


Glossary

xii


1
Introduction

In the continuously evolving world of CPU architectures, the pursuit of enhanced
performance and efficiency remains a constant driving force. As the demand for
specialised applications rises, it turns out to be very important to have the required
hardware support. This thesis project focuses on advancing processor architecture
by integrating a proposed composable extension (CX) for custom instructions to the
MicroBlaze-V soft-core processor.
The RISC-V processor architecture is popular for its openness and flexibility [1].
The custom extension definition in the RISC-V core is aimed at optimising the
execution of specialised tasks and accelerate the tasks using dedicated hardware.
However, the custom opcode space in RISC-V is unmanaged and this makes it
difficult to create an ecosystem where different parties can publish and exchange
their own composable custom instruction. An open composition like this requires
routine and robust integration of elements authored by different parties into a stable
system that can work together as a unit.
This thesis work involves not only the research, design, implementation, and evalua-
tion of the proposed CX extension, and its integration on the RISC-V core, but also
incorporating dedicated accelerators to offload and accelerate critical tasks in the
system, such that it follows AMD’s design philosophy which has embraced RISC-V
as a foundation of its processor designs. Through a thorough research study on the
RISC-V Instruction Set Architecture (ISA), our work intends to identify key areas
of improvement and devise suitable solutions for the same, that will consider the
energy efficiency of the overall system performance. This thesis hopes to contribute
valuable insights to the customisable processor architectures, thereby ensuring that
AMD’s innovation technology remains a front-runner in the computing sector.

1.1 Related Work
The reserved opcodes for the custom instructions in the RISC-V architecture enable
integration of the CX extension. The design of the CX extension in this thesis work
is based on a proposed RISC-V Composable Custom Extension specification in [2].
The specification document defines the procedure and the process flow for handling
the custom instructions. The document serves as a foundation for the thesis work
as it explains the relevance of CX extension, the parameters required, and how it
functions as a whole system. It describes how the instructions can be encoded and
how they can be interfaced to communicate with the processor, which is followed

1


1. Introduction

in this thesis work. This specification document also speaks about the required
Control and Status Registers (CSRs) and the logic interface to connect and control
the processor core and the accelerators.
Several researchers have worked on running hardware accelerators in the RISC-V
core. In [3], a convolution neural network has been implemented based on RISC-
V architecture using the reserved opcode space for the custom instructions. The
CORDIC based hardware accelerators were implemented on the VexRiscv CPU
core which is a 32-bit RISC-V chip [4] and the performance and resource utilisation
have been assessed, similar to the metric analysis followed in this project. These
works provide a significant contribution to the design and implementation phase of
our thesis work.

1.2 Purpose and Goal
In order to define a standard extension for the new instructions, it must be of gen-
eral interest, broad utility and non-proprietary. Usually, defining a new standard
extension is a long process managed by RISC-V International. The RISC-V archi-
tecture allows independent vendors to define their own CX extension. Sharing the
custom opcode space for the custom instructions in a 32b processor is critical and
it obviates the need to transition to a higher-bit processor in some settings.
In this thesis work, a CX extension interface will be designed and integrated into the
MicroBlaze-V soft-core processor for managing the custom instructions. Potential
hardware accelerators will be plugged in through an interconnect module so that
we can demonstrate the custom instructions and analyze the performance of the
core. The performance and resource utilisation of the implemented design will be
compared with the CPU without the accelerators, and the improvement will be
noted.

1.3 Thesis Outline
In Chapter 2, all the relevant background knowledge required for the thesis will be
described. In Chapter 3, the methods followed to carry out the thesis work and the
tools used will be depicted. A detailed description of the design and the constraints
in the design will be presented in Chapter 4. Chapter 5 will focus on the results
obtained from the implementation of the design and comparison with the software
model with respect to different evaluation metrics. Finally, chapter 6 covers the
challenges faced throughout the course of the work and its future scope.

2


2
Technical Background

This section describes the specification standards used in this thesis work and the
necessary background information required for the reader to understand the work
done. The section begins with an introduction to Field Programmable Gate Arrays
(FPGA), followed by information on pipelining and the Reduced Instruction Set
Computing (RISC) architecture. In the subsequent sections, we describe more on
RISC-V, the MicroBlaze soft-core, and the proposed CX extension. While detailing
the CX extension, we describe the overall working of the extension, how the instruc-
tions are encoded and processed in the MicroBlaze-V and how the results and status
are written back to the corresponding registers. Further, we move to the description
of accelerators and the potential accelerators that we could use for our current work.
We end the chapter with the metrics that we could consider for evaluation.

2.1 Field Programmable Gate Array
Field Programmable Gate Arrays (FPGA) are semiconductor devices that can be
configured to meet the desired functionality or application. They contain config-
urable logic blocks (CLB) connected by a set of programmable interconnects which
allows the designer to perform both simple and complex tasks. The FPGAs include
different memory elements ranging from single-bit flip flops to very dense memory
arrays, for digital storage. The FPGAs provide better performance compared to a
general CPU and can be reprogrammed according to the requirement. This versa-
tility and the re-programmable feature of FPGAs allow them to be used in various
applications, including image processing, wireless communications and medical di-
agnosis.
The FPGA was first introduced by Xilinx in 1985 and they continue to be one of
the leading manufacturers today. Apart from Xilinx, modern FPGAs which are
manufactured by firms like Intel, Altera, etc. offer a large range of features like
impressive logic densities, flash memory, embedded processors and digital signal
processing (DSP) blocks [5]. The FPGAs can be configured by altering electrical
inputs and outputs and figuring out how each resources are used and connected to
form the hardware design. However, in software perspective, FPGA designs can
be streamlined by using pre-designed libraries of digital circuits and functions, also
known as intellectual property (IP) cores. Often, third-party suppliers and FPGA
vendors offer these libraries, which are available for purchase or lease, which is the
case with AMD IP cores.

3


2. Technical Background

FPGAs can be programmed by loading a bitstream which describes the configuration
blocks like the lookup tables, registers and other blocks. The bitstream is generated
by a tool called Vivado [6], where the program is written in Hardware Descrip-
tion Language (HDL), for example, Very High-Speed Integrated Circuit (VHSIC)
Hardware Description Language (VHDL).

2.2 Pipelining
Pipelining is a technique used in digital systems, particularly in processor design,
to enhance the performance by allowing multiple instructions to be processed si-
multaneously rather than waiting for one instruction to complete execution before
starting the next [7]. The overall process can be divided into stages and each stage
in the pipeline does a specific task. Every instruction enters a stage once the previ-
ous instruction has been completed. This helps in more efficient usage of resources
and increases the overall throughput.

3.3 Statically scheduled pipelines 93

In
st

ru
ct

io
n

m
em

or
y

P
C

r1

r2

R1

R2w
W

op
co

de

R
eg

is
te

rs
C

on
tr

ol

A
LU

D
at

a
m

em
or

y@

W

R

W
R

IF/ID ID/EX ME/WB
IF ID EX ME WB

+

R
E

G
_d

at
a

co
nt

ro
l

4

+

(P
C

)+
4

(P
C

)+
4

br
an

ch
@

O
ffs

et

EX/ME
(P

C
)+

4

(P
C

)+
4

co
nt

ro
l

W
R

in
 r

eg
.#

W
R

co
nt

ro
l

Figure 3.5. Basic 5-stage pipeline for independent loads, stores, and register-to-register instructions.

Basic 5-stage pipeline for independent loads, stores, and ALU instructions

Figure 3.5 shows a basic 5-stage pipeline. This pipeline can execute independent Loads, Stores,
and ALU instructions. Independent instructions do not share resources such as registers and
memory locations. The major resources in the data path are the instruction memory (cache), the
register file with two read ports and one write port, an ALU capable of integer arithmetic and
logic operations, and a data memory (cache).

Two consecutive stages are separated by pipeline registers, labeled by the two stages they
separate. As an instruction moves from one stage to the next it is re-coded, and the re-coded
instruction is stored in the pipeline register. All pipeline registers are clocked in every cycle. In
every clock, the following activities take place in each stage.

� I-fetch (IF) In every clock, the program counter (PC) is incremented by 4 while the current
instruction is fetched in the instruction memory. At the end of the cycle (the trailing edge of
the clock), (PC)+4 is stored in the PC and the new instruction is stored in IF/ID.� I-decode (ID) The opcode is decoded into control signals. Control signals set up the function
of the various combinational components in subsequent stages EX, ME, WB, and are con-
nected to control inputs of the hardware components in each stage. At the end of the clock
these control signals are stored in a control field in ID/EX. Two input registers are always
fetched from the register file, even if they are not needed. The entire instruction except for
the opcode is passed on to the next, EX, stage; (PC)+4 must be carried along through the
pipeline in case of an exception.

Figure 2.1: Five-stage processor pipelining [7]

The typical stages as shown in a 5-stage pipeline as seen in figure 2.1 include :
1. Fetch: In this stage, the instruction is fetched from the memory, and the

program counter where the address of the next instruction is stored, is incre-
mented.

2. Decode: Here, the opcode of the instruction is decoded and the corresponding
function to be done is established.

3. Execute: In this stage, the actual operation is carried out.

4


2. Technical Background

4. Memory Access: In this stage, the memory is accessed to load or store the
data, if it is a memory operation.

5. Write Back: The result of the operation is written to the register in this stage.

However, when multiple instructions are executed at the same time, data hazards
could occur due to the data dependency between the instructions. In addition to this,
instructions like jump and branch could also introduce control hazards since they
disturb the order in which the instructions are expected to be executed. Advanced
algorithms like branch prediction and out-of-order executions are implemented in
modern processors to mitigate these challenges [7].

2.3 RISC-V

The RISC-V is an open-source ISA based on the RISC model. It was developed by
Krste Asanovic, Andrew Waterman and Yunsup Lee of the University of California
at Berkeley in 2010 [3]. The RISC-V architecture has evolved from RISC which
operates at a very high speed, integrating pipelining in its operations. It has become
very popular due to its openness and flexibility and is used in small embedded
processors to high-end processor configurations. It is designed as a modular ISA
composed of a small set of standard instructions and a set of extensions that can be
added according to the needs of the developer. RISC-V provides both 32-bit and
64-bit instruction sets, and also various extensions like floating point, multiply and
accumulate, vectors, etc [8].

MicroBlaze-V

AMD makes use of the RISC-V architecture in their MicroBlaze-V soft-core pro-
cessor. The MicroBlaze-V processor offers a wide variety of customisable and easy
to integrate microprocessor configurations based on the RISC Harvard model. It
is used in many areas including the medical industry, automotive industry, and
communication markets due to its flexibility. The MicroBlaze-V core is fully em-
bedded into AMD’s Vivado tool which makes it easier to use its functionalities in
our designs.

The MicroBlaze-V soft-core processor is a 32-bit processor, which implies it consists
of thirty-two 32-bit general purpose registers, a 32-bit Program counter, and a 32-bit
address bus [9]. It includes a standard set of instructions and CSRs but its flexibility
allows us to customise and add more instructions according to the requirements of
the developer. The architecture of the MicroBlaze-V core is shown in figure 2.2.

5


2. Technical Background

MicroBlaze Processor Reference Guide 7
UG984 (v2018.2) June 21, 2018 www.xilinx.com

Chapter 2 

MicroBlaze Architecture

Introduction
This chapter contains an overview of MicroBlaze™ features and detailed information on 
MicroBlaze architecture including Big-Endian or Little-Endian bit-reversed format, 32-bit 
general purpose registers, virtual-memory management, cache software support, and 
AXI4-Stream interfaces.

Overview
The MicroBlaze embedded processor soft core is a reduced instruction set computer (RISC) 
optimized for implementation in Xilinx® Field Programmable Gate Arrays (FPGAs). The 
following figure shows a functional block diagram of the MicroBlaze core.

X-Ref Target - Figure 2-1

Figure 2-1: MicroBlaze Core Block Diagram

Bus
IF

I-Cache

Instruction 
Buffer

Instruction 
Buffer

Branch Target 
Cache

Program
Counter

M_AXI_IC Memory Management Unit (MMU)

ITLB DTLBUTLB

Bus
IF

D-Cache

M_AXI_DC

M_AXI_DP

DLMB

M0_AXIS .. 
M15_AXIS

S0_AXIS .. 
S15_AXIS

Special 
Purpose 
Registers

Instruction
Decode

Register File
32 x 32b

ALU

Shift

Barrel Shift

Multiplier

Divider

FPU

Instruction-side
Bus interface

Data-side
Bus interface

Optional MicroBlaze feature

M_AXI_IP

ILMB

M_ACE_DCM_ACE_IC

X19738-090717

Send Feedback

Figure 2.2: MicroBlaze Architecture [10]

AMD has a few Fast Simplex Logic (FSL) custom instructions that enable data in
and out of the processor through the AXI4- Stream interface but the need for a
CX extension is in demand. The focus of this work is to integrate a Composable
Extension (CX) to the current MicroBlaze-V soft-core.

2.4 CX Extension

The CX extension is a group of named custom functions that bridge between software
and hardware, enabling the software libraries and hardware cores that implement
the same extension. The CX multiplexing enables the composition of a system
of individually authored and versioned components [2]. The Custom Functions
(CF), identified based on CF_IDs are executed using Composable Execution Units
(CXUs), which are hardware units identified using unique CXU_IDs. Each CXU
operates on the operands from register/immediate values and writes the result back
to the destination register along with updating the status in the CSRs. Additionally,
CSRs also contain the state_id which represents the state context of the CXU. The
CXU logic interface defined between the CXUs and CPU manages the control flow
of each instruction and its corresponding result.

2.4.1 CX standard encoding

When a particular CXU is selected, the software issues custom functions to the
configured CXU using different types of instruction encodings: R-type, I-type, and
flex type. For each encoding type, the instruction specifies the CF_ID, source
operands (register or immediate) and possibly a destination register in the encoded
format.

6


2. Technical Background

Custom-0 R-type encoding

This type of instruction encoding has two source register operands, a destination
register, and CF_ID of 10 bits as given in figure 2.3.
The assembly instruction can be written as : cx_reg cf_id, rd, rs1, rs2

two source registers, or one source register and one immediate value. R-type and I-type instructions always write a
destination register whereas flex-type instructions never do so.

2.3.1. Custom-0 R-type encoding

Assembly instruction: cx_reg cf_id,rd,rs1,rs2

An R-type CF instruction issues a CXU request for a zero-extended 10-bit CF_ID cf_id with two source register
operands identified by rs1 and rs2. The CXU response data is written to destination register rd.

067111214151920242531

1101000rdcf_id[2:0]rs1rs2cf_id[9:3]

custom-0

Figure 8. CX R-type instruction encoding

2.3.2. Custom-1 I-type encoding

Assembly instruction: cx_imm cf_id,rd,rs1,imm

An I-type CF instruction issues a CXU request for a zero-extended 4-bit CF_ID cf_id with one source register
operand identified by rs1 and a signed-extended 8-bit immediate value imm. The CXU response is written to
destination register rd.

067111214151920232431

1101010rd000rs1cf_id[3:0]imm[7:0]

custom-1

Figure 9. CX I-type instruction encoding


This new, irregular immediate field encoding may have a disproportionate impact on area and critical
path delay in the decode or execute pipeline stages of a RISC-V processor core.

Seven-eighths of the custom-1 encoding space is reserved for future custom function instruction encodings.

0671112141531

1101010reserved1-7reserved

custom-1

Figure 10. CX reserved I-type instruction encodings

2.3.3. Custom-2 flex-type encoding

Assembly instruction: cx_flex cf_id,rs1,rs2
Assembly instruction: cx_flex25 custom

A flex-type CF instruction issues a CXU request for a zero-extended 10-bit CF_ID cf_id with two source register
operands identified by rs1 and rs2. There is no destination register and CXU response data (but not a possible error
status) is discarded. The instruction is executed purely for its effect upon the selected state context of the selected
CXU.

2.3. Custom function instruction encodings | Page 15

Draft Proposed RISC-V Composable Custom Extensions Specification

Figure 2.3: CX R-type instruction encoding [2]

Custom-1 I-type encoding

This type of instruction issues a CXU request for a CF_ID with 4 bits (zero padded)
having one source register operand and one immediate operand, and the result is
written back to the destination register.
The assembly instruction can be written as : cx_imm cf_id, rd, rs1, imm

The instruction encoding is shown in figure 2.4.

two source registers, or one source register and one immediate value. R-type and I-type instructions always write a
destination register whereas flex-type instructions never do so.

2.3.1. Custom-0 R-type encoding

Assembly instruction: cx_reg cf_id,rd,rs1,rs2

An R-type CF instruction issues a CXU request for a zero-extended 10-bit CF_ID cf_id with two source register
operands identified by rs1 and rs2. The CXU response data is written to destination register rd.

067111214151920242531

1101000rdcf_id[2:0]rs1rs2cf_id[9:3]

custom-0

Figure 8. CX R-type instruction encoding

2.3.2. Custom-1 I-type encoding

Assembly instruction: cx_imm cf_id,rd,rs1,imm

An I-type CF instruction issues a CXU request for a zero-extended 4-bit CF_ID cf_id with one source register
operand identified by rs1 and a signed-extended 8-bit immediate value imm. The CXU response is written to
destination register rd.

067111214151920232431

1101010rd000rs1cf_id[3:0]imm[7:0]

custom-1

Figure 9. CX I-type instruction encoding


This new, irregular immediate field encoding may have a disproportionate impact on area and critical
path delay in the decode or execute pipeline stages of a RISC-V processor core.

Seven-eighths of the custom-1 encoding space is reserved for future custom function instruction encodings.

0671112141531

1101010reserved1-7reserved

custom-1

Figure 10. CX reserved I-type instruction encodings

2.3.3. Custom-2 flex-type encoding

Assembly instruction: cx_flex cf_id,rs1,rs2
Assembly instruction: cx_flex25 custom

A flex-type CF instruction issues a CXU request for a zero-extended 10-bit CF_ID cf_id with two source register
operands identified by rs1 and rs2. There is no destination register and CXU response data (but not a possible error
status) is discarded. The instruction is executed purely for its effect upon the selected state context of the selected
CXU.

2.3. Custom function instruction encodings | Page 15

Draft Proposed RISC-V Composable Custom Extensions Specification

Figure 2.4: CX I-type instruction encoding [2]

Custom-2 flex type encoding

This instruction issues a CXU request for a 10-bit zero padded CF_ID with two
source register operands. No destination register is involved in this operation and
the response data is discarded.
The assembly instruction can be written as : cx_flex cf_id, rs1, rs2

The instruction encoding is shown in figure 2.5

067111214151920242531

1101101customcf_id[2:0]rs1rs2cf_id[9:3]

custom-2

Figure 11. CX flex-type instruction encoding

Alternatively, equivalently, the cx_flex25 form of instruction issues an arbitrary 25-bit custom instruction.

06731

1101101custom

custom-2

Figure 12. CX flex-type instruction alternate encoding


A flex-type CF instruction may be used with a CXU-L2 request’s raw instruction field req_insn (3.4.5)
to provide an arbitrary 32-7=25-bit custom request to a CXU. The absence of an (integer) destination
register field is a feature that provides added, CPU-uninterpreted, custom instruction bits to a CXU.



One disadvantage of this approach: when the selected CXU routinely discards the R[rs1] or R[rs2]
operands, use of the flex-type custom function instruction can create a useless false dependency on the
rs1 and rs2 registers, which may uselessly delay issue of the CF instruction in an out-of-order CPU
core.

2.4. Custom function instruction execution via composable
extension multiplexing

Figure 13 illustrates how a custom function instruction and the CXU CSRs implement composable extension / CXU
composition via composable extension multiplexing. When the CPU issues a custom function instruction, it
produces a CXU request from the fields of the instruction, two source operands from the register file and/or an
immediate field of the instruction, and the cxu_id and state_id fields of mcx_selector. The CXU request may
include the request ID cookie (defined by the CPU), the CXU_ID, STATE_ID, raw instruction, CF_ID, and
operands. The CXU_ID identifies which CXU must process the request. The CXU includes state context(s) and a
datapath. The STATE_ID selects the state context to use for this request. The CXU checks for errors in CXU_ID,
STATE_ID, and CF_ID per 2.2.2, processes the request, possibly updating this state context, and produces a CXU
response, which may include the same request ID cookie, a success/error status, and the response data. The CPU
commits the custom function instruction by updating cx_status (when response status is an error condition) and
writing the response data to the destination register.

2.4. Custom function instruction execution via composable extension multiplexing | Page 16

Draft Proposed RISC-V Composable Custom Extensions Specification

Figure 2.5: CX flex-type instruction encoding [2]

7


2. Technical Background

2.5 Control and Status Registers (CSRs)

The Control and Status Registers (CSRs) are special registers in a processor that
monitor and manage the operation of a system. The CSRs store the control and
status information of different units of the processor. They help the software to set
the parameters, initiate and control the operations in the processor. They also pro-
vide feedback about the current state of the processor and its hardware components,
with the help of flags or status bits. A few examples are interrupt mask registers,
interrupt status registers, error status registers, etc.

The CSRs are crucial for debugging and understanding the state of the system as
they provide information about what the processor is doing at the given moment.
The CSRs control access to critical resources by enforcing certain privilege levels,
thus ensuring the safety and integrity of the system. Fine-tuning these CSRs can
optimise the performance of the system by enabling or disabling certain hardware
features.

The CSRs are handled in specific privilege levels, mainly in the machine mode,
where you have full access to all the controls in the processor. The CSRs are
accessed through specific instructions provided by the ISA of the processor. The
MicroBlaze-V core has some standard CSRs implemented in it. A few custom CSRs
like mstatus (describing the Machine Status), misa (describing the supported ISA)
are also defined already in MicroBlaze-V [9].

2.6 AXI4-Stream Interface

The Advanced eXtensible Interface (AXI) Stream protocol can be used as a standard
interface for exchanging data between components connected to each other [11]. It
facilitates high speed and efficient communication of data and this feature makes it
useful for this thesis work.

The AXI-Stream Interface conforms to the ARM AMBA AXI4-Stream Protocol
Specification [11]. It works by the principle of handshaking between the transmitter
and receiver. The transmitter is treated as the Master and the receiver is treated as
the Slave. The AXI4-Stream interface makes use of mainly three signals :

1. TVALID: sent out by the Master indicating that the datastream is valid and
it wants to send the data.

2. TREADY: sent by the Slave device indicating it is ready to receive the data.

3. TDATA: The datastream sent out through AXI Stream from the source to the
receiver.

The data transfer happens only when both TVALID and TREADY signals are
asserted irrespective of the order in which they are set as shown in figure 2.6. It
is possible to omit TREADY in some cases where the receiver can always accept a
transfer. In such cases, TREADY is always assumed to be HIGH.

8


2. Technical Background

Interface Signals 
2.2 Handshake signaling

ARM IHI 0051B Copyright © 2010, 2021 Arm Limited or its affiliates. All rights reserved. 2-19
ID040921 Non-Confidential

2.2.2 Handshake with TREADY asserted before TVALID

In Figure 2-2, the Receiver drives TREADY HIGH before the data and control information is valid. This indicates 
the Receiver can accept the data and control information in a single ACLK cycle. In this case, the transfer occurs 
once the Transmitter drives TVALID HIGH. Figure 2-2 shows the transfer occurring at T3. 

Figure 2-2 Handshake with TREADY asserted before TVALID

2.2.3 Handshake with TVALID and TREADY asserted simultaneously

In Figure 2-3, the Transmitter asserts TVALID HIGH and the Receiver asserts TREADY HIGH in the same ACLK 
cycle. In this case, transfer takes place in the same cycle, as shown in T2 of Figure 2-3.

Figure 2-3 Handshake with TVALID and TREADY asserted simultaneously

ACLK

INFORMATION

TVALID

TREADY

T0 T1 T2 T3

ACLK

INFORMATION

TVALID

TREADY

T0 T1 T2 T3

Figure 2.6: Timing Diagram of AXI4-Stream Data Transfer

In addition to these, a signal called TLAST can be configured in some cases to
indicate the packet boundaries. Asserting the TLAST bit indicates the final transfer
of an operation and no more bits should be followed after it.

2.7 Accelerators

The accelerators are used to accelerate a given task, which otherwise would require a
considerable amount of time to execute. The accelerators function as the hardware
units in our project, executing the custom function and writing the result and error
status back to the processor.

A system composed of a processor and a hardware accelerator enhances the software
programmability for much of the software run on the processor while improving the
power and performance for the functions run on the hardware accelerator. Since
the hardware accelerators are designed specifically to handle a particular function,
their resources are rightly matched for the precision of the operation, rather than
the typical processors which offer resources in bits of the multiple of 8. The hard-
ware accelerators also reduce the performance time by covering the execution time
required for branch prediction, data caching and other complex operation involved
in modern processors. Moreover, as the hardware accelerators provide only the ad-
equate amount of hardware resources required to perform the given operation, it
also reduces the amount of power consumption involved in executing a stream of
instructions.

FPGA based accelerators are quite convenient due to their flexibility and adapt-
ability. They can be reprogrammed to run any function easily. Moreover, making
use of an FPGA minimizes the latency and power consumption compared to using
a CPU [12]. In this project, the accelerators are utilised to demonstrate the newly
integrated extension.

9


2. Technical Background

2.8 Intellectual Property (IP) Core
An Intellectual Property (IP) core is a standalone reusable functional logic unit
that performs a complex task and can be used in several digital designs. These
are developed using hardware description languages like Verilog, VHDL etc. There
are several IP cores in the Xilinx library available to be used as a plug-in and play
module and a few of them can be used in our design.

2.8.1 CORDIC
CORDIC stands for Coordinate Rotational Digital Computer. This algorithm was
initially developed by Volder to solve trigonometric equations in an iterative fashion,
and was generalized later by Walther to include more functions like the hyperbolic
and square root equations [13].
The CORDIC core can be configured in two ways in terms of its architecture.

• Fully parallel configuration with single cycle data throughput: In this case,
the CORDIC core implements the operations in parallel using an array of
shift-addsub stages.

• Word serial implementation with multiple cycle throughput: In this stage, the
shift addsub operations are performed serially using a single shift addsub stage
in a feedback loop.

The CORDIC IP core can be used to implement many different mathematical func-
tions including trigonometric functions, square root, hyperbolic and also rectangular-
polar conversions. The respective function can be configured in the IP core and the
operands can be selected accordingly whether it is a phase or cartesian operand.
The pin diagram of the CORDIC IP core is shown in figure 2.7.

CORDIC v6.0 9
PG105 August 6, 2021 www.xilinx.com

Chapter 2: Product Specification

implements these shift-addsub operations serially, using a single shift-addsub stage and 
feeding back the output.

A word serial CORDIC core with N bit output width has a latency of N cycles and produces 
a new output every N cycles. The implementation size this iterative circuit is directly 
proportional to the internal precision.

Resource Utilization
For details about performance, visit Performance and Resource Utilization.

Port Descriptions
A block diagram of the CORDIC core is presented in Figure 2-1.
X-Ref Target - Figure 2-1

Figure 2-1: CORDIC Symbol and Pinout

s_axis_cartesian_tdata

s_axis_cartesian_tvalid

s_axis_cartesian_tready

s_axis_phase_tdata

s_axis_phase_tvalid

s_axis_phase_tready

aclk

aresetn

aclken

m_axis_dout_tdata

m_axis_dout_tvalid

m_axis_dout_tready

m_axis_dout_tuser

m_axis_dout_tlast

DS858_01_082311

s_axis_phase_tuser

s_axis_cartesian_tuser

s_axis_cartesian_tlast

s_axis_phase_tlast

Send Feedback

Figure 2.7: Pin Diagram of CORDIC IP core [13]
.

10


2. Technical Background

Trigonometric Functions

The trigonometric functions like sine, cosine, tangent, etc. have a wide range of
applications in our day-to-day lives. The applications range from astronomy where
it is used to find the distance of Earth from other planets and stars, to navigation,
construction sites, marine engineering etc.

The CORDIC IP core calculates the sine and cosine of the given phase value (in
radians). Since we need the phase value as input, only the phase input parameters
are enabled in this case. The coarse rotation module in the IP core limits the value
of input angle between −π and +π. The input angle is expressed as a fixed-point
two’s complement number with an integer width of 3 bits while the output vector
is expressed as a pair of fixed-point two’s complement numbers with integer width
of 2 bits [13]. The IP core gives both the sine and cosine values of the phase value
in a single output vector of 32 bits with the lower 16 bits representing sine and the
higher 16 bits representing cosine. The output format is given in figure 2.8.

SINECOSINE

0151631

Figure 2.8: Output vector format of sin and cos function in CORDIC
.

Vector Translation

The vector translation function to convert rectangular to polar coordinates is used
in different areas which include navigation, robotics and signal processing.

The CORDIC IP core performs the vector translation operation where it rotates
the input vector around the circle in an angle θ until the Y component equals
zero. The scaled magnitude and phase of the rotated input vector are obtained as
outputs. In this case, both the Cartesian and phase input parameters are enabled.
The vector translation shows linear behaviour with respect to magnitude. The
number of significant magnitude bits of the input vector limits the accuracy of the
phase output from CORDIC. The input and output representation is similar to
the trigonometric function where the output magnitude is expressed as fixed two’s
complement numbers with integer width of 2 bits and the phase angle is expressed as
fixed two’s complement number with an integer width of 3 bits. The output vector
format is similar to the trigonometric function with 32 bits but with the magnitude
representing the lower 16 bits and phase value representing the higher 16 bits. The
output vector representation is shown in figure 2.9.

MAGNITUDEPHASE

0151631

Figure 2.9: Output vector format of Translate function in Cordic
.

11


2. Technical Background

The CORDIC IP core makes use of the AXI4-Stream Protocol for sending in inputs
and sending out output signals. It works by basic handshaking between tvalid and
tready signals. The AXI4-Stream interface can be configured in CORDIC in Non-
blocking as well as Blocking Mode. The Non-blocking mode does not have a tready
signal and is always assumed to be asserted by default. On the other hand, the
Blocking mode has a tready signal present and this helps to control the data flow
through the core, making sure the output buffer is not overloaded with data.

2.8.2 Fast Fourier Transform (FFT)

FFT is a computationally efficient algorithm for computing the Discrete Fourier
Transform (DFT) of a signal [14]. The FFT IP core provided in the Xilinx library
uses the Cooley-Tukey FFT algorithm [15] for calculating the DFT. The Cooley-
Tukey FFT algorithm is one of the most widely used methods for calculating the
DFT.

The DFT X(k), k = 0, ...N − 1 of the sequence x(n), n = 0, ...N − 1 is defined as
(2.1), where N is the transform length. The Inverse Discrete Fourier Transform
(IDFT) is given by (2.2).

X(k) =
N−1∑
n=1

x(n)e−jnk2π/N (2.1)

x(n) = 1
N

N−1∑
k=1

X(k)ejnk2π/N (2.2)

Cooley-Tukey FFT algorithm

The Cooley-Tukey algorithm decomposes the DFT for larger sizes into sub-components
and performs the DFT. This divide-and-conquer strategy significantly reduces the
computational complexity of the DFT from O[N2] to O[NlogN ]. The radix-2 al-
gorithm, the most common variant of the Cooley-Tukey algorithm, recursively de-
composes the DFTs into smaller DFTs of half the size until the computations can
be performed directly.

Xilinx FFT IP core

The FFT IP core has two different types of input signals for input and config in-
structions and one output signal to provide the output. The pin diagram of the
FFT IP core is given in figure 2.10.

12


2. Technical Background

DS808 July 25, 2012 www.xilinx.com 31
Product Specification

Fast Fourier Transform v8.0

Pinout  

This section describes the core ports as shown in Figure 34 and described in Table 3. 

X-Ref Target - Figure 34

Figure 34: Core Schematic Symbol

Table  3: Core Signal Pinout

Name Direction Optional Description

aclk Input No Rising-edge clock.

aclken Input Yes Active-high clock enable (optional).

aresetn Input Yes Active-low synchronous clear (optional, always take priority over 
aclken). 
A minimum aresetn active pulse of two cycles is required.

s_axis_config_tvalid Input No TVALID for the Configuration channel.
Asserted by the external master to signal that it is able to provide 
data.

s_axis_config_tready Output No TREADY for the Configuration channel.
Asserted by the FFT to signal that it is ready to accept data.

s_axis_config_tdata Input No TDATA for the Configuration channel.
Carries the configuration information: CP_LEN, FWD/INV, NFFT 
and SCALE_SCH. 
See Section Run-Time Transfer Configuration.

s_axis_data_tvalid Input No TVALID for the Data Input channel. 
Used by the external master to signal that it is able to provide data.

s_axis_data_tready Output No TREADY for the Data Input channel. 
Used by the FFT to signal that it is ready to accept data.

s_axis_config_tdata

s_axis_config_tvalid

s_axis_config_tready

s_axis_data_tdata

s_axis_data_tvalid

s_axis_data_tready

s_axis_data_tlast

aclk

aresetn

aclken

m_axis_data_tdata

m_axis_data_tvalid

m_axis_data_tready

m_axis_data_tuser

m_axis_data_tlast

m_axis_status_tdata

m_axis_status_tvalid

m_axis_status_tready

event_frame_started

event_tlast_unexpected

event_tlast_missing

event_fft_overflow

event_data_in_channel_halt

event_data_out_channel_halt

event_status_channel_halt

DS808_01_080910

Figure 2.10: Pin Diagram of FFT IP core. [16]

The Xilinx FFT IP core provides four different types of architecture to implement the
FFT computations. These architectures process the data as continuous streamlining,
pipelined streaming I/O architecture or independent data frames and burst I/O
architecture. In the pipelined streaming solution, the Decimation in Frequency
(DIF) method has been used for FFT computations and several radix-2 butterfly
processing engines implement continuous data processing. Each processing engine
has independent memory units to store the input and the intermediate result. The
burst I/O architecture uses the Decimation in Time (DIT) method for the FFT
computation and is implemented as radix-4, radix-2, and radix-2 lite architectures.
In the burst I/O architecture, the FFT core loads the data separately followed by
computing the transform and unloading the result. The radix-4 and the radix-2
architectures use the radix-4 and radix-2 butterfly structures for the computations
respectively. The radix-2 lite architecture uses the same butterfly processing engines
as the radix-2 but uses the same adder/subtractor blocks for the computations [16].
The Radix-2 Burst I/O architecture is illustrated in figure 2.11.

The transform length will determine the number of stages for the FFT algorithm.
The transform length can be configured during runtime and in the design stage of
the FFT IP core. The configuration port in the core enables the selection of FFT
and Inverse Fast Fourier Transform (IFFT) as well as the scaling performed in each
stage.

The streaming I/O architecture would give the highest throughput by utilising the
most resources while the radix-2 lite burst I/O architecture would have the least
throughput. The performance of each architecture is given in figure 2.12.

13


2. Technical Background

DS808 July 25, 2012 www.xilinx.com 17
Product Specification

Fast Fourier Transform v8.0

Radix-2 Lite Burst I/O

This architecture differs from the Radix-2 Burst I/O in that the butterfly processing engine uses one shared
adder/subtractor, hence reducing resources at the expense of an additional delay per butterfly calculation. Again,
as with the Radix-4 and Radix-2 Burst I/O architectures, data can be simultaneously loaded and unloaded only if
the output samples are in bit reversed order. This solution supports point sizes from 8 to 65536. See Figure 26.  

Run-Time Transfer Configuration

All run-time configuration options discussed in this section are programed using the Configuration channel. Please
see section Configuration Channel for more information.

X-Ref Target - Figure 25

Figure 25: Radix-2 Burst I/O

X-Ref Target - Figure 26

Figure 26: Radix-2 Lite Burst I/O

  -

ROM for
Twiddles

Data
RAM 0

Data
RAM 1

sw
itc

h

sw
itc

h

Input Data

Output Data

RADIX-2
BUTTERFLY

Generate one
output each cycle

Sine one cycle,
cosine the next

Multiply real one cycle,
imaginary the next

Store data in
single RAM

ds260_05_102306

Input Data

Output Data

ROM for
Twiddles

Data
DPM 0

Data
DPM 1

RADIX-2
BUTTERFLY

-

Figure 2.11: Radix-2 Burst I/O architecture [16].

Resources

Throughput

Radix-2-lite
Burst I/O

Radix-2
Burst I/O

Radix-4
Burst I/O

Streaming 
Architecture

Figure 2.12: Resource versus throughput for different architecture types.

The data format of the input, output, and configuration instructions are shown in
figures 2.13.

14


2. Technical Background

XK_REXK_IM

0151631

(a) Input and Output instructions

FWDXK_IM

01623 1

0's

(b) Config instruction

Figure 2.13: Data format of instructions in FFT IP core

Here, XK_IM and XK_RE represent the real and imaginary parts of the data while
the FWD bit represents whether a FFT or IFFT should be performed.

2.9 Evaluation Metrics
Performance always comes into importance when we integrate anything into an
existing processor design. It is of interest to assess the benefits of running a certain
function in the hardware with the help of a processor, rather than running the same
function in software. The performance of the processor and the hardware usage can
be evaluated using various metrics. Some of the commonly used metrics are given
below:

1. Cycles Per Instruction (CPI) - This denotes the average number of clock
cycles that each instruction takes to execute on the machine [7].

CPI = Texe

IC ∗ TC

(2.3)

where Texe is the execution time, IC is the instruction count and TC is the
machine cycle time.

2. Speedup - The speedup of a machine over a reference machine is defined as
the fraction of the execution time that a program takes to run in the reference
machine to the execution time that the same program takes to run in the
current machine.

CPI = Texe,Ref

Texe

(2.4)

where Texe,Ref is the execution time of the program on the reference machine
and Texe is the execution time of the program on the machine that is being
evaluated [7].

3. Resource utilisation - This denotes the amount of hardware usage for exe-
cuting a program in the specified machine.

4. Power Consumption - This denotes the amount of power consumed (both
static and dynamic) for executing a program in the specified machine.

15


2. Technical Background

16


3
Methods

This chapter describes the overall process flow of the thesis work. This is followed
by an overview of the tools used for the design and implementation of the prototype
and to test the entire module on an FPGA.

3.1 Workflow
The first step in our thesis work was to understand the MicroBlaze-V core and the
relevance of integrating a CX extension to it. The initial few weeks of the work were
invested in literature survey on custom instructions and the proposed specification
for the CX extension. Once we had a good understanding about the work to be
done, we started by designing a block diagram of the entire system. We designed
an interface for the CX extension on paper that would be implemented on RTL and
later integrated into the MicroBlaze core. We also did some surveys in parallel to
select the accelerators that could be used for demonstrating the CX extension.
After the design has been done on paper, we proceeded to work on the RTL imple-
mentation of the selected accelerators and testing them to function as a unit using
a testbench. Further on, we moved to developing the RTL for the design and ver-
ifying its behaviour and timing using Vivado. The design was implemented using
Vivado after verification along with checking the power and the timing constraints
to make sure that the requirements were met. Later, the newly developed interface
was integrated to the existing MicroBlaze-V core and the same process of RTL de-
sign and verification was repeated. After the bugs and timing issues were fixed, we
implemented the entire core design in the KCU105 FPGA to test the behaviour on
hardware.
In parallel to the RTL design, we also worked on the software implementation of
the accelerators using C. The C program for finding sine and cosine, magnitude and
phase of the vector and FFT were written and built in Vitis and later simulated
using Questasim. In order to record the output results, we logged the results from
the waveform to an output file via UART by using a TCL script. The execution time
of all the three programs were noted so that it can be compared with the execution
time of the acceleration with CX-enabled MicroBlaze-V. The resource utilisation
of the core with and without the interface and accelerators were also reported for
further comparisons.
The entire plan of the project was prepared on an excel sheet with all the tasks
and sub-tasks and the person assigned to work on each specific task. In addition to

17


3. Methods

this, we had weekly sync meeting with our advisors at AMD every Tuesday where
we reported our work done in the previous week and discussed our plans on moving
forward.

3.2 TestBench
A top-level test bench was designed for the functional verification of the CX interface
along with the interconnect and accelerator modules. The test bench was designed
to imitate the software libraries which supply the instructions. The test vectors are
generated using a Python script by concatenating the input of the IP core along
with the CXU_ID and CF_ID. The input to the IP core was extracted from the
inbuilt test bench of the IP core library. The test also features a self-checking system
where the computed result from the accelerators is compared against the expected
result obtained from the IP core library.
The test bench was written to check for errors by itself using the assert and report
statements in VHDL. The test bench design can be explained clearly using the
flowchart given in figure 3.1.

3.3 Tools Used
The following tools given in table 3.1 were used during the thesis work.

Table 3.1: Software and Hardware tools used in the thesis work

Tools Purpose
Emacs Text editor for the VHDL code.
QuestaSim To perform functional verification of the design.
Vivado To implement design on FPGA.
Vitis Environment to test the functionality in software.
Python Used to generate the test vectors for the testbench.
Xilinx Kintex Ultrascale KCU105 FPGA board used to run the program.
Latex For documentation
Inkscape For sketching

18


3. Methods

Start

i=0

Feed test vector at i

Accelerator process

Expected Output =
Actual Output? i=i+1

Output mismatch.
Error

No

Yes

End

Figure 3.1: Top-level TestBench Design Flow

19


3. Methods

20


4
Design

In this section, the design of each component facilitating the implementation of
the CX extension will be described in detail including the process flow. First, an
overview of the design will be introduced in Section 4.1 followed by the modifications
in MicroBlaze-V including the pipeline flow, design of CX interface, and additional
CSRs required for the implementation of CX in Section 4.2. In Section 4.3, the In-
terconnect that acts as a bridge between the CPU and accelerators will be described
in detail. Section 4.4 introduces the design of the implemented accelerators and last,
the reference model used for the performance evaluation will be described in Section
4.5.

4.1 Overview
Implementing the CX extension in the RISC-V architecture involves modifications
to the MicroBlaze-V core and integrating software libraries and hardware modules.
These modifications to the core include the definitions of an additional hardware
module (CX interface), definitions of new CSRs, and required changes in the decode
and write-back pipeline stages to handle the execution of custom instructions. Cus-
tom instructions required to execute a custom function are fetched with the help of
software libraries. Software libraries generate and feed the instruction code to the
IF stage in the RISC-V architecture. When custom instructions are decoded, the
CX interface handles their execution with the help of external accelerators.
AMD’s support for the MicroBlaze RISC-V processor allows compiling the hardware
specification defined in Vivado with the software code built on the Vitis IDE. The
Electronic Design Automation (EDA) support in Vitis allows exporting the current
hardware design as a platform, and building applications. This application helps in
generating the required instruction codes using inline C assembler in elf format. By
importing the generated elf file to the Vivado environment, MicroBlaze-V can be
implemented on FPGA along with the custom instructions. These generated instruc-
tion codes will be stored in the external memory (LMB RAM) of the MicroBlaze-V.
While running the design on hardware, each instruction will be fetched in each clock
cycle and executed by the MicroBlaze-V processor.
To demonstrate the CX extension, the FFT and CORDIC function accelerators are
considered as CXUs. Unique CXU_IDs will be defined for the accelerators along
with function IDs (CF_ID) for each custom instruction. These defined IDs are
fetched during execution through the custom instructions and CSRs. As shown in

21


4. Design

figure 4.1, the CXU_ID will be fetched from the CSR called mcx_selector while the
CF_ID will be fetched from the instruction itself.

Figure 4.1: Block diagram of the overall design.

As the accelerators considered do not have any state context, state ID is ignored.
This information along with the operand values will be sent to the CXU after veri-
fying the version compatibility. When the CXU completes its execution, the result
along with the status will be written back to the MicroBlaze core.

4.2 MicroBlaze-V

In this thesis work, the MicroBlaze-V core has been modified for the implementation
of the CX extension. In order to handle the execution of the custom instruction, a
hardware module called CX interface has been designed and integrated into the pro-
cessor core. In order to support this new module, some of the other components also
need to be modified. This includes integrating the new CSRs and updating the de-
coder module to support the custom instructions. Figure 4.2 shows the architecture
of the modified MicroBlaze-V core to handle the CX extension.

22


4. Design

I-C
ache

Instruction-side
Bus Interface

M_AXI_IP

ILMB

BUS
IF

Program
Counter

Branch Target
Cache

Instruction
Buffer

Instruction
Decode

compressed
Instruction

Decode

Control and 
Status 

Registers
Multiplier

Divider

ALU

Barrel Shifter

FPU

Bit
Manipulation

CX_Interface

Register File
32 registers

Floating Point
Register File
32 registers

D
-C

ache

Data-side
Bus Interface

BUS
IF

M_AXI_DC
M_ACE_DC

M_AXI_DP

DLMB

Optional feature
Modifications for CX Extension

Figure 4.2: Architecture of the modified MicroBlaze-V core. The highlighted
modules in yellow show the blocks that are modified to integrate the CX interface.

4.2.1 CX interface
The CX interface designed in MicroBlaze-V handles the execution of custom instruc-
tions through a series of operations shown in figure 4.3. When the ID stage decodes
the custom instruction, the operand values and the CF_ID from the instruction
code along with the CXU_ID and version from the CSRs will be forwarded to the
CX interface. The CX interface fetches this information and compares the version
in CSR with the MicroBlaze version. If the versions match, then the operand values
along with the identifiers (CXU_ID, CF_ID) will be encoded and transmitted via
the AXI stream interface. In case of an incorrect version, the IV flag in the CSR
will be updated and the execution of the custom instruction will be dropped. After
the transmission of data, the MicroBlaze-V will wait for the result and the pipeline
will be stalled.

Figure 4.3: Workflow of the CX interface

The functional block diagram of the CX interface is shown in figure 4.4. The ex-
ecution of a custom instruction is controlled by the EX_CX_Instr register, where
the signal will be asserted while decoding a custom instruction. CX_EX_Firstcycle
denotes the first cycle in the EX stage and helps with handshake signals in the
AXI-4 stream protocol. EX_CX_Write determines whether the custom instruction
expects a result or not. If the value of this signal is low during the execution of
a custom instruction, the CX interface will consider the instruction as a Write in-
struction and will not wait for the response. EX_Piperun indicates whether the
current stage of the pipeline is in the EX stage or not. EX_CX_Stall flag is used
to stall the pipeline and EX_CX_Result contains the result of the execution. The
EX_MCX_Sel and EX_New_MCX_Status are the CSRs designed for the CX ex-
tension and the EX_Write_MCX_Status flag is used as an enable signal to write

23


4. Design

to the CSRs. The CX interface communicates with the accelerators via the AXI-4
stream protocol. The master interface helps to write the data and the slave interface
helps to read the data. The control and data signals of the master and slave AXI
interface can also be seen from the block diagram.

CX_Interface

EX_CX_Instr

EX_CX_Write

EX_MCX_Sel

EX_New_MCX_Status

Clk Reset EX_CX_Stall EX_CX_Result

EX_Write_MCX_Status

CX_M_AXIS_Tdata(87 downto 0)

CX_M_AXIS_Tvalid

CX_M_AXIS_Tready

CX_S_AXIS_Tdata(39 downto 0)

CX_S_AXIS_Tready

CX_S_AXIS_Tvalid

EX_Piperun

EX_CX_Firstcycle

Figure 4.4: Design of CX interface with signals

As per the custom instruction encoding format, the destination register address will
not be present for the custom-2 flex type encoding type. Therefore the custom-
2 flex type instructions are considered as write instructions where the MicroBlaze
core would not expect a result. In the case of write instructions, the pipeline will be
stalled until the transfer of payload in the master interface of the AXI-4 stream has
finished. The execution of each instruction for a CX extension might require more
than one clock cycle. Therefore, stalling the pipeline would be critical as it can lead
to various pipeline hazards. In this design, the pipeline is stalled in the decode stage
during the execution of a custom instruction. The pipeline will be resumed when
the custom instruction finishes its execution.

The AXI master and slave interfaces have been designed for the transfer of the
information from the MicroBlaze-V to the accelerators. The required parameters
needed for the execution of the custom instruction will be encoded as shown in figure
4.5 and transferred to the accelerators via the master interface. Correspondingly,
the executed result of the custom instruction along with the status of execution will
be read by the MicroBlaze through the slave interface. The received information,
shown in figure 4.6 will be decoded and the destination register and the status CSR
will be updated.

Figure 4.5: Encoded datastream write by the RISC-V core

24


4. Design

Figure 4.6: Encoded datastream read by the RISC-V core

4.2.2 CSR
The implementation of CX requires to define additional CSRs for extension multi-
plexing and custom instruction execution. For the execution of a collision-free in-
struction, the compatibility of MicroBlaze-V IP core as well as accelerators need to
be verified. As the execution of a custom instruction follows the specified procedure,
verifying the status of the execution of a custom instruction is also important.

1. mcx_selector - The mcx_selector CSR enables CX multiplexing and allows
the developer to select the corresponding CXU required to run the particular
instruction. It can be read or written only in the machine level. The format
of this CSR is given in figure 4.7.


In a privileged architecture system, user level read access to mcx_selector values could reveal goings-
on in other software threads and thus facilitate side channel attacks.


In a privileged architecture with M/S/U levels, for example, what CSRs are required and what access
permissions should they have?

027282931

reservedcxe000

version

Figure 3. mcx_selector CSR 0xBC0 (version 0: legacy custom instructions))

0781516232427282931

cxu_idreservedstate_idreservedcxe100

version

Figure 4. mcx_selector CSR 0xBC0 (version 1: extension multiplexing)

The mcx_selector CSR has the following fields:

.version: extension multiplexing version

• When version=0, disable composable extension multiplexing. When cxe=0, custom-[0123] instructions
execute the CPU’s built-in custom instructions and custom CSR addresses select the CPU’s built-in custom
CSRs. When cxe=1, custom-[0123] instructions and custom CSR accesses raise an illegal-instruction
exception.

• When version=1, enable version-1 composable extension multiplexing. The cxu_id and state_id fields select
the current CXU and state context. When cxe=0, custom-[012] instructions issue CXU requests, and custom
CSR accesses access CX CSRs, of the CXU and state context identified by cxu_id and state_id. When cxe=1,
custom-[012] instructions and custom CSR accesses raise an illegal instruction exception.

• version values 2-7 are reserved.

.cxe: custom operation exception enable

• When (version=0 or version=1) and cxe=1, a custom operation raises an illegal-instruction exception.

.cxu_id: select the hart’s current CXU

• A valid cxu_id identifies a configured CXU.

• When enabled, when cxu_id does not identify a configured CXU, executing a custom operation instruction
causes an invalid CXU_ID error. The cx_status.CX error bit is set and the instruction’s destination register, if
any, is zeroed.

.state_id: select the hart’s current CXU’s current state context

• A valid state_id identifies a state context of a CXU.

• When enabled, when cxu_id is valid, but state_id does not identify a state context of the current CXU,
executing a custom operation instruction causes an invalid STATE_ID error. The cx_status.IS error bit is set
and the custom operation instruction’s destination register, if any, is zeroed.

No error occurs when mcx_selector is CSR-written with an invalid CX selector, i.e., when .cxu_id or .state_id
are invalid. Rather, subsequently executing a custom operation instruction may cause a CXU_ID or STATE_ID

2.2. New CX control / status registers | Page 14

Draft Proposed RISC-V Composable Custom Extensions Specification

Figure 4.7: Definition of mcx selector CSR

Here, cxu_id field identifies the corresponding CXU and the state_id field
identifies the corresponding state.

2. cx_status - This CSR accumulates the CXU error flags and it may be read
and written in all privilege levels. All the fields of this CSR are set to 0 by
the application software by default, before the execution of an instruction and
then the values are updated by the CXU in case of any errors. The fields of
cx_status are given in figure 4.8.

instructions, and read cx_status to determine if there were any errors.

0123456731

IVICISOFIFOPCUreserved

accrued errors

Figure 5. cx_status CSR 0x801

The cx_status CSR has the following fields:

.IV: invalid CX version error

• Set by a CSR-write to mcx_selector, or by a CF instruction, when mcx_selector.version is invalid. (For
example, when new software writes a new selector type that old hardware does not implement.)

.IC: invalid CXU_ID error

• Set by a CF instruction when mcx_selector.cxu_id is invalid.

.IS: invalid STATE_ID error

• Set by a CF instruction when mcx_selector.cxu_id is valid but mcx_selector.state_id is invalid.

.OF: state context is off error

• Set by a CF instruction when mcx_selector.cxu_id and mcx_selector.state_id are valid but the selected
state context is in the off state.

.IF: invalid CF_ID error

• Set by a CF instruction when mcx_selector.cxu_id and mcx_selector.state_id are valid but the
instruction’s CF_ID is invalid.

.OP: CXU operation error

• Set by a CF instruction when mcx_selector.cxu_id, mcx_selector.state_id, and its CF_ID are valid but
there is an error in the requested operation or its operands, in lieu of custom error state.

.CU: custom CXU operation error

• Set by a CF instruction of a stateful extension when mcx_selector.cxu_id, mcx_selector.state_id, and its
CF_ID are valid but there is an error in the requested operation or its operands, with custom (extension-
defined) error state available.


The custom error state of a stateful extension may be obtained using custom functions of the extension.
In addition, the custom error state of a serializable extension may also be obtained using
IStateContext custom functions cf_read_status and/or cf_read_state.



Should writing mcx_selector automatically zero cx_status? This shortens the code path to use an
extension by one instruction but it precludes the use case of clearing errors, issuing a series of custom
function instructions across multiple extensions, then checking for errors.

For simplicity we do not adopt this option.

2.2. New CX control / status registers | Page 13

Draft Proposed RISC-V Composable Custom Extensions Specification

Figure 4.8: Definition of cx status CSR

The flags seen in figure 4.8 are described below.
• IV - Invalid CX version error: set when the mcx_selector version is in-

valid.
• IC - Invalid CXU_ID error: set when the cxu_id provided to the mcx_selector

is invalid.
• IS - Invalid STATE_ID error: set when cxu_id is valid but the state_id

is invalid.

25


4. Design

• OF - State context is off error: set when cxu_id and state_id are valid
but the state context is in off state.

• IF - Invalid CF_ID error: set when cxu_id and state_id are valid but
the CF_ID of the instruction in invalid.

• OP - CXU operation error: set when cxu_id, state_id and CF_ID are
valid but there is an error in the requested operation or operands.

• CU - Custom CXU operation error: set when cxu_id, state_id and
CF_ID are valid but there is an error in the requested operation or
operands with its custom error state available.

4.3 Interconnect

The interconnect has been designed to act as a bridge between the MicroBlaze-V
core and the accelerators. The interconnect has independent AXI-4 stream interfaces
for communicating with the MicroBlaze-V processor and the accelerators as shown
in figure 4.9. The overall execution of a custom instruction has been handled in the
interconnect as two phases, the write and read phase. The write phase handles the
communication from the MicroBlaze to the accelerators and the read phase handles
the opposite. A side lobe channel, CXU hit has also been designed to denote the
accelerators executing the custom instruction.

When broadcasting the information to all the accelerators, the interconnect need
not know the header ID (CXU_ID) of the accelerators, thereby avoiding the need
to have lookup tables in the interconnect module. The accelerator with the correct
CXU_ID will respond to the broadcast request and the interconnect will wait for the
response. If none of the accelerators respond to the request, the custom instruction
will fail with a CXU error.

Figure 4.9: AXI-4 stream interfaces in the design

26


4. Design

4.3.1 Write Phase
The write phase of the interconnect handled by the broadcaster, evaluator, and error
handling units in figure 4.10 establishes the communication from MicroBlaze to the
accelerators. During the write phase, the interconnect broadcasts the payload from
the master interface of the CX interface to all the accelerators. Broadcasting the
information eliminates the need to store the header IDs including the CXU ID and
CF ID of the accelerators in the MicroBlaze or in the interconnect. Accelerators
with the matching CXU_ID will respond to the request and indicate the intercon-
nect with the help of the side lobe channel (CXU Hit). The error handling in the
interconnect uses the CXU Hit flag from the accelerator to check for CXU Error. In
case of a mismatch, the CXU error flag will be set and accumulated. Accumulating
the CXU error helps to restore any previous invalid CXU errors.

Figure 4.10: Interconnect functionality in the write phase

4.3.2 Read Phase
During the read phase, the computed result from the accelerators along with the
accumulated error flag will be read by the MicroBlaze-V processor. The functions
performed by the interconnect during the read phase are shown in figure 4.11. The
mixer waits to get the response from the accelerators along with the status of oper-
ation and append the accumulated CXU error flag and write the information to the
CX interface. During this function, bitwise OR operation will be performed for the
response from the accelerators. Therefore, the non-triggered accelerators’ response
should be set as void or 0. If none of the accelerators are triggered, then the error
handling identifies the error and writes back to the CX interface. An OR function-
ality has been used to distinguish between the response from the mixer function

27


4. Design

and the error handling unit. The evaluator unit facilitates generating the response
based on the instruction types (Read/Write) by analysing the ready flag in the slave
interface of the CX interface. When the ready flag is high, the instruction will be
identified as a read instruction where the CX interface expects a result from the
accelerator. When the ready flag is low, the instruction will be identified as a write
instruction where the CX interface will not expect a result.

Figure 4.11: Interconnect functionality in the read phase

4.4 Accelerators
The accelerator executes the custom instruction by performing the computations
based on the parameters provided and writes the result back to the MicroBlaze-
V processor through the Interconnect module. The master and slave interfaces
defined in the accelerator handle the reading and writing of the data respectively.
Additionally, the CXU Hit channel identifies whether the custom instruction has
the correct CXU_ID. Each accelerator has a unique CXU_ID which acts as the
identifier for the accelerators. This thesis work considers the FFT and the CORDIC
IP cores from the Xilinx library for hardware accelerations. Table 4.1 includes the
identifiers for each accelerator and their corresponding function.

Table 4.1: Accelerator identifier

Accelerator Instruction Type CXU ID CF ID
Trigonometry Custom-0 R type 0x20 0b10

Vector Translation Custom-0 R type 0x25 0b01

FFT
Custom-2 flex type 0x30 0b01
Custom-2 flex type 0x30 0b10
Custom-0 R type 0x30 0b11

A wrapper module, shown in figure 4.12 defined over the IP cores manages the

28


4. Design

execution of the custom instruction. Upon receiving the tvalid and tdata signals,
the wrapper module samples and checks the integrity of the incoming bit stream. If
the CXU_ID is valid, the data stream will be received by asserting the ready signal
with the help of the AXI slave interface and will be noted as CXU hit. In case
of a CXU mismatch, the data stream will not be received and the CXU Hit signal
will not be asserted. The required parameters for the IP core will be forwarded
after the CF_ID check and operand check have been performed. If both checks fail,
the error handling will manage and provide an error response back to MicroBlaze.
The wrapper module waits until the IP core computes the result and the computed
result will be encoded with the the status to generate the response back to the
MicroBlaze. This combinatorial logic algorithm helps in the faster response of the
accelerators. The AXI interfaces of the wrapper and the IP cores are connected
with the help of the AXI master and slave interface defined in the wrapper. These
interfaces facilitate the connection of the AXI handshake signals (valid and ready)
and the data channel.

As mentioned before, the custom instructions are defined based on write-back. The
accelerator module follows the normal process flow when the custom instruction
expects a result back. In case the custom instruction does not expect a result back,
then the accelerators will not write the result back to the MicroBlaze. The AXI
master interface defined in the wrapper handles this functionality. When these
custom instructions produce a CF error or an op error, then error handling stores
and accumulates the corresponding status. The accumulated status will be written
back later to the MicroBlaze-V when the accelerators run a read instruction.

Acc_s_axis_tdata

Acc_s_axis_tvalid

Acc_s_axis_tready

Sampling

CXU Check

CXU_hit

CF Check

OP Check

IP core

Error Handling

Encoder

Acc_m_axis_tready

Acc_m_axis_tvalid

Acc_m_axis_tdata

Accelerator Wrapper

AXI Slave
Interface

CX_Status

s_axis_tdata

m_axis_tdata

m_axis_tvalid
AXI Master
Interface

cf error

op error

error_tvalid

m_axis_tready

s_axis_tvalid
s_axis_treadyy

Figure 4.12: Wrapper module for the accelerators

29


4. Design

4.4.1 CORDIC
This thesis work implements trigonometric and vector translation functionalities
of the CORDIC IP core from Vivado. These functionalities are implemented as
two independent CXUs with unique CXU_IDs as mentioned in table4.1. As the
CORDIC accelerators require only one instruction to perform the computations,
two LSB bits from the instruction code would be sufficient to represent the ID.
As mentioned before, the IP core will be executed only after the error checking
has been performed. In case of CORDIC functionalities, the state ID is ignored
as the computation does not require any state context. Since both the CORDIC
functions produce a result, custom -0 type instruction encoding is used. Therefore,
the accelerator wrapper follows the normal process flow and returns the result to
the MicroBlaze-V.
The trigonometric IP core has been configured for operating on a 16-bit fixed point
input value which produces a 32-bit fixed point value, thereby enabling the write
back of the computed result seamlessly. The 32-bit output value contains 16-bit sine
and cosine values in the LSB and MSB parts respectively. As the operand values
in the instruction code allow 32 bits, the input for the IP core has been defined on
the LSB bits and the rest of the bits are ignored. The vector translation function of
the CORDIC IP core is configured for a 32-bit Cartesian input producing a 32-bit
result containing the scaled magnitude and rotated phase angle. Both the functions
of CORDIC are configured to yield maximum performance by using the maximum
pipelining mode and parallel architectural configuration. The truncation of the data
helps to limit the result to 32-bits in both CORDIC functions.
Similar wrapper modules have been designed for both CORDIC functions following
the normal process flow but they differ in input handling. This difference in the
sampling unit ensures to drive the correct input for the IP cores. The overall timing
required for the wrapper to perform these computations is shown in figure 4.13.
As mentioned, when receiving the bit stream having CXU valid, the CXU Hit flag
will be asserted and will be de-asserted after the execution is completed. This flag
also indicates the interconnect of the execution of a custom instruction. The wrapper
module waits for the valid signal which indicates that the result is ready and returns
the result along with the status in the accelerator master interface. By enabling the
blocking mode of the IP core, the tready signal can be controlled manually.

4.4.2 FFT
The FFT IP core has been configured based on the radix-2 burst I/O architecture
with a transform length of 256 enabling the number of elements, N or k in equations
(2.1 and 2.2), in a data frame. Therefore, the number of butterfly stages in the FFT
algorithm will be log2 256 = 8. The input to the FFT core has been designed to
have 16 bits on both real and imaginary components. Therefore, the output will also
have 16-bit real and 16-bit imaginary components. The IP core has been configured
for fixed-point computations.
The Burst I/O architecture allows writing and reading each data frame in separate
phases. Therefore, three custom instructions are required to configure, write, and

30


4. Design

Clk

Acc_s_axis_valid

Acc_s_axis_data

Acc_s_axis_tready

CXU_HIT

IP_s_axis_valid

IP_s_axis_data

IP_s_axis_ready

IP_m_axis_valid

IP_m_axis_ready

IP_m_axis_data

Acc_m_axis_valid

Acc_m_axis_ready

Acc_m_axis_data

Figure 4.13: Timing diagram of accelerators using CORDIC IP core

read data from the FFT core. The design constraints set in the software manage
the order of execution of these phases. In the write phase, the configurations and
inputs required for the computations will be fed into the FFT core. During the read
phase, the MicroBlaze will read the computed result using each instruction.

The wrapper module designed around the FFT core manages these independent
phases along with the normal process flow including sampling and error checking.
When receiving a bit stream with CXU valid, each phase of operation of the FFT
acceleration will be differentiated using the CF_ID with the help of a demultiplexer
unit as shown in figure 4.14. Since the property of the IP core enables to have
independent AXI interface to configure, read, and write from the IP core, restricting
the tvalid signal enables to control each phase of operation. This enables to connect
the tdata to the IP core directly as tvalid restricts the IP core from reading the
data. Similarly, tready in the slave interface of the wrapper has been configured to
be dependent on the control signals in the AXI interface of the IP core as shown in
the block diagram. In case of output instruction, the acc_s_axis_tready flag will
depend on the tvalid in the output interface of the IP core. When considering the
delay in computing the output, the wrapper module needs to receive the instruction
and wait for the response. Therefore, the valid signal in the output interface has
been logically OR’ed and forwarded to the demux. The FFT wrapper will only write
back to the MicroBlaze for the custom instruction to read the output. In case of a
write instruction, the error handling unit will store and accumulate the error. This
means that the cf error and the operation error will be accumulated in the error
handling unit as shown in figure 4.15 and are flushed out when a read instruction is
processed. The ready flag in the master interface of the wrapper has been connected
to the ready flag in the output interface. When the output instruction returns the
result to the MicroBlaze, the ready flag will be de-asserted so that the following

31


4. Design

data is not read.

      IP core

Configuration
Interface

Input
Interface

Output
Interface

acc_s_axis_tvalid

& acc_m_axis_tready

CF_ID

CF Error

s_axis_config_tvalid

s_axis_config_tready

CF_ID
CXU valid

CXU valid

acc_s_axis_tready

OR

s_axis_tvalid
s_axis_tready

m_axis_tdata

m_axis_tvalid

m_axis_tready

Encoder

cx_status

acc_m_axis_tdata

Sampler

acc_s_axis_tdata

s_axis_config_tdata

s_axis_tdata

1

Figure 4.14: Block diagram of accelerator design using the FFT IP core

(a) cf error (b) op error

Figure 4.15: Accumulation of the errors by the error handling unit

When executing an input or configuration instruction, the FFT wrapper follows the
same methodology and timing. The timing of the FFT write process when following
a valid instruction is shown in figure 4.16.
In case of an output instruction, the wrapper modules need to wait until the IP core
generates the output. The timing behaviour of an output instruction can be seen in
figure 4.17.

32


4. Design

Clk

Acc_s_axis_valid

Acc_s_axis_data

Acc_s_axis_tready

CXU_HIT

IP_s_axis_valid

IP_s_axis_data

IP_s_axis_ready

Figure 4.16: Timing diagram of input and config instructions in the FFT acceler-
ator

Clk

Acc_s_axis_valid

Acc_s_axis_data

Acc_s_axis_tready

CXU_HIT

IP_m_axis_valid

IP_m_axis_ready

IP_m_axis_data

Acc_m_axis_valid

Acc_m_axis_ready

Acc_m_axis_data

Figure 4.17: Timing diagram of output instruction in the FFT accelerator

33


4. Design

4.5 Reference Model
It is necessary to compare the performance of the newly designed core and analyze
its advantage over executing the same functions in software. So, for that reason,
the trigonometric, vector translation, and FFT function codes were written in C
programming language and were made to run in the existing MicroBlaze-V environ-
ment. The in-built math library in C has been used for calculating the trigonometric,
square root and other mathematical operations involved in the calculations. More-
over, it was also made sure that the program size doesn’t grow significantly in size
so that it can be fit inside the core memory. A few design decisions were needed
to be taken to satisfy this requirement. One of them was to make use of the Xil-
inx developed xil_printf() function instead of the normal printf() in order to
compress the size. The size of the LMB RAM had to be increased to an extent
to fit the nested loop operations involved in FFT calculation. Once the codes were
written, each of them was compiled and built using Vitis, which generated an elf file.
This elf file was imported to Vivado, which in turn allows the MicroBlaze design to
communicate with the software code. Finally, this model was implemented and run
on the KCU105 FPGA to analyze the performance metrics which are described in
Section 2.9.

34


5
Results

In this section, we show the results obtained in this work. Through this work, AMD
has got a new interface in their Microblaze-V core to plug in any accelerator for
executing custom instructions. However, the accelerators should have the AXI-4
stream communication implemented and should have a similar wrapper designed to
interface with the Microblaze core.

In this thesis work, we compared the results obtained by running the custom in-
structions using accelerators and by running the same function in the core without
the accelerators. It was noted that the output value of the function written in C
code does not match the output of the IP core. This is expected due to the dif-
ference in the way of implementing these functions in the IP core. For example,
the truncation parameters are specific to the FFT IP core and might not match the
conventional algorithm followed to program the same code in C. As a result, it is
not ideal to compare the output of the performed operation, but would rather make
sense to compare the performance in terms of speed up, resource usage and other
metrics described in Section 2.9. These metrics were recorded by running the design
in the KCU105 FPGA at a clock frequency of 100 MHz and the obtained results
are illustrated below. As mentioned earlier, these metrics were recorded for both
the existing MicroBlaze-V design and the modified core with the CX extension and
accelerators. These results are then compared to show the benefits that have been
achieved by having an interface to execute custom instructions.

5.1 Speedup

The speedup can be calculated from the execution time values of both the MicroBlaze-
V core designs. The execution time was recorded by calculating the number of cycles
using the CSR called cycles available in MicroBlaze-V. The cycles CSR samples
the time from the beginning of the execution until it completes the execution. The
difference between the time samples is recorded to calculate the total time taken to
execute the specific instruction. This gives the value of CPI for that instruction. Ex-
ecution time is obtained by multiplying the CPI with the time period of the system
clock (10 ns). The reported execution time is given in table5.1.

35


5. Results

Table 5.1: Execution time of the given functions in the modified design and current
design.

Execution Time
Function CX-enabled MicroBlaze-V MicroBlaze-V
Sin and Cos 0.31 us 16.59 us
Vector Translation 0.31 us 43.68 us
FFT 57.69 us 17.85 ms

It makes more sense to calculate the speedup for each function using the execution
time. This can be done using equation 2.4. The calculated speedup values are
plotted for comparison in figure 5.1.

Sin	and	Cos Vect	Trans FFT
Functions

0

50

100

150

200

250

300

350

Sp
ee

du
p

Figure 5.1: Speedup comparison of functions from the original version and by
using accelerators

From the figure, it is clear that the execution time has improved significantly for
all the functions. The use of accelerators has made the execution of the functions
faster by executing them in hardware, which otherwise would take multiple loops to
execute in software.

5.2 Hardware utilisation

The resource utilisation comparison of the hardware with the addition of CXUs
and the interface is an essential parameter regarding the system cost. The resource
utilisation of the system with and without accelerators in reference to KCU105
FPGA is shown in figure 5.2.

36


5. Results

LUT LUTRAM FF BRAM DSP
Resources

0

0.5

1

1.5

2

2.5

3

U
til
iz
at
io
n	
%

Without	CX	and	acccelerators
With	CX	and	accelerators

Figure 5.2: Resource utilisation percentage comparison

This resource utilisation comparison shows the additional resources required to im-
plement the CX extension to perform the selected computations. The extra usage of
resources is obvious as it is required to have more flip flops, LUTs, DSPs and other
resources for the newly designed interface and accelerators. For instance, the accel-
erations including FFT have been configured to use DSPs for faster response time.
Since we consider a single extension to connect all the accelerators, it is sufficient
to compare the resource utilisation

5.3 Power Consumption

The extra hardware resources consume more power as well and thus the power
should also be reported and ensured that it is well within the constraints and does
not throw any warning in Vivado. A comparison of both dynamic and static power
was done among both the designs and the results are illustrated in Figure 5.3.

37


5. Results

Dynamic	Power Static	Power
0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

P
ow

er
	(
W
)

Without	CX	and	accelerators
With	CX	and	accelerators

Figure 5.3: Comparison of power consumption

The implementation report from Vivado, shown in figure 5.3 explains the power
dissipation in the FPGA board and does not illustrate the actual power consump-
tion of the core with acceleration during runtime. In this case, we should have
considered the energy consumption as the parameter which expects to show the
improvement with acceleration. Our thesis work failed to explore this area due to
timing constraints, which is one of our limitations.
The above results show that the introduction of the CX interface to the MicroBlaze-
V core has enabled the execution of custom instructions with ease and at a higher
performance. As the amount of hardware used is still in satisfactory limits, the
design has been successful and can be a possible introduction in AMD’s future
customer releases.

38


6
Conclusion

The introduction of the CX interface to the MicroBlaze-V enables the plug-and-
play of different accelerator modules. The current state of the CX extension in
MicroBlaze-V supports any accelerator with an AXI4-stream interface. In the case
of multiple accelerators, the same wrapper module in section 4.4 can be redefined
for the identification of the accelerator and error handling. The instruction codes
for enabling the accelerators will be fed into the MicroBlaze-V core using an inline
assembler. On running the module, the instructions along with the operators will be
taken in from the software, get processed in the core and accelerators, and output
the result. With the help of the reference model, the accelerated functions have
been compared with the same functions implemented on MicroBlaze-V using C
functions.
To implement complex functions including FFT and trigonometry, a long set of
standard instructions has to be fed to MicroBlaze-V. The same function can be exe-
cuted with minimum CPU time by implementing a CX extension that uses external
accelerators. An improvement of 300 times in the execution time can be observed
when using FFT accelerators compared with the normal C functions. Similarly, an
improvement of 50 times and 150 times in the CPU time can be observed for the
trigonometric and vector translation functions. This improvement requires addi-
tional hardware resources in the FPGA including FFs, DSPs, etc. Therefore, the
FPGA device dissipates additional total power compared to standard MicroBlaze-V
implementation.
After implementing the CX extension, the MicroBlaze-V is expected to consume
less power for the custom function.

6.1 Challenges
Initially, investigating and proposing potential accelerators that are industrially rel-
evant was challenging. We were planning to design an accelerator function from
scratch. Due to the constraint in the duration of this thesis work and as the main
focus was on enabling MicroBlaze-V to support custom instructions, we decided
to use the standalone IP cores from AMD’s Vivado library. Therefore, from an
industrial perspective, our thesis work demonstrates the easiness of connecting a
standalone IP core to the MicroBlaze-V without extra effort.
Another challenge we faced was to get the software implementation of the accelerator
functions to work. As these implementations, especially FFT, require multiple loops,

39


6. Conclusion

it require more memory space in BRAM, due to which the results were not obtained
as expected. However, we did manage to unroll the loops and also increase the
BRAM memory to an extent so that the expected output was obtained.
We also faced some difficulties when modifying the processor core to support the
custom instructions, especially during the write-back to the registers and updating
CSRs. Due to our inexperience and unfamiliarity with the MicroBlaze-V, we had to
rely on our advisors at AMD for the necessary modifications in the core.

6.2 Future Work
The current state of the CX extension in MicroBlaze-V supports the implementation
of FFT acceleration with a set of limitations defined in the software. That is, the
execution order for the custom instructions for FFT should be in the chronological
order of configuration, input, and output instructions. The chronological execution
order can be managed by the hardware enabling the user to ignore the execution
order. With the recent modification in the specification for enabling CX extension,
the status of execution of a custom instruction should be updated in the CSRs.
The designed interconnect block managing the number of accelerators can be re-
leased as a standalone IP core with peripheral support to MicroBlaze-V thereby,
allowing the user to seamlessly interface multiple accelerators. Furthermore, the
wrapper module functions can be predefined in Vivado enabling the users to install
independent accelerators to MicroBlaze-V.

40


References

[1] E. Cui, T. Li, and Q. Wei, “RISC-V Instruction Set Architecture Extensions:
A Survey,” IEEE Access, vol. 11, pp. 24 696–24 711, 2023.

[2] “Draft proposed RISC-V composable custom extensions specification,” https:
//github.com/grayresearch/CX.

[3] Z. Li, W. Hu, and S. Chen, “Design and Implementation of CNN Custom
Processor Based on RISC-V Architecture,” in 2019 IEEE 21st International
Conference on High Performance Computing and Communications; IEEE 17th
International Conference on Smart City; IEEE 5th International Conference
on Data Science and Systems (HPCC/SmartCity/DSS), 2019, pp. 1945–1950.

[4] K.-D. Nguyen, D. T. Kiet, T.-T. Hoang, N. Q. N. Quynh, and C.-K. Pham, “A
CORDIC-based Trigonometric Hardware Accelerator with Custom Instruction
in 32-bit RISC-V System-on-Chip,” in 2021 IEEE Hot Chips 33 Symposium
(HCS), 2021, pp. 1–13.

[5] “What is a field programmable gate array (FPGA)?” https://www.ibm.com/
think/topics/field-programmable-gate-arrays.

[6] Vivado Design Suite User Guide, Advanced Micro Devices, Inc, 10 2023.
[7] M. A. Michel Dubois and P. Stenström, Parallel Computer Organization and

Design, 1st ed. Cambridge University Press, 2012.
[8] J. L. Hennessy and D. A. Patterson, Computer Architecture - A Quantitative

Approach, 6th ed. Morgan Kaufmann, 2019.
[9] MicroBlaze Processor Reference Guide (UG984), Advanced Micro Devices, Inc,

6 2021.
[10] I. Mhadhbi, S. Ben Othman, and S. Ben Saoud, “Impact of MicroBlaze FPGAs

Design Methodologies of the Embedded Systems Performances,” in Interna-
tional Conference on Control, Engineering Information Technology, 2014, pp.
146–151.

[11] AMBA AXI-Stream Protocol Specification, ARM Limited, 4 2021.
[12] G. Park, T. Taing, and H. Kim, “High-speed FPGA-to-FPGA Interface for a

Multi-Chip CNN Accelerator,” in 2023 20th International SoC Design Confer-
ence (ISOCC), 2023, pp. 333–334.

[13] CORDIC v6.0 Product Guide (PG105), Advanced Micro Devices, Inc, 8 2021.

41

https://github.com/grayresearch/CX
https://github.com/grayresearch/CX
https://www.ibm.com/think/topics/field-programmable-gate-arrays
https://www.ibm.com/think/topics/field-programmable-gate-arrays


References

[14] P. Duhamel and M. Vetterli, “Fast Fourier transforms: a tutorial review and a
state of the art,” Signal processing, vol. 19, no. 4, pp. 259–299, 1990.

[15] J. W. Cooley and J. W. Tukey, “An algorithm for the machine calculation
of complex Fourier series,” Mathematics of computation, vol. 19, no. 90, pp.
297–301, 1965.

[16] Fast Fourier Transform v9.1 Product Guide (PG109), Advanced Micro Devices,
Inc, 5 2022.

42


A
Appendix 1

A.1 Reference Model C code

A.1.1 Trigonometric Function
1 # include <math.h>
2 # include <stdio.h>
3

4 int main ()
5 {
6 int angle = 0;
7

8 // printing the sine value of angle
9 xil_printf ("sin (%d) = %d", angle ,sin(angle));

10 xil_printf ("cos (%d) = %d", angle ,co