Vehicle Motion Control on SIMD: Tradi-
tional and AI based models on the edge
Master’s thesis in Embedded Electronic System Design

MADHU SURESH
SAURUBH SUDARSHAN

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2022


Master’s thesis 2022

Vehicle Motion Control on SIMD: Traditional and
AI based models on the edge

MADHU SURESH
SAURUBH SUDARSHAN

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2022


Vehicle Motion Control on SIMD: Traditional and AI based models on the edge
MADHU SURESH, SAURUBH SUDARSHAN

© MADHU SURESH, SAURUBH SUDARSHAN, 2022.

Supervisor: Pedro Petersen Moura Trancoso, Department of Computer Science and
Engineering
Company advisor: Henok Fessehatsion, Thyagaraja Naidu, CEVT AB
Examiner: Per Larsson-Edefors, Department of Computer Science and Engineering

Master’s Thesis 2022
Department of Computer Science and Engineering
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Description of the picture on the cover page (if applicable)

Typeset in LATEX
Gothenburg, Sweden 2022

iv


Vehicle Motion Control on SIMD: Traditional and AI based models on the edge
MADHU SURESH
SAURUBH SUDARSHAN
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Recent advancements in technology such as Artificial Intelligence (AI) and Non-
Linear Model Predictive Control (NMPC) have led to its use in the field of motion
control in vehicles. When it comes to the implementation of the models related to
these technologies, they are expected to be executed within hard timing deadlines as
they are performance critical. Further, due to their high computational cost, coupled
with the strict deadlines, they are usually deployed on accelerators like the Graphics
Processing Unit (GPU). However, resource-constrained embedded platforms cannot
afford to have such accelerators. Therefore considering these limitations, it’s crucial
to thoroughly investigate the implementation of these models entirely on CPU with-
out any dedicated accelerator, while meeting the strict requirements. This thesis
investigates the method by analyzing two different models, viz. AI and NMPC mod-
els, in which the Single Instruction Multiple Data (SIMD) component of an Arm
processor is exploited. The SIMD units are commonly used for vector operations
in a modern CPU. By using these models, various Arm’s SIMD implementation
techniques such as Arm Neon intrinsics, Ne10 library and Auto-vectorization are
investigated. When compared to the traditional approach of sequential computing
implementation, the proposed method implemented with Neon Intrinsics was found
to be more efficient and gave an execution time reduction of 61.9% for an AI model,
while the NMPC model gave an increase in execution time of 8.3%.

Keywords: AI, NMPC, Graphical Processing Unit, Internet of Things, SIMD, Neon
Intrinsics, Neon enabled library, CUDA.

v


Acknowledgements
We would like to express our sincere gratitude to our supervisors Henok Fessehatsion,
Thyagaraja Naidu, and technical expert Stefan Carlsson at CEVT for allowing us
to do our Master’s thesis and for their encouragement and willingness to assist in
any situation. We would also like to thank our supervisor Pedro Petersen Moura
Trancoso at Chalmers University who has guided us on the right path throughout
the project. Finally, we would like to thank our respective families for the continuous
support and encouraging words that have helped us in finishing this thesis.

Madhu Suresh, Saurubh Sudarshan, Gothenburg, December 2022

vii


Contents

1 Introduction 1
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Aim and Objectives: . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Limitation: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Thesis Outline: . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 5
2.1 Flynn’s Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Armv8-A Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Arm Neon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Utilizing Arm Neon . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4.1 Hand-coded Neon assembler . . . . . . . . . . . . . . . . . . . 8
2.4.2 Auto-vectorization . . . . . . . . . . . . . . . . . . . . . . . . 8
2.4.3 Neon Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.4 Neon-enabled Libraries . . . . . . . . . . . . . . . . . . . . . . 9

2.5 Performance metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Implementation Models 11
3.1 NMPC model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 C2: Online NMPC Controller . . . . . . . . . . . . . . . . . . 12
3.1.1.1 Elector . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.1.1.2 Controller . . . . . . . . . . . . . . . . . . . . . . . . 12

3.1.2 IPOPT Library . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 AI model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Neural network . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4 Methods 17
4.1 Working Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2 NMPC Model Implementation, Optimization and Evaluation . . . . . 17

4.2.1 Hardware Implementation . . . . . . . . . . . . . . . . . . . . 18
4.2.2 SIMD Optimization . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 AI model implementation, Optimization, and Evaluation . . . . . . . 21
4.3.1 Hardware implementation . . . . . . . . . . . . . . . . . . . . 21
4.3.2 SIMD Optimization . . . . . . . . . . . . . . . . . . . . . . . . 21

5 Results 27

ix


Contents

5.1 Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.1 SOC AI Model . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5.1.2 NMPC Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.2 Memory Usage Benchmark . . . . . . . . . . . . . . . . . . . . . . . . 31
5.3 Execution Time SIMD vs GPU . . . . . . . . . . . . . . . . . . . . . 31

6 Conclusion 33

7 Future Work 35

Bibliography 37

A Appendix 1 I
A.1 Build System: CMake . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 Measuring Execution Time . . . . . . . . . . . . . . . . . . . . . . . . II

A.2.1 Chronos Library . . . . . . . . . . . . . . . . . . . . . . . . . II
A.2.2 Perf Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

A.3 Profiling Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III
A.3.1 Perf Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . III
A.3.2 Nvprof Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . III

A.4 Memory footprint measurement . . . . . . . . . . . . . . . . . . . . . III
A.4.1 Valgrind: Massif tool . . . . . . . . . . . . . . . . . . . . . . . III

x


List of Acronyms

Below is the list of acronyms that have been used throughout this thesis:

AI Artificial Intelligence
MPC Model Predictive Control
GPU Graphics Processing Unit
TPU Tensor Processing Unit
SIMD Single Instruction, Multiple Data
GCC GNU Compiler Collection
CNN Convolution Neural Network
FPGA Field Programmable Gate Array
ML Machine Learning
DSP Digital Signal Processing
SLP Superword-Level Parallelism
PERF performance analyzing tool in Linux
SISD Single Instruction, Single Data
MIMD Muliple Input, Multiple Data
MISD Muliple Input, Single Data
LLVM Low Level Virtual Machine
NMPC Non-linear Model Predictive Control
DP Dynamic Programming
SOC State Of Charge
IPOPT Interior Point OPTimizer
FNN Feedforward Neural Network
RMSE Root Mean Squared Error
ReLU Rectified Linear Unit
CUDA Compute Unified Device Architecture
SSE Streaming SIMD Extensions

xi


Contents

xii


1
Introduction

The recent advancements in automobiles and information technology have trans-
formed conventional vehicles into smart commuting machines such as hybrid ve-
hicles and self-driving vehicles. These features and characteristics are offered by
both cutting-edge communication and computing technologies which pose many
challenges for the design of autonomous driving edge computing systems. Major
technologies such as AI (Artificial Intelligence) and MPC (Model Predictive Con-
trol) are currently being used for improving the functioning and responsiveness of
these vehicles [1] [2]. These applications are often complex and resource-intensive,
hence they were generally facilitated on a cloud platform [3]. In general, it is more
advantageous or required to have these inferences close to the source of data or
action requests [4], avoiding the need to send the data to a cloud service and wait
for a response. In many scenarios, data transmission to the cloud is unreliable, if
not impossible, or has a high latency with uncertainty about the communication’s
round-trip delay, which is unacceptable for latency-sensitive applications that re-
quire real-time decisions. Other considerations, such as data security and privacy,
force data to remain on edge devices.

However, existing edge devices pose challenges in themselves, as some of them are
not capable of parallel computation. There is a need to opt for multi-core, parallel
computing edge devices [5]. Lately, to provide the edge devices with the neces-
sary computation capability to handle complex algorithms and AI, unconventional
approaches such as tensor processing unit (TPU) by Google and new graphics pro-
cessing unit (GPU) architectures by Nvidia are being developed [6]. These devices
have multi-core processors which can process many pieces of data simultaneously,
making them useful for complex algorithms. Embedded platforms with limited re-
sources, such as Internet of Things (IoT) devices, cannot afford such accelerators as
they operate with batteries with limited capability. For specific applications such as
vehicle motion control in the automotive sector, Arm processors are widely adopted
as control units, and hence shifting to other hardware inflicts inconvenience. There-
fore it is important to investigate an efficient method for implementation of AI or
similar complex functions on an Arm processor by exploiting resources such as Arm
Neon gaining an advantage of parallel computation with SIMD support.

SIMD is one architecture that was used as a base for early supercomputers [7] and
that is capable of parallel computation. Recent developments in technology have
led SIMD to become a generic feature in high-end processors. Many mobile appli-

1


1. Introduction

cations such as multimedia, graphics, and signal processing have appeared which
require better performance, and output quality needs parallel computation capa-
bilities which SIMD has. It also achieves a performance speed-up that is needed
for high computational load applications such as computer vision, digital signal
processing and other applications that require performing the very same operation
over large amounts of data which can achieve significant performance benefits from
the use of SIMD extensions. Many commonly used microprocessors have recently
improved their architectures to allow specific SIMD extensions, such as Neon for
Arm microprocessors [8], AVX for Intel [9] and 3DNow! for AMD microprocessors
[10] respectively. These add-ons are available in specific co-processors that enable
vectorization.

1.1 Problem Statement
Complex algorithms like the MPC model and AI model are two of the major tech-
nologies which are used to improve the functionality of the vehicle. These algorithms
are computationally expensive and resource intensive. As a result, their performance
must be improved for them to meet the system’s stringent deadlines. Hence, edge
devices with multi-core and vectorized computing capabilities would be advanta-
geous for these applications. GCC compiler offers the auto-vectorization feature,
but it does not always give optimal results. Thus, it is necessary to manually im-
plement a vectorized operation. Arm compatible edge devices such as Raspberry
Pi zero 2W has the necessary capabilities such as Arm Neon registers to vectorize
the operations. This thesis investigates Arm Neon’s capabilities on these models
by manually vectorizing the operations by using Neon Intrinsics, Ne10 library, and
Auto vectorization on Raspberry pi zero 2W. The findings will be used to answer
the following research questions:

• How well can both AI and MPC models be implemented on Arm processor
using Neon technology?

1.2 Related Work
Recently, specifically in the autonomous driving field, extensive research is being
carried out to deploy AI and similar complex algorithms efficiently on edge devices
due to their challenges and concerns of centralized cloud computing. Here are a
few articles that help in understanding the issue. To start with, Lee et al. [11]
proposed a method to accelerate Convolution Neural Network (CNN) with LeNet
network by using SIMD architecture. Here, performance was evaluated using Rasp-
berry Pi 3 MODEL B by utilizing Arm Neon, the SIMD processing unit inside Arm
CPUs. In comparison to the traditional implementation, the proposed implementa-
tion achieved a speed-up of up to 2.66 in execution time and a 3.55-fold decrease in
energy consumption. Similarly, a study on the implementation of six different digital
signal processing (DSP) algorithms on an A15 architecture used SIMD to optimize
performance [12]. These implementations were then compared to those that are
automatically produced by the compiler’s auto-vectorization feature. The execution

2


1. Introduction

times of the SIMD implementations achieved much lower execution time compared
to that produced by the compiler and the speed-up ranged from 2.47 to 5.11. In
another research project, to address the problem of slower inference on edge de-
vices, researchers have focused on accelerating edge inference by both hardware and
software means. An FPGA System-on-Chip based architecture to speed up the ML
computations on an edge environment was proposed in [13]. In [14], Gaurav Mitra et
al. considered and compared the different hardware with SIMD architecture, namely
the Neon SIMD instruction set used on the Arm Cortex-A series of RISC processors
with the SSE2 SIMD instruction set found on Intel platforms within the context
of the Open Computer Vision (OpenCV) library. The performance of compiler
auto-vectorization was compared to that of hand-tuned script across five different
benchmarks and ten different hardware platforms. Hand-tuned Neon benchmarks
on Arm platforms were 1.05x to 13.88x faster than auto-vectorized code, while hand-
tuned SSE benchmarks on Intel platforms were 1.34x to 5.54x faster. Also, Liu et
al. [15] proposed a computer architecture for a self-driving vehicle that is based on
heterogeneous hardware. The authors identified bottlenecks in autonomous driving,
which were found to be localization and perception, and matched them with suitable
accelerators such as CPU, GPU, and DSP, which enabled them to obtain high per-
formance and energy efficient results. Hence, these papers suggest that SIMD has
some advantages in that has simple, repetitive arithmetic operations of enormous
amounts of data.

1.3 Aim and Objectives:
The main research question that is addressed in this thesis work is to analyze how
capable the Arm Neon is, compared to other hardware, when employed for edge
computing of vehicle motion control models. Based on this overall aim, the following
objectives are formulated:

• Identify the best implementation method by deploying the existing MPC, i.e.,
the non-linear MPC (NMPC) [16], on the selected Arm processor by making
use of its SIMD component. Evaluate the performance, considering computa-
tion time and amount of memory used.

• Similarly, investigate and implement an AI model on Arm Neon using SIMD
architecture and compare it with the traditional approach of sequentially com-
puting using only CPU. This will enable us to recommend the best way for
implementation on Arm devices with Neon technology. The model will be
further implemented on other hardware such as Nvidia GPU for comparison.

1.3.1 Limitation:
Since the main focus of the thesis is to optimize the models using SIMD architec-
ture, a limited machine-independent optimization is done and the execution time
measurement which is necessary for the benchmark is constrained to use two main
techniques which are PERF [17] and Chronos [18]. The individual model also has
its limitations and complications to be used in the thesis for SIMD optimization.

3


1. Introduction

Limitation of NMPC model:

In the case of the NMPC model, it is limited to optimizing evaluation functions
which are used to calculate cost functions, that can be vectorized, the rest are not
chosen due to limited time. Since these evaluation functions take double precision
to store their value, the Neon enabled library which is developed for single precision
cannot be used. Hence this limits the model to an investigation with Arm Neon
intrinsics.

1.3.2 Thesis Outline:
For ease of understanding, the thesis report is divided into following chapters:

2. Theory: This chapter starts with the necessary background literature. Fur-
ther, it aims to provide the reader a technical insight on important topics such
as SIMD, Arm Neon technology and performance metrics used in the thesis.

3. Implementation models: This describes the models which are used for
implementation in the thesis.

4. Methods This section describes the research methodology of all the imple-
mentation models, and how the implementation was developed and tested in
the thesis.

5. Results: This chapter presents and discusses the project results and concludes
with a possible solution to the benchmark.

6. Conclusion: This chapter concludes with the limitations of the thesis.
7. Future work: This describes how the project can be expanded in the future

4


2
Theory

This section aims to provide technical insight on topics that are salient to thesis,
which include: SIMD, Arm Neon technology and its usage, and performance metrics
used in the thesis.

2.1 Flynn’s Classification
Based on the number of instructions and data streams that can be processed concur-
rently, in 1966 Michael J Flynn, in his highly cited paper [19], classified computing
systems into following four major categories:

1. SISD: A single-instruction, single-data-stream computing system (SISD) is a
single processor machine that executes single instruction to operate on a single
data stream.

2. SIMD: A single-instruction, multiple-data-stream system is a multiprocessor
machine that performs the same operation on multiple pieces of data concur-
rently.

3. MISD: Out of this four categories, multiple-instruction, single-data-stream
system is a rarely used class. But as we approach the limits of Moore’s law, few
researchers are proposing models based on this parallelism architecture [20].

4. MIMD: Multiple-instruction, multiple-data-stream system involves multiple
processors independently executing several instructions on several data sources.

As the main focus of the thesis is to investigate on SIMD architecture in Arm
processor, SIMD is emphasized in subsequent discussions.

SIMD
Machines executing vectors of data have been present for a while now. Early super-
computers such as the CDC Star-100 and the Texas Instruments ASC could operate
on the vector of data with a single instruction [21]. SIMD machines can only ex-
ploit data level parallelism but not concurrency. The following figure 2.2 helps to
understand the difference between the conventional scalar operation and SIMD op-
eration. Using scalar operations, to acquire the sums, four add instructions have
to be executed sequentially. SIMD operation, on the other hand, achieves the same
outcome with only one additional instruction.

5


2. Theory

SISD

MISD

SIMD

MIMD

Single Multiple

In
st

ru
ct

io
n 

St
re

am

Data Stream

Si
ng

le
M

ul
tip

le

Figure 2.1: Flynn’s Classification

A1 A2 A3 A4

B1 B2 B3 B4

C1 C2 C3 C4

A1

B1

C1

A2 A3 A4

B2 B3 B4

C2 C3 C4

(a) Scalar Operation (b) SIMD Operation

Figure 2.2: Scalar and SIMD operations illustration

It is also important to note in figure 2.2 that each rectangle represents registers. A
regular register, as illustrated in figure 2.2, can accommodate only a single scalar
value and aid in the fast retrieval of data for processing by the CPU. However, in
the case of architecture employing SIMD capabilities, they contain a separate set
of wide registers generally termed SIMD or vector registers. These registers are
capable of containing multiple lanes which helps in storing multiple scalar values.
Depending on the architecture the width of the SIMD registers vary. If we assume
the width of the SIMD register to be 64 bits, it would allow the user to operate with
either eight 8-bit elements, sixteen 4-bit elements or two 32-bit elements.
Likewise, from the example provided in figure 2.2 b), if we assume the width of
SIMD register to be 128 bits, then we are adding four 32-bit elements with another
register having similar structure. The corresponding elements in the lanes of the
two registers would be added and the result would be placed on the respective lane
of the destination register.

6


2. Theory

2.2 Armv8-A Architecture
The processor used in the thesis is Cortex-A53 processor from Arm, which imple-
ments the Armv8-A architecture.
The Armv8-A architecture includes two execution states, 32-bit and 64-bit, and each
of the execution states have their own instruction set.

1. AArch32: There will only be 16 128-bit wide and 32 64-bit wide registers in
this instruction set. Maximum width for load and store operations is 64 bits.
Completely different register aliases compared with AArch64.

2. AArch64: There are in total 32 128-bit registers in the AArch64 execution
state of Armv8-A architecture.

2.3 Arm Neon
Arm’s implementations of the Advanced SIMD architecture are referred to as Arm
Neon. Armv8-A includes a separate register file, a set of 128-bit wide registers, for
performing SIMD operations. The Neon unit is completely integrated into the CPU,
sharing resources for integer operations, loop control, and caching. When compared
to a hardware accelerator, this significantly saves space and power.

D D

S S S S

HHHHHHHH

B B B BB BB B B BB B B BB B B BB

127 64 063

Vx.16B

Vx.8H

Vx.4S

Vx.2D

Note: In register names, x = 0 - 31

Figure 2.3: ARMv8-A Neon Register Packing

Figure 2.3 describes the packing of 128-bit Neon registers in Armv8-A architecture.
Note, in the case of AArch32 execution state, where only 64-bit Neon register width
is allowed, only the 64 least significant bits are used. As illustrated in figure 2.3, a
128-wide register can be packed as follows:

• Sixteen 8-bit elements, represented by the operand suffix .16B where B indi-
cates byte.

• Eight 16-bit elements, represented by the operand suffix .8H where H indicates
halfword.

• Four 32-bit elements, represented by the operand suffix .4S where S indicates
word.

7


2. Theory

• Two 64-bit elements, represented by the operand suffix .2D where D indicates
double word.

2.4 Utilizing Arm Neon
Neon technology can be used in the following ways:

1. Hand-coded Neon assembler
2. Auto-vectorization
3. Neon intrinsics
4. Neon-enabled libraries

2.4.1 Hand-coded Neon assembler
Hand-coded Neon assembler uses Neon Assembly instructions to vectorize the data.
Neon assembler code will help in maximizing the optimal performance of the code
using carefully hand-written assembler code that yields the best results from Neon,
especially for performance-critical applications. The disadvantage is that it is diffi-
cult to maintain and write assembly code. However, with a limited time, this way
of optimizing will not be investigated in the thesis.

2.4.2 Auto-vectorization
Auto-vectorization is the technique of allowing the compiler to automatically detect
possibilities to use advanced SIMD instructions in the source code. Essentially,
compiler optimization includes two techniques: loop unrolling and Superword-Level
parallelism (SLP) optimization. The compiler targets loops and merges similar
independent scalar instructions into vector instructions. Compilers that support
auto-vectorization include Arm Compiler 6, Arm C/C++ Compiler, LLVM-clang,
and GCC.
The following are a few advantages of employing auto-vectorization:

• The source codes written in high-level languages, such as C, and C++, are
portable unless there’s the inclusion of architecture-specific code elements such
as inline assembly or intrinsics.

• Auto-vectorization requires significantly less design time compared to writing
hand-optimized assembly code.

• Targeting a specific micro-architecture can be as simple as changing a single
compiler option, but optimizing an assembly program demands a thorough
understanding of the target hardware.

However, auto-vectorization isn’t always the best option as there are chances when
the compiler fails to identify opportunities to use Neon thereby failing to produce
completely optimized code.

8


2. Theory

2.4.3 Neon Intrinsics
Intrinsics are functions whose exact implementation is known to a compiler. Ac-
cordingly, when generating the assembly code, the compiler replaces this function
calls with an appropriate instruction or sequence of instructions. Intrinsics provide
almost the same control as writing assembly language, but the compiler handles
register allocation and pipeline optimization.
Neon Intrinsics are a collection of C and C++ functions that are included in the
arm_neon.h header file. They are supported by Arm compiler and GCC as well.
The following piece of code is one example that describes the usage of intrinsics.

//Intrinsic Function usage in C/C++
float64x2t C = vaddq_f64(float64x2t A, float64x2t B);

//Assembly code generated for the above instruction
fadd v0.2d, v1.2d, v0.2d

The ’vaddq_f64’ function is an intrinsic function that essentially adds two vectors,
each containing two elements of size 64-bit and returns a similar vector. In the above-
provided instruction code, compared with the assembly instruction code generated
by the compiler, it can be seen that the vectors A and B are placed in the two source
SIMD registers ’v0’ and ’v1’ respectively and ’.2d’ indicates that each register has
two lanes.

2.4.4 Neon-enabled Libraries
There are also a few libraries that provide support to employing Neon technology.
One such library, termed Ne10 [22], is focus in this thesis. It is an open source
software library maintained by Arm, targeting Arm architectures, which provides
optimized implementations of essential operations in general math, signal processing,
image processing, and physics functions. The Ne10 project was designed to provide a
collection of widely used functions that have been significantly optimized for the Arm
architecture and offer dependable, well-tested behavior that is simple to include into
programs. Both the assembler and Neon implementations come with C interfaces
to the routines [23].

2.5 Performance metrics

Execution Time Measurement
Chronos Library

In this thesis, to measure the execution time std::chrono library is used. It is a
part of in C++14 Standard Library and includes three standard clocks that the user
can interface with [18]. The System clock represents a system-wide real-time clock
which is the machine’s best guess of the current time on the wall clock. Steady

9


2. Theory

clock is the monotonic clock where each tick of the clock takes the same amount
of time. High resolution clock is the clock with the shortest tick period. As a
high-resolution clock provides high precision and resolution of up to nanoseconds,
it was selected for the timing measurements.
Furthermore, Chronos provides real-time measurement. Real-time is also termed
wall clock time which is the time counted from the start of the process to the end of
it. This means, that since we are running tests over an OS, the time measurements
include time slices of other processes which interrupted the process we were inter-
ested in. Consequently, the time measurements obtained would not be the actual
time spent on the process we were interested in but would also contain the time
spent on other high-priority tasks scheduled by the OS.

Perf tool

To obtain the CPU time, which is the actual time spent on the process by the
CPU, one other tool known as Perf is investigated and employed in this thesis. The
Perf tool is a performance analyzing tool for Linux [17]. It contains various sub-
commands which could be used to get in-depth detail of the event/process. The
stat sub-command can be used to get the CPU time of the particular task and the
measurement has a resolution up to 9 decimal places (nanoseconds).

10


3
Implementation Models

As the title suggests, the investigation is done on both traditional and AI-based mod-
els in the thesis. This section provides a comprehensive background study of these
two different models used in this work.

3.1 NMPC model
As a traditional-based approach, the non-linear MPC (NMPC), which is imple-
mented in the Chalmers MSc thesis [16], is used. In [16], Hultgren and Husmark
proposed the implementation of an offline-online coupled powertrain control solution
for a three-mode hybrid electric vehicle to minimize the total energy consumption
over a route. The offline component uses a simplified version of the powertrain to
generate optimized State Of Charge (SOC) and velocity references through dynamic
programming given speed limits and the topographic profile of a road segment. The
online component acts on these references by generating torque setpoints to the
power sources and selecting the most optimal gear using the optimal control tech-
nique of nonlinear model predictive control on a detailed dynamical model of the
powertrain. The basic block diagram of the controller is shown in figure 3.1.

Figure 3.1: Block diagram of the control concept design [16]

In this case, offline optimization (C1 in figure 3.1) is only evaluated once per trip
whereas online optimization (C2 in figure 3.1) is used several times with intervals.
Also, C2 contains multiple vector operations which is an advantage of SIMD archi-
tecture. Thus, this thesis only considers online optimization (C2) for further SIMD
implementation.

11


3. Implementation Models

3.1.1 C2: Online NMPC Controller

The online NMPC controller is split up into two parts, namely, Elector and the
Controller.

3.1.1.1 Elector

The Elector’s job is to orchestrate the Controller by determining which hybrid mode
in three modes i.e. Electric, Serial, and, Parallel mode and gear is best for the current
driving situation. The desired mode and gear are elected by optimizing the objective
function given by a specific NMPC until the next election is done.

Figure 3.2: Representation of Elector [16]

Figure 3.2 shows the representation of Elector where ω1 is ICE output shaft speed,
ω2 is gearbox input shaft speed, ζ is SOC, d is distance, vref is Velocity reference,
ζref is SOC reference from C1 optimization, and igear is the previous gear from the
previous run.

3.1.1.2 Controller

Once the election is complete, the chosen hybrid mode-gear combination is sent to
the Controller. The purpose of the Controller is to compute the input signals to the
plant. Figure 3.3 represents the input and output of the controller. Here, ielected

mode ,
ielected
gear are hybrid mode-gear combination elected by the elector, and x(k) are the

state which will be optimized to get the desired output torques by using the IPOPT
solver (see Section 3.1.2). ζref and vref are the SOC and velocity reference computed
by offline optimization. By taking these inputs, the controller runs at each time step
to compute the various input torques by optimizing the previous states using the
NMPC solver (IPOPT). The torque calculated is further required for the vehicle to
follow the SOC and velocity references.

In both elector and controller, the solver used for nonlinear programs is "Inte-
rior Point OPTimizer", IPOPT [24] library for large-scale nonlinear optimization.
This model is basically implemented in MATLAB and SIMULINK. Further, this
SIMULINK model is used to generate C scripts for use in SIMD applications.

12


3. Implementation Models

Figure 3.3: Representation of Controller [16]

3.1.2 IPOPT Library
IPOPT is an open-source software package that is written in c++ for large-scale
nonlinear optimization. It is used to solve general nonlinear programming problems
of the form,

min
x∈Rn

f(x)

s.t gL ≤ g(x) ≤ gU

xL ≤ x ≤ xU

where x ∈ Rn are optimization variables, f : Rn −→ R is the objective function, and
g : Rn −→ Rm are general nonlinear constraints. The functions f(x) and g(x) can
be convex or non-convex, and they can be linear or nonlinear (but should be twice
continuously differentiable). The nonlinear, non-convex constraints and objective
functions in this work are solved by this IPOPT library. It is used to solve opti-
mization problems written in a programming language such as C, C++, Fortran, or
Matlab. To accomplish this, the following functions/methods provide the necessary
information to IPOPT as input:

• Problem size [get_nlp_info] and bounds [get_bounds_info];
• Starting point [get_starting_point];
• Function values f(xk) [eval_f] and g(xk) [eval_g];
• First derivatives ∇ f(xk) [eval_grad_f] and ∇ c(xk) [eval_jac_g];

• Second derivatives σf ∇2 f(xk) + Σj λ
(j)
k ∇2 c(j) (xk) [eval_h];

13


3. Implementation Models

3.2 AI model

Since the introduction of AlexNet [25] in 2012, progress in the field of AI, particularly
Deep Neural networks (DNN), has accelerated exponentially. With their matrix
multiplication strategies, these DNNs provide the most significant opportunity to
exploit parallelism, in return demand a significant computation power for training
and inference since they use large neural networks. One such neural network model
is chosen in this thesis viz. xEV Battery State-of-Charge using a deep feedforward
neural network (FNN) [26]. This model was chosen mainly considering its ability
for potential exploitation in SIMD implementation, as being a DNN and the core
of the DNN is Multiply–accumulate operation. The other considerations include
the model being related to vehicle motion and energy field, and having open-source
copyright.

The objective of this model is to utilize a deep FNN approach to estimate the bat-
tery State-of-Charge. The project was mainly done in MATLAB and outlines data
collection, preparation, development, tuning, and robust validation of the FNN to
sensor noise. The [26] model was subjected to datasets with errors purposefully
added to the data during training to produce a robust estimator. For example,
introducing cell voltage variation of 4mV, cell current variation of 110mA, and tem-
perature variation of 50 °C. The error values were chosen to be comparable to the
noise and error found in actual sensors used in commercially available electrical
vehicles. The robust FNN trained from two Li-ion cells datasets, one for a nickel
manganese cobalt oxide (NMC) cell and the second for a nickel cobalt aluminum
oxide (NCA) chemistry cell, was shown to overcome the added errors and obtain a
SOC estimation accuracy of 1% RMSE (Root Mean Squared Error) [26].

3.2.1 Neural network
Figure 3.5 shows the representation of a neural network of the model in [26]. The
neural network contains a normalized input layer with 5 inputs i.e, voltage V , current
I, temperature T , average voltage V _avg, and average current I_avg. As illustrated
in the figure 3.5, inputs are further fed to three fully connected layers (hidden
layers in figure 3.5) containing 55, 55, and 1 neurons respectively. The three hidden
layers contain a non-linear activation function each, i.e, hyperbolic tangent, leaky
Rectified Linear Unit (ReLU), and clipped ReLU, respectively. The hyperbolic
tangent activation function, simply known as tanh function, maps any real number
in the range 1 to -1 by applying tanh function to it. The leaky ReLU activation
function receives a value as input and, if the value is positive, returns the same
value. However, if the value is negative, the function returns the input value scaled
with a constant number. The equation of the leaky ReLU is shown in the equation
3.1. The clipped ReLU function, as shown in equation 3.2 [27], performs a threshold
operation where the input value is compared to a constant termed "ceiling." If the
input value is less than 0, the function returns 0 and if the input value is greater
than the ceiling constant, the output is set to the ceiling value. The input value is
retained if it is between 0 and the ceiling value.

14


3. Implementation Models

f(x) = max(scale ∗ x, x) (3.1)

f(x)


0 x < 0
x 0 ≤ x < ceiling

ceiling x ≥ ceiling

(3.2)

In the neural network model at discussion, for the leaky ReLU layer, a scale of 0.3 is
used whereas 1 is used as ceiling for the clipped ReLU layer. The model is trained
in MATLAB for 50 epochs (total number of iterations of all the training data in one
cycle for training the model) and is further used for optimization. Figure 3.4 shows
an overview of how training and testing processes are carried out in a feed-forward
neural network. The trained model is further saved, and the trained parameters i.e.,
weights and biases from the last epoch are extracted. These parameters will be used
for the development of inference by exploiting SIMD in the thesis as shown in figure
3.4.

Training Data

Forward Propagation
(SOC Estimation)

Loss Calculation
Back Propagation

(Update network

Parameters)

Training Process

Testing Data

Forward Propagation (SOC
Estimation)

Inference

Figure 3.4: Representation of feed forward training and testing overview

15


3. Implementation Models

1 1

1

2

3

55

4

5

V(K)

I(K)

I_avg (K)

T(K)

V_avg (K)

55

5

4

3

22

3

4

5

55

SOC

Hidden
Layer 1

Hidden
Layer 2

Input
Layer

Output Layer
(Estimated

SOC)

Normalized Input layer

 { V, I, T, V_avg, I_avg }

Hidden Layer 

(55 Neurons)

Hyperbolic Tangent (Non
Linear Activation funcn) 

Hidden Layer 

(55 Neurons)

Leaky ReLU (0.03)

Hidden Layer 

(1 Neuron)

Clipped ReLU (1)

Regression 

{SOC}

Figure 3.5: Representation of Neural Network

16


4
Methods

The purpose of this section is to describe how the various SIMD implementations
were carried out by detailing the approach taken, and coding style used, etc.

4.1 Working Environment
The working environment, as illustrated in figure 4.1, is employed in this thesis.
Programming and debugging are carried out in the development environment, which
is an x86-64 machine running on Windows OS. The program is then compiled into a
static binary file using a suitable cross-compiler. For the cross-compilation, GCC’s
aarch64-none-linux-gnu [28] is utilized. The GNU Toolchain is chosen since it is
open-source and most extensively used. Once the executable is created, it is then
transferred to the target environment, which is Raspberry Pi Zero 2W running on
Raspberry Pi OS (64-bit), for performing evaluations.

.exe

Development Environment

(PC)

Target Environment

(Raspberry Pi Zero 2W)

Figure 4.1: Setup Environment

4.2 NMPC Model Implementation, Optimization
and Evaluation

In this thesis work, only the ’C2 Online NMPC Controller’ has been used for evalu-
ation. The controller is decoupled from the plant model which consequently limits
the model to run only on initial inputs but not on updated inputs that are essentially
the previous output of the controller.

17


4. Methods

4.2.1 Hardware Implementation

As mentioned previously in section 3.1.1, the controller was designed and developed
using MATLAB/SIMULINK. For the hardware implementation of the controller, C
codes were generated using the SIMULINK coder tool [29]. To interact with the
controller, for providing inputs, forming the optimization problem, and receiving
outputs, C/C++ codes were then manually developed. Since the Chronos library
has been used for the timing measurements, the top file is written in C++. For
building this entire project, the CMake tool [30] was employed which is further
explained in the Appendix.

Start

Interface with the Generated
C code of the NMPC model

Compile and Link with IPOPT
Library built for the

development environment

Verify the C/C++
implementation with the

MATLAB/SIMULINK
Implementation

Build IPOPT Library for
the target environment 

Using suitable cross compiler
compile and Link with IPOPT

Library built for the ARM
device

Verify the NMPC
Implementation for the ARM

Cortex A53

Make necessary changes for
the C/C++ source files or/and

rebuild the IPOPT library 

Stop

Verification
fail

Verification
fail

Figure 4.2: Flow chart for Hardware Implementation of the NMPC Model

4.2.2 SIMD Optimization

The NMPC model that is built in C is further used to optimize using SIMD ar-
chitecture. Since SIMD is useful for vectorizing the sequentially computing scalar
operations, it is necessary to find suitable complex operations which can be adopted
or replaced with vector operations in the model. By thorough investigation, it is
found that evaluation functions such as objective function, constraint function, etc,
which are input to the IPOPT solver explained in section 3.1.2 have several multi-
ple complex arithmetic operations that could be vectorized. As per profiling results
shown in table 4.1, it can be seen that these evaluation functions are the least con-
tributing execution time in the model; however, they were the major part that could
be utilized for vectorization in the model. Thus, only these functions were selected
for SIMD implementation in this thesis.

18


4. Methods

Table 4.1: Profiling results of NMPC model run for 1000 times.

Serial No. Name Percentage
time (%)

Execution
Time (sec)

1 dmumps_solve_node_fwd 4.32 0.15
2 init_malloc 3.17 0.11
3 dmumps_fac_asm_master 3.17 0.11
4 Ipopt::DenseVector::AddTwoVectorImpl 2.88 0.1
. . . .
. . . .
. . . .

123 eval_h_EM 0.29 0.01
124 eval_jac_g_EM 0.29 0.01
125 eval_jac_g_Serial 0.29 0.01

Optimization through Auto-vectorization

As previously stated in chapter 2, auto-vectorization is used to optimize the NMPC
model. By using the auto-vectorization technique, a compiler attempts to vec-
torize any blocks of code that are automatically recognized as being vectorizable.
The open-source GCC compiler is utilized in this thesis since it provides auto-
vectorization. With the GCC compiler, auto-vectorization can be enabled by setting
optimization options like -ofast, -o3, etc. However, the compiler failed to identify
much SIMD optimization in the model. This is because the compiler cannot opti-
mize multiple arithmetic operations, but helps in producing optimized code while
in loops.

Optimization through Arm Intrinsics

Further, Arm intrinsics are used to optimize the evaluation functions in the NMPC
model. Following is one example of the usage of Arm intrinsics in one of the evalu-
ation functions.

//Arithmetic scalar operations before SIMD implementation,
t2 = in1[63] * in1[63];
t3 = in1[64] * in1[64];
t4 = in1[65] * in1[65];
t5 = in1[66] * in1[66];

// Assigning address of in1[63] to in1_63 variable,
const double* in1_63 = &in1[63];

// Arithmetic vector operations using Arm Intrinsics,
float64x2_t t25_1=vmulq_f64(*((float64x2_t *)in1_63),\

*((float64x2_t *)in1_63));
float64x2_t t25_2=vmulq_f64(*((float64x2_t *)in1_63+1),\

*((float64x2_t *)in1_63+1));

19


4. Methods

Here, the scalar operations done to calculate t2, t3, t4, and t5 are vectorized
using vmulq_f64 by taking two inputs at a time. For example, (*((float64x2_t
*)in1_63 which is used as an input to the intrinsic function contains two values i.e.
in1[63] and in1[64] which will be multiplied as a vector and is assigned to the
variable t25_1. The same procedure is applied to t25_2 to vectorize the operations
of t4 and t5 respectively. Hence, this procedure is carried out to replace most of
the scalar operations present in the evaluation functions to vectorize them. Further,
this is checked in the assembly code to ensure whether the compiler is vectorizing
these operations.

Start 

Select a IPOPT input
function 

Note the output and
benchmark the
implementation

Develop SIMD
implementation using

the ARM Intrinsics

Verify the SIMD
implementation

Benchmark the SIMD
implementation

Repeat from step 2 

Verification 

fails

Debug

If no performance increase

Figure 4.3: Flow chart for Neon Intrinsics Implementation of the NMPC Model

20


4. Methods

4.3 AI model implementation, Optimization, and
Evaluation

Figure 4.4 shows how the AI model is implemented, developed, and used in this
thesis for the benchmark.

Start

Extract Parameter
from trained model

MATLAB
implementation

(Inference)

C implementation on
Arm

Verification, Profiling
& Benchmarking

SIMD
optimization

Stop

Select layer

Intrinsics
implementation Autovectorization NE10

implementation

Verification &
Benchmark

Repeat for other
layers

Stop

Debug

If verification fails

Figure 4.4: Flow chart for Neon Intrinsics Implementation of the AI model

4.3.1 Hardware implementation
The development of a model is done in C and is usually developed in two ways.
One way is to use a generated C code from MATLAB, and another is to code
it manually. However, for the maximum utilization of SIMD in the model, the
second way is chosen. This means the model is built from scratch for every layer
by extracting all parameters from the trained neural network. Further, verification
is done by comparing the output with the MATLAB model which is developed by
integrating a trained network in it. Therefore, by following this procedure, a model
is built layer by layer. The entire project is built using the CMake tool, which is
described in the Appendix.

4.3.2 SIMD Optimization
The AI model built in C is further used to optimize using SIMD architecture and
benchmark. Since SIMD is useful for vectorizing complex scalar operations, se-
lecting a suitable layer that yields more outcomes for SIMD is necessary. Hence
to accomplish this, profiling is done for the model layers. As per profiling results
shown in Table 4.2, a suitable layer will be selected for further implementations.
As explained in chapter 2, optimization is done using auto vectorization, Arm Neon
intrinsics, and the Ne10 library.

21


4. Methods

Table 4.2: Profiling results of AI model run for 1000 times.

Serial No. Name Percentage
time (%)

Execution
Time (sec)

1 Fully Connected Layer 2 76.19 0.16
2 Fully Connected Layer 1 9.52 0.02
3 Fully Connected Layer 3 9.52 0.02
4 tanh layer 4.76 0.01
5 Clipped ReLU Layer 0.00 0.00
6 Leaky ReLU Layer 0.00 0.00
7 Sequence Input Layer 0.00 0.00

Optimization through Auto-vectorization

As explained in section 2.4.2, the model developed in C is optimized using auto-
vectorization. The auto-vectorization is a technique where a compiler automatically
identifies the vectorizable block of codes and tries to vectorize it. In this thesis,
an open-source compiler that supports auto-vectorization, GCC is used. Auto-
vectorization can be activated in the GCC compiler by using optimization flags
such as -ofast, -o3, etc.

Optimization through Arm Intrinsics

Individual layers of the AI model are further optimized using Arm-Neon intrinsics.
The layer is selected for optimization as per the results of profiling. As explained in
section 2.4.3, suitable functions in Arm Neon intrinsics are used for the optimization
of all the layers in the model.

Following list 4.1 is one example of Arm intrinsics usage in one of the fully connected
layers (hidden layer 2) in the figure 3.5. It is developed to optimize the matrix
multiplication using Arm Neon intrinsic functions. It takes weight matrix 4×5 (A0,
A1, A2, A3, and A4 vectors with 4 elements) and input matrix 5 × 1 (B) as an
input at a time, the function performs vector multiplication on individual vectors
with individual elements in B and provides 4 × 1 matrix as an output. The matrix
multiplication is usually done in the following way as shown in eq. 4.1. As intrinsic
functions are specifically developed for vectors, matrix multiplication is done by
dividing its elements as vectors. Accordingly, the vectors A0, A1, A2, A3, and A4
are multiplied by each element in B respectively as shown in the equation 4.2.


A0[0] A1[0] A2[0] A3[0] A4[0]
A0[1] A1[1] A2[1] A3[1] A4[1]
A0[2] A1[2] A2[2] A3[2] A4[2]
A0[3] A1[3] A2[3] A3[3] A4[3]

 ×


B[0]
B[1]
B[2]
B[3]
B[4]

 =


C[0]
C[1]
C[2]
C[3]

 (4.1)

22


4. Methods


C[0]
C[1]
C[2]
C[3]

 =


A0[0]
A0[1]
A0[2]
A0[3]

 × B[0] +


A1[0]
A1[1]
A1[2]
A1[3]

 × B[1] +


A2[0]
A2[1]
A2[2]
A2[3]

 × B[2] +


A3[0]
A3[1]
A3[2]
A3[3]

 × B[3]

+


A4[0]
A4[1]
A4[2]
A4[3]

 × B[4]

(4.2)

1

2 /∗ Function to mult ip ly 4∗5 matrix with 5∗1 matrix us ing Arm Neon
func t i on ∗/

3

4 void matrix_multiply_4x5_neon ( f loat32x4_t A0 , f loat32x4_t A1 ,
5 f l oat32x4_t A2 , f loat32x4_t A3 , f loat32x4_t A4 , f loat32x4_t B,

f l oa t32_t B5 , f l oa t32_t ∗C)
6

7 /∗ I n i t i a l i z i n g ve c to r s C0 and C1 ∗/
8 f l oat32x4_t C0 ;
9 f l oat32x4_t C1 ;

10

11

12 /∗ Using ’vmovq_n_f32 ’ func t i on to i n i t i a l i z e the ve c t o r s C0
13 and C1 to zero ∗/
14

15 C0 = vmovq_n_f32 (0 ) ;
16 C1 = vmovq_n_f32 (0 ) ;
17

18 /∗ Using ’ vfmaq_laneq_f32 ’ func t i on to mult ip ly and
19 accumulate in 4x1 blocks , i . e . each column in C ∗/
20

21 /∗ Mult ip ly An with nth element o f B and add i t to C to a l l the
e lements in the vec to r ∗/

22 C0 = vfmaq_laneq_f32 (C0 , A0 , B, 0) ;
23 C0 = vfmaq_laneq_f32 (C0 , A1 , B, 1) ;
24 C0 = vfmaq_laneq_f32 (C0 , A2 , B, 2) ;
25 C0 = vfmaq_laneq_f32 (C0 , A3 , B, 3) ;
26

27 /∗ Using ’ vmulq_n_f32 ’ f unc t i on to mult ip ly a vec to r with a s c a l a r ∗/
28 C1 = vmulq_n_f32 (A4 , B5) ;
29

30 /∗ Using ’ vaddq_f32 ’ f unc t i on to add two vec to r s ∗/
31 C1 = vaddq_f32 (C0 , C1) ;
32

33 /∗ Using ’ vst1q_f32 ’ func t i on to s t o r e a vec to r to C v a r i a b l e ∗/
34 vst1q_f32 (C, C1) ;
35

36 }
Listing 4.1: Intrinsics implementation of Fully Connected Layer 2

23


4. Methods

Optimization through Arm Neon library

Similar to the Neon intrinsics implementation, the Ne10 implementation is carried
out. As discussed in section 2.3, the Ne10 library contains the optimized imple-
mentation of various common math, physics, and other functions. Unlike the Neon
intrinsics, the usage of the Ne10 library for utilizing the Neon technology is straight-
forward.
Following is a code snippet, 4.2, of the layer 2 implementations of the SOC AI model
used in the thesis. Compared to list 4.1, it can be observed that the usage of Ne10
functions are straightforward. Ne10 functions, for instance the ne10_mul_float_neon
function in the list 4.2, takes the entire array as input, irrespective of the register
width and individual data element size, and manages the packing of the data in
the SIMD register for computation. Thus making the functions convenient for the
user. Whereas in intrinsic implementation, the user must have knowledge about the
width and the lanes of the SIMD register and then, based on that, has to choose a
specific function to compute the result.

1 #inc lude <../ inc /NE10 . h>
2 void fu l lyConnectedLayer1 ( f l o a t ∗ in1 , f l o a t in2 [ 1 ] [ 5 5 ] ) {
3 f l o a t tempMAT [ 5 ] [ 5 5 ] ;
4 f l o a t temp [ 5 ] ;
5 f o r ( i n t i =0; i <55; i++) {
6 ne10_mul_float_neon ( temp , in1 , layer2Weight [ i ] , 5) ;
7 // Transpose the Matrix
8 tempMAT [ 0 ] [ i ] = temp [ 0 ] ;
9 tempMAT [ 1 ] [ i ] = temp [ 1 ] ;

10 tempMAT [ 2 ] [ i ] = temp [ 2 ] ;
11 tempMAT [ 3 ] [ i ] = temp [ 3 ] ;
12 tempMAT [ 4 ] [ i ] = temp [ 4 ] ;
13 in2 [ 0 ] [ i ] = 0 ;
14 }
15 f o r ( i n t j =0; j <5; j++) {
16 // Addit ion o f the Matrix Dot Products
17 ne10_add_float_neon ( in2 , in2 , tempMAT[ j ] , 55) ;
18 }
19 // Addit ion o f the Bias
20 ne10_add_float_neon ( in2 , in2 , layer2Bias , 55) ;
21 }

Listing 4.2: Ne10 implementation of Fully Connected Layer 2

GPU implementation
AI inference is primarily performed on GPUs and TPUs, with SIMD being less
sought-after for the same use case. As a result, it is crucial to benchmark on a GPU
to understand the performance differences compared to SIMD. The AI model infer-
ence on the GPU is accomplished using CUDA programming. CUDA programming
is an extension of the C/C++ programming language and is largely similar to it.
Essentially, the block of code that needs to be run on the GPU has to be present in
a function with the specifier being "__global__ ". A function with this specifier is
referred to as a kernel, and the GPU can only operate with kernels; it cannot work

24


4. Methods

with any other functions or pieces of code in the source files. Further, before execu-
tion of the kernel, the data related to or required for the computations in the kernel
has to be copied to the GPU. This movement of the data to and forth between the
GPU and the host can be accomplished by using the CUDA API cudaMemcpy. The
cudaMemcpyHostToDevice argument is required to be passed in the cudaMemcpy
API to transfer data from the host to the device’s memory (the GPU), and the
cudaMemcpyDeviceToHost argument is required to copy data back from the device
to the host.

1 /∗ Sequence Input Layer GPU Implementation ∗/
2 __global__ void sequenceInputLayerKernel ( const f l o a t ∗A, const f l o a t ∗B

, f l o a t ∗C, i n t numElements ) {
3

4 i n t i = blockDim . x ∗ blockIdx . x + threadIdx . x ;
5

6 i f ( i < numElements ) {
7 C[ i ] = A[ i ] − B[ i ] ;
8 }
9 }

Listing 4.3: Sequence Input Layer GPU Implementation

The block of code in list 4.3 demonstrates how the sequence input layer from the
SOC AI Model is implemented using the CUDA programming. As described in
the previous paragraph, a function named sequenceInputLayerKernel with the
specifier __global__ is declared. This makes it a kernel, which during compilation
signifies to the compiler, NVCC, to generate respective machine code to run this
function on the GPU rather than the CPU. The global index or thread indexing is
computed in line 4 of list 4.3 which is used to select the elements of the array. The
predefined variables such as gridDim, blockDim, blockIdx, and threadIdx which
provide information on the dimension of the grid, dimension of the block, index of
the block, and index of the thread respectively, help in the computation of the global
index.

25


4. Methods

26


5
Results

In this section, the results of the benchmarking of various implementations in terms
of execution time and memory consumption are presented. Furthermore, the com-
parison of the performance (execution time) of Neon with the GPU is detailed.

5.1 Execution Time
As explained in the previous section, to compare the performance changes of different
implementations, the models implemented in basic C code are considered standard
and termed as base in this thesis work.
Since the bench-marking is carried out on top of the Raspberry Pi OS, the dynamic
clock setting is originally activated in the OS. With the dynamic clock enabled, the
OS manages the clock rate depending on the load in order to save power. Hence,
the dynamic clock is disabled by adding the following lines detailed in listing 5.1 in
the /boot/configṫxt boot file.

1 // Clocking at 1 GHz
2 arm_freq=1000
3 over_voltage=6
4 force_turbo=1

Listing 5.1: Environment variables setting to disable dynamic clocking

The arm_freq parameter can be used to set the maximum frequency of the CPU
in the SOC. The over_voltage parameter sets the level of voltage consumed by
the core, and the force_turbo parameter, if activated, forcefully runs the CPU at
maximum frequency even during the idle state.
Hence, by overclocking the Raspberry Pi device, clock rate was maintained at an
almost stable rate when benchmarking for different workloads.
Furthermore, inconsistencies in the measured execution times were observed when
an implementation was benchmarked multiple times. Although the variances were
negligible, an average of 10 repeated runs were captured for every executable timing
measurement.

5.1.1 SOC AI Model
The optimization flag used when compiling plays a vital role in the executable’s run
time and code size. The optimization can target either on the code size (Os) or

27


5. Results

execution time (O0, O1, O2, O3 or Ofast). The user has the freedom to set the type
and level of the optimization.
Initially, when generating the base executable, the compiler optimization was set to
O0. However, benchmarking with the auto-vectorization implementation which is
compiled at Ofast to enable the usage of Neon technology seemed to be fallacious.
The reason is that the GCC compiler when compiling at Ofast not only optimizes
using the co-processor under discussion but also employs several other techniques
which are not in the interest of the thesis. The various techniques involved at
different optimization levels could be found on Linux’s GCC website [31]. Since
compiling at O3 and then switching off vectorization through an argument flag for
realistic measurement was not available, the base implementation is compiled at O2,
where there are barely any techniques involved other than vectorization.
The accompanying bar graph viz. figure 5.1, and table 5.1 detail the results of the
execution time benchmarking for the SOC AI Model for each of the four implemen-
tations: base, auto-vectorization, Ne10, and intrinsics.

Base Auto-vectorization Ne10 Library Intrinsics

 SoC AI Model Benchmarking

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Ex
ec

ut
io

n 
Ti

m
e 

[s
ec

on
ds

]

0.0575575 0.054789 0.051223

0.0214437

Timing Measurement

Figure 5.1: Comparison of execution time between various implementations of
the SOC AI model

The base implementation which is considered as standard for comparison had dy-
namic instruction count of 31,025,571 and took 0.0576 sec to execute. Compara-
tively, the auto-vectorization implementation resulted in performance improvement
of 4.8% and a drop of 6% in the dynamic instruction count. When the assembly
code was examined, it was found that only 2 layers out of the 7 layers in the AI

28


5. Results

Table 5.1: SOC AI Model Execution Time Benchmarking Results

Implementations Mean Execution
Time (sec) Instruction Count

Execution Time
Percentage

Change w.r.t Base

Instructions Count
Percentage Change

w.r.t Base
Base 0.0575575 31025571 - -
Auto-vectorization 0.0547890 29132224 -4.8 % -6.10254 %
Ne10 0.0512230 34653925 -11% +11.7 %
Intrinsics 0.0214437 13469458 -62.7% -56.58%

Model were vectorized. These two layers were the sequence input layer and leaky
ReLU layer (see figure 3.5), and from the profiling results in table 4.2, it can be seen
that these were the two least contributing factors to the execution time. Hence, only
a mere performance increase of about 5% was obtained.
As for the implementation using the intrinsics, a performance boost of 62.7% was
achieved as there was significant amount (56.5%) reduction in the instruction count.
Since the implementation was performed such that 4 data elements were processed at
once, we expected around 75% performance gain as a fourfold reduction of 0.0575 sec
accounts to 0.0144 sec. It is challenging to reach this theoretical performance value
practically. One explanation could be that it was observed that the length of the
array influences majorly. The sequence input layer contains very few elements, and
its execution time was found to be the same, 4.39 ms, whether it was performed on
a CPU or SIMD using intrinsics. Although it is intriguing to learn how the length
of an array affects the performance (execution time), this topic area is not further
investigated in this project work as it was out of the scope of the thesis.
For the Ne10 implementation, about an 11% decrease in the execution time was ob-
tained. Surprisingly, in contrast to other implementations, there was an increase in
the instruction count. But, these instructions on average consumed 1.282 clock cy-
cles to execute whereas the base implementation instructions consumed 1.694 clock
cycles, which explains the reason Ne10 implementation outperformed the base imple-
mentation. Furthermore, scrutinizing the assembly code generates, it was observed
that, although the 4 data elements are processed simultaneously, the load and store
operations were using the D registers which are 64 bits wide. Hence, before and
after computations, the data-transfer operations were operating on only 2 elements
instead of 4.

5.1.2 NMPC Model
As described in the previous sections, the evaluation functions were optimized by
utilizing Arm Neon technology. Unlike in SOC AI Model benchmarking, the NMPC
benchmarking was done on only three implementations excluding the Ne10 imple-
mentation. Since the evaluation functions in the NMPC model contained data types
of 64 bits and the math functions in the Ne10 library were only developed for the
data types of width 32 bits, the Ne10 implementation was dropped. Further, as
data types in the evaluation functions were 64 bits wide, we could only pack two
data elements at a time in the Intrinsics implementation.
The bar graph in Figure 5.2 details the obtained benchmarking results (execution

29


5. Results

Base Auto-vectorization Intrinsics
 [All Functions]

Intrinsics
 [Selected
 Functions]

NMPC Model Benchmarking

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5
Ex

ec
ut

io
n 

Ti
m

e 
[s

ec
on

ds
]

3.38414 3.35301
3.66107

3.32286

Timing Measurement

Figure 5.2: Comparison of execution time between various implementations of
the NMPC model

time) for the NMPC model. The base implementation of the NMPC Model took
around 3.38 sec to execute while the auto-vectorization implementation took around
3.35 sec to execute. While examining the assembly code it was noticed that the GCC
compiler had completely failed to compute using the Arm Neon SIMD registers. The
0.828% decrease we see in execution time is not due to vectorization but due to other
compiler optimization techniques such as inlining functions to remove the function
call overhead.

In the case of intrinsics implementation, when all the Neon-optimized evaluation
functions were used to run the NMPC model, the model took 8.28% more time
to execute compared to the base implementation. Unlike the SOC AI Model, the
NMPC model had a very large number of small blocks of code that could be vec-
torized. And not everything in these blocks of code could be run on the SIMD
co-processor, but some need to be computed on the CPU. Furthermore, when the
assembly code was scrutinized, we could observe more data-transfer instructions
between the Neon and CPU registers which would explain the reason for deterring
performance.

Not all evaluation functions were rendered non-optimized when using intrinsics.
Employing the functions that were optimized we could get a mere 1.7% which is
as expected that these were insignificant contributors in the execution time with at
most 0.3%, as obtained from the profiling result in table 4.1.

30


5. Results

5.2 Memory Usage Benchmark
In order to perform the memory consumption benchmark for the various implemen-
tations in the thesis, a customized memory footprint metric is defined based on the
needs of CEVT. Memory footprint is defined as the aggregate of the size of the
binary on the disk and peak run-time memory consumption. Run-time memory
consumption is further defined as the sum of the stack and heap memory allocation
for a particular process. Furthermore, it is important to note that the peak memory
allocation value during the run-time of an implementation is considered but not the
accumulation of the memory allocation values over the time of the run. For the
measurement of the memory allocation Valgrind tool is used.
Table 5.2 provides the result obtained from the memory footprint benchmarking.
The Ne10 implementation consumed much of the disk space compared to any other
implementation. The reason is that the Ne10 implementation was statically linked
to the Ne10 library for generating the executable, meaning that the resulting binary
file would contain the machine code for both the used and unused functions in the
implementation. The reason for not linking dynamically is that the dynamic linking
references to the library would be made during run-time causing additional overhead
which would affect the execution time, the primary performance metric considered
in the thesis. It is interesting to note that, although the standard implementation,
base, occupied the least disk space, there is no relationship between the execution
time and size as the Intrinsics implementation has slightly larger binary size than
the Auto-vectorization implementation. Furthermore, it can be observed that for all
four implementations the peak run-time memory consumption is the same, being
8880 Bytes. In all the cases the CPU allocated fixed-size stack memory of 8880 Bytes
and heap memory allocation was found to be null as there was no object creation
in any of the implementations.

Table 5.2: SOC AI Model Memory Footprint Benchmarking Results

Implementations
Size of the
Executable
on disk (B)

Run-time Memory
Consumption

(B)
Total (B)

Base 27,048 8,880 35,928
Auto-vectorization 28,416 8,880 37,296
Ne10 437,884 8,880 446,764
Intrinsics 28,536 8,880 37,416

5.3 Execution Time SIMD vs GPU
Since we encountered a hard time installing the Perf tool on the Linux4Tegra OS
which the Jetson Nano runs on, the Chronos library was used to take timing mea-
surements.
The Jetson Nano GPU took 2.22 sec to execute the SOC AI model, while the Neon
Intrinsics implementation on the CPU took a mere 0.0214 sec to execute. It was

31


5. Results

assumed that copying data from and to host memory and device memory could be
the issue for the bottleneck. Hence, to perform further analysis profiling was carried
out and the result is detailed in table 5.3.

Table 5.3: SOC AI Model Profiling in Jetson Nano GPU

Sl No. Name Percentage
time (%)

Execution
time (ms)

1 fullyConnectedLayer2Kernel 28.11 34.538
2 [CUDA memcpy HtoD] 21.19 26.037
3 fullyConnectedLayer1Kernel 13.89 17.070
4 [CUDA memcpy DtoH] 12.44 15.284
5 sequenceInputLayerKernel 6.42 7.8927
6 tanhLayerKernel 6.32 7.7609
7 leakyReLULayerKernel 5.95 7.3121
8 fullyConnectedLayer3Kernel 5.67 6.9720

As we observe from the profiling results, although the memory copy API CUDA
memcpy HtoD and CUDA memcpy DtoH are among the significant contributing
factors for the execution time, still the fullyConnectedLayer2Kernel tops the list
with 34.5 ms compared to Neon Intrinsics that barely took 15.4 ms.

32


6
Conclusion

According to the aim of this thesis, the two selected models were investigated and
benchmarked on an Arm processor utilizing its SIMD component. The investiga-
tion was done for all possible methods in Arm Neon such as Arm Neon intrinsics,
auto-vectorization, and a Neon-enabled library (Ne10). Further, two performance
metrics are considered when benchmarking: execution time and memory footprint.
Since execution time is considered the primary metric in the thesis, unless otherwise
stated, the term "performance" in this section refers to execution time. In general,
when used for optimization, the fixed width of the SIMD register itself might become
a shortcoming. For instance, consider an application that contains most of its data
as double-precision (64-bit) values. If this application needs to be optimized using
64-bit wide SIMD registers, then the developer has to compromise the accuracy of
the model as only 32-bit or lesser-width data is allowed on the registers.

Irrespective of the target models for optimization, the following conclusion can be
reached based on the results obtained in the thesis. Optimization using auto-
vectorization doesn’t provide satisfactory results as the compiler fails to recog-
nize most of the vectorizable blocks of code while dealing with complex models.
Hence, an open-source toolchain like GCC is not recommended if one relies on
the auto-vectorization method. Although it only offers a slight performance gain,
auto-vectorization is nevertheless advantageous to use as the compiler does the op-
timization and doesn’t require the developer to modify the code for the optimiza-
tion.Comparatively to other methods, Neon-enabled libraries, such as Ne10, do pro-
vide satisfactory optimization. Although these libraries are straightforward to use,
they have their own flaws. For instance, the Ne10 library used 64-bit registers to
load and store the data even when the 128-bit registers were available, hampering
the performance gain. On the other hand, considering the performance boost ob-
tained, Arm Neon intrinsic is advised when compared to other methods used for
comparison in the thesis. In comparison to the Ne10 library, the functions in Arm
Neon intrinsics require the developer to manually pack the data into the SIMD regis-
ter for computations. This provides more flexibility for the user to efficiently utilize
the SIMD register and obtain theoretical performance gains.

The results show that the Arm Neon approach works well for an AI model than
MPC model, and the reason is further discussed while answering the primary thesis
research question at the end of this chapter 6. As for the findings regarding the

33


6. Conclusion

memory footprint benchmarking, except for the Ne10 library, all the other methods
had almost the same size. However, as the Ne10 library had to be statically linked,
the size of the binary file generated was found to be at least 10 times larger than
the other method. One significant finding was that there is no relationship between
memory footprint and execution time. Regarding the performance of the selected AI
model on Arm Neon and GPU, although the inference time was found to be lesser
in Arm Neon, as previously stated in the chapter additional research is required to
draw a firm conclusion(s). However, it was clear from the results (see section 5.3)
that, for small models such as the selected AI model that are not very resource
intensive, the memory copy operation between the CPU and GPU would bottleneck
the execution time.
Hence to conclude with the results obtained, the following research question stated
in chapter 1 is answered:
How well can both AI and MPC models be implemented on Arm proces-
sor using Neon technology?

Arm Neon is best suited for applications whose core operations can be vectorized.
This is evident from the results (chapter 5) obtained in the thesis. As per the results
obtained, the Arm Neon intrinsics method decreases the inference time of the AI
model by 62% when compared to the unoptimized C implementation. However, a
mere performance improvement was noticed when the MPC model was optimized
using Arm Neon. The significant difference between the two models is the amount
of code that could be vectorized in each of them. Since the chosen AI model is a
DNN and its primary operations involve multiplication and accumulation, the ma-
jority of the code was optimized through vectorization. In contrast, the MPC model
optimization target included small blocks of code. Additionally, these code blocks
were not entirely vectorizable since they had a few operations that required com-
putations to be performed on the CPU. The movement of data from Arm Neon to
the CPU proved costly [32], and therefore the simultaneous usage of CPU and Arm
Neon for computation depleted the performance gain. Hence, Arm Neon is recom-
mended to be used when large blocks of code in the target application are purely
vectorizable. Therefore, the selection of the application for SIMD optimization is
crucial for exploiting the performance gain in terms of execution time.

34


7
Future Work

Based on the results and discussion presented, the following aspects can be addressed
to carry forward the work in this project,

• Results in section 5.3 is not enough to determine whether the Arm Neon
is superior to the GPU for inferences. Hence, further investigation has to be
made either by performing performance tuning or experimenting with different
implementation methods to conclude.

• In this thesis, the compiler used is GCC which is an open-source compiler.
Instead, an Arm 6 compiler which is specifically built for Arm processors
could be used for further investigation.

• In this thesis, the evaluation functions were used exclusively when optimizing
the NMPC model. It would be of interest if all possible optimization could be
done further in the investigation. For instance, optimizing the IPOPT Library
as it contributed the most to execution time.

• It would be interesting to extend the study to a wider family of Arm cores,
especially the Cortex A-77 and Cortex A-78 processors that implement the
Armv8.2-A ISA. Since Armv8.2-A ISA or later supports half-precision floating
point arithmetic operations on the Neon, it would be interesting to understand
the accuracy and performance trade-off.

• Further, benchmarking in a microcontroller environment, at least on bare
metal, should be prioritized over an Operating System (OS). Benmarking on
OS introduces undesired noises when measuring execution time.

35


7. Future Work

36


Bibliography

[1] S. Grigorescu, B. Trasnea, T. Cocias, and G. Macesanu, “A survey of deep
learning techniques for autonomous driving,” Journal of Field Robotics, vol. 37,
no. 3, pp. 362–386, 2020.

[2] C. K. Law, D. Dalal, and S. Shearrow, “Robust model predictive control for
autonomous vehicles/self driving cars,” arXiv preprint arXiv:1805.08551, 2018.

[3] G. Plastiras, M. Terzi, C. Kyrkou, and T. Theocharidcs, “Edge intelligence:
Challenges and opportunities of near-sensor machine learning applications,”
in 2018 IEEE 29th International conference on application-specific systems,
architectures and processors (asap). IEEE, 2018, pp. 1–7.

[4] M. Roopaei, P. Rad, and M. Jamshidi, “Deep learning control for complex and
large scale cloud systems,” Intelligent Automation & Soft Computing, vol. 23,
no. 3, pp. 389–391, 2017.

[5] P. San Juan, A. Castelló, M. F. Dolz, P. Alonso-Jordá, and E. S. Quintana-Ortí,
“High performance and portable convolution operators for multicore proces-
sors,” in 2020 IEEE 32nd International Symposium on Computer Architecture
and High Performance Computing (SBAC-PAD). IEEE, 2020, pp. 91–98.

[6] Y.-L. Lee, P.-K. Tsung, and M. Wu, “Techology trend of edge AI,” in 2018
International Symposium on VLSI Design, Automation and Test (VLSI-DAT).
IEEE, 2018, pp. 1–2.

[7] M. Sung et al., “SIMD parallel processing,” Architectures Anonymous, vol. 6,
p. 11, 2000.

[8] D. Seal, ARM architecture reference manual. Pearson Education, 2001.
[9] C. Lomont, “Introduction to Intel advanced vector extensions,” Intel white

paper, vol. 23, 2011.
[10] S. Oberman, G. Favor, and F. Weber, “AMD 3dnow! technology: Architecture

and implementations,” IEEE Micro, vol. 19, no. 2, pp. 37–48, 1999.
[11] S.-J. Lee, S.-S. Park, and K.-S. Chung, “Efficient SIMD implementation for

accelerating convolutional neural network,” in Proceedings of the 4th Interna-
tional Conference on Communication and Information Processing, 2018, pp.
174–179.

[12] S. Yagneswar, “Performance optimization of signal processing algorithms for
SIMD architectures,” Master’s thesis, KTH Royal Institute of Technology, 2017.

37


Bibliography

[13] K. Karras, E. Pallis, G. Mastorakis, Y. Nikoloudakis, J. M. Batalla, C. X.
Mavromoustakis, and E. Markakis, “A hardware acceleration platform for AI-
based inference at the edge,” Circuits, Systems, and Signal Processing, vol. 39,
no. 2, pp. 1059–1070, 2020.

[14] G. Mitra, B. Johnston, A. P. Rendell, E. McCreath, and J. Zhou, “Use of SIMD
vector operations to accelerate application code performance on low-powered
ARM and Intel platforms,” in 2013 IEEE International Symposium on Parallel
& Distributed Processing, Workshops and Phd Forum. IEEE, 2013, pp. 1107–
1116.

[15] S. Liu, J. Tang, Z. Zhang, and J.-L. Gaudiot, “Computer architectures for
autonomous driving,” Computer, vol. 50, no. 8, pp. 18–25, 2017.

[16] D. Hultgren and T. Husmark, “Three-mode hybrid powertrain optimal control
to track offline optimized references,” Master’s thesis, Chalmers University of
Technology, 2020.

[17] Perf: Linux profiling with performance counters. [Online]. Available:
https://perf.wiki.kernel.org/index.php/Main_Page

[18] C++ chronos library. [Online]. Available: https://en.cppreference.com/w/cpp/
chrono

[19] M. J. Flynn, “Very high-speed computing systems,” Proceedings of the IEEE,
vol. 54, no. 12, pp. 1901–1909, 1966.

[20] Y. Ngoko and D. Trystram, “Revisiting Flynn’s classification: The portfo-
lio approach,” in Euro-Par 2017: Parallel Processing Workshops, D. B. Heras,
L. Bougé, G. Mencagli, E. Jeannot, R. Sakellariou, R. M. Badia, J. G. Barbosa,
L. Ricci, S. L. Scott, S. Lankes, and J. Weidendorfer, Eds. Cham: Springer
International Publishing, 2018, pp. 227–239.

[21] A. F. Hernández, “Yet another survey on SIMD instructions,” 2013.
[22] Project Ne10. [Online]. Available: https://projectne10.github.io/Ne10/
[23] Ne10. [Online]. Available: http://web.archive.org/web/20160706043552/http:

//projectne10.github.io/Ne10/
[24] A. Wächter and L. T. Biegler, “On the implementation of an interior-point fil-

ter line-search algorithm for large-scale nonlinear programming,” Mathematical
programming, vol. 106, no. 1, pp. 25–57, 2006.

[25] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
deep convolutional neural networks,” Advances in neural information processing
systems, vol. 25, 2012.

[26] C. Vidal, P. Kollmeyer, M. Naguib, P. Malysz, O. Gross, and A. Emadi, “Robust
xEV battery state-of-charge estimator design using a feedforward deep neural
network,” SAE International Journal of Advances and Current Practices in
Mobility, vol. 2, no. 2020-01-1181, pp. 2872–2880, 2020.

[27] Clipped Rectified Linear Unit (ReLU) layer. [Online]. Available: https://www.
mathworks.com/help/deeplearning/ref/nnet.cnn.layer.clippedrelulayer.html

38

https://perf.wiki.kernel.org/index.php/Main_Page
https://en.cppreference.com/w/cpp/chrono
https://en.cppreference.com/w/cpp/chrono
https://projectne10.github.io/Ne10/
http://web.archive.org/web/20160706043552/http://projectne10.github.io/Ne10/
http://web.archive.org/web/20160706043552/http://projectne10.github.io/Ne10/
https://www.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.clippedrelulayer.html
https://www.mathworks.com/help/deeplearning/ref/nnet.cnn.layer.clippedrelulayer.html


Bibliography

[28] Arm GNU Toolchain. [Online]. Available: https://developer.arm.
com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/
gnu-a/downloads

[29] SIMULINK Coder. [Online]. Available: https://www.mathworks.com/
products/simulink-coder.html

[30] CMake. [Online]. Available: https://cmake.org/
[31] GCC - GNU project C and C++ compiler. [Online]. Available: https:

//linux.die.net/man/1/gcc
[32] Latency issue. [Online]. Available: https://developer.arm.com/documentation/

ddi0344/k/ch16s05s02

39

https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-a/downloads
https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-a/downloads
https://developer.arm.com/tools-and-software/open-source-software/developer-tools/gnu-toolchain/gnu-a/downloads
https://www.mathworks.com/products/simulink-coder.html
https://www.mathworks.com/products/simulink-coder.html
https://cmake.org/
https://linux.die.net/man/1/gcc
https://linux.die.net/man/1/gcc
https://developer.arm.com/documentation/ddi0344/k/ch16s05s02
https://developer.arm.com/documentation/ddi0344/k/ch16s05s02


Bibliography

40


A
Appendix 1

A.1 Build System: CMake
In the thesis, a CMake-based build system is setup for the NMPC Model as well
as the SOC AI Model. CMake is open-source build system which makes easier to
manage the build process of the project or software. Essentially, the CMake builds
the project by reading the set of instructions in the text file named CMakeLists.txt
inside the project. There can be more than one CMakeLists.txt as the developer
has complete control over how many files to create and where to put them in the
project. Further, by defining set of instructions in these files, the developer can
configure on how to build or compile the project.

2RPi_Implementation

Build Out source_av source_base source_intrinsics source_ne10 CMakeLists.txt

av base intrinsics ne10

CMakeLists.txt

CMakeLists.txt

CMakeLists.txt

CMakeLists.txt

Figure A.1: Folder Structure for the SoC AI Model

The diagram in figure A.1 illustrates the folder structure used for working on the
SoC AI Model and provides information on the location of the CMakeLists.txt
files used by the CMake tool to build and generate the output. From figure A.1,
it can be observed that one CMakeLists.txt file placed at the top most directory
2Rpi_Implementation, and four placed inside the Out directory. Separate sub-
directories are made in the Out directory for each of the implementations: base, av,
ne10, and intrinsics. The build process is mostly defined in the CMakeLists.txt
file located in the Out sub-directories, while the top most CMake file only specifies

I


A. Appendix 1

the locations of the these four CMake text files. The instructions in these four files
provides information to the CMake tool on which compiler to use to compile what
files, location of the source codes, libraries, where to store the generated output file
and so on. Since it is open-source and straightforward, CMake was chosen to set up
the build environment.

A.2 Measuring Execution Time

A.2.1 Chronos Library
As already discussed in the section 2.5, Chronos library is employed and investigated
in this thesis project. Out of the three clocks available in the Chronos for measure-
ment, we have choosen the high resolution clock since we needed high accuracy. In
order to measure the execution time using the Chronos, source code needs to be
modified. As it can be observed from the example provided in listing A.1, before
and after the relevant code section, whose execution time needs to be measured, the
time must be fetched.

1

2 auto begin = std : : chrono : : h igh_reso lut ion_c lock : : now ( ) ;
3 .
4 /∗ Code Segment ∗/
5 .
6 auto stop = std : : chrono : : h igh_reso lut ion_c lock : : now ( ) ;
7 auto e lapsed = std : : chrono : : durat ion_cast \
8 <std : : chrono : : nanoseconds >(stop − begin ) ;
9 auto elapsed_nano = e lapsed . count ( ) ∗ 1e −9;

10

Listing A.1: Execution time measurement by using Chronos

A.2.2 Perf Tool
Perf is a profiler tool for Linux based systems that provides command line interface
for the user with the information on the hardware differences in Linux performance
measurements. In the thesis, Perf is predominantly used, where-ever possible, for
the execution time measurement as hardware counters are made use of. Running
perf list helps to find out all the measurable events in a particular CPU. The
-e argument helps to specify the desired events to be measurement and for re-
peated measurements one could use -r argument followed by the desired number
of repetitions. One example of employing the Perf tool for timing measurement is
provided in listing A.2. The given example provides information on the events, that
include number of cycles consumed and number of instructions executed when the
executable is run. Further, 10 repeated measurements is taken before providing the
results along with the mean and standard deviation.

1 p e r f s t a t −e c y c l e s i n s t r u c t i o n s −r 10 <executable_name>
2

Listing A.2: Execution time measurement with the Perf tool

II


A. Appendix 1

A.3 Profiling Tools

A.3.1 Perf Profiling
Profiling with Perf is straightforward. The perf record command records all the
profiling information in the function or API level. The perf report command
output the recorded profiling information in the terminal.

1 p e r f record <executable_name>
2 p e r f r epo r t
3

Listing A.3: Profiling with Perf tool

A.3.2 Nvprof Profiling
Nvpof, similar to perf, is a profiling tool available in Linux for collecting profiling
data such as kernel execution, memory transfer, etc. that are related to the CUDA
activities happening on both GPU and CPU. It is included in the CUDA toolkit. For
profiling the GPU implementation the Nvprof is employed. The command format
for profiling with Nvprof is shown in listing A.4.

1 nvprof <executable_name>
2

Listing A.4: Profiling with nvprof tool

A.4 Memory footprint measurement

A.4.1 Valgrind: Massif tool
Valgrind, originally built as a memory debugging tool for Linux, has now developed
into a instrumentation framework for building dynamic analysis tool. In the thesis,
the Massif tool, which is a heap profiler, is obtained under the Valgrind is used to
measure the run-time memory consumption of different implementation during their
execution.
The format of command line for profiling an executable with the Massif tool is pro-
vided in listing A.5. After running the command, an file with name massif.out.xxxx
is generated, where the file ending ’xxxx’ represents random numbers. The gener-
ated file can then be parsed or converted to readable text file which would contain
all the memory profiling information. The profiling result is represented in as graph
and is also detailed in a table as seen in the reference snap provided in figure A.2.

1 va l g r ind −−t o o l=mass i f −−s tack=yes <executable_name>
2

Listing A.5: Profiling with Massif tool

III


A. Appendix 1

Figure A.2: Profiling result from Massif tool

IV


	Introduction
	Problem Statement
	Related Work
	Aim and Objectives:
	Limitation:
	Thesis Outline:


	Theory
	Flynn’s Classification
	Armv8-A Architecture
	Arm Neon
	Utilizing Arm Neon
	Hand-coded Neon assembler
	Auto-vectorization
	Neon Intrinsics
	Neon-enabled Libraries

	Performance metrics

	Implementation Models
	NMPC model
	C2: Online NMPC Controller
	Elector
	Controller

	IPOPT Library

	AI model
	Neural network


	Methods
	Working Environment
	NMPC Model Implementation, Optimization and Evaluation
	Hardware Implementation
	SIMD Optimization

	AI model implementation, Optimization, and Evaluation
	Hardware implementation
	SIMD Optimization


	Results
	Execution Time
	SOC AI Model
	NMPC Model

	Memory Usage Benchmark
	Execution Time SIMD vs GPU

	Conclusion
	Future Work
	Bibliography
	Appendix 1
	Build System: CMake
	Measuring Execution Time
	Chronos Library
	Perf Tool

	Profiling Tools
	Perf Profiling
	Nvprof Profiling

	Memory footprint measurement
	Valgrind: Massif tool