Hardware Acceleration of
Machine Learning

Evaluation and comparison of different hardware-aware opti-
mization techniques

Master’s thesis in Computer Science and Engineering

Fangzhou Chen
William Sköld

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2023


Master’s thesis 2023

Hardware Acceleration of
Machine Learning

Evaluation and comparison of different hardware-aware optimization
techniques

Fangzhou Chen
William Sköld

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2023


Hardware Acceleration of Machine Learning
Evaluation and comparison of different hardware-aware optimization techniques
Fangzhou Chen, William Sköld

© Fangzhou Chen, William Sköld, 2023.

Supervisor: Pedro Petersen Moura Trancoso, Department of Computer Science and
Engineering
Advisor: Evangelos Siminos, Volvo Group (SML)
Examiner: Pedro Petersen Moura Trancoso, Department of Computer Science and
Engineering

Master’s Thesis 2023
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: A running megatron, generated by DALLE2.

Typeset in LATEX
Gothenburg, Sweden 2023

iv


Hardware Acceleration of Machine Learning
Fangzhou Chen
William Skold
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
The Transformer architecture has been widely used in various fields, as demon-
strated by GPT-3, a large language model that shows impressive performance. How-
ever, achieving such excellent performance requires high computational capabilities.
Therefore, improving the computational power of current machine learning systems
is of great importance.

This thesis aims to optimize and accelerate fine-tuning of Transformer-based models
while taking into account several evaluation criteria, such as training time, energy
consumption, cost, and hardware utilization. Additionally, a comparison is made
between GPU training settings and specialized AI accelerators, such as TPU training
settings.

In our study, a high-performance kernel for the Adan optimizer was introduced, and
the LightSeq library is applied to accelerate existing Transformer components. We
also introduce mixed precision training into our workflow and compare all these
optimization techniques step by step with baseline performance. In addition, our
analysis includes distributed training with multiple GPUs, and a backpropagation
time estimation algorithm is introduced. Next, Google’s TPU accelerator is used
to run our task, and its performance is compared to the similar GPU setup used
in our study. Finally, the advantages and disadvantages of different methods are
systematically analyzed, while training on V100, A100, A10 and T4 with different
configurations. Meanwhile, the workflow between GPUs and TPUs is analyzed,
illustrating the pros and cons of different accelerators.

Various weights for measuring optimization methods based on time, energy consump-
tion, cost, and hardware utilization are proposed. Our analysis shows that optimal
scores in all metrics can be achieved by implementing the optimized LightSeq model,
kernel fusion for the Adan optimizer, and enabling mixed precision training. While
training with TPU offers certain advantages, such as large batch sizes when loading
training data, the ease of use, reliability, and software stability of GPU training
surpasses that of TPU training.

Keywords: Transformer, GPU, Distributed, Energy consumption, Fine-tuning, TPU.

v


Acknowledgements
We would like to express our sincere gratitude to our supervisors, Evangelos and
Pedro for their guidance, expertise and unwavering support throughout this thesis.
Their mentorship, advice and constructive feedback have been instrumental in shap-
ing the direction and results of this thesis. We would also want to thank Thomas
Nordenskjöld, at Volvo for his support and help with all the small things related to
collaborating with Volvo for this thesis.

Lastly, we’d like to extend our thanks to Volvo Group SML for providing us with the
necessary tools for this thesis, such as Microsoft Azure for benchmarking, expertise
from employees and workspace.

William and Fangzhou, Gothenburg, 2023-06-13

vii


Contents

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Machine Learning Hardware . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Machine Learning Framework . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Machine Learning Libraries . . . . . . . . . . . . . . . . . . . 2
1.3 Transformer usage at Volvo . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Aim and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.7 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Theory 7
2.1 Transformer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Self Attention . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Multi-headed Attention . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Transformer Architecture . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 BERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.5 Pre-train and Fine-tuning . . . . . . . . . . . . . . . . . . . . 12

2.2 Deep Learning Optimizer . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Stochastic Gradient Descent . . . . . . . . . . . . . . . . . . . 13
2.2.2 Momentum . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.4 Adan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.3 Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 TPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.4 Mixed Precision Training . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.5 LightSeq Library . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.6 Distributed Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.6.1 Topology and Communication . . . . . . . . . . . . . . . . . . 23
2.6.2 Data Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

ix


Contents

2.6.3 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6.4 Collective Communication . . . . . . . . . . . . . . . . . . . . 25

3 Methods 29
3.1 Fused Adan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 Kernel fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.1.3 Memory Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.4 Vector Instruction . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.5 Instruction Level Parallelism . . . . . . . . . . . . . . . . . . . 32

3.2 Multi Tensor Access . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.1 Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.2.2 Multi Tensor Apply . . . . . . . . . . . . . . . . . . . . . . . . 35

3.3 Mixed Precision Training . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3.1 Enable Mixed Precision Training . . . . . . . . . . . . . . . . 36

3.4 LightSeq Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.5 Distributed Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5.1 Enabling Data Parallel . . . . . . . . . . . . . . . . . . . . . . 40
3.5.2 Backpropagation Time Estimation . . . . . . . . . . . . . . . 40

3.6 Training on a TPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6.1 Training Frameworks . . . . . . . . . . . . . . . . . . . . . . . 42

3.7 Systematic Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.7.1 Hardware examined . . . . . . . . . . . . . . . . . . . . . . . . 43

4 Results 45
4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1.1 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.2 Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.3 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.1.4 Training Settings . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.1.5 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Fused Adan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Benchmark on Training process . . . . . . . . . . . . . . . . . 48
4.2.2 Benchmark on Optimization process . . . . . . . . . . . . . . 48
4.2.3 Benchmark on Adan fused kernel . . . . . . . . . . . . . . . . 51

4.3 Mixed Precision Training . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.4 LightSeq Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.5 Distributed Training . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5.1 Data Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.5.2 Time Consumption in One Training Step . . . . . . . . . . . . 58
4.5.3 Backpropagation Time Estimation and Measurement . . . . . 59

4.6 TPU-based training . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
4.7 Overall Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.7.1 TPU-GPU evaluation . . . . . . . . . . . . . . . . . . . . . . . 63

5 Discussion 67
5.1 Comparison between different optimizations . . . . . . . . . . . . . . 67

x


Contents

5.1.1 Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.2 Volvo Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2.1 Benchmark results . . . . . . . . . . . . . . . . . . . . . . . . 70
5.2.2 Switch to New Datasets . . . . . . . . . . . . . . . . . . . . . 71

5.3 Kernel Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.4 Software and Hardware Improvements for ML . . . . . . . . . . . . . 72

5.4.1 Software Improvements . . . . . . . . . . . . . . . . . . . . . . 72
5.4.2 Different Hardware . . . . . . . . . . . . . . . . . . . . . . . . 73
5.4.3 Mixed Precision on GPUs . . . . . . . . . . . . . . . . . . . . 73

5.5 Randomness in Test Accuracy . . . . . . . . . . . . . . . . . . . . . . 75
5.6 TPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.6.1 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6.2 TPU-GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6.3 Unfair Batch Sizes . . . . . . . . . . . . . . . . . . . . . . . . 77

6 Future work 79
6.1 Gradient Compression . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.2 Dcompute and Dcomm Estimation . . . . . . . . . . . . . . . . . . . . . 79
6.3 Performance Improvement on Fused Adan . . . . . . . . . . . . . . . 80
6.4 System Level Energy Measurement . . . . . . . . . . . . . . . . . . . 80
6.5 TPU energy measurements . . . . . . . . . . . . . . . . . . . . . . . . 81

7 Conclusion 83

Bibliography 85

A Appendix 1 I
A.1 Fused Adan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.2 Benchmark Framework . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.3 Overall Evaluation Original Data . . . . . . . . . . . . . . . . . . . . I
A.4 LightSeq-Hugging Face Guide . . . . . . . . . . . . . . . . . . . . . . I

xi


Contents

xii


List of Figures

2.1 Example of correlation between query vector and key vector. . . . . . 8
2.2 Multi Attention with three attention heads, vector a1 is word embed-

ding and b1
headi is the attention vector generated by different attention

heads, and these vectors were transformed by the linear layer. . . . . 10
2.3 Transformer Architecture, on the left is stack of encoders, on the right

is stack of decoders. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 BERTbase for sequence classification . . . . . . . . . . . . . . . . . . . 12
2.5 GPU architecture example . . . . . . . . . . . . . . . . . . . . . . . . 17
2.6 Principle of systolic array . . . . . . . . . . . . . . . . . . . . . . . . 19
2.7 MXU architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.8 Mixed precision training weight update . . . . . . . . . . . . . . . . . 20
2.9 BF16, FP16 and FP32 formats. Different bit ranges for above three

floating number representations. . . . . . . . . . . . . . . . . . . . . . 21
2.10 Distributed training topology, on the left is parameter server, on the

right is ring organization. . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.11 Data Parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.12 Broadcast communication . . . . . . . . . . . . . . . . . . . . . . . . 26
2.13 Allreduce communication . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.1 Unfused Adan kernel launch time graph with single tensor access. . . 30
3.2 Multi Tensor Access, with a chunk size of 3. Meaning that the tensors

(red, yellow and green) are split into chunks of 3 for more efficient
memory accesses. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.3 1-Step benchmark displaying the GPU’s SM and memory throughput
when using Single Tensor Access. . . . . . . . . . . . . . . . . . . . . 35

3.4 Profiling showing the most time consuming GPU kernels, during one
training step when using the Hugging Face model without mixed pre-
cision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5 Backpropagation workflow in distributed training. . . . . . . . . . . . 41

4.1 Benchmark 1, Loss and accuracy for unfused and fused Adan. . . . . 49
4.2 Benchmark 2, Fused Adan kernel with single tensor access launch

time graph, showing kernel launch is very loosely distributed over the
timeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 Benchmark 2, Fused Adan kernel with multi tensor access launch time
graph, showing kernel launch is compactly distributed over the timeline 50

xiii


List of Figures

4.4 Benchmark 3, GPU pipeline utilization . . . . . . . . . . . . . . . . . 52
4.5 Figure shows the SM and memory throughput of the GPU when using

Multi Tensor Access in one-step benchmark. . . . . . . . . . . . . . . 52
4.6 Benchmark 4, Mixed precision training loss and accuracy . . . . . . . 54
4.7 Figure shows the percentages of the top 10 GPU kernels with the

highest time consumption during one training step using Hugging
Face model with mixed precision. . . . . . . . . . . . . . . . . . . . . 55

4.8 Figure shows the hardware utilization when training the Hugging face
model using mixed precision in the one-step benchmark. . . . . . . . 56

4.9 Figure shows the utilization of the hardware in the one-step bench-
mark when training the LightSeq model using mixed precision. . . . . 56

4.10 Benchmark 5, FP16 Training loss and accuracy . . . . . . . . . . . . 56
4.11 Benchmark 6, Data Parallel training loss and accuracy . . . . . . . . 58
4.12 Four different plots (A, B, C and D) showing the how the differ-

ent hardware compares with the optimization techniques previously
shown. Each point is one configuration. Due to TPU energy consump-
tion being unavailable in Google Cloud it’s only included in plots not
plotting energy on any axis. . . . . . . . . . . . . . . . . . . . . . . . 62

5.1 LightSeq, HuggingFace on V100 and HuggingFace on A100 with batch
size of 16, 32 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5.2 Best training time and the cost on different devices.Training configura-
tion: LightSeq Encoder(GPUs), Fused Adan(GPUs) and AdamW(TPU),
Mixed Precision Training. Batch size 32 using on A10, A100, V100.
Batch size 16 using on T4. Batch size 512 using on TPUv2 and
TPUv3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

5.3 Different training time and energy when enable and disable mixed
precision on GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

xiv


List of Tables

3.1 Typical kernel size with their SM/Memory throughput . . . . . . . . 34
3.2 Average execution time of sgemm kernels with different size when not

using mixed precision. . . . . . . . . . . . . . . . . . . . . . . . . . . 36
3.3 Average execution time of common kernels within Hugging Face model. 38
3.4 Notions used for Backpropgation time estimation . . . . . . . . . . . 41
3.5 Showing the parameters used for the different hardware in the sys-

tematic benchmark. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.6 Showcasing the hardware specifications of the selected hardware. Im-

portant to note is that Clock Frequency when referring to GPUs is
the CUDA core clock frequency. . . . . . . . . . . . . . . . . . . . . . 44

4.1 Benchmark settings, f_m means fused multi tensor, uf_s means un-
fused single tensor, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2 Benchmark 1, Comparison between different optimizer. . . . . . . . . 48
4.3 Benchmark 2, One-step benchmark time results . . . . . . . . . . . . 49
4.4 Benchmark 2, Different Adan kernel one-step benchmark results . . . 49
4.5 Active time ratio for different Adan kernel implementations . . . . . . 51
4.6 Benchmark 3, Typical kernel size and throughput for different Adan

kernel implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 53
4.7 Benchmark 4, Mixed precision training result . . . . . . . . . . . . . . 53
4.8 Benchmark 5, Training using Hugging Face and LightSeq . . . . . . . 54
4.9 Benchmark 6, Data parallel training result . . . . . . . . . . . . . . . 57
4.10 one-step time consumption per GPU . . . . . . . . . . . . . . . . . . 58
4.11 Layer Time Measurement under 2 GPU . . . . . . . . . . . . . . . . . 59
4.12 Layer Time Measurement under 4 GPU . . . . . . . . . . . . . . . . . 60
4.13 Benchmark 7, Baseline and Mixed Precision performances for the

Python XLA and TensorFlow implementations. Fine-tuning for 3
epochs with a batch size of 16 . . . . . . . . . . . . . . . . . . . . . . 60

4.14 Benchmark 7, Mixed Precision performance with different batch sizes
for the two different implementations. Out-of-memory occurred for
the XLA implementation at a batch size of 32. . . . . . . . . . . . . . 61

4.15 Parameters for the systematic benchmark. . . . . . . . . . . . . . . . 61

xv


List of Tables

4.16 Results based on fine-tuning the BERT model using the IMDb dataset.
First section shows the baseline performances for the different archi-
tectures, with a batch size of 16 (16x8 for TPUv2/v3). Second section
shows the mixed performance of the different hardware with a batch
size of 16 (16x8 for TPUv2/v3). Third section features the most
optimized implementation for each type of hardware. . . . . . . . . . 64

5.1 Original data from different optimization . . . . . . . . . . . . . . . . 68
5.2 Normlized data from different optimization . . . . . . . . . . . . . . . 68
5.3 Weight Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.4 Score Table, the lower the better . . . . . . . . . . . . . . . . . . . . 69
5.5 Performance on Volvo dataset . . . . . . . . . . . . . . . . . . . . . . 70

xvi


1
Introduction

Machine learning (ML) has become increasingly popular in recent years as a power-
ful tool for solving complex tasks in natural language processing (NLP), computer
vision, and automation, etc. The emergence of impressive ML technologies such
as ChatGPT [1] and DALLE [2], which showcased remarkable capabilities in early
2023, is due to developments in deep learning, a branch of ML. These advancements
are based on the Transformer architecture proposed by Google [3].

The Transformer-based BERT model [4] is extensively utilized in the field of NLP.
Efficiency gains can be observed by a wide range of other industries and companies.
As outlined in a blog post by Google showing utilize BERT enhance the interpre-
tation of user queries [5] to improve user experience. Examples such as the ones
listed show that even for domain specific tasks, using the likes of transformer-based
models can improve efficiency and performance. Some AI-based translation tools,
such as DeepL, also use Transformer-based models. By customizing the network
architecture to a certain extent, translation accuracy and speed between different
languages are greatly improved, while maintaining low usage costs.

The success of the Transformer architecture can be attributed to its ability to address
the data dependency problem that arises in training recurrent neural network (RNN)
commonly used in traditional NLP, which accelerates the transformer-based model’s
training through parallelization [6]. In the training of RNN, the input of the next
iteration depends on the output of the previous round, making it difficult to feed
multiple tokens in the sequence into the network simultaneously for parallel training.
However, Transformer eliminates this dependency by introducing the self-attention
mechanism, which replaces the association between tokens with the self attention
mechanism.

However, the Transformer architecture is not without its flaws, one of which is its
lower computational efficiency. Specifically, if the input token sequence has a length
of n and each token embedding has dimension d, the overall complexity per layer of
the self-attention mechanism is O(n2d). In comparison, the complexity per layer of
CNN in the same case is O(knd2) (k is the number of convolution kernels) and RNN
is O(nd2). This shows that the complexity of Transformer increases quadratically as
the sequence length grows linearly. Therefore, it is essential to improve the compute
capabilities for Transformer architecture from both hardware side and software side,
especially when dealing with longer sequences.

1


1. Introduction

1.1 Machine Learning Hardware
There are several companies offering various hardware prototypes that can accelerate
the training of machine learning models. NVIDIA’s GPU is the most commonly used
for accelerating CUDA-based applications like machine learning, thanks to its high
performance and built-in CUDA core. Other companies like Intel and AMD also offer
their own selection of GPUs suitable for accelerating deep learning. For example,
Intel’s latest ARC series GPUs are expected to perform similarly to NVIDIA’s RTX
series. Intel has implemented an abstraction of its latest GPU offering through its
XPU heterogeneous computing platform, enabling it to run on mainstream machine
learning frameworks like PyTorch. Similarly, AMD offers the Radeon family of
GPUs suitable for ML using its open-source software called ROCm.

In addition to GPUs, other types of hardware can accelerate machine learning, in-
cluding FPGAs or specialized chips designed specifically for ML. For example, the
Cerebras Wafer-Scale Engine (WSE) chip is a giant AI accelerator that utilizes a
single wafer. It allows for reduced communication latency between cores and more
shared on-chip memory, resulting in highly accelerated training and ultra-low la-
tency inference [7]. Another example is that of Google’s Tensor Processing Unit
(TPU), which will be covered in more details later on.

1.2 Machine Learning Framework
Data scientists usually use a comprehensive deep learning framework to aid them
in creating deep learning models. The most common frameworks in academia and
industry are PyTorch [8] and Tensorflow [9]. The primary component of these frame-
works is a data structure known as a Tensor. Torch providing exceptional support
for GPUs, but it only offers an upper-level interface to the programming language
Lua, which is not widely used. As a result, Facebook’s development team repro-
grammed the upper-level interface in Python to provide full Python support, which
now named PyTorch. PyTorch has gained popularity because of the widespread use
of Python in the community.

TensorFlow is a deep learning framework introduced by Google in 2015. It is com-
parable to PyTorch since operations are performed on tensors. However, unlike
PyTorch, TensorFlow is a static graph-based framework, whereas PyTorch is based
on dynamic graphs. Both dynamic and static graphs have their advantages and
disadvantages. Dynamic graphs are simple to use for model development and error
detection. Nonetheless, they often have lower performance and cannot meet specific
needs. Static graphs are quicker since they are designed for a series of optimizations
to be made before execution. However, the optimized graphs lead to harder to detect
errors during runtime making debugging and error solving more difficult.

1.2.1 Machine Learning Libraries
In addition to hardware, high-performance software libraries are also important
in achieving maximum computational performance. Different hardware requires

2


1. Introduction

different libraries to fully exploit their potential performance. For example, NVIDIA
provides the cuDNN high-performance machine learning software library, while Intel
offers the OneDNN machine learning library. AMD provides the ROCm library for
their hardware to maximize performance in machine learning tasks. These libraries
use a variety of hardware optimization strategies to address performance bottlenecks
that compilers cannot optimize.

For instance, the double buffering technique proposed by Tian et al. [10] takes
advantage of the fast warp switching feature of NVIDIA GPUs. This technique
hides memory access latency on the previous warp by prefetching data for the other
warp, allowing the scheduler to quickly switch the other warp to the idle stream
processor when a memory access occurs on one warp. However, the increasing
diversity of hardware in recent years has presented a significant challenge in the
development of software libraries. While specific hardware designs can significantly
improve hardware performance in a given domain, non-specialists may struggle to
accelerate their tasks by writing high-performance kernels independently, as these
efforts are usually limited to hardware vendors.

1.3 Transformer usage at Volvo
In international companies like Volvo Group, many teams are trying to use artificial
intelligence technology to expedite their business and reduce costs. The implemen-
tation of Transformer-based neural network architecture is being pursued by many
Volvo data scientists to build their own large language model and train a customized
model on the internal corpus. This has sparked Volvo’s interest in speeding up
training the Transformer model, which would allow AI technology to use in Volvo’s
various business activities by reducing the cost of training large language models.

At Volvo SML, BERT-based models are used in various tasks, such as predicting
traffic codes by generating corresponding codes based on parts’ descriptions. This
reduces the need for human labor and allows human effort to be focused on more crit-
ical steps. Our goal is to optimize the Transformer architecture to reduce hardware
usage and time costs associated with the model development workflow.

1.4 Problem statement
Nowadays, the push for training a large model within academic researchers and
industry engineers has increased, pre-trained models based on the Transformer ar-
chitecture such as BERT are already prevalent in NLP-based tasks. However, they
still require fine-tuning on domain specific datasets before they can effectively be in
research or products. The fine-tuning process incurs both an initial cost but also a
recurring cost in terms of monetary costs, time and energy, due to factors such as
wanting to improve the model in the future with more data, or simply model degra-
dation in which the models performance worsens with time. This thesis therefore
aims to investigate the effectiveness of various hardware-aware optimization tech-
niques such as kernel fusion, mixed precision, acceleration libraries and distributed

3


1. Introduction

training techniques when fine-tuning a pre-trained BERT model. The results will
address the following research questions:

1. How can the fine-tuning process of the BERT model benefit from hardware-
aware optimization techniques in terms of time, energy consumption and mon-
etary cost?

2. What are the benefits and drawbacks of using GPU-based training methods
as opposed to a TPU-based training method?

1.5 Related Work
Sparsh et al. [11] presents a survey focusing on the architecture and system-level
optimizations on GPUs for DL applications. Reviewing techniques for inference and
training as well as for using either single GPU or a distributed system. Highlighting
that over the past nine years, GPU compute power has increased 32x, whilst the
memory bandwidth only increased 13x whilst the DNN memory requirements grow
at a rapid pace. Thus, coming to the conclusion that substantial improvements
within memory capacity and bandwidth for GPUs are required to keep GPUs the
main platform for DNNs. Most papers focused only on a single technique for increas-
ing performance, however future work should use multiple techniques in conjunction
and evaluate them.

Further more, they state that a critical challenge in GPU use is large power consump-
tion, and that nearly all papers surveyed ignored the energy efficiency assessment
of their solutions that future workers should evaluate. As well as newer GPU mod-
els and their low / mixed-precision techniques [12] for achieving higher efficiency.
Moreover, they say it will be interesting to see how next-generation GPUs compare
to custom-made AI accelerators such as TPUs. Finally, they focus on the lack of
security when using GPUs, and that security needs to be a design parameter when
designing GPUs and CNNs, not retrofitted.

Shi et al. [13]present a comparative study of SOTA GPU-accelerated DL software
tools (Caffe, TensorFlow, Torch, etc.), benchmarking tool performance using 3 pop-
ular neural networks on two CPU and three GPU platforms. They also benchmark
on a distributed system with multiple GPUs. Two contributions are made, one for
end-users and one for software developers. First, by using their results as a guide
for selecting hardware and software by end-users. Secondly, by showing software
developers of DL tools possible future directions for performance optimization.

They conclude by showing that the tools tested make good use of GPUs, and achieve
significant speed-ups over CPU counterparts. However, there is no single software
tool evaluated that can consistently perform better than another, so there should be
opportunities for optimization in this area. Finally, they stated that there were two
directions for future work. One is benchmarking DL software tools and hardware
platforms, and the other is evaluating tool scalability on a high-performance GPU
cluster.

Sze et al. [14] discuss the challenges of energy consumption, throughput, and data

4


1. Introduction

movement and how to address them at various levels of hardware design. They
state that data movement is the cause of high energy consumption, and that re-
cent research focuses on reducing memory movement while maintaining accuracy,
throughput, and cost. They therefore advocate selecting architectures with efficient
memory hierarchies and data flows that increase data reuse.

They also mention that joint design between algorithms and hardware has been
a driving force to reduce data movement requirements through reduced bit width
precision, compression, and increased sparsity. They conclude that considering the
interactions between different levels of hardware design, one example is reducing the
cost of memory access, which could lead to more energy-efficient data flows.

Wang et al. [15] conduct a study on the performance and energy efficiency of multiple
hardware types (Intel CPU, NVIDIA GPU, AMD GPU, Google TPU) on a suite
of representative DL workloads. They investigate the impact on performance and
energy consumption based on many factors, such as hardware, software library for
hardware, and deep learning framework. Their contribution is twofold, providing
end users with useful results when selecting hardware for their own projects, and
exposing opportunities for hardware manufacturers to improve their software.

1.6 Aim and Objectives
The main aim is to analyze how hardware-aware optimization techniques such as Ker-
nel Fusion, Multi-tensor Access, etc impacts the fine-tuning process when fine-tuning
on the BERT Model. Secondarily, we also aim to investigate the performance dif-
ferences between various hardware within two main architectures, GPUs and TPUs
as well as compare these architectures directly. Based on these aims, the following
goals are formulated:

1. Accelerating the Adan optimizer using Kernel fusion and Multi-tensor ac-
cess techniques. Different hardware-related optimization techniques will be
explored to determine the best implementation. The advantages of the high-
performance kernel in terms of computation time and hardware utilization will
be evaluated.

2. Accelerating the existing training process (Hugging Face baseline) using tech-
niques such as mixed precision, distributed training and also using accelera-
tion libraries such as LightSeq. The optimization process will be conducted
in stages, and compared with the existing workflow, mainly evaluating the
training time, energy consumption, and cost.

3. The existing workflow will be migrated to TPU-based training platform and
compared with the GPU-based training platform, both in terms of baseline
performance as well as optimized performance. The advantages and disadvan-
tages of both methods will be analyzed.

4. Lastly, performing a systematic benchmark / grid search to evaluate a plethora
of hardware, and how the hardware is impacted by the optimization techniques
used in this thesis.

5


1. Introduction

1.7 Delimitations
This thesis has one main limitation, access to physical hardware. Because of this,
the thesis will be conducted in the cloud only, that is, on either Microsoft Azure
or Google Cloud. The advantage of this is access to a wide selection of hardware
that can be benchmarked, although at the disadvantage of having to use energy
estimations rather than real energy measurements.

However, even though a wide choice of hardware is available in the commercial mar-
ket, our thesis is limited to the use of NVIDIA GPUs available in Volvo’s subscrip-
tion on Azure and Google TPUs (v2 and v3). The TPUv4 would perhaps be a more
fitting candidate for some of the benchmarks but is unavailable due to monetary
reasons (Google Cloud offers TPUv2 and TPUv3 for free via their TPU Reserach
Cloud initiative). Moreover, while pre-training would be an interesting area of re-
search, the costs associated with repeating pre-training using cloud computing are
considered too high.

1.8 Thesis Outline
Theory: In this chapter, we provide a brief overview of relevant background liter-
ature, including the Transformer and its key algorithms, optimization algorithms,
GPGPU and TPU hardware architectures, LightSeq library, mixed precision train-
ing, and distributed training.

Method: This chapter outlines the current problems in the training process and
presents different optimization methods for each problem to improve training per-
formance. We also discuss the potential benefits of these optimization methods.

Results: In this chapter, we present the results of our approach and briefly describe
them.

Discussion: This chapter synthesizes the results from the previous chapter, sum-
marizes our findings, and provides recommendations. We also delve deeper into
the optimization methods and training processes used to provide insights useful for
future work.

Future work: In this chapter, we suggest future work based on our findings as well
as based on the limitations of the thesis.

Conclusion: This chapter summarizes the thesis content and presents our conclu-
sion based on the presented results.

6


2
Theory

This chapter serves as an introduction to the underlying theory that forms the basis
of this thesis. It will cover several key topics, including the Transformer architecture,
deep learning optimizer theory, GPU and TPU architectures, the LightSeq software
library, mixed precision training, and the concept of distributed training. By pro-
viding an overview of these theories, the reader will gain a better understanding of
the subsequent chapters of the thesis and how these theories relate to the research
presented in this thesis.

2.1 Transformer
Transformer is a popular machine learning architecture used for various tasks in nat-
ural language processing. These tasks include sequence labeling, sequence classifica-
tion, natural language generation, word sense disambiguation, etc. The Transformer
architecture, with its superior semantic feature extraction and extended sequence
processing capabilities, has enabled Transformer-based models to achieve state-of-
the-art (SOTA) performance on numerous datasets. ERNIE-Doc-Large [16] achieved
SOTA performance on the IMDB dataset text classification task with 97.1% accu-
racy. On the Penn Treebank dataset [17] GPT-3 (Zero shot) achieved the SOTA
test perplexity on language modelling of 20.5 [18].

This thesis focuses on sequence classification tasks using the Transformer-based
BERT model. The Transformer architecture introduces a range of novel features
and capabilities which will illustrated in the following sub-sections.

2.1.1 Self Attention
Self-attention is a technique that allows the aggregation of all positions in an input
sequence, weighing each position according to its importance. This approach sig-
nificantly enhances a model’s ability to capture long-range dependencies within a
sequence. The Transformer architecture uses this technique to achieve better perfor-
mance than traditional sequence models such as recurrent neural networks (RNNs)
[19] and long short-term memory (LSTM) [20] networks when dealing with long
sequences. Moreover, the Transformer architecture does not rely on sequential com-
putations over time, as opposed to RNNs and LSTMs. This feature allows efficient
parallel computation during training, resulting in faster processing times. This ad-

7


2. Theory

Figure 2.1: Example of correlation between query vector and key vector.

vantage over traditional sequence models can significantly reduce the time required
for training.

There are three essential parameters for the attention mechanism, the Query, Key
and Value matrices (Q, K, V) as depicted in the following equation,

Attention(Q, K, V ) = softmax
(

QKT

√
dk

)
V. (2.1)

Notably, the Query, Key, and Value matrices are derived from the word embedding
vectors through linear transformations. As illustrated,

Embedding ×W Q = Q

Embedding ×W K = K

Embedding ×W V = V

(2.2)

where W Q, W K , W V are trainable parameters.

Q matrix multiplied by the transpose of the K matrix produces the correlation
matrix between the Query and Key. A higher correlation between Query and Key
results in a larger corresponding value. This procedure is important because it
allows the strength of the relationship between Query and Key to be determined
in a structured and quantitative manner, allowing us to enhance the original word
embedding.

8


2. Theory

The correlation computation is shown in Figure 2.1, in order to simplify the expres-
sion the original embeddings are used directly. This means that the Query matrix
on the left and the Key matrix on the top have the same values, and both matrices
undergo a linear transformation during actual computation. The yellow correlation
matrix reflects the correlation between each token, with higher values indicating a
stronger correlation with the current token, and therefore requiring a higher weight.
These weights are then embedded as contextual information from the corpus into
the original vector.

Normalization of dimension is necessary to obtain reasonable values for relationships,
and to maintain stability in the gradients, which is crucial for neural networks.
Subsequently, the correlation matrix is multiplied by the Value matrix to obtain the
attention-weighted word embedding representation.

Attention mechanism aims to improve the learning process by identifying relevant
tokens and strengthening their weights. This is accomplished by comparing associ-
ations between sequence contexts and amplifying weights, resulting in an enhanced
representation of the original word embeddings in the current context.

2.1.2 Multi-headed Attention
Multi-head attention mechanism utilizes multiple sets of trainable matrices, namely
W Q

i , W K
i , and W V

i , to project input data into different subspaces. By initializing
these matrices randomly, each head captures distinct semantic information, leading
to diverse feature representations. Subsequently, the output features from different
heads are concatenated and transformed by WO, which is the linear layer, to obtain
the final embedding representation, as seen here:

MultiHead(Q, K, V ) = Concat(head1, head2, ..., headh)W O

headi = Attention(QW Q
i , KW K

i , V W V
i ).

(2.3)

This approach facilitates the learning of distinct pieces of information in various
representation spaces, which enhances the flexibility and efficiency of the model.
The design is reminiscent of convolution neural networks, which employ a large
number of convolution kernels to capture features in feature space by randomly
initializing the values of different kernels. Increasing the number of heads in the
Transformer model can enhance its ability to capture sequence features.

Figure 2.2 shows that the multi-head attention mechanism exhibits a high degree
of parallelism because the attention heads do not have interconnections and can
therefore be computed in parallel. This property provides the model with increased
flexibility and perceptual capabilities, while making efficient use of computational
resources without encountering training obstacles.

9


2. Theory

Figure 2.2: Multi Attention with three attention heads, vector a1 is word embedding
and b1

headi is the attention vector generated by different attention heads, and these
vectors were transformed by the linear layer.

2.1.3 Transformer Architecture
Transformer architecture uses a conventional Encoder-Decoder structure. In par-
ticular, the architecture consists of a stack of N Encoders on the left-hand side of
Figure 2.3, where input elements are uncorrelated, allowing parallelization. On the
right hand side, a stack of N Decoders is shown, where inputs rely on the previous
output prediction token. This creates a temporal dependence between inputs, and
prohibiting simultaneous output of prediction results.

Transformer architecture simplifies the loop structure, which makes it impossible
to directly capture position information in a sequence. To overcome this limitation,
Transformer uses positional encoding to append positional information to each input
vector, allowing the model to better grasp the sequential information in the sequence.
The basic architecture is shown in Figure 2.3.

Transformer architecture suffers from a significant drawback where identical tokens
within the input corpus are assigned identical embeddings due to the absence of a
loop structure. To address this limitation and ensure that the relevant positional in-
formation of tokens in the sequence is captured, Vaswani et al. introduced positional
encoding after the embedding stage.

Position encoding is a technique used to incorporate positional information into
sequence input embeddings. This technique involves adding an offset vector to the
input embedding vector. The offset vector is generated by a specific function that
generates a unique position encoding at different locations and time steps. The most
commonly used functions for generating position encoding are the sine and cosine
functions. These functions are effective in capturing relative position information
between different locations in a sequence and representing it as a continuous vector.

10


2. Theory

Figure 2.3: Transformer Architecture, on the left is stack of encoders, on the right
is stack of decoders.

Positional encoding with sine and cosine is done using the following equations

PE(pos,2i) = sin
(

pos

100002i/d

)
PE(pos,2i+1) = cos

(
pos

100002i/d

)
.

(2.4)

The Transformer uses the sine function to encode the token in even positions and
the cosine function to encode the position in odd positions.

2.1.4 BERT
BERT used the Transformer encoder for stacking, with Google offering two variants:
BERTbase, featuring a 12-layer encoder, and BERTlarge, containing a 24-layer en-
coder. These models also exhibit different attention head quantities and different
hidden layer parameters. BERTbase has 12 attention heads and 768 hidden layers,
while BERTlarge has 16 attention heads with 1024 hidden layers.

Figure 2.4 shows the architecture of the BERTbase model used for sequence classi-
fication. It consists of three layers: Token embedding layer, Transformer encoder
layer, and a task-specific output layer.

The Token embedding layer is responsible for creating embeddings that contain both
sentence and position information for existing tokens. The Transformer encoder
layer, which is built by stacking multiple encoders, extracts semantic information
from the input and passes it to the task-specific output layer.

The task-specific output layer generates the final output based on the requirements
of the downstream task. By selecting different outputs, various downstream tasks
can be performed. For a sequence classification task, a dense layer is randomly
initialized as classification header at the output of the encoder.

11


2. Theory

Figure 2.4: BERTbase for sequence classification

2.1.5 Pre-train and Fine-tuning
The pre-training procedure for BERT involves two primary components. The first
component is called Masked Language Modeling (MLM), which involves masking a
segment of words in the training corpus and training the model to predict missing
words based on the context of surrounding words. This approach enhances the
model’s ability to understand the semantics of masked words by leveraging the
context of the words to the left and right of the masked word. This results in a
bidirectional understanding of the language model, which allows it to comprehend
language more effectively.

The second component of BERT pre-training procedure is known as the Next Sen-
tence Prediction (NSP) task. The NSP task involves training the model to predict
whether sentence B follows sentence A in a given text, with a 50% probability that
B is the next sentence. By training the model on this task, BERT is better able
to understand the relationship between sentences in a document, which is essential
for many natural language processing tasks, such as question answering and sen-
timent analysis. The combination of MLM and NSP enables BERT to learn rich
representations of language that can be fine-tuned for a wide range of downstream
tasks.

The term "fine tuning" refers to the process of optimizing a pre-trained model on
a specific dataset and usually downstream task. This technique differs from pre-
training, which involves training a model on a large dataset to learn general features
that can be applied to various tasks. During this phase, the model architecture
undergoes rapid changes.

Once the model architecture is fixed, we can accelerate the training process by
using a custom kernel. Therefore, determining the model architecture is critical to
determining the optimization tools that can be applied.

The primary goal of fine-tuning is to retain the knowledge acquired by the pre-
trained model while adapting it to the target dataset and downstream task. To

12


2. Theory

achieve this, several techniques are used to prevent the model from forgetting pre-
viously learned features. For example, when doing fine-tuning on computer vision
task e.g. VGG network [21], it may beneficial to freeze its feature extraction layer
and train only the dense layer for classification on a new dataset. This approach
reduces computational costs, but it also has the disadvantage of limiting the model’s
ability to achieve high accuracy.

The learning rate is a critical parameter in the model fine-tuning process, and com-
mon techniques include lowering the learning rate, adjusting the optimizer’s β, using
warm-up, and learning rate decay [22]. These methods aim to strike a balance be-
tween model training stability and learning speed during the fine-tuning process.

In this thesis, the words "training" and "fine-tuning" are used interchangeably, if any
section, paragraph, etc refers to pre-training then "pre-training" is explicitly used.

2.2 Deep Learning Optimizer
Deep learning optimizer is a set of algorithms used to adjust the weights and biases
of a deep neural network throughout the training phase with the aim of reducing
the error between the predicted and true values. Essentially, the main task of
the optimizer is to determine the set of parameters that will produce the most
accurate model predictions on the training data and to update the model parameters
according to a selected algorithm.

The optimization process involves iteratively modifying the model parameters by
using the gradient of the loss function associated with those parameters. The opti-
mizer determines the direction and magnitude of each parameter update, while the
learning rate controls the size of the update. Various optimization algorithms use
different techniques to compute parameter updates, which may affect the speed and
accuracy of model training.

2.2.1 Stochastic Gradient Descent
Stochastic gradient descent (SGD) [23] is a commonly used and straightforward
method for optimizing model parameters in machine learning. Its popularity is
mainly due to its intuitiveness and ease of implementation, making it a popular
choice among researchers and engineers.

Denoting the model parameter set at iteration t by θt, the fundamental concept
behind stochastic gradient descent is illustrated by the equation

θt+1 = θt − ηt∇L(θt) , (2.5)

where ηt is the learning rate and ∇L(θt) denotes the gradient of the loss function
with respect to the parameter vector θt, evaluated for the i-th sample randomly
chosen from the training set.

As machine learning progresses, researchers have discovered that the choice of learn-
ing rate can significantly affect the convergence and dispersion of the model. If the

13


2. Theory

learning rate is set too low, the model is guaranteed to converge, but the learning
speed is impractically slow. Conversely, if the learning rate is set too high, the
model can learn features from the training data more quickly, but it may struggle
to converge to a local optimum.

To solve this problem, researchers proposed dynamic learning rate algorithm [24],
which is an extension of the stochastic gradient descent algorithm. This approach
adjusts the learning rate as a function of the number of iterations, resulting in a
schedule of learning rates. During the first iteration, the algorithm induces a large
change in the parameters of the model, while subsequent iterations incur smaller
and smaller changes.

2.2.2 Momentum
Momentum is an optimization concept in machine learning that helps to accelerate
the gradient decedent in the correct direction, thus speeding up the convergence
rate. It is an extension of the gradient descent optimization algorithm that builds
historical memory information in the search direction to overcome the local minima
and oscillations of noisy gradients. Momentum is based on the same concept of
momentum in physics, where a ball rolling down a hill builds enough momentum to
jump out of the local minimum and bring it to a global minimum. The main idea
behind momentum is to calculate an exponentially weighted average of historical
gradients and use it to update the weights. By considering the past gradients, the
gradient descent step becomes smooth, which reduces the amount of oscillations
seen in the iterations.

gt ← ∇L(θt−1)
vt ← γvt−1 + ηtgt

θt ← θt−1 − vt

(2.6)

Equation 2.6 shows how the momentum is involved in the gradient update. The
gt represents the loss resulting from the calculation of the result from the current
parameter versus the actual result, which is the gradient that the sample tells us
to discard. The vt represents the momentum accumulation, i.e. the gradient of the
current term is weighted and added to the gradient of the previous steps so that
historical gradient information can be introduced to help the system maintain the
correct gradient direction. Where γ represents the weight of the momentum in the
previous steps and η represents the learning rate of the current gradient. Finally,
the parameters are updated using the vt term to obtain the new parameters.

2.2.3 Adam
The Adam optimizer combines the ideas of both momentum and adaptive learning
rate, which allows the Adam optimizer to accelerate the training process of deep
learning models and avoid some problems of SGD algorithm, such as the learning
rate is difficult to adjust and easily fall into local optimal solutions.

14


2. Theory

gt = ∇L(θt)

θt = θt−1 −
α · gt√∑t

i=1 g2
t

(2.7)

The following discussion focuses on the concept of adaptive learning rate, which
originated from the AdaGrad optimization algorithm. To address the limitations
of the stochastic gradient descent (SGD) and momentum algorithms, AdaGrad was
introduced by John Duchi et al. in 2011. AdaGrad enables the adjustment of differ-
ent learning rates for each parameter and updates frequently changing parameters
with smaller steps, while sparse parameters are updated in larger steps.

The adaptive learning rate in AdaGrad is determined by the sum of the squares
of past gradients gt up to the current time step t. If there are many parameters
being updated in gt (non-sparse), the sum of the squares will be larger, leading to
a decrease in the learning rate for the current θt. Conversely, if there are fewer
parameters being updated in gt (sparse), the sum of the squares will be smaller,
resulting in an acceleration of the update of the θt. This relationship is expressed
in equation 2.7.

gt = ∇L(θt)
mt = β1 ·mt−1 + (1− β1) · gt

vt = β2 · vt−1 + (1− β2) · g2
t

m̂t = mt/(1− βt
1)

v̂t = vt/(1− βt
2)

θt = θt−1 −
η√

v̂t + ϵ
m̂t

(2.8)

The Adam algorithm workflow is presented in Equation 2.8. The algorithm incor-
porates momentum and adaptive learning rates to accelerate the machine learning
model’s training process and avoid local optima. The momentum term, mt, controls
the gradient descent’s direction, preventing the model from falling into a local opti-
mum. The vt term adjusts the learning rate, η, based on the gradient’s sparsity. By
combining adaptive learning rates and momentum, Adam can optimize the model’s
parameters efficiently.

The weights of the first-order and second-order moment estimation are controlled by
β1 and β2, respectively. β1 is the exponential decay rate of the first-order moment
estimation, and its value is usually set to 0.9 to balance the weight of the historical
gradient. Similarly, β2 is the exponential decay rate of the second-order moment
estimation and is usually set to 0.999 to adjust the squared weight of the historical
gradient.

15


2. Theory

2.2.4 Adan
Xie et al. [25] recently proposed Adan, an optimization algorithm that employs the
Nesterov momentum [26]. By leveraging newer weights, Adan is able to anticipate
the direction of gradient descent and determine the final gradient update direction,
which can potentially accelerate convergence compared to using momentum alone.
However, a potential limitation of Nesterov momentum is the need to incorporate
the yet-to-be-updated momentum into the network for propagation to the loss value,
resulting in multiple recomputations. As a consequence, the full potential of Nes-
terov momentum has not been realized, and further development of the method is
required.

mt = βmt−1 +∇L(θt−1 − αβmt−1)
θt = θt−1 − αmt

(2.9)

Equation 2.9 requires the computation of ∇L to be performed beforehand to obtain
the Loss value. However, it is possible to modify the equation so that the stored
historical gradient can replace the calculation process without altering the result.

mt = βmt−1 +∇L(θt−1) + β[∇L(θt−1)−∇L(θt−2)]
θt = θi−t − αmt

(2.10)

Equation 2.9 can be modified to obtain equation 2.10. This modification is equiva-
lent and does not affect the convergence of the training. Furthermore, it becomes
apparent that ∇L(θt−1) − ∇L(θt−2) approximates the second-order derivative of
the objective function. By leveraging this information the direction of the gradient
change can be determined, thus achieving faster convergence compared to stochastic
gradient descent.

mt = β1mt−1 + (1− β1)gt

vt = β2vt−1 + (1− β2)(gt − gt−1)
nt = β3nt−1 + (1− β3)[gt + β2(gt − gt−1)]2

ηt = η/(
√

nt + ϵ)
θt+1 = (1 + λtη)−1[θt − ηt ◦ (mt + β2vt)]

(2.11)

The basic structure of Adan is presented in equation 2.11. Notably, the authors have
implemented first-order and second-order moment estimations which are similar to
those employed in Adam’s optimizer. To achieve Nesterov momentum, it is necessary
to calculate with gt−1 subsequent to acquiring the gradient gt−1 − gt−2.

16


2. Theory

Figure 2.5: GPU architecture example

2.3 Hardware Architecture
This section is dedicated to explaining the design and ML application of existing
hardware. How the hardware works, and what benefits it provides when training
deep neural networks.

2.3.1 GPU
The following section concentrates on the use of graphics processing units (GPU)
and will not delve into traditional graphics processing units. Due to the advance-
ment of a sophisticated programming model, general-purpose computing can now be
implemented with GPUs. Contemporary GPUs feature a "manycore" design, known
as Streaming Multiprocessors (SM) by NVIDIA and Compute Units (CU) by AMD.
These SM cores execute a single instruction multiple thread (SIMT) program ini-
tiated by a corresponding kernel. Additionally, multiple threads can be run on
each SM, allowing for communication via scratchpad memory and synchronization
through barrier operations. Each core typically contains a first-level instruction and
data cache, which decreases the amount of data sent to lower memory levels. The
vast number of threads available can be utilized to hide the latency caused by cache
misses.

Figure 2.5 shows a group of Streaming Multiprocessors or Compute Units that share
the same cache and control unit. Such a group of compute units is hierarchical in the
GPU programming model, where NVIDIA refers to it as warp and AMD refers to
it as wavefront. SMs/CUs execute the same instruction using different data, which
significantly enhances the data throughput of the instruction. In contrast, a CPU
may not match this throughput gain, even with vector instructions.

However, GPU’s throughput decreases substantially when a set of SMs/CUs pro-
duces different branch in conditional statements. To handle branches, modern GPUs
insert instructions back into the instruction stream and fetch them multiple times
until all result were calculated, using a mask to prevent some threads from writing

17


2. Theory

results to registers to complete the branching operation. As a result, an excessive
number of branches in an instruction degrades the performance of the GPU. In
contrast, a CPU has a more robust control unit that enhances its ability to handle
instruction branches.

The high degree of parallelism and the hierarchical memory architecture of GPU
present possibilities for optimization of existing machine learning software. The
pipeline structure of GPU compute core allows for commonly used pipeline opti-
mization techniques such as loop unrolling and instruction rescheduling to improve
pipeline performance. Additionally, special vector instructions can be employed to
enhance data reading capabilities. The hierarchical cache structure enables caching
of frequently used data, reducing unnecessary memory access and improving perfor-
mance.

Despite advancements in computer systems, the CPU remains at the center and is
referred to as the host, while GPUs and other accelerators are regarded as devices,
functioning as subordinate peripheral devices. To use a GPU, engineers must man-
ually specify programs to run on it, which results in additional overhead. Therefore,
reducing this overhead is a crucial topic for optimizing modern GPU programs.

2.3.2 TPU
Tensor Processing Unit (TPU) [27] is a specialized accelerator for Tensor operations,
which was introduced by Google in 2017. Its primary function is to facilitate the
efficient execution of matrix operations that are pervasive in deep learning. The
TPU is a custom designed chip that consists of several dedicated components, such
as a matrix multiplication unit, an accumulator, a nonlinear activation unit, and
on-chip buffer.

The Matrix-multiply units (MXU) is the foundation of the TPU, which presents a
different architecture compared to a GPU. While a GPU is often described as tempo-
ral architecture, TPU is characterized as spatial architecture, both of which employ
similar computing units and multiple units as computing components. TPU’s spa-
tial design enables it to support a variety of datapaths, facilitating data exchange
across multiple Processing Elements (PEs).

The Systolic array is a classical architecture in spatial architecture that was first
introduced by Kung et al. [28] . This architecture is characterized by its simplicity
and low cost. The primary design goal of the systolic array is to reduce the overhead
of data transfer between memories by extending the data flow time in the computing
unit. The systolic array achieves this by using data as many times as possible in the
process pipeline, fetching data once, and cascading multiple PEs to use the results
of the previous level of operations as input to the current PE.

As shown in Figure 2.6, the first data enters the first PE and is processed, then
passed to the next PE. Meanwhile, the second data enters the first PE, and so on.
By the time the first data reaches the last PE, it has been processed multiple times.
This pulsating architecture actually reuses input data multiple times, resulting in
high computing throughput while consuming less memory bandwidth. Upon expan-

18


2. Theory

Figure 2.6: Principle of systolic array

sion of the one-dimensional array shown in Figure 2.6 into a two-dimensional array,
the resulting structure is recognized as the matrix multiplication unit (Figure 2.7)
present in TPU.

Figure 2.7: MXU architecture

In contrast to GPUs, TPUs utilize a systolic array, which is a data-flow dependent
computational architecture. This unique structure takes advantage of potential
parallelism in computationally intensive tasks, allowing for high throughput even
at lower clock frequencies. By calculating the dynamic energy consumption using
the formula E = c · f · V 2 it’s possible to see that operating at a lower circuit
frequency at a certain voltage leads to lower power consumption. Google’s research
[29] shows that a computing cluster of 16 TPUv2 chips can consume 36% less energy

19


2. Theory

compared to a cluster of 16 V100 32GB GPUs.

The TPUv2 has a power consumption of 280W per chip and contains two MXU
cores on each chip. The instance we have access to contains four TPUv2 chips,
resulting in an available system power consumption of about 1120W. In comparison,
the average power consumption of the NVIDIA V100 PCIe 16GB is about 250W
per chip.

In summary, the goal is to compare TPUs with GPUs in terms of their training
time, and cost of use. Furthermore, to assess the level of complexity involved in
transitioning from GPU training to TPU training, including aspects such as software
reliability and usability. This comparison will give users more insight into choosing
the proper configuration to accelerate their machine learning tasks.

2.4 Mixed Precision Training
Mixed precision training [12] is a popular technique that exploits the low precision of
FP16 to reduce computation and memory overhead while maintaining model accu-
racy. The key idea is to store network parameters in the low-precision FP16 format
while maintaining accuracy in the high-precision FP32 format. During training, the
range of parameter values is usually kept within a limited range, which makes the
use of FP16 parameters a viable option with little impact on accuracy. However, the
gradient has the potential to suffer from numerical underflow, especially in the later
stages of training, resulting in a loss of accuracy. The basic weight update process
is illustrated in the Figure 2.8.

Figure 2.8: Mixed precision training weight update

To alleviate this problem, a technique called gradient scaling is introduced, where
the FP16 gradient is multiplied by a loss scale mapped to a range of values suitable
for FP16 operations to avoid overflow before being used for backpropagation. At
the end of the backpropagation the gradient is divided by the loss scale so that the
true value of the gradient can be restored.

20


2. Theory

Figure 2.9: BF16, FP16 and FP32 formats. Different bit ranges for above three
floating number representations.

If the gradient is found to have underflowed during the update process, then the
update is skipped. In the parameter update step, all updates are performed on
the FP32 copies of the model parameters to ensure the stability of the values. This
approach results in faster training time and less memory consumption due to the use
of less accurate FP16 parameters, while the use of more accurate FP32 parameters
ensures accurate parameter updates.

FP16 and BF16
Mixed-precision training gains a significant advantage from low-precision floating-
point representations, such as FP16 and BF16. FP32 and FP16 are commonly used
in GPU training, while BF16 is preferred for Google’s TPU training. Both FP16 and
BF16 use 16-bit data, with differences in the length of their mantissa and exponent,
as illustrated in Figure 2.9. Note that BF16 has the same exponent bits as FP32,
so they have the same representation range as FP32.

Converting from generic FP32 data to BF16 is simpler, as the exponent of BF16
is the same as that of FP32, requiring no special processing. However, converting
from FP32 to FP16 requires a special process due to the difference in their precision
representation capabilities. Typically, FP32 needs to be rescaled to adapt to the
precision of FP16.

2.5 LightSeq Library
LightSeq, originally created for high performance Inference using Transformers has
been extended, under the name of LightSeq to accelerate the training / fine-tuning
process of Transformer-based models on GPUs [30], [31]. As previously stated,
Transformer-based models use is widespread due to their stellar performance in
many ML tasks with one of the most common approaches being fine-tuning the
likes of large pre-trained models such as BERT, GPT, etc.

21


2. Theory

LightSeq comes with a series of optimization techniques for GPUs specifically tai-
lored for Transformer-based models computation and memory access flows in order
to improve performance. LightSeq focuses on following optimization techniques:
Computational Graph Optimization, Dependent Reduction Rewriting, Accelerated
Mixed-Precision Update for Trainer and Dangling-Tensor Aware Memory Manager.

Computational Graph Optimization - LightSeq aims to replace as many fine-
grained GPU kernel operations implemented in PyTorch with coarse-grained fused
ones, which reduces overhead caused by launching kernel functions. General matrix
multiply (GEMM) operations such as linear transformation and scaled dot product
use the implementations found in cuBLAS, whilst non-GEMM operations such as
Dropout, LayerNorm and Softmax are re-implemented in LightSeq. Element-wise
operations such as DropOut, ReLU, etc are fused into coarse-grained kernels if theyre
adjacent in the computational graph of the Transformer. These operations also
enable more optimizations in the shape of parallelism. Layers such as the embedding
layer and criterion layer are also optimized with more parallelism, and Softmax is
also reworked such that instead of being directly calculated, the logarithmic Softmax
is calculated instead.

Accelerated Mixed-Precision Update for Trainer - Parameters and gradients
are FP16 during forward- and back-propagation when using mixed precision, how-
ever the optimizer requires FP32 copies to ensure accuracy. Traditionally each gradi-
ent or parameter is copied and stored as a FP32 copy, and then loaded when needed
by the optimizer. This approach introduces two inefficiencies; mass-kernel launches
when copying and updating which reduces GPU utilization as well as redundant
memory footprints (storing FP16 and FP32 gradients/parameters) which reduces
memory efficiency. LightSeq resolves these two inefficiencies by using symbolic ten-
sor linking and on-the-fly conversion. When initializing the optimizer, parameters
and gradients are copied into a separate workspace tensor, which is then linked to the
optimizer. LightSeq only executes the optimizer update one time per training step
to update the workspace, instead of for every updated parameter/gradient which re-
duces kernel launches. When loading from the workspace, the parameters/gradients
are loaded as FP16, then converted on-the-fly to FP32 in the registers to be updated.
Once updated, theyre re-converted to FP16 before being stored in the workspace.
This reduces the data movement by 50% and avoids redundant FP32 copies since
all memory access is done using FP16.

Dependent Reduction Rewriting - Batch reduction operations such as Layer-
Norm and Softmax traditionally take a lot of time during training and are therefore
a focus of optimization in LightSeq. Changes to how the batch reduction opera-
tions of LayerNorm as well as re-arrangement of gradient calculation formula allows
for parallel execution of the batch reduction operations. LightSeq stores parame-
ters as FP16 floating points but converts them to FP32 during computation due to
sensitivity to low precision of floating point numbers when using LayerNorm.

Dangling-Tensor Aware Memory Manager - In general, large batch training
leads to faster convergence and higher accuracy however due to memory limita-
tions it requires either multiple GPUs, memory offload or gradient accumulation.

22


2. Theory

LightSeq divides GPU memory into permanent and temporary, to store parameters
and gradients as well as intermediate tensors respectively. During the initialization
phase LightSeq scans the training set and estimates the capacity required, then the
temporary memory is allocated once and reused by different batches before finally
released at the end of training. Which reduces the memory footprint by compacting
the memory and reducing the allocation and releases.

2.6 Distributed Training

Distributed training is a technique used to address the problem of handling large
volumes of data needed to train machine learning models, especially neural networks.
Given the limited computing power of individual machines, it has become necessary
to distribute machine learning workloads across multiple machines, transforming a
centralized system into a distributed one. Two primary parallel approaches, data
parallelism and model parallelism, are used to facilitate distributed training. These
methods can be used independently or in combination. In addition, when working
with distributed training, it is essential to consider the fault tolerance and scalability
of the system.

2.6.1 Topology and Communication

Parameter Server [32] and Ring All Reduce [33] algorithms are widely used in in-
dustry, such as at Google or Microsoft. Parameter Server involves a master server
and multiple worker nodes, and is therefore an asymmetric system where worker
nodes obtain all parameters of the model from the master node through inter-node
communication. Ring All Reduce utilizes a symmetric point-to-point system where
all nodes are equal and form a ring structure. The transmission of node parameters
in this system involves each node transmitting its parameters to the left node and
receiving the remaining parameters from the right node. This process effectively re-
duces network communication, allowing all nodes to obtain updated values of model
parameters during the transmission process.

There are two modes of inter-node communication when obtaining model param-
eters: synchronous and asynchronous. In a symmetric system, where each node
has comparable computational speed, synchronous communication is preferred to
maintain the model’s accuracy. However, in systems with nodes of varying perfor-
mance, synchronous communication can limit system throughput. In such scenarios,
asynchronous communication can improve system throughput by reducing the wait-
ing time of faster nodes. However, asynchronous communication introduces random
noise (staleness), which represents an out-of-date direction of gradient descent. This
noise can be equivalent to the momentum used in the optimizer [34], facilitating the
parameter update process and preventing the model from falling into a local opti-
mum.

23


2. Theory

Figure 2.10: Distributed training topology, on the left is parameter server, on the
right is ring organization.

Figure 2.11: Data Parallel

2.6.2 Data Parallel
The data parallel approach involves replicating model parameters on each node in a
distributed system. To facilitate training, the training data is divided into N copies,
where N is equal to the number of nodes in the system. During the training process,
each node updates the gradient information based on the current batch of data. An
update mechanism, such as the Parameter Server or Ring All Reduce, is used to
ensure that these updates are propagated throughout the system and applied to all
copies of the model parameters.

As shown in the Figure 2.11, the training data is divided into four parts, and the
entire model is stored on each device, using a quarter of the dataset for training,
and updated parameters are transferred between the devices to ensure the gradient
descent of the model parameters in the correct direction.

The use of data parallelism is a popular technique in distributed training, especially
when the model size is within the limits of a single GPU’s capacity. However, when
the model size is too large to fit into a single GPU, alternative methods such as
model parallelism or pipeline parallelism should be considered. These alternatives

24


2. Theory

come with added communication and scheduling overhead.

2.6.3 Workflow
The data parallel execution model currently in use operates as follows: The dis-
tributed initiator broadcasts the training data from the data sitting node to the
entire communication network via Broadcast. Subsequently, each GPU starts the
forward, which does not require communication between GPUs because there is no
data dependency.

However, during the backward, gradient information from each layer needs to be
exchanged with other devices so that the entire communication network obtains gra-
dient accumulation information within the entire mini-batch. This step is known as
reduce sum within the collective communication. In practice, communication is initi-
ated as soon as a layer in the network completes backpropagation. This provides the
advantage of overlapping communication with computation and completely masks
the communication overhead.

In this parallel process of computation and communication, the slower communica-
tion process often determines the time overhead. This may result in completion of
the calculation but waiting for the communication channel to become available, re-
sulting in a cumulative delay in the backpropagation process. Once all devices have
obtained the complete gradient information, averaging the gradient within the batch
allows the optimizer to obtain the required gradient, and executing optimizer.step()
completes the update process. Therefore, the bottleneck that affects the training
speed throughout the training step is the communication time between GPUs and
the computation time of backpropagation process.

2.6.4 Collective Communication
Collective Communications refers to a global communication process involving all
processes in a particular process group. It consists of fundamental operations such
as send, receive, copy, synchronize Barrier, and the synchronization of processes
between nodes (signal+wait). These fundamental operations are combined to form
communication templates, also known as communication primitives. The challenge
of collective communication lies in ensuring communication efficiency and optimal
application of the network hardware connection topology.

In the forward propagation of distributed training broadcast is mainly used to dis-
tribute the training data to each node. Figure 2.12 shows the process of Broadcast
where the data owning node will communicate with every other node, once and only
once. Thus the communication overhead of Broadcast is N, the number of nodes in
the N communication world.

Allreduce is a collective communication operation that aggregates data from multiple
processing units and combines this data into a global result, utilizing a designated
operator. The resulting data is then distributed to all processing units. Allreduce
performs in several stages. First, each participant disperses its vectors. Second, each

25


2. Theory

Figure 2.12: Broadcast communication

participant gathers vectors from other participants. Finally, each participant per-
forms a preference operation on all accumulated vectors. By using various Allreduce
sequences for different participants, highly complex computations can be distributed
across many computational units.

Allreduce is widely used by parallel applications in high-performance computing
(HPC) that relate to scientific simulation and data analysis, such as machine learning
computations, as well as the training phase of deep learning neural networks. Given
the substantial growth of deep learning models and the complexity of scientific
simulation tasks employing networks, effective implementation of Allreduce is critical
to reducing communication time.

Figure 2.13: Allreduce communication

Assuming the existence of P processes and a total amount of N data. When the
main process handles the data, the other processes send the data to the main process,
resulting in (P − 1)N data received. After the main process finishes handling the
data, it broadcasts the result to the other processes, resulting in (P − 1)N data
sent. Therefore, the total amount of communication is 2(P − 1)N . This means that
communication overhead will increase linearly with the increase in the number of P

26


2. Theory

processes.

In the Ring-AllReduce algorithm, the amount of communication per process can be
calculated as follows. Each process must receive N

P
(P − 1) data from the process

on its left and send the same amount of data to the process on its right. Thus, the
total communication is 2N

P
(P−1). Chunking data into N

P
sizes distributes chunks to

other processes, reducing communication bottlenecks. As the number of P processes
increases, the total communication converges to 2N , indicating that communication
does not increase as the number of nodes in communication increases.

Therefore, the Ring-Allreduce algorithm is more efficient than the simple algorithm
because it distributes computation and communication evenly among all participat-
ing processes, eliminating bottlenecks. Ring-AllReduce is commonly used in many
AllReduce implementations, especially for distributed deep learning workloads.

27


2. Theory

28


3
Methods

This chapter explores various strategies for optimizing performance when fine-tuning
BERT models. Specifically, we examined the Adan optimizer and implemented a
hardware-optimized version using Kernel Fusion, Multi-tensor access and Instruc-
tion Level Parallelism (ILP). In addition, this chapter investigates the effects on
fine-tuning performance when using techniques such as mixed precision training,
distributed training, and LightSeq’s acceleration library. Finally, all optimization
techniques explored are used in a combined search with various hardware (NVIDIA
A100, V100, A10, T4, and Google TPUv2, TPUv3) to find optimal solutions.

3.1 Fused Adan
Adan optimizer is a novel optimizer that diverges from the widely used Adam op-
timizer. It leverages the Nesterov momentum to accelerate the gradient descent
algorithm, resulting in faster model training. As the algorithm is still in its early
release, there is currently no optimized kernel available on the GPU. The implemen-
tation of fused Adan will be introduced in the following subsection.

3.1.1 Profiling
The current Adan kernel is not equipped with a high-performance fused kernel, and
its original implementation is based on the tensor compute interface provided by
Pytorch. Unfortunately, this implementation may result in potential performance
losses. Therefore, the original Adan implementation is profiled to investigate poten-
tial issues, see Figure 3.1

By measuring the time consumed by optimizer.step() during one optimization step
it is possible to examine how much time is consumed for computation tasks and
data preparation tasks. The optimizer single step profiling resulted in a total time
of 232.866 ms. This result can be broken down into the two parts: computation
and data preparation, which take 169.331 ms and 63.535 ms, respectively.

The profiling reveals that there are 4422 kernel launches during one optimization
step, with 804 kernels dedicated to data preparation and 3618 kernels used for com-
putation. Additionally, a kernel launch time graph for this original implementation
is provided in Figure 3.1. This figure illustrates that the kernel is arranged in a loose

29


3. Methods

Figure 3.1: Unfused Adan kernel launch time graph with single tensor access.

manner on the timeline which results in a significant amount of GPU time being
wasted, showing potential opportunities for optimization.

According to the profiling results the Adan optimizer takes a total computation time
of 169331 µs. However, only 31, 293.63 µs is actually used for kernels running ac-
tively, which implies that the computing units of the device are active only during
this time. As a result, only 18.48% of the total time is spent on actual computa-
tion. This suggests that there is a considerable amount of overhead that could be
eliminated through optimization.

3.1.2 Kernel fusion
Kernel fusion [35] is a method used to eliminate the extra overhead caused by mul-
tiple kernel launches and might improve power efficiency on GPUs. By reusing the
data which already loaded into register or shared memory, redundant data load/store
could be reduced. A kernel is essentially a function call, such as torch.mul(),
torch.add(). To illustrate, consider the example code below. For example, calcu-
lating R = A×B + C×D using the PyTorch implementation requires dividing it
into three tensor operation kernel launches.

# Unfused version

# Create Variable, Matrix size (512, 512)
A = torch.rand(512, 512).to("cuda")
B = torch.rand(512, 512).to("cuda")
C = torch.rand(512, 512).to("cuda")
D = torch.rand(512, 512).to("cuda")

# Calculate R=A*B+C*D, with intermediate variables M, N
M = torch.mul(A, B)
N = torch.mul(C, D)
R = torch.add(M, N)

# Calculate R=A*B+C*D, Fused version
fused_kernel(A, B, C, D, R)

/*Following code is written in CUDA style*/
__global__ void fused_kernel(float *A, float *B,

30


3. Methods

float *C, float *D, float *R) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < 512 && col < 512) {

float sum = 0;
for (int i = 0; i < 512; i++) {

sum += A[row * 512 + i] * B[i * 512 + col];
sum += C[row * 512 + i] * D[i * 512 + col];

}
R[row * 512 + col] = sum;

}
}

In the code above torch.mul() is used twice and torch.add() once, with each function
call incurring a GPU kernel launch, leading to increased overhead when allocating
resources, instruction scheduling and data transfer.

The fused Kernel combines two torch.mul() calls and one torch.add() call into a
customized kernel, so that there only needs to be one kernel launch to perform the
same operations as before.

3.1.3 Memory Hierarchy
In modern GPUs, multiple levels of storage are available for general-purpose com-
puting, which typically includes four levels: DRAM, scratchpad memory (shared
memory), local memory, and register. To access data, a thread must first locate it
in DRAM and then copy it to its own storage, such as register or local memory. Lo-
cal memory is physically located in DRAM, but since addresses can be determined
at compile time, they can be cached. Register is preferred for frequently accessed
data because it has the fastest access speed and is closest to the compute unit. How-
ever, register is a limited resource due to hardware constraints. Scratchpad memory
can be used as an alternative and has similar access speeds to register.

The fused kernel Adan implementation avoids excessive access to device DRAM by
using registers to store frequently access variables, which can lead to memory walls.
This technique can significantly improve the arithmetic intensity at DRAM level
and reduce GPU idle time.

3.1.4 Vector Instruction
Vector instructions are a type of instruction that operates on multiple data elements
in a single operation, derived from the Single Instruction Multiple Data (SIMD)
model. In addition to their use in multimedia processing, such as image and au-
dio/video processing, scientific simulation, and machine learning, vector instructions
are widely used in high-performance computing due to their ability to reduce the
number of instructions needed to operate on large data sets, resulting in increased
throughput.

31


3. Methods

NVIDIA supports vector instructions in its Parallel Thread Execution (PTX) in-
structions, such as vectorized Load and Store instructions. Using vector loading
instructions reduces the total number of instructions, reduces latency, and improves
bandwidth utilization. However, the corresponding instruction cycles are longer than
non-vectorized instructions, making this trade-off beneficial for medium-performance
workloads.

/*0088*/ IMAD R10.CC, R3, R5, c[0x0][0x140]
/*0090*/ IMAD.HI.X R11, R3, R5, c[0x0][0x144]
/*0098*/ IMAD R8.CC, R3, R5, c[0x0][0x148]
/*00a0*/ LD.E.64 R6, [R10]
/*00a8*/ IMAD.HI.X R9, R3, R5, c[0x0][0x14c]
/*00c8*/ ST.E.64 [R8], R6

As an example, the code above utilizes the IMAD instruction to compute the address
of the vector load. It subsequently uses LD.E.64 to load two 32-bit floating-point
numbers into the R6 register simultaneously. The IMAD instruction is then em-
ployed again to calculate the data storage address, and the ST.E.64 instruction is
used to store the data into the address indicated by the R8 register simultaneously.

Vector loading can be utilized by using vector data types defined in the CUDA
C/C++ standard header, such as int2, int4, or float4. When using the fused Adan
single tensor access kernel, the float4 type is used for processing FP32 data. As
a result, the compiler is enabled to execute vectorized load instructions, such as
LD.E.128, to retrieve the target data in one go instead of employing four LD.E
instructions for data access. The following code provides further insight into the
fused Adan kernel.

float4* p4_ptr = reinterpret_cast<float4*>(p);
/*Variable pointer...*/
float4* exp_avg_diff4_ptr = reinterpret_cast<float4*>(exp_avg_diff);

/*Vectorized load from global memory*/
float4 p4 = p4_ptr[global_id];
float4 g4 = g4_ptr[global_id];
const float4 neg_grad4 = neg_grad4_ptr[global_id];
float4 exp_avg4 = exp_avg4_ptr[global_id];
float4 exp_avg_sq4 = exp_avg_sq4_ptr[global_id];
float4 exp_avg_diff4 = exp_avg_diff4_ptr[global_id];

3.1.5 Instruction Level Parallelism
The concept of instruction-level parallelism plays a critical role in computer archi-
tecture, enabling a higher number of instructions to be executed per cycle by lever-
aging microprocessor pipeline architecture and minimizing pipeline bubbles through
instruction scheduling. The fused adan multi tensor access kernel extends instruc-
tion scheduling space by utilizing loop unrolling techniques, allowing the compiler to

32


3. Methods

reorder instructions based on pipeline characteristics during the compilation stage
for optimal hardware performance.

Loop unrolling has several benefits, including reducing the overhead caused by
branch prediction failures and minimizing the number of variable maintenance in-
structions required for loop counters. For example, if the loop for(i = 0; i < 100; ),
executing i+ = 2 instead of i+ = 1 reduces the number of ADD instructions required
to maintain the variable i value from 100 to 50.

In addition, unrolling the loop allows processing of two iterations at a time, opti-
mizing instruction space to make full use of the pipeline to cover access latency and
improve throughput, especially when dealing with time-consuming access tasks such
as data loading and storage within the loop body.

The provided code demonstrates the implementation of loop unrolling in fused Adan
multi tensor access. Specifically, by setting the ILP to 4 for each thread, the loop is
processed with four elements at a time. The compiler is instructed to perform loop
unrolling using the #pragma unroll directive, which enhances both code readability
and efficiency.

#define ILP 4
#pragma unroll
for(int ii = 0; ii < ILP; ii++)

{
int i = i_start + threadIdx.x + ii*blockDim.x;
if(i < n && i < chunk_size)
{

r_p[ii] = p[i]; // r_ means register variable
r_g[ii] = g[i];
r_exp_avg[ii] = exp_avg[i];
r_exp_avg_sq[ii] = exp_avg_sq[i];
r_exp_avg_diff[ii] = exp_avg_diff[i];
r_neg_grad_diff[ii] = neg_grad[i];

} else {
//...

}
}

3.2 Multi Tensor Access
Multi Tensor Access is a grouping technique that improves the device’s memory
access efficiency and results in significant performance improvements. In situations
where a multitude of small tensors exist, it is possible to have the process by aggre-
gating the individual tensor accesses into a larger one, see Figure 3.2. This greatly
simplifies the operation, leading to reduced kernel launches and global memory ac-
cess, ultimately improving performance.

33


3. Methods

Figure 3.2: Multi Tensor Access, with a chunk size of 3. Meaning that the tensors
(red, yellow and green) are split into chunks of 3 for more efficient memory accesses.

3.2.1 Profiling

Table 3.1 provides the SM and memory throughput for different kernel sizes when
using the single tensor access fused Adan kernel.

Grid Size Block Size SM Throughput Memory Throughput

21747 1024 39.45% 82.44%
2304 1024 37.34% 79.22%
576 1024 31.44% 62.96%
2 1024 0.78% 0.89%
1 1024 0.45% 0.50%

Table 3.1: Typical kernel size with their SM/Memory throughput

Figure 3.3 presents the SM and Memory throughput rate curves for running the
Adan kernel. The results show that the use of single tensor access kernel results
frequent kernel launches with smaller sizes, leading to a significant reduction in
both SM and memory throughput.

By observing the data in Table 3.1, a sharp decline in both computational and
memory throughput is noticed as the grid size decreases. Therefore, it is beneficial
to implement multi-tensor access to mitigate any potential variations caused by the
kernel size. This will not only enhances the efficiency of the system but also ensures
consistent performance.

34


3. Methods

Figure 3.3: 1-Step benchmark displaying the GPU’s SM and memory throughput
when using Single Tensor Access.

3.2.2 Multi Tensor Apply
In order to use the Pytorch Optimizer interface, it is necessary to specify the foreach
parameter that determines the mode in which parameters are accessed for optimiza-
tion. When using a False foreach value, the optimization process iterates through
each parameter tensor individually, leading to the sequential updating of these in-
dividual tensors. However, when using a True foreach value, the optimizer groups
the parameter tensors according to a defined chunk size before sending them to the
kernel as a tensor list.

To enable multi tensor access the tensors need to be packed before launching the
kernel computations. To reduce startup overhead caused by many small tensors,
multiple tensors are bundled into fixed-size chunks for computation, as depicted in
Figure 3.2.

for (int t = 0; t < ntensors; t++) {
tl.sizes[loc_tensor_info] = tensor_lists[0][t].numel();
for (int d = 0; d < depth; d++)

...
for (int chunk = 0; chunk < chunks_this_tensor; chunk++) {
...

if (tensors_full || blocks_full || last_chunk) {
multi_tensor_apply_kernel<<<loc_block_info,
block_size, 0, stream>>>(

chunk_size, noop_flag.DATA_PTR<int>(), tl,
callable, args...);

}
}

}

The first condition, tensors_full, requires the chunk sent to the kernel to be opti-
mally sized for kernel operation. The second condition, blocks_full, ensures that
the kernel operates optimally with the maximum allocated number of blocks. This
prevents performance bottlenecks caused by too many or too few blocks. The last
condition, last_chunk, handles tensors that cannot be processed under the first

35


3. Methods

two conditions, ensuring that all tensors are calculated without missing.

3.3 Mixed Precision Training
The current workflow does not utilize mixed-accuracy training, resulting in a missed
opportunity for potential performance and energy benefits. The most time consum-
ing GPU kernels during training are shown in Figure 3.4.

Figure 3.4: Profiling showing the most time consuming GPU kernels, during one
training step when using the Hugging Face model without mixed precision.

In the realm of deep learning, the computation of matrix multiplication is acknowl-
edged as a computation intensive kernel. Therefore, each of the Single precision
General Matrix Multiply (sgemm) kernels illustrated in Figure 3.4 can be perceived
as operators of high computational intensity. The sgemm kernel accounts for 93.6%
of the overall time consumption in one training step. It follows that any acceleration
in the execution of the sgemm kernel can lead to noteworthy enhancements in the
overall training performance. Table 3.2 shows the execution time for some typical
sgemm kernels with different matrix size.

Kernel Mean Duration/µs

volta_sgemm_128x64_nt 1274
volta_sgemm_128x32_nn 1205
volta_sgemm_32x128_tn 743
volta_sgemm_128x32_tn 3113
volta_sgemm_128x32_sliced1x4_nt 3033

Table 3.2: Average execution time of sgemm kernels with different size when not
using mixed precision.

3.3.1 Enable Mixed Precision Training
By introducing automatic mixed precision training (AMP) into the training flow,
the process can be optimized in terms of time and energy consumption for better

36


3. Methods

results. This thesis will leverage Hugging Face’s acceleration library to automate
scaling of digital ranges and FP32-FP16 conversion. Minor modifications to the
training loop are all that’s required to enable automatic mixed precision training.

from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, train_loader, scheduler = accelerator.prepare(

model, optimizer, train_loader, scheduler)
...
# Code in trainer
for step, batch in enumerate(self.train_dataloader):

batch = {k: v.to(self.device) for k, v in batch.items()}
self.model.zero_grad()
outputs = self.model(**batch)
loss = outputs.loss
self.accelerator.backward(loss) # original: loss.backward()
# ...

The code above demonstrates the initialization process required before using the
accelerator. In this code, the loss parameter is passed to the backward function
provided by the accelerator. To ensure proper functioning, it is crucial to initialize
the model beforehand. During the initialization process, the accelerator object es-
tablishes internal links to the original model parameters. Once the loss is passed in,
the FP16 model optimized by the accelerator can be used after range scaling.

Mixed precision training offers two main advantages, which are briefly described
below,

1. Reduced memory usage: FP16/BF16 is half the length of FP32, meaning
that more data can be transferred during training. Which may enable more
data to be used per batch during training. However, depending on the imple-
mentation, mixed precision might yield higher requirements. E.g. Hugging
Face’s implementation of mixed precision stores FP32 copies of model param-
eters as well as stores gradients in FP32, whilst also storing the FP16 values
for the model parameters. Meaning that more memory is actually required
for storing model parameters and gradients. In particular, Hugging Face’s
implementation requires 4 bytes per model parameter as well as 4 bytes per
gradient during FP32 training. Whilst requiring 6 bytes per model parameter
and 4 bytes per gradient during mixed precision.

This is unlike the LightSeq implementation, which avoids storing FP32 copies,
and instead uses on-the-fly FP16 <-> FP32 conversion. Meaning that Light-
Seq reduces memory requirements by only storing 2 bytes per model parame-
ters and even 2 bytes per gradient. Which yields, both a speedup since every
access is FP16 as well as lower memory requirements.

2. More computationally efficient: Special AI acceleration chips such as the
GPUs of NVIDIA V100 and Google TPU can execute operations faster using
FP16 than FP32. In NVIDIA’s GPUs, FP32 data is converted to FP16 data

37


3. Methods

in SMs, then fed into the tensor core for accelerated operations through a
specific FP32-FP16 data workflow.

3.4 LightSeq Training
Apart from the computationally intensive operations in the Transformer Encoder,
other operations such as transpose, bias addition, softmax, and data transfer be-
tween the device and host may also impact performance. To enhance the fine-tuning
process, it is of interest to minimize the potential overhead caused by these opera-
tions.

The current implementation of the BERT model by Hugging Face prioritizes ad-
hering to the original structure of the model rather than focusing on performance
optimization. As a result, Hugging Face’s implementation provides a stable and
convenient API interface for data scientists to use, however there are many opti-
mizations that can be done to increase performance. Table 3.3 shows the average
execution time for Softmax and LayerNorm kernel in original Hugging Face imple-
mentation. Furthermore, the time consumption of encoder implemented Hugging
Face in both the forward as well as backward without mixed precision is also found
in the table.

Kernel Average Duration

softmax_warp_forward 497 µs
softmax_warp_backward 712 µs
vectorized_layer_norm_kernel 80 µs

encoder forward 14.8 ms
encoder backward 25 ms

Table 3.3: Average execution time of common kernels within Hugging Face model.

LightSeq offers an alternative implementation of the Hugging Face Transformer,
which surpasses Hugging Face’s BERT model implementation in terms of perfor-
mance. This is accomplished by sacrific