Hardware Acceleration of Machine Learning

dc.contributor.authorChen, Fangzhou
dc.contributor.authorSköld, William
dc.contributor.departmentChalmers tekniska högskola / Institutionen för data och informationstekniksv
dc.contributor.departmentChalmers University of Technology / Department of Computer Science and Engineeringen
dc.contributor.examinerPetersen Moura Trancoso, Pedro
dc.contributor.supervisorPetersen Moura Trancoso, Pedro
dc.date.accessioned2023-11-23T13:47:17Z
dc.date.available2023-11-23T13:47:17Z
dc.date.issued2023
dc.date.submitted2023
dc.description.abstractThe Transformer architecture has been widely used in various fields, as demonstrated by GPT-3, a large language model that shows impressive performance. However, achieving such excellent performance requires high computational capabilities. Therefore, improving the computational power of current machine learning systems is of great importance. This thesis aims to optimize and accelerate fine-tuning of Transformer-based models while taking into account several evaluation criteria, such as training time, energy consumption, cost, and hardware utilization. Additionally, a comparison is made between GPU training settings and specialized AI accelerators, such as TPU training settings. In our study, a high-performance kernel for the Adan optimizer was introduced, and the LightSeq library is applied to accelerate existing Transformer components. We also introduce mixed precision training into our workflow and compare all these optimization techniques step by step with baseline performance. In addition, our analysis includes distributed training with multiple GPUs, and a backpropagation time estimation algorithm is introduced. Next, Google’s TPU accelerator is used to run our task, and its performance is compared to the similar GPU setup used in our study. Finally, the advantages and disadvantages of different methods are systematically analyzed, while training on V100, A100, A10 and T4 with different configurations. Meanwhile, the workflow between GPUs and TPUs is analyzed, illustrating the pros and cons of different accelerators. Various weights for measuring optimization methods based on time, energy consumption, cost, and hardware utilization are proposed. Our analysis shows that optimal scores in all metrics can be achieved by implementing the optimized LightSeq model, kernel fusion for the Adan optimizer, and enabling mixed precision training. While training with TPU offers certain advantages, such as large batch sizes when loading training data, the ease of use, reliability, and software stability of GPU training surpasses that of TPU training.
dc.identifier.coursecodeDATX05
dc.identifier.urihttp://hdl.handle.net/20.500.12380/307395
dc.language.isoeng
dc.setspec.uppsokTechnology
dc.subjectTransformer
dc.subjectGPU
dc.subjectDistributed
dc.subjectEnergy consumption
dc.subjectFine-tuning
dc.subjectTPU
dc.titleHardware Acceleration of Machine Learning
dc.type.degreeExamensarbete för masterexamensv
dc.type.degreeMaster's Thesisen
dc.type.uppsokH
local.programmeComputer systems and networks (MPCSN), MSc
local.programmeHigh-performance computer systems (MPHPC), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
CSE 23-36 FC WS.pdf
Storlek:
4.32 MB
Format:
Adobe Portable Document Format

License bundle

Visar 1 - 1 av 1
Hämtar...
Bild (thumbnail)
Namn:
license.txt
Storlek:
2.35 KB
Format:
Item-specific license agreed upon to submission
Beskrivning: