Hardware Acceleration of Machine Learning
| dc.contributor.author | Chen, Fangzhou | |
| dc.contributor.author | Sköld, William | |
| dc.contributor.department | Chalmers tekniska högskola / Institutionen för data och informationsteknik | sv | 
| dc.contributor.department | Chalmers University of Technology / Department of Computer Science and Engineering | en | 
| dc.contributor.examiner | Petersen Moura Trancoso, Pedro | |
| dc.contributor.supervisor | Petersen Moura Trancoso, Pedro | |
| dc.date.accessioned | 2023-11-23T13:47:17Z | |
| dc.date.available | 2023-11-23T13:47:17Z | |
| dc.date.issued | 2023 | |
| dc.date.submitted | 2023 | |
| dc.description.abstract | The Transformer architecture has been widely used in various fields, as demonstrated by GPT-3, a large language model that shows impressive performance. However, achieving such excellent performance requires high computational capabilities. Therefore, improving the computational power of current machine learning systems is of great importance. This thesis aims to optimize and accelerate fine-tuning of Transformer-based models while taking into account several evaluation criteria, such as training time, energy consumption, cost, and hardware utilization. Additionally, a comparison is made between GPU training settings and specialized AI accelerators, such as TPU training settings. In our study, a high-performance kernel for the Adan optimizer was introduced, and the LightSeq library is applied to accelerate existing Transformer components. We also introduce mixed precision training into our workflow and compare all these optimization techniques step by step with baseline performance. In addition, our analysis includes distributed training with multiple GPUs, and a backpropagation time estimation algorithm is introduced. Next, Google’s TPU accelerator is used to run our task, and its performance is compared to the similar GPU setup used in our study. Finally, the advantages and disadvantages of different methods are systematically analyzed, while training on V100, A100, A10 and T4 with different configurations. Meanwhile, the workflow between GPUs and TPUs is analyzed, illustrating the pros and cons of different accelerators. Various weights for measuring optimization methods based on time, energy consumption, cost, and hardware utilization are proposed. Our analysis shows that optimal scores in all metrics can be achieved by implementing the optimized LightSeq model, kernel fusion for the Adan optimizer, and enabling mixed precision training. While training with TPU offers certain advantages, such as large batch sizes when loading training data, the ease of use, reliability, and software stability of GPU training surpasses that of TPU training. | |
| dc.identifier.coursecode | DATX05 | |
| dc.identifier.uri | http://hdl.handle.net/20.500.12380/307395 | |
| dc.language.iso | eng | |
| dc.setspec.uppsok | Technology | |
| dc.subject | Transformer | |
| dc.subject | GPU | |
| dc.subject | Distributed | |
| dc.subject | Energy consumption | |
| dc.subject | Fine-tuning | |
| dc.subject | TPU | |
| dc.title | Hardware Acceleration of Machine Learning | |
| dc.type.degree | Examensarbete för masterexamen | sv | 
| dc.type.degree | Master's Thesis | en | 
| dc.type.uppsok | H | |
| local.programme | Computer systems and networks (MPCSN), MSc | |
| local.programme | High-performance computer systems (MPHPC), MSc | 
