Analyzing Factors Influencing Performance in LLM Inference Systems Master’s thesis in Computer science and engineering Ziyu Fang Mujtaba Al Neama Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2025 Master’s thesis 2025 Analyzing Factors Influencing Performance in LLM Inference Systems Ziyu Fang Mujtaba Al Neama Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2025 Analyzing Factors Influencing Performance in LLM Inference Systems Ziyu Fang, Mujtaba Al Neama © Ziyu Fang, Mujtaba Al Neama, 2025. Supervisor: Huaifeng Zhang, Department of Computer Science and Engineering Examiner: Ahmed Ali-Eldin Hassan, Department of Computer Science and Engi- neering Master’s Thesis 2025 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2025 iv Analyzing Factors Influencing Performance in LLM Inference Systems Ziyu Fang, Mujtaba Al Neama Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Large Language Models (LLMs) lead to numerous innovative applications, including virtual assistants, content generation, and recommendation systems, which made a revolution in daily life. LLM inference is crucial as it allows these models to process and generate human-like text in real-time, making them adaptable to a wide range of practical applications. This capability has not only enhanced the functionality of AI-driven technologies but also expanded their accessibility and impact across various industries. Latency and throughput are essential metrics for evaluating the performance of LLMs, as they directly influence user experience and system efficiency. They are critical for optimizing the deployment and operation of LLM-based solutions in real-world scenarios. In this thesis, we evaluate the performance of LLM inference by analyzing how different factors affect LLM inference throughput and latency. We aim to study the current performance and internal working mechanism of Llama 3 and GPT-2 to explore potential methods that could keep improving LLM inferences. The approach of this project includes 2 steps which are data collection and data analysis. Our results show that increasing the batch size significantly improves throughput but can also lead to higher latency, indicating a trade-off between speed and responsiveness in LLM inference. Additionally, a larger model size can provide more accurate outputs. Interestingly, allocating more GPU resources reduced overall GPU utilization for both Llama 3 and GPT-2. Inefficiencies in resource allocation could negatively impact the cost-effectiveness of deploying LLM inference’s performance. Keywords: LLM Inference, Latency, Throughput, Transformer Model, GPU, Paral- lelism Computation, Empirical Analysis. v Acknowledgements We would like to express our great appreciation to Huaifeng and Ahmed for giving us this project. In particular, we would like to thank Huaifeng for all the support and patient guidance to our thesis, and we thank Ahmed for your great advice on our project. Especially, Ziyu expresses her thanks to Filip, Robin, Gabriel, Edvard, and Björn for being accompanied with her to go through all the difficulties in this thesis work. In the end, thanks to everything that happened in our thesis work; for equipping us with the bravery to face future challenges. Ziyu Fang, Mujtaba Al Neama, Gothenburg, 2025-04-23 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Goals and Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Scope of this Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background 5 2.1 Machine Learning Primer . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Development of Large Language Models . . . . . . . . . . . . . . . . 6 2.2.1 Statistical Methods . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 Transformer Model Era . . . . . . . . . . . . . . . . . . . . . . 7 2.2.4 Open-Source Large Language Models . . . . . . . . . . . . . . 8 2.3 Transformer Model Basics . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Model Architecture . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.2 Attention Mechanism . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.3 Encoder & Decoder . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.1 Parallelism Techniques . . . . . . . . . . . . . . . . . . . . . . 12 2.4.2 vLLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.3 Tokenizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Methods 15 3.1 Approach Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Data Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2.1 Contributing Factors: . . . . . . . . . . . . . . . . . . . . . . . 16 3.2.2 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3 Analysis Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Feature Importance Analysis . . . . . . . . . . . . . . . . . . . 19 3.4 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.4.1 Hardware configuration Berzelius . . . . . . . . . . . . . . . . 20 ix Contents 3.4.2 GPU Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.5 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5.1 GPT-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.5.2 Llama 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Results 23 4.1 Performcance Metrics Analysis . . . . . . . . . . . . . . . . . . . . . . 23 4.1.1 Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1.1.1 Batch Size . . . . . . . . . . . . . . . . . . . . . . . . 24 4.1.1.2 Highers latency with Bigger batch size . . . . . . . . 24 4.1.1.3 Input Length . . . . . . . . . . . . . . . . . . . . . . 25 4.1.1.4 Output Length . . . . . . . . . . . . . . . . . . . . . 25 4.1.1.5 Number of GPUs with TTLT . . . . . . . . . . . . . 26 4.1.1.6 Number of GPUs with TTFT . . . . . . . . . . . . . 27 4.1.2 Throughput . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.2.1 Batch Size . . . . . . . . . . . . . . . . . . . . . . . . 27 4.1.2.2 Input Length . . . . . . . . . . . . . . . . . . . . . . 28 4.1.2.3 Output Length . . . . . . . . . . . . . . . . . . . . . 28 4.1.2.4 Number of GPUs with Throughput . . . . . . . . . . 28 4.1.3 GPT-2 vs. Llama 3 . . . . . . . . . . . . . . . . . . . . . . . . 28 4.2 GPU Scaling and Performance Impact . . . . . . . . . . . . . . . . . 29 4.3 Further Analysis on Llama 3 . . . . . . . . . . . . . . . . . . . . . . . 30 4.3.1 Verifying Llama 3 performance . . . . . . . . . . . . . . . . . 30 4.3.2 Llama 3 Output Accuracy . . . . . . . . . . . . . . . . . . . . 30 4.4 Correlation Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.5 Feature Importance Analysis . . . . . . . . . . . . . . . . . . . . . . . 32 4.6 Summary of Key Findings . . . . . . . . . . . . . . . . . . . . . . . . 33 5 Conclusion 35 5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.1.1 Inference vs. Training Analysis . . . . . . . . . . . . . . . . . 36 5.1.2 Increase GPU utilization . . . . . . . . . . . . . . . . . . . . . 36 5.1.3 Extend LLMs numbers & GPU numbers . . . . . . . . . . . . 36 Bibliography 37 A Appendix 1 I x List of Figures 2.1 Transformer Architecture [22] . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Scaled Dot-Product Attention [22] . . . . . . . . . . . . . . . . . . . . 10 2.3 Multi-Head Attention [22] . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 Tensor Parallelism [36] . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5 Pipeline Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.1 An overview of data collection workflow . . . . . . . . . . . . . . . . . 16 3.2 An overview of LLM working mechanism . . . . . . . . . . . . . . . . 17 4.1 Factors Comparison between Llama 3 and GPT-2 . . . . . . . . . . . 23 4.2 Llama 3: Average GPU utilization for each batch size . . . . . . . . . 25 4.3 (a) GPU Utilization GPT-2 (b) GPU Utilization Llama 3 (c) Max GPU Utilization GPT-2 (d) Max GPU Utilization Llama 3 . . . . . . 26 4.4 Performance Metrics with Different Batch Sizes and number of GPUs 29 4.5 (a) Llama 3: TTLT latency with GPU numbers (b) Llama: Throughput with GPU numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.6 Correlation Matrices for GPT-2 and Llama 3 Models . . . . . . . . . 32 4.7 Feature Importance Analysis for GPT-2 and Llama 3 . . . . . . . . . 33 xi List of Figures xii List of Tables 3.1 Correlation Strength Based on Amount of r [41] . . . . . . . . . . . . 19 xiii List of Tables xiv 1 Introduction With the development of Artificial Intelligence (AI), a lot of new applications like virtual assistants, content generation, and recommendation systems have risen to the market. One of the key drivers behind these innovations is the development of Large Language Models (LLMs). LLMs enable AI systems to understand and generate human-like text. LLM inference is the process of generating output based on input data. Unlike training, which is typically a one-time, resource-intensive process, inference happens every time a user interacts with the model. As a result, LLM inference could be computationally expensive and resource-intensive[1]. Efficient inference is therefore crucial for reducing cost, enhancing user experience. This thesis would like to get a greater insight into the performance and internal working mechanism of LLM’s inference systems, allowing us to explore the potential method that could keep improving LLM’s performance. In this chapter, we will briefly introduce our work starting with the goals and challenges of the project. Followed by the scope of the thesis and the approach we use during the project. Finally, we introduce our thesis outline. 1.1 Goals and Challenges The main goal of our thesis is to evaluate the performance of LLM inference systems by analyzing how key factors affect LLM inference throughput and latency. Specifically, we are basing our analysis on GPT-2 [2] and Llama 3 [3], two of the most popular and widely used LLMs. Latency and throughput are key metrics for evaluating LLMs because they directly affect user experience and system efficiency. Latency measures the time taken to process and respond to individual requests, which is crucial for real-time applications to ensure quick and responsive interactions [4]. Throughput gauges the number of requests the model can handle per unit of time, reflecting the system’s ability to scale and manage high demand efficiently, thus influencing cost-effectiveness [5] [6]. Therefore, by evaluating these metrics, we ensure that we better understand LLM inference performance requirements for practical deployments. The list below includes the challenges we faced in this project: • The first challenge is how to design an efficient data collection pipeline, that collects sufficient useful data for analysis. 1 1. Introduction • Designing the analysis framework is another challenge. It is important to correctly understand how each factors influence LLM inference performance. • The deployment of LLM inference is often constrained by GPU resource limi- tations. Therefore, we need to carefully determine how many GPUs we can allocate to each instance of GPT-2 and Llama 3. • Accurately determining input length using tokenizers is a challenge, as it involves understanding how different tokenization strategies affect sequence length, which is critical for optimizing model performance and managing computational resources. • The maximum sequence length limitation in LLM poses a challenge, as it restricts the model’s ability to process and generate long sequences. 1.2 Scope of this Thesis In our project, we focus on the LLM inference workloads. We select two models with different parameter sizes to perform the analysis: GPT-2 and Llama 3. The parameter size of GPT-2 and Llama 3 is 137 million and 70 billion, respectively. The factors influencing inference performance are vast. To limit the scope of this thesis, we will focus on 5 key parameters: batch size, input length, output length, model size, and number of GPUs. We will give a detailed description of these 5 contributing factors. Batch Size specifies the number of input and output sequences in the LLMs. Larger batch sizes may increase latency for individual requests because the model processes more input requests simultaneously, leading to longer wait times. However, large batch sizes should improve throughput as they enable more GPU parallelism, allowing more data to be processed per unit of time [4]. Therefore, batch size is one of the key factors to evaluate LLM inference scalability and efficiency. Input Length means how many tokens are in each input sequence. Longer in- put sequences usually require more computation and memory, thus may increase latency. Additionally, longer input lengths could reduce throughput as they consume more resources and limit the number of concurrent requests the system can handle effectively. Output Length means how many tokens are generated for each output sequence. Same as input length, generating longer outputs means more time and computational power, leading to higher latency. This increased demand may reduce throughput because each request takes longer to process. Although the input length and output length will influence latency and throughput, the interesting thing is to which degree they affect the LLM inference processing. Model Size is the number of parameters that an LLM contains. A larger model size means the LLM could work better in more complex languages and deal with nuanced prompts [7]. Therefore, comparing the model size supports a practical method to weight LLM performance. 2 1. Introduction Number of GPUs is important for finding the best GPU utilization for LLM inference. Effective GPU utilization may reduce latency while increasing the number of GPUs can improve throughput by distributing the computational load. However, a larger number of GPUs does not lead to effective GPU utilization. So, it is meaningful to take the number of GPUs as one of the key factors and analyze how much GPU utilization influences latency and throughput. 1.3 Approach Our project involves two main steps: data collection and data analysis. In the data collection step, we design a collection pipeline and collect two comprehensive datasets of GPT-2 and Llama 3 through varying batch sizes, input lengths, output lengths, and number of GPUs. In the data analysis step, we perform extensive analysis on the data collected from the previous step, to explore relations between latency and throughput. We conducted experiments on the Berzelius computer cluster [8] which can provide computational power of up to 752 NVIDIA A100 GPUs. Furthermore, Berzelius enables non-blocking connections between GPUs featuring 200 GB/s bandwidth and microsecond latency. It is crucial to provide stable and sufficient computation resources. 1.4 Outline The rest of the thesis organized as follows. Chapter 2 describes the development of LLMs, an introduction to transformer model, an overview of a state-of-the-art LLM-inference framework, vLLM, and the concept of tokenizer. Chapter 3 presents approaches used in our research. The analysis results are presented in Chapter 4. The research conclusion, discussion, optimization, and future work are given in Chapter 5. 3 1. Introduction 4 2 Background In this chapter, we provide an overview of the key concepts and components that are fundamental to understanding the performance of Large Language Model (LLM) inference systems. The chapter begins with the preliminary knowledge overviewing AI as a whole. Further, the development of LLM is introduced. This is followed by a discussion of Transformer Models including the model architecture and attention mechanisms. What’s more, we delve into the role of the tokenizer and parallelism techniques. We also explore vLLM, a high-throughput distributed LLM inference that integrates with models seamlessly. Together, these sections provide a comprehensive foundation for analyzing the factors that affect performance in LLM inference systems. 2.1 Machine Learning Primer Artificial Intelligence (AI) is the simulation of human intelligence processes by ma- chines, particularly computer systems. Today, AI has become an integral part of our daily lives, seamlessly embedded in various technologies and services. Hopefully, our project can help AI continue to evolve, driving cutting-edge research and innova- tion, and exploring new methodologies to push the boundaries of what is possible, including advances in (Natural Language Processing) NLP, autonomous systems, and ethical practices. Mitchell et al. [9] introduce machine learning (ML) as a subfield of AI that focuses on the development of algorithms and the usage of data to enable computer systems to learn in a human way. It can effectively improve systems by facilitating the ability to make predictions or decisions based on large amounts of data and new information [9]. Different from traditional programming tasks that are coded by developers carefully, machine learning leverages data to create models that can automatically generalize from past examples to handle new data with better accuracy. Machine learning can be categorized into 4 main types which are supervised learning [10], unsupervised learning [11], semi-supervised learning [12], and reinforcement learning [13]. Supervised learning is defined as training models on a labeled dataset, which could make predictions by learning inputs and outputs [10]. Unsupervised learning, on the other hand, uses unlabeled data to discover hidden patterns or structures within the data through ways like clusters, dimensionality reduction, visualization, and so on [11]. Semi-supervised learning is a hybrid method that utilizes both labeled 5 2. Background and unlabeled data. It learns from the small amount of labeled data and uses the amount of unlabeled data to improve the learning progress [12]. Natural Language Processing (NLP) is a subfield of artificial intelligence that focuses on the interaction between computers and human (natural) languages. The goal of NLP is to enable machines to understand, interpret, and generate human language that is both meaningful and useful [14]. A LLM is a type of artificial intelligence model designed to understand and generate human-like text by leveraging vast amounts of data and advanced neural network architectures. These models are trained on extensive corpora, enabling them to perform a wide range of language-related tasks, such as translation, summarization, and conversation. 2.2 Development of Large Language Models Large Language Models (LLMs) are a branch of language models known for their size and power. They leverage large amounts of data and computational power to achieve state-of-the-art performance on a variety of NLP tasks. In this section, we will introduce the LLMs development progress. 2.2.1 Statistical Methods Natural language processing (NLP) had significant advancements through statistical methods like Hidden Markov Models (HMMs) and N-gram models [15] [16]. These approaches relied on mathematical models to understand and generate human language. Hidden Markov Models (HMMs) became widely used for various NLP tasks such as part-of-speech tagging, speech recognition, and named entity recognition. HMMs provided a framework for dealing with sequential data by modeling the probabilities of sequences of observed and hidden states. For example, in the context of language, the observed states might be words or phonemes, and the hidden states could be part-of-speech tags or phonetic features. By capturing the probabilistic dependencies between these sequences, HMMs allowed for more accurate and efficient tagging and recognition systems [15] N-gram models also emerged as a staple in language modeling at the same time. It aimed to predict the next word in a sequence based on the previous n-1 words, capturing the contextual dependencies between words. N-grams provided a simple but effective way to model language, enabling improvements in various applications like text prediction and machine translation which have been explained by Brown et al.[16]. Another crucial development for statistical methods is Latent Dirichlet Allocation (LDA). LDA is a generative probabilistic model used for topic modeling. It classifies documents into topics based on word distributions, assuming that documents are mixtures of topics and each topic is a distribution over words. LDA provided valuable 6 2. Background insights into the thematic structure of large text corpora, enhancing information retrieval and content recommendation [17]. However statistical methods have limitations of restricted context understanding, data sparsity, high computational costs, and insufficient semantic depth. Therefore, it is hard to capture long-range dependencies and handle nuanced meanings in language, as Beran et al.introduced in their paper [18]. Consequently, the need for more sophisticated approaches led to the development of neural networks. 2.2.2 Neural Networks Neural networks capture more complex patterns and dependencies in text [15], [16], which effectively improves the drawback of statistical methods. With the capacity to learn non-linear relationships, neural networks address the limitations of statistical methods by learning rich, distributed representations of words and contexts. It effectively handles long-range dependencies, reduces data sparsity issues through embeddings, and captures complex semantic relationships, thus providing more robust and scalable solutions for modern NLP tasks. Feedforward Neural Networks (FNNs) were among the earliest neural architectures employed for NLP tasks. FNNs, consisting of layers where each layer’s output is fed into the next, was used for classification and regression tasks. However, FNN had a problem with sequential data due to their lack of temporal context[19]. The introduction of Recurrent Neural Networks (RNNs) marked a significant advancement. RNNs are designed to handle sequential data by incorporating feedback connections, allowing them to maintain a form of memory and capture temporal dependencies. This made them more suitable for tasks involving sequences of words, such as language modeling and sequence prediction. Despite their potential, traditional RNNs faced challenges with long-term dependencies due to issues like the vanishing gradient problem [20]. Long Short-Term Memory (LSTM) networks are one of the most significant break- through developments of neural networks, published in 1997 [21]. LSTMs address the limitations of traditional RNNs by incorporating gating mechanisms that control the flow of information and maintain long-term dependencies. This enhancement allowed LSTMs to remember information over longer sequences, significantly improving their performance in tasks such as machine translation and speech recognition. 2.2.3 Transformer Model Era The Transformer model[22] addresses several shortcomings of previous models such as FNNS, RNNs, and LSTM networks. The Transformer model relies on attention mechanisms and discards the use of recurrence entirely. The transformer model has a significant impact on various NLP tasks. In 2018, Devlin et al[23]. introduced Bidirectional Encoder Representations from Transformers (BERT), which is designed to pre-train deep bidirectional representations from an unlabeled text by joint conditioning on both left and right context in all layers 7 2. Background [23]. In 2019, the Generative Pre-trained Transformer (GPT) series by OpenAI was published. The GPT began with GPT-1 in 2018 and was followed by GPT-2 in 2019, demonstrating the effectiveness of large-scale unsupervised pre-training. GPT-3, in particular, highlighted the capabilities of very large models in generating coherent and contextually relevant text, showcasing advanced performance in numerous language tasks[24][1]. 2.2.4 Open-Source Large Language Models Open-Source Large Language Models profoundly impacted the field of NLP by making advanced AI technologies available to a wide audience. Models like GPT-Neo [25], GPT-J [26], and GPT-NeoX [27] by EleutherAI made cutting-edge NLP capa- bilities available to researchers, developers, and organizations without the high costs. These models foster collaboration within the AI community, enabling researchers to build upon existing architectures, share improvements, and drive innovation more rapidly. Additionally, open-source models such as Large Language Model Meta AI (LLaMA) [28], and BLOOM[29] provide transparency in terms of architecture and methodologies, which is crucial for understanding model behavior, debugging, and analysis. What’s more, Llama published by Meta, supports a wide range of applications, including text generation, comprehension, and translation, while being more accessible and adaptable compared to previous models[28]. Therefore, we pick Llama 3 to study its performance. Hugging Face [30] is a collaboration platform that hosts models, datasets, and applications. Hugging Face is best known for its Transformers library, which includes implementations of numerous transformer-based models for various NLP tasks [31]. This library, along with their Datasets and Tokenizers libraries, has become a cornerstone for researchers and developers working with state-of-the-art NLP models. Hugging Face’s Model Hub facilitates the sharing of pre-trained models, fostering a collaborative environment 2.3 Transformer Model Basics The transformer model was first described in the paper "Attention is all you need" by Vaswani et al.[22] which can be seen as a landmark of modern artificial intelligence. In this section, we will introduce the transformer model from the perspectives of model architecture, attention mechanism, encoder and decoder. 2.3.1 Model Architecture The transformer model is built upon a sequence-to-sequence architecture. The sequence-to-sequence architecture is the mechanism that takes an input sequence with processing and generates an output sequence. The transformer model also follows an encoder-decoder structure based on sequence-to-sequence. The encoder- decoder structure consists of 2 parts which are the encoder and decoder. The encoder processes an input sequence into a difference representation, which the decoder 8 2. Background will use to generate the desired output sequence. Based on this overall sequence model architecture, the transformer model itself is built entirely on self-attention mechanisms and point-wise, fully connected layers for both the encoder and decoder, see Figure 2.1. therefore, it enables more efficient parallelization and improved performance on a range of tasks. Figure 2.1: Transformer Architecture [22] 2.3.2 Attention Mechanism Attention is a mechanism that allows the model to focus on matching different parts of the input sequence when producing each output element from Vaswani et al.[22]. The attention mechanism enables the model to weigh the sum of the different values for the output, with the weight for each value determined by a compatibility function applied to the query and the corresponding key. Generally, there are 2 key aspects of attention in the transformer model which are the Scaled Dot-Product Attention and the Multi-Head Attention. Scaled Dot-Product Attention is the core operation of the attention mechanism. There are 3 metrics in this formula which are Query Matrix Q, Key Matrix K, and Value Matrix V. • Query Matrix Q represents the current token for which attention is being computed. It determines which parts of the input sequence should be focused on. 9 2. Background • Key Matrix K represents all elements in the sequence. It is used to determine the relevance to the query. • Value Matrix V contains the value that models are trying to attend in. It works by first taking the dot product of a query vector (Q) with key vectors (K) to measure their similarity. Then this attention score for each pair of query and key is calculated using the dot product of the query and key, scaled by the inverse square root of the dimension of the keys √ dk to ensure stable gradients during training. The scaled scores are passed through a softmax function to convert them into a probability distribution, which represents the attention weights. Finally, the output is calculated as a weighted sum of value vectors (V), using these attention weights. This process enables the model to selectively attend to important information in the input sequence, facilitating better handling of dependencies regardless of their distance. The figure 2.2 is a concrete representation of the formula as follow: Attention(Q, K, V ) = softmax ( QKT √ dk ) V (2.1) Figure 2.2: Scaled Dot-Product Attention [22] Multi-Head Attention focuses on different parts of the input sequence simulta- neously, instead of performing a single attention function with dmodel-dimensional keys, values, and queries. The mechanism of Multi-Head Attention can be intuitively visualized by figure 2.3. By performing the attention operation multiple times in parallel with different sets of queries, keys, and values, known as "heads", multi-head attention significantly enhances its ability. Instead of computing a single set of attention scores, Multi-Head Attention runs several attention mechanisms (or heads) in parallel. Each head has its own set of learned linear projections for Q, K, and V. Each attention head computes scaled dot-product attention independently. This means each head focuses on different parts of the sequence or different aspects of the relationships between tokens. The outputs of all attention heads are concatenated. In the end, the concatenated output is then passed through a final linear layer to produce the final output of the multi-head attention mechanism, the formula can be 10 2. Background shown as: MultiHead(Q, K, V ) = Concat(head1, head2, . . . , headh)WO (2.2) where headi = Attention(QW Q i , KW K i , V W V i ) (2.3) Figure 2.3: Multi-Head Attention [22] 2.3.3 Encoder & Decoder Encoder: As figure 2.1 shown, the encoder processes the input sequence and is composed of multiple identical layers. Each layer consists of two primary sub-layers: The first one is a multi-head self-attention mechanism, and the second is a feed- forward neural network. Each of the two sub-layers is followed by layer normalization and has a residual connection to increase training stability and efficiency. Vaswani et al.[22] use six encoders. The multi-head self-attention mechanism allows the encoder to attend to different parts of the input sequence to capture dependencies between words, irrespective of their position in the sentence. And for the sub-layer feed- forward neural network refines the representation of each word after self-attention. It consists of two linear transformations with a ReLU activation in between. Decoder: As the right side of transformer architecture (see figure 2.1), it is composed by Masked Multi-Head Self-Attention, Feed-Forward Neural Network, and Encoder- Decoder Attention. The decoder also begins with a self-attention mechanism, but with a crucial difference: it is masked. The Masked Multi-Head Self-Attention sub-layer ensures that each position in the output sequence can only attend to earlier positions, maintaining the auto-regressive nature of the decoder. This prevents the model from cheating by looking at future words during training, ensuring that the generation of the output sequence is auto-regressive. Then the sub-layer Encoder- Decoder Attention enables the decoder to attend to the entire input sequence by incorporating the output of the encoder, allowing it to gather context necessary for generating the next word in the output sequence. 11 2. Background 2.4 Inference In our project, we are focusing on analyzing the contributing factors (batch size, input length, output length, model size, and number of GPUs) and how they influence the latency and throughput in GPT-2 and Llama 3 inference systems. The goal of inference is to use a trained model to make predictions or generate outputs based on a new prompt. The trained LLMs can take the input sequences, then process them and produce output sequences or predictions based on the patterns and relationships model learned. It is widely used in image classification, face identification, trend prediction, and so on. 2.4.1 Parallelism Techniques Given the large size of LLMs, several parallelism techniques can be used to accelerate the training of the inference process. Existing methods to parallelize computation of LLMs include data parallelism, tensor parallelism, and pipeline parallelism. Data Parallelism The input data of a model is usually arranged in batches. In data parallelism, the batched input data is split into several sub-batches, each sub-batch allocated to a device. Each device holds a full copy of the model and computes the output for the sub-batch. Data parallelism is widely used due to its simplicity [32]. Pipeline Parallelism is a computation technique which divides the layers of the Deep Neural Network (DNN) model into multiple consecutive stages [33], see Figure 2.5. Each stage is allocated to a single GPU performing forward pass and backward pass. In forward pass, each stage passes an output activation to the next stage asynchronously. Meanwhile, The last stage begins with the backward pass as soon as the forward pass is finished. Similar to the forward pass, the backward pass will transfer the input back to the previous stage. One of the significant advantages of pipeline parallelism is that GPU computation happens simultaneously, which could increase the throughput. Tensor Parallelism involves splitting model tensors, which include model weights (parameters size in each model), gradients (the rate of change of loss with respect to the weights), and optimizer states (additional variables maintained by the optimizer to enhance the training process), into multiple slices across different GPUs [34]. Instead of processing the entire set of weights on a single GPU like pipeline parallelism, tensor parallelism devides individual weights [35]. The methodology of tensor parallelism can be shown as Figure 2.4. It involves distributed computation of specific operations, modules, or layers of the model, ensuring the correctness of the computation while enhancing computational efficiency and performance. Therefore, tensor parallelism is particularly useful for the models that a single parameter consumes most of the GPU memory. 2.4.2 vLLM vLLM is a state-of-the-art open-source library designed to enhance the efficiency of LLM inferencing and model serving, built by Kwon et al. [37]. This library is 12 2. Background Figure 2.4: Tensor Parallelism [36] Figure 2.5: Pipeline Parallelism notable for its advanced serving throughput, achieved through innovative features like PagedAttention—an attention mechanism inspired by the classical virtual memory and paging techniques from operating systems [37]. PagedAttention allows vLLM to manage large-scale models and datasets more efficiently, optimizing both memory usage and computational performance. A key strength of vLLM is its ability to handle continuous batching of incoming requests, significantly boosting throughput and reducing latency. The library includes optimized CUDA kernels that maximize performance on NVIDIA GPUs, ensuring that computations are both fast and efficient. Furthermore, vLLM is designed with flexibility in mind, supporting tensor parallelism and pipeline parallelism, which enables seamless distributed inference across multiple GPUs, including both NVIDIA and AMD hardware. In addition to its technical prowess, vLLM offers seamless integration with popular models from Hugging Face, making it easy for developers to deploy and serve a wide range of pre-trained models. The library also supports various high-throughput decoding algorithms, such as parallel sampling and beam search, which are essential for delivering quick and accurate results. This combination of advanced features 13 2. Background and ease of use makes vLLM an ideal choice for deploying large language models in diverse environments, from research to production. 2.4.3 Tokenizer Tokenizer is a tool to break down text into smaller units, such as words, subwords, or characters. This process is known as tokenization. Tokenization takes an important step in NLP for transferring original text into a specific format so that machine learning algorithms can easily process it. It is essential for preparing textual data for computational processing by machine learning models, particularly in tasks like language modeling, translation, and sentiment analysis. Sampling Parameters is used for text generation. max_tokens controls the maximum number of tokens for the output sequence. min_tokens limits the minimum number of tokens for the output sequence. Setting up both max_tokens and min_tokens gives us the desired output sequence length. RequestMetrics class is associated with the request. first_token_time is used for measuring the time when the first token was generated. By setting first_token_time, we got the time that could show how fast the model responses. 14 3 Methods This chapter goes through our approaches and experiments environment that we utilized in the project implementation. 3.1 Approach Overview Our approach involves two stages: data collection and data analysis. Data collection is our first stage. By varying batch sizes, input lengths, output lengths, and number of GPUs, we aim to collect comprehensive datasets which will be critical for us to understand how these factors influence model performance and resource utilization. The next step is data analysis based on our datasets, where we explore patterns and correlations to identify optimal settings that balance performance and resource consumption. The results gained from this analysis will inform best practices for deploying language models in diverse applications, and come up with some potential optimization. 3.2 Data Collections The first stage of our project is data collection for evaluating the performance of a language model, GPT-2 and Llama 3, using the vLLM library. The key steps in this workflow include measuring latency, calculating throughput, and monitoring GPU utilization. An overview of the data collection workflow can be seen as Figure 3.1 1. Define Parameters: Set up a test value for each of the parameters batch sizes, input lengths, model size, and output lengths to determine the different configurations to be tested. 2. Initialize the LLM: Load the GPT-2 model and Llama 3 model using vllm with setting multiple GPUs for tensor operations specific for parallel computation. 3. Batch Generation: Generates different combinations of batch sizes, input lengths, and output lengths. 4. LLM Work Mechanism: Processes each batch configuration to LLM, see Figure 3.2 for how LLM inference processes each batch of request. 5. Metrics Computation: 15 3. Methods • Measure Latency: Measure the latency, including the Time to First Token (TTFT) and Time to Last Token (TTLT), for a given input se- quence and batch size. • Measure Throughput: Measure throughput and normalized latency. • Measure GPU Utilization: Retrieves the GPU utilization details, including memory usage and GPU utilization. 6. Write Results to CSV: Open a CSV file to save the results. Iterate over the generated batches, measuring performance metrics for each configuration. Capture GPU usage details before and after processing each batch to determine GPU utilization. Write the collected data, including GPU details, TTFT, TTLT, throughput, and normalized latency, into the CSV file. Figure 3.1: An overview of data collection workflow 3.2.1 Contributing Factors: We focus on 5 factors: batch size, input length, output length, model size, and number of GPUs. Batch Size refers to the number of samples processed together in a single pass through the model. In our project, as shown in the figure 3.2, the batch size determines how many sequences are in each batch. So the LLM will take the batch size of different input sequences simultaneously, process them together, and then create the same batch size of different output sequences. For each batch, we measure how long it takes to generate output sequences (latency) and how much data is processed per second (throughput). Input Length: refers to the number of tokens in each input sequence that fed into the model to get generation. Using a tokenizer object from the vLLM library to convert the input text into tokens, it is ensured that the input text matches each input length in our data collection process. Tokenizers tokenize the text and truncate it to a specific number of tokens, which are then fed into the large 16 3. Methods Figure 3.2: An overview of LLM working mechanism language models. The ’tokenizer(input_text, return_tensors=’pt’, truncation=True, max_length=input_length)’ method is used in our project to convert the input text into tokens and truncate it to the specified input length. It ensures that the resulting number of tokens matches the desired input length, either by truncating longer texts or leaving shorter texts as they are. Output Length refers to the number of tokens in each output sequence that the model is expected to generate as output in response to a given input. It determines how long the generated text should be. We set up the generation parameters structure SamplingParams from the vLLM library, which includes min_tokens and max_tokens. Therefore, LLM can generate output text as the output length we want, ensuring the generated text exactly the desired number of tokens. Model Size for LLM is often measured by the number of parameters. A larger model size indicates more parameters, which require additional computational resources. This increased resource demand can lead to higher latency, as more time is needed to process each input, and decreased throughput, as fewer inferences can be performed in parallel. In our project, we selected GPT-2 with 124 million parameters and LLaMA 3 with 8.03 billion parameters as our research targets. Given the significant difference in model size between GPT-2 and LLaMA 3, we can explore how these variations affect performance in terms of LLM inference. Number of GPUs is crucial for understanding how efficiently the GPU resources are being utilized during model inference. In our code, specifying the tensor parallel size (e.g., tensor_parallel_size=6) configures the model to use multiple GPUs for tensor operations, allowing parallel computation of model layers. Besides, setting the distributed_executor_backend to a specific backend (e.g., "mp") allows for distributed execution, where tasks are split and executed across multiple processors or GPUs. This parallelization enhances performance by increasing throughput and reducing latency. 17 3. Methods 3.2.2 Metrics During our evaluation, there are 4 metrics we focused which are Time to First Token (TTFT) and Time to Last Token (TTLT) latency, normalized latency, and Throughput. Time to first token (TTFT) is the time that it takes for the model to produce the first token of the response after receiving the input. We record the start time before sending input to the model, then subtract this start time from the timestamp when the first token is generated. For the first token timestamp, we utilized the first_token_time attribute from the RequestMetrics in the vLLM library to provide the timestamp for when the first token is generated. This gives the TTFT, indicating how quickly the model produces the first token after receiving input which is crucial for applications utilizing streaming to get immediate feedback. Time to Last Token (TTLT) measures the overall time taken by the model to process the input sequences and generate the response. Therefore, we refer to Sheng et al.[38] latency computation methods. By considering an effective batch size of b, an input sequence length of s, and an output sequence length of n, the latency t is defined as the total number of seconds spent to process the input sequences and generate all the bn tokens. Normalized latency is a key metric that assesses the efficiency of the model in generating output relative to the amount of data processed. In our data collection process, we utilize the normalized latency method from Kwon et al.[37] where normal- ized latency is calculated by dividing the total latency (TTLT) by the product of the batch size and the output length. Normalized latency is also an important parameter for evaluating throughput. Kwon et al.[37] point out that a high-throughput serving system should retain low normalized latency against high request rates. Normalized Latency = Total Latency (TTLT) Batch Size × Output Length Throughput refers to the amount of data or number of tasks that the LLM can process within a specific period. It is a measure of the model’s efficiency and performance in handling and completing computational tasks. It usually can be described as token per second, latency inference, and queries per second. In our project, we use the Generation Throughput as our metric. Generation Throughput is defined as bn t to calculate how many tokens the model generates per second on average over the total duration, as proposed by Sheng et al. [38]. The batch size is b, n is the output sequence length, and t is TTLT. 3.3 Analysis Method The second step of this project is analyzing the datasets we collected from the previous stage. By starting the correlation analysis in this stage, we get an overview of how the models’ performance is affected by contributing factors. What is more, we find the most important factors for the latency and throughput through feature importance 18 3. Methods analysis. Both correlation and feature importance analysis play a solid role in the realization of the further evaluation of LLM inference performance. Therefore, this section will go through these two analysis methods. 3.3.1 Correlation Analysis Correlation analysis is a statistical method used to measure a linear relationship between two variables. The Pearson correlation coefficient (PCC) is one of the most common measures of correlation [39]. It is a descriptive statistic, which summarizes the characteristics of a dataset. Specifically, PCC describes the strength and direction of the linear relationship between two quantitative variables [40]. The formula of the Pearson correlation coefficient is given below. r = ∑n i=1(Xi − X̄)(Yi − Ȳ )√∑n i=1(Xi − X̄)2∑n i=1(Yi − Ȳ )2 (3.1) Where r is the Pearson correlation coefficient. Xi and Yi are individual values of variables.X̄ and Ȳ are mean values for each variable X and Y [41]. The formula 3.1 ensures the results of PCC will only be a scalar between −1 and 1. The value of PCC is explained in Table 3.1. r Strength of correlation −1.0 < r < 0.0 Negative correlation 0.0 < r < 0.1 No correlation 0.1 < r < 0.3 Low positive correlation 0.3 < r < 0.5 Medium positive correlation 0.5 < r < 0.7 High positive correlation 0.7 < r < 1 Very high positive correlation Table 3.1: Correlation Strength Based on Amount of r [41] In this project, we calculate the Pearson correlation coefficient to quantify the relationships between key factors (Batch Size, Input Length, and Output Length) with performance metrics (TTFT, TTLT, throughput, and normalized latency). 3.3.2 Feature Importance Analysis Feature Importance Analysis is a class of techniques to calculate all the input feature scores, which is used for determining the relative importance of each feature in datasets to make a decision [42]. Input feature scores can provide valuable insights into data relationships by understanding which features most drive model results, and guide future model development and system optimization. Usually, the feature with a higher score has a bigger influence on the model. Therefore, we aim to identify which features (output length, input length, batch size) most strongly influence 19 3. Methods the performance metrics (see Section 3.2.2) in this project, providing insight into optimizing models and improving system performance. During this evaluation, we utilized permutation importance to calculate feature importance. Permutation feature importance is one of the methods to calculate feature importance [43]. It calculates each feature’s contribution to a given dataset. For non-linear or opaque estimators, permutation feature importance is particularly useful. By breaking the relationship between the feature and the target, we could determine how much the model depends on such specific features. The permutation feature importance formula is defined as follows: ij = s − 1 K K∑ k=1 sk,j (3.2) where ij is the importance score of feature j, and s is the reference score (baseline performance) of the model (machine learning algorithm, e.g., decision tree, random forest, and linear models) on the original dataset. sk,j is the score of the model on the dataset where each feature j is shuffled for k times. Each k is in 1, ..., K. Therefore, formula 3.2 shows the features with higher scores are more critical to the model’s performance. 3.4 Experiment Setup In this section, we will explore the computer cluster Berzelius that we used during the data collection process (see Section 3.2). Additionally, we will also introduce the methods and tools for getting GPU usage details and utilization. 3.4.1 Hardware configuration Berzelius In our project, the experiment utilizes the Berzelius computer cluster which can support computational power of up to 752 NVIDIA A100 GPUs. For parallel utilization of GPU computation resources, we allocate a maximum of 8 GPUs for data collection. For each GPU requested, an additional 16 CPU cores and 128 GB RAM are added. The maximum running time we allocated is 4 hours. 3.4.2 GPU Utilization We utilize nvidia-smi as the tool to get GPU usage details including utilization. The NVIDIA System Management Interface (nvidia-smi) is a command line utility, based on the NVIDIA Management Library (NVML). It is intended to aid in the management and monitoring of NVIDIA GPU devices. It can display detailed information about the GPUs installed in a system, including utilization, temperature, memory usage, and more. By examining a list of GPU details, we can identify which GPUs are currently in use by checking if their utilization is greater than 0%. These active GPUs are then filtered into a new list so that we can count the number of used GPUs and generate a comma-separated string of their names. Additionally, it 20 3. Methods calculates the total, used, and free memory across all GPUs. Specifically, it sums up the total memory capacity, the memory currently being used, and the available memory for all GPUs in the system. This provides a comprehensive overview of GPU resource utilization, helping to monitor and manage the system effectively. 3.5 Models In our project, we studied the performance of 2 LLMs: GPT-2 and Llama 3. 3.5.1 GPT-2 We chose GPT-2 as one of the pre-trained transformer models with 124M parameters. The training object is to predict the next word in a target input sequence so that the model can learn grammar, language patterns, and reasoning ability. It uses a mask mechanism internally as we mentioned in Chapter 2 (see Section 2.3.1) to ensure the predictions for the token i only use the inputs from 1 to i but not the future tokens. The default maximum sequence length of the GPT-2 model is 1024 tokens which means the combination of input context provided to the model and the text generated by the model in response shouldn’t exceed 1024 tokens. Therefore, due to the limitation of the default maximum sequence length, we set the range of both of input and output sequence length to be 16 - 512 tokens. The tokenizer used in GPT-2 is based on a variant of Byte Pair Encoding (BPE) [6], an algorithm designed to split a text into subword units which become the tokens of the LLM. BPE tokenization starts by treating the entire vocabulary as individual characters. It then iteratively merges the most frequent pairs of characters or character sequences into single tokens. This process continues until a predefined vocabulary size is reached. By using subword tokenization, GPT-2 can handle out- of-vocabulary words more gracefully by breaking them down into known subword units. This tokenizer approach helps balance capturing word-level and subword-level information, which is particularly useful for handling rare or complex words and effectively deals with the vast diversity of text found in natural language. However, GPT-2 is still one of the classic standard NLP models because it is based on the transformer architecture to handle long-range dependencies in a text efficiently. It plays a critical role in the LLM’s development, and this is also why we chose it as one of the targets to research. 3.5.2 Llama 3 Llama 3 is an LLM that was developed and released by Meta. Similar to GPT-2, it is also a collection of pre-trained and instruction-tuned generative text models. It is designed to handle instruction-following tasks effectively, making it suitable for applications where the model needs to understand and act upon specific user instructions. Unlike GPT-2 which only has 124M parameters, Llama 3 has two sizes which are 8 billion or 70 billion parameters. In our project, we choose the Llama 3 model with 8.03 billion parameters. 21 3. Methods The Llama 3 is an auto-regressive language model that follows the transformer architecture. It supports a maximum sequence length that 4 times the amount of GPT-2 model. This means the model can handle up to 4096 tokens for input and output combined, ensuring that if the input text approaches this limit, the output will be correspondingly limited to fit within the total token count. Llama 3 uses a subword tokenization method similar to the BPE which we mentioned in the GPT-2 section. This approach allows the tokenizer to efficiently handle both common and rare words by breaking them down into manageable subword components. This method enables the model to work with a large vocabulary of 128256 tokens, which allows for more detailed and nuanced representations of language compared to earlier models with smaller vocabularies. Llama 3 is an up-to-date LLM, which has a bigger model size and maximum sequence length. Therefore, it will be an interesting model for us to research how it affects the performance compared with the GPT-2 model. 22 4 Results This chapter presents the results of our experiments evaluating the performance of GPT-2 and Llama 3, using the vLLM library for inference. The data we used for evaluation is from the previous data collection stage, which varied batch size with different input and output lengths based on GPT-2 and Llama 3. The evaluation focused on understanding how factors such as batch size, input length, output length, and number of GPUs that affect key performance metrics, including latency and throughput. The results are organized into sections covering correlation analysis, feature importance analysis, and detailed examinations of latency and throughput under different conditions. 4.1 Performcance Metrics Analysis In this section, we will dive deeper into analyzing performance metrics such as throughput, and latency (both TTFT and TTLT), and the individual impact of input variables on each metric. The results are shown in Figure 4.1. Figure 4.1: Factors Comparison between Llama 3 and GPT-2 23 4. Results 4.1.1 Latency Latency measures the time taken to process and respond to individual requests. We focus on two key latency metrics: TTFT and TTLT (see Section 3.2.2). This section analyzes the impact of batch size, input length, and output length on latency. 4.1.1.1 Batch Size In the Batch Size vs TTLT Figure 4.1 (a), both GPT-2 and Llama 3 show a positive linear relationship between TTLT and batch size. However, Llama 3’s TTLT rises much more steeply, indicating that GPT-2 handles larger batch sizes more efficiently with respect to TTLT. Similarly, In the Batch Size vs TTFT Figure 4.1 (b), TTFT increases with batch size for both models, but the increase is much more significant in Llama 3. GPT-2 maintains a relatively low TTFT even as batch size increases, showing it is more efficient at generating the first token with larger batch sizes. The non-linear increase in TTFT for Llama 3 suggests a fractional power model (y = xn, x is batch size, y is TTLT, 0 < n < 1) is more appropriate, while GPT-2’s stable performance could be modeled with a simple linear or constant function. 4.1.1.2 Highers latency with Bigger batch size During our analysis process, we found for the GPT-2 model, the latency increases when the number of GPUs increases for all the batch sizes, see Figure 4.4. On the contrary, for Llama 3, when the number of GPUs increases from 1 to 4, the latency decreases for all the batch sizes. However, when the number of GPUs increases from 4 to 8, the latency increases too, similarly to GPT-2. Notably, the large batch size (32 and 64) shows a higher latency increase. The bigger batch size means more input sequences and output sequences for each batch process. Combine with the factors we talked in section 4.1.1.5, this situation could also be explained from the perspectives of an increase in the computational load, memory requirements, and data transfer overheads, while also exposing inefficiencies in parallel processing and causing longer wait times for batch completion. A larger batch size means that more data must be processed simultaneously during each inference step, which increases the computational load. Although it may improve overall throughput, it increases the amount of work that needs to be done in each individual step. This additional workload increases the time it takes to complete each step, leading to higher latency. Data transfer overheads and memory constraints also contribute to higher latency with larger batch sizes. Because more data needs to be loaded from storage and transferred between the CPU and GPU, it could increase the overhead associated with data movement. Also larger batch sizes demand more memory, both on the GPU for storing inputs and intermediate calculations. The increased demand on computational resources like GPU cores can lead to resource contention, where different parts of the computation pipeline compete for the same resources, slowing down processing and increasing latency. Last but not least, GPU utilization determines whether parallel processing is 24 4. Results efficient. High GPU utilization is generally beneficial when dealing with large batch sizes because it indicates that the GPU’s computational resources are being fully leveraged. However, we observed that in our project we have low GPU utilization, especially with the rise of GPU-allocated numbers. What’s more, we found that for Llama 3 with the increase in batch size, the average GPU utilization decreased, see Figure 4.2. This means the GPU is not fully used efficiently for each batch process, which could cause longer processing time. It also makes GPU idling and underperformance which leads to increased latency and reduced throughput. Figure 4.2: Llama 3: Average GPU utilization for each batch size 4.1.1.3 Input Length In Input Length vs Latency (TTLT) Figure 4.1 (d), latency also increases as input length increases for both models, but Llama 3 again shows a steeper rise in latency compared to GPT-2. This means GPT-2 is more efficient at processing longer inputs, resulting in lower latency. The linear trend observed in both models supports using a linear function to describe this relationship. For Input Length vs TTFT Figure 4.1 (e), TTFT increases with input length for Llama 3 but stays almost constant for GPT-2, highlighting GPT-2’s ability to maintain a stable time to the first token regardless of input length. The non-linear trend in Llama 3 suggests a fractional power model (y = xn, x is input length, y is TTFT, 0 < n < 1) for input length and TTFT, while GPT-2’s consistency suggests a constant function might be best. 4.1.1.4 Output Length Figure 4.1 (g) shows output length vs the time to last token. In the Figure, latency increases with output length for both models in a linear relationship, but Llama 3’s latency rises more sharply than GPT-2’s. This suggests that GPT-2 is better at handling longer outputs, maintaining lower latency. Shown in Figure 4.1 (h), TTFT remains relatively stable for GPT-2 across different output lengths, while it generally increases for Llama 3 with slight fluctuation. This indicates that GPT-2 provides more stable performance in generating the first token across varying output lengths. A linear model could describe Llama 3’s slight fluctuation, while GPT-2’s performance might be best represented by a constant model. 25 4. Results (a) (b) (c) (d) Figure 4.3: (a) GPU Utilization GPT-2 (b) GPU Utilization Llama 3 (c) Max GPU Utilization GPT-2 (d) Max GPU Utilization Llama 3 4.1.1.5 Number of GPUs with TTLT From Figure 4.4, we found that for the GPT-2 model, increasing the number of GPUs can lead to higher TTLT. The following are the potential causes for this situation, which are synchronization and communication overheads, low GPU utilization, memory and data transfer. Synchronization and communication overheads could be one of the primary reasons for the increase in latency while adding more GPUs during GPT-2 inference. Each input sequence data needs to be split and distributed across the GPUs, which requires inter-GPU communication via interconnects like NVLink or PCIe. After the GPUs process their portions of the data, the results need to be gathered and merged, which introduces delays. Additionally, GPUs will wait for each other to complete their tasks before moving on to the next step. This can cause some GPUs to be idle while waiting for others to catch up increasing the overall latency, and leading to synchronization bottlenecks. Furthermore, the GPU utilization does not behave as high as we expected, (see Figure 4.3 (a) and (b)). Therefore, we analyzed the relationship between GPU number with maximum average GPU utilization. Figure 4.3 (c) and (d) show the max GPUs’ utilization decreased with the rise of GPU numbers through our analysis, which could be another main issue that causes the increase of latency. This happens because the workload is not always perfectly divisible across all GPUs, 26 4. Results leading to underutilization. When GPUs are not fully utilized, their computational power is wasted, resulting in slower processing and higher latency. Distributing the workload evenly across multiple GPUs can be another challenge, especially for the attention mechanisms in LLMs, which may not scale linearly. Additionally, memory and data transfer could also contribute to increased latency. As more GPUs are used, moving data between them, or between the CPU and GPUs, can become a problem. If the data transfer rate cannot keep up with the processing speed, GPUs may sit idle, waiting for data, which also reduces utilization and increases latency. 4.1.1.6 Number of GPUs with TTFT Similar to TTLT analysis, we used the same experiment to evaluate TTFT when LLM models run on different number of GPUs and processing input with different batch sizes. Figure 4.4 shows the results of this experiment. Figure 4.4 (c) shows the TTFT by number of GPUs and different batch sizes, we notice that GPT-2 performed the best in a single GPU setup similar to TTLT and TTFT remains relatively stable on multi GPUs setup. Figure 4.4 (F) shows the TTFT by number of GPUs and different batch sizes for Llama 3 model, we notice that the model respond faster for smaller batch sizes i.e. 1, 2, and 4 compared to larger batch sizes where the model perform the best on 4 GPUs following the same behavior of TTLT. This analysis aligns with the correlation and feature importance findings presented later in this chapter, confirming that TTFT is significantly influenced by batch size and input length across both models. 4.1.2 Throughput Throughput, measured in tokens per second, reflects the efficiency of the models in processing data. This section examines how batch size, input length and output length affect throughput (Figure 4.1). Understanding these factors is crucial for scaling models to handle high demand. 4.1.2.1 Batch Size Figure 4.1 (c) illustrates the relationship between throughput and batch size. In the Figure, throughput increases with batch size for both models, but GPT-2 consistently achieves higher throughput than Llama 3 across all batch sizes, showing greater efficiency. The non-linear increase in throughput, especially the saturation effect in GPT-2, suggests that a fractional power model (y = xn, x is batch size, y is throughput, 0 < n < 1) is suitable for the relationship between throughput and batch size. 27 4. Results 4.1.2.2 Input Length Figure 4.1 (f) shows input length with throughput, throughput decreases as input length increases for both models, though GPT-2 maintains higher overall throughput. This indicates GPT-2 is more efficient at handling longer inputs, preserving better throughput. The non-linear decline supports using an inverse proportional function (y = k x , k > 0), which the throughput decrease with the increase of input length. 4.1.2.3 Output Length Finally, Figure 4.1 (i) shows throughput with output length, throughput decreases as output length increases for both models, with GPT-2 consistently achieving higher throughput. This suggests GPT-2 is more efficient in generating longer outputs while maintaining throughput. The non-linear decrease suggests a fractional power model (y = xn, x is output length, y is throughput, 0 < n < 1) for output length and throughput is appropriate. 4.1.2.4 Number of GPUs with Throughput Similar to the reason that causes an increase in latency while adding more GPUs for LLMs inference, it will also cause throughput to go down like Figure 4.4. When increasing the number of GPUs for batch processes with models like GPT-2 or Llama 3, throughput can decrease due to several factors. These include increased communication and synchronization overheads between GPUs, which slow down processing; inefficiencies in load distribution, where some GPUs may be underutilized; and the overhead of managing a larger number of GPUs. Additionally, memory and data transfer bottlenecks can arise, and certain operations within the models may not parallelize efficiently, leading to diminishing returns as more GPUs are added, ultimately reducing throughput. These findings emphasize the importance of optimizing input and output lengths and managing batch size to maintain high throughput, particularly in large-scale applications. 4.1.3 GPT-2 vs. Llama 3 Figure 4.4 shows GPT-2 inference generally exhibits 3 times lower latency and higher throughput compared to Llama 3. Here are the reasons for comparing the difference between these 2 models. Firstly, the size and complexity of the models play a significant role. GPT-2 is a smaller and less complex model, with fewer parameters and layers than Llama 3. This reduced complexity means that GPT-2 requires less computational power and time to process each inference step, resulting in lower latency and the ability to handle more inferences in a given time frame, thus achieving higher throughput. In contrast, Llama 3 is a larger and more advanced model with more parameters and layers, which naturally increases the computational load and processing time, leading to higher latency and lower throughput. 28 4. Results Additionally, architectural differences between the models contribute to these performance disparities. GPT-2’s architecture is simpler and more streamlined, al- lowing for more efficient processing during inference. While Llama 3 may incorporate advanced mechanisms to improve accuracy or other metrics, these enhancements can introduce additional computational overhead, slowing down inference and reducing throughput. What’s more, GPT-2 has been extensively optimized for inference on various hardware platforms, benefiting from techniques like pruning, quantization, and specialized inference engines that reduce computational demands. Therefore we suspect that Llama 3 might have not done the same level of optimization, leading to less efficient operation and, consequently, higher latency and lower throughput. Moreover, the inference paths of the two models also differ, contributing to their performance differences. GPT-2’s inference process involves a relatively straightfor- ward computational path with fewer operations per layer and simpler mechanisms for generating outputs. This simplicity enables faster processing for each token, resulting in lower latency and higher throughput. In contrast, Llama 3’s inference might involve more complex operations, which require more computation per token. These additional demands significantly increase the time required for inference, leading to the observed higher latency and lower throughput. 4.2 GPU Scaling and Performance Impact Number of GPUs plays a critical role in the performance of LLMs, particularly in large-scale deployments (Figure 4.4). This section examines the effects of varying batch sizes and the number of GPUs on both latency and throughput. Figure 4.4: Performance Metrics with Different Batch Sizes and number of GPUs We analyzed the datasets collected from the experiment described in the beginning of the chapter by comparing the two models’ performance on different numbers of GPU 29 4. Results setups while processing input with different batch sizes. In this experiment, batch sizes of 1, 2, 4, 8, 16, 32, and 64 were processed across different GPU configurations: 1, 2, 4, and 6 GPUs for GPT-2, and 1, 2, 4, and 8 GPUs for Llama 3. For GPT-2, the highest performance was consistently observed with a single GPU. Upon transitioning to 2 GPUs, there was a significant drop in TTLT, throughput and TTFT, with smaller drops as more GPUs were added, as shwon in Figure 4.4(a), 4.4(b) and 4.4(c). This suggests that GPT-2 may face bottlenecks in parallel processing that limit its scalability when utilizing multiple GPUs. In contrast, Llama 3 demonstrated a different scaling behavior, as shown Figure 4.4. It performed better on 4 GPUs compared to 1 or 2 GPUs in this experiment when batch size is smaller than 32. However, for larger batch sizes (32 and 64), the model achieved better performance when running on 2 GPUs. This indicates that Llama’s performance is more sensitive to both batch size and GPU configuration, likely due to differences in its architecture and parallel processing capabilities. To further investigate impact of batch size and number of GPUs on Llama 3’s performance, we conducted a detailed analysis of the Llama 3 in §4.3.1. 4.3 Further Analysis on Llama 3 Previously, we found that Llama 3’s performance was not consistent for each collection. Generally, it performs best for TTLT latency and throughput when the number of GPUs is 4 and the batch size is smaller than 32, see Figure 4.4. However, we also noticed once that Llama 3 achieved its best performance in all batch size cases while using 4 GPUs. Therefore, in this section, we set up 2 test experiments to verify the consistency of Llama 3’s performance and the output quality. 4.3.1 Verifying Llama 3 performance To validate the previous results (Figure 4.4), we collected the data 10 times and analyzed them with shaded areas. Comparing Figure 4.4 and Figure 4.5, Llama 3 using 4 GPUs reaches the general best performance is more nuanced. For smaller batch sizes (1, 2, 4, 8, and 16), the model performed better on 4 GPUs compared to the others. For larger batch sizes (32 and 64), the model achieved the best performance when running on 2 GPUs. 4.3.2 Llama 3 Output Accuracy In this section, we evaluate the quality of output sequences generated by GPT-2 and Llama 3 as well as identify factors influencing this quality. The quality of output can be defined as the correlation between input and output. Our comparison shows that the Llama 3 model consistently outperforms GPT-2 in output quality. Though both models exhibit better coherence with shorter outputs, Llama 3’s architecture supports more stable handling of longer texts compared to GPT-2. 30 4. Results (a) (b) Figure 4.5: (a) Llama 3: TTLT latency with GPU numbers (b) Llama: Throughput with GPU numbers Notably, the number of GPUs and batch size do not significantly impact output quality for either model. Increasing batch size or the number of GPUs does not lead to noticeable changes in the quality of the outputs, which remains stable. Conversely, input and output lengths have a significant effect on the models’ per- formance. For a given output length, longer input lengths generally result in more coherent and contextually relevant outputs. However, when the input length is fixed, longer outputs tend to lose coherence, often drifting from the initial prompt and becoming less focused. 4.4 Correlation Analysis The correlation analysis conducted for the Llama 3 and GPT-2 models provides insights into the interactions between the input variables and performance metrics. Figures 4.6 present these correlation scores (refer to Table 3.1), reflecting how each model responds to different input variables such as batch size, input length and output length. According to the classification shown in Table 3.1, both LLM inferences show very high positive correlations (0.7 < r < 1) between batch size and throughput, indicating that increasing batch size can improve throughput. Similarly, latency has a strong positive correlation with output length, meaning larger output leads to higher latency. However, batch size negatively correlates with normalized latency for both models, suggesting that as batch size increases, the normalized latency tends to decrease relative to throughput. Input Length and batch size also show high correlations (0.5 < r < 0.7) with TTFT, indicating that the increase in batch size or input length result in delaying the response of processing the first token. we also find low to medium correlations between batch size and latency (TTLT) around 0.3 for both models, suggesting that 31 4. Results batch size also influences latency (TTLT) but not as much as Output length. Figure 4.6: Correlation Matrices for GPT-2 and Llama 3 Models Figure 4.6 illustrates the discussed correlations, providing a visual representation of how each model’s performance is influenced by the input variables, aiding in the understanding of their operational dynamics. 4.5 Feature Importance Analysis In this study, the feature importance is calculated using permutation importance. This method evaluates the importance of each feature by observing the decrease in model accuracy after permuting the feature’s values. The rationale is that if shuffling the values of a feature significantly decreases the model’s accuracy, then the feature is important for making predictions. Figure 4.7 shows the results of the feature importance analysis. Throughput: As shown in Figure 4.7(a) and Figure 4.7(d), for both GPT-2 and Llama 3, batch size is the most important factor in determining throughput. In GPT-2, the number of GPUs also plays a significant role, followed by input and output length, which have a minor impact. In Llama 3, however, input and output length are more important than the number of GPUs. This indicates that Llama 3 is more affected by variations in sequence length, whereas GPT-2 is affected more by the number of GPUs. TTLT: Figure 4.7(b) and Figure 4.7(e) show feature importance for TTLT. In the Figure, Output length is the main contributing factor influencing the time required to process the last token in both models, which is expected since longer outputs naturally take more time to generate. Batch size also affects TTLT, but to a less degree. The number of GPUs and input length have little impact, indicating that these factors are less critical once the model starts processing the input. TTFT: Feature importance for TTFT is shown in Figure 4.7(c) and Figure 4.7(f). Batch size is again the most important factor for TTFT in both models. Input 32 4. Results length also matters, especially in Llama 3, where longer input sequences can delay the time to generate the first token. The number of GPUs and output length have minimal influence on TTFT, suggesting that they become more relevant later in the processing rather than at the start. Figure 4.7: Feature Importance Analysis for GPT-2 and Llama 3 In summary, while batch size is the most influential factor across all metrics, the importance of other factors varies between GPT-2 and Llama 3. Both GPT-2 and Llama 3 are affected by adding GPUs, especially for throughput, while Llama 3’s performance is more influenced by the lengths of input and output sequence length. 4.6 Summary of Key Findings In this section, we summarize the critical observations from our performance evalua- tion of GPT-2 and Llama. The experiments explored how different input variables impact the models’ latency and throughput. The findings highlight the strengths and limitations of each model, offering insights into their scalability and efficiency in various configurations. This summary presents the key performance trends and contrasts between GPT-2 and Llama. 33 4. Results Overall Performance: GPT-2 generally requires less resources than Llama 3 when processing the same input variables, resulting in higher throughput and lower latency particularly with larger batch sizes. GPT-2 also demonstrates better scalability with increasing batch sizes and additional GPUs, a key strength for real-world applications. However, both models show scalability limitations, with diminishing returns in throughput and latency as more GPUs are added or batch sizes increase. Scalability and Resource Management: GPT-2 scales more effectively than Llama 3, particularly in throughput. Llama 3 handles longer sequences with slightly better latency but struggles with throughput scaling. Effective resource management, including careful tuning of batch sizes and GPU usage, is essential to optimize per- formance and manage trade-offs between latency, throughput, and GPU utilization. 34 5 Conclusion This chapter presents a summary analysis of the results and interesting findings of our project. The key findings can be summarized as follows. From correlation analysis, both the GPT-2 and Llama 3 inferences show a positive result between batch size and throughput, which showed a strong positive relationship between them. This means that each of the inferences could process more sequences per unit of time (throughput) with the increase in batch size. From the perspective of practice, it indicates that both the GPT-2 and Llama 3 inferences can handle more columns of data as the batch size rises, which shows good scalability and efficiency. However, latency also has a positive relationship with batch size, the correlation between them is approx 0.3, which is a weak positive correlation. This will cause the latency also slightly increase. In summary, the strong positive correlation with throughput and the weak positive correlation with latency suggest that both GPT-2 and Llama 3 scale well with larger batch sizes. They significantly increase throughput with only a marginal increase in latency. This is advantageous in many scenarios, as it allows for high processing efficiency without severely impacting response times. Besides, we found the feature importance of batch size is over 1.90 for the throughput metric. It indicates that, when adjusting for different factors, the increase in batch size relative to the throughput gain can be achieved about 1.9 times. This suggests that throughput gets significant improvement by increasing batch size. At the same time, the output length’s feature importance for TTLT to both of the models can be a maximum of 1.59 more than the second biggest factor batch size. It highlights that the length of the generated output sequences has a considerable impact on latency. This is because generating longer output sequences requires more computational steps, as the model needs to sequentially predict each token based on the previous context. Thus, the longer the output, the longer it takes for the model to compute and deliver the complete result. However, TTFT’s batch size feature importance is 0.66 higher than the second biggest factor input length. This suggests that the model’s latency for TTFT is more sensitive to the number of sequences processed at once (batch size) than to the length of the individual sequences (input length). Compared with GPT-2, Llama 3 is more sensitive to the change of each factor. However, GPT -2 has better scalability with each factor change. What’s more, high GPU utilization typically results in lower latency and higher throughput. Conversely, in our project, we observed that as the number of GPUs increased, the average GPU utilization decreased, leading to higher latency and lower throughput. It is an 35 5. Conclusion interesting finding, which could be mitigated effectively by reducing stored previous sequence token information in key/value(KV) [44]. 5.1 Future Work In this section, we want to deeply analyze some future optimization and work to enhance this project. 5.1.1 Inference vs. Training Analysis In our project, we only focus on analyzing the factors that influence the performance of LLMs inference. However, this work could involve a comparative analysis of how the factors we used in our project influence the performance of LLM training. Such a study could provide deeper insights into optimizing LLM performance across different stages of model development, leading to more efficient resource use and better overall model performance. Understanding these distinctions could help in designing more effective strategies for both inference and training, ultimately enhancing the application of LLMs in various domains. 5.1.2 Increase GPU utilization We observed that the average GPU utilization was lower than expected, which contributed to increased latency and decreased throughput during generative inference. As part of our future work, we aim to explore strategies for increasing GPU utilization to reduce latency. Jin et al. [44] designed a system with a priori knowledge of the output sequence to reduce the large memory reserved by previous sequence token information in key/value (KV) which could increase the LLM inference’s GPU utilization and throughput. By finding a way to enhance GPU utilization, we anticipate not only improving throughput but also achieving lower latency, thereby optimizing the overall performance of our system. 5.1.3 Extend LLMs numbers & GPU numbers In our current project, we utilized two LLMs as research targets, Llama 3 and GPT-2, with a maximum allocation of 8 GPUs for parallel computing. For future work, we aim to expand the scope of our analysis by incorporating more LLMs. By increasing the number of models under consideration, we intend to conduct a more comprehensive analysis of how various factors—such as GPU utilization, batch size, input length, output length, and model architecture—affect the performance of LLM inference. Additionally, by increasing the GPU numbers that could be used for parallel computation, we may explore more of the models’ GPU limitations and different parallelism techniques’ performance. This expanded study will provide deeper insights into optimizing inference performance across different models, helping us identify best practices for scaling and efficiency in large-scale generative inference tasks. 36 Bibliography [1] T. Brown, B. Mann, N. Ryder, et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020. [2] O. Community, Gpt-2 model on hugging face, Accessed: 2024-09-16, 2023. [Online]. Available: https://huggingface.co/openai-community/gpt2. [3] M. AI, Meta llama 3 8b model on hugging face, Accessed: 2024-09-16, 2023. [Online]. Available: https://huggingface.co/meta-llama/Meta-Llama-3- 8B. [4] T. Wolf, L. Debut, V. Sanh, et al., Huggingface’s transformers: State-of-the- art natural language processing, 2020. arXiv: 1910.03771 [cs.CL]. [Online]. Available: https://arxiv.org/abs/1910.03771. [5] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016. [6] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019. [7] web.dev, Understanding large language model (llm) sizes, https://web.dev/ articles/llm-sizes, Accessed: Day-Month-Year, n.d. [8] National Supercomputer Centre, Berzelius getting started, Linköping University, Accessed: Day-Month-Year. [Online]. Available: https://www.nsc.liu.se/ support/systems/berzelius-getting-started/. [9] T. M. Mitchell and T. M. Mitchell, Machine learning. McGraw-hill New York, 1997, vol. 1. [10] E. Alpaydin, Introduction to machine learning. MIT press, 2020. [11] T. Hastie, R. Tibshirani, J. H. Friedman, and J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction. Springer, 2009, vol. 2. [12] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning. 2006,” Cambridge, Massachusettes: The MIT Press View Article, vol. 2, p. 1, 2006. [13] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018. [14] D. Jurafsky and J. H. Martin, “Speech and language processing: An intro- duction to speech recognition, computational linguistics and natural language processing,” Upper Saddle River, NJ: Prentice Hall, 2008. [15] L. Rabiner and B. Juang, “An introduction to hidden markov models,” ieee assp magazine, vol. 3, no. 1, pp. 4–16, 1986. 37 https://huggingface.co/openai-community/gpt2 https://huggingface.co/meta-llama/Meta-Llama-3-8B https://huggingface.co/meta-llama/Meta-Llama-3-8B https://arxiv.org/abs/1910.03771 https://arxiv.org/abs/1910.03771 https://web.dev/articles/llm-sizes https://web.dev/articles/llm-sizes https://www.nsc.liu.se/support/systems/berzelius-getting-started/ https://www.nsc.liu.se/support/systems/berzelius-getting-started/ Bibliography [16] P. F. Brown, V. J. Della Pietra, P. V. Desouza, J. C. Lai, and R. L. Mercer, “Class-based n-gram models of natural language,” Computational linguistics, vol. 18, no. 4, pp. 467–480, 1992. [17] D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent dirichlet allocation,” Journal of machine Learning research, vol. 3, no. Jan, pp. 993–1022, 2003. [18] J. Beran, “Statistical methods for data with long-range dependence,” Statistical science, pp. 404–416, 1992. [19] Y. Bengio, R. Ducharme, and P. Vincent, “A neural probabilistic language model,” Advances in neural information processing systems, vol. 13, 2000. [20] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term dependencies with gradient descent is difficult,” IEEE transactions on neural networks, vol. 5, no. 2, pp. 157–166, 1994. [21] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural compu- tation, vol. 9, no. 8, pp. 1735–1780, 1997. [22] A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017. [23] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018. [24] A. Radford, K. Narasimhan, T. Salimans, I. Sutskever, et al., “Improving language understanding by generative pre-training,” 2018. [25] S. Black, L. Gao, P. Wang, C. Leahy, and S. Biderman, “Gpt-neo: Large scale autoregressive language modeling with mesh-tensorflow,” If you use this software, please cite it using these metadata, vol. 58, no. 2, 2021. [26] B. Wang and A. Komatsuzaki, Gpt-j-6b: A 6 billion parameter autoregressive language model, 2021. [27] S. Black, S. Biderman, E. Hallahan, et al., “Gpt-neox-20b: An open-source autoregressive language model,” arXiv preprint arXiv:2204.06745, 2022. [28] H. Touvron, T. Lavril, G. Izacard, et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023. [29] B. Workshop, T. L. Scao, A. Fan, et al., “Bloom: A 176b-parameter open-access multilingual language model,” arXiv preprint arXiv:2211.05100, 2022. [30] Hugging Face, Hugging face: The ai community building the future, https: //huggingface.co/, Accessed: Day-Month-Year. [31] T. Wolf, L. Debut, V. Sanh, et al., “Transformers: State-of-the-art natural language processing,” in Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 2020, pp. 38– 45. [32] Paradigms of Parallelism | Colossal-AI — colossalai.org, https://colossalai. org/docs/concepts/paradigms_of_parallelism/, [Accessed 30-08-2024]. [33] D. Narayanan, A. Harlap, A. Phanishayee, et al., “Pipedream: Generalized pipeline parallelism for dnn training,” in Proceedings of the 27th ACM sympo- sium on operating systems principles, 2019, pp. 1–15. [34] D. Narayanan, M. Shoeybi, J. Casper, et al., “Efficient large-scale language model training on gpu clusters using megatron-lm,” in Proceedings of the In- 38 https://huggingface.co/ https://huggingface.co/ https://colossalai.org/docs/concepts/paradigms_of_parallelism/ https://colossalai.org/docs/concepts/paradigms_of_parallelism/ Bibliography ternational Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–15. [35] Amazon Web Services, Inc., Pytorch tensor parallelism: How it works, Ama- zon SageMaker Documentation, https://docs.aws.amazon.com/sagemaker/ latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism- how-it-works.html, n.d. [Online]. Available: https://docs.aws.amazon. com/sagemaker/latest/dg/model-parallel-extended-features-pytorch- tensor-parallelism-how-it-works.html. [36] H. Face, Tensor parallelism, Text Generation Inference Documentation, https: / / huggingface . co / docs / text - generation - inference / conceptual / tensor_parallelism, n.d. [Online]. Available: https://huggingface.co/ docs/text-generation-inference/conceptual/tensor_parallelism. [37] W. Kwon, Z. Li, S. Zhuang, et al., “Efficient memory management for large language model serving with pagedattention,” in Proceedings of the 29th Sym- posium on Operating Systems Principles, 2023, pp. 611–626. [38] Y. Sheng, L. Zheng, B. Yuan, et al., “Flexgen: High-throughput generative in- ference of large language models with a single gpu,” in International Conference on Machine Learning, PMLR, 2023, pp. 31 094–31 116. [39] J. Lee Rodgers and W. A. Nicewander, “Thirteen ways to look at the correlation coefficient,” The American Statistician, vol. 42, no. 1, pp. 59–66, 1988. [40] Scribbr, Pearson correlation coefficient: Definition, formula & calculation, Accessed: 2024-08-28, n.d. [Online]. Available: https://www.scribbr.com/ statistics/pearson-correlation-coefficient/. [41] DataTab, Pearson correlation, https://datatab.net/tutorial/pearson- correlation, Accessed: 2024-08-28. [42] Built In, Feature importance: What it is and how to measure it, Accessed: 2024-08-28, 2023. [Online]. Available: https://builtin.com/data-science/ feature-importance. [43] Scikit-learn, Permutation importance, Accessed: 2024-08-28, 2024. [Online]. Available: https://scikit- learn.org/stable/modules/permutation_ importance.html. [44] Y. Jin, C.-F. Wu, D. Brooks, and G.-Y. Wei, “SΘ3: Increasing gpu utiliza- tion during generative inference for higher throughput,” Advances in Neural Information Processing Systems, vol. 36, pp. 18 015–18 027, 2023. 39 https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism-how-it-works.html https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism-how-it-works.html https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism-how-it-works.html https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism-how-it-works.html https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism-how-it-works.html https://docs.aws.amazon.com/sagemaker/latest/dg/model-parallel-extended-features-pytorch-tensor-parallelism-how-it-works.html https://huggingface.co/docs/text-generation-inference/conceptual/tensor_parallelism https://huggingface.co/docs/text-generation-inference/conceptual/tensor_parallelism https://huggingface.co/docs/text-generation-inference/conceptual/tensor_parallelism https://huggingface.co/docs/text-generation-inference/conceptual/tensor_parallelism https://huggingface.co/docs/text-generation-inference/conceptual/tensor_parallelism https://www.scribbr.com/statistics/pearson-correlation-coefficient/ https://www.scribbr.com/statistics/pearson-correlation-coefficient/ https://datatab.net/tutorial/pearson-correlation https://datatab.net/tutorial/pearson-correlation https://builtin.com/data-science/feature-importance https://builtin.com/data-science/feature-importance https://scikit-learn.org/stable/modules/permutation_importance.html https://scikit-learn.org/stable/modules/permutation_importance.html Bibliography 40 A Appendix 1 I List of Figures List of Tables Introduction Goals and Challenges Scope of this Thesis Approach Outline Background Machine Learning Primer Development of Large Language Models Statistical Methods Neural Networks Transformer Model Era Open-Source Large Language Models Transformer Model Basics Model Architecture Attention Mechanism Encoder & Decoder Inference Parallelism Techniques vLLM Tokenizer Methods Approach Overview Data Collections Contributing Factors: Metrics Analysis Method Correlation Analysis Feature Importance Analysis Experiment Setup Hardware configuration Berzelius GPU Utilization Models GPT-2 Llama 3 Results Performcance Metrics Analysis Latency Batch Size Highers latency with Bigger batch size Input Length Output Length Number of GPUs with TTLT Number of GPUs with TTFT Throughput Batch Size Input Length Output Length Number of GPUs with Throughput GPT-2 vs. Llama 3 GPU Scaling and Performance Impact Further Analysis on Llama 3 Verifying Llama 3 performance Llama 3 Output Accuracy Correlation Analysis Feature Importance Analysis Summary of Key Findings Conclusion Future Work Inference vs. Training Analysis Increase GPU utilization Extend LLMs numbers & GPU numbers Bibliography Appendix 1