Analyzing Factors Influencing Performance in LLM Inference Systems

Fang, Ziyu; Al Neama, Mujtaba

Analyzing Factors Influencing Performance in LLM Inference Systems

dc.contributor.author	Fang, Ziyu
dc.contributor.author	Al Neama, Mujtaba
dc.contributor.department	Chalmers tekniska högskola / Institutionen för data och informationsteknik	sv
dc.contributor.department	Chalmers University of Technology / Department of Computer Science and Engineering	en
dc.contributor.examiner	Ali-Eldin Hassan, Ahmed
dc.contributor.supervisor	Zhang, Huaifeng
dc.date.accessioned	2025-04-30T09:37:50Z
dc.date.issued	2025
dc.date.submitted
dc.description.abstract	Large Language Models (LLMs) lead to numerous innovative applications, including virtual assistants, content generation, and recommendation systems, which made a revolution in daily life. LLM inference is crucial as it allows these models to process and generate human-like text in real-time, making them adaptable to a wide range of practical applications. This capability has not only enhanced the functionality of AI-driven technologies but also expanded their accessibility and impact across various industries. Latency and throughput are essential metrics for evaluating the performance of LLMs, as they directly influence user experience and system efficiency. They are critical for optimizing the deployment and operation of LLM-based solutions in real-world scenarios. In this thesis, we evaluate the performance of LLM inference by analyzing how different factors affect LLM inference throughput and latency. We aim to study the current performance and internal working mechanism of Llama 3 and GPT-2 to explore potential methods that could keep improving LLM inferences. The approach of this project includes 2 steps which are data collection and data analysis. Our results show that increasing the batch size significantly improves throughput but can also lead to higher latency, indicating a trade-off between speed and responsiveness in LLM inference. Additionally, a larger model size can provide more accurate outputs. Interestingly, allocating more GPU resources reduced overall GPU utilization for both Llama 3 and GPT-2. Inefficiencies in resource allocation could negatively impact the cost-effectiveness of deploying LLM inference’s performance.
dc.identifier.coursecode	DATX05
dc.identifier.uri	http://hdl.handle.net/20.500.12380/309294
dc.language.iso	eng
dc.relation.ispartofseries	CSE 24-166
dc.setspec.uppsok	Technology
dc.subject	LLM Inference, Latency, Throughput, Transformer Model, GPU, Parallelism Computation, Empirical Analysis.
dc.title	Analyzing Factors Influencing Performance in LLM Inference Systems
dc.type.degree	Examensarbete för masterexamen	sv
dc.type.degree	Master's Thesis	en
dc.type.uppsok	H
local.programme	Computer systems and networks (MPCSN), MSc

Ladda ner

Original bundle

Visar 1 - 1 av 1

Namn:: CSE 24-166 ZF MAN.pdf
Size:: 2.43 MB
Format:: Adobe Portable Document Format

Ladda ner

License bundle

Visar 1 - 1 av 1

Namn:: license.txt
Size:: 2.35 KB
Format:: Item-specific license agreed upon to submission
Description:

Ladda ner

Samlingar

Examensarbeten för masterexamen