Fine Tuning a Large Language Model for Tactical Decision Making in Level 3 Autonomous Trucks Master’s thesis in Computer science and engineering Yifan Zhao Mengyuan Wang Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2025 Master’s thesis 2025 Fine Tuning a Large Language Model for Tactical Decision Making in Level 3 Autonomous Trucks Yifan Zhao Mengyuan Wang Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2025 Fine Tuning a Large Language Model for Tactical Decision Making in Level 3 Autonomous Trucks Yifan Zhao, Mengyuan Wang © Yifan Zhao, Mengyuan Wang, 2025. Supervisor: Deepthi Pathare, Department of Computer Science and Engineering, Chalmers & Volvo Group Trucks Technology Examiner: Morteza Haghir Chehreghani, Department of Computer Science and Engineering, Chalmers Master’s Thesis 2025 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2025 iv A Chalmers University of Technology Master’s thesis template for LATEX Yifan Zhao Mengyuan Wang Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract This thesis investigates whether a Large Language Model (LLM) can be adapted to serve as the tactical brain of a Level-3 autonomous truck through supervised fine-tuning (SFT). We first generated highway driving scenarios in the SUMO sim- ulator, pairing each coded scenario with high-level maneuvering decisions, which include ACC set speed, time gap, lane change intent, generated by a powerful LLM. The resulting scenario-decision pairs constitute a domain-specific dataset that captures a variety of safety-critical interactions between a self-propelled truck and surrounding traffic. Three open-source modelsMeta-Llama-3.1-8B, Qwen 2.5-14B, and DeepSeek-R1-Distill-Llama-8B-are then fine-tuned with Low-Rank Adaptation (LoRA). A modular control stack separates the LLMs high-level reasoning from a low-level Intelligent Driver Model (IDM) that executes longitudinal and lateral mo- tion, mirroring real-world practice. Evaluation of SUMO episodes showed that fine-tuning improved the quality of deci- sions. All models improve the achieve a high success rate. Despite the fact that the fine-tuned LLMs achieved a high success rate, we discovered that the LLMs does not fully learn a perfect set of driving strategies. The LLMs does not completely learn the truck’s lane changing strategy. As a result, the LLMs behaved somewhat clumsily in some scenarios. After fine-tuning, some unsafe decisions were eliminated, which confirms the improvement of logical consistency. The models also generate concise natural language rationales, improving the interpretability and compliance of the system. This study shows that when equipped with a tailored driving dataset and efficient LoRA fine-tuning, a modestly sized LLM can provide a degree of safe, efficient, and interpretable but not perfect tactical decisions for self-driving trucks. Keywords: Large Language Models (LLMs), Autonomous Driving, Open-Source Models, Supervised Fine-Tuning, Prompt Engineering v Acknowledgements This project is carried out at Safe and Efficient Driving Division at Volvo Group Trucks Technology, and we extend our sincere gratitude for their invaluable support. We would like to especially thank our supervisor, Deepthi Pathare, for her excep- tional guidance, insightful feedback, and continuous encouragement throughout the course of this project. Special thanks to our examiner Morteza Haghir Chehreghani at Chalmers University of Technology. His profound knowledge and academic rigor have greatly enhanced the quality of our work. We are also grateful for the equipment and computational resources provided by Volvo, which made the experimental work possible. Special thanks go to Alexander Bergersen for his assistance with cluster support and infrastructure access. This work has greatly deepened our interest in the application of Large Language Models in real-world domains and has inspired us to further explore this exciting and rapidly evolving field. Mengyuan Wang, Gothenburg, 2025-06-12 Yifan Zhao, Gothenburg, 2025-06-12 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Related Work 5 2.1 Classical Methods to Tactical Decision-Making . . . . . . . . . . . . . 5 2.2 Advanced Machine Learning Methods . . . . . . . . . . . . . . . . . . 6 2.2.1 Imitation Learning . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . 7 2.3 LLMs for Decision-Making . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 3 Theory 11 3.1 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.2 Prompt Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.1 Prompt Structure . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.2 Prompting Techniques . . . . . . . . . . . . . . . . . . . . . . 12 3.2.2.1 Zero-Shot . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.2.2 Few-Shot . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2.2.3 Chain-of-Thought . . . . . . . . . . . . . . . . . . . 13 3.2.2.4 Tree-of-Thought . . . . . . . . . . . . . . . . . . . . 13 3.3 Supervised Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.3.1 Full Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3.2 Low-Rank Adaptation . . . . . . . . . . . . . . . . . . . . . . 14 4 Methods 15 4.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.1.1 Low Level Controller . . . . . . . . . . . . . . . . . . . . . . . 16 4.2 Dataset Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 ix Contents 4.3 Model Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3.1 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . 18 4.3.2 Hyper-Parameters Tuning . . . . . . . . . . . . . . . . . . . . 19 4.4 Experimental Design and Evaluation . . . . . . . . . . . . . . . . . . 21 4.4.1 Simulation Settings . . . . . . . . . . . . . . . . . . . . . . . . 21 4.4.2 Observational Space (Input to the LLM) . . . . . . . . . . . . 23 4.4.3 Action Space (LLM Decision Outputs) . . . . . . . . . . . . . 24 4.4.4 Episode Termination Conditions . . . . . . . . . . . . . . . . . 24 4.4.5 Performance Evaluation Metrics . . . . . . . . . . . . . . . . . 25 5 Results 29 5.1 Evaluation of Generalization Abilities . . . . . . . . . . . . . . . . . . 29 5.2 Evaluation of Tactical Decision Making Abilities . . . . . . . . . . . . 31 5.2.1 Metrics Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 5.2.3 Case Study 1: Deepseek-distill-llama’s Lane Change Behavior 35 5.2.3.1 Pretrained Deepseek-distill-llama . . . . . . . . . . . 35 5.2.3.2 Fine-tuned Deepseek-distill-llama . . . . . . . . . . . 36 5.2.4 Case Study 2: Qwen2.5-14B’s Safety Maintenance Behavior . 38 5.2.4.1 Pretrained Qwen-14B . . . . . . . . . . . . . . . . . 39 5.2.4.2 Fine-tuned Qwen14B . . . . . . . . . . . . . . . . . . 39 6 Conclusion 41 6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Bibliography 43 A Appendix 1 - System Prompt I A.1 Supervised Fine-Tuning Prompt . . . . . . . . . . . . . . . . . . . . . II B Appendix 2 - Dataset III B.1 Example Scenario-Decision Pair . . . . . . . . . . . . . . . . . . . . . IV x List of Figures 2.1 A rule-based decision-making system. Source: Adapted from [5] . . . 6 3.1 Comparison of different model sizes among popular LLMs. Source: Adapted from [20] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2 Example of a chain of thought CoT prompting. Source: Adapted from [16] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.3 LoRA matrix decomposition method. Adapted from [23] . . . . . . . 14 4.1 Overview structure of the project . . . . . . . . . . . . . . . . . . . . 16 4.2 Schematic diagram of highway simulation setup . . . . . . . . . . . . 22 4.3 Simulation control architecture . . . . . . . . . . . . . . . . . . . . . 23 5.1 Accuracy of MMLU and HellaSwag dataset for each LLM. . . . . . . 29 5.2 Loss change comparison of Qwen2.5 14B . . . . . . . . . . . . . . . . 30 5.3 Loss change comparison of Llama 3.1 8B . . . . . . . . . . . . . . . . 30 5.4 Loss change comparison in training and validation . . . . . . . . . . . 31 xi List of Figures xii List of Tables 4.1 Variables in the Intelligent Driver Model . . . . . . . . . . . . . . . . 17 4.2 Illustrative input-output pair from the driving decision dataset . . . . 19 4.3 Comparison of three large language models . . . . . . . . . . . . . . . 20 4.4 Parameters Used in TCOP Calculation . . . . . . . . . . . . . . . . . 27 5.1 Performance metrics of pretrained models in three surrounding cars experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Performance metrics of fine-tuned models in three surrounding cars experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3 Performance metrics of pretrained models in seven surrounding cars experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.4 Performance metrics of fine-tuned models in seven surrounding cars experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.5 Encoded Current Driving Environment Observation . . . . . . . . . . 35 5.6 Pretrained Deepseek-Distill-Llama Decision Based on Observed Context 35 5.7 Encoded Current Driving Environment Observation . . . . . . . . . . 36 5.8 Finue-tuned Deepseel-Distill-Llama Decision Based on Observed Con- text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.9 Encoded Current Driving Environment Observation . . . . . . . . . . 38 5.10 Pretrained Qwen-14B Decision Based on Observed Context . . . . . . 39 5.11 Fine-tuned Qwen-14B Decision Based on Observed Context . . . . . . 39 xiii List of Tables xiv 1 Introduction 1.1 Background The recent advancements and development in autonomous driving have revolution- ized the field of transportation, demonstrating its promising potential for enhancing safety and efficiency, and is gradually integrating into our daily life. Modern au- tonomous vehicles integrate multiple subsystems, including perception, planning, decision-making, and control to act as an intelligent agent on the road. Among these, the decision-making module serves as a "brain" for the autonomous vehicles, responsible for translating surrounding environment and goals into safe and efficient driving decisions. The long distance trucking takes a significant portion of global freight, which means it can benefits greatly from automation. The Society of Automotive Engineers (SAE) defines Level 3 (L3) autonomy as a mode in which the vehicle can handle all aspects of the driving task in certain condi- tion, but a human driver must be available to take over the control when the system requests. In a L3 vehicle, the automated system performs dynamic driving tasks, including lane change, acceleration and braking, without continuous human super- vision. In terms of long distance trucking, L3 autonomy is particularly compelling for highway scenarios. Highway scenarios offer relative structured environment that can be suitable for automation very well. For example, there is no traffic lights in a highway, lane marking is structured, and traffic flow is consistent. Tactical decision-making consists of adaptive cruise control (ACC) and lane change maneuvers. ACC is a type of advanced driver-assistance system for road vehicles that automatically adjusts the vehicle speed to maintain a safe distance from vehicles ahead. Lane change maneuver involves changes in both the longitudinal and lateral velocity as well as movement in the presence of other moving vehicles, which can be perceived as challenging. Recently, Large Language Models (LLMs) have advanced significantly. Because of their remarkable ability of context understanding, reasoning, generating coherent response in natural language, they are good potential tactical decision makers in a highway scenario of L3 autonomous vehicles. Generative AI can perform complex task by following instructions expressed in a human language way, which shows a so- called common sense reasoning that was unexpected from previous AI systems. For examples, GPT-o1 shows a strong performance on different tasks range from daily 1 1. Introduction dialog, mathematical computation and logical reasoning. This ability that reason the instructions instantly has led to researchers to explore LLMs in a decision-making domain. In this master’s thesis, we conduct a feasibility study of how to leverage capabilities of LLMs, including natural language understanding, complex task reasoning, for decision-making. The primary focus of this study would be a scientific way to fine- tune a LLM, which enables model to learn the pattern of tactical decision-making for autonomous vehicles. Our methodology involves dataset generation, models fine- tuning and evaluation. An powerful LLM is applied for a simulation environment to generate the dataset. Then smaller LLMs take advantage of the generated dataset to fine-tune. Finally, We evaluate our models in a simulation environment, which uses some metrics designed by human evaluators. 1.2 Purpose The primary objective of this project is to explore some effective methods, for exam- ples, use fine-tuning techniques and prompt engineering, for an open-source LLM to make tactical decisions in a L3 autonomous vehicle. In addition, we tend to ac- cess the performance in comparison to different LLMs. Specifically, in a simulation environment, we test LLMs’ success/collision rate, average speed of the ego vehicle, and driving efficiency. To concretize the objective, we purpose a research question: How can a LLM be adapted to a decision maker in a L3 driving vehicle This research question can be divided into three specific sub-questions: • 1. What is the dataset we need to fine-tune LLMs? How can we obtain this dataset. • 2. What supervised fine-tuning methods we should use? • 3. How should we design the evaluation test in a simulation environment? 1.3 Limitations Although this project aims to explore some effective methods for an open-source LLM to make tactical decisions in a L3 autonomous vehicle, it is important to acknowledge certain limitations: • 1. Scope: The evaluation of the LLMs is limited to a simulation environment instead of the real-world testing, limiting the immediate applicability of find- ings. In our scenario, LLMs can be used in a real-world settings to validate and monitor the correctness of tactical decisions to improve the overall model robustness. 2 1. Introduction • 2. Dataset: Due to the time constraints of this thesis, the synthesized dataset will be used for fine-tuning. Only a limited number of data samples will be manually validated due to workload constraints. The impact of data quality on model performance is comparable to the presence of noise in any driving dataset. 1.4 Thesis Outline In Chapter 1 we introduced the background of our thesis, and we continue to ex- plore previous studies that researches have done on decision-making in autonomous vehicles. Chapter 3 covers the theory that related to LLMs, prompt engineering, and supervise fine-tuning. Chapter 4 clarifies the methodology used in our study. Followed by Chapter 5, which presents the experiment’s result. and discusses the observation and insights. Last but not least, Chapter 6 concludes the project by summarizing the findings and discusses the future work. 3 1. Introduction 4 2 Related Work The related work chapter investigates how previous studies have proposed some methods that tackle the tactical decision-making problem in a autonomous vehi- cles. One common approach is to use a rule-based system, which is efficient in ensuring safety in autonomous vehicles. Other more comprehensive strategies in- volve more advanced machine learning algorithms, which also include using large language models. Finally, we discuss how our work differs from previous research and the contribution it brings to this study area. 2.1 Classical Methods to Tactical Decision-Making Early autonomous driving systems are based on knowledge-driven, rule-based strate- gies to make tactical decisions on highways [1]. These methods embed the knowledge of the traffic rules from the experts and driving heuristics into vehicles’ driving logic. For instance, a finite-state machine can be used to switch between driving behaviors, such as lane-keeping, overtaking, or following based on preset conditions. In such rule-based systems, an if-then style rule and logic reasoning would be applied to select maneuvers, which ensures decisions remain interpretable and compliant with traffic laws [2]. In addition, a well fine-tuned rule database can yield safe and pre- dictable behaviors; Pellkofer and Dickmanns proposed a behavior decision module, which execute task plans generated by task planning experts rule database [3]. Simi- larly, Noh and An proposed a decision-making framework for highway environments, capable of reliably and robustly assessing collision probabilities under current traffic conditions and automatically determining appropriate driving strategies [4]. These rule-based systems outperform in safety and Interpretability, however, they face limitations in complex, dynamic environments. For example, designing an finite- state machine to imitate human drivers’ behaviors can be extremely challenging, because the number of states can explode as one enumerates more traffic situations and exceptions. Human driving behaviors often involve subtle judgments that are difficult to capture through strict rules. Therefore, rule-based systems are struggling with the scenarios that are not expected by their designers. Although highway scenarios are usually structured(e.g. well defined lanes, fewer edge cases), the rule- based systems are brittle if they confronted with out-of-distribution events (e.g. an accident scene or an aggressive cut-in) that werent encoded in the rules [6]. In conclusion, these classical approaches form the early toolkit for L3 autonomous 5 2. Related Work Figure 2.1: A rule-based decision-making system. Source: Adapted from [5] driving on highway. They provides transparency and some basic guarantees, but at the cost of manual tuning and extremely limited adaptability. 2.2 Advanced Machine Learning Methods To overcome the inflexibility of manual rules, researchers turn to the focus on data- driven methods. Machine learning plays an important role in this field. It allows an autonomous agent to learn the driving pattern from previous examples and experi- ence, rather than relying solely on hand-crafted logic. Two major paradigms have been explored: imitation learning and reinforcement learning. 2.2.1 Imitation Learning Imitation learning uses human driving behaviors logs to train a model that imitate human’s driving decisions. For example, models may learn when to change the lane or what they should respond when there is a lower vehicle ahead by observing many human drivers in similar situations. NVIDIA proposed the landmark DAVE-2 system, which is an end-to-end neural network controllers that shows that a convo- lutional neural network could map camera images directly to steering commands by learning from human demonstrations [7]. This kind of models can reproduce the human driving behaviors quite well. However, the disadvantage of this model is also quite well-known - models may lack a deep understanding of the scenario and fail when encountering scenarios that are absent or rare in the training data. Kim et al. [8] elaborated on the shortcomings of this approach, pointing out that it lacks a semantic understanding of the content of the image, which makes it difficult to deal with data outside of the training distribution. 6 2. Related Work 2.2.2 Reinforcement Learning Reinforcement learning(RL) is one of the three basic machine learning paradigms. It concerns how an intelligent agent should take actions in a dynamic environment in order to maximize a reward signal. In the context of tactical decision-making on highway, RL agent can be awarded for safety, speed, and efficiency. Deepthi et al. [9] designed a total cost of operation (TCOP) reward for the RL agent, embed- ding trucking costs, such as fuel, brake wear, guides the agent toward commercially relevant behaviors. Deep reinforcement learning (DRL) methods have achieved promising results in a simulation environment. The agent can learn efficient overtaking and lane-changing behaviors that might be hard to coded manually. It has been proved that DRL outperforms simple rule-based systems in collision avoidance scenarios [10]. Simi- larly, many works have used deep q-networks (DQN), which is a variant of DRL, or deep deterministic policy gradient (DDPG) to learn a highway policy for the RL agent. Peng et al [10] used an upper DQN to handle lane change decisions and a lower DDPG to control the trajectory; the two layers were trained in coordination to complete smooth lane changes in traffic. Although RL outperforms other methods, it has its own challenges as well. A good RL agent needs a well-designed reward function, this relies heavily on the researchers designing ability and experience. In addition, the RL agent requires extensive ex- ploration in simulation. 2.3 LLMs for Decision-Making Given the limitations of machine learning methods, researchers begun to explore the possibility of using LLMs’ reasoning ability to improve autonomous driving decision-making. LLMs such as GPT3 [11] are pre-trained on a vast generic dataset, which gives them generalized reasoning capabilities. Although LLMs are originally developed for text generation like chatbots, text translation, they have the potential of being a decision-maker. Recent works [12] implied that integrating the LLM into autonomous vehicle’s planning module could make the system more generalizable and interpretable. A prominent work product is the use of the LLM as a high-level planner that directs the vehicle’s actions, while the low-level controller executes the plan. For instance, in the study [13], the authors propose an approach where an LLM serves as the core decision-making module in the driving stack. The LLM is prompted with a description of current scenarios and outputs a natural language reasoning about what the ego vehicle should do. These textual decisions are then translated into numerical driving commands by the Model Predictive Controller (MPC). This method enabled human-like handling of complex situations. Similarly, the work [14] introduce DriveMLM, an LLM-based framwork for au- tonomous driving behavior planning. It employs a multimodal LLM as the plan- ning module: it takes as input structured observations from various sensors, such as 7 2. Related Work camera and radar, high-level route commands, and even driving rules, and produces both a driving decision and an explanatory text justification. Decisions for autonomous vehicles must not only be effective, but also legally and ethically acceptable. LLM has been conceived as a tool to inject knowledge of trans- portation laws, regulations and even ethical frameworks into the decision-making loop. In their work [15], the authors present a system that leverages a retrieval- augmented LLM to ensure traffic rule compliance. In their framwork, an agent called Traffic Regulation Retrieval (TRR) fetches relevant rules from a database of driving laws and manuals based on the vehicles current situation. Then a GPT- 4-based reasoning module interprets these textual rules and evaluates the vehicles intended action against them. This approach can flag illegal and unsafe maneuvers from the knowledge of database. One of the most immediate advantages of LLM is that it can give the reason why the model makes this decision by some prompt engineering techniques such as chain- of-thought(CoT) [16]. Unlike a typical neural network that maps inputs to outputs without human-readable reasons, which could largely enhance user trust and ease validation, an LLM can produce a CoT or a textual explanation of why a certain maneuver is recommended. CoT enables complex reasoning capabilities through in- termediate reasoning steps, it not only increase the transparency of internal model’s reasoning, but also improve the quality of the final output. Several systems have integrated a question-answering or commentary capability into driving control. For example, [17] proposes a DriveGPT4, which is a multimodal LLM that can take video frames from the cars cameras as input and output both low-level driving con- trols and a textual commentary. In use, the LLM might receive a query like "Why does this vehicle behave in this way?". These sort of queries trigger LLMs’ reasoning ability in natural. In summary, large language models are opening up new avenues for autonomous ve- hicle decision-making at the cognitive level. They provide natural language interface, making autonomous systems more transparent and interactive. 2.4 Motivation Previous studies offer significant contribution to this field but also leave gaps that motivate our research. The rule based approach is suitable for structured scenarios and gives clear logical explanations, however, it could be too rigid when applied in complex and dynamic driving scenarios. And it lacks adaptability, it would be difficult to cope with emerging scenarios and requires a large number of experts to write its logic. The machine learning approach generalizes its capabilities by learning the dataset in a way that makes it in easier to adapt to new scenarios. However, it requires massive labeled datasets, and the black-box nature of neural networks makes it difficult to explain errors when they occur. In our research, we develop a large language models framework for tactical decision- 8 2. Related Work making, where the LLM serves as a "brain" in our system, and decides high-level actions including Adaptive Cruise Control (ACC) and lane change maneuvers in a highway scenario, as well as providing human-like text-based reasons. Specifically, we design and implement a high-level planner and low-level executor separated con- troller, and the LLM is responsible for high-level decision-making considering both safety and efficiency, while the low-level control is based on physical models for actual driving maneuvers, such as lane change and speed increase. 9 2. Related Work 10 3 Theory The theory chapter lays the theoretical foundations to support the research by sum- marizing the basic concepts and principles that guided the research. It starts with an overview of LLMs. Further, it investigates techniques that can enhance LLM performance such as prompt engineering and supervised fine-tuning. At last, it discusses the SUMO simulation environment used to evaluate LLM’s performance. 3.1 Large Language Models In this section, some essential background information about LLMs will be provided. LLMs is advanced generative AI models that undergo extensive unsupervised pre- training on vast text datasets to understand and generate human-like text. Examples of LLMs include OpenAI’s GPT series [18], Anthropic’s Claude series [19], and Meta’s Llama families. 3.1.1 Terminology Model Size: Model size refers to how large the model is, specifically, how many parameters does the model has. A typical large model size can be over 100B, which means the model has more than 100 billion parameters, such as, Llama3.1-405B. A small size, on the other hand, may only have 1-10 billion parameters. Usually, a larger model size means the model has a better ability to learn more complex patterns from the training dataset, however, it also means a larger model needs more computation resources for training and more memory space to store its parameters. Base Model: A pre-trained language model that serves as a foundation for building more specific models for downstream tasks, such as text classification, translation. Instruct Model: A model that is derived from base models but undergo additional fine-tuning on datasets of instructions and their corresponding outputs. This process imbues the model with the ability to follow specific directives and perform tasks more reliably [21]. Hallucinations: Model-generated content that is clearly untrue, where the model is attempting to deceive the user by making up content that doesn’t exist [22]. 11 3. Theory Figure 3.1: Comparison of different model sizes among popular LLMs. Source: Adapted from [20] 3.2 Prompt Engineering Prompt engineering is an increasingly important skill set needed to converse effec- tively with LLMs. A typical well-designed prompt includes system instructions, questions, input data, and a example of the output. 3.2.1 Prompt Structure System Instructions: Assigning roles to the LLM to guide text generation by offering specific context. For example, in the beginning of the prompt, use a role prompt like "You are a smart driving assistant." Questions: The main content of the user query, on which the user expects the model to base its answer. Input Data: A dynamic content can be changed according to different states, for example, cars’ speed, lane number and relative distance. 3.2.2 Prompting Techniques In order to retrieve the best response from the LLMs, there are several prompt techniques users can choose. 12 3. Theory 3.2.2.1 Zero-Shot Zero-Shot Prompting is a prompting technique to guide LLMs towards new tasks. LLMs can answer queries they have never seen before. For example, an LLM can answer a prompt like "Please translate following Chinese to English for me: ", such prompt can trigger the LLM to find out the most relevant word vectors in its latent space to generate the words even it has never been trained on a translation task. 3.2.2.2 Few-Shot Compared to zero-shot prompting, few-shot prompting offers a limited number of input-output examples to LLMs to better answer the query. This methods can significantly improve the answer format quality, which leads the LLMs to answer more logically. 3.2.2.3 Chain-of-Thought LLMs usually struggle with complex reasoning tasks, which limits their ability and ef- ficiency. Due to this challenge, Chain-of-Thought (CoT) are introduced as a solution to guide LLMs to think step-by-step [16]. CoT trigger model’s inherent reasoning ability obtained during the pre-training by providing demonstrations that combine reasoning information with inputs and outputs, which helps model generating out- put through systematic steps. Unlike few-shot prompting, which provides examples to the LLMs, CoT provides a detailed reasoning chain within the examples. This detailed reasoning chain breaks down the complex task into small and easy tasks. With this divide-and-conquer mindset, the model can stimulate his latent reasoning ability, which in turn leads to the completion of complex tasks that were previously poorly performed. 3.2.2.4 Tree-of-Thought CoT improves LLMs reasoning ability by teaching them to think loudly. Tree-of- Thought (ToT) is a framework that summaries thought chain prompts and encour- ages exploration of thought as an intermediate step towards solving general problems using language models. This enables LLM to self-evaluate the whole process by the intermediate thoughts made towards solving a problem through a deliberate reason- ing process. 3.3 Supervised Fine-Tuning Normally, pre-trained model are trained by unsupervised learning and may not be op- timized to specific downstream tasks. Fine-tuning bridges the gap through take the advantage of general language understanding that models gained from pre-training and adapted it to a specific task by supervised learning. During the fine-tuning process, the model’s weight are adjusted based on the gra- dients computed by the specific task loss, which is the difference between model’s prediction. 13 3. Theory Figure 3.2: Example of a chain of thought CoT prompting. Source: Adapted from [16] 3.3.1 Full Fine-Tuning Full fine-tuning is a traditional approach to fine-tuning that typically involves build- ing on a pre-trained model and adjusting all of the model’s parameters by continuing to train on data from a specific downstream task. The advantage of this approach is that the downstream task data can be leveraged to adjust the model’s behavior, which makes the model better adapted to the specific application scenario. How- ever, full fine-tuning may be extremely computational expensive due to the number of parameters. 3.3.2 Low-Rank Adaptation Low-Rank Adaptation (LoRA) [23], is an efficient fine-tuning technique for large pre-trained language models compare to full fine-tuning. Its key idea is to achieve model fine-tuning by introducing trainable low-rank matrices in the Transformer layer of the model without changing the weights of the pre-trained model. This approach can significantly reduce the number of training parameters, thus reducing the demand for computational resources. Figure 3.3: LoRA matrix decomposition method. Adapted from [23] 14 4 Methods The methods chapter outlines the methodology used for this project. There are three major parts. It commences with dataset generation, which elaborates how the dataset could be obtained in this project. Next, it covers how the supervised fine- tuning be applied. Finally, it shows the process of SUMO evaluation and customizing evaluation metrics for the experiment. 4.1 Overview We present a novel framework for tactical decision-making in autonomous truck sys- tems, leveraging the capabilities of Large Language Models (LLMs). Our approach introduces an innovative method of translating complex driving environments into structured textual representations, allowing an LLM to reason over the scene and determine optimal driving maneuvers. Figure 4.1 illustrates a high-level overview of the proposed system architecture. The pipeline begins with the dataset generation phase, where we utilize high-fidelity sim- ulations to construct a diverse and comprehensive set of highway driving scenarios. A large, high-performance LLM is employed to interpret these simulated environ- ments, generating high-quality textual descriptions alongside corresponding tactical decisions (e.g., "set desired ACC speed to 24m/s," "initiate lane change to the left," or "increase time gap to 3s"). Following dataset creation, we proceed to the fine-tuning stage. In alignment with industrial constraints regarding computational efficiency and deployment feasibility, we select a significantly smaller pre-trained LLM for this purpose. This compact model is fine-tuned on the previously generated dataset, learning to replicate the decision-making logic and contextual understanding of its larger counterpart. The final stage of the framework is Evaluation. The fine-tuned model is deployed within a simulated environment, where its decision-making capabilities are rigorously tested across a wide array of highway scenarios. The evaluation focuses on the mod- els ability to produce safe, efficient, and contextually appropriate tactical decisions in real-time. To support this evaluation, we utilize Simulation of Urban Mobility (SUMO), an open-source, microscopic traffic simulator. SUMO enables fine-grained modeling of individual vehicle behaviors and interactions, allowing for detailed and realistic simulations of highway traffic. It supports large-scale networks, multi-modal 15 4. Methods Figure 4.1: Overview structure of the project transport systems, and time-discrete, space-continuous simulations. Due to its flexi- bility, extensibility, and broad adoption in both research and industry, SUMO serves as a suitable platform for testing autonomous driving strategies under a wide range of conditions. 4.1.1 Low Level Controller To execute these decisions within the simulation, we implement a two-layer control architecture that separates high-level planning from low-level control. This modular separation mirrors real-world autonomous driving system design, where strategic reasoning is decoupled from the physical actuation of vehicle control. We utilize LLMs to handle high-level tactical decision-making within our autonomous driving framework. This includes selecting maneuvers such as lane keeping, lane changing, or adjusting speed in response to dynamic traffic scenarios. In essence, the model determines what action the vehicle should take next, based on a textual understanding of the driving environment. For low-level control--the component responsible for translating high-level decisions into executable physical actions–we implement the Intelligent Driver Model (IDM) [24] to govern longitudinal dynamics. Low-level control involves managing the vehi- cles motion along and across lanes, including precise regulation of speed (acceleration and deceleration), maintaining safe following distances, and executing lane changes when commanded by the high-level planner. The IDM is a time-continuous car-following model that is designed to realistically 16 4. Methods simulate human driving behavior, particularly in highway conditions. It computes the vehicles acceleration based on the current speed, desired speed, and the gap to the leading vehicle. Compared to earlier models such as Gipps’ model, IDM maintains more realistic dynamics even in deterministic limits, offering smoother transitions and safer spacing behavior. The longitudinal dynamics of the IDM are given by Eqs. 4.1. Equation 4.1a defines the acceleration v̇ as a function of the current speed v, the desired speed v0, and the spacing term in Eq. 4.1b. v̇ = a 1 − ( v v0 )δ − ( s∗(v, ∆v) s )2  , (4.1a) s∗(v, ∆v) = s0 + v T + v ∆v 2 √ a b , (4.1b) ∆v = v − vlead, (4.1c) Table 4.1 defines the variables used in the IDM model. Table 4.1: Variables in the Intelligent Driver Model Symbol Description v Ego-vehicle speed (m s−1) v0 Desiredor free-flowspeed (m s−1) v̇ Longitudinal acceleration (m s−2) a Comfortable acceleration limit (m s−2) b Comfortable deceleration limit (m s−2) δ Acceleration exponent (usually 4) s Actual gap to lead vehicle (m) s0 Minimum stand-still gap (m) T Desired time headway (s) ∆v Approach rate v − vlead (m s−1) The lateral controller control the lateral actions stay on lane, change to left lane, or change to right lane. Lane change is performed using the default LC2013 lane change model [25] in SUMO [9]. 4.2 Dataset Generation Although the LLM is already trained on a vast generic dataset, it needs a domain- specific dataset to dive deep to become a domain specialist. Hence, a specialized, vertical high-speed truck driving dataset is critical for subsequent model training. In order to generate such a dataset, we created a simulation environment, which was built by SUMO. This simulation environment contains one truck and 7 surrounding 17 4. Methods vehicles that would travel on a three-lane highway. At the beginning of the simula- tion, the surrounding vehicles and the ego truck would be randomly generated on the three lanes to ensure that there would not be any collision between them. The entire simulation will run for 1000 steps, and we set each step to be 0.1 seconds, which means that for each step of the simulation, all vehicles will run for 0.1 seconds. The movement of the surrounding vehicles is controlled by sumo, which in principle ensures that there will be no traffic accidents. The driving of the ego-trucks is controlled by a complete system, which is divided into a high-level decision-making and a low-level controller. We use DeepSeek [26] to take care of the high-level decision-making, i.e. we let DeepSeek do the decision-making. Equipped with DeepSeek and IDM, our ego truck can perform the simulation safely and efficient. To avoid simulations in which the surrounding traffic density is too small to generate the amount of data needed for the complex case, we maintain a window for the self-trucks and make sure that all surrounding vehicles are under that window. As soon as a vehicle exits this window (which may be too slow or too fast), we generate that vehicle on the reverse side of the exit window. Meanwhile, our ego truck will have a sensor capable of checking for vehicles in a 100 meters range, meaning that the truck will only make a decision based on the currently detected vehicles, more in line with a realistic truck autopilot. DeepSeek makes a decision every 10 steps, which means that a high-level instruc- tion will be sent to the low-level controller every 1 second. We ran a total of 50 simulations, obtaining 100 scenario-decision data per simulation, and ended up with a total of 5,000 highway truck driving decisions in our dataset. The dataset only contains scenario-decision pairs, and they would be integrated into prompts for the supervised fine-tuning. The fine-tuning prompt is shown in Appendix A.1. Table 4.2 illustrates an example of generated dataset. An entry in the dataset consists of an input and an output, both of them are described by the json format. Where the input describes the current environment of the truck and the output is the decision made by the DeepSeek model, which mainly consists of acc settings and lane change. 4.3 Model Fine-Tuning This section presents the model fine-tuning methods, which adapt weights of pre- trained LLMs to the highway scenario decision-making. We determine candidate LLMs, model fine-tuning methods, and hyperparameters. 4.3.1 Model Selection Open-source models like Llama series and Qwen families, have demonstrated com- petitive performance against closed-source GPT-3.5-turbo benchmarks. Due to the commercial licenses and their ability, these open-source LLMs are more preferable for this thesis. 18 4. Methods Table 4.2: Illustrative input-output pair from the driving decision dataset Input Output { "ego_speed": 24.8, "ego_lane": 2, "current_time_gap": 2.0, "surrounding_vehicles": [ {"id": "veh1", "distance": 33.3, "rel_position": "front", "lane_relation": "right_lane", "speed": 21.7, "lane": 0}, {"id": "veh3", "distance": 36.2, "rel_position": "front", "lane_relation": "right_lane", "speed": 21.6, "lane": 1}, {"id": "veh4", "distance": 60.8, "rel_position": "front", "lane_relation": "right_lane", "speed": 21.7, "lane": 0} ] } { "acc_set_speed": 25, "time_gap": 2.0, "lane_change": "none", "reason": "No vehicles ahead in current lane, maintain max speed while keeping safe distance from adjacent-lane vehicles." } In this thesis, we conduct the study using three LLMs - Llama-3.1-8B, Qwen2.5-14B, and DeepSeek-R1-Distill-Llama-8B [27], and compare their performances. These models are much smaller than the LLM we used in the dataset generation task, in the consideration of time efficiency and reasoning ability. It is worth noting that during the fine-tuning, catastrophic forgetting could happen and this can largely effect model’s ability. As new knowledge replaces prior learning, the model loses the ability to handle its original tasks. In order to avoid catastrophic forgetting, we use MMLU [28] dataset to evaluate models’ reasoning ability. MMLU dataset is a large-scale, multi-task language comprehension program designed to assess and enhance the ability of language models on a variety of language com- prehension tasks. The project covers a wide range of topics and domains, such as history, literature, science, mathematics, etc., and challenges the model’s compre- hension skills and breadth of knowledge through these diverse topics. We benchmark the score before fine-tuning and we evaluate them again after fine-tuning to check if catastrophic forgetting is happened during the fine-tuning. 4.3.2 Hyper-Parameters Tuning In this project, LoRA method are applied for the fine-tuning. Unlike full fine-tuning, LoRA freezes model’s parameters so model can remember what it learned in the pre-training in principle. In addition, LoRA allows us to train some of the dense layers in the neural network indirectly by optimizing the rank decomposition matrix of the dense layer changes during adaptation. The three models we chose are all transformer-based models, we applied LoRA to the Q,V layers of the attention layer, which were chosen for a few main considerations: • Parameter efficiency: fine-tuning focuses on the parts of the model that most affect its output and performance. Q and V layers directly affect the computation of the attention weights and the final output representation, so 19 4. Methods Table 4.3: Comparison of three large language models Model Size Description Llama 3.1 8B 7.62 Billion Advanced open-source lan- guage model developed by Meta, designed to excel in multilingual dialogue, rea- soning, and text generation tasks. Qwen2.5 14B 14.8 Billion Significantly more knowledge and greatly improved capabil- ities in coding and mathemat- ics, especially in JSON data. Llama 8B distilled from DeepSeek R1 7.62 Billion Distilled from the larger DeepSeek-R1, which was trained using reinforcement learning to enhance reasoning capabilities. Strong perfor- mance in reasoning tasks, offering a balance between computational efficiency and capability. tweaking on these layers can change the model’s behavior more directly. • Information selection: by adjusting layer Q, it is possible to influence how the model selects information (i.e., what information it pays attention to), and by adjusting layer V, it is possible to influence how the model makes use of certain information once it has been selected. layer K influences primarily how the information is matched, and in many cases adjustments to both Q and V will be enough to direct the model’s attention to more useful information. • Computational efficiency: while LoRA aims to improve parameter effi- ciency through low-rank updates, applying such updates on all layers still adds an additional computational burden. Therefore, selecting the layers to tune that have the greatest impact on the final performance can yield the greatest performance gains while adding the least computational cost. Additionally, r and α are two important variables for LoRA. The r parameter refers to the rank in LoRA, which determines the size of the low-rank matrix. Typical value range for rank is 4 to 64. Lower ranks (e.g., 4, 8) are more efficient, often with minimal performance drop. Higher ranks can improve performance but increase memory and compute. The α parameter defines the learning rate scaling factor for LoRA adaptation. This parameter affects the update rate of the low-rank matrix. It is always range from 8 to 64, setting such that α/r ≈ 1 − 8 to scales the LoRA output. To prevent catastrophic forgetting, we test different combinations of rank and alpha and evaluate corresponding models’ performance on the MMLU dataset. 20 4. Methods W̃ = W0 + ∆W, (4.2a) ∆W = α r B A, (4.2b) y = W̃x = W0x + α r B A x, (4.2c) B ∈Rd×r, A ∈ Rr×k, r≪min(d, k). (4.2d) After fine-tuning, we used two generalized datasets, MMLU [28] and HellaSwag [29], to evaluate the models’ capabilities. Among them, MMLU focuses on verifying the logical reasoning ability of the model. It includes a series of multiple-choice questions on STEM, social science, humanities, etc. We hope that our fine-tuned model can achieve the highest possible score, which represents the model’s ability to reason with complex logic. HellaSwag focuses on verifying the model’s ability to understand context. It is also a multi-choice question dataset, and we hope that the model can correctly find out the appropriate context. The model needs to capture important information hidden in the problem, such as numbers, and correctly understanding this information can have a significant impact on the final decision. 4.4 Experimental Design and Evaluation This thesis work employs a dynamic highway traffic simulation environment de- veloped using the Simulation of Urban MObility (SUMO) platform, aimed at the comprehensive evaluation of fine-tuned models in the domain of autonomous truck driving, comprehensively including foundational mathematical reasoning, logical in- ference, and the capacity for making context-aware, real-time driving decisions. One objective is to assess the models’ ability to operate safely and efficiently within com- plex and dynamic traffic scenarios. In addition, this research also investigates the models ability to produce clear and reasonable justifications for their decisions, with the aim of enhancing the transparency and trustworthiness of AI-driven autonomous vehicle technologies. 4.4.1 Simulation Settings The road network models a straight highway segment consisting of three 3.2 meters wide lanes in the primary direction of travel, spanning a total length of 5200 meters. The lane number is encoded as 0, 1, 2, with 0 meaning the right-most lane, 1 meaning the middle lane, and 2 meaning the left-most lane. Traffic flow is governed by a density of 0.02 vehicles/meter, with 80% trucks and 20% cars, representing a challenging, truck-dominated scenario. Furthermore, the ego vehicle is modeled as a semi-trailer truck with a length of 16 meters and a width of 2.55 meters, and the driver’s behavior is deterministic. The Surrounding vehicles, modeled as passenger cars, are governed by the Krauss car-following model [30] and LC2013 [31] lane- changing model, with speed explicitly sampled from the uniform distributions with 21 4. Methods an interval from 20 m/s to 30 m/s. In addition, the sensor range is set to 200 meters both in the front and rear of the ego truck. Every other vehicle that is currently driving within this range will be detected as a surrounding vehicle. A visual illustration of the simulation setting is provided in Fig 4.2. We design two sets of experiments, one with up to three surrounding vehicles, and another representing a denser traffic with up to seven surrounding vehicles. Figure 4.2: Schematic diagram of highway simulation setup The first 250 meters are the "warm-up" area of the experiment. Therefore, each simulation episode officially begins from a longitudinal distance of 250 meters from the start of the highway segment. To ensure a dynamic and non-deterministic traf- fic context, the initial positions and velocity profiles of surrounding vehicles are procedurally randomized at the onset of each episode. A simulation episode is recorded as successful if the ego vehicle navigates the entire 5200-meter highway segment and reaches the designated terminal point. Conversely, an episode is prematurely terminated and classified as a failure if a critical safety violation occurs. Such events include, but are not limited to, collisions with other vehicles or road infrastructure, or any deviation of the ego vehicle from the defined drivable roadway (e.g., driving off-road). Each episode is limited to a maximum of 500 simulation steps. At each step, the ego vehicles current driving state, along with information of the surrounding vehicles within the sensor range, is recoded, including position, speed, and lane assignment. This information is used to construct an encoded textual representation of the driv- ing environment, which is then provided as input to the Large Language Model (LLM), which serves as the high-level tactical planner. The LLM analyzes the current environment context and determines an appropri- ate high-level tactical action (e.g., maintain speed, initiate lane change to the left, increase the time gap to 3 seconds). This decision is then passed to a low-level executor that is responsible for executing the maneuver through concrete control inputs. 22 4. Methods The low-level controller operates at a fine-grained temporal resolution of 0.1 seconds per simulation step. It translates high-level actions into precise longitudinal and lateral control commands: • Longitudinal speed adjustments are executed smoothly over a period of 10 simulation steps (1 second). • Lateral lane changes are completed over 40 simulation steps (4 seconds), en- suring stable and safe transitions. Figure 4.3 illustrates the overall simulation framework and control architecture adapted from paper [9]. Figure 4.3: Simulation control architecture 4.4.2 Observational Space (Input to the LLM) The LLM-based driving agent perceives the environment through a structured set of observations, categorized into ego vehicle state and surrounding vehicle states. The intrinsic state of the ego vehicle (the autonomous semi-truck) is captured by the following parameters: 1. Current Longitudinal Speed: The forward velocity of the ego vehicle, mea- sured in meters per second (m/s). 2. Current Lane Index: An integer representing the ego vehicle’s current lane occupancy. For this three-lane highway segment, this value can be 0 (the rightmost lane), 1 (the middle lane), or 2 (the leftmost lane). 3. Current Set ACC Time Gap: The prevailing target time gap setting, in seconds (s), for the ego vehicle’s Adaptive Cruise Control (ACC) system. This represents the desired time headway the ACC is currently configured to maintain with a preceding vehicle, and serves as an input observation for the LLM’s subsequent decision on potentially adjusting this setting. 23 4. Methods Every vehicle that is within the ego truck’s simulated sensor range will be defined as a surrounding vehicle, and the following information is recorded: 1. Vehicle Identifier (ID): A unique identifier for each surrounding vehicle, enabling temporal tracking and consistent reference. 2. Current Longitudinal Speed: The forward velocity of the observed vehicle, measured in meters per second (m/s). 3. Current Lane Index: The current lane occupied by the observed vehicle, using the same indexing scheme as the ego vehicle (i.e., 0, 1, or 2). 4. Relative Lane Position: A categorical description of the observed vehicle’s lane in relation to the ego vehicle (i.e., "same lane," "left lane," or "right lane"). 5. Relative Longitudinal Distance: The forward or backward distance, in meters (m), from the ego vehicle to the observed vehicle. This is typically measured from the rear bumper of a leading vehicle to the front bumper of the ego vehicle, or vice-versa for a trailing vehicle. 6. Relative Positional: A categorical representation indicating whether the observed vehicle is positioned ahead or behind the ego vehicle. Specifically, it is defined as ’front’ or ’rear’ based on longitudinal distance. 4.4.3 Action Space (LLM Decision Outputs) Based on the processed observational input, the LLM determines the ego vehicle’s tactical maneuvers. The decision outputs primarily cover three key aspects of driving behavior: 1. Adaptive Cruise Control (ACC) Target Speed: The desired cruising speed, in m/s, for the ACC system to maintain, subject to prevailing traffic conditions and safety constraints. 2. ACC Target Time Gap: The desired time gap, in seconds (s), for the ACC system to maintain with a preceding vehicle. 3. Lane Change Maneuver: A categorical decision regarding lateral move- ment: • Maintain current lane. • Initiate a lane change to the left. • Initiate a lane change to the right. 4.4.4 Episode Termination Conditions Each simulation run, or episode, concludes based on one of the following conditions: 1. Successful Target Achievement: The episode ends successfully if the ego vehicle safely reaches the "exit" segment, which is the pre-defined destination that the last 200 m of the highway segment is marked as the "exit" segment. 24 4. Methods Numerically, to successfully reach the destination, the traveled distance of the ego-truck should be larger than 4700, as the whole highway segment is of length 5200 m, and the ego-truck starts from 250 m. 2. Safety Violation (Hazardous Event): The episode is terminated prema- turely if a critical safety event occurs, including: • Collision: Any physical impact between the ego vehicle and another vehicle or a static road element. • Road Departure: The ego vehicle deviates from the designated drivable roadway boundaries. 3. Failure to Reach Target (Timeout): The episode is terminated if the ego vehicle fails to reach the target destination within a predefined maximum number of simulation steps. This condition typically indicates overly cautious or inefficient driving behavior (e.g., excessively low driving speed). 4.4.5 Performance Evaluation Metrics The comprehensive abilities of the fine-tuned LLMs are assessed using a suite of quantitative metrics, typically averaged over a statistically significant number of simulation episodes: 1. Average Speed: The mean longitudinal speed (m/s) of the ego vehicle over the duration of an episode. This metric reflects travel efficiency. 2. Average Distance Traveled: The mean distance (m) traveled by the ego vehicle per episode. This is particularly relevant for episodes that terminate prematurely. 3. Average Executed Steps: The mean number of simulation decision cycles completed by the ego vehicle per episode, indicating the efficiency in decision- making. 4. Success Rate: The proportion of episodes in which the ego vehicle success- fully reaches the target destination, expressed as a percentage. Success Rate = ( Number of Success Episodes Total Number of Episodes ) (4.3) 5. Collision Rate: The proportion of episodes that end due to a collision, de- fined as: Collision Rate = ( Number of Collisions Total Number of Episodes ) (4.4) 6. Off-Road Rate: The proportion of episodes that terminate due to the ego vehicle leaving the drivable area, calculated as: Off-Road Rate = ( Number of Off-Road Incidents Total Number of Episodes ) , (4.5) 25 4. Methods 7. Timeout Rate: The proportion of episodes that exceed the maximum allowed number of steps (set to 500), calculated as: Timeout Rate = ( Number of Timeouts Total Number of Episodes ) (4.6) 8. Average TCOP (Total Cost of Operation): A composite metric compris- ing energy and driver costs, representing the operational expenditure in euros [9]. The cost at each step is computed as: c(t) = Cel et + Cdr ∆t (4.7) Where Cel is the electricity cost rate, et is the energy consumed at time step t, Cdr is the driver cost rate, and ∆t is the duration of a time step. The energy consumed (et) is given by, et = ft vt ∆t (4.8) with force ft at time step t defined as, ft = m at + 1 2 CdAfρair v2 t + mgCr + mg sin ( arctan ( slope 100 )) (4.9) where: • m: vehicle mass • at: vehicle acceleration at time t • Cd: aerodynamic drag coefficient • Af : frontal area • ρair: air density • Cr: rolling resistance coefficient • g: gravitational acceleration In this study, the slope is assumed to be zero. The detailed parameter values are provided in Table 4.4. 26 4. Methods Table 4.4: Parameters Used in TCOP Calculation Parameters Values m 40000 kg Cd 0.36 Af 10 m2 ρair 1.225 kg/m3 Cr 0.005 g 9.81 m/s2 Cel 0.5 euro per kwh Cdr 50 euro per hour ∆t 1 s 27 4. Methods 28 5 Results In the results chapter, we illustrate the outcome of the conducted experiments. Be- ginning with the model performances, we compare the reasoning and generalization capabilities of the three models before and after the fine-tuning to determine whether the fine-tuning resulted in a modification of the model’s capabilities. Afterwards, we evaluate our fine-tuned models in the SUMO simulation environment, as quantified by custom metrics specifically crafted for this study. Finally, we show the actual experiment scenarios to demonstrate the model’s output. 5.1 Evaluation of Generalization Abilities This section shows how the generalization ability of the three models changes before and after fine-tuning. The models will be evaluated on two generic datasets. The datasets include MMLU and HellaSwag, which evaluate the models’ reasoning ability and common-sense contextual understanding, respectively. The evaluation accuracy scores for each LLM are presented in Figure 5.1. We let each of the three models answer the questions in the two datasets. What is clearly seen is that Qwen2.5 14B exceed the other two models by a cliff on both datasets, especially on HellaSwag. From this we can tentatively infer that the Qwen series of models may be more suitable for making correct and rational decisions in our SUMO environment, as the decision to drive an automated truck in a SUMO environment is a complex Figure 5.1: Accuracy of MMLU and HellaSwag dataset for each LLM. 29 5. Results Figure 5.2: Loss change comparison of Qwen2.5 14B Figure 5.3: Loss change comparison of Llama 3.1 8B and continuous decision-making process. We can also see that almost all models have their performance on both datasets more or less degraded after fine-tuning. This is explainable because the two datasets are more generalized datasets, whereas the fine-tuned datasets are synthetic, with strong specialized domain knowledge. While we have frozen most of the model parameters as a way to mitigate the problem of catastrophic model forgetting, the model still suffers above its ability to generalize. Among all three models and two datasets, the worst relative drop occurs on LLama 3.1 8B evaluated on HellaSwag after fine-tuning with rank 128 and α 256, where accuracy falls by approximately 16 % compared with its pre-trained model. Looking at the changes in the fine-tuning hyperparameters, the scores of most of the models on these two datasets tend to increase and then decrease as the values of α and r increase. One explanation for this is that the increase in parameters makes the model learn the logic in the training set, but as the number of parameters continues to increase, the amount of information contained in the dataset is all but exhausted, and the model spends its attention on the training set, leading to catastrophic forgetting of the general knowledge. 30 5. Results Figure 5.4: Loss change comparison in training and validation Figure 5.2 shows the loss changes during the training and validation process for Qwen2.5 14B. We can see that as the rank and alpha parameters increase, the model converges faster during training and eventually reaches lower loss values. This is easy to understand because the model’s fitting ability gets stronger with more parameters, and we validate their loss values on the validation set at the same time to prevent overfitting. Figure 5.3 and 5.4 shows the loss curves for the Llama and DeepSeek- R1-Distill-Llama models during training, again demonstrating the same pattern as the Qwen2.5 14 model, further validating the previous observation. 5.2 Evaluation of Tactical Decision Making Abil- ities This section analyzes the performance of the proposed three large language models for decision making with the SUMO simulation environment, both before and after fine-tuning. To evaluate the performance of the model more comprehensively, includ- ing dealing with common driving situations (the common case of three surrounding vehicles in the training set), as well as the more intensive case of seven surrounding vehicles, we set up two separate sets of experiments for this, and calculate the cor- responding metrics. In addition, this section also analyzes the reasonableness of the model’s decisions in specific driving scenarios through case studies. 5.2.1 Metrics Evaluation The calculated metrics provide the assessment of models’ performance in terms of safety, efficiency, and operational cost. Evaluation metrics, shown in Tables 5.1 (pretrained Models) and 5.2 (fine-tuned Models) are from experiments with three surrounding vehicles. And the results for the seven surrounding vehicles experiments are shown as Tables 5.3 (pretrained Models) and 5.4 (fine-tuned Models). 31 5. Results Metric Llama3.1-8B Qwen2.5-14B Distill-Llama-8B Success Rate 0.88 0.97 0.02 Failure Rate 0.12 0.03 0.98 Max Steps Rate 0.00 0.00 0.00 Invalid Decision Rate 0.09 0.00 0.00 Average Distance (m) 4538.82 4613.82 1100.65 Average Speed (m/s) 19.62 23.65 21.61 Average Steps 235.38 195.87 50.42 Energy Cost 2.87 3.56 1.63 Driver Cost 2.99 2.72 0.73 TCOP 5.86 6.28 2.36 TCPO/km 1.42 1.49 3.44 Table 5.1: Performance metrics of pretrained models in three surrounding cars ex- periments Metric Llama3.1-8B Qwen2.5-14B Distill-Llama-8B Success Rate 1.00 1.00 0.66 Failure Rate 0.00 0.00 0.34 Max Steps Rate 0.00 0.00 0.00 Invalid Decision Rate 0.00 0.00 0.00 Average Distance (m) 4759.40 4766.59 4001.87 Average Speed (m/s) 21.08 24.00 22.28 Average Steps 229.70 198.87 178.28 Energy Cost 3.06 3.71 2.88 Driver Cost 3.19 2.76 2.51 TCOP 6.25 6.47 5.39 TCPO/km 1.31 1.36 1.38 Table 5.2: Performance metrics of fine-tuned models in three surrounding cars ex- periments In general, fine-tuning has improved the success rate of all models. Shown by the results, the models are capable of generating required format responses simply by prompt engineering, even though LLama3.1-8b has some cases of invalid decisions in the pretrained version, it can be eliminated by fine-tuning, and the model will generate 100% responses that align with the template in the prompt. The elimina- tion of invalid decisions for Meta-Llama-3.1-8B, in particular, underscores improved logical consistency and decision safety, which is a key prerequisite for real-world deployment. In addition, we investigate the abnormal responses, and they are all of long redundant text in the "reason" part. Thus they exceed the maximum token requirement and become invalid. This is common in small models that simply keep repeating meaningless words to meet the maximum token requirement. In three surrounding vehicles experiments, especially for poor-performanced model Deepseek-distill-llama-8B, the success rate has been increased to 66% compared to 2% for its pretrained version. While fine-tuning has improved Llama-8B’s success 32 5. Results Metric Llama3.1-8B Qwen2.5-14B Distill-Llama-8B Success Rate 0.66 0.90 0.48 Failure Rate 0.34 0.10 0.52 Max Steps Rate 0.00 0.00 0.00 Invalid Decision Rate 0.01 0.00 0.00 Average Distance (m) 3796.73 4585.20 3267.25 Average Speed (m/s) 22.27 21.95 23.75 Average Steps 172.93 212.18 137.09 Energy Cost 3.02 3.16 2.92 Driver Cost 2.36 2.95 1.92 TCOP 5.39 6.11 4.85 TCPO/km 1.86 1.34 2.76 Table 5.3: Performance metrics of pretrained models in seven surrounding cars ex- periments Metric Llama3.1-8B Qwen2.5-14B Distill-Llama-8B Success Rate 1.00 1.00 0.42 Failure Rate 0.00 0.00 0.58 Max Steps Rate 0.00 0.00 0.00 Invalid Decision Rate 0.00 0.00 0.00 Average Distance (m) 4762.96 4760.08 3344.73 Average Speed (m/s) 23.01 21.57 23.53 Average Steps 209.18 225.46 141.10 Energy Cost 3.48 3.14 2.91 Driver Cost 2.91 3.13 1.98 TCOP 6.39 6.27 4.89 TCPO/km 1.34 1.32 1.89 Table 5.4: Performance metrics of fine-tuned models in seven surrounding cars ex- periments rate both in three and seven vehicle experiments, Qwen2.5-14B consistently main- tained a high success rate, demonstrating high robustness both before and after fine-tuning in all experiments. This is also reflected in the average experiment steps that are taken to carry out the experiment. Note that they are not the lower the better, because if the model fails the experiment and leads to an early termination, the steps will naturally be fewer than those ones taken to accomplish the experi- ment successfully. To provide a standard benchmark, it takes 189 steps to make the ego-truck reach the "exit" segment with the constant maximum speed of 25 m/s. Note that although overall the results show a relatively high success rate after fine- tuning, much of this performance gain can be attributed to the architectural design of the controller system. In this setup, the LLM functions primarily as a high-level tactical decision-maker, responsible for understanding the driving context and issu- ing appropriate commands, while the low-level executor handles the actual vehicle control tasks such as lane changes, speed adjustments, and maintaining safe follow- 33 5. Results ing distances. The improved success rate thus reflects the benefits of this hierarchical control design. But still, it also indicates that the fine-tuned LLM has learned to understand a more complex and dynamic driving environment. It demonstrates enhanced decision-making capabilities, such as avoiding the risk of collisions associ- ated with abrupt lane change maneuvers. Most importantly, the LLM has learned to avoid critical safety violations like driving out of the road, further emphasizing its improved contextual awareness and planning ability. To assess efficiency, we also record and calculate the average driving speed (m/s) and Total Cost of Operation (TCOP) throughout all experiments. Due to the same concerns we mentioned before, a lower TCOP does not directly indicate that the model provides decisions that make driving more efficient. Therefore, we also record the Total Cost per kilometer (TCOP/km), which offers a normalized mark for the efficiency assessment. For example, in the three surrounding vehicle experiments, while Meta-Llama-3.1-8Bs TCPO/km decreases from 1.42 to 1.31, and Qwen2.5- 14B drops from 1.49 to 1.36. Notably, deepseek-distill-llama’s TCPO/km declined dramatically from 3.44 to 1.38, indicating that although its total costs increased (from 2.36 to 5.39), its operational efficiency per kilometer greatly improved. The same patterns are observed in the seven-vehicles experiments. Energy and driver costs followed the same trend as TCOP. And these increases are expected, given the models greater success in completing longer and more complex driving tasks. Average speed increased modestly across all models throughout both three-surrounding- vehicle and seven-surrounding-vehicle experiments. These gains, considered along- side changes in average distance and average steps, reflect improvements in naviga- tion efficiency and suggest more streamlined paths. Notably, although Qwen2.5-14B shows a relatively higher average speed in the three-surrounding-vehicle experiment, it adopts a more conservative strategy when driving conditions are more intensive. In contrast, Llama-8b demonstrates significant progress, with both average distance and average driving speed all increased in these two sets of experiments, which is consistent with its improved success rate after fine-tuning. In conclusion, fine-tuning has enhanced the performance of all three models to some extent, especially in terms of task efficiency and decision reliability. Although overall operational costs increase, largely due to higher task completion rates, the cost efficiency per kilometer improves or remains stable. Qwen2.5-14B shows the most robust performance, with a stable experiment success rate and relatively low TCOP/km. These results indicate the potential for fine-tuning in adapting large language models for complex, real-world autonomous driving. Future work could explore further optimization strategies to enhance the balance between task success and cost-efficiency. 5.2.2 Case Studies To further explore the reasoning capabilities and explainability of the models’ de- cisions, this section presents detailed case studies observed during the experiments. These examples highlight the models’ responses to specific driving scenarios both before and after the fine-tuning process, offering a qualitative understanding of the 34 5. Results changes in their decision-making behavior. It indicates the potential of models to an- alyze text-based driving situations and decision-making ability, as well as the ability to provide human-like language explainability. 5.2.3 Case Study 1: Deepseek-distill-llama’s Lane Change Behavior This case study focuses on the lane change decisions made by the Deepseek-distill- llama model in a specific traffic scenario before and after fine-tuning. 5.2.3.1 Pretrained Deepseek-distill-llama The current driving scenario is shown as Table 5.5, and the corresponding decision made by the model is shown in Table 5.6. Table 5.5: Encoded Current Driving Environment Observation Ego-truck Status Current Speed (m/s) Current Lane Number Current Time Gap (s) 24.0 0 3.0 Surrounding Vehicles Information Car 1 Current Speed: 20.9 m/s Current Lane Number: 2 Relative Distance to Ego Vehicle: 130.9 m Position to Ego Vehicle: Front Lane Relation: Left lane Car 2 Current Speed: 20.1 m/s Current Lane Number: 2 Relative Distance to Ego Vehicle: 181.0 m Position to Ego Vehicle: Front Lane Relation: Left lane Car 3 Current Speed: 22.1 m/s Current Lane Number: 0 Relative Distance to Ego Vehicle: -45.3 m Position to Ego Vehicle: Rear Lane Relation: Same lane Table 5.6: Pretrained Deepseek-Distill-Llama Decision Based on Observed Context Decision Component Value Desired Speed (m/s) 25 Desired Time Gap (s) 3 Lane Change Intent Right Reason Adjusting to maintain safety and efficiency. 35 5. Results In the pretrained state, deepseek-distill-llama, despite the ego vehicle traveling at 24.0 m/s in lane 0 only with a following vehicle (’car3’) in the same lane, it decided to change the lane to the right. The stated reason, "Adjusting to maintain safety and efficiency," is quite general and lacks specific justification based on the observed environment. There is no immediate indication of a safety threat or an obvious efficiency gain from moving to the right, especially considering the presence of ve- hicles in the left lanes. Given the ego vehicle is currently driving in the right-most lane, this decision leads to "out of road" and fails the experiment. Coupled with the previously reported low success rate, it suggests a potential lack of nuanced understanding of the driving context in the pretrained model. 5.2.3.2 Fine-tuned Deepseek-distill-llama The current driving scenario is shown as Table 5.7, and the corresponding decision made by the model is shown in Table 5.8. Table 5.7: Encoded Current Driving Environment Observation Ego-truck Status Current Speed (m/s) Current Lane Number Current Time Gap (s) 21.3 1 3.0 Surrounding Vehicles Information Car 1 Current Speed: 18.6 m/s Current Lane Number: 2 Relative Distance to Ego Vehicle: -129.9 m Position to Ego Vehicle: Rear Lane Relation: Left lane Car 2 Current Speed: 19.2 m/s Current Lane Number: 2 Relative Distance to Ego Vehicle: 185.0 m Position to Ego Vehicle: Front Lane Relation: Left lane Car 3 Current Speed: 21.7 m/s Current Lane Number: 1 Relative Distance to Ego Vehicle: 99.9 m Position to Ego Vehicle: Front Lane Relation: Same lane Table 5.8: Finue-tuned Deepseel-Distill-Llama Decision Based on Observed Context Decision Component Value Desired Speed (m/s) 25 Desired Time Gap (s) 3 Lane Change Intent Right Reason Move to right lane to maintain faster speed in clear lane with no immediate obstructions ahead. 36 5. Results In this case, the ego truck is in lane 1, driving at 21.3 m/s. There is a slower vehicle (’car1’ at 18.6 m/s) behind in the left lane (lane 2) and a slightly faster vehicle (’car3’ at 21.7 m/s) ahead in the same lane. The model identifies the right lane (lane 0, implicitly) as a "clear lane with no immediate obstructions ahead" and aims to "maintain faster speed." This reasoning aligns with typical driving strategies for efficiency and maintaining desired speed. The improved explainability suggests that fine-tuning has enhanced the model’s ability to perceive the environment and provide more contextually relevant justifications for its decisions. 37 5. Results 5.2.4 Case Study 2: Qwen2.5-14B’s Safety Maintenance Be- havior This case study focuses on the lane change decisions made by the Qwen2.5-14B model in a specific traffic scenario before and after fine-tuning. The current driving state of the ego-truck and the surrounding vehicles information are shown as Table 5.9. Table 5.9: Encoded Current Driving Environment Observation Ego-truck Status Current Speed (m/s) Current Lane Number Current Time Gap (s) 25.0 2 2.0 Surrounding Vehicles Information Car 1 Current Speed: 20.2 m/s Current Lane Number: 2 Relative Distance to Ego Vehicle: 20.0 m Position to Ego Vehicle: Front Lane Relation: Same lane Car 2 Current Speed: 24.8 m/s Current Lane Number: 1 Relative Distance to Ego Vehicle: 57.1 m Position to Ego Vehicle: Front Lane Relation: Right lane Car 3 Current Speed: 25.5 m/s Current Lane Number: 1 Relative Distance to Ego Vehicle: -5.7 m Position to Ego Vehicle: Rear Lane Relation: Right lane Car 4 Current Speed: 25.5 m/s Current Lane Number: 0 Relative Distance to Ego Vehicle: -56.7 m Position to Ego Vehicle: Rear Lane Relation: Right lane Car 5 Current Speed: 23.5 m/s Current Lane Number: 1 Relative Distance to Ego Vehicle: 84.7 m Position to Ego Vehicle: Front Lane Relation: Right lane Car 6 Current Speed: 25.5 m/s Current Lane Number: 2 Relative Distance to Ego Vehicle: -40.5 m Position to Ego Vehicle: Rear Lane Relation: Same lane 38 5. Results 5.2.4.1 Pretrained Qwen-14B The corresponding decision made by the pretrained model is shown in Table 5.10. Table 5.10: Pretrained Qwen-14B Decision Based on Observed Context Decision Component Value Desired Speed (m/s) 20.2 Desired Time Gap (s) 2.0 Lane Change Intent Right Reason Reduce speed to maintain safe distance from front vehicle and change lane to avoid rear vehicle. The pretrained Qwen-14B model is able to recognize the speed difference with the lead vehicle (’car1’ at 20.2 m/s), and it detects the potential threat from the rear vehicle (’car6’ at 25.5 m/s). In response, it attempts to handle this situation by both slowing down and changing lanes to the right, presumably to avoid the pressure from Car 6. However, this decision reflects an incomplete understanding of the surrounding traffic, and the model fails to capture the threat that comes from Car 3 (with speed 25.5 m/s) in the right lane (lane 1) with a very short distance, and this could lead to a potential collision, which is very dangerous. 5.2.4.2 Fine-tuned Qwen14B The corresponding decision made by the fine-tuned model is shown in Table 5.11. Table 5.11: Fine-tuned Qwen-14B Decision Based on Observed Context Decision Component Value Desired Speed (m/s) 20.2 Desired Time Gap (s) 2.0 Lane Change Intent None Reason Maintain speed to match slower front vehicle in same lane while avoiding rear collision risk from veh6. After fine-tuning, the model demonstrates a more conservative and context-aware behavior. As it identifies that the leading vehicle (’car1’ at 20.2 m/s) is driving with a slow speed and it is aware of the potential danger comes from the rear fast-speed car (’car6’ at 25.5 m/s). More importantly, the model now captures the critical risk posed by Car 3 (with speed 25.5 m/s) on the right (lane 1). It shows that the model takes into account the dynamic traffic in the right lane, and it chooses to maintain a safe, slower speed that matches the leading vehicle, and stay in its current lane instead of making a potentially risky lane change to the right. This suggests improved safety maintenance behavior and a better understanding of surrounding traffic dynamics. 39 5. Results 40 6 Conclusion This thesis investigates the application of LLMs in the domain of tactical decision- making for autonomous driving trucks. Traditional decision-making approaches, typically rule-based, often fall short when faced with the complexity, uncertainty, and socially nuanced nature of real-world traffic scenarios. In contrast, LLMs can offer a promising alternative due to their contextual reasoning and flexibility. This work focuses on using small, pretrained LLMs, given practical considerations such as computational efficiency and the need for deployment on resource-limited systems. Through fine-tuning on domain-specific datasets, these models exhibit the ability to generate contextually appropriate tactical maneuvers, including lane changes, efficient driving speed adjustments, together with reasonable justification on the basis of analyzing the current driving environment. The fine-tuning process allows the LLM to learn from a diverse set of traffic contexts, encompassing interaction dynamics and traffic densities. One of the key contribu- tions of this research is that it demonstrates LLMs’ ability to effectively process text-based representations of complex driving environments and generate maneuver decisions that are both operationally feasible and safety-aligned, showing the po- tential of LLMs as high-level tactical planners in autonomous driving systems. In addition, this thesis proposes and implements a modular control architecture that separates high-level planning from low-level execution. Within this framework, the LLM operates as a strategic decision-maker, while a dedicated low-level controller handles actual acceleration/deceleration calculation. Thus the ego truck can main- tain a safe distance from the leading vehicle. Beyond decision-making capabilities, another significant benefit is that LLMs can generate human-like natural language explanations. This opens up possibilities for improving the transparency and explainability of the autonomous driving system in the future. By articulating the reasoning behind decisions, LLMs support inter- pretability, build user trust, and help meet the increasing regulatory requirements for explainable AI in safety-critical applications. 6.1 Future Work While this thesis demonstrates the potential of LLMs in tactical decision-making, several limitations and research opportunities remain. 41 6. Conclusion A primary challenge is hallucination, where LLMs generate plausible but incorrect or unsafe outputs. It is worth mentioning that, based on the experiments’ observations, the models are not always able to make intelligent and accurate decisions. Future research must prioritize strategies to detect and mitigate hallucinations, potentially through uncertainty estimation, rule-based filters, or external sensory validation. Another observation is that when exposed to sequential driving scenarios, especially in the highway driving environment, where there are many similar consecutive sce- narios, while LLMs tend to make repeated and redundant decisions, in some specific cases, LLMs can produce consistent and smooth decisions with small changes be- tween time steps, showing an implicit capability to maintain short-term temporal coherence. Future work may explore mechanisms that make LLMs retain relevant information from context over time while avoiding meaningless redundancy in the generated content. Another promising direction is to integrate LLMs with other methods, such as rein- forcement learning (RL) or the rule-based system, to form a hybrid decision-making system. In such a framework, the LLM can serve multiple roles: • High-Level Goal Generation: The LLM generates abstract sub-goals (e.g., pre- pare for merge or maintain safe following distance) that guide the RL agents exploration and policy learning. • Reward Shaping and Policy Guidance: The LLM provides structured language- based feedback or safety constraints, which can be translated into reward signals or training biases to guide the RL agent toward safe and efficient be- haviors. • Decision Filtering and Validation: Benefiting from its commonsense reasoning and understanding of natural language safety rules, the LLM can act as a reasoning-based filter, evaluating decisions proposed by itself or other modules, rejecting those that are unsafe, unreasonable, or non-compliant with traffic norms. This hybrid approach improves both safety and explainability while also making the decision-making process more efficient and trustworthy. 42 Bibliography [1] J. Hu, Y. Wang, S. Cheng, et al., “A survey of decisionmaking and plan- ning methods for selfdriving vehicles,” Frontiers in Neurorobotics, vol. 19, p. 1 451 923, 2025. doi: 10.3389/fnbot.2025.1451923. [Online]. Available: https://doi.org/10.3389/fnbot.2025.1451923. [2] C. Zhao, L. Li, X. Pei, Z. Li, F.-Y. Wang, and X. Wu, “A comparative study of state-of-the-art driving strategies for autonomous vehicles,” Accident Analysis & Prevention, vol. 150, p. 105 937, 2021. [3] M. Pellkofer and E. Dickmanns, “Behavior decision in autonomous vehicles,” in Intelligent Vehicle Symposium, 2002. IEEE, IEEE, vol. 2, 2002, pp. 495– 500. [4] S. Noh and K. An, “Decision-making framework for automated driving in high- way environments,” IEEE Transactions on Intelligent Transportation Systems, vol. 19, no. 1, pp. 58–71, 2017. [5] F. Bouchard, S. Sedwards, and K. Czarnecki, “A rule-based behaviour plan- ner for autonomous driving,” in International Joint Conference on Rules and Reasoning, Springer, 2022, pp. 263–279. [6] J. Chen, Decision-making in highway autonomous driving combined with pre- diction algorithms, 2022. [7] M. Bojarski, D. Del Testa, D. Dworakowski, et al., “End to end learning for self-driving cars,” arXiv preprint arXiv:1604.07316, 2016. [8] J. Kim, T. Misu, Y.-T. Chen, A. Tawari, and J. Canny, “Grounding human- to-vehicle advice for self-driving vehicles,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 591–10 599. [9] D. Pathare, L. Laine, and M. H. Chehreghani, “Improved tactical decision mak- ing and control architecture for autonomous truck in sumo using reinforcement learning,” in 2023 IEEE International Conference on Big Data (BigData), IEEE, 2023, pp. 5321–5329. [10] A. Rizehvandi, S. Azadi, and A. Eichberger, “Decision-making policy for au- tonomous vehicles on highways using deep reinforcement learning (drl) method,” Automation, vol. 5, no. 4, p. 564, 2024. [11] L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and consequences,” Minds and Machines, vol. 30, pp. 681–694, 2020. [12] Z. Yang, X. Jia, H. Li, and J. Yan, “Llm4drive: A survey of large language models for autonomous driving,” arXiv preprint arXiv:2311.01043, 2023. [13] H. Sha, Y. Mu, Y. Jiang, et al., “Large language models as decision makers for autonomous driving,” 2023. 43 https://doi.org/10.3389/fnbot.2025.1451923 https://doi.org/10.3389/fnbot.2025.1451923 Bibliography [14] W. Wang, J. Xie, C. Hu, et al., “Drivemlm: Aligning multi-modal large lan- guage models with behavioral planning states for autonomous driving,” arXiv preprint arXiv:2312.09245, 2023. [15] T. Cai, Y. Liu, Z. Zhou, et al., “Driving with regulation: Interpretable decision- making for autonomous vehicles with retrieval-augmented reasoning via llm,” arXiv preprint arXiv:2410.04759, 2024. [16] J. Wei, X. Wang, D. Schuurmans, et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information process- ing systems, vol. 35, pp. 24 824–24 837, 2022. [17] Z. Xu, Y. Zhang, E. Xie, et al., “Drivegpt4: Interpretable end-to-end au- tonomous driving via large language model,” IEEE Robotics and Automation Letters, 2024. [18] OpenAI, GPT4o: Large language model, https://chat.openai.com. [19] Anthropic, Claudeă3ăSonnet: Large language model, https://claude.ai. [20] K. He, R. Mao, Q. Lin, et al., “A survey of large language models for health- care: From data, technology, and applications to accountability and ethics,” Information Fusion, p. 102 963, 2025. [21] S. Petrus, Base models are more based: Base models vs instruct models ex- plained, https://sebastian-petrus.medium.com/base-models-are-more- based-base-models-vs-instruct-models-53609730f3e6, Blog post, Sep. 2024. (visited on 05/07/2025). [22] L. Huang, W. Yu, W. Ma, et al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions, 2023,” arXiv preprint arXiv:2311.05232, 2023. [23] E. J. Hu, Y. Shen, P. Wallis, et al., “Lora: Low-rank adaptation of large language models.,” ICLR, vol. 1, no. 2, p. 3, 2022. [24] M. Treiber, A. Hennecke, and D. Helbing, “Congested traffic states in empirical observations and microscopic simulations,” Physical review E, vol. 62, no. 2, p. 1805, 2000. [25] J. Erdmann, “Sumo’s lane-changing model,” in Modeling Mobility with Open Data, ser. Lecture Notes in Computer Science, vol. 8594, Springer, 2015, pp. 105– 123. doi: 10 . 1007 / 978 - 3 - 319 - 15024 - 6 _ 7. [Online]. Available: https : //link.springer.com/chapter/10.1007/978-3-319-15024-6_7. [26] A. Liu, B. Feng, B. Xue, et al., “Deepseek-v3 technical report,” arXiv preprint arXiv:2412.19437, 2024. [27] DeepSeek-AI, Deepseek-r1: Incentivizing reasoning capability in llms via re- inforcement learning, 2025. arXiv: 2501.12948 [cs.CL]. [Online]. Available: https://arxiv.org/abs/2501.12948. [28] D. Hendrycks, C. Burns, S. Basart, et al., “Measuring massive multitask lan- guage understanding,” arXiv preprint arXiv:2009.03300, 2020. [29] R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hellaswag: Can a machine really finish your sentence?” arXiv preprint arXiv:1905.07830, 2019. [30] S. KrauSS, P. Wagner, and C. Gawron, “Metastable states in a microscopic model of traffic flow,” Physical Review E, vol. 55, no. 5, p. 5597, 1997. [31] J. Erdmann, “Lane-changing model in sumo,” Proceedings of the SUMO2014 modeling mobility with open data, vol. 24, pp. 77–88, 2014. 44 https://chat.openai.com https://claude.ai https://sebastian-petrus.medium.com/base-models-are-more-based-base-models-vs-instruct-models-53609730f3e6 https://sebastian-petrus.medium.com/base-models-are-more-based-base-models-vs-instruct-models-53609730f3e6 https://doi.org/10.1007/978-3-319-15024-6_7 https://link.springer.com/chapter/10.1007/978-3-319-15024-6_7 https://link.springer.com/chapter/10.1007/978-3-319-15024-6_7 https://arxiv.org/abs/2501.12948 https://arxiv.org/abs/2501.12948 I A. Appendix 1 - System Prompt A Appendix 1 - System Prompt A.1 Supervised Fine-Tuning Prompt You are a tactical decision-making system for a semi-truck, ego truck with the following parameters: - Maximum speed: {} m/s - Maximum acceleration: {} m/sš - Maximum deceleration: {} m/sš - Adaptive Cruise Control (ACC) time gap range: {} - {} seconds Current situation: - Current speed: {} m/s - Current lane: {} (0=right, 1=middle, 2=left on a 3-lane highway) - Current ACC time gap setting: {} seconds Surrounding vehicles: { { "Vehicle id": {} "Current speed": {} "Current lane": {} "Relative lane relation to the ego vehicle": {} "Relative distance to the ego vehicle": {} "Relative position to the ego vehicle": {} }} Based on the situation, make a tactical decision with ONLY these outputs: 1. ACC set speed (m/s) 2. ACC time gap (s) 3. Lane change request (none, left, right) 4. Brief reason for decision (one sentence) Format your response as a JSON object: {{ "acc_set_speed": X, "time_gap": Y, "lane_change": "none/left/right", "reason": "brief explanation" }} Do not include any other text outside of this JSON object. II III B. Appendix 2 - Dataset B Appendix 2 - Dataset B.1 Example Scenario-Decision Pair Scenario: { "ego_speed": 22.1, "ego_lane": 1, "current_time_gap": 3.1, "surrounding_vehicles": [ { "id": "veh2", "distance": 42.2, "rel_position": "rear", "lane_relation": "same_lane", "speed": 21.0, "lane": 1 }, { "id": "veh5", "distance": 104.6, "rel_position": "front", "lane_relation": "right_lane", "speed": 22.6, "lane": 0 } ] } Decision: { "acc_set_speed": 22.6, "time_gap": 3.8, "lane_change": "right", "reason": "Return to right lane to follow keep-right rule; gap ahead is clear and safe." } IV List of Figures List of Tables Introduction Background Purpose Limitations Thesis Outline Related Work Classical Methods to Tactical Decision-Making Advanced Machine Learning Methods Imitation Learning Reinforcement Learning LLMs for Decision-Making Motivation Theory Large Language Models Terminology Prompt Engineering Prompt Structure Prompting Techniques Zero-Shot Few-Shot Chain-of-Thought Tree-of-Thought Supervised Fine-Tuning Full Fine-Tuning Low-Rank Adaptation Methods Overview Low Level Controller Dataset Generation Model Fine-Tuning Model Selection Hyper-Parameters Tuning Experimental Design and Evaluation Simulation Settings Observational Space (Input to the LLM) Action Space (LLM Decision Outputs) Episode Termination Conditions Performance Evaluation Metrics Results Evaluation of Generalization Abilities Evaluation of Tactical Decision Making Abilities Metrics Evaluation Case Studies Case Study 1: Deepseek-distill-llama's Lane Change Behavior Pretrained Deepseek-distill-llama Fine-tuned Deepseek-distill-llama Case Study 2: Qwen2.5-14B's Safety Maintenance Behavior Pretrained Qwen-14B Fine-tuned Qwen14B Conclusion Future Work Bibliography Appendix 1 - System Prompt Supervised Fine-Tuning Prompt Appendix 2 - Dataset Example Scenario-Decision Pair