Designing and Developing DirectorAI: An AI Assistant for Generating Vehicle Sim- ulation Scenarios Creating an effective AI Assistant for complex, domain-specific tasks without domain-specific training data Master’s Thesis in Interaction Design & Technologies Master’s Thesis in Computer Science – Algorithms, Languages and Logic Adam Telles and Hannes Raaholt Larsson Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2025 Master’s thesis 2025 Designing and Developing DirectorAI: An AI Assistant for Generating Vehicle Simulation Scenarios Creating an effective AI Assistant for complex, domain-specific tasks without domain-specific training data Adam Telles, Hannes Raaholt Larsson Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2025 Designing and Developing DirectorAI: An AI Assistant for Generating Vehicle Sim- ulation Scenarios Creating an effective AI Assistant for complex, domain-specific tasks without domain- specific training data Adam Telles, Hannes Raaholt Larsson © Adam Telles, Hannes Raaholt Larsson, 2025. Supervisor: Morten Fjeld, Interaction Design and Software Engineering Advisor: Anders Tell, Volvo Cars Examiner: Staffan Björk, Interaction Design and Software Engineering Master’s Thesis 2025 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Stylized mockup of the DirectorAI chat interface, overlaying a scene gener- ated with DirectorAI and shown in Volvo Cars’ Product Simulator. Typeset in LATEX Gothenburg, Sweden 2025 iv Designing and Developing DirectorAI: An AI Assistant for Generating Vehicle Sim- ulation Scenarios Creating an effective AI Assistant for complex, domain-specific tasks without domain- specific training data Adam Telles, Hannes Raaholt Larsson Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract This thesis investigates the use and potential benefits of AI in automating the gener- ation of vehicle simulation scenarios. Focused on enhancing the usability of Director, a scripting tool for Volvo Cars’ Product Simulator software, this project involved designing and developing DirectorAI, an AI assistant featuring a chatbot interface. Using large language models, the research explored how to create effective and reli- able scenarios without domain-specific model training. Moreover, the project iden- tified limitations that emerge when deploying general-purpose language models in complex, domain-specific environments, alongside design patterns that enable these models to function effectively as assistants. The primary research question addressed was: "What design choices or patterns enable general-purpose language models to function as effective assistants in complex, domain-specific software environments without domain-specific training?" The outcome and findings demonstrated that the success of general-purpose LLMs in domain-specific environments relies less on model modification and more on how the system is designed to supply the model with relevant information. By iteratively crafting system prompts that embed domain context, constraints, and examples, DirectorAI was able to perform effectively without the need for custom training or fine-tuning. Through prototyping and user evaluation, several key design patterns were identified that enabled the assistant to support complex workflows within the existing simulation software. This research emphasized the importance of interaction design in shaping the utility and usability of AI-assisted systems. By identifying and analyzing the design choices and patterns that facilitate the effective use of general-purpose LLMs in domain- specific environments, this thesis contributes to the understanding of how AI-assisted tools can be developed for complex simulation scenarios, offering valuable insights for future applications. Ultimately, this study demonstrated the potential of AI to significantly improve the efficiency and reliability of vehicle simulation scenarios, with implications for the automotive industry and beyond. Keywords: Large Language Models (LLMs), Chatbots, Interaction Design, Com- puter Science, Vehicle Simulation Scenarios, AI Assistant, AI Adaptation, Prompt Engineering. v Acknowledgements First and foremost, we are grateful for the opportunity to conduct our Master’s thesis at Volvo Cars and for the valuable industrial context provided. We would particularly like to express our gratitude to our industry advisor, Anders Tell, whose support and guidance were instrumental in the completion of this thesis. He not only helped set up this project but also provided invaluable technical insights, practical advice, and encouragement throughout the process. His deep understanding of the complex systems underlying ProSim and Director greatly shaped our work. We would also like to thank the ProSim team at Volvo Cars for their technical support and for sharing their expertise. Their insights into the practical workflows and use cases for Director and ProSim were crucial for grounding our work in real- world needs. Additionally, we extend our thanks to the other employees at Volvo Cars who partic- ipated in our user studies. Their willingness to share their experiences and provide feedback was essential for our understanding of the challenges faced by users, signif- icantly shaping the direction of our design work. Finally, we would like to thank our academic supervisor, Morten Fjeld, for his guidance and support during the academic aspects of this thesis, providing us with valuable feedback and helping us navigate the research process. Adam Telles and Hannes Raaholt Larsson, Gothenburg, 2025-06-23 vii Contents List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Project Aim and Research Questions . . . . . . . . . . . . . . . . . . 1 1.2 Project Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Simulation Software Environment: An Overview of Director, ProSim, and FSM . . . . . . . . . . . . . . 3 2 Background 7 2.1 Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 AI in 3D Modeling, Video Editing and Content Generation . . . . . . 8 2.3 Related Academic Work . . . . . . . . . . . . . . . . . . . . . . . . . 9 3 Theory 11 3.1 Limitations of Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1.1 Implementation of Chatbots . . . . . . . . . . . . . . . . . . . 12 3.2 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 12 3.2.1 GPT Architecture and Transformer Mechanisms . . . . . . . . 12 3.3 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 13 3.4 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.6 Interaction Design-Related Theory . . . . . . . . . . . . . . . . . . . 16 3.6.1 User-Centered Design . . . . . . . . . . . . . . . . . . . . . . . 16 3.6.2 Cognitive Load . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.6.3 Affordances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.6.4 Mental Models . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.6.5 Usability Heuristics . . . . . . . . . . . . . . . . . . . . . . . . 17 4 Methods 19 4.1 Design Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.1 Discover Phase . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.2 Define Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.1.3 Develop Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.1.4 Deliver Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 ix Contents 4.2 Technical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.1 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.2 Prompt Engineering . . . . . . . . . . . . . . . . . . . . . . . 24 4.2.3 Other Relevant Concepts . . . . . . . . . . . . . . . . . . . . . 26 4.2.4 Retrieval-Augmented Generation (RAG) . . . . . . . . . . . . 27 4.3 Time Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5 Process 29 5.1 Preliminary Scoping and Feasibility . . . . . . . . . . . . . . . . . . . 29 5.2 Problem Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 5.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.3.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.3.2 How AI Could Help . . . . . . . . . . . . . . . . . . . . . . . . 34 5.4 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . 35 5.4.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 5.4.2 Early Attempts: Simple Prompting . . . . . . . . . . . . . . . 36 5.4.3 Evolving the Design: Introducing AI Modes . . . . . . . . . . 36 5.4.4 The Full Pipeline: A Structured AI Workflow . . . . . . . . . 37 5.4.5 Why a Pipeline? . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.4.6 Technical Considerations and Trade-offs . . . . . . . . . . . . 41 5.4.7 Why GPT-4? . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.4.8 From Generation to Correction: Introducing the Edit Pipeline 42 5.4.9 Design of the Edit Pipeline . . . . . . . . . . . . . . . . . . . . 43 5.4.10 Trade-offs and Implementation Notes . . . . . . . . . . . . . . 44 5.4.11 Enabling Safe Experimentation . . . . . . . . . . . . . . . . . 44 5.4.12 Switching Between Pipelines: Manual vs. Automatic . . . . . 45 5.4.13 The Dual Pipeline Architecture . . . . . . . . . . . . . . . . . 45 5.4.14 Improving Explainability . . . . . . . . . . . . . . . . . . . . . 46 5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.5.1 Participant Background and Preconceptions . . . . . . . . . . 47 5.5.2 General Impressions . . . . . . . . . . . . . . . . . . . . . . . 48 5.5.3 Trust and Effectiveness . . . . . . . . . . . . . . . . . . . . . . 48 5.5.4 Interaction Framework . . . . . . . . . . . . . . . . . . . . . . 49 5.6 Post-Evaluation Adjustments . . . . . . . . . . . . . . . . . . . . . . 50 6 Results 53 6.1 Overview of DirectorAI . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 Findings on Supporting Research Question 1 . . . . . . . . . . . . . . 57 6.2.1 Limitations of LLMs in the Absence of Domain-Specific Context 57 6.2.2 Literal Interpretations and Semantic Misalignment . . . . . . 58 6.3 Findings on Supporting Research Question 2 . . . . . . . . . . . . . . 59 6.4 Findings on the Main Research Question . . . . . . . . . . . . . . . . 60 6.4.1 Supporting Varied Prompting Strategies . . . . . . . . . . . . 60 6.4.2 Support for Non-Destructive, Editable Workflows . . . . . . . 61 6.4.3 Assistants That Execute and Educate . . . . . . . . . . . . . . 61 6.4.4 Feedback and System Visibility . . . . . . . . . . . . . . . . . 62 6.4.5 First Drafts as a Value Proposition . . . . . . . . . . . . . . . 63 x Contents 6.4.6 Providing Domain Context Through System Prompts . . . . . 63 6.4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 7 Discussion 65 7.1 Interpreting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 7.2 Relating to literature . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 7.3 Project Framing and Evolution . . . . . . . . . . . . . . . . . . . . . 69 7.4 Reflections on the Process . . . . . . . . . . . . . . . . . . . . . . . . 70 7.5 Design and Engineering Implications . . . . . . . . . . . . . . . . . . 71 7.6 Implications for Creative Software . . . . . . . . . . . . . . . . . . . . 71 7.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 7.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 7.9 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 8 Conclusion 79 Bibliography 83 A User Study Protocol 1: Exploratory Study I A.1 Participant Onboarding & Consent Process . . . . . . . . . . . . . . . I A.2 User Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I A.3 Study Introduction: Context & Objectives . . . . . . . . . . . . . . . II A.4 Initial Training & Exploration Phase . . . . . . . . . . . . . . . . . . II A.5 Introduce the Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . II A.6 Post-Test Interview . . . . . . . . . . . . . . . . . . . . . . . . . . . . III A.6.1 General Impressions . . . . . . . . . . . . . . . . . . . . . . . III A.6.2 Task-Specific Feedback . . . . . . . . . . . . . . . . . . . . . . IV A.6.3 Interface and Usability . . . . . . . . . . . . . . . . . . . . . . IV A.6.4 Efficiency and Workflow . . . . . . . . . . . . . . . . . . . . . IV B User Study Protocol 2: Prototype Evaluation VII B.1 Participant Onboarding & Consent Process . . . . . . . . . . . . . . . VII B.2 User Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII B.3 Introduction to DirectorAI . . . . . . . . . . . . . . . . . . . . . . . . VIII B.4 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII B.5 Interview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII xi Contents List of Abbreviations AI Artificial Intelligence API Application Programming Interface BERT Bidirectional Encoder Representations from Transformers Director Product Simulator Director FSM Function State Machine JSON JavaScript Object Notation LLM Large Language Model MIDI Musical Instrument Digital Interface ML Machine Learning MLM Masked Language Model NLP Natural Language Processing ProSim Product Simulator RAG Retrieval-Augmented Generation TF-IDF Term Frequency-Inverse Document Frequency GPT Generative Pre-trained Transformer xii List of Figures 1.1 An example scenario showcasing a Volvo car driving in a city environ- ment, rendered in Product Simulator. . . . . . . . . . . . . . . . . . . 3 1.2 Overview of the Volvo Cars scenario scripting tool Director. . . . . . 4 4.1 A visualization of the Double Diamond design framework. Adapted from [70]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.2 Time plan of the thesis as a GANTT-chart. . . . . . . . . . . . . . . 28 5.1 Distribution of code categories in interview responses. . . . . . . . . . 32 5.2 The structured pipeline showing how a user’s prompt is processed by the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.3 A visualized example instance of the camera generator, showcasing what information it receives. . . . . . . . . . . . . . . . . . . . . . . . 39 5.4 A visualized example instance of the start time generator. Showcasing what information it receives and what it must consider to generate appropriate timings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 6.1 A simplified overview of the various information sources DirectorAI has access to. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 6.2 Section of the new Director interface, showcasing the DirectorAI chat. 55 6.3 DirectorAI used to generate a scenario that opens all car doors. . . . 55 xiii List of Figures xiv List of Tables 5.1 Selected Participant Quotes . . . . . . . . . . . . . . . . . . . . . . . 33 xv List of Tables xvi 1 Introduction Recent advances in large language models (LLMs) such as OpenAI’s GPT-4 have sparked growing interest in their potential to support complex digital workflows [1]. Certain AI-driven assistants have shown how LLMs can help with implementation details and suggestions. An example of this is Github Copilot, an AI-driven cod- ing assistant. However, these systems often rely on vast amounts of domain-specific training data in order to achieve their efficiency. This thesis explores a different ques- tion: can general-purpose LLMs, without fine-tuning or additional domain-specific training, still provide effective assistance within a highly technical, domain-specific tool? The case study of this project is Director, a proprietary scripting tool used at Volvo Cars for configuring and creating scenarios for a 3D vehicle simulator. While pow- erful, Director presents a range of usability challenges, particularly for new or less technical users. Through a user-centered design process, this project develops and evaluates an AI-driven chat interface for Director, using GPT-4 to help users find simulation commands, configure parameters, and generate scripts conversationally. 1.1 Project Aim and Research Questions The broader aim of the project is to investigate how general-purpose LLMs can reduce the cognitive load and technical barriers inherent in domain-specific software. We focus particularly on the design decisions that enable such assistants to feel helpful, intuitive, and efficient, despite the utilized AI model having no task-specific training. This leads to our main research question: Main Research Question (MRQ): What design patterns enable general-purpose language models to function as effective assistants in complex, domain-specific software environments without domain-specific training? To support the investigation of this main question, two supporting research questions are proposed. These help clarify aspects of the problem space and structure the evaluation of the results: Supporting Research Question 1 (SRQ1): 1 1. Introduction What limitations emerge when using general-purpose language models as assistants in complex domain-specific environments without domain- specific training data? Supporting Research Question 2 (SRQ2): What constitutes “effective assistance” in the context of AI-supported domain-specific workflows? Together, these questions guide the exploration of both the opportunities and the constraints involved in integrating general-purpose AI into specialized technical do- mains. Through this work, our aim is to contribute with insights on how AI can be meaningfully integrated into specialized workflows, and what considerations are required to make such integration useful and usable in practice. 1.2 Project Context Designing and implementing simulation scenarios for Volvo Cars in Unity currently involves a detailed, hands-on process where every aspect of a vehicle’s operation, such as opening doors, adjusting the camera, or triggering animations, must be meticulously programmed. For instance, creating a simple replayable simulation showcase of a car requires repeatedly switching between the 3D simulation soft- ware and Director to manually retrieve and input position and rotation values for the camera. Timing animations, such as doors opening or headlights activating, is similarly tedious, as users must determine exact signal names, many of which are inconsistently labeled, and adjust timing sequences without clear visual feed- back. This fragmented workflow makes even straightforward tasks unintuitive and time-consuming, acting as a barrier to creativity and efficiency in validation. The recent years’ advancements in AI, particularly in NLP and generative AI, have made it more feasible to dynamically generate accurate results from user input, as well as the option of using cloud computing for this purpose, not relying on on- premise hardware and thus making it broadly accessible. With the ever-growing complexity of vehicle software, the need for efficient and user-friendly simulation software is becoming more important. With access to Volvo Cars’ experts and proprietary tools and data, we will work closely with an industry-grade testing environment that ensures our research is both practical and impactful. This thesis addresses the challenge of automating the scenario creation workflow in a way that users find effective and helpful. The goal is to enable users to describe scenarios using natural language either at a high, abstract level or with low-level, in- tricate detail. By offloading the cognitive load of learning and remembering system syntax and commands to the LLM and its supporting structure, users can instead focus on creativity and solution exploration. By embedding interaction design prin- ciples, the assistant aims to reduce cognitive load while preserving user agency and empowering new and non-technical users. All this allows the assistant to support more dynamic and fluid workflow, moving away from the current rigid ones. 2 1. Introduction 1.3 Simulation Software Environment: An Overview of Director, ProSim, and FSM The technical ecosystem for this project consists of three main components: Product Simulator (ProSim), Function State Machine (FSM), and Director. These systems work together to provide an interactive, simulation-driven environment for testing vehicle behaviors and features. ProSim is the core simulation software and is developed in Unity [2]; an example of ProSim’s use can be seen in Figure 1.1. It serves as the real-time simulation engine, responsible for rendering 3D environments and vehicles, simulating physical interactions, and allowing direct user control. ProSim includes a high-fidelity 3D car model, complete with functional elements such as doors, compartments, lights, and interior features. Users interact with the simulation through a first-person character, navigated via keyboard and mouse in a manner reminiscent of first-person video games. This setup allows users to explore the vehicle in a realistic and intuitive way, and interact with different components as if inside a physical prototype. Figure 1.1: An example scenario showcasing a Volvo car driving in a city environ- ment, rendered in Product Simulator. FSM introduces a layer of logical control on top of the raw physics simulation pro- vided by ProSim. While ProSim handles the physical effects of actions (such as opening a door), FSM manages system logic, rules, and interdependencies that reg- ulate whether certain actions are allowed. For example, FSM enforces rules such as “a door cannot be opened if it is locked” or “the car cannot start if the key is not present.” FSM is designed to reflect, as closely as possible, the real software logic that would run in the actual vehicle. This makes the simulation not only visually realistic, but also behaviorally accurate from a systems standpoint. Although the 3 1. Introduction details of the FSM team’s implementation are outside the scope of this project, it is important to note that FSM is tightly integrated with ProSim, and receives input from it to enforce these logical constraints. The third and most relevant component to this project is Director. Director is a Flutter-based application and serves as an interface for scripting and controlling ProSim. Communication between Director and ProSim occurs via Message Queuing Telemetry Transport (MQTT), a lightweight publish-subscribe messaging protocol often used in IoT systems [3]. Upon startup, Director establishes a connection with ProSim and receives a list of available signals. These signals represent actions that can be triggered in the simulation, such as opening a door, toggling a light, or adjusting climate control. Each signal corresponds to an MQTT topic (e.g., ego/door/open/FL for opening the front left door) and is used to send specific commands to ProSim. In figure 1.2, an overview of Director is presented. Figure 1.2: Overview of the Volvo Cars scenario scripting tool Director with num- bered annotations. (1) Signal search bar and different sections of signal types that are toggleable on the left-hand side. The list of signals appears on the right, with signal type (input, output, and bidirectional) as filters. Signals can be dragged onto the timeline (the section between signals and properties), creating a track. (2) Column of signal tracks, each track occupying one row. The scenario can be played using the play button, paused by clicking it again, or stopped using the stop button. (3) Timeline section, where each track can contain an unlimited number of actions. Actions are added by right-clicking the timeline and selecting Add action. Actions can be edited by selecting and moving, hiding, or deleting them. (4) The Properties pane is shown when an action is selected. Property values can be manually edited. In this image, the signal action MoveCamera is selected, with eleven parameters including 3D coordinates and direction. Director’s user interface is divided into three main sections: the signal list, the timeline editor, and the properties pane. The signal list (displayed on the left-hand 4 1. Introduction side) shows all available signals retrieved from ProSim. These are organized by category, where the two most important categories for this project are ego and user input. Signals in the ego category directly affect the simulation without regard to the FSM’s logic. For instance, an ego door open signal will open the door regardless of whether the vehicle is locked. On the other hand, user input signals simulate user-initiated actions, such as pulling the door handle, which respect the constraints enforced by FSM and could thus fail if the door is locked. This distinction is essential when designing test cases that should reflect realistic interactions. Signals from the list can be added as tracks in the timeline editor. Each track corresponds to a specific signal, and users can place actions along the timeline. When the timeline is played, a red playhead moves across the screen. As it passes over an action, the associated signal is sent to ProSim, triggering the desired effect in the simulation. This timeline-based system is conceptually similar to music production software, where multiple tracks are layered and events are scheduled in time. On the right-hand side of Director is the parameters pane, where users can configure signal-specific values for each action. For example, when triggering a light signal, parameters such as RGB values might be required to define the color or intensity. This allows for nuanced and dynamic scripting of simulation scenarios. Although Director does not communicate directly with FSM, it plays a key role in invoking behaviors that may be subject to FSM rules. Director sends signals only to ProSim, which then relays those commands to FSM as needed. For the purposes of this project, understanding the structure and behavior of signals, particularly the distinction between ego and user input categories, is sufficient to design meaningful scenarios and scripts within Director. 5 1. Introduction 6 2 Background In order to better understand how DirectorAI fits into the broader technical land- scape, this chapter briefly explores the history of chatbots, AI in creative and tech- nical workflows, and related academic works. 2.1 Chatbots Chatbots, also known as artificial conversation entities, interactive agents and dig- ital assistants, are AI programs designed to simulate conversation with human users. Chatbots use NLP and sentiment analysis to communicate in human lan- guage through text or oral speech with humans or other chatbots [4]. The foundational concept behind chatbots is often credited to Alan Turing, who in 1950 questioned whether a computer could converse with people without revealing its artificial nature, known as the Turing test. The first chatbot named ELIZA was created in 1966, which simulated a psychotherapist by rephrasing user statements as questions. Its abilities were rather basic, relying on pattern matching and template responses, but it significantly inspired future chatbot development [5]. A great advancement happened in 2001 with the creation of SmarterChild which could be found on e.g. Microsoft’s MSN, which was the first chatbot that could assist with practical daily tasks by retrieving information from databases on subjects such as movie schedules, sports results, and news [5]. The development continued with intelligent personal voice assistants such as Apple Siri, Amazon Alexa, and Google Assistant, which can understand voice commands, manage tasks via the internet, and generate relevant responses quickly. The use of chatbots increased particularly after 2016, which coincided with social media platforms allowing developers to create chatbots for their services [5]. Chatbots are deployed across multiple sectors, driven by their ability to automate tasks, enhance user engagement, and provide scalable solutions. The primary appli- cations include: • Customer Service and Support: Chatbots facilitate 24/7 customer sup- port by addressing frequently asked questions (FAQs), tracking orders, and providing multilingual assistance. Their scalability reduces operational costs. Omnichannel integration enables seamless interactions across web, mobile, and social media platforms [6]. 7 2. Background • Data Collection and Analytics: By capturing user preferences and be- havioral data, chatbots enable businesses to refine marketing strategies and personalize offerings. In e-commerce, recommendation systems driven by chat- bot interactions enhance sales [7]. • Education: Chatbots serve as virtual tutors, offering personalized learning experiences and administrative support. Tools like Ada and Socratic exemplify their role in subjects such as mathematics and computer science, promoting experiential learning [8]. • Healthcare: Chatbots provide remote patient support, mental health assis- tance, and health education, improving care accessibility. However, their de- ployment requires stringent compliance with regulations, such as the american HIPAA [9]. • Marketing and Sales: Through conversational marketing, chatbots engage customers via digital ads and messaging apps, streamlining lead generation and sales workflows [6]. • Administrative Automation: Chatbots automate repetitive tasks such as scheduling and invoice processing, enhancing operational efficiency in sectors like finance and retail [10]. 2.2 AI in 3D Modeling, Video Editing and Con- tent Generation One example of adding AI features into an existing editing program is Blender Open MCP, which integrates the open-source Blender software with various AI models (e.g., Claude AI, Ollama) to enable natural language control over complex 3D model creation and scene manipulation [11]. While this primarily focuses on the generation of static assets, the methodology for rapidly creating and modifying intricate 3D environments and characters via AI is highly pertinent. Furthermore, DeepMotion is another related software example that has a large library of motion capture and uses AI to translate text prompts into 3D animations [12]. There are also many other generative AI tools such as Adobe Firefly and Runway ML that offer capabilities for text-to-video and image-to-video generation [13], [14]. Synthesia, for instance, uses AI to generate videos directly from text scripts, com- plete with AI avatars and voiceovers, resulting in AI-generated content with a clear chronological sequence [15]. Similarly, Descript, does video editing through text manipulation, allowing users to intuitively cut or rearrange video segments by sim- ply editing the corresponding transcribed text [16]. While these pieces of software work great in their domain, the implementation of AI is specific to each application and may often not be generalizable to other applications. Academic research on LLM-Assisted Video Editing with Unified Language Representations is particularly relevant to the concept of scripting. This work explores using LLMs to facilitate natural language interaction for complex editing 8 2. Background tasks, such as generating shot lists or refining sequences based on textual prompts [17]. This research underscores the potential for linguistic commands to directly influence the temporal flow and content of a generated sequence. 2.3 Related Academic Work The 18 design guidelines for human-AI interaction from Amershi et al. [18] serve as a foundational reference for designing user-friendly AI systems. These guidelines, developed through iterations and expert review, aim to help practitioners build AI tools that are both effective and intuitive. For our work in Director, where users work with complex scripting logic and often unpredictable signals, several principles stood out, especially those related to clarifying AI capabilities, supporting user learning, and enabling corrections. We used these as a guide to design an assistant that emphasizes transparency, context-awareness, and trust. The guidelines also stress the importance of efficient interactions and graceful handling of errors. Drawing from these guidelines helped us align DirectorAI with current best practices in human-AI interaction, particularly in making the assistant more transparent and allowing for better recoverability. Certain studies explore how LLMs can improve usability in more specific contexts. For example, Varma et al. [19] developed an AI-powered librarian assistant that helps students interact with library systems using natural language. Their assistant allows users to ask about books, availability, and recommendations through con- versation, reducing the friction of navigating traditional interfaces. Although the domain differs from our work, the core idea is the same: using natural language to bridge the gap between user goals and complex backend systems. Their success in integrating LLMs into an existing workflow supports our approach of applying a general-purpose model to a domain-specific tool like Director without requiring custom retraining. DirectorAI is reactive in the sense that it mainly responds to user input, but there is also research into more proactive uses of LLMs in technical settings. For example, Chen et al. [20] built a conversational programming assistant that suggests improve- ments and anticipates user needs without being prompted. While DirectorAI does not go that far, their work is relevant in how it shows the potential of context-aware, conversational agents to ease work in syntax-heavy environments. Their focus on shared context and fluid collaboration also resonates with our use case, where users can gradually build or refine simulation scripts through ongoing dialogue with the assistant. Another important factor is how users perceive the AI, especially their mental model of what it can and can’t do. Rismani et al. [21] studied this in the context of AI writ- ing assistants and found that user understanding of the AI’s behavior significantly affects how effectively they use it. If users assume the assistant is smarter than it is, or interpret its suggestions too literally, it can lead to frustration or misuse. This applies directly to DirectorAI. If users assume the assistant is always right, they may not critically evaluate signal suggestions or generated scenarios. On the other 9 2. Background hand, if they understand it as a helpful but imperfect assistant, users are more likely to engage critically, which ultimately could lead to more satisfactory results. This highlights the need for clear, transparent communication from the assistant, as well as features that help users build a realistic understanding of its capabilities. Khurana et al. [22] offer a cautionary perspective. In their study, users interacted with an LLM tailored to a specific software system. Surprisingly, the results showed little improvement over a general baseline, partly because users had trouble under- standing how their prompts related to the responses they received. More concerning was that many users followed incorrect AI-generated instructions without question- ing them, especially users with less technical experience. This highlights the need to design AI systems that users can understand and trust, not just those that generate correct answers. At the same time, other research shows that even imperfect AI systems can still be helpful. Schoenegger et al. [23], for instance, studied how LLMs influence human forecasting performance. They found that even when users were given advice from a deliberately flawed LLM, their forecasts still improved compared to those who had no AI support. This suggests that LLMs can provide meaningful cognitive support even when their output isn’t perfect. That idea aligns well with our goals. DirectorAI isn’t meant to be an expert system that replaces the user’s judgment. Instead, it’s a tool to help users get unstuck, reduce friction, and better understand how to interact with the system. As Schoenegger et al. put it, the LLM acts more like a ’decision aid’, which we believe is especially valuable in the kind of complex, technical workflows users face in Director. AI-based assistants have significantly enhanced the efficiency of existing software, particularly in software development. [24] review AI-driven innovations, presenting a case study on GitHub Copilot, a generative AI tool that provides real-time code suggestions and completions. The study reports a 55% reduction in task completion time and higher rates of passing test cases on the first attempt, demonstrating how AI assistants streamline coding processes. By automating repetitive tasks and offer- ing context-aware suggestions, GitHub Copilot improves productivity and reduces errors, making it a valuable tool for developers working on enterprise applications. The review emphasizes the need for clean data and user preparation to maximize these efficiency gains, underscoring the design considerations for embedding AI as- sistants into development workflows. 10 3 Theory In this chapter, we present the technical resources used in this thesis, along with the theoretical foundations and related works that informed our approach. 3.1 Limitations of Chatbots Despite their versatility, chatbots face significant theoretical and practical constraints, rooted in the limitations of current NLP and ML frameworks. These include: • Limited Contextual Understanding: Chatbots struggle with ambiguous or complex queries due to finite NLP capabilities. Rule-based systems rely on keyword matching, while AI-driven models may misinterpret nuanced inputs, leading to irrelevant responses [25]. • Lack of Emotional Intelligence: The inability to interpret emotions, hu- mor, or sarcasm restricts chatbots’ capacity to foster emotional connections, critical for customer loyalty [26]. This stems from their reliance on statistical language models rather than human-like emotional understanding. • Inability to Handle Complex Queries: Chatbots excel in routine tasks but falter in scenarios requiring reasoning or creativity. For instance, pro- viding personalized advice or resolving multifaceted technical issues remains challenging [27]. • Security and Privacy Risks: Handling sensitive data exposes chatbots to vulnerabilities like hacking or data breaches. Compliance with privacy regulations is critical, particularly in healthcare [28]. • Hallucinations and Inaccuracy: LLMs may generate false information, known as hallucinations, due to overfitting or biased training data. Examples include fabricated citations or nonsensical responses to ambiguous inputs [29]. • Ethical and Bias Concerns: Biases in training datasets can lead to discrim- inatory outputs, undermining fairness in applications such as education and healthcare. Ethical challenges also arise from potential emotional manipula- tion [30]. • Environmental Impact: The computational resources required for train- ing and operating chatbots contribute to significant energy consumption and 11 3. Theory carbon emissions, raising sustainability concerns [31]. 3.1.1 Implementation of Chatbots Chatbot implementations typically rely on two core approaches: pattern matching and machine learning. Pattern matching uses rule-based systems to compare user input against prede- fined templates, selecting fixed responses accordingly. This method, exemplified by ELIZA, is effective for predictable interactions but struggles with ambiguous or novel input due to its reliance on scripted responses [5]. Machine learning-based chatbots, by contrast, leverage natural language process- ing (NLP) to understand context and intent, allowing them to generate responses dynamically. This approach powers virtual assistants like Siri, Alexa, and Google Assistant, which continuously adapt to user behavior through models such as deep learning and LLMs [5]. A leading example is ChatGPT, which uses the GPT architecture to produce co- herent and contextually relevant dialogue. It can handle nuanced prompts and is capable of more than casual conversation, such as generating scenario-based simu- lations for educational purposes by interpreting structured prompts and delivering dynamic multi-turn responses [32], [33]. 3.2 Large Language Models LLMs are neural networks, typically based on transformer architectures, trained on vast corpora of text data to predict and generate sequences of words [34]. The trans- former model, introduced by [34], leverages self-attention mechanisms to capture long-range dependencies in text, allowing LLMs to process and generate coherent language. Prominent examples, such as BERT and GPT, demonstrate the power of pre-training on diverse datasets followed by task-specific adaptation [35], [36]. Pre- training equips LLMs with broad linguistic knowledge, which can be fine-tuned for tasks like text classification, translation, or question answering [35]. The scale of LLMs, often comprising billions of parameters, enables them to model complex linguistic patterns but introduces challenges, including high computational costs and ethical concerns related to bias and misinformation [37]. Despite these limitations, LLMs have transformed fields such as scientific research, healthcare, and education by facilitating advanced text analysis and generation [38]. Their ability to generalize across tasks underscores their potential as foundational tools in NLP. 3.2.1 GPT Architecture and Transformer Mechanisms The GPT family is based on the Transformer architecture introduced by Vaswani et al. [34], which uses self-attention to model long-range dependencies in sequences and allows for efficient parallel computation. 12 3. Theory GPT uses an autoregressive approach to generate text one token at a time, condi- tioning each token on all previous ones. The probability of a sequence is expressed as shown in Equation (3.1). P (x1, x2, ..., xn) = n∏ t=1 P (xt|x1, ..., xt−1) (3.1) These conditional probabilities are modeled using deep neural networks [39]. GPT follows a decoder-only Transformer design, stacking layers of masked self-attention and feed-forward components optimized for generative tasks like dialogue and sum- marization. In contrast, bidirectional models such as BERT [35] use an encoder-only architec- ture trained via masked language modeling (MLM), making them well-suited for classification and extraction tasks. Retrieval-Augmented Generation (RAG) enhances transformer models by incorpo- rating retrieved external documents into the context window, improving factual accuracy and domain-specific performance [40]. 3.3 Natural Language Processing Natural Language Processing (NLP) focuses on enabling machines to understand and generate human language in a contextually meaningful way [41]. Applications span translation, summarization, and sentiment analysis. Modern NLP has evolved from rule-based and statistical methods to deep learn- ing architectures, especially transformers, which outperform RNNs and LSTMs in modeling long-term dependencies. A key task is text classificationused in this project to categorize user input. Tra- ditional approaches like TF-IDF require manual feature engineering and struggle with generalization. In contrast, transformer-based models such as BERT leverage pretraining and transfer learning to improve accuracy on unseen inputs [35]. 3.4 Transformers Transformers, introduced by Vaswani et al. [42], replaced recurrence with an at- tention mechanism, improving parallelization and the ability to model long-range relationships. Their core innovation is self-attention, which computes new token embeddings as weighted combinations of all tokens in a sequence. The self-attention mechanism, defined in Equation (3.2), is: Attention(Q, K, V ) = softmax ( QKT √ dk ) V (3.2) 13 3. Theory where Q, K, V are learnable matrices and dk is the dimension of the key vectors. Encoder and Decoder Roles The encoder processes input tokens into contextual embeddings using stacked lay- ers of multi-head self-attention and feed-forward networks. The decoder, used in generative tasks, predicts output tokens iteratively, incorporating encoder outputs and prior predictions. GPT models use decoder-only stacks, while BERT employs encoder-only stacks [43]. BERT and Bidirectional Context BERT (Bidirectional Encoder Representations from Transformers) was proposed by Devlin et al. [35] to pretrain models that leverage context from both directions. It uses the MLM objective to predict randomly masked tokens using surrounding context and is ideal for classification and question answering. 3.5 Software This section introduces the main software applications and technologies used in this project. Dart and Flutter Dart is an open-source, general-purpose programming language developed by Google, optimized for building high-performance, client-side applications across web, mobile, desktop, and embedded platforms [44]. With an object-oriented syntax and features like sound null safety, Dart enhances code reliability by preventing null reference errors [45]. It supports ahead-of-time (AOT) compilation for efficient native code execution and just-in-time (JIT) compilation with hot reload, enabling rapid de- velopment cycles [45]. Dart’s platform-independent virtual machine and standard library, extended by the Pub package repository, make it a versatile tool for modern application development [46]. Flutter, a UI software development kit (SDK) built on Dart, enables developers to create natively compiled, visually consistent applications from a single codebase for multiple platforms, including iOS, Android, web, and desktop [47]. Flutter’s widget-based architecture allows hierarchical composition of UI components, known as widgets, to build responsive interfaces [48]. Its rendering engine, powered by Skia (or Impeller on iOS), bypasses platform-specific UI components to ensure uniform visuals and performance across devices [47]. Flutter leverages Dart’s AOT compila- tion for fast execution and JIT compilation for hot reload, streamlining development workflows [46]. Supported by a rich ecosystem of Pub packages and pre-built wid- gets, Flutter reduces development time and is used in applications such as Google Pay, making it suitable for cross-platform development [46]. 14 3. Theory Azure OpenAI Service Microsoft Azure OpenAI Service provides enterprise-grade access to advanced AI models developed by OpenAI, including the GPT-4 model family, integrated with Azure’s secure cloud infrastructure [49]. Launched in 2021, this service enables orga- nizations to leverage powerful language models for tasks such as content generation, summarization, code generation, and conversational interfaces, while ensuring com- pliance with enterprise requirements such as data privacy, security, and regional availability [49], [50]. The GPT-4 models, including GPT-4, GPT-4 Turbo, and GPT-4o, are multimodal, capable of processing text and images, and excel in com- plex reasoning, coding, and multilingual tasks [51]. To interact with GPT-4 models, Azure OpenAI Service provides a REST API, ac- cessible via endpoints for chat completions, embeddings, and other capabilities [52]. Developers create an Azure OpenAI resource in the Azure portal, deploy a GPT-4 model and authenticate API calls using either API keys or Microsoft Entra ID to- kens [52], [53]. For example, a chat completion API call involves sending a POST request to an endpoint with a JSON payload containing a system message, user prompt, and parameters such as max_tokens to control output length [53]. The re- sponse includes a model-generated completion, token usage, and metadata, enabling multi-turn conversations or single-turn tasks [52]. Azure’s enterprise features, such as private networking, role-based access control, and content filtering, ensure secure and compliant API usage, while global standard deployments dynamically route traffic for low-latency performance [49], [54]. Additionally, Azure OpenAI supports RAG for grounding responses in enterprise data, enhancing accuracy for domain- specific applications [55]. Unity Unity is a real-time 3D development platform developed by Unity Technologies, widely used for creating interactive simulations, games, and extended reality (XR) applications across industries such as automotive, robotics, and manufacturing [56]. Launched in 2005, Unity supports the creation of both two-dimensional (2D) and three-dimensional (3D) environments, using the C# programming language for scripting and a visual editor for designing scenes, physics, and animations [56], [57]. Its component-based architecture allows developers to attach behaviors, such as physics simulations or artificial intelligence (AI), to game objects, enabling rapid prototyping and iteration [58]. Unity’s cross-platform capabilities support over 19 platforms, including Windows, macOS, Linux, iOS, Android, and XR devices like HoloLens and Oculus, making it versatile for deploying simulation environments [56]. Graphical simulation tools are important to visualize, test and validate different behaviors and functions in the automotive industry. For instance, Yang et al. [59] have used the Unity software to create a virtual reality driving simulation platform to examine driver behavior and depth sensor accuracy in various scenarios, which the authors mean will reduce training costs and improve the efficiency of addressing 15 3. Theory emergency events. 3.6 Interaction Design-Related Theory This section introduces relevant theory from the field of interaction design used in this thesis. 3.6.1 User-Centered Design Within the field of interaction design there exist several different approaches to design, such as speculative design, participatory design, and somaesthetic design to name a few [60]–[62]. Each type of design influences what questions we ask ourselves as we work: What is the goal? Who are we designing for? Why are we designing for them? Consequently, the selected type of design affects the process we follow and the intended outcomes. In the case of designing and developing DirectorAI, an assistant intended to address the usability issues in Volvo Cars’ Director tool, we use a User-Centered Design (UCD) approach. This approach is based on the work by Norman [63], Norman & Draper [64], and Gould & Lewis [65]. UCD is not a strict methodology, but rather it is a guiding principle or design philos- ophy. Fundamentally, UCD emphasizes building an understanding and empathy for users’ issues, pain points, and workflows. It promotes a design and development pro- cess based on user needs, and producing prototypes that are continuously evaluated by users themselves to ensure solutions stay aligned with their real-world issues. 3.6.2 Cognitive Load Cognitive load, as introduced by Sweller et al. [66], refers to users’ limited capacity to hold task-relevant information in memory. These limitations in how people learn and process information can often increase the difficulty of performing a task. In the case of software like Director, cognitive load can be alleviated through various methods, such as reducing the complexity of the task or simplifying the interface. One of the main motivations behind DirectorAI is to reduce the cognitive load associated with using Director. The tool’s interface, scripting syntax, and workflow all require users to hold a lot of information in mind while working. This is where LLMs offer an opportunity to reduce cognitive load. With their ability to understand users’ natural language, we can shift some of that mental burden away from users and potentially make the system simpler to use. 3.6.3 Affordances Affordance is a concept introduced by Norman [63] and refers to the properties of an object that suggest how it can be used. In terms of interface design, affordances are what guide user expectations. For example, if the interface contains a button 16 3. Theory that looks clickable then it suggests an action will be performed when it is pressed. Cooper et al. [67] expand on the idea of affordances when it comes to digital interface design. They argue that the perceived functionality of an interface element may not always align with its actual functionality. Therefore, they suggest that what matters most is not the interface element’s true functionality, but instead how users perceive it to work based on past experiences. This issue becomes more complex in the case of DirectorAI, affordances extend beyond interface elements such as buttons or icons to the functionality of the under- lying LLM. While the interface contains familiar elements such as a revert button that undoes actions or a chat log that logs user and AI messages, the assistant itself carries a more complex affordance. DirectorAI is presented as a helpful and intel- ligent assistant and if users perceive the assistant as an expert, or something they can delegate all work to, breakdowns in usability and trust will begin to occur since the assistant will inevitably fail at certain tasks. 3.6.4 Mental Models Another important concept introduced by Norman [63] is that of mental models. Mental models refer to the internal understanding people form about how a system works. They can be shaped by prior experience, cues from the system, and guidance from other people. While these models are not always accurate, they guide how users interact with a system and what they expect it to do. When it comes to DirectorAI, mental models become particularly important, as users may approach the assistant with preconceptions from previous experiences with other AI-based assistants or timeline-based scripting tools. When a user’s mental model differs from how the assistant actually functions, especially regarding the capabilities and limitations of the underlying LLM, confusion and frustration may arise. As such, considering users’ existing mental models, and supporting the formation of accurate new ones, is an important part of designing DirectorAI. 3.6.5 Usability Heuristics Established interaction design heuristics include Nielsen’s 10 Usability Heuristics for User Interface Design [68]. While we do not consider all of them in this project, it is worth briefly mentioning a few that are particularly relevant for DirectorAI’s design. These include the heuristic about visibility of system status and providing timely feedback (H1). Nielsen also emphasizes the importance of user control, such as being able to undo actions (H3). Another relevant heuristic is the availability of help and documentation (H10). Perhaps the most central one, however, is the idea that systems should speak the user’s language, avoiding overly technical jargon (H2). This directly relates to the core of what we aim to address with DirectorAI, as many of Director’s usability issues stem from the overuse of technical terminology. 17 3. Theory 18 4 Methods This project used both design and technical methods in parallel, allowing insights from one to inform decisions in the other throughout the process. The design meth- ods will follow an interaction design framework, ensuring that the development of AI-driven features is grounded in user needs, usability principles, and iterative refine- ment. Simultaneously, the technical methods will focus on implementing and inte- grating AI technologies, such as natural language processing, within the constraints of the existing 3D simulation software. By combining these two perspectives, the project aims to create a solution that is feasible within the system’s technical limits while directly addressing user needs. 4.1 Design Methods To begin, this project was planned with a formative methodological approach, which in this case meant that the insights, observations, and feedback from both Director and DirectorAI would be used to inform our work and how we progressed. In contrast to a summative approach, the focus was not on producing statistically generalizable results. Instead, we intended to use findings from our research to iteratively refine our prototype and address usability issues. This formative approach was chosen since it aligns well with the principles of UCD where understanding user needs, challenges, and workflows is crucial. The planned design methodology in this project is fundamentally rooted in a UCD approach, adopted to effectively address the usability challenges of Director and ensure the AI assistant genuinely meets user needs. UCD emphasizes an itera- tive process involving user feedback, usability testing, and continuous refinement throughout the design and development process. As explored by Mirabdolah et al. [69], applying UCD principles is particularly valuable for enhancing interaction quality and usability within complex systems like Director, especially across diverse domains. Their work highlights how iterative feedback loops and a focus on the user perspective can lead to more efficient, effective, and less cognitively demand- ing solutions. Employing UCD was therefore considered essential for navigating the complexities of integrating a novel AI tool into the existing Director workflow, al- lowing the design to evolve based on direct user input and testing, increasing the likelihood of developing a truly helpful and usable assistant. Intended to provide structure for this UCD approach, the Double Diamond frame- 19 4. Methods work [70] served as the primary design framework for this project. While the Dou- ble Diamond was central, it is not the only established approach available within the field of interaction design. Alternative methodologies such as Design Thinking, Goal-Directed Design, and Lean UX each offer distinct strengths, and their relevance depends heavily on the goals, constraints, and nature of the project. Design Thinking shares several principles with the Double Diamond, particularly the emphasis on empathy, iterative prototyping, and user feedback [71], [72]. However, it often operates in a more fluid and less formally staged manner, which can be helpful in exploratory projects. However, in our case, we needed a more structured process to support analysis and reporting. For this thesis, where clarity, traceability, and a well-defined structure for analysis and reporting were important, the Double Diamond offers a more appropriate balance between flexibility and methodological rigor. Goal-Directed Design, introduced by Alan Cooper, focuses heavily on identifying user goals and creating solutions that align with these goals through persona-based design [73]. This approach offers deep behavioral insight and works particularly well when designing tools for specific, repeatable tasks. However, this project aimed to explore emergent creative behaviors, many of which were discovered through early exploratory studies or even appearing later in prototyping phases. As such, a rigid goal-driven approach risked narrowing the design space too early. Lean UX, on the other hand, emphasizes rapid experimentation, minimal viable products, and continuous iteration in fast-moving product teams [74]. While ideal in agile, startup-like settings with quick user feedback loops, the lacking emphasis on discovery and exploration makes it less suitable for early-stage research into novel technologies, especially considering our primary goal is understanding user needs rather than optimizing conversion metrics or feature releases. Lean UX contrasts having a clear ’solution space’ such as in the Double Diamond which enables clear un- derstanding of user’s issues before moving to prototyping, something we considered valuable. In contrast to these alternatives, the Double Diamond model was chosen because it supported a comprehensive, structured exploration of both the problem space and the solution space [70]. It allowed the project to remain grounded in user needs while still leaving room for creative and conceptual exploration, particularly around the emerging role of AI in design workflows. Its staged structure also enabled clear documentation of each phase, which was beneficial for the transparency and reflection included in a thesis. The Double Diamond’s four phases (Discover, Define, Develop, and Deliver, as seen in figure 4.1) offered a methodical yet open-ended process that aligned well with the project’s aim to investigate, understand, and meaningfully contribute to how users interacted with AI assistants in creative but technically complex tools. 20 4. Methods Figure 4.1: A visualization of the Double Diamond design framework. Adapted from [70]. 4.1.1 Discover Phase In the Discover phase, we planned to familiarize ourselves with Volvo Cars’ 3D vehicle simulation software and Director in order to understand their functionality and limitations. This included informal observations of developers and users using the tool to identify common workflows, challenges, and workarounds. While full ethnographic studies provide deep insights into user behavior [75], they are often time-intensive and require prolonged immersion in this setting. Given the constraints of this project, we used a more lightweight ethnographic approach [76], focusing on observations and contextual inquiry rather than extensive fieldwork. To supplement these observations with broader user data, we conducted a survey targeting existing users of the software. Surveys are useful for reaching a larger participant pool and identifying common trends [77], though they lack the depth of direct user interaction and potential for follow-up. The survey focused on under- standing the software’s primary use cases and challenges users faced. While inter- views could have provided richer qualitative data, a survey allowed us to efficiently gather input from a larger and more diverse set of users. We then conducted usability testing following the guidelines of Rubin and Chis- nell [78]. The motivation for these studies was to identify the specific challenges that users can encounter while performing tasks in Director. During these stud- ies, we collected qualitative data through a combination of think-aloud protocols to capture users’ thought processes, direct observations of hesitations, confusions, or workarounds, and post-task interviews to explore their experiences in more depth. More advanced usability evaluation methods, such as controlled laboratory studies or cognitive workload assessments, could have provided deeper insights into user performance [79]. However, these methods often require specialized equipment, con- trolled environments, or long testing sessions, which are beyond the practical scope of this study. Instead, we prioritized methods that provided rich, in-depth, and 21 4. Methods actionable feedback within real-world constraints and use cases, ensuring that the findings directly inform the design process. Our usability tests involved both experienced and inexperienced users of the software, as well as individuals with no prior exposure. This was important because interaction design should not only optimize experiences for existing users but also lower entry barriers for new users [63]. While longitudinal studies could have offered insights into learning curves and long-term adoption, time constraints made this impractical, so our evaluation focused on immediate usability and emerging behavior rather than long-term user adaptation. 4.1.2 Define Phase In the Define phase, we analyzed the collected data to refine our understanding of usability challenges and user needs. This involved thematic coding to identify recurring patterns in survey responses and usability test findings. This approach was chosen because thematic coding provided a structured but adaptable way to interpret qualitative data, allowing us to translate user feedback into clear and relevant insights [80]. These insights then informed our design and were important for ensuring that the resulting prototypes address actual pain points, and align with user behavior and needs, rather than being based on technical possibilities or novelty [63]. While participatory design workshops could have helped refine our design further, we chose to rely on the identified codes and issues instead. This decision was made to keep the process focused and simple, especially given time and resource constraints. By basing our design work and prototypes on direct observations and user feedback, we were able to capture a range of perspectives nonetheless. Personas is another method we considered during this phase. Introduced by Cooper [81] and later developed further by Pruitt and Grudin, [82], personas are fictional characters based primarily on qualitative data gathered in the discover phase. They are used to represent groups of users or stakeholders and typically include goals, workflows, needs, and pain points. We initially planned to use personas, or were at least open to the idea, but the discover phase quickly showed just how broad and diverse the user base was. It included everyone from software engineers and function architects to 3D artists and analysts. In practice, most participants had differing goals and use cases for Director, making it difficult to generalize them into one or even a few personas. There was also the risk of how designing for a few personas might lead to solutions only being optimized for the workflows represented in those personas. Instead, we chose to simply focus on the most prominent pain points and recurring issues that appeared across user groups. 22 4. Methods 4.1.3 Develop Phase During the Develop phase, we began creating and refining prototypes to test and validate our design concepts. Prototyping is often essential in interaction design as it enables iterative testing and adjustment before committing to full-scale imple- mentation [83]. Rather than starting with wireframes or paper sketches, we chose to move directly into more functional prototypes, allowing us to quickly assess the feasibility of our ideas in the context of the actual Director software. This approach ensured our early design work was grounded in real-world constraints and techni- cal considerations, providing more immediate insights into user needs and possible solutions. Considered prototyping methods included the Wizard of Oz technique, which in- volves faking prototype functionality to gather user feedback without investing ex- tensive development time [79]. However, we quickly encountered a core issue: what kind of AI behavior should we simulate? A best-case scenario? Worst-case? Some- thing in between? At this stage, we had no clear idea of how the AI would actually perform or how it would respond to user prompts. Instead, we chose to build simple yet functional prototypes to test feasibility and gather user feedback using a real LLM, which was especially important given the unpredictable nature of LLMs in a niche, proprietary system like Director. Additionally, while a more complete participatory co-design method can be useful for aligning solutions with real-world use cases [84], it was not the primary approach in this project. Mainly due to limited user availability, we adopted a lighter form of co-design, relying on short, targeted feedback sessions throughout development. User feedback was crucial, but doing it this way allowed us to incorporate user input effectively, aligning the design with real-world workflows without overcomplicating the early stages by placing too much burden on participants. 4.1.4 Deliver Phase Finally, in the Deliver phase, we evaluated our prototype to understand how Di- rectorAI impacted the user experience and in which ways it supported scenario creation. This evaluation focused on gathering qualitative insights through user studies, where methods such as think-aloud protocols and post-task interviews were used to capture the diverse ways in which participants interact with the assistant. Rather than simply measuring performance improvements, our goal was to explore how DirectorAI influenced user workflows, reduced cognitive load, and supported creative exploration, providing an understanding of the design’s practical impact. By following this structured process, we ensured that the developed solution, Direc- torAI, was grounded in user research and iterative refinement. At each stage, we tried to balance methodological depth with practical constraints, making deliberate trade-offs based on which methods best fit our time constraints, user access, and the goals of the project. 23 4. Methods 4.2 Technical Methods This section explains technical methodologies relevant to adapting LLMs for specific contexts. Understanding these techniques provides important context for the design choices made and the specific methods used in this thesis, particularly the project’s focus on prompt engineering for DirectorAI. 4.2.1 Fine-tuning Fine-tuning refers to the process of adapting a pre-trained LLM to a specific task or domain by further training it on a targeted dataset. This approach adjusts the model’s parameters to enhance performance for applications such as text classifica- tion, question answering, or scientific text generation [35]. Typically, fine-tuning em- ploys supervised learning, where labeled data is used to minimize a task-specific loss function. For example, fine-tuning BERT on domain-specific corpora significantly improves its accuracy in tasks like natural language inference [35]. The process re- quires high-quality labeled datasets and substantial computational resources, which can be a limitation in certain contexts [85]. Instruction Fine-Tuning GPT models can be fine-tuned using structured prompt-response datasets, aligning model outputs with desired behaviors. This process is often enhanced by reinforce- ment learning from human feedback to promote safety, usefulness, and coherence [86]. Parameter-Efficient Adaptation Approaches like Low-Rank Adaptation (LoRA) update only a subset of model pa- rameters, allowing resource-efficient customization for domain-specific deployments [87]. 4.2.2 Prompt Engineering Prompt engineering involves designing input prompts to guide a pre-trained LLM to produce desired outputs without modifying its parameters. This technique lever- ages the model’s existing knowledge, making it resource-efficient for tasks such as text generation or sentiment analysis [36]. Strategies such as few-shot learning, where prompts include a few task examples, or zero-shot learning, where only task instructions are provided, enhance model performance [36]. Prompt engineering is particularly valuable in scenarios requiring rapid adaptation to new tasks with mini- mal data, though its effectiveness depends on the model’s generalization capabilities [88]. 24 4. Methods Zero-Shot and Few-Shot Learning Zero-shot learning (ZSL) enables LLMs to perform tasks without prior task-specific training or examples, relying solely on a descriptive prompt [36]. For instance, instructing a model to “Classify the sentiment of this review as positive or negative” without providing examples tests its ability to generalize from pre-trained knowledge. ZSL is particularly useful for rapid task adaptation, but it can fail in domains requiring specialized knowledge or nuanced reasoning [39]. Few-shot learning (FSL), on the contrary, includes a small number of task examples within the prompt to guide the model [36]. For example, a prompt might include: “Example: ‘Great product!’ → Positive. ‘Terrible service.’ → Negative. Now classify: ‘Amazing experience!”’. FSL often outperforms ZSL by providing context that aligns the model’s output with the desired format or style [88]. One-shot learning, a special case of FSL with a single example, serves as an intermediate approach. Both ZSL and FSL are forms of in-context learning, where the model learns from the prompt context without weight updates [89]. The effectiveness of ZSL and FSL depends on prompt clarity, the model’s pre- training data, and task complexity. Recent studies suggest that larger models, such as GPT-3 or its successors, exhibit stronger zero-shot and few-shot capabilities due to their extensive training corpora [90]. However, performance can be degraded for tasks that require deep reasoning or domain-specific expertise, necessitating ad- vanced techniques such as prompt chaining or fine-tuning [88]. Prompt Chaining Prompt chaining is a technique that decomposes complex tasks into a sequence of smaller, interdependent prompts, where the output of one prompt serves as input for the next [90]. This approach mitigates the limitations of single-prompt interactions by structuring tasks into manageable steps, improving accuracy and coherence. For example, to generate a business plan, a chain might include: 1. “List the key sections of a business plan.” 2. “Write an executive summary for a company with the following description: [input].” 3. “Draft a market analysis based on the executive summary: [output from step 2].” Prompt chaining is particularly effective for multi-step reasoning tasks, such as planning, code debugging, or report generation [90]. Related to prompt chaining is the concept of chain-of-thought (CoT) prompting, which encourages the model to articulate intermediate reasoning steps within a single prompt [90]. For instance, a CoT prompt might state: “Solve this math problem and explain each step.” While CoT focuses on reasoning within one prompt, prompt chaining distributes reasoning across multiple prompts, making it suitable for tasks requiring iterative refinement or modular outputs [90]. A more advanced variant, 25 4. Methods tree-of-thought (ToT) prompting, explores multiple reasoning paths in parallel and selects the optimal one, often implemented through chained prompts [91]. Prompt chaining requires careful design to ensure alignment between steps and to prevent error propagation. Automated workflows, where outputs are program- matically fed into subsequent prompts, can streamline this process, particularly in API-based applications [90]. Meta-Prompting Meta-prompting involves crafting prompts that instruct the model to design, evalu- ate, or optimize prompts before addressing a task [92]. This higher-level approach leverages the model’s self-reflective capabilities to improve prompt quality. For ex- ample, a meta-prompt might state: “Write an optimized prompt to elicit a detailed summary of a scientific article.” Alternatively, it could ask: “Review this prompt for clarity and suggest improvements: [insert prompt].” Meta-prompting is particularly valuable for users with limited prompt engineering expertise or when tackling novel tasks [92]. A related concept is self-consistency, where the model generates multiple responses to a prompt and selects the most consistent or highest-quality output Wang2023. Meta-prompting can orchestrate self-consistency by instructing the model to com- pare outputs and refine its approach. Another related technique, reflexive prompting, asks the model to reflect on its reasoning process or prompt effectiveness, e.g., “Why did this prompt yield a vague response, and how can it be improved?” [92]. Meta-prompting can be combined with prompt chaining to create dynamic work- flows. For instance, a meta-prompt might generate an initial prompt, which is then used in a chained sequence, with subsequent meta-prompts refining the process based on intermediate results [90]. However, meta-prompting requires precise phrasing to avoid confusion, as the model must balance the meta-task (e.g., prompt design) with the actual task. 4.2.3 Other Relevant Concepts Several additional concepts improve prompt engineering, particularly for complex or reasoning-heavy tasks: • Self-Consistency: As mentioned, self-consistency involves generating multi- ple outputs and selecting the best one, often improving performance on tasks such as mathematical reasoning or factual question answering Wang2023. This technique can be integrated into prompt chaining or meta-prompting workflows to ensure robust outputs. • Automated Prompt Engineering: Frameworks like DSPy [93] program- matically optimize prompts by iterating over prompt variations and evaluating performance metrics. Unlike meta-prompting, which relies on the model’s nat- ural language capabilities, automated prompt engineering uses computational optimization, making it suitable for large-scale applications. 26 4. Methods • Temperature and Top-k Sampling: Model parameters like temperature (controlling response randomness) and top-k sampling (limiting token selection to the k most likely options) indirectly influence prompt engineering outcomes [39]. For instance, lower temperature values (e.g., 0.5) produce deterministic outputs, while higher values (e.g., 1.0) increase creativity. Prompt engineers must account for these parameters when designing prompts, especially for tasks requiring specific tones or styles. • Agentic Workflows: Advanced prompt engineering can emulate agent-like behavior, where the model autonomously decides subsequent steps based on prior outputs [90]. For example, a meta-prompt might instruct the model to “Generate a plan, execute each step, and adjust based on results.” Such workflows combine prompt chaining, meta-prompting, and self-consistency to create dynamic, goal-oriented interactions. 4.2.4 Retrieval-Augmented Generation (RAG) RAG is a hybrid framework that combines information retrieval with natural lan- guage generation to enhance LLMs. Introduced by Lewis et al. [94], RAG integrates parametric knowledge (encoded in model weights) with non-parametric knowledge (retrieved from external sources) to improve factual accuracy and contextual rele- vance. In the RAG pipeline, a query initiates the retrieval of relevant documents from an external knowledge base, typically a vector-indexed corpus. Retrieval uses dense embedding models such as BERT [35] or Dense Passage Retrieval (DPR) [95] to identify semantically similar content. The retrieved information is then appended to the query and passed to a generative model, enabling outputs grounded in both learned and external knowledge. This architecture mitigates key limitations of standard LLMs, particularly halluci- nations and static knowledge, by conditioning responses on verifiable sources [96]. It is especially effective for domain-specific or private queries, where fine-tuning would be impractical [40]. Nevertheless, RAG presents challenges such as retrieval noise, increased inference latency, and the need to balance contributions from parametric and non-parametric knowledge [97]. Active research explores solutions including domain-specific re- triever tuning, structured data retrieval, and improved fusion mechanisms [98]. RAG’s modular design also opens avenues for scalable updates and theoretical ad- vances in memory, reasoning, and contextual language modeling. 4.3 Time Plan The following Figure 4.2 describes the time plan for approximate date of the project. 27 4. Methods Figure 4.2: Time plan of the thesis as a GANTT-chart. 28 5 Process This chapter outlines how we designed and built DirectorAI, using a user-centered approach to tackle the usability issues present in Director. It covers the initial scoping and problem discovery phases, the iterative design of the prototype, and the technical challenges involved in integrating AI into a complex, domain-specific system. 5.1 Preliminary Scoping and Feasibility Before conducting formal user studies, we began with an initial exploration of the technical landscape surrounding Director. This phase took place early on, while we were still getting to know the tools and gathering participants for our studies. It played an important role in shaping the project’s direction and helped ground the rest of our process in what was technically feasible and realistic within the time frame and scope of the project. Our approach involved reviewing the available source code for Director, identifying which parts of the software we could work with and how, and mapping out the structure of the data we had access to. We also considered how Director connected to ProSim, which is the broader simulation platform it interfaces with. Early on, we determined that we did not have access to the full ProSim environment, which ruled out changes that would require modifying the underlying simulation software, such as camera control or environmental interactions. While these suggestions might arise later as potential improvements, it was important to understand that they would fall outside the bounds of what we could realistically address. This early scoping also gave us a clearer understanding of the kind of AI functionality we might be able to implement. Initially, we considered possibilities like training a custom model to support scenario creation. However, we quickly realized that this would require access to a large, labeled dataset of simulation scenarios, which did not exist. Creating such a dataset would have been a major undertaking in itself and was not feasible given our resources. As a result, we began focusing more on how we could apply existing LLMs to augment the scenario creation process without custom model training. Ultimately, this stage helped us set clear boundaries for the project. It allowed us to better interpret user feedback later in the process, as we could distinguish between 29 5. Process desirable but impractical features and problems that could actually be addressed in a meaningful way. It also informed the way we designed our user studies, allowing us to focus on pain points within the Director interface and scenario creation workflow that we knew we could feasibly explore and improve. In this sense, the preliminary scoping phase became a short but important part of the design process, shaping both the direction of our research and the tools we would later design and prototype. 5.2 Problem Discovery To gain a preliminary understanding of how Director and ProSim are used, we conducted a short survey targeting Volvo Cars employees with either experience in these tools or an interest in their development. The goal was to identify usage patterns, common tasks, and any early frustrations or wishes users had. This was important to ensure that the design work would be grounded in real-world usage. The survey included questions open-ended responses, such as: • “What do you use ProSim for?” • “Which features do you use most?” • “Have you used Director, and if so, for what?” • “What features do you like about Director?” • “What challenges or feature requests do you have?” A final yes/no question was included asking if the respondent would like to partici- pate in future user studies on the matter. While the response count was limited (nine respondents), and the answers were mixed and often vague, a few clear themes emerged: • ProSim was used for a wide range of purposes, from visualizing scenarios and FSM behavior to running driver studies and showcasing car functionality. • Of the few respondents who had used Director, several mentioned frustrations with configuring and timing actions, adjusting signals, and managing scenario sequences. Although the survey did not reveal strong patterns, it hinted at a general sense of complexity and inconsistency in how the tools were used. This initial impression was further supported by early feedback and observational data from experienced users and developers, which indicated that users frequently encountered friction during scenario creation, particularly when trying to locate signals and understand their functionality. To assess the scope and details of these issues, we conducted an initial structured user study with both experienced and potential users of ProSim and/or Director. Respondents of our survey who had answered yes to the question about participating in a user study were recruited to partake. The overarching goal was to investigate 30 5. Process the potential for AI-powered enhancements based on users’ pain points and overall experiences. The user studies, described in more detail in Appendix A, were conducted with eight participants. Each session lasted approximately 30 minutes. First, partici- pants received a brief tutorial on the functionality of both ProSim and Director to ensure a shared baseline of understanding, acknowledging that prior experience and proficiency with the software varied. Participants were then given time to explore Director independently to become more familiar with its interface and features. During the study, participants were encouraged to think aloud by verbalizing their thoughts and decision-making processes to help us gather as much insight as possible into their behavior, challenges, and perceptions. The main task required participants to create a short scenario involving: • Changing camera angles and positions • Opening and closing vehicle doors • Activating the car horn for a specified duration This scenario was chosen because it involved common, easy-to-understand signals that also varied in type and parameters. Minimal guidance was provided to ensure a fair and unbiased representation of each participant’s experience. Primarily qualitative data was documented during the user studies. Such infor- mation included observations of hesitation, confusion, and general behavior. This qualitative data provided nuanced insights into usability issues. In addition, after the tasks were completed, participants were interviewed about their experiences, focusing on what aspects of the software they found satisfying or frustrating. From these interviews and our observations, three major categories of user frustra- tion emerged: • Difficulty finding signals • Lack of information on how signals and their parameters work • Limited feedback and unclear information about how the program functions overall Some of the identified issues stemmed from the underlying implementation of signals, elements that could not be directly addressed within Director alone. This prompted a careful consideration of which problems could be realistically solved at the level of Director and which would require changes to the broader software ecosystem. 5.3 Problem Definition Following our user studies, we analyzed both observational and interview data to identify the most pressing usability issues in Director. Our insights were primarily derived from qualitative methods, particularly think-aloud protocols and post-task 31 5. Process Figure 5.1: Distribution of code categories in interview responses. interviews. Thematic coding was applied to the interview transcripts, allowing us to identify recurring patterns, frustrations, and pain points experienced by users. 5.3.1 Data Analysis One of the most frequently observed issues was the difficulty participants faced in locating the correct signals when creating scenarios. This challenge was evident both in observed behavior and participant feedback. Participants often attempted to search for signals using natural language terms such as driver’s door or trunk, only to receive no results, as Director requires queries to match system-specific signal names like Door FL Open or tailgate. The absence of synonym support or semantic flexibility created friction, particularly for non-expert or occasional users unfamiliar with the tool’s naming conventions. Even experienced users reported that the need to recall or uncover exact terminology interrupted their creative flow. The full list of identified usability themes, along with the frequency of their occur- rence across sessions, is presented in Figure 5.1. This visualization offers an overview of the most prominent barriers to effective interaction, helping to contextualize the relative impact of this and other issues. These issues were compounded by inconsistencies in how signals were structured. Some actions were represented by discrete signals (e.g., Door Open and Door Close), while others, like the horn, used a single signal (hornSoundOn) with parameters to control behavior. Users frequently searched for signals that did not exist, such as hornSoundOff, not realizing the action was handled differently. This lack of consistency increased the cognitive load and hindered intuitive interaction. A closely related concern was the configuration and interpretability of signal pa- rameters. Although less commonly raised in interviews, this issue became apparent during observations and through our own experience using the tool. Parameters sometimes used technical labels (e.g., a parameter named boolean), which confused 32 5. Process non-technical users. Others, such as color configuration for lights, used ambiguous input formats: RGB values required floats (0-1), but users often entered integers (0-255), leading to unexpected behavior. Without clearer information or feedback, even simple parameter adjustments became a point of friction. Something worth noting is that codes like ’Frustration with signal parameters’ ap- pear in the lower half of the identified codes, yet parameter configuration became a major focus during the design and implementation of DirectorAI. There are a few reasons for this. One is that some user complaints or requests related to features that were simply not feasible for us to implement. For example, a ’Desired feature’ was the ability to “fly” in ProSim, whereas the simulation currently only supports walking. Implementing these types of features was beyond our scope, especially since we did not have access to ProSims source code. Another reason signal parameters became a focus is that codes like ’Lacking information’ often referred to missing or inadequate documentation and metadata related to parameters, which made them difficult for users to understand or use effectively. Table 5.1: Selected Participant Quotes Participant Quotes P1 “There is no guide for how the signals are different from one another. A lot of signals have unclear info.” “The camera doesn’t work as expected. And it’s unclear what the difference between the camera and ego character is.” P2 “It lacks info for what signals do and what the parameters mean.” “Some signals do just one thing but some do multiple things.” P3 “Improve the way to find signals, [it needs to] match with what I actually want to do.” P4 “You need that instant feedback. By the time I finally have a sce- nario ready, the requirements can have changed.” “This type of work must be more intuitive. Like a car is.” P5 “[Other teams] always need to ask experts for help, because the tool is too technical for them.” P6 “There’s so much potential here, but you need to make it faster.” “It should be as simple as possible, so more people can try.” “Getting started is hard, so many people skip it.” Beyond these specific issues, participants also expressed a broader sense of unfulfilled potential. Several interviewees envisioned using Director for fast-paced ideation or live scenario building during team discussions but found the program too rigid and slow to support such workflows. Others noted that while testing is valuable in Director and ProSim, it is often skipped due to the complexity and time investment 33 5. Process required to build even simple scenarios. Selected participant quotes that influenced our understanding of Director’s workflow issues and the following work on DirectorAI can be seen in table 5.1. These findings led us to a clear problem framing: Director, while powerful, is lim- ited by its reliance on system-specific syntax, technical terminology, and rigid work- flows. These constraints make the tool difficult to access for new, occasional, or non- technical users, and slow even for experienced ones. This challenge is not unique to Director, but indicative of a broader issue in many complex software tools, that being the gap between user intent and system operation. 5.3.2 How AI Could Help At this point, we began exploring how AI, specifically natural language processing (NLP), might help bridge this gap. The idea was planted early in the project, as we reflected on how similar issues appear in other domains: software development, data analysis, or engineers working with circuit design [99]–[102]. Users may know what they want to do, but not necessarily how to express it in system-specific terms, or may greatly benefit from the automation provided by AI. AI interfaces have increasingly shown promise in translating natural human requests into structured commands, enabling smoother interaction with complex systems. This potential aligned closely with the challenges we uncovered in Director. AI could help users find signals even if they use imprecise or conversational terms, for example interpreting driver’s door as Door FL Open, or mapping boot to tailgate. Language models can infer synonyms, correct misspellings, and interpret user intent, especially when trained or prompted with domain-specific context. Moreover, AI can act as a knowledge proxy: we can encode expert knowledge into prompts, descriptions, or examples, allowing the AI to answer questions and provide guidance without requiring users to rely on internal documentation or interrupting workflows to ask a colleague. To further explore the feasibility of our approach and gain a deeper understanding of where the current issues stem from, we consulted with Directors developers. These stakeholders provided valuable insights into why certain usability issues persist in the current software. When asked why documentation improvements had not been addressed, the developers pointed to the sheer scale of the system and how Director has approximately 900 unique signals. Updating and standardizing all of them, along with their metadata and parameters, would be extremely time-consuming. Due to resource limitations and other focus areas, this has not been a priority. Regarding the technical language used in Director, the developers explained that it aligns with their workflows and how the software has historically been used. For example, a value like “HornSoundOn” with a boolean parameter controlled by true/false makes perfect sense to a software engineer. As a result, updating the structure or terminology of signals may improve clarity for some users, but risks inconveniencing others who rely on or have adapted to the existing conventions. These conversations reinforced our belief that AI could offer a viable alternative, 34 5. Process not by replacing Director’s structure, but by sitting between the user and Director’s functionality. Rather than requiring extensive documentation rewrites, restructuring Director’s backend, or disrupting existing workflows, AI could provide a semantic bridge where it interprets the user’s requests, selects relevant signals, and assists in parameter configuration. In summary, our problem definition rests on three main points: • Discoverability: Users struggle to find the right signals using natural lan- guage or intuitive phrasing. • Interpretability: Signal parameters are difficult to understand or configure correctly without technical knowledge. • Speed and accessibility: Scenario creation is slower and more effortful than it needs to be, especially for less technical users. These challenges pointed us toward a solution that leverages AI to interpret user intent, embed expert knowledge, and reduce cognitive load when using Director by making the tool more accessible without requiring a fundamental redesign. 5.4 Design and Implementation This section presents the design process and implementation details of our AI assis- tant for Director. The goal of this tool is to improve usability in the scripting of 3D simulation scenarios by allowing users to interact with Director via natural lan- guage. We begin by introducing terminology used throughout this section, followed by a breakdown of the assistant’s architecture, interaction modes, and AI prompt strategies. Lastly, we outline the specific components developed during this project. 5.4.1 Terminology To ensure clarity, we define several terms used throughout this section: • Signal: A command issued from the Director program to ProSim, where the 3D simulation occurs. Each signal controls a specific aspect of the simulation, such as opening and closing car doors, adjusting the camera, or modifying weather conditions. Signals are added to Director’s timeline as actions, each with a defined start time and associated parameters. When the timeline is played, these actions are triggered at the specified times, sending the corre- sponding signal and its values to the simulation. • AI Mode: A selectable interaction context that defines the behavior of the AI assistant. Each mode corresponds to a distinct use case (e.g., camera positioning or signal editing) and determines how user prompts are interpreted and processed by the assistant and LLM. • Generator: A modular processing unit responsible for two tasks: (1) compos- ing and sending a task-specific system prompt to the LLM, and (2) handling 35 5. Process the returned data in a structured way to produce meaningful changes in the simulation environment. • Classifier: An AI module that dictates which AI generator is most appropri- ate for the given input. • Pipeline: A chained sequence of generators and classifiers that transforms a user prompt into application-specific output. Pipelines allow for multi-step processing, such as refining AI outputs through further generators or user interactions. 5.4.2 Early Attempts: Simple Prompting Our initial approach was simple: we added a chat window to the existing interface where the user could enter their prompt, which we would later pass to the AI. The user’s prompt, along with the entire list of available signals and their descriptions, was sent to a GPT-4o-mini model via an API request. The model was instructed to recommend the most appropriate signal based on the prompt. Despite its simplicity, this worked surprisingly well in many cases. The model could often infer what the user wanted and suggest one or several relevant signals. However, several limitations quickly became apparent. First, signal descriptions alone were often not detailed enough to make an informed decision. Two signals might appear similar and have similar descriptions, for example ’Door Open FL’ and ’Ext Door Handle R1 L Ui,’ but behave quite differently. Of these two signals, the first one directly opens the door, while the other simulates a user pulling the handle, which may fail if the door is locked. Secondly, even when the correct signal was found, users still had to manually configure its parameters, sometimes without knowing what the parameters meant or how they worked. Third, some signals required contextual or spatial understanding not conveyed in their metadata. For example,