Designing and Developing DirectorAI: An
AI Assistant for Generating Vehicle Sim-
ulation Scenarios

Creating an effective AI Assistant for complex, domain-specific
tasks without domain-specific training data

Master’s Thesis in Interaction Design & Technologies
Master’s Thesis in Computer Science – Algorithms, Languages and Logic

Adam Telles and Hannes Raaholt Larsson

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2025


Master’s thesis 2025

Designing and Developing DirectorAI: An AI
Assistant for Generating Vehicle Simulation

Scenarios

Creating an effective AI Assistant for complex, domain-specific tasks
without domain-specific training data

Adam Telles, Hannes Raaholt Larsson

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2025


Designing and Developing DirectorAI: An AI Assistant for Generating Vehicle Sim-
ulation Scenarios
Creating an effective AI Assistant for complex, domain-specific tasks without domain-
specific training data
Adam Telles, Hannes Raaholt Larsson

© Adam Telles, Hannes Raaholt Larsson, 2025.

Supervisor: Morten Fjeld, Interaction Design and Software Engineering
Advisor: Anders Tell, Volvo Cars
Examiner: Staffan Björk, Interaction Design and Software Engineering

Master’s Thesis 2025
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: Stylized mockup of the DirectorAI chat interface, overlaying a scene gener-
ated with DirectorAI and shown in Volvo Cars’ Product Simulator.

Typeset in LATEX
Gothenburg, Sweden 2025

iv


Designing and Developing DirectorAI: An AI Assistant for Generating Vehicle Sim-
ulation Scenarios
Creating an effective AI Assistant for complex, domain-specific tasks without domain-
specific training data
Adam Telles, Hannes Raaholt Larsson
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
This thesis investigates the use and potential benefits of AI in automating the gener-
ation of vehicle simulation scenarios. Focused on enhancing the usability of Director,
a scripting tool for Volvo Cars’ Product Simulator software, this project involved
designing and developing DirectorAI, an AI assistant featuring a chatbot interface.
Using large language models, the research explored how to create effective and reli-
able scenarios without domain-specific model training. Moreover, the project iden-
tified limitations that emerge when deploying general-purpose language models in
complex, domain-specific environments, alongside design patterns that enable these
models to function effectively as assistants. The primary research question addressed
was: "What design choices or patterns enable general-purpose language models to
function as effective assistants in complex, domain-specific software environments
without domain-specific training?"

The outcome and findings demonstrated that the success of general-purpose LLMs in
domain-specific environments relies less on model modification and more on how the
system is designed to supply the model with relevant information. By iteratively
crafting system prompts that embed domain context, constraints, and examples,
DirectorAI was able to perform effectively without the need for custom training or
fine-tuning. Through prototyping and user evaluation, several key design patterns
were identified that enabled the assistant to support complex workflows within the
existing simulation software.

This research emphasized the importance of interaction design in shaping the utility
and usability of AI-assisted systems. By identifying and analyzing the design choices
and patterns that facilitate the effective use of general-purpose LLMs in domain-
specific environments, this thesis contributes to the understanding of how AI-assisted
tools can be developed for complex simulation scenarios, offering valuable insights
for future applications. Ultimately, this study demonstrated the potential of AI
to significantly improve the efficiency and reliability of vehicle simulation scenarios,
with implications for the automotive industry and beyond.

Keywords: Large Language Models (LLMs), Chatbots, Interaction Design, Com-
puter Science, Vehicle Simulation Scenarios, AI Assistant, AI Adaptation, Prompt
Engineering.

v


Acknowledgements
First and foremost, we are grateful for the opportunity to conduct our Master’s
thesis at Volvo Cars and for the valuable industrial context provided. We would
particularly like to express our gratitude to our industry advisor, Anders Tell, whose
support and guidance were instrumental in the completion of this thesis. He not only
helped set up this project but also provided invaluable technical insights, practical
advice, and encouragement throughout the process. His deep understanding of the
complex systems underlying ProSim and Director greatly shaped our work.

We would also like to thank the ProSim team at Volvo Cars for their technical
support and for sharing their expertise. Their insights into the practical workflows
and use cases for Director and ProSim were crucial for grounding our work in real-
world needs.

Additionally, we extend our thanks to the other employees at Volvo Cars who partic-
ipated in our user studies. Their willingness to share their experiences and provide
feedback was essential for our understanding of the challenges faced by users, signif-
icantly shaping the direction of our design work.

Finally, we would like to thank our academic supervisor, Morten Fjeld, for his
guidance and support during the academic aspects of this thesis, providing us with
valuable feedback and helping us navigate the research process.

Adam Telles and Hannes Raaholt Larsson, Gothenburg, 2025-06-23

vii


Contents

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Project Aim and Research Questions . . . . . . . . . . . . . . . . . . 1
1.2 Project Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Simulation Software Environment:

An Overview of Director, ProSim, and FSM . . . . . . . . . . . . . . 3

2 Background 7
2.1 Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 AI in 3D Modeling, Video Editing and Content Generation . . . . . . 8
2.3 Related Academic Work . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Theory 11
3.1 Limitations of Chatbots . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Implementation of Chatbots . . . . . . . . . . . . . . . . . . . 12
3.2 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.2.1 GPT Architecture and Transformer Mechanisms . . . . . . . . 12
3.3 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.6 Interaction Design-Related Theory . . . . . . . . . . . . . . . . . . . 16

3.6.1 User-Centered Design . . . . . . . . . . . . . . . . . . . . . . . 16
3.6.2 Cognitive Load . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6.3 Affordances . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6.4 Mental Models . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6.5 Usability Heuristics . . . . . . . . . . . . . . . . . . . . . . . . 17

4 Methods 19
4.1 Design Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 Discover Phase . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.2 Define Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.1.3 Develop Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.4 Deliver Phase . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

ix


Contents

4.2 Technical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.1 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.2 Prompt Engineering . . . . . . . . . . . . . . . . . . . . . . . 24
4.2.3 Other Relevant Concepts . . . . . . . . . . . . . . . . . . . . . 26
4.2.4 Retrieval-Augmented Generation (RAG) . . . . . . . . . . . . 27

4.3 Time Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Process 29
5.1 Preliminary Scoping and Feasibility . . . . . . . . . . . . . . . . . . . 29
5.2 Problem Discovery . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Problem Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.3.1 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.3.2 How AI Could Help . . . . . . . . . . . . . . . . . . . . . . . . 34

5.4 Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . 35
5.4.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.4.2 Early Attempts: Simple Prompting . . . . . . . . . . . . . . . 36
5.4.3 Evolving the Design: Introducing AI Modes . . . . . . . . . . 36
5.4.4 The Full Pipeline: A Structured AI Workflow . . . . . . . . . 37
5.4.5 Why a Pipeline? . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4.6 Technical Considerations and Trade-offs . . . . . . . . . . . . 41
5.4.7 Why GPT-4? . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.4.8 From Generation to Correction: Introducing the Edit Pipeline 42
5.4.9 Design of the Edit Pipeline . . . . . . . . . . . . . . . . . . . . 43
5.4.10 Trade-offs and Implementation Notes . . . . . . . . . . . . . . 44
5.4.11 Enabling Safe Experimentation . . . . . . . . . . . . . . . . . 44
5.4.12 Switching Between Pipelines: Manual vs. Automatic . . . . . 45
5.4.13 The Dual Pipeline Architecture . . . . . . . . . . . . . . . . . 45
5.4.14 Improving Explainability . . . . . . . . . . . . . . . . . . . . . 46

5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.5.1 Participant Background and Preconceptions . . . . . . . . . . 47
5.5.2 General Impressions . . . . . . . . . . . . . . . . . . . . . . . 48
5.5.3 Trust and Effectiveness . . . . . . . . . . . . . . . . . . . . . . 48
5.5.4 Interaction Framework . . . . . . . . . . . . . . . . . . . . . . 49

5.6 Post-Evaluation Adjustments . . . . . . . . . . . . . . . . . . . . . . 50

6 Results 53
6.1 Overview of DirectorAI . . . . . . . . . . . . . . . . . . . . . . . . . . 53
6.2 Findings on Supporting Research Question 1 . . . . . . . . . . . . . . 57

6.2.1 Limitations of LLMs in the Absence of Domain-Specific Context 57
6.2.2 Literal Interpretations and Semantic Misalignment . . . . . . 58

6.3 Findings on Supporting Research Question 2 . . . . . . . . . . . . . . 59
6.4 Findings on the Main Research Question . . . . . . . . . . . . . . . . 60

6.4.1 Supporting Varied Prompting Strategies . . . . . . . . . . . . 60
6.4.2 Support for Non-Destructive, Editable Workflows . . . . . . . 61
6.4.3 Assistants That Execute and Educate . . . . . . . . . . . . . . 61
6.4.4 Feedback and System Visibility . . . . . . . . . . . . . . . . . 62
6.4.5 First Drafts as a Value Proposition . . . . . . . . . . . . . . . 63

x


Contents

6.4.6 Providing Domain Context Through System Prompts . . . . . 63
6.4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7 Discussion 65
7.1 Interpreting Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.2 Relating to literature . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.3 Project Framing and Evolution . . . . . . . . . . . . . . . . . . . . . 69
7.4 Reflections on the Process . . . . . . . . . . . . . . . . . . . . . . . . 70
7.5 Design and Engineering Implications . . . . . . . . . . . . . . . . . . 71
7.6 Implications for Creative Software . . . . . . . . . . . . . . . . . . . . 71
7.7 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
7.9 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

8 Conclusion 79

Bibliography 83

A User Study Protocol 1: Exploratory Study I
A.1 Participant Onboarding & Consent Process . . . . . . . . . . . . . . . I
A.2 User Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I
A.3 Study Introduction: Context & Objectives . . . . . . . . . . . . . . . II
A.4 Initial Training & Exploration Phase . . . . . . . . . . . . . . . . . . II
A.5 Introduce the Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . II
A.6 Post-Test Interview . . . . . . . . . . . . . . . . . . . . . . . . . . . . III

A.6.1 General Impressions . . . . . . . . . . . . . . . . . . . . . . . III
A.6.2 Task-Specific Feedback . . . . . . . . . . . . . . . . . . . . . . IV
A.6.3 Interface and Usability . . . . . . . . . . . . . . . . . . . . . . IV
A.6.4 Efficiency and Workflow . . . . . . . . . . . . . . . . . . . . . IV

B User Study Protocol 2: Prototype Evaluation VII
B.1 Participant Onboarding & Consent Process . . . . . . . . . . . . . . . VII
B.2 User Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII
B.3 Introduction to DirectorAI . . . . . . . . . . . . . . . . . . . . . . . . VIII
B.4 Scenario . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII
B.5 Interview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII

xi


Contents

List of Abbreviations
AI Artificial Intelligence

API Application Programming Interface

BERT Bidirectional Encoder Representations from Transformers

Director Product Simulator Director

FSM Function State Machine

JSON JavaScript Object Notation

LLM Large Language Model

MIDI Musical Instrument Digital Interface

ML Machine Learning

MLM Masked Language Model

NLP Natural Language Processing

ProSim Product Simulator

RAG Retrieval-Augmented Generation

TF-IDF Term Frequency-Inverse Document Frequency

GPT Generative Pre-trained Transformer

xii


List of Figures

1.1 An example scenario showcasing a Volvo car driving in a city environ-
ment, rendered in Product Simulator. . . . . . . . . . . . . . . . . . . 3

1.2 Overview of the Volvo Cars scenario scripting tool Director. . . . . . 4

4.1 A visualization of the Double Diamond design framework. Adapted
from [70]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Time plan of the thesis as a GANTT-chart. . . . . . . . . . . . . . . 28

5.1 Distribution of code categories in interview responses. . . . . . . . . . 32
5.2 The structured pipeline showing how a user’s prompt is processed by

the system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 A visualized example instance of the camera generator, showcasing

what information it receives. . . . . . . . . . . . . . . . . . . . . . . . 39
5.4 A visualized example instance of the start time generator. Showcasing

what information it receives and what it must consider to generate
appropriate timings. . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

6.1 A simplified overview of the various information sources DirectorAI
has access to. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6.2 Section of the new Director interface, showcasing the DirectorAI chat. 55
6.3 DirectorAI used to generate a scenario that opens all car doors. . . . 55

xiii


List of Figures

xiv


List of Tables

5.1 Selected Participant Quotes . . . . . . . . . . . . . . . . . . . . . . . 33

xv


List of Tables

xvi


1
Introduction

Recent advances in large language models (LLMs) such as OpenAI’s GPT-4 have
sparked growing interest in their potential to support complex digital workflows [1].
Certain AI-driven assistants have shown how LLMs can help with implementation
details and suggestions. An example of this is Github Copilot, an AI-driven cod-
ing assistant. However, these systems often rely on vast amounts of domain-specific
training data in order to achieve their efficiency. This thesis explores a different ques-
tion: can general-purpose LLMs, without fine-tuning or additional domain-specific
training, still provide effective assistance within a highly technical, domain-specific
tool?

The case study of this project is Director, a proprietary scripting tool used at Volvo
Cars for configuring and creating scenarios for a 3D vehicle simulator. While pow-
erful, Director presents a range of usability challenges, particularly for new or less
technical users. Through a user-centered design process, this project develops and
evaluates an AI-driven chat interface for Director, using GPT-4 to help users find
simulation commands, configure parameters, and generate scripts conversationally.

1.1 Project Aim and Research Questions
The broader aim of the project is to investigate how general-purpose LLMs can
reduce the cognitive load and technical barriers inherent in domain-specific software.
We focus particularly on the design decisions that enable such assistants to feel
helpful, intuitive, and efficient, despite the utilized AI model having no task-specific
training.

This leads to our main research question:

Main Research Question (MRQ):

What design patterns enable general-purpose language models to function
as effective assistants in complex, domain-specific software environments
without domain-specific training?

To support the investigation of this main question, two supporting research questions
are proposed. These help clarify aspects of the problem space and structure the
evaluation of the results:

Supporting Research Question 1 (SRQ1):

1


1. Introduction

What limitations emerge when using general-purpose language models
as assistants in complex domain-specific environments without domain-
specific training data?

Supporting Research Question 2 (SRQ2):

What constitutes “effective assistance” in the context of AI-supported
domain-specific workflows?

Together, these questions guide the exploration of both the opportunities and the
constraints involved in integrating general-purpose AI into specialized technical do-
mains. Through this work, our aim is to contribute with insights on how AI can
be meaningfully integrated into specialized workflows, and what considerations are
required to make such integration useful and usable in practice.

1.2 Project Context
Designing and implementing simulation scenarios for Volvo Cars in Unity currently
involves a detailed, hands-on process where every aspect of a vehicle’s operation,
such as opening doors, adjusting the camera, or triggering animations, must be
meticulously programmed. For instance, creating a simple replayable simulation
showcase of a car requires repeatedly switching between the 3D simulation soft-
ware and Director to manually retrieve and input position and rotation values for
the camera. Timing animations, such as doors opening or headlights activating,
is similarly tedious, as users must determine exact signal names, many of which
are inconsistently labeled, and adjust timing sequences without clear visual feed-
back. This fragmented workflow makes even straightforward tasks unintuitive and
time-consuming, acting as a barrier to creativity and efficiency in validation.

The recent years’ advancements in AI, particularly in NLP and generative AI, have
made it more feasible to dynamically generate accurate results from user input, as
well as the option of using cloud computing for this purpose, not relying on on-
premise hardware and thus making it broadly accessible. With the ever-growing
complexity of vehicle software, the need for efficient and user-friendly simulation
software is becoming more important. With access to Volvo Cars’ experts and
proprietary tools and data, we will work closely with an industry-grade testing
environment that ensures our research is both practical and impactful.

This thesis addresses the challenge of automating the scenario creation workflow in
a way that users find effective and helpful. The goal is to enable users to describe
scenarios using natural language either at a high, abstract level or with low-level, in-
tricate detail. By offloading the cognitive load of learning and remembering system
syntax and commands to the LLM and its supporting structure, users can instead
focus on creativity and solution exploration. By embedding interaction design prin-
ciples, the assistant aims to reduce cognitive load while preserving user agency and
empowering new and non-technical users. All this allows the assistant to support
more dynamic and fluid workflow, moving away from the current rigid ones.

2


1. Introduction

1.3 Simulation Software Environment:
An Overview of Director, ProSim, and FSM

The technical ecosystem for this project consists of three main components: Product
Simulator (ProSim), Function State Machine (FSM), and Director. These systems
work together to provide an interactive, simulation-driven environment for testing
vehicle behaviors and features.

ProSim is the core simulation software and is developed in Unity [2]; an example
of ProSim’s use can be seen in Figure 1.1. It serves as the real-time simulation
engine, responsible for rendering 3D environments and vehicles, simulating physical
interactions, and allowing direct user control. ProSim includes a high-fidelity 3D car
model, complete with functional elements such as doors, compartments, lights, and
interior features. Users interact with the simulation through a first-person character,
navigated via keyboard and mouse in a manner reminiscent of first-person video
games. This setup allows users to explore the vehicle in a realistic and intuitive way,
and interact with different components as if inside a physical prototype.

Figure 1.1: An example scenario showcasing a Volvo car driving in a city environ-
ment, rendered in Product Simulator.

FSM introduces a layer of logical control on top of the raw physics simulation pro-
vided by ProSim. While ProSim handles the physical effects of actions (such as
opening a door), FSM manages system logic, rules, and interdependencies that reg-
ulate whether certain actions are allowed. For example, FSM enforces rules such as
“a door cannot be opened if it is locked” or “the car cannot start if the key is not
present.” FSM is designed to reflect, as closely as possible, the real software logic
that would run in the actual vehicle. This makes the simulation not only visually
realistic, but also behaviorally accurate from a systems standpoint. Although the

3


1. Introduction

details of the FSM team’s implementation are outside the scope of this project, it is
important to note that FSM is tightly integrated with ProSim, and receives input
from it to enforce these logical constraints.

The third and most relevant component to this project is Director. Director is
a Flutter-based application and serves as an interface for scripting and controlling
ProSim. Communication between Director and ProSim occurs via Message Queuing
Telemetry Transport (MQTT), a lightweight publish-subscribe messaging protocol
often used in IoT systems [3]. Upon startup, Director establishes a connection
with ProSim and receives a list of available signals. These signals represent actions
that can be triggered in the simulation, such as opening a door, toggling a light,
or adjusting climate control. Each signal corresponds to an MQTT topic (e.g.,
ego/door/open/FL for opening the front left door) and is used to send specific
commands to ProSim. In figure 1.2, an overview of Director is presented.

Figure 1.2: Overview of the Volvo Cars scenario scripting tool Director with num-
bered annotations.
(1) Signal search bar and different sections of signal types that are toggleable on
the left-hand side. The list of signals appears on the right, with signal type (input,
output, and bidirectional) as filters. Signals can be dragged onto the timeline (the
section between signals and properties), creating a track.
(2) Column of signal tracks, each track occupying one row. The scenario can be
played using the play button, paused by clicking it again, or stopped using the stop
button.
(3) Timeline section, where each track can contain an unlimited number of actions.
Actions are added by right-clicking the timeline and selecting Add action. Actions
can be edited by selecting and moving, hiding, or deleting them.
(4) The Properties pane is shown when an action is selected. Property values can
be manually edited. In this image, the signal action MoveCamera is selected, with
eleven parameters including 3D coordinates and direction.

Director’s user interface is divided into three main sections: the signal list, the
timeline editor, and the properties pane. The signal list (displayed on the left-hand

4


1. Introduction

side) shows all available signals retrieved from ProSim. These are organized by
category, where the two most important categories for this project are ego and user
input. Signals in the ego category directly affect the simulation without regard to
the FSM’s logic. For instance, an ego door open signal will open the door regardless
of whether the vehicle is locked. On the other hand, user input signals simulate
user-initiated actions, such as pulling the door handle, which respect the constraints
enforced by FSM and could thus fail if the door is locked. This distinction is essential
when designing test cases that should reflect realistic interactions.

Signals from the list can be added as tracks in the timeline editor. Each track
corresponds to a specific signal, and users can place actions along the timeline. When
the timeline is played, a red playhead moves across the screen. As it passes over an
action, the associated signal is sent to ProSim, triggering the desired effect in the
simulation. This timeline-based system is conceptually similar to music production
software, where multiple tracks are layered and events are scheduled in time.

On the right-hand side of Director is the parameters pane, where users can configure
signal-specific values for each action. For example, when triggering a light signal,
parameters such as RGB values might be required to define the color or intensity.
This allows for nuanced and dynamic scripting of simulation scenarios.

Although Director does not communicate directly with FSM, it plays a key role in
invoking behaviors that may be subject to FSM rules. Director sends signals only
to ProSim, which then relays those commands to FSM as needed. For the purposes
of this project, understanding the structure and behavior of signals, particularly the
distinction between ego and user input categories, is sufficient to design meaningful
scenarios and scripts within Director.

5


1. Introduction

6


2
Background

In order to better understand how DirectorAI fits into the broader technical land-
scape, this chapter briefly explores the history of chatbots, AI in creative and tech-
nical workflows, and related academic works.

2.1 Chatbots
Chatbots, also known as artificial conversation entities, interactive agents and dig-
ital assistants, are AI programs designed to simulate conversation with human
users. Chatbots use NLP and sentiment analysis to communicate in human lan-
guage through text or oral speech with humans or other chatbots [4].

The foundational concept behind chatbots is often credited to Alan Turing, who in
1950 questioned whether a computer could converse with people without revealing
its artificial nature, known as the Turing test. The first chatbot named ELIZA was
created in 1966, which simulated a psychotherapist by rephrasing user statements as
questions. Its abilities were rather basic, relying on pattern matching and template
responses, but it significantly inspired future chatbot development [5].

A great advancement happened in 2001 with the creation of SmarterChild which
could be found on e.g. Microsoft’s MSN, which was the first chatbot that could assist
with practical daily tasks by retrieving information from databases on subjects such
as movie schedules, sports results, and news [5]. The development continued with
intelligent personal voice assistants such as Apple Siri, Amazon Alexa, and Google
Assistant, which can understand voice commands, manage tasks via the internet,
and generate relevant responses quickly. The use of chatbots increased particularly
after 2016, which coincided with social media platforms allowing developers to create
chatbots for their services [5].

Chatbots are deployed across multiple sectors, driven by their ability to automate
tasks, enhance user engagement, and provide scalable solutions. The primary appli-
cations include:

• Customer Service and Support: Chatbots facilitate 24/7 customer sup-
port by addressing frequently asked questions (FAQs), tracking orders, and
providing multilingual assistance. Their scalability reduces operational costs.
Omnichannel integration enables seamless interactions across web, mobile, and
social media platforms [6].

7


2. Background

• Data Collection and Analytics: By capturing user preferences and be-
havioral data, chatbots enable businesses to refine marketing strategies and
personalize offerings. In e-commerce, recommendation systems driven by chat-
bot interactions enhance sales [7].

• Education: Chatbots serve as virtual tutors, offering personalized learning
experiences and administrative support. Tools like Ada and Socratic exemplify
their role in subjects such as mathematics and computer science, promoting
experiential learning [8].

• Healthcare: Chatbots provide remote patient support, mental health assis-
tance, and health education, improving care accessibility. However, their de-
ployment requires stringent compliance with regulations, such as the american
HIPAA [9].

• Marketing and Sales: Through conversational marketing, chatbots engage
customers via digital ads and messaging apps, streamlining lead generation
and sales workflows [6].

• Administrative Automation: Chatbots automate repetitive tasks such as
scheduling and invoice processing, enhancing operational efficiency in sectors
like finance and retail [10].

2.2 AI in 3D Modeling, Video Editing and Con-
tent Generation

One example of adding AI features into an existing editing program is Blender
Open MCP, which integrates the open-source Blender software with various AI
models (e.g., Claude AI, Ollama) to enable natural language control over complex
3D model creation and scene manipulation [11]. While this primarily focuses on
the generation of static assets, the methodology for rapidly creating and modifying
intricate 3D environments and characters via AI is highly pertinent. Furthermore,
DeepMotion is another related software example that has a large library of motion
capture and uses AI to translate text prompts into 3D animations [12].

There are also many other generative AI tools such as Adobe Firefly and Runway
ML that offer capabilities for text-to-video and image-to-video generation [13], [14].
Synthesia, for instance, uses AI to generate videos directly from text scripts, com-
plete with AI avatars and voiceovers, resulting in AI-generated content with a clear
chronological sequence [15]. Similarly, Descript, does video editing through text
manipulation, allowing users to intuitively cut or rearrange video segments by sim-
ply editing the corresponding transcribed text [16]. While these pieces of software
work great in their domain, the implementation of AI is specific to each application
and may often not be generalizable to other applications.

Academic research on LLM-Assisted Video Editing with Unified Language
Representations is particularly relevant to the concept of scripting. This work
explores using LLMs to facilitate natural language interaction for complex editing

8


2. Background

tasks, such as generating shot lists or refining sequences based on textual prompts
[17]. This research underscores the potential for linguistic commands to directly
influence the temporal flow and content of a generated sequence.

2.3 Related Academic Work
The 18 design guidelines for human-AI interaction from Amershi et al. [18] serve
as a foundational reference for designing user-friendly AI systems. These guidelines,
developed through iterations and expert review, aim to help practitioners build AI
tools that are both effective and intuitive. For our work in Director, where users work
with complex scripting logic and often unpredictable signals, several principles stood
out, especially those related to clarifying AI capabilities, supporting user learning,
and enabling corrections. We used these as a guide to design an assistant that
emphasizes transparency, context-awareness, and trust. The guidelines also stress
the importance of efficient interactions and graceful handling of errors. Drawing from
these guidelines helped us align DirectorAI with current best practices in human-AI
interaction, particularly in making the assistant more transparent and allowing for
better recoverability.

Certain studies explore how LLMs can improve usability in more specific contexts.
For example, Varma et al. [19] developed an AI-powered librarian assistant that
helps students interact with library systems using natural language. Their assistant
allows users to ask about books, availability, and recommendations through con-
versation, reducing the friction of navigating traditional interfaces. Although the
domain differs from our work, the core idea is the same: using natural language
to bridge the gap between user goals and complex backend systems. Their success
in integrating LLMs into an existing workflow supports our approach of applying
a general-purpose model to a domain-specific tool like Director without requiring
custom retraining.

DirectorAI is reactive in the sense that it mainly responds to user input, but there
is also research into more proactive uses of LLMs in technical settings. For example,
Chen et al. [20] built a conversational programming assistant that suggests improve-
ments and anticipates user needs without being prompted. While DirectorAI does
not go that far, their work is relevant in how it shows the potential of context-aware,
conversational agents to ease work in syntax-heavy environments. Their focus on
shared context and fluid collaboration also resonates with our use case, where users
can gradually build or refine simulation scripts through ongoing dialogue with the
assistant.

Another important factor is how users perceive the AI, especially their mental model
of what it can and can’t do. Rismani et al. [21] studied this in the context of AI writ-
ing assistants and found that user understanding of the AI’s behavior significantly
affects how effectively they use it. If users assume the assistant is smarter than it is,
or interpret its suggestions too literally, it can lead to frustration or misuse. This
applies directly to DirectorAI. If users assume the assistant is always right, they
may not critically evaluate signal suggestions or generated scenarios. On the other

9


2. Background

hand, if they understand it as a helpful but imperfect assistant, users are more likely
to engage critically, which ultimately could lead to more satisfactory results. This
highlights the need for clear, transparent communication from the assistant, as well
as features that help users build a realistic understanding of its capabilities.

Khurana et al. [22] offer a cautionary perspective. In their study, users interacted
with an LLM tailored to a specific software system. Surprisingly, the results showed
little improvement over a general baseline, partly because users had trouble under-
standing how their prompts related to the responses they received. More concerning
was that many users followed incorrect AI-generated instructions without question-
ing them, especially users with less technical experience. This highlights the need to
design AI systems that users can understand and trust, not just those that generate
correct answers.

At the same time, other research shows that even imperfect AI systems can still be
helpful. Schoenegger et al. [23], for instance, studied how LLMs influence human
forecasting performance. They found that even when users were given advice from
a deliberately flawed LLM, their forecasts still improved compared to those who
had no AI support. This suggests that LLMs can provide meaningful cognitive
support even when their output isn’t perfect. That idea aligns well with our goals.
DirectorAI isn’t meant to be an expert system that replaces the user’s judgment.
Instead, it’s a tool to help users get unstuck, reduce friction, and better understand
how to interact with the system. As Schoenegger et al. put it, the LLM acts more
like a ’decision aid’, which we believe is especially valuable in the kind of complex,
technical workflows users face in Director.

AI-based assistants have significantly enhanced the efficiency of existing software,
particularly in software development. [24] review AI-driven innovations, presenting
a case study on GitHub Copilot, a generative AI tool that provides real-time code
suggestions and completions. The study reports a 55% reduction in task completion
time and higher rates of passing test cases on the first attempt, demonstrating how
AI assistants streamline coding processes. By automating repetitive tasks and offer-
ing context-aware suggestions, GitHub Copilot improves productivity and reduces
errors, making it a valuable tool for developers working on enterprise applications.
The review emphasizes the need for clean data and user preparation to maximize
these efficiency gains, underscoring the design considerations for embedding AI as-
sistants into development workflows.

10


3
Theory

In this chapter, we present the technical resources used in this thesis, along with
the theoretical foundations and related works that informed our approach.

3.1 Limitations of Chatbots
Despite their versatility, chatbots face significant theoretical and practical constraints,
rooted in the limitations of current NLP and ML frameworks. These include:

• Limited Contextual Understanding: Chatbots struggle with ambiguous
or complex queries due to finite NLP capabilities. Rule-based systems rely on
keyword matching, while AI-driven models may misinterpret nuanced inputs,
leading to irrelevant responses [25].

• Lack of Emotional Intelligence: The inability to interpret emotions, hu-
mor, or sarcasm restricts chatbots’ capacity to foster emotional connections,
critical for customer loyalty [26]. This stems from their reliance on statistical
language models rather than human-like emotional understanding.

• Inability to Handle Complex Queries: Chatbots excel in routine tasks
but falter in scenarios requiring reasoning or creativity. For instance, pro-
viding personalized advice or resolving multifaceted technical issues remains
challenging [27].

• Security and Privacy Risks: Handling sensitive data exposes chatbots
to vulnerabilities like hacking or data breaches. Compliance with privacy
regulations is critical, particularly in healthcare [28].

• Hallucinations and Inaccuracy: LLMs may generate false information,
known as hallucinations, due to overfitting or biased training data. Examples
include fabricated citations or nonsensical responses to ambiguous inputs [29].

• Ethical and Bias Concerns: Biases in training datasets can lead to discrim-
inatory outputs, undermining fairness in applications such as education and
healthcare. Ethical challenges also arise from potential emotional manipula-
tion [30].

• Environmental Impact: The computational resources required for train-
ing and operating chatbots contribute to significant energy consumption and

11


3. Theory

carbon emissions, raising sustainability concerns [31].

3.1.1 Implementation of Chatbots
Chatbot implementations typically rely on two core approaches: pattern matching
and machine learning.

Pattern matching uses rule-based systems to compare user input against prede-
fined templates, selecting fixed responses accordingly. This method, exemplified
by ELIZA, is effective for predictable interactions but struggles with ambiguous or
novel input due to its reliance on scripted responses [5].

Machine learning-based chatbots, by contrast, leverage natural language process-
ing (NLP) to understand context and intent, allowing them to generate responses
dynamically. This approach powers virtual assistants like Siri, Alexa, and Google
Assistant, which continuously adapt to user behavior through models such as deep
learning and LLMs [5].

A leading example is ChatGPT, which uses the GPT architecture to produce co-
herent and contextually relevant dialogue. It can handle nuanced prompts and is
capable of more than casual conversation, such as generating scenario-based simu-
lations for educational purposes by interpreting structured prompts and delivering
dynamic multi-turn responses [32], [33].

3.2 Large Language Models
LLMs are neural networks, typically based on transformer architectures, trained on
vast corpora of text data to predict and generate sequences of words [34]. The trans-
former model, introduced by [34], leverages self-attention mechanisms to capture
long-range dependencies in text, allowing LLMs to process and generate coherent
language. Prominent examples, such as BERT and GPT, demonstrate the power of
pre-training on diverse datasets followed by task-specific adaptation [35], [36]. Pre-
training equips LLMs with broad linguistic knowledge, which can be fine-tuned for
tasks like text classification, translation, or question answering [35].

The scale of LLMs, often comprising billions of parameters, enables them to model
complex linguistic patterns but introduces challenges, including high computational
costs and ethical concerns related to bias and misinformation [37]. Despite these
limitations, LLMs have transformed fields such as scientific research, healthcare, and
education by facilitating advanced text analysis and generation [38]. Their ability
to generalize across tasks underscores their potential as foundational tools in NLP.

3.2.1 GPT Architecture and Transformer Mechanisms
The GPT family is based on the Transformer architecture introduced by Vaswani
et al. [34], which uses self-attention to model long-range dependencies in sequences
and allows for efficient parallel computation.

12


3. Theory

GPT uses an autoregressive approach to generate text one token at a time, condi-
tioning each token on all previous ones. The probability of a sequence is expressed
as shown in Equation (3.1).

P (x1, x2, ..., xn) =
n∏

t=1
P (xt|x1, ..., xt−1) (3.1)

These conditional probabilities are modeled using deep neural networks [39]. GPT
follows a decoder-only Transformer design, stacking layers of masked self-attention
and feed-forward components optimized for generative tasks like dialogue and sum-
marization.

In contrast, bidirectional models such as BERT [35] use an encoder-only architec-
ture trained via masked language modeling (MLM), making them well-suited for
classification and extraction tasks.

Retrieval-Augmented Generation (RAG) enhances transformer models by incorpo-
rating retrieved external documents into the context window, improving factual
accuracy and domain-specific performance [40].

3.3 Natural Language Processing
Natural Language Processing (NLP) focuses on enabling machines to understand
and generate human language in a contextually meaningful way [41]. Applications
span translation, summarization, and sentiment analysis.

Modern NLP has evolved from rule-based and statistical methods to deep learn-
ing architectures, especially transformers, which outperform RNNs and LSTMs in
modeling long-term dependencies.

A key task is text classificationused in this project to categorize user input. Tra-
ditional approaches like TF-IDF require manual feature engineering and struggle
with generalization. In contrast, transformer-based models such as BERT leverage
pretraining and transfer learning to improve accuracy on unseen inputs [35].

3.4 Transformers
Transformers, introduced by Vaswani et al. [42], replaced recurrence with an at-
tention mechanism, improving parallelization and the ability to model long-range
relationships. Their core innovation is self-attention, which computes new token
embeddings as weighted combinations of all tokens in a sequence.

The self-attention mechanism, defined in Equation (3.2), is:

Attention(Q, K, V ) = softmax
(

QKT

√
dk

)
V (3.2)

13


3. Theory

where Q, K, V are learnable matrices and dk is the dimension of the key vectors.

Encoder and Decoder Roles
The encoder processes input tokens into contextual embeddings using stacked lay-
ers of multi-head self-attention and feed-forward networks. The decoder, used in
generative tasks, predicts output tokens iteratively, incorporating encoder outputs
and prior predictions. GPT models use decoder-only stacks, while BERT employs
encoder-only stacks [43].

BERT and Bidirectional Context
BERT (Bidirectional Encoder Representations from Transformers) was proposed by
Devlin et al. [35] to pretrain models that leverage context from both directions.
It uses the MLM objective to predict randomly masked tokens using surrounding
context and is ideal for classification and question answering.

3.5 Software
This section introduces the main software applications and technologies used in this
project.

Dart and Flutter
Dart is an open-source, general-purpose programming language developed by Google,
optimized for building high-performance, client-side applications across web, mobile,
desktop, and embedded platforms [44]. With an object-oriented syntax and features
like sound null safety, Dart enhances code reliability by preventing null reference
errors [45]. It supports ahead-of-time (AOT) compilation for efficient native code
execution and just-in-time (JIT) compilation with hot reload, enabling rapid de-
velopment cycles [45]. Dart’s platform-independent virtual machine and standard
library, extended by the Pub package repository, make it a versatile tool for modern
application development [46].

Flutter, a UI software development kit (SDK) built on Dart, enables developers
to create natively compiled, visually consistent applications from a single codebase
for multiple platforms, including iOS, Android, web, and desktop [47]. Flutter’s
widget-based architecture allows hierarchical composition of UI components, known
as widgets, to build responsive interfaces [48]. Its rendering engine, powered by Skia
(or Impeller on iOS), bypasses platform-specific UI components to ensure uniform
visuals and performance across devices [47]. Flutter leverages Dart’s AOT compila-
tion for fast execution and JIT compilation for hot reload, streamlining development
workflows [46]. Supported by a rich ecosystem of Pub packages and pre-built wid-
gets, Flutter reduces development time and is used in applications such as Google
Pay, making it suitable for cross-platform development [46].

14


3. Theory

Azure OpenAI Service
Microsoft Azure OpenAI Service provides enterprise-grade access to advanced AI
models developed by OpenAI, including the GPT-4 model family, integrated with
Azure’s secure cloud infrastructure [49]. Launched in 2021, this service enables orga-
nizations to leverage powerful language models for tasks such as content generation,
summarization, code generation, and conversational interfaces, while ensuring com-
pliance with enterprise requirements such as data privacy, security, and regional
availability [49], [50]. The GPT-4 models, including GPT-4, GPT-4 Turbo, and
GPT-4o, are multimodal, capable of processing text and images, and excel in com-
plex reasoning, coding, and multilingual tasks [51].

To interact with GPT-4 models, Azure OpenAI Service provides a REST API, ac-
cessible via endpoints for chat completions, embeddings, and other capabilities [52].
Developers create an Azure OpenAI resource in the Azure portal, deploy a GPT-4
model and authenticate API calls using either API keys or Microsoft Entra ID to-
kens [52], [53]. For example, a chat completion API call involves sending a POST
request to an endpoint with a JSON payload containing a system message, user
prompt, and parameters such as max_tokens to control output length [53]. The re-
sponse includes a model-generated completion, token usage, and metadata, enabling
multi-turn conversations or single-turn tasks [52]. Azure’s enterprise features, such
as private networking, role-based access control, and content filtering, ensure secure
and compliant API usage, while global standard deployments dynamically route
traffic for low-latency performance [49], [54]. Additionally, Azure OpenAI supports
RAG for grounding responses in enterprise data, enhancing accuracy for domain-
specific applications [55].

Unity
Unity is a real-time 3D development platform developed by Unity Technologies,
widely used for creating interactive simulations, games, and extended reality (XR)
applications across industries such as automotive, robotics, and manufacturing [56].
Launched in 2005, Unity supports the creation of both two-dimensional (2D) and
three-dimensional (3D) environments, using the C# programming language for
scripting and a visual editor for designing scenes, physics, and animations [56], [57].

Its component-based architecture allows developers to attach behaviors, such as
physics simulations or artificial intelligence (AI), to game objects, enabling rapid
prototyping and iteration [58]. Unity’s cross-platform capabilities support over 19
platforms, including Windows, macOS, Linux, iOS, Android, and XR devices like
HoloLens and Oculus, making it versatile for deploying simulation environments
[56].

Graphical simulation tools are important to visualize, test and validate different
behaviors and functions in the automotive industry. For instance, Yang et al. [59]
have used the Unity software to create a virtual reality driving simulation platform
to examine driver behavior and depth sensor accuracy in various scenarios, which
the authors mean will reduce training costs and improve the efficiency of addressing

15


3. Theory

emergency events.

3.6 Interaction Design-Related Theory
This section introduces relevant theory from the field of interaction design used in
this thesis.

3.6.1 User-Centered Design
Within the field of interaction design there exist several different approaches to
design, such as speculative design, participatory design, and somaesthetic design to
name a few [60]–[62]. Each type of design influences what questions we ask ourselves
as we work: What is the goal? Who are we designing for? Why are we designing for
them? Consequently, the selected type of design affects the process we follow and
the intended outcomes.

In the case of designing and developing DirectorAI, an assistant intended to address
the usability issues in Volvo Cars’ Director tool, we use a User-Centered Design
(UCD) approach. This approach is based on the work by Norman [63], Norman &
Draper [64], and Gould & Lewis [65].

UCD is not a strict methodology, but rather it is a guiding principle or design philos-
ophy. Fundamentally, UCD emphasizes building an understanding and empathy for
users’ issues, pain points, and workflows. It promotes a design and development pro-
cess based on user needs, and producing prototypes that are continuously evaluated
by users themselves to ensure solutions stay aligned with their real-world issues.

3.6.2 Cognitive Load
Cognitive load, as introduced by Sweller et al. [66], refers to users’ limited capacity
to hold task-relevant information in memory. These limitations in how people learn
and process information can often increase the difficulty of performing a task. In
the case of software like Director, cognitive load can be alleviated through various
methods, such as reducing the complexity of the task or simplifying the interface.

One of the main motivations behind DirectorAI is to reduce the cognitive load
associated with using Director. The tool’s interface, scripting syntax, and workflow
all require users to hold a lot of information in mind while working. This is where
LLMs offer an opportunity to reduce cognitive load. With their ability to understand
users’ natural language, we can shift some of that mental burden away from users
and potentially make the system simpler to use.

3.6.3 Affordances
Affordance is a concept introduced by Norman [63] and refers to the properties of
an object that suggest how it can be used. In terms of interface design, affordances
are what guide user expectations. For example, if the interface contains a button

16


3. Theory

that looks clickable then it suggests an action will be performed when it is pressed.
Cooper et al. [67] expand on the idea of affordances when it comes to digital interface
design. They argue that the perceived functionality of an interface element may not
always align with its actual functionality. Therefore, they suggest that what matters
most is not the interface element’s true functionality, but instead how users perceive
it to work based on past experiences.

This issue becomes more complex in the case of DirectorAI, affordances extend
beyond interface elements such as buttons or icons to the functionality of the under-
lying LLM. While the interface contains familiar elements such as a revert button
that undoes actions or a chat log that logs user and AI messages, the assistant itself
carries a more complex affordance. DirectorAI is presented as a helpful and intel-
ligent assistant and if users perceive the assistant as an expert, or something they
can delegate all work to, breakdowns in usability and trust will begin to occur since
the assistant will inevitably fail at certain tasks.

3.6.4 Mental Models
Another important concept introduced by Norman [63] is that of mental models.
Mental models refer to the internal understanding people form about how a system
works. They can be shaped by prior experience, cues from the system, and guidance
from other people. While these models are not always accurate, they guide how
users interact with a system and what they expect it to do.

When it comes to DirectorAI, mental models become particularly important, as
users may approach the assistant with preconceptions from previous experiences
with other AI-based assistants or timeline-based scripting tools. When a user’s
mental model differs from how the assistant actually functions, especially regarding
the capabilities and limitations of the underlying LLM, confusion and frustration
may arise. As such, considering users’ existing mental models, and supporting the
formation of accurate new ones, is an important part of designing DirectorAI.

3.6.5 Usability Heuristics
Established interaction design heuristics include Nielsen’s 10 Usability Heuristics
for User Interface Design [68]. While we do not consider all of them in this project,
it is worth briefly mentioning a few that are particularly relevant for DirectorAI’s
design. These include the heuristic about visibility of system status and providing
timely feedback (H1). Nielsen also emphasizes the importance of user control, such
as being able to undo actions (H3). Another relevant heuristic is the availability
of help and documentation (H10). Perhaps the most central one, however, is the
idea that systems should speak the user’s language, avoiding overly technical jargon
(H2). This directly relates to the core of what we aim to address with DirectorAI, as
many of Director’s usability issues stem from the overuse of technical terminology.

17


3. Theory

18


4
Methods

This project used both design and technical methods in parallel, allowing insights
from one to inform decisions in the other throughout the process. The design meth-
ods will follow an interaction design framework, ensuring that the development of
AI-driven features is grounded in user needs, usability principles, and iterative refine-
ment. Simultaneously, the technical methods will focus on implementing and inte-
grating AI technologies, such as natural language processing, within the constraints
of the existing 3D simulation software. By combining these two perspectives, the
project aims to create a solution that is feasible within the system’s technical limits
while directly addressing user needs.

4.1 Design Methods
To begin, this project was planned with a formative methodological approach, which
in this case meant that the insights, observations, and feedback from both Director
and DirectorAI would be used to inform our work and how we progressed. In contrast
to a summative approach, the focus was not on producing statistically generalizable
results. Instead, we intended to use findings from our research to iteratively refine
our prototype and address usability issues. This formative approach was chosen
since it aligns well with the principles of UCD where understanding user needs,
challenges, and workflows is crucial.

The planned design methodology in this project is fundamentally rooted in a UCD
approach, adopted to effectively address the usability challenges of Director and
ensure the AI assistant genuinely meets user needs. UCD emphasizes an itera-
tive process involving user feedback, usability testing, and continuous refinement
throughout the design and development process. As explored by Mirabdolah et
al. [69], applying UCD principles is particularly valuable for enhancing interaction
quality and usability within complex systems like Director, especially across diverse
domains. Their work highlights how iterative feedback loops and a focus on the
user perspective can lead to more efficient, effective, and less cognitively demand-
ing solutions. Employing UCD was therefore considered essential for navigating the
complexities of integrating a novel AI tool into the existing Director workflow, al-
lowing the design to evolve based on direct user input and testing, increasing the
likelihood of developing a truly helpful and usable assistant.

Intended to provide structure for this UCD approach, the Double Diamond frame-

19


4. Methods

work [70] served as the primary design framework for this project. While the Dou-
ble Diamond was central, it is not the only established approach available within
the field of interaction design. Alternative methodologies such as Design Thinking,
Goal-Directed Design, and Lean UX each offer distinct strengths, and their relevance
depends heavily on the goals, constraints, and nature of the project.

Design Thinking shares several principles with the Double Diamond, particularly the
emphasis on empathy, iterative prototyping, and user feedback [71], [72]. However,
it often operates in a more fluid and less formally staged manner, which can be
helpful in exploratory projects. However, in our case, we needed a more structured
process to support analysis and reporting. For this thesis, where clarity, traceability,
and a well-defined structure for analysis and reporting were important, the Double
Diamond offers a more appropriate balance between flexibility and methodological
rigor.

Goal-Directed Design, introduced by Alan Cooper, focuses heavily on identifying
user goals and creating solutions that align with these goals through persona-based
design [73]. This approach offers deep behavioral insight and works particularly well
when designing tools for specific, repeatable tasks. However, this project aimed to
explore emergent creative behaviors, many of which were discovered through early
exploratory studies or even appearing later in prototyping phases. As such, a rigid
goal-driven approach risked narrowing the design space too early.

Lean UX, on the other hand, emphasizes rapid experimentation, minimal viable
products, and continuous iteration in fast-moving product teams [74]. While ideal
in agile, startup-like settings with quick user feedback loops, the lacking emphasis on
discovery and exploration makes it less suitable for early-stage research into novel
technologies, especially considering our primary goal is understanding user needs
rather than optimizing conversion metrics or feature releases. Lean UX contrasts
having a clear ’solution space’ such as in the Double Diamond which enables clear un-
derstanding of user’s issues before moving to prototyping, something we considered
valuable.

In contrast to these alternatives, the Double Diamond model was chosen because
it supported a comprehensive, structured exploration of both the problem space
and the solution space [70]. It allowed the project to remain grounded in user
needs while still leaving room for creative and conceptual exploration, particularly
around the emerging role of AI in design workflows. Its staged structure also enabled
clear documentation of each phase, which was beneficial for the transparency and
reflection included in a thesis. The Double Diamond’s four phases (Discover, Define,
Develop, and Deliver, as seen in figure 4.1) offered a methodical yet open-ended
process that aligned well with the project’s aim to investigate, understand, and
meaningfully contribute to how users interacted with AI assistants in creative but
technically complex tools.

20


4. Methods

Figure 4.1: A visualization of the Double Diamond design framework. Adapted from
[70].

4.1.1 Discover Phase
In the Discover phase, we planned to familiarize ourselves with Volvo Cars’ 3D
vehicle simulation software and Director in order to understand their functionality
and limitations. This included informal observations of developers and users using
the tool to identify common workflows, challenges, and workarounds. While full
ethnographic studies provide deep insights into user behavior [75], they are often
time-intensive and require prolonged immersion in this setting. Given the constraints
of this project, we used a more lightweight ethnographic approach [76], focusing on
observations and contextual inquiry rather than extensive fieldwork.

To supplement these observations with broader user data, we conducted a survey
targeting existing users of the software. Surveys are useful for reaching a larger
participant pool and identifying common trends [77], though they lack the depth
of direct user interaction and potential for follow-up. The survey focused on under-
standing the software’s primary use cases and challenges users faced. While inter-
views could have provided richer qualitative data, a survey allowed us to efficiently
gather input from a larger and more diverse set of users.

We then conducted usability testing following the guidelines of Rubin and Chis-
nell [78]. The motivation for these studies was to identify the specific challenges
that users can encounter while performing tasks in Director. During these stud-
ies, we collected qualitative data through a combination of think-aloud protocols to
capture users’ thought processes, direct observations of hesitations, confusions, or
workarounds, and post-task interviews to explore their experiences in more depth.

More advanced usability evaluation methods, such as controlled laboratory studies
or cognitive workload assessments, could have provided deeper insights into user
performance [79]. However, these methods often require specialized equipment, con-
trolled environments, or long testing sessions, which are beyond the practical scope
of this study. Instead, we prioritized methods that provided rich, in-depth, and

21


4. Methods

actionable feedback within real-world constraints and use cases, ensuring that the
findings directly inform the design process.

Our usability tests involved both experienced and inexperienced users of the software,
as well as individuals with no prior exposure. This was important because interaction
design should not only optimize experiences for existing users but also lower entry
barriers for new users [63]. While longitudinal studies could have offered insights
into learning curves and long-term adoption, time constraints made this impractical,
so our evaluation focused on immediate usability and emerging behavior rather than
long-term user adaptation.

4.1.2 Define Phase

In the Define phase, we analyzed the collected data to refine our understanding
of usability challenges and user needs. This involved thematic coding to identify
recurring patterns in survey responses and usability test findings. This approach
was chosen because thematic coding provided a structured but adaptable way to
interpret qualitative data, allowing us to translate user feedback into clear and
relevant insights [80]. These insights then informed our design and were important
for ensuring that the resulting prototypes address actual pain points, and align with
user behavior and needs, rather than being based on technical possibilities or novelty
[63].

While participatory design workshops could have helped refine our design further,
we chose to rely on the identified codes and issues instead. This decision was made to
keep the process focused and simple, especially given time and resource constraints.
By basing our design work and prototypes on direct observations and user feedback,
we were able to capture a range of perspectives nonetheless.

Personas is another method we considered during this phase. Introduced by Cooper
[81] and later developed further by Pruitt and Grudin, [82], personas are fictional
characters based primarily on qualitative data gathered in the discover phase. They
are used to represent groups of users or stakeholders and typically include goals,
workflows, needs, and pain points.

We initially planned to use personas, or were at least open to the idea, but the
discover phase quickly showed just how broad and diverse the user base was. It
included everyone from software engineers and function architects to 3D artists and
analysts. In practice, most participants had differing goals and use cases for Director,
making it difficult to generalize them into one or even a few personas. There was
also the risk of how designing for a few personas might lead to solutions only being
optimized for the workflows represented in those personas. Instead, we chose to
simply focus on the most prominent pain points and recurring issues that appeared
across user groups.

22


4. Methods

4.1.3 Develop Phase

During the Develop phase, we began creating and refining prototypes to test and
validate our design concepts. Prototyping is often essential in interaction design
as it enables iterative testing and adjustment before committing to full-scale imple-
mentation [83]. Rather than starting with wireframes or paper sketches, we chose
to move directly into more functional prototypes, allowing us to quickly assess the
feasibility of our ideas in the context of the actual Director software. This approach
ensured our early design work was grounded in real-world constraints and techni-
cal considerations, providing more immediate insights into user needs and possible
solutions.

Considered prototyping methods included the Wizard of Oz technique, which in-
volves faking prototype functionality to gather user feedback without investing ex-
tensive development time [79]. However, we quickly encountered a core issue: what
kind of AI behavior should we simulate? A best-case scenario? Worst-case? Some-
thing in between? At this stage, we had no clear idea of how the AI would actually
perform or how it would respond to user prompts. Instead, we chose to build simple
yet functional prototypes to test feasibility and gather user feedback using a real
LLM, which was especially important given the unpredictable nature of LLMs in a
niche, proprietary system like Director.

Additionally, while a more complete participatory co-design method can be useful
for aligning solutions with real-world use cases [84], it was not the primary approach
in this project. Mainly due to limited user availability, we adopted a lighter form
of co-design, relying on short, targeted feedback sessions throughout development.
User feedback was crucial, but doing it this way allowed us to incorporate user input
effectively, aligning the design with real-world workflows without overcomplicating
the early stages by placing too much burden on participants.

4.1.4 Deliver Phase

Finally, in the Deliver phase, we evaluated our prototype to understand how Di-
rectorAI impacted the user experience and in which ways it supported scenario
creation. This evaluation focused on gathering qualitative insights through user
studies, where methods such as think-aloud protocols and post-task interviews were
used to capture the diverse ways in which participants interact with the assistant.
Rather than simply measuring performance improvements, our goal was to explore
how DirectorAI influenced user workflows, reduced cognitive load, and supported
creative exploration, providing an understanding of the design’s practical impact.

By following this structured process, we ensured that the developed solution, Direc-
torAI, was grounded in user research and iterative refinement. At each stage, we
tried to balance methodological depth with practical constraints, making deliberate
trade-offs based on which methods best fit our time constraints, user access, and the
goals of the project.

23


4. Methods

4.2 Technical Methods

This section explains technical methodologies relevant to adapting LLMs for specific
contexts. Understanding these techniques provides important context for the design
choices made and the specific methods used in this thesis, particularly the project’s
focus on prompt engineering for DirectorAI.

4.2.1 Fine-tuning
Fine-tuning refers to the process of adapting a pre-trained LLM to a specific task
or domain by further training it on a targeted dataset. This approach adjusts the
model’s parameters to enhance performance for applications such as text classifica-
tion, question answering, or scientific text generation [35]. Typically, fine-tuning em-
ploys supervised learning, where labeled data is used to minimize a task-specific loss
function. For example, fine-tuning BERT on domain-specific corpora significantly
improves its accuracy in tasks like natural language inference [35]. The process re-
quires high-quality labeled datasets and substantial computational resources, which
can be a limitation in certain contexts [85].

Instruction Fine-Tuning
GPT models can be fine-tuned using structured prompt-response datasets, aligning
model outputs with desired behaviors. This process is often enhanced by reinforce-
ment learning from human feedback to promote safety, usefulness, and coherence
[86].

Parameter-Efficient Adaptation
Approaches like Low-Rank Adaptation (LoRA) update only a subset of model pa-
rameters, allowing resource-efficient customization for domain-specific deployments
[87].

4.2.2 Prompt Engineering
Prompt engineering involves designing input prompts to guide a pre-trained LLM
to produce desired outputs without modifying its parameters. This technique lever-
ages the model’s existing knowledge, making it resource-efficient for tasks such as
text generation or sentiment analysis [36]. Strategies such as few-shot learning,
where prompts include a few task examples, or zero-shot learning, where only task
instructions are provided, enhance model performance [36]. Prompt engineering is
particularly valuable in scenarios requiring rapid adaptation to new tasks with mini-
mal data, though its effectiveness depends on the model’s generalization capabilities
[88].

24


4. Methods

Zero-Shot and Few-Shot Learning
Zero-shot learning (ZSL) enables LLMs to perform tasks without prior task-specific
training or examples, relying solely on a descriptive prompt [36]. For instance,
instructing a model to “Classify the sentiment of this review as positive or negative”
without providing examples tests its ability to generalize from pre-trained knowledge.
ZSL is particularly useful for rapid task adaptation, but it can fail in domains
requiring specialized knowledge or nuanced reasoning [39].

Few-shot learning (FSL), on the contrary, includes a small number of task examples
within the prompt to guide the model [36]. For example, a prompt might include:
“Example: ‘Great product!’ → Positive. ‘Terrible service.’ → Negative. Now
classify: ‘Amazing experience!”’. FSL often outperforms ZSL by providing context
that aligns the model’s output with the desired format or style [88]. One-shot
learning, a special case of FSL with a single example, serves as an intermediate
approach. Both ZSL and FSL are forms of in-context learning, where the model
learns from the prompt context without weight updates [89].

The effectiveness of ZSL and FSL depends on prompt clarity, the model’s pre-
training data, and task complexity. Recent studies suggest that larger models, such
as GPT-3 or its successors, exhibit stronger zero-shot and few-shot capabilities due
to their extensive training corpora [90]. However, performance can be degraded
for tasks that require deep reasoning or domain-specific expertise, necessitating ad-
vanced techniques such as prompt chaining or fine-tuning [88].

Prompt Chaining
Prompt chaining is a technique that decomposes complex tasks into a sequence of
smaller, interdependent prompts, where the output of one prompt serves as input for
the next [90]. This approach mitigates the limitations of single-prompt interactions
by structuring tasks into manageable steps, improving accuracy and coherence. For
example, to generate a business plan, a chain might include:

1. “List the key sections of a business plan.”

2. “Write an executive summary for a company with the following description:
[input].”

3. “Draft a market analysis based on the executive summary: [output from step
2].”

Prompt chaining is particularly effective for multi-step reasoning tasks, such as
planning, code debugging, or report generation [90].

Related to prompt chaining is the concept of chain-of-thought (CoT) prompting,
which encourages the model to articulate intermediate reasoning steps within a single
prompt [90]. For instance, a CoT prompt might state: “Solve this math problem
and explain each step.” While CoT focuses on reasoning within one prompt, prompt
chaining distributes reasoning across multiple prompts, making it suitable for tasks
requiring iterative refinement or modular outputs [90]. A more advanced variant,

25


4. Methods

tree-of-thought (ToT) prompting, explores multiple reasoning paths in parallel and
selects the optimal one, often implemented through chained prompts [91].

Prompt chaining requires careful design to ensure alignment between steps and
to prevent error propagation. Automated workflows, where outputs are program-
matically fed into subsequent prompts, can streamline this process, particularly in
API-based applications [90].

Meta-Prompting
Meta-prompting involves crafting prompts that instruct the model to design, evalu-
ate, or optimize prompts before addressing a task [92]. This higher-level approach
leverages the model’s self-reflective capabilities to improve prompt quality. For ex-
ample, a meta-prompt might state: “Write an optimized prompt to elicit a detailed
summary of a scientific article.” Alternatively, it could ask: “Review this prompt for
clarity and suggest improvements: [insert prompt].” Meta-prompting is particularly
valuable for users with limited prompt engineering expertise or when tackling novel
tasks [92].

A related concept is self-consistency, where the model generates multiple responses
to a prompt and selects the most consistent or highest-quality output Wang2023.
Meta-prompting can orchestrate self-consistency by instructing the model to com-
pare outputs and refine its approach. Another related technique, reflexive prompting,
asks the model to reflect on its reasoning process or prompt effectiveness, e.g., “Why
did this prompt yield a vague response, and how can it be improved?” [92].

Meta-prompting can be combined with prompt chaining to create dynamic work-
flows. For instance, a meta-prompt might generate an initial prompt, which is then
used in a chained sequence, with subsequent meta-prompts refining the process based
on intermediate results [90]. However, meta-prompting requires precise phrasing to
avoid confusion, as the model must balance the meta-task (e.g., prompt design) with
the actual task.

4.2.3 Other Relevant Concepts
Several additional concepts improve prompt engineering, particularly for complex
or reasoning-heavy tasks:

• Self-Consistency: As mentioned, self-consistency involves generating multi-
ple outputs and selecting the best one, often improving performance on tasks
such as mathematical reasoning or factual question answering Wang2023.
This technique can be integrated into prompt chaining or meta-prompting
workflows to ensure robust outputs.

• Automated Prompt Engineering: Frameworks like DSPy [93] program-
matically optimize prompts by iterating over prompt variations and evaluating
performance metrics. Unlike meta-prompting, which relies on the model’s nat-
ural language capabilities, automated prompt engineering uses computational
optimization, making it suitable for large-scale applications.

26


4. Methods

• Temperature and Top-k Sampling: Model parameters like temperature
(controlling response randomness) and top-k sampling (limiting token selection
to the k most likely options) indirectly influence prompt engineering outcomes
[39]. For instance, lower temperature values (e.g., 0.5) produce deterministic
outputs, while higher values (e.g., 1.0) increase creativity. Prompt engineers
must account for these parameters when designing prompts, especially for
tasks requiring specific tones or styles.

• Agentic Workflows: Advanced prompt engineering can emulate agent-like
behavior, where the model autonomously decides subsequent steps based on
prior outputs [90]. For example, a meta-prompt might instruct the model
to “Generate a plan, execute each step, and adjust based on results.” Such
workflows combine prompt chaining, meta-prompting, and self-consistency to
create dynamic, goal-oriented interactions.

4.2.4 Retrieval-Augmented Generation (RAG)
RAG is a hybrid framework that combines information retrieval with natural lan-
guage generation to enhance LLMs. Introduced by Lewis et al. [94], RAG integrates
parametric knowledge (encoded in model weights) with non-parametric knowledge
(retrieved from external sources) to improve factual accuracy and contextual rele-
vance.

In the RAG pipeline, a query initiates the retrieval of relevant documents from an
external knowledge base, typically a vector-indexed corpus. Retrieval uses dense
embedding models such as BERT [35] or Dense Passage Retrieval (DPR) [95] to
identify semantically similar content. The retrieved information is then appended
to the query and passed to a generative model, enabling outputs grounded in both
learned and external knowledge.

This architecture mitigates key limitations of standard LLMs, particularly halluci-
nations and static knowledge, by conditioning responses on verifiable sources [96]. It
is especially effective for domain-specific or private queries, where fine-tuning would
be impractical [40].

Nevertheless, RAG presents challenges such as retrieval noise, increased inference
latency, and the need to balance contributions from parametric and non-parametric
knowledge [97]. Active research explores solutions including domain-specific re-
triever tuning, structured data retrieval, and improved fusion mechanisms [98].
RAG’s modular design also opens avenues for scalable updates and theoretical ad-
vances in memory, reasoning, and contextual language modeling.

4.3 Time Plan
The following Figure 4.2 describes the time plan for approximate date of the project.

27


4. Methods

Figure 4.2: Time plan of the thesis as a GANTT-chart.

28


5
Process

This chapter outlines how we designed and built DirectorAI, using a user-centered
approach to tackle the usability issues present in Director. It covers the initial
scoping and problem discovery phases, the iterative design of the prototype, and
the technical challenges involved in integrating AI into a complex, domain-specific
system.

5.1 Preliminary Scoping and Feasibility
Before conducting formal user studies, we began with an initial exploration of the
technical landscape surrounding Director. This phase took place early on, while we
were still getting to know the tools and gathering participants for our studies. It
played an important role in shaping the project’s direction and helped ground the
rest of our process in what was technically feasible and realistic within the time
frame and scope of the project.

Our approach involved reviewing the available source code for Director, identifying
which parts of the software we could work with and how, and mapping out the
structure of the data we had access to. We also considered how Director connected
to ProSim, which is the broader simulation platform it interfaces with. Early on, we
determined that we did not have access to the full ProSim environment, which ruled
out changes that would require modifying the underlying simulation software, such
as camera control or environmental interactions. While these suggestions might arise
later as potential improvements, it was important to understand that they would
fall outside the bounds of what we could realistically address.

This early scoping also gave us a clearer understanding of the kind of AI functionality
we might be able to implement. Initially, we considered possibilities like training a
custom model to support scenario creation. However, we quickly realized that this
would require access to a large, labeled dataset of simulation scenarios, which did
not exist. Creating such a dataset would have been a major undertaking in itself
and was not feasible given our resources. As a result, we began focusing more on
how we could apply existing LLMs to augment the scenario creation process without
custom model training.

Ultimately, this stage helped us set clear boundaries for the project. It allowed us to
better interpret user feedback later in the process, as we could distinguish between

29


5. Process

desirable but impractical features and problems that could actually be addressed in
a meaningful way. It also informed the way we designed our user studies, allowing us
to focus on pain points within the Director interface and scenario creation workflow
that we knew we could feasibly explore and improve. In this sense, the preliminary
scoping phase became a short but important part of the design process, shaping both
the direction of our research and the tools we would later design and prototype.

5.2 Problem Discovery
To gain a preliminary understanding of how Director and ProSim are used, we
conducted a short survey targeting Volvo Cars employees with either experience
in these tools or an interest in their development. The goal was to identify usage
patterns, common tasks, and any early frustrations or wishes users had. This was
important to ensure that the design work would be grounded in real-world usage.

The survey included questions open-ended responses, such as:

• “What do you use ProSim for?”

• “Which features do you use most?”

• “Have you used Director, and if so, for what?”

• “What features do you like about Director?”

• “What challenges or feature requests do you have?”

A final yes/no question was included asking if the respondent would like to partici-
pate in future user studies on the matter.

While the response count was limited (nine respondents), and the answers were
mixed and often vague, a few clear themes emerged:

• ProSim was used for a wide range of purposes, from visualizing scenarios and
FSM behavior to running driver studies and showcasing car functionality.

• Of the few respondents who had used Director, several mentioned frustrations
with configuring and timing actions, adjusting signals, and managing scenario
sequences.

Although the survey did not reveal strong patterns, it hinted at a general sense of
complexity and inconsistency in how the tools were used. This initial impression was
further supported by early feedback and observational data from experienced users
and developers, which indicated that users frequently encountered friction during
scenario creation, particularly when trying to locate signals and understand their
functionality.

To assess the scope and details of these issues, we conducted an initial structured
user study with both experienced and potential users of ProSim and/or Director.
Respondents of our survey who had answered yes to the question about participating
in a user study were recruited to partake. The overarching goal was to investigate

30


5. Process

the potential for AI-powered enhancements based on users’ pain points and overall
experiences.

The user studies, described in more detail in Appendix A, were conducted with
eight participants. Each session lasted approximately 30 minutes. First, partici-
pants received a brief tutorial on the functionality of both ProSim and Director to
ensure a shared baseline of understanding, acknowledging that prior experience and
proficiency with the software varied. Participants were then given time to explore
Director independently to become more familiar with its interface and features.

During the study, participants were encouraged to think aloud by verbalizing their
thoughts and decision-making processes to help us gather as much insight as possible
into their behavior, challenges, and perceptions.

The main task required participants to create a short scenario involving:

• Changing camera angles and positions

• Opening and closing vehicle doors

• Activating the car horn for a specified duration

This scenario was chosen because it involved common, easy-to-understand signals
that also varied in type and parameters. Minimal guidance was provided to ensure
a fair and unbiased representation of each participant’s experience.

Primarily qualitative data was documented during the user studies. Such infor-
mation included observations of hesitation, confusion, and general behavior. This
qualitative data provided nuanced insights into usability issues. In addition, after
the tasks were completed, participants were interviewed about their experiences,
focusing on what aspects of the software they found satisfying or frustrating.

From these interviews and our observations, three major categories of user frustra-
tion emerged:

• Difficulty finding signals

• Lack of information on how signals and their parameters work

• Limited feedback and unclear information about how the program functions
overall

Some of the identified issues stemmed from the underlying implementation of signals,
elements that could not be directly addressed within Director alone. This prompted
a careful consideration of which problems could be realistically solved at the level
of Director and which would require changes to the broader software ecosystem.

5.3 Problem Definition
Following our user studies, we analyzed both observational and interview data to
identify the most pressing usability issues in Director. Our insights were primarily
derived from qualitative methods, particularly think-aloud protocols and post-task

31


5. Process

Figure 5.1: Distribution of code categories in interview responses.

interviews. Thematic coding was applied to the interview transcripts, allowing us
to identify recurring patterns, frustrations, and pain points experienced by users.

5.3.1 Data Analysis
One of the most frequently observed issues was the difficulty participants faced
in locating the correct signals when creating scenarios. This challenge was evident
both in observed behavior and participant feedback. Participants often attempted to
search for signals using natural language terms such as driver’s door or trunk, only
to receive no results, as Director requires queries to match system-specific signal
names like Door FL Open or tailgate. The absence of synonym support or semantic
flexibility created friction, particularly for non-expert or occasional users unfamiliar
with the tool’s naming conventions. Even experienced users reported that the need
to recall or uncover exact terminology interrupted their creative flow.

The full list of identified usability themes, along with the frequency of their occur-
rence across sessions, is presented in Figure 5.1. This visualization offers an overview
of the most prominent barriers to effective interaction, helping to contextualize the
relative impact of this and other issues.

These issues were compounded by inconsistencies in how signals were structured.
Some actions were represented by discrete signals (e.g., Door Open and Door Close),
while others, like the horn, used a single signal (hornSoundOn) with parameters to
control behavior. Users frequently searched for signals that did not exist, such
as hornSoundOff, not realizing the action was handled differently. This lack of
consistency increased the cognitive load and hindered intuitive interaction.

A closely related concern was the configuration and interpretability of signal pa-
rameters. Although less commonly raised in interviews, this issue became apparent
during observations and through our own experience using the tool. Parameters
sometimes used technical labels (e.g., a parameter named boolean), which confused

32


5. Process

non-technical users. Others, such as color configuration for lights, used ambiguous
input formats: RGB values required floats (0-1), but users often entered integers
(0-255), leading to unexpected behavior. Without clearer information or feedback,
even simple parameter adjustments became a point of friction.

Something worth noting is that codes like ’Frustration with signal parameters’ ap-
pear in the lower half of the identified codes, yet parameter configuration became a
major focus during the design and implementation of DirectorAI. There are a few
reasons for this. One is that some user complaints or requests related to features
that were simply not feasible for us to implement. For example, a ’Desired feature’
was the ability to “fly” in ProSim, whereas the simulation currently only supports
walking. Implementing these types of features was beyond our scope, especially since
we did not have access to ProSims source code. Another reason signal parameters
became a focus is that codes like ’Lacking information’ often referred to missing or
inadequate documentation and metadata related to parameters, which made them
difficult for users to understand or use effectively.

Table 5.1: Selected Participant Quotes

Participant Quotes

P1 “There is no guide for how the signals are different from one another.
A lot of signals have unclear info.”
“The camera doesn’t work as expected. And it’s unclear what the
difference between the camera and ego character is.”

P2 “It lacks info for what signals do and what the parameters mean.”
“Some signals do just one thing but some do multiple things.”

P3 “Improve the way to find signals, [it needs to] match with what I
actually want to do.”

P4 “You need that instant feedback. By the time I finally have a sce-
nario ready, the requirements can have changed.”
“This type of work must be more intuitive. Like a car is.”

P5 “[Other teams] always need to ask experts for help, because the tool
is too technical for them.”

P6 “There’s so much potential here, but you need to make it faster.”
“It should be as simple as possible, so more people can try.”
“Getting started is hard, so many people skip it.”

Beyond these specific issues, participants also expressed a broader sense of unfulfilled
potential. Several interviewees envisioned using Director for fast-paced ideation or
live scenario building during team discussions but found the program too rigid and
slow to support such workflows. Others noted that while testing is valuable in
Director and ProSim, it is often skipped due to the complexity and time investment

33


5. Process

required to build even simple scenarios. Selected participant quotes that influenced
our understanding of Director’s workflow issues and the following work on DirectorAI
can be seen in table 5.1.

These findings led us to a clear problem framing: Director, while powerful, is lim-
ited by its reliance on system-specific syntax, technical terminology, and rigid work-
flows. These constraints make the tool difficult to access for new, occasional, or non-
technical users, and slow even for experienced ones. This challenge is not unique
to Director, but indicative of a broader issue in many complex software tools, that
being the gap between user intent and system operation.

5.3.2 How AI Could Help
At this point, we began exploring how AI, specifically natural language processing
(NLP), might help bridge this gap. The idea was planted early in the project, as
we reflected on how similar issues appear in other domains: software development,
data analysis, or engineers working with circuit design [99]–[102]. Users may know
what they want to do, but not necessarily how to express it in system-specific terms,
or may greatly benefit from the automation provided by AI. AI interfaces have
increasingly shown promise in translating natural human requests into structured
commands, enabling smoother interaction with complex systems.

This potential aligned closely with the challenges we uncovered in Director. AI could
help users find signals even if they use imprecise or conversational terms, for example
interpreting driver’s door as Door FL Open, or mapping boot to tailgate. Language
models can infer synonyms, correct misspellings, and interpret user intent, especially
when trained or prompted with domain-specific context. Moreover, AI can act as
a knowledge proxy: we can encode expert knowledge into prompts, descriptions,
or examples, allowing the AI to answer questions and provide guidance without
requiring users to rely on internal documentation or interrupting workflows to ask
a colleague.

To further explore the feasibility of our approach and gain a deeper understanding of
where the current issues stem from, we consulted with Directors developers. These
stakeholders provided valuable insights into why certain usability issues persist in
the current software. When asked why documentation improvements had not been
addressed, the developers pointed to the sheer scale of the system and how Director
has approximately 900 unique signals. Updating and standardizing all of them, along
with their metadata and parameters, would be extremely time-consuming. Due to
resource limitations and other focus areas, this has not been a priority.

Regarding the technical language used in Director, the developers explained that
it aligns with their workflows and how the software has historically been used.
For example, a value like “HornSoundOn” with a boolean parameter controlled
by true/false makes perfect sense to a software engineer. As a result, updating the
structure or terminology of signals may improve clarity for some users, but risks
inconveniencing others who rely on or have adapted to the existing conventions.

These conversations reinforced our belief that AI could offer a viable alternative,

34


5. Process

not by replacing Director’s structure, but by sitting between the user and Director’s
functionality. Rather than requiring extensive documentation rewrites, restructuring
Director’s backend, or disrupting existing workflows, AI could provide a semantic
bridge where it interprets the user’s requests, selects relevant signals, and assists in
parameter configuration. In summary, our problem definition rests on three main
points:

• Discoverability: Users struggle to find the right signals using natural lan-
guage or intuitive phrasing.

• Interpretability: Signal parameters are difficult to understand or configure
correctly without technical knowledge.

• Speed and accessibility: Scenario creation is slower and more effortful than
it needs to be, especially for less technical users.

These challenges pointed us toward a solution that leverages AI to interpret user
intent, embed expert knowledge, and reduce cognitive load when using Director by
making the tool more accessible without requiring a fundamental redesign.

5.4 Design and Implementation
This section presents the design process and implementation details of our AI assis-
tant for Director. The goal of this tool is to improve usability in the scripting of
3D simulation scenarios by allowing users to interact with Director via natural lan-
guage. We begin by introducing terminology used throughout this section, followed
by a breakdown of the assistant’s architecture, interaction modes, and AI prompt
strategies. Lastly, we outline the specific components developed during this project.

5.4.1 Terminology
To ensure clarity, we define several terms used throughout this section:

• Signal: A command issued from the Director program to ProSim, where the
3D simulation occurs. Each signal controls a specific aspect of the simulation,
such as opening and closing car doors, adjusting the camera, or modifying
weather conditions. Signals are added to Director’s timeline as actions, each
with a defined start time and associated parameters. When the timeline is
played, these actions are triggered at the specified times, sending the corre-
sponding signal and its values to the simulation.

• AI Mode: A selectable interaction context that defines the behavior of the
AI assistant. Each mode corresponds to a distinct use case (e.g., camera
positioning or signal editing) and determines how user prompts are interpreted
and processed by the assistant and LLM.

• Generator: A modular processing unit responsible for two tasks: (1) compos-
ing and sending a task-specific system prompt to the LLM, and (2) handling

35


5. Process

the returned data in a structured way to produce meaningful changes in the
simulation environment.

• Classifier: An AI module that dictates which AI generator is most appropri-
ate for the given input.

• Pipeline: A chained sequence of generators and classifiers that transforms
a user prompt into application-specific output. Pipelines allow for multi-step
processing, such as refining AI outputs through further generators or user
interactions.

5.4.2 Early Attempts: Simple Prompting
Our initial approach was simple: we added a chat window to the existing interface
where the user could enter their prompt, which we would later pass to the AI. The
user’s prompt, along with the entire list of available signals and their descriptions,
was sent to a GPT-4o-mini model via an API request. The model was instructed to
recommend the most appropriate signal based on the prompt. Despite its simplicity,
this worked surprisingly well in many cases. The model could often infer what the
user wanted and suggest one or several relevant signals.

However, several limitations quickly became apparent. First, signal descriptions
alone were often not detailed enough to make an informed decision. Two signals
might appear similar and have similar descriptions, for example ’Door Open FL’
and ’Ext Door Handle R1 L Ui,’ but behave quite differently. Of these two signals,
the first one directly opens the door, while the other simulates a user pulling the
handle, which may fail if the door is locked. Secondly, even when the correct signal
was found, users still had to manually configure its parameters, sometimes without
knowing what the parameters meant or how they worked. Third, some signals
required contextual or spatial understanding not conveyed in their metadata. For
example,