Automated Validation of Test Cases Us-
ing Generative AI
Development of program for generating test cases utilizing
generative AI for system requirements in automotive industry.

Degree project report in Electrical Engineering

Jarl Dang, Daniel Göransson

DEPARTMENT OF ELECTRICAL ENGINEERING

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2025
www.chalmers.se

www.chalmers.se


Degree project report 2025

Automated Validation of Test Cases Using
Generative AI

Development of program for generating test cases utilizing generative
AI for system requirements in automotive industry.

Jarl Dang, Daniel Göransson

Department of Electrical Engineering
Chalmers University of Technology

Gothenburg, Sweden 2025


Automated Validation of Test Cases Using Generative AI
Development of program for generating test cases utilizing generative AI for system
requirements in automotive industry.
Daniel Göransson, Jarl Dang

© Jarl Dang, 2025.
© Daniel Göransson, 2025.

Supervisor: Jonatan Rydberg, Volvo Group
Examiner: Hazem Torfah, Computer Science and Engineering

Degree project report 2025
Department of Electrical Engineering
Chalmers University of Technology
SE-412 96 Gothenburg
Sweden
Telephone +46 31 772 1000

Cover: Image illustrating the similarities between the neurons in an AI to the neu-
rons of a brain. Photo by Growtika on Unsplash.

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Gothenburg, Sweden 2025

iv


Automated Validation of Test Cases Using Generative AI
Development of a program for generating test cases utilizing generative AI for sys-
tem requirements in automotive industry.

Jarl Dang, Daniel Göransson
Department of Electrical Engineering
Chalmers University of Technology

Abstract

This project was provided by Volvo and was intended to help automate their current
system of creating validation test cases using generative artificial intelligence (AI).
Automation of this system reduces development time and costs while enhancing
efficiency. This project covered theories regarding AI, Large Language Model (LLM),
prompt engineering, test cases and system requirements. Following an evaluation of
available open-source LLMs, the model QWQ-32B was selected. By applying prompt
engineering techniques, the model was able to generate not only validation test cases
but also executable test code. The program was able to generate adequate results
according to Volvo staff. However, the results needs some adjustments in order to
be viable for a variety of test cases. Potential improvements, such as the integration
of Retrieval-Augmented Generation (RAG), are discussed in this report as future
directions to address current limitations. As a result of this project, Volvo has been
provided with a solid foundation for automating generation of validation test cases
and executable test code.

Keywords: automotive, system requirement, test cases, Large Language Models,
LLM, AI, Gen AI, RAG, prompt engineering, HIL

v


Acknowledgements

This project was conducted in collaboration with Volvo Group and would not have
been possible without the contributions of several skilled colleagues. We would
like to thank Giorgos Georgiadis for his support with the AI component of the
project. We are also deeply grateful to Martin Hill and Fabricio Lélis de Melo
for their invaluable assistance with the Jama platform. Special thanks goes to
our manager, Ivana Jern, for initiating and coordinating the project, and to our
Volvo supervisor, Jonatan Rydberg, whose consistent guidance and weekly support
significantly contributed to the end result of the project.

Jarl Dang, Daniel Göransson, Gothenburg, March 2025

vii


List of Acronyms

Below is the list of acronyms that have been used throughout this thesis listed in
alphabetical order:

AI Artifical Intelligence
API Application Programming Interface
CAN Controller Area Network
DFS Depth First Search
HIL Hardware In the Loop
IDE Integrated Development environment
LLM Large Language Model
NLP Natural Language Processing
RAG Retreiveal Augmented Generation
VAS Volvo Autonomous Systems

ix


Contents

List of Acronyms ix

List of Figures xv

List of Tables xvii

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.2 HIL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.1.3 Requirements Traceability . . . . . . . . . . . . . . . . . . . . 2

1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 5

2.1 AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 LLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

xi


Contents

2.1.3 Generative AI . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Prompt Engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.4 Retrieval Augmented Generation (RAG) . . . . . . . . . . . . . . . . 9

2.5 Depth First Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.6 Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6.1 Test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6.2 System requirements . . . . . . . . . . . . . . . . . . . . . . . 11

2.6.3 Jama Connect . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.6.4 PyTest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Methods 13

3.1 Researching potential models . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.2 Signal interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.3 Validation Test Cases Template . . . . . . . . . . . . . . . . . 17

3.2.4 Generating Validation Test Cases . . . . . . . . . . . . . . . . 18

3.2.5 Generating Code . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.2.6 Logging the results . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Results and Discussion 21

4.1 Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.2 Automated Validation of Tipping Bucket . . . . . . . . . . . . . . . . 23

4.2.1 Generated Validation Test Case . . . . . . . . . . . . . . . . . 23

xii


Contents

4.2.2 Generated Code . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.3 Validate Trailer Tipper Bucket State Transitions . . . . . . . . . . . . 27

4.3.1 Generated validation test case . . . . . . . . . . . . . . . . . . 27

4.3.2 Generated Code . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4 Overall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Conclusion 35

A Appendix 1 I

xiii


Contents

xiv


List of Figures

2.1 Figure which illustrates the different AI-concepts and subcategories. . 5

2.2 Simple illustration of a general neural network with a limited number
of hidden layers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 The neural network of a typical Large Language Model with only 2
hidden layers illustrated. The input context data and output data is
also explicit. The predictive probabilities are taken from the GPT2
model. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Diagram showing the process of using an API [8] . . . . . . . . . . . 8

2.5 An example of how the architecture of a simple RAG implementation
may look. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.6 The order which Depth First Search would explore this specific tree
structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Jama structure visualised as a tree data structure. . . . . . . . . . . . 17

4.1 Flowchart showing the program structure . . . . . . . . . . . . . . . . 22

4.2 Generated Python Code based on the given validation test case 4.1. . 27

4.3 Generated Python Code to validate Trailer Tipper Bucket State Tran-
sitions based on the given validation test case 4.2. . . . . . . . . . . . 33

4.4 Figure illustrating the potential feedback loop of having two AIs work
in combination to autotomize the principle of prompt engineering.
Either by improving the prompt to create a better output, or by
improving the output directly. . . . . . . . . . . . . . . . . . . . . . . 34

xv


List of Figures

xvi


List of Tables

3.1 Comparison of Selected Large Language Models . . . . . . . . . . . . 15

xvii


List of Tables

xviii


1
Introduction

1.1 Background
The usage of artificial intelligence, AI, models has grown significantly over the last
couple of years since the introduction of ChatGPT. These AI models have helped
people in their daily lives by automating boring and tedious tasks, but their impact
goes far beyond that. AI is now driving innovation, making critical decisions, and
even transforming entire industries, reshaping the way we live and work on a fun-
damental level. Our opportunities with AI provide us with unlimited potential for
the development of society, enhancing efficiency and unlocking innovations we have
yet to imagine.
Something that is currently a hot topic on everyone’s lips is self-driving cars, or
more broadly, the integration of AI in vehicles. This can be uncomfortable for
some people, as the idea of machines making decisions raises concerns about safety,
reliability, and control. However, we should view this more as an opportunity rather
than a threat, as AI has the potential to refine our way of driving.
With cars becoming increasingly dependent on software, the complexity of their
codebase continues to grow. Every new feature, weather it’s safety related or per-
formance enhancing, requires thousands of lines of code. Every feature requires
system requirements to ensure the users safety and products functionality. For
Volvo, this means that managing and maintaining software has become just as im-
portant as designing the physical vehicle itself. But with millions of lines of code,
outdated and inefficient part can slow development, making software management
a growing challenge.
Our hope with this thesis is to solve these challenges by combining AI with code. By
developing a tool which helps developers generate solid foundations for test cases,
we are potentially increasing their efficiency. Instead of having to start from scratch
with every system requirement, they simply need to do the last finishing touches.
The background to this project can be divided into three main sections. The technol-
ogy that makes up the foundation of the project (1.1.1), its main area of application
(1.1.2), and the problem that it is trying to solve (1.1.3). The delimitations that
were made by Volvo and why they were made are also presented (1.4).

1.1.1 AI
Artificial Intelligence (AI) is a technology which has received a lot of attention for
its limitless applications. It enables computers and machines to simulate human
learning in order to solve problems. One certain technology under this umbrella

1


1. Introduction

concept are LLMs, which are AIs that have been specialized in processing natural
languages. Large Language Models (LLMs) are the most common AI models in
order to create original content from user input, since it takes input in natural
language, thus being relatable and easy to instruct for humans [1]. AI which is used
to generate new content is categorized as generative AI and will be the main focus
of this project. Generative AIs differs from LLMs since they are free to generate
new content in any format, not necessarily just text.

1.1.2 HIL
The automotive industry has been investing enormous amounts of effort and re-
sources in order to seize the advantages that AI could potentially bring, such as
autonomous vehicles. Volvo is no different in this regard. One area in which Volvo
sees potential benefits is in combination with Hardware in the Loop (HIL) testing.
HIL testing is the process of testing the software of a vehicles hardware by sim-
ulating real world environments and scenarios without needing the actual vehicle.
This is useful since it can be costly, time intensive and dangerous to test the full
physical system each time for every system requirement [2]. Instead of attaching
the hardware to a real vehicle, data is simulated using computer models and sent
via Controller Area Network (CAN) or other signals to the hardware system being
tested. The CAN signals acts as the bridge between the simulated data and the
physical components and allows the simulated world to communicate with the real
hardware.

1.1.3 Requirements Traceability
In order to ensure that products fulfil industry standards and regulations, several
requirements have to be implemented and complied with. In order to trace all the
system requirements, and their related test implementations and results, the plat-
form named Jama Software is used at Volvo Autonomous Systems (VAS). Vehicles
are large systems requiring several thousands upon thousands of system require-
ments in order to ensure functionality and user safety. Each component needs to
fulfil its system requirement, thus the amount of tests needed increases exponen-
tially with the amount of components. Hence writing these tests is time consuming,
even though each test may be simple the sheer amount of tests is a challenge. By
utilizing generative AI to write these tests, it could potentially provide immense
benefits regarding safety and efficiency.

1.2 Purpose
The purpose of this project is to enhance and automate the existing test frame-
work to enable seamless execution of tests in a HIL environment. By leveraging
open-source AI models, the project aims to transform human-readable system re-
quirements and test specifications into executable Python code, ready to be tested
with HIL equipment. This approach will significantly streamline the process from

2


1. Introduction

requirements to test results, enabling faster feedback and shorter development cy-
cles. The hope is also for the AI to cover more areas that a human may not think
of, therefore testing scenarios that may not have been tested before, improving the
quality of testing.

1.3 Goals
The goal with this project is to automatize the process of creating test cases from
system requirements in HIL environments. This is done via several steps:

• Utilize prompt engineering in order to tailor the LLM output until a satis-
factory result is produced, meaning a valid template and functional test case
code.

• Create a seamless integration between the Jama requirement, the AI model,
and the hardware using python as our programming language and API end-
points for retrieving information.

• Provide a valuable tool for Volvo to further develop and apply to other areas
of development.

1.4 Delimitations
These delimitations were set in order to focus only on the important aspects in order
to accommodate the given time frame.

1. This project will only focus on LLMs that meet the requirements set by the
project team at Volvo. The specific requirements are as following:

• Local deployment - The LLM needed the capability of being deployed on
local hardware. A non-local model posed a risk of data leakage and was
not assessed to be safe enough.

• Large input capacity - In order for prompt engineering to be unrestricted
and receive more context, the AI model had to be able to handle large
sets of text strings as input data. The limit that was set due to hardware
capacity was 32B parameters.

• Processing speed - The time it takes for the LLM to process a request
can not be outside a reasonable time frame in order to keep the promise
of increased production speed. A reasonable time frame in this regard
is assessed to be within 10 minutes, which was decided after a dialogue
with employees at Volvo. This is to optimize time used for engineering
the input data.

3


1. Introduction

• Pre-trained - To ensure accordance with the time plan and thereby a
quick setup process, the LLM had to be pre-trained in natural language
and code generation for the task to be fulfilled. It also had to be of rea-
sonable size with regards to the hardware that was available at the Volvo
office.

• License - To use this program to develop a commercial product, a free use
license had to be used for the LLM. The specific license that was deemed
most appropriate was the apache license 2.0. It allowed for free use and
free distribution of the software including for commercial use with mini-
mal restrictions [3].

2. The tuning of the LLM will only be with different input prompts and adjusting
certain parameters such as temperature since Volvo does not have any large
dataset that can be used for fine-tuning the model for this task.

3. The evaluation of the output from the AI will only be based of previously
written test cases the functionality of the generated code.

4. The project will not cover a user interface in order to focus on the internal
functionality of the program.

1.5 Contributions
1. Prompts which are adjusted for QWQ-32B to generate validaton test cases

from system requirements.

2. Prompts to generate code from validation test cases.

3. A workflow map which illustrates the process of the project from start to finish.

4. Suggestions for potential improvements when creating and implementing an
LLM with an existing database.

4


2
Theory

2.1 AI
AI is an umbrella concept which includes several other terms such as LLMs and
Generative AIs. This is illustrated in figure 2.1 which indicates all the technical
terms which AI encompasses. As seen, AI is a broad and ubiquitous term within
tech, capable of a wide range of diverse tasks. Consequently, it lacks a proper
agreed upon definition in the industry and scientific world. In this paper, AI will
be defined as per NASA’s definition, which in turn references the National Defence
Authorization Act of 2019 [4]. As such, AI is defined as computer systems which can
perform complex tasks normally done by humans, such as reasoning and decision
making.

Figure 2.1: Figure which illustrates the different AI-concepts and subcategories.

2.1.1 Neural Networks
Most modern AIs utilizes a concept called neural networks in order to process data.
Although there exists several other approaches to train and create AIs. Neural
networks are inspired by the human brain and consist of interconnected nodes, so
called neurons, in order to process information in layers, similar to humans. These

5


2. Theory

neural networks often have large numbers of processors working in parallel and
arranged in several layers. In figure 2.2 the structure of a typical neural network
can be seen. It consists of an input layer, a number of hidden layers and an output
layer. A comparison is made in [5] which relates the input layer to the optic nerves
in human visual processing. The hidden layers would then represent the neurons
in the brain which processes the raw data in order to create an image (output) for
us humans to see. The number of hidden layers a neural network contains is also
refereed to as the depth of the network or AI.
To further delve into the hidden layer and how it processes the raw data, we can
imagine it as a control panel consisting of several thousands of dials (nodes). Each
dial is associated with an adjustable number (weight). Adjusting these dials modi-
fies the weights, which determine how input signals are transformed and propagated
through the network. Each node computes a weighted sum of its inputs and trans-
mits the resulting continuous value to the next layer. Setting these dials to their
optimal value for each input is an impossible challenge for humans to solve manu-
ally. There are several algorithms in order to adjust the dials which yields different
results, some popular alternatives are Deep Learning methods such as Backpropa-
gation and Transformers [5]. All of them utilize large datasets as answer sheets in
order to train the AI. A high quality dataset is often key in training an AI model to
perform well.

Figure 2.2: Simple illustration of a general neural network with a limited number
of hidden layers.

2.1.2 LLM
Today there are several popular LLMs, such as ChatGPT, Claude and DeepSeek.
LLMs are AIs which are specialised in processing natural language, such as the
languages humans use to communicate. They are statistical models which tries
to predict the next token given a context. Tokens refers to the basic building
blocks of natural languages, such as punctuation marks, individual words or Chinese
characters. The context includes the user input, but may also include previous chat
history or stored memory. The neural network of an LLM can be seen in figure
2.3. It is clear that it follows the general structure of a neural network as previously

6


2. Theory

illustrated, the difference being that the weights have values which results in it being
an expert in processing natural languages.

Figure 2.3: The neural network of a typical Large Language Model with only 2
hidden layers illustrated. The input context data and output data is also explicit.
The predictive probabilities are taken from the GPT2 model.

2.1.3 Generative AI

Generative AI is a class of artificial intelligence systems designed to generate new,
original content based on patterns learned from training data. It does not simply re-
trieve or remix existing data, instead new content is generated which that resembles
its training data. The generative part comes from its ability to create, unlike tradi-
tional AI which typically focuses on classification or prediction. This also includes
popular LLM models such as ChatGPT which generates text. But it also extends
to AIs which focus on other forms of media, such as Midjourney which generates
images, or Jukebox which generates raw audio. One major aspect of Generative AI
in the industry is its ability to generate code, since code is simply plain text. This
has already proven to be an invaluable tool for developers in increasing efficiency
and improving the learning process [6].

2.2 API

API stands for Application Programming Interface (API) and is a useful tool that
provides a channel for two different software programs to communicate. This is done
using a set of rules and protocols that is used to request and provide data between
the two programs. This request is done using code and often includes an endpoint
URL, a Hypertext transfer protocol (HTTP) method and certain parameters. This
process of communication provides a standardized way for users to access other
applications data and is used in many businesses and organizations, saving time
and money [7].

7


2. Theory

Figure 2.4: Diagram showing the process of using an API [8]

2.3 Prompt Engineering

Prompt engineering is the technique of carefully crafting and refining input prompts
to improve the quality and precision of the generated output by the LLM. It is a way
of carefully instructing and guiding the LLM closer to the desired outcome. This
can be a useful technique in scenarios where training data or time is limited and
the complexity of the task is reasonable [9]. It also allows for the user to have more
control of the model and ensure that the model’s output are aligned with user goals.
Prompt engineering is not training the LLM, no weights are being adjusted nor is
the neural network of the LLM changing. It is simply a way to better utilise it and
unleash more of its potential. Although prompt engineering has many benefits there
are also challenges with implementing such a technique. Since LLMs are statistical
models, the output will always vary even if the input remains identical. This may
lead to inconsistent results and makes it hard to quantify the quality of the input.
Another challenge is the limited context window. For complex tasks which may
require a large context window, it may not be possible for the LLM to process all of
it and thus limiting the LLMs understanding of the problem.
One fundamental and famous technique which a lot of programmers are utilizing is
called rubber duck debugging (rubberducking). The principle is that a developer
would explain their code in natural language to an inanimate object, such as a
rubber duck, in order to better conceptualize their approach and avoid tunnel vision
[10]. The potential benefits with combining rubberducking and prompt engineering
are exponential, allowing for developers to go through their code and receive instant
feedback from the LLM. Thus leading to a key concept within prompt engineering,
an iterative feedback cycle where the prompts continually improve to better serve
the developers goal.

8


2. Theory

2.4 Retrieval Augmented Generation (RAG)
RAG is a technique that enables LLMs to access more context before processing
a request. A retrieval framework retrieves and incorporates new information by
first searching a predetermined database. The retrieved data is used as additional
context for the LLM to produce an improved response. The architecture of a simple
RAG implementation can be seen in figure 2.5. The clients questions is first sent to
a framework which combines the question with contextual data. It is then prompted
to the LLM which then generates a response, the response is then processed and
sent back to the client.

Figure 2.5: An example of how the architecture of a simple RAG implementation
may look.

This technique allows for the LLM to access more information in order to process
the question with a higher confidence. Consequently, the result should be of higher
quality and relevance to the client. A study found that RAG improved the results of
their GPT-4 model with 4% and reduced hallucinations [11]. One important factor
is that the database provides relevant high quality data, since this will serve as a
foundation for the LLM to process the given request.

2.5 Depth First Search
DFS is a search algorithm commonly used to explore data structures such as graphs
and trees. Since DFS prioritises the depth first, it is generally the fastest way to
search a tree when the depth of the tree is of interest. These data structures consists
of vertices and edges. The algorithm initialises in a root node which marks the start
of the tree. It then explores adjacent vertices and prioritises depth before breath,

9


2. Theory

meaning it searches as far deep as possible in a given branch before exploring other
paths. After the deepest level has been reached, it backtracks until it finds an
unexplored neighbour and starts over again. This is repeated until all vertices have
been explored. The search order and depth priority can be seen in figure 2.6.

Figure 2.6: The order which Depth First Search would explore this specific tree
structure.

10


2. Theory

2.6 Testing

2.6.1 Test cases
Test cases are an integral part of the production chain in software and hardware
development. It is used when testing a product, and can be defined as a pair of
inputs and outputs to assess the functionality of software or hardware components.
The inputs and outputs are produced according to a test specification and then
executed using the tested software, typically named software under test (SUT), and
the result is then reported back and documented as a part an evaluation of the
software [12]. However, designing a test case and its designated test data requires
a lot of resources such as cost and time, and can even account for up to half of the
development costs. One attempt to reduce these costs are to automate the process,
thereby reducing the labour costs and improving the production efficiency [13].

2.6.2 System requirements
With the rapid development in the automotive industry comes new innovations and
increasingly complex systems. To ensure that these technologies are safe, functional
and follow the automotive legislations, system requirements are created. A system
requirement is commonly created in order to meet costumer expectations, ensure
quality and safety, or to meet necessary legislations [14]. It is also used for design
purposes by creating internal requirements that the company deem important.

In order to create a standardized set of legislations on an international level, Volvo
uses the ISO-standards. The ISO-standards was created by the International Or-
ganization for Standardization in order for companies to follow a unionized set of
standards when creating a product. This ensures that the product meets a stan-
dardized set of regulatory requirements and in turn promotes high quality and safe
product [15].

2.6.3 Jama Connect
Jama connect is a requirement management platform for companies within the au-
tomotive industry that is designed to support developers with documentation and
traceability during the development process. It features compliance support for ISO-
standard which is critical in order to create safe and secure products that comply
with industry regulations. Another key functionality is providing a framework for
managing testing, such as running test cases and logging test runs [16].

2.6.4 PyTest
When developing software in python, the most common framework for testing
python code is PyTest. Its main function is easily write and create readable tests
that can be scaled to more complex functional tests [17].

11


2. Theory

12


3
Methods

In this chapter the methodology of the project will be described. The project began
with researching different models and exploring needs and demands of Volvo. There
were several meetings with test developers and AI specialists working at Volvo in
order to lay out a loose plan. A gantt chart (??) was created in order to plan and
structure the progress of the project, although this early on in the project the tasks
were not immediately obvious and were prone to change as the project developed.
Twice a week, meetings were held with a supervisor at Volvo in order to receive
feedback and guidance on the project.

3.1 Researching potential models
To identify the most suitable language model for the process description that was
provided by Volvo, a survey of LLMs was conducted. For a particular model to meet
the project requirements, it had to satisfy a defined set of criteria. These criteria
were created in compliance with the team at Volvo to ensure functional, technical
and security constraints.

The requirements, stated and described in delimitations, were as following:

• Local deployment

• Large input capacity

• Processing speed

• Pre-trained

• License

The LLMS that were surveyed were:

• DeepSeek V3

• DeepSeek R3

• QWQ-32B

• Falcon LLM

13


3. Methods

• Claude Sonnet 3.5

14


3. Methods

Model Pros Cons

DeepSeek V3
[18] • Solid benchmarks

• Processes large texts (64k
tokens)

• Open-source

• Cost-effective, fast

• Adjustable weights

• No reasoning capability

• Low max output (8k)

• Possible censorship
concerns

DeepSeek-R1-
Distill-Qwen-
32B [19] • Smaller, faster version of

R1.

• Less hardware intensive

• Cheaper to run

• Open-source

• Lower max context tokens

• Lower output tokens

• Slight performance drop
(5-15%) compared to R1

QWQ-32B [20]

• Compact model

• Good reasoning

• Open-source

• Quick response time

• Solid benchmarks

• Limited adoption

• Fewer tools

Falcon 3 LLM
[21] • Open-source

• Apache 2.0

• Strong benchmark (Falcon
3)

• Large models need
compute

Claude Sonnet
3.5 [22] • Beats GPT-4o in

benchmarks (math,
language, coding)

• Hard to download locally

• License needs review

• Price plan

Table 3.1: Comparison of Selected Large Language Models
15


3. Methods

After communication with an AI specialist at Volvo, the selected LLM that was
deployed for testing on local hardware was an open-source model named DeepSeek-
R1-Distill-Qwen-32B. It showed clear signs of high performance in coding tests and
also high performance in english tests, with results that were comparable to more
common models like GPT-o1 and Claude-3.5 [23]. The LLM was changed to QWQ-
32B after three weeks, a decision that was made by Volvo as they deemed it to be
more efficient. This was shown to be true when testing the model, as the LLM
followed instructions more thoroughly than before.

3.2 Execution
Visual Studio Code was utilized as the IDE to write the program. The program
itself was written in Python Version 3.13 and pip as the package manager. Several
dependencies were installed using pip in order to communicate with the AI and
Jama API, such as huggingface and py-Jama-rest-client. A code library from Volvo
was also supplied to make certain requests possible from Jama utilizing their API
and endpoints. Access to their API was given by creating a username and password
in accordance to Volvo policy. The program was then built with two main purposes,
generating a validation test case or generating executable code.

3.2.1 Approach
In order to instruct the LLM and produce the best results, prompt engineering was
leveraged. Several initial approaches were discussed with the engineers at Volvo,
such as fine-tuning or RAG. Most were deemed too time and resource intensive,
therefore it was settled that prompt engineering would yield sufficient results within
the constraints of this project. First a request was sent to the LLM in natural lan-
guage, in order to generate a validation test case. The context was simply the system
requirement alone. This was then improved upon several iterations of analysing the
context and having test case experts, as well as ourselves, manually judging the
quality of the output. Improvements were then made, such as giving the LLM high
quality examples of what the output should resemble or clarifying instructions. Af-
ter iterating several times, a structured way to input the request was created. The
input consisted of an instruction file, a prompt file, and several high quality exam-
ples. At this point the user could demand the LLM to generate either a validation
test case given a system requirement, or executable code given a validation test case.
The different files varied depending on which step was to be fulfilled.

3.2.2 Signal interfaces
In order to retrieve all of the CAN signal interfaces and downstream system require-
ments the Jama structure was constructed as a tree with nodes, as seen in figure
3.1. The given system requirement or function acted as the root node, the other
nodes were then explored via DFS. Since the signal interfaces of the lowest level
are of interest, DFS is an efficient way to search and retrieve them. This method
also allowed the user to search for nodes other than signal interfaces, depending

16


3. Methods

on what the Jama API allowed. By using this approach, all the downstream CAN
signals were extracted and could be fed into the LLM in order for it to generate
code signal names corresponding to the physical hardware. By simply providing a
Jama ID for the relevant system requirement, the program could now extract all of
the downstream CAN signals.

Figure 3.1: Jama structure visualised as a tree data structure.

3.2.3 Validation Test Cases Template

A template was created to provide the LLM with clear instructions and an easy way
to fill in the requested information. It was written in markdown language in order
to structure the information in sections with headers. The template took inspiration
from high quality examples on Jama, given by test engineers at Volvo. The mark-
down block in figure 3.1 shows a small snippet of the template and clearly shows
the benefits of writing in markdown language. Additionally, many LLMs are trained
with data written in markdown language, making them suited at understanding that
specific format. This template was constructed following high quality validation test
cases already uploaded onto Jama, with the goal of producing similar results.

17


3. Methods

Validation Test Case Template Snippet

# TEST CASE TEMPLATE INSTRUCTIONS

You are to complete the template below by replacing fields marked
with ‘[FILL]‘ with appropriate content. The ‘[FIXED]‘ and ‘[FILL]‘
markers are only guides and **must not appear in your final
output**.

## FORMATTING REQUIREMENTS

-Remove all ‘[FIXED]‘ and ‘[FILL]‘ markers from your final output
-Preserve all formatting, including bold text, line breaks,
numbering, and spacing
-Use precise technical language appropriate for validation test
cases
-Default values in *italics* are suggestions to use when no
specific value is indicated
-Leave any field empty with a dash (-) if no relevant information
exists

## TEMPLATE STRUCTURE

[FIXED]# PROJECT ID:
[FILL]
. . .

Markdown Code 3.1: A small snippet of the validation test case template
written in markdown language.

3.2.4 Generating Validation Test Cases
To generate a validation test case several inputs were given to the LLM. Following
the general structure, the LLM was fed instructions, examples and the entire tem-
plate A.1 to fill in. The LLM then generated a response which included internal
thinking. This made it mandatory to extract the validation test case manually from
the response. This process of was repeated several times in order to improve the
instructions using prompt engineering. When the output was satisfactory for one
system requirement, several others were inputted in order to test the stability of the
model and instructions.

3.2.5 Generating Code
To generate code a validation test case is inputted into the LLM which acts as
the main guideline to shape the code. The purpose of the code is to perform the
necessary steps and tests in order to fulfill the requirements of the validation test
case. Once again the LLM responded with internal thinking making it mandatory
to extract the code manually. This method was then repeated similarly as when
generating validation test cases.

18


3. Methods

3.2.6 Logging the results
A folder and file structure was implemented in order to log the results from each
iteration. This was done via utilizing Pythons ability to create and edit files. De-
pending on what is generated the output is different and is logged respectively in
their separate folders. Besides the generated output, the logs also include the en-
tirety of the LLM input and context to better track and quantify which inputs yields
good results.

19


3. Methods

20


4
Results and Discussion

The output can be split into two sections, the generated validation test case and
the generated code. In order to test the LLM, several validation test cases were
generated. They all share the same instructions and examples which the LLM
can learn from. The only difference is the user prompt, which will be the system
requirement description. The same principles applies for the generated code results,
but the difference being the validation test case from which the code is generated.

4.1 Workflow

The flowchart seen in figure 4.1 is designed in order to visualize and divide each task
into smaller sub-tasks in accordance with the project description. This is done with
the intention of reflecting each part of the system and how they are connected, but
also the order of execution in the program.

21


4. Results and Discussion

Figure 4.1: Flowchart showing the program structure

To provide additional clarity on the functionality of each major component of the
program, the flowchart has been divided into bigger distinct sections. Each section
is subsequently described in detail.

1. Data extraction from Jama
The process begins with the user providing a prompt in the form of a re-
quirement description or a specific Jama ID. Based on this input, the system
accesses the corresponding Jama data and extracts both the description of the
requirement and also the CAN signal names. This is then compiled into a set
of data which can be configured for each run to alter the output.

2. Test case generation
In the second part of the program, the compiled data along with a pre-defined
template for documentation of test cases is fed into the LLM in order to
generate the validation test case document.

3. Code generations
In the last step of the process, the previously generated validation test case is
fed back into the LLM for the generation of executable code. The generated
code is then executed while connected to the HIL-rig and the result of the test
execution is returned.

22


4. Results and Discussion

4.2 Automated Validation of Tipping Bucket
The following results have been generated in order to validate the shaking capability
of a truck’s bucket at varying angles. The results have been manually extracted
from the LLM output and does not show the internal thinking, which was part of
the original output.

4.2.1 Generated Validation Test Case
Besides the instructions and examples, the input given to the LLM in order to gen-
erate this result is the following system requirement description:

When TrailerBucketIMUStatus indicates that the tipping angle is below
BucketShakerThreshold, the Shaker Control Function shall set Shaker
Capability to ShakerNotAvailable.

The result is outputted in markdown language, since inserted template was written
in markdown language. As seen in the markdown block 4.1, the LLM fills the
validation test case template (A.1) with the relevant information in order to satisfy
the system requirement. Some steps are left as blank, as the LLM deems that it
is lacking sufficient information in order to fill those headers, such as ASSIGNEE,
RELEASE and TEST RUN RESULTS. Another thing to note is that the judgment
by the LLM is far from flawless, as seen in the markdown block. The threshold
values are arbitrarily chosen since this data was missing from the context.
The goal of this system requirement was to validate that the bucket is incapable of
shaking when below a certain tipping angle. This is clearly described in the template
under Description. It can also be seen that the preconditions and configurations
for the system are reasonable as well as the steps needed to ensure that the goal
is achieved. One engineer at Volvo stated that the angles are arbitrary and that
it would be an improvement if the test did a continuous sweep across an interval
of angles instead. Furthermore, since a lot of HIL specific context is missing, the
validation test case is written more generally and lacking in detail, such as simply
initiating the shaker via a operation command via control interface. The
LLM is missing the necessary context in order to be more specific.

Validation Test Case for Shaker Control Function

# PROJECT ID:
CAATS502-HIALS-1888

# GLOBAL ID:
GID-587887

# OBJECT TYPE:
Test

# NAME:
Shaker Control Function -Tipping Angle Below Threshold

23


4. Results and Discussion

# COMMON DESCRIPTION:

## Description:
When TrailerBucketIMUStatus indicates that the tipping angle is below
BucketShakerThreshold, the Shaker Control Function shall set Shaker
Capability to ShakerNotAvailable.

## Module:
Control System

## Test Type:
System Functional

## Precondition and configurations:
1. The system is powered on and initialized.
2. TrailerBucketIMUStatus signal is operational.
3. BucketShakerThreshold value is pre-configured (e.g., 15 degrees).

## Testing environment(HIL/SIL/Machine/Site):
HIL

## Input data:
TrailerBucketIMUStatus with tipping angle values below
BucketShakerThreshold.

## Expected behavior:
The Shaker Capability output should transition to ShakerNotAvailable when
the tipping angle falls below the threshold.

# SPECIFIC DESCRIPTION:

## Specific equipment:
HiL rig with IMU simulation module

## Specific configuration:
BucketShakerThreshold = 15°

## Specific data collection:
Record Shaker Capability state changes and tipping angle readings.

## Specific environmental conditions:
No external vibrations or disturbances.

# STEPS:
1. Configure HIL rig to simulate TrailerBucketIMUStatus with a tipping
angle of 10° (below threshold).
2. Initiate shaker operation command via control interface.
3. Monitor Shaker Capability signal output.
4. Verify Shaker Capability transitions to ShakerNotAvailable within 2
seconds.
5. Repeat test with tipping angles at 14°, 0°, and -5°.

# ASSIGNEE:

24


4. Results and Discussion

*(Left blank)*

# RELEASE:
*(Left blank)*

# VALIDATION METHOD:
Test

# STATUS:
Ready for Review

# TEST CASE STATUS:
Blocked

# EXTERNAL ID:
*(Left blank)*

# TEST RUN RESULTS:
*(Left blank)*

# TAGS:
Shaker Control, Tipping Angle, Safety Threshold

Markdown Code 4.1: The validation test case which was generated in order
to verify shaker capability at various bucket angles.

4.2.2 Generated Code
To generate the code, instructions, past code examples, the relevant CAN signals,
and the validation test case of interest was inputted as context for the LLM. As seen
in the code 4.2, all of the steps which was required from the validation test case 4.1
have been covered in the test code. The system is initialized and all of the angles
and transitions have their own tests. The code properly sets the signals and sends
the appropriate commands according to past code examples. When comparing to
the existing code at Volvo it has the same structure and way of testing.
The connection to the physical hardware and actually running the code on the HIL-
rigs have not yet been executed. It can also be seen that the LLM is missing context
since the functions and vehicle class are not imported properly, thus the program
would not compile or run at its current state. Once again it would require human
intervention in order to be a finished product and ready to execute. The code shows
potential but the LLM, once more, is lacking some of the required context in order
to produce a fully functional result.

1 import pytest
2 from time import sleep
3

4 class TestShakerControlFunction:
5

25


4. Results and Discussion

6 @pytest.fixture(scope="function")
7 def initialize_system(self, vehicle):
8 # Configure BucketShakerThreshold to 15 degrees as per specific

configuration↪→

9 vehicle.set_signal(['ControlSystem', 'BucketShakerThreshold'], 15)
10 yield
11 # Reset any signals post-test if needed
12

13 @pytest.mark.dotest
14 def test_shaker_not_available_below_10_degrees(self, vehicle,

initialize_system):↪→

15 """Verify Shaker Capability transitions to NotAvailable when tipping
angle is 10°."""↪→

16 vehicle.set_signal(['IMUSensor', 'TrailerBucketIMUStatus'], 10)
17 # Initiate shaker operation command via control interface
18 vehicle.send_command("InitiateShakerOperation")
19 sleep(2) # Wait for system response within specified timeout
20 assert vehicle.get_signal(['ShakerOutput', 'ShakerCapability']) ==

"ShakerNotAvailable"↪→

21

22 @pytest.mark.dotest
23 def test_shaker_not_available_at_14_degrees(self, vehicle,

initialize_system):↪→

24 """Test tipping angle at 14° (just below threshold)."""
25 vehicle.set_signal(['IMUSensor', 'TrailerBucketIMUStatus'], 14)
26 vehicle.send_command("InitiateShakerOperation")
27 sleep(2)
28 assert vehicle.get_signal(['ShakerOutput', 'ShakerCapability']) ==

"ShakerNotAvailable"↪→

29

30 @pytest.mark.dotest
31 def test_shaker_not_available_zero_degrees(self, vehicle,

initialize_system):↪→

32 """Check zero degree input condition."""
33 vehicle.set_signal(['IMUSensor', 'TrailerBucketIMUStatus'], 0)
34 vehicle.send_command("InitiateShakerOperation")
35 sleep(2)
36 assert vehicle.get_signal(['ShakerOutput', 'ShakerCapability']) ==

"ShakerNotAvailable"↪→

37

38 @pytest.mark.dotest
39 def test_negative_angle_input(self, vehicle, initialize_system):
40 """Validate negative angles (-5°) trigger NotAvailable state."""
41 vehicle.set_signal(['IMUSensor', 'TrailerBucketIMUStatus'], -5)
42 vehicle.send_command("InitiateShakerOperation")
43 sleep(2)
44 assert vehicle.get_signal(['ShakerOutput', 'ShakerCapability']) ==

"ShakerNotAvailable"↪→

45

46 @pytest.mark.dotest
47 def test_threshold_exact_value(self, vehicle, initialize_system):
48 """Ensure threshold value (15°) does not trigger

ShakerNotAvailable."""↪→

49 vehicle.set_signal(['IMUSensor', 'TrailerBucketIMUStatus'], 15)
50 vehicle.send_command("InitiateShakerOperation")

26


4. Results and Discussion

51 sleep(2)
52 # Expected to remain available or other state except NotAvailable
53 assert vehicle.get_signal(['ShakerOutput', 'ShakerCapability']) !=

"ShakerNotAvailable"↪→

54

55 @pytest.mark.dotest
56 def test_above_threshold_condition(self, vehicle, initialize_system):
57 """Test angle above threshold (e.g., 20°)."""
58 vehicle.set_signal(['IMUSensor', 'TrailerBucketIMUStatus'], 20)
59 vehicle.send_command("InitiateShakerOperation")
60 sleep(2)
61 assert vehicle.get_signal(['ShakerOutput', 'ShakerCapability']) !=

"ShakerNotAvailable"↪→

Figure 4.2: Generated Python Code based on the given validation test case 4.1.

4.3 Validate Trailer Tipper Bucket State Transi-
tions

These results have been generated in order to validate a trailer tipper bucket tran-
sitions. In order to ensure that the TrailerTipper correctly updates the BucketState
based on the BucketDown signals and current BucketAngle.

4.3.1 Generated validation test case
This is a combination of four simple system requirements in order to test the func-
tionality of the bucket. The LLM thus had to combine the four different system
requirements into a single validation test case, which is a common practice at Volvo.
The results have once again been manually extracted. The user prompt was also
slightly modified to clarify that the given system requirements were to be combined.
These are the following system requirements:

1. When Bucket angle transitions to Not available, the TrailerTipper State func-
tion shall set Bucket_state to BUCKET_STATE_NOT_AVAILABLE

2. When Bucket angle transitions to At or Above Threshold, the TrailerTipper
State function shall set Bucket_state to BUCKET_STATE_UP

3. When BucketDown transitions to Inactive, the TrailerTipper State function
shall set Bucket_state to BUCKET_STATE_MID

4. When BucketDown transitions to Active, the TrailerTipper State function
shall set Bucket_state to BUCKET_STATE_DOWN

As seen in 4.2 the LLM took into account all four of the different system require-
ments, this is made clear under the expected behaviour section. The description
is also accurate and the thresholds are asked to be defined by the user in the pre-
conditions and configurations header, since this information was missing from the

27


4. Results and Discussion

system requirements alone. Once again the details are lacking as in the first vali-
dation test case 4.1, it would be difficult for someone to follow these steps without
any prior knowledge of a HIL setup or testing in general. It shares similar strengths
and weaknesses as validation test case 4.1. The general steps are clear and easy to
follow, but the missing small details may make it impossible to follow through.

Validation Test Case for Trailer Tipper Bucket State Transitions colback

# PROJECT ID:
CAATS502-TIPPER-001

# GLOBAL ID:
GID-678901

# OBJECT TYPE:
Test

# NAME:
Validate Trailer Tipper Bucket State Transitions

# COMMON DESCRIPTION:

## Description:
This test verifies that the TrailerTipper State function correctly updates
the Bucket_state based on BucketDown signal transitions and BucketAngle
threshold conditions.

## Module:
Control System

## Test Type:
System Functional

## Precondition and configurations:
1. The system is in a stable operational state with all subsystems
initialized.
2. HIL environment configured with simulated bucket sensors and actuator
models.
3. Provide/Define value for parameter/threshold X (BucketDown transition
delay)
4. Provide/Define value for parameter/threshold Y (Bucket angle activation
threshold)
5. Ensure CANoe/Simulation tool is set up to monitor/record signals.

## Testing environment(HIL/SIL/Machine/Site):
HIL

## Input data:
-BucketDown signal (Active/Inactive)
-BucketAngle signal (numeric value in degrees)
-TippingActuationAvailableStatus (TRUE/FALSE)
-TrailerChassiIMUStatus (roll angle, if applicable)

28


4. Results and Discussion

## Expected behavior:
The TrailerTipper State function shall set Bucket_state to:
-**BUCKET_STATE_DOWN** when BucketDown transitions to Active
-**BUCKET_STATE_MID** when BucketDown transitions to Inactive
-**BUCKET_STATE_UP** when BucketAngle ≥Threshold Y
-**BUCKET_STATE_NOT_AVAILABLE** when BucketAngle is marked as Not Available

# SPECIFIC DESCRIPTION:

## Specific equipment:
HIL rig with simulated bucket position sensors, actuator controls, and CAN
communication interface

## Specific configuration:
Control system version P3/P4, Trailer model TA15B, IMU simulation enabled

## Specific data collection:
Record all input signals (BucketDown, BucketAngle) and output states
(Bucket_state) during test execution. Capture timestamped logs for state
transitions.

## Specific environmental conditions:
Ambient temperature within operational range (-20°C to +60°C), stable
power supply, no external vibrations.

# STEPS:
| # | Action | Expected Result | Notes |
|---|--------|------------------|-------|
| 1 | Set BucketDown signal to **Active** | Bucket_state transitions to **
BUCKET_STATE_DOWN** | Verify via HIL simulation and signal monitoring |
| 2 | Set BucketDown signal to **Inactive** | Bucket_state transitions to *
*BUCKET_STATE_MID** | Confirm via control system diagnostics |
| 3 | Set BucketAngle to a value **≥Threshold Y** | Bucket_state
transitions to **BUCKET_STATE_UP** | Ensure threshold Y is defined in
preconditions |
| 4 | Set BucketAngle to **Not Available** | Bucket_state transitions to **
BUCKET_STATE_NOT_AVAILABLE** | Validate error handling logic |
| 5 | Simulate simultaneous transitions: BucketDown=Active AND BucketAngle
≥Threshold Y | Bucket_state prioritizes **BUCKET_STATE_DOWN** over UP
condition | Confirm state machine priority rules |
| 6 | Revert all inputs to default/neutral states | System returns to
initial state (if applicable) | Verify system stability after test |

# ASSIGNEE:
*(Left blank)*

# RELEASE:
*(Left blank)*

# VALIDATION METHOD:
Test

# STATUS:
Ready for Review

29


4. Results and Discussion

# TEST CASE STATUS:
Not Scheduled

# EXTERNAL ID:
*(Left blank)*

# TEST RUN RESULTS:
*(Left blank)*

# TAGS:
TrailerTipper, StateTransition, SensorValidation, ControlSystem,
BucketState, HILTesting
__
# PROJECT ID:
CAATS502-TIPPER-001

# GLOBAL ID:
GID-678901

# OBJECT TYPE:
Test

# NAME:
Validate Trailer Tipper Bucket State Transitions

# COMMON DESCRIPTION:

## Description:
This test verifies that the TrailerTipper State function correctly updates
the Bucket_state based on BucketDown signal transitions and BucketAngle
threshold conditions.

## Module:
Control System

## Test Type:
System Functional

## Precondition and configurations:
1. The system is in a stable operational state with all subsystems
initialized.
2. HIL environment configured with simulated bucket sensors and actuator
models.
3. Provide/Define value for parameter/threshold X (BucketDown transition
delay)
4. Provide/Define value for parameter/threshold Y (Bucket angle activation
threshold)
5. Ensure CANoe/Simulation tool is set up to monitor/record signals.

## Testing environment(HIL/SIL/Machine/Site):
HIL

## Input data:

30


4. Results and Discussion

-BucketDown signal (Active/Inactive)
-BucketAngle signal (numeric value in degrees)
-TippingActuationAvailableStatus (TRUE/FALSE)
-TrailerChassiIMUStatus (roll angle, if applicable)

## Expected behavior:
The TrailerTipper State function shall set Bucket_state to:
-**BUCKET_STATE_DOWN** when BucketDown transitions to Active
-**BUCKET_STATE_MID** when BucketDown transitions to Inactive
-**BUCKET_STATE_UP** when BucketAngle ≥Threshold Y
-**BUCKET_STATE_NOT_AVAILABLE** when BucketAngle is marked as Not Available

# SPECIFIC DESCRIPTION:

## Specific equipment:
HIL rig with simulated bucket position sensors, actuator controls, and CAN

Markdown Code 4.2: The validation test case which was generated in
order to verify that the transitions for the trailers tipper bucker are working
correctly.

4.3.2 Generated Code
The generated code follows the same principle as before, the only difference being
that validation test case 4.2 is fed into the LLM instead. This is then used as
the base to generate the code. The results share many of the same strengths and
weaknesses as before. Although the LLM did not opt to include fixtures for this
code, in contrast to code block 4.2. Otherwise it asserts the signals in the same way
and shares a similar architecture. As such it shares the same flaws as previously
and would require an improved prompt, more context, or manual improvement in
order to make the code executable and fulfil its function. One thing to note is that
the code doesn’t seem to test the transitions correctly, instead it seems to test the
current state. It would also be necessary for the code to retreive the signal from
BucketDown, in a similar fashion to vehicle.get_signal from code block 4.2.

31


4. Results and Discussion

1 import pytest
2

3 class TestTrailerTipperBucketStateTransitions:
4 def test_bucket_down_active(self):
5 bucket_down = "Active"
6 bucket_angle = 0
7 tipping_available = True
8 imu_status = 0
9 assert self.trailer_tipper_state(bucket_down, bucket_angle,

tipping_available, imu_status) == "BUCKET_STATE_DOWN"↪→

10

11 def test_bucket_down_inactive(self):
12 bucket_down = "Inactive"
13 bucket_angle = 0
14 tipping_available = True
15 imu_status = 0
16 assert self.trailer_tipper_state(bucket_down, bucket_angle,

tipping_available, imu_status) == "BUCKET_STATE_MID"↪→

17

18 def test_bucket_angle_above_threshold(self):
19 bucket_down = "Inactive"
20 bucket_angle = 50 # Assuming threshold Y is 40
21 tipping_available = True
22 imu_status = 0
23 assert self.trailer_tipper_state(bucket_down, bucket_angle,

tipping_available, imu_status) == "BUCKET_STATE_UP"↪→

24

25 def test_bucket_angle_not_available(self):
26 bucket_down = "Inactive"
27 bucket_angle = None # Not Available
28 tipping_available = True
29 imu_status = 0
30 assert self.trailer_tipper_state(bucket_down, bucket_angle,

tipping_available, imu_status) == "BUCKET_STATE_NOT_AVAILABLE"↪→

31

32 def test_simultaneous_transitions(self):
33 bucket_down = "Active"
34 bucket_angle = 50 # Above threshold
35 tipping_available = True
36 imu_status = 0
37 assert self.trailer_tipper_state(bucket_down, bucket_angle,

tipping_available, imu_status) == "BUCKET_STATE_DOWN"↪→

38

39 def test_revert_to_default(self):
40 bucket_down = "Inactive"
41 bucket_angle = 30 # Below threshold
42 tipping_available = False
43 imu_status = 0
44 result = self.trailer_tipper_state(bucket_down, bucket_angle,

tipping_available, imu_status)↪→

45 assert result in ["BUCKET_STATE_MID", "BUCKET_STATE_NOT_AVAILABLE"]
46

32


4. Results and Discussion

47 def trailer_tipper_state(self, bucket_down, bucket_angle,
tipping_available, imu_status):↪→

48 # Placeholder for the actual implementation
49 if bucket_down == "Active":
50 return "BUCKET_STATE_DOWN"
51 elif bucket_down == "Inactive":
52 return "BUCKET_STATE_MID"
53 elif bucket_angle is not None and bucket_angle >= 40 and

tipping_available:↪→

54 return "BUCKET_STATE_UP"
55 elif bucket_angle is None:
56 return "BUCKET_STATE_NOT_AVAILABLE"
57 else:
58 return "BUCKET_STATE_MID"
59

Figure 4.3: Generated Python Code to validate Trailer Tipper Bucket State Tran-
sitions based on the given validation test case 4.2.

4.4 Overall

Overall the templates were filled and of decent quality. They would still need to
be quality controlled by a human before being finished products. Therefore, the
level of quality produced by the LLM is not yet sufficient enough to be autonomous
and should instead be seen as a helpful tool to speed up progress. Since every
demand may differ from the previous, it is hard to produce the perfect prompt for
each instance using prompt engineering. The difference between getting a nearly
complete result and an industry ready result is therefore enormous. The LLM is
proficient at creating results which are general and cover most scenarios, but in
return it lacks the specific context to fill in specific details and create complete and
functional results. The AI-generated validation test cases and code are fairly similar
to human created ones, especially the validation test cases are written in the same
way as already existing ones. The code is also similar but lacks the functionality
of the human created one, since it contains small bugs and issues, the structure
however is the same.
An improved version could be having the user input more context specific to the
system requirement. Such inputs could be how to initialize, preconditions and what
environment it will be tested in. The test environment could be fed as context for
future runs. RAG is also a technique that is worth exploring in order to add more
context for the LLM. The only concern being how to filter the good data from the
bad, since not all of Jama can be trusted to be of high quality.
Another issue that made itself apparent was the difficulty of quantifying what a
good output is. Especially as two student which lack the work experience to judge
the quality of the validation test cases and code which is produced. An improvement
could be to create an AI whose sole purpose is to quantify the data and decide its’
quality. Using this in combination with a generating AI, such as the one we worked
with, could lead to a powerful feedback loop continuously improving the prompts,

33


4. Results and Discussion

and thus the results, based on the principle of prompt engineering 4.4. It could also
be used directly to improve the results instead of the prompt. This may lead to
superior results but could also decrease flexibility for future issues, since the results
can’t be reused in the same way as a well constructed prompt.

Figure 4.4: Figure illustrating the potential feedback loop of having two AIs work
in combination to autotomize the principle of prompt engineering. Either by im-
proving the prompt to create a better output, or by improving the output directly.

Another potential improvement which would be worth exploring is to enable a di-
alogue, similar to Chat-GPT and many other popular chat bots. This would allow
the user to adjust and tweak the output of the LLM directly in order to improve the
results. This possibility was briefly explored during development, but was not fully
realized due to lack of time. An AI specialist at Volvo had recommended a path to
get started, but also hinted at the possibility that it may turn out to be far more
complicated than initially thought.
In summary, the results were satisfactory but not enough to make the process com-
pletely autonomous. It should still be able to serve as a valuable tool in increasing
efficiency and cutting down time regarding testing. It increases existing system re-
quirement testing by providing an easy way to create a strong foundation which can
be built upon. The generated validation test case and code covered most of the sys-
tem requirement. Although it was quite general, thus some details and context were
missing from the output and would require additional improvement from a human
in order to be fully complete. This issue could potentially be solved in several ways.
Such as improving the prompts and providing more context specific to the system
requirement at hand, or creating a feedback loop with an AI which quantifies the
results. A non-autonomous solution would be to enable a dialogue with the LLM
after a request has been sent, potentially improving flexibility as well as results.

34


5
Conclusion

The project, provided by Volvo, aimed to create an autonomous solution for generat-
ing validation test cases using generative AI for HIL-rigs. The result of the project is
a tool that can be of assistance for creating validation test cases. The goal of a fully
automated process has not been fully achieved but is in its current state a program
that can assist and speed up the process of creating the test cases. The final model
that was selected for the task was QWQ-32B and the programming language used
for creating the program was Python. The validation test cases created so far are
adequate but needs some refinement before being uploaded and used in production.
Additionally, the generated code needs to be tested in a HIL-environment in order
to determine its quality and functionality. After discussing the results with Volvo
employees, the result was deemed reasonable but needs further adjustments with
the inputs in order for the program to be deemed ready for production. Further im-
provements stated under results and discussion state ideas such as specific tailored
context, allowing for dialogue, RAG, or creating a feedback loop for the LLM. In
general, the program will provide Volvo with a base for further improvements in or-
der to automate the process of generating validation test cases and related test code.
As a concept, it could have potential to automate testing within several industries,
extending beyond car industries and Volvo.

35


5. Conclusion

36


Bibliography

[1] IBM. “Artificial intelligence.” Accessed: 2025-04-15. (2024), [Online]. Avail-
able: https://www.ibm.com/think/topics/artificial-intelligence.

[2] S. Neemeh. “What is hil testing?” Accessed: 2025-04-15. (2023), [Online].
Available: https://www.lhpes.com/blog/what-is-hil-testing.

[3] The Apache Software Foundation, Apache License, Version 2.0, https://
www.apache.org/licenses/LICENSE-2.0, [Online; accessed 6-May-2025],
2004.

[4] NASA Science, What is artificial intelligence? https://www.nasa.gov/what-
is-artificial-intelligence/, [Online]. Updated May 13, 2024. Available:
https://www.nasa.gov/what-is-artificial-intelligence/, [Accessed: Apr. 22, 2025],
2024.

[5] R. T. Kreutzer and M. Sirrenberg, “What is artificial intelligence and how
to exploit it?” In Understanding Artificial Intelligence: Fundamentals, Use
Cases and Methods for a Corporate AI Journey. Cham: Springer International
Publishing, 2020, pp. 1–57, isbn: 978-3-030-25271-7. doi: 10.1007/978-3-
030-25271-7_1. [Online]. Available: https://doi.org/10.1007/978-3-
030-25271-7_1.

[6] R. Ulfsnes, N. B. Moe, V. Stray, and M. Skarpen, Transforming Software De-
velopment with Generative AI: Empirical Insights on Collaboration and Work-
flow. 2024, pp. 219–234, Cited by: 9. doi: 10.1007/978-3-031-55642-5_10.
[Online]. Available: https://www.scopus.com/inward/record.uri?eid=2-
s2.0-85205652290&doi=10.1007%2f978-3-031-55642-5_10&partnerID=
40&md5=967e70b195ca2c53846b637c15b1218a.

[7] Oracle, What is an api? Accessed: 2025-05-05, 2025. [Online]. Available: https:
//www.oracle.com/cloud/cloud-native/api-management/what-is-api/.

[8] DHTMLX, Web apis used in dhtmlx widgets, Accessed: 2025-05-05, 2025. [On-
line]. Available: https : / / dhtmlx . com / blog / web - apis - used - dhtmlx -
widgets/.

[9] B. Meskó, “Prompt engineering as an important emerging skill for medical
professionals: Tutorial,” Journal of Medical Internet Research, vol. 25, no. 1,
2023, Cited by: 228; All Open Access, Gold Open Access, Green Open Access.
doi: 10.2196/50638. [Online]. Available: https://www.scopus.com/inward/
record.uri?eid=2-s2.0-85173015974&doi=10.2196%2f50638&partnerID=
40&md5=8b7d886ab3df29d5ff01aadb60436ebd.

37

https://www.ibm.com/think/topics/artificial-intelligence
https://www.lhpes.com/blog/what-is-hil-testing
https://www.apache.org/licenses/LICENSE-2.0
https://www.apache.org/licenses/LICENSE-2.0
https://www.nasa.gov/what-is-artificial-intelligence/
https://www.nasa.gov/what-is-artificial-intelligence/
https://doi.org/10.1007/978-3-030-25271-7_1
https://doi.org/10.1007/978-3-030-25271-7_1
https://doi.org/10.1007/978-3-030-25271-7_1
https://doi.org/10.1007/978-3-030-25271-7_1
https://doi.org/10.1007/978-3-031-55642-5_10
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85205652290&doi=10.1007%2f978-3-031-55642-5_10&partnerID=40&md5=967e70b195ca2c53846b637c15b1218a
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85205652290&doi=10.1007%2f978-3-031-55642-5_10&partnerID=40&md5=967e70b195ca2c53846b637c15b1218a
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85205652290&doi=10.1007%2f978-3-031-55642-5_10&partnerID=40&md5=967e70b195ca2c53846b637c15b1218a
https://www.oracle.com/cloud/cloud-native/api-management/what-is-api/
https://www.oracle.com/cloud/cloud-native/api-management/what-is-api/
https://dhtmlx.com/blog/web-apis-used-dhtmlx-widgets/
https://dhtmlx.com/blog/web-apis-used-dhtmlx-widgets/
https://doi.org/10.2196/50638
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85173015974&doi=10.2196%2f50638&partnerID=40&md5=8b7d886ab3df29d5ff01aadb60436ebd
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85173015974&doi=10.2196%2f50638&partnerID=40&md5=8b7d886ab3df29d5ff01aadb60436ebd
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85173015974&doi=10.2196%2f50638&partnerID=40&md5=8b7d886ab3df29d5ff01aadb60436ebd


Bibliography

[10] T. Elvira, T. T. Procko, J. O. Couder, and O. Ochoa, “Digital rubber duck:
Leveraging large language models for extreme programming,” Cited by: 1,
2023, pp. 295–304. doi: 10.1109/CSCE60160.2023.00051. [Online]. Avail-
able: https : / / www . scopus . com / inward / record . uri ? eid = 2 - s2 . 0 -
85191165473&doi=10.1109%2fCSCE60160.2023.00051&partnerID=40&md5=
ff6207b68b2ea1aabdd35970fa2ddbb3.

[11] F. Miladi, V. Psyché, and D. Lemire, “Comparative performance of gpt-4, rag-
augmented gpt-4, and students in moocs,” Communications in Computer and
Information Science, vol. 2162 CCIS, pp. 81–92, 2024, Cited by: 0. doi: 10.
1007/978-3-031-65996-6_7. [Online]. Available: https://www.scopus.com/
inward/record.uri?eid=2-s2.0-85200739796&doi=10.1007%2f978-3-
031-65996-6_7&partnerID=40&md5=b70c533d14dde233565a0b7d3e59686b.

[12] N. Setiani, R. Ferdiana, P. I. Santosa, and R. Hartanto, “Literature review on
test case generation approach,” in Proceedings of the 2nd International Con-
ference on Software Engineering and Information Management, ser. ICSIM
’19, Bali, Indonesia: Association for Computing Machinery, 2019, pp. 91–95,
isbn: 9781450366427. doi: 10.1145/3305160.3305186. [Online]. Available:
https://doi.org/10.1145/3305160.3305186.

[13] S. Anand, E. K. Burke, T. Y. Chen, et al., “An orchestrated survey of method-
ologies for automated software test case generation,” Journal of Systems and
Software, vol. 86, no. 8, pp. 1978–2001, 2013, issn: 0164-1212. doi: https:
//doi.org/10.1016/j.jss.2013.02.061. [Online]. Available: https:
//www.sciencedirect.com/science/article/pii/S0164121213000563.

[14] Visure Solutions, Automotive requirements management, Accessed: 2025-04-
23, 2025. [Online]. Available: https://visuresolutions.com/blog/automotive/
requirements-management/.

[15] International Organization for Standardization, Iso-standards, Accessed: 2025-
04-23, 2025. [Online]. Available: https://www.iso.org/standards.html.

[16] Jama Software, Jama connect® for automotive solution overview, Accessed:
2025-04-24, 2025. [Online]. Available: https : / / www . jamasoftware . com /
solution-overview/jama-connect-for-automotive-solution-overview-
thank-you/.

[17] pytest-dev, Pytest documentation, https://docs.pytest.org/en/latest/,
Accessed on: 2025-04-22, 2025.

[18] A. Liu, B. Feng, B. Xue, et al., “Deepseek-v3 technical report,” DeepSeek-
AI, Tech. Rep., Dec. 2024, Accessed: 2025-05-05. [Online]. Available: https:
//deepseekv3.org/paper.

[19] DeepSeek-AI, Deepseek-r1: Incentivizing reasoning capability in llms via re-
inforcement learning, 2025. arXiv: 2501.12948 [cs.CL]. [Online]. Available:
https://arxiv.org/abs/2501.12948.

[20] Qwen Team, Qwq-32b: Embracing the power of reinforcement learning, Ac-
cessed: 2025-05-17, Mar. 2025. [Online]. Available: https://qwenlm.github.
io/blog/qwq-32b/.

38

https://doi.org/10.1109/CSCE60160.2023.00051
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85191165473&doi=10.1109%2fCSCE60160.2023.00051&partnerID=40&md5=ff6207b68b2ea1aabdd35970fa2ddbb3
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85191165473&doi=10.1109%2fCSCE60160.2023.00051&partnerID=40&md5=ff6207b68b2ea1aabdd35970fa2ddbb3
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85191165473&doi=10.1109%2fCSCE60160.2023.00051&partnerID=40&md5=ff6207b68b2ea1aabdd35970fa2ddbb3
https://doi.org/10.1007/978-3-031-65996-6_7
https://doi.org/10.1007/978-3-031-65996-6_7
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85200739796&doi=10.1007%2f978-3-031-65996-6_7&partnerID=40&md5=b70c533d14dde233565a0b7d3e59686b
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85200739796&doi=10.1007%2f978-3-031-65996-6_7&partnerID=40&md5=b70c533d14dde233565a0b7d3e59686b
https://www.scopus.com/inward/record.uri?eid=2-s2.0-85200739796&doi=10.1007%2f978-3-031-65996-6_7&partnerID=40&md5=b70c533d14dde233565a0b7d3e59686b
https://doi.org/10.1145/3305160.3305186
https://doi.org/10.1145/3305160.3305186
https://doi.org/https://doi.org/10.1016/j.jss.2013.02.061
https://doi.org/https://doi.org/10.1016/j.jss.2013.02.061
https://www.sciencedirect.com/science/article/pii/S0164121213000563
https://www.sciencedirect.com/science/article/pii/S0164121213000563
https://visuresolutions.com/blog/automotive/requirements-management/
https://visuresolutions.com/blog/automotive/requirements-management/
https://www.iso.org/standards.html
https://www.jamasoftware.com/solution-overview/jama-connect-for-automotive-solution-overview-thank-you/
https://www.jamasoftware.com/solution-overview/jama-connect-for-automotive-solution-overview-thank-you/
https://www.jamasoftware.com/solution-overview/jama-connect-for-automotive-solution-overview-thank-you/
https://docs.pytest.org/en/latest/
https://deepseekv3.org/paper
https://deepseekv3.org/paper
https://arxiv.org/abs/2501.12948
https://arxiv.org/abs/2501.12948
https://qwenlm.github.io/blog/qwq-32b/
https://qwenlm.github.io/blog/qwq-32b/


Bibliography

[21] Technology Innovation Institute, Introducing the technology innovation insti-
tute’s falcon 3: Making advanced ai accessible and available to everyone, every-
where, Accessed: 2025-05-17, 2024. [Online]. Available: https://falconllm.
tii.ae/.

[22] Anthropic, Claude 3.5 sonnet model card addendum, Accessed: 2025-05-17,
2024. [Online]. Available: https://www.anthropic.com/news/claude-3-5-
sonnet.

[23] DeepSeek, “Deepseek-r1: Incentivizing reasoning capability in llms via rein-
forcement learning,” arXiv preprint arXiv:2501.12948, 2025. [Online]. Avail-
able: https://arxiv.org/abs/2501.12948.

39

https://falconllm.tii.ae/
https://falconllm.tii.ae/
https://www.anthropic.com/news/claude-3-5-sonnet
https://www.anthropic.com/news/claude-3-5-sonnet
https://arxiv.org/abs/2501.12948


Bibliography

40


A
Appendix 1

# TEST CASE TEMPLATE INSTRUCTIONS

You are to complete the template below by replacing fields marked with ‘[FILL]‘
with appropriate content. The ‘[FIXED]‘ and ‘[FILL]‘ markers are only guides and
**must not appear in your final output**.

## FORMATTING REQUIREMENTS

-Remove all ‘[FIXED]‘ and ‘[FILL]‘ markers from your final output
-Preserve all formatting, including bold text, line breaks, numbering, and
spacing
-Use precise technical language appropriate for validation test cases
-Default values in italics are suggestions to use when no specific value is
indicated
-Leave any field empty with a dash (-) if no relevant information exists

## TEMPLATE STRUCTURE

[FIXED]# PROJECT ID:
[FILL]

[FIXED]# GLOBAL ID:
[FILL]

[FIXED]# OBJECT TYPE:
[FILL]

*(Default value: **Test**)*

[FIXED]# NAME:
[FILL]

[FIXED]# COMMON DESCRIPTION:

[FIXED]## Description:
[FILL]

[FIXED]## Module:
[FILL]

*(Default value: **Environment**)*

[FIXED]## Test Type:
[FILL]

I


A. Appendix 1

*(Default value: **System Diagnostic**)*

[FIXED]## Precondition and configurations:
[FILL]

1.
2.
3.

[FIXED]## Testing environment(HIL/SIL/Machine/Site):
[FILL]

*(Default value: **HIL**)*

[FIXED]## Input data:
[FILL]

[FIXED]## Expected behavior:
[FILL]

[FIXED]# SPECIFIC DESCRIPTION:

[FIXED]## Specific equipment:
[FILL]

[FIXED]## Specific configuration:
[FILL]

[FIXED]## Specific data collection:
[FILL]

[FIXED]## Specific environmental conditions:
[FILL]

[FIXED]# STEPS:
[FILL]

[FIXED]# ASSIGNEE:
[FILL]

[FIXED]# RELEASE:
[FILL]

[FIXED]# VALIDATION METHOD:
[FILL]

[FIXED]# STATUS:
[FILL]

[FIXED]# TEST CASE STATUS:
[FILL]

[FIXED]# EXTERNAL ID:
[FILL]

II


A. Appendix 1

[FIXED]# TEST RUN RESULTS:
[FILL]

[FIXED]# TAGS:
[FILL]

---

## VERIFICATION STEPS

After filling the template:

1. Ensure all ‘[FIXED]‘ and ‘[FILL]‘ markers have been completely removed
2. Verify all fields have appropriate content based on the provided context
3. Use default values when appropriate, but customize based on context
4. Ensure all formatting is preserved exactly as shown in the template
5. Check that all required fields have content (or an explicit dash if empty)

## EXPECTED OUTPUT FORMAT

Your output should look like this (example snippet):

Markdown Code A.1: Validation Test Case template

III


DEPARTMENT OF SOME SUBJECT OR TECHNOLOGY
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden
www.chalmers.se

www.chalmers.se

	List of Acronyms
	List of Figures
	List of Tables
	Introduction
	Background
	AI
	HIL
	Requirements Traceability

	Purpose
	Goals
	Delimitations
	Contributions

	Theory
	AI
	Neural Networks
	LLM
	Generative AI

	API
	Prompt Engineering
	Retrieval Augmented Generation (RAG)
	Depth First Search
	Testing
	Test cases
	System requirements
	Jama Connect
	PyTest


	Methods
	Researching potential models
	Execution
	Approach
	Signal interfaces
	Validation Test Cases Template
	Generating Validation Test Cases
	Generating Code
	Logging the results


	Results and Discussion
	Workflow
	Automated Validation of Tipping Bucket
	Generated Validation Test Case
	Generated Code

	Validate Trailer Tipper Bucket State Transitions
	Generated validation test case
	Generated Code

	Overall

	Conclusion
	Appendix 1