Explainable AI for Decision Making
Applying Generative AI to Enhance Decision Making

Master’s thesis in Data Science & AI

Fabian Kaneby, Johanna Norell

DEPARTMENT OF PHYSICS

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2025
www.chalmers.se

www.chalmers.se


Master’s thesis 2025

Explainable AI for Decision Making

Applying Generative AI to Enhance Decision Making

FABIAN KANEBY, JOHANNA NORELL

Department of Physics
Chalmers University of Technology

Gothenburg, Sweden 2025


Explainable AI for Decision Making
Applying Generative AI to Enhance Decision Making
FABIAN KANEBY, JOHANNA NORELL

© FABIAN KANEBY, JOHANNA NORELL, 2025.

Supervisor: Bettina Linder, Volvo Penta
Examiner: Mats Granath, Director - Complex Adaptive Systems M.Sc. Program

Master’s Thesis 2025
Department of Physics
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: AI generated illustration of the human brain.
(Created with Microsoft Designer using the prompt “An abstract line drawing of AI
and neural network with a human brain on white background (HEX: #ffffff)”.

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2025

iv


Explainable AI for Decision Making
Applying Generative AI to Enhance Decision Making
FABIAN KANEBY, JOHANNA NORELL
Chalmers University of Technology

Abstract
This thesis examines the feasibility of using an AI system to support decision mak-
ing processes in identifying potential root causes of quality issues in industrial and
marine power systems. The AI system employs a Retrieval Augmented Generation
(RAG) architecture, utilizing Large Language Models (LLMs).

The research investigates whether pre-trained LLMs, combined with a constructed
database in the RAG framework, are sufficient to provide support in a highly specific
domain context. It also explores the factors that influence user acceptance and trust
in the AI system. The evaluation includes both quantitative metrics and qualitative
user tests with domain experts.

The project was conducted in collaboration with Volvo Penta, a power solution
provider, and all data collection and user testing were performed at the company.
The findings suggest that the system can effectively retrieve and summarize histori-
cal data to aid in identifying the root causes of quality issues. Additionally, the study
reveals that user satisfaction and trust of AI-driven insights are primarily influenced
by the system’s ability to explain its reasoning process for reaching conclusions.

Keywords: Generative AI, Large Language Models, Retrieval Augmented Genera-
tion, AI System, Explainable AI, Decision Support, Root Cause Analysis, Quality
Issues

v


Acknowledgements
We would like to express our gratitude, to our examiner, Mats Granath, and our
supervisor, Bettina Linder at Volvo Penta, for their invaluable support throughout
this thesis project. It has been a great experience and an incredible learning process
to combine academic research with real industrial impact.

Second, our appreciation goes to Volvo Penta for providing the opportunity to be
a small part of the company’s significant digitalization journey. The opportunities
and challenges in this journey are tremendous, making this an extraordinarily in-
teresting time to write this thesis. Additionally, we would like to acknowledge the
support from Adam Wengrud and Himanshu Sahni for their technical assistance and
knowledge sharing throughout our thesis project and to Jonas Trolle for his domain
knowledge, visions and enthusiasm, guiding us throughout the thesis project.

Johanna Norell & Fabian Kaneby, Gothenburg, May 2025

vii


List of Acronyms

Below is the list of acronyms that have been used throughout this thesis listed in
alphabetical order:

AI Artificial Intelligence
ANNS Approximate Nearest Neighbor Search
CBOW Continuous Bag of Words
FAISS Facebook AI Similarity Search
LLM Large Language Model
ML Machine Learning
NLP Natural Language Processing
PLM Pre-trained Language Model
RAG Retrieval-Augmented Generation
UX User Experience

ix


Contents

List of Acronyms ix

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Theory 5
2.1 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.1.1 OpenAI GPT 4o-mini . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Llama 3 Instruct 70B . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Prompting techniques . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Retrieval-Augmented Generation . . . . . . . . . . . . . . . . . . . . 7
2.2.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1.1 Data Segmentation . . . . . . . . . . . . . . . . . . . 7
2.2.1.2 Embeddings . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1.3 Vector Storage . . . . . . . . . . . . . . . . . . . . . 8

2.2.2 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.2.1 User Query . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.2.2 Searching . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2.3 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Human-Centered AI . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4.1 Explainable AI . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4.2 Useful & Usable AI . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.3 AI-assisted decision making . . . . . . . . . . . . . . . . . . . 11

3 Methods 13
3.1 Use case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 15

xi


Contents

3.2.2 Size of data sets . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 AI System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.3.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1.1 Data Segmentation . . . . . . . . . . . . . . . . . . . 16
3.3.1.2 Embedding . . . . . . . . . . . . . . . . . . . . . . . 18
3.3.1.3 Vector Storing . . . . . . . . . . . . . . . . . . . . . 19

3.3.2 Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.3 Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 Explainability factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.5.1 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . 21
3.5.1.1 Retrieval Component . . . . . . . . . . . . . . . . . . 21
3.5.1.2 Generation Component . . . . . . . . . . . . . . . . 22

3.5.2 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . 23

4 Results 25
4.1 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.1.1 Retrieval Component . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.2 Generation Component . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Qualitative Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Searching for similar Quality Reports . . . . . . . . . . . . . . 28
4.2.2 Analyzing the related Root Cause Analyses . . . . . . . . . . 28
4.2.3 Importance of explainability factors . . . . . . . . . . . . . . . 29

5 Discussion 31
5.1 Quantitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 31

5.1.1 Retrieval Component . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.2 Generation Component . . . . . . . . . . . . . . . . . . . . . . 32

5.2 Qualitative Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1 Searching for similar Quality Reports . . . . . . . . . . . . . . 33
5.2.2 Analyzing the related Root Cause Analyses . . . . . . . . . . 34
5.2.3 Importance of explainability factors . . . . . . . . . . . . . . . 35

6 Conclusion 37

Bibliography 39

A Appendix 1 I
A.1 Prompt for feature engineering . . . . . . . . . . . . . . . . . . . . . . I
A.2 Prompt for determining the most likely root cause . . . . . . . . . . . I

xii


List of Figures

2.1 RAG Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3.1 Illustration of the current process of how Quality Reports are handled
within the company. . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 The desirable future state of how the AI System will enhance the
process of handling Quality Reports within the company. . . . . . . . 14

3.3 Description of the two distinct data sources and their relation. . . . . 15
3.4 Overview of the system architecture. Green components illustrate

where AI have been used. . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Overview of the feature engineering process. . . . . . . . . . . . . . . 17
3.6 Overview of embedding evaluation process. The same process was

completed using all-mpnet-base-v2. . . . . . . . . . . . . . . . . . . . 18
3.7 Illustration of the quantitative evaluation method. . . . . . . . . . . . 22
3.8 Illustration of the generative evaluation method. . . . . . . . . . . . . 22

4.1 Accuracy of the quantitative evaluation on the retrieval component
for all Quality Reports. . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Accuracy of the quantitative evaluation on the retrieval component
for Quality Reports from Volvo Penta only. . . . . . . . . . . . . . . . 26

4.3 Accuracy of the retrieval component over different grouping sizes for
all Quality Reports. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

xiii


List of Figures

xiv


List of Tables

3.1 Overview of Data Sources and Features . . . . . . . . . . . . . . . . . 14

4.1 Accuracy results across evaluation rounds . . . . . . . . . . . . . . . . 27
4.2 The users responses when evaluating the first case. . . . . . . . . . . 27
4.3 The users responses when evaluating the second case. . . . . . . . . . 28
4.4 Responses from the participants when evaluating the components de-

signed to enhance user satisfaction and trust. . . . . . . . . . . . . . . 29

xv


List of Tables

xvi


1
Introduction

At the time of writing this report, we find ourselves in the midst of what is often
referred to as the AI revolution. Although some may argue it is inaccurate to frame
it this way, artificial intelligence (AI) and machine learning (ML) have existed since
the early 1940s [1]. Nevertheless, several recent factors have significantly accelerated
their adoption by a broader audience. Increasing availability and accessibility of data
and computing power certainly play a pivotal role in the accelerating adoption of
AI. Furthermore, it should be highlighted that the development and improvements
in language models and generative AI have successfully showcased the power and
elegance of AI to the general public.

As the AI revolution unfolds, several businesses have advanced their internal com-
petencies and adoption in this domain. Data-driven decisions present a significant
opportunity to make informed decisions, relying less on individual domain expertise.
Ultimately, this opportunity includes scaling the number of decisions made without
necessitating additional human resources.

Another key advantage of data-driven decision support systems is the potential of
accelerating decision making. In large global companies, many long-standing de-
cision processes remain highly complex, influenced by countless factors, therefore
taking considerable time. For example, addressing faulty products in a large busi-
ness context often involves prolonged lead times from issue detection to production
changes. Potentially, AI systems can significantly reduce this delay by compiling
and summarizing both historical and real-time data, enabling more efficient deci-
sion making.

However, there are obstacles and shortages in adopting AI techniques. Notably,
the feasibility of the practical implementation of AI models and the and how to
interact with their results are crucial considerations. Potential challenges in the
process of developing these AI systems within industrial companies include data
storage, data completion, access to data, technical adoption among other things.
Lastly, human acceptance of AI systems is essential for the systems to be utilized
once implemented. Thus, there is a strong need for case studies that address the
feasibility of implementing AI systems in day-to-day business operations. This thesis
aims to achieve precisely this, by implementing an AI model to support the decision
making process at an industrial company. Data acquisition, model implementation
and analysis have been conducted in coordination with Volvo Penta.

1


1. Introduction

1.1 Aim

The aim of this thesis is to understand how emerging Large Language Model (LLM)
techniques, specifically the Retrieval-Augmented Generation (RAG) architecture,
can assist business users in decision making. Secondly, the aim is also to uncover
and understand key factors for the user to accept and trust the RAG.

In order to achieve this, a proof of concept RAG is to be developed which guides users
in deciding next course of action when handling reports of defect power systems.
The RAG should be able to provide possible root causes based on historically similar
cases. The goal of the RAG is to reduce time in the root cause investigation process
related to a defect by enhancing this decision making process.

1.2 Research Questions
For manufacturing companies, analyzing claims and warranty cases is crucial for
competitiveness, product quality perception and maintaining strong customer rela-
tionships. At Volvo Penta, a power solution provider, understanding quality issues
is particularly complex due to several factors. First, their vast product range leads
to a wide variety of potential issues. Secondly, the power solutions are often inte-
grated into larger systems, where the system design is beyond the company’s control.
In these cases, installation and setup are critical factors affecting performance and
related issues. Thirdly, operating in a global market, Volvo Penta encounters some-
what siloed claims processes, further complicating this process.

Although not all quality issues stem from production errors, investigating the root
cause is sometimes essential to determine if a product change is necessary. Identi-
fying the root cause is a time consuming process, often taking several months and
requiring a thorough investigation. At Volvo Penta, this investigation is extensively
documented, capturing extensive data and findings throughout the process. How-
ever, when faced with a new quality issue, navigating through this data has proven
difficult. No systematic method for locating relevant data exists, often resulting in
separate investigations for similar issues.

With the advancement of AI, particularly LLMs, Volvo Penta hypothesized that
an LLM could enhance workflow efficiency. The ability of LLMs to manage large
volumes of textual data and effectively summarize it has garnered attention as a
potential tool for enhancing business processes.

However, traditional LLMs have shortcomings that must be addressed when used
for critical business decisions. LLMs can be prone to hallucinations, generating fab-
ricated information when lacking a direct answer instead of admitting uncertainty
[2]. In highly specialized domains, they often lack the necessary domain-specific
knowledge to provide accurate responses [2]. To combat these drawbacks, an evolu-
tion of the typical LLMS has been developed, referred to as Retrieval- Augmented
Generation architecture [2].

2


1. Introduction

In the RAG architecture, a retrieval component firstly retrieves information from a
constructed database, providing relevant and accurate context to an LLM, minimiz-
ing the risk of hallucination in the output [2]. The database also acts as a knowledge
base for the model, enhancing its utility in specific domains. Integrating AI archi-
tectures like RAG into daily operations presents challenges. Beyond design choices
during development, user willingness to adopt and trust the system is crucial for its
success [3].

As such, this thesis aims to answer the following questions.

• Can a RAG system effectively retrieve and summarize historical data to assist
in identifying the root cause of engine quality issues?

• What factors influence user satisfaction and trust from AI-driven insights when
identifying possible root causes?

1.3 Limitations
This thesis is written in collaboration with Volvo Penta, as such all data is limited to
the Volvo Group and no data from different actors is used. Generalization from the
model will therefore only be applicable on Volvo Group’s domain and not outside.
The data used for the thesis is limited to the time period 2018-01-01 to 2025-01-01.

3


1. Introduction

4


2
Theory

This chapter presents the theoretical foundation of our AI system, beginning with
its core technical components and continuing with a literature review on user sat-
isfaction and trust in human-AI interaction. The first sections provide an overview
of LLMs, including their underlying architecture, prompting techniques and limita-
tions. We then introduce the Retrieval-Augmented Generation (RAG) architecture,
which addresses the shortcomings of LLMs by integrating external knowledge re-
trieval. The chapter concludes with an exploration of human-centered AI, focusing
on explainability, usability and decision support-factors critical to fostering user
satisfaction and trust in AI-driven systems.

2.1 Large Language Models
Large Language Models are AI models designed to understand, process and generate
text, making them a part of Natural Language Processing (NLP) [4]. Many recent
breakthroughs in language models can be attributed to transformers, increased com-
putational capabilities and large-scale training data being available [4].

In 2017, Google introduced the transformer architecture, which significantly ad-
vanced the way embeddings are generated and understood in NLP [5]. Unlike previ-
ous models that processed words sequentially, transformers uses a mechanism called
self-attention, which allows the model to weigh the importance of each word in a
sequence relative to others, regardless of their distance. This enables transformers
to capture long-range dependencies in text more effectively [6].

Two prominent families of language models, Llama and Generative Pre-Training
(GPT), are both built on transformer architectures [7, 8]. GPT serves as the foun-
dation for OpenAI’s models, while Llama is the foundation for Meta’s models. LLMs
like OpenAI’s GPT models and Meta’s Llama models consists of billions of param-
eters and are trained on vast datasets, enabling them to perform a variety of tasks,
from drafting emails to serving as customer support agents [4]. However, like most
AI, LLMs are dependent on the data they have been trained on. Its knowledge is
limited to the information captured in their training data [4]. This limitation makes
it difficult for LLMs to perform well in highly specialized domains where expert-level
accuracy is required. LLMs are also prone to what is referred to as hallucinations,
providing plausible sounding but factually incorrect information [9].

5


2. Theory

2.1.1 OpenAI GPT 4o-mini
As mentioned, OpenAI’s LLMs are built upon their GPT architecture. GPT-4o is
an autoregressive omni model. An autoregressive AI model predicts the next compo-
nents from the previous sequence [6]. Following this, omni refers to a model that can
take any combination of text, audio, image, and video as inputs and generate any
combination of text, audio, and image as output [10]. It is trained on a combination
of public available data and private data from partnerships [11]. GPT-4o-mini is a
smaller version of GPT-4o, developed using model distillation. Model distillation is
a technique where a smaller student model learns to mimic a larger teacher model
by training on its outputs rather than the original dataset [12]. This process signif-
icantly reduces training costs across model families while maintaining comparable
performance [13]. Neither GPT-4o-mini or GPT-4o are open source models.

2.1.2 Llama 3 Instruct 70B
Llama is an open source LLM developed by Meta, based on the transformer archi-
tecture and trained on data available to the public [8]. LLama differentiates from
the original transformer architecture in various areas were it improves components
including pre-normalization, the activation function and embeddings [8]. Llama 3
Instruct 70B is an instruction tuned model, intended for assistant-like chat, unlike
pre-trained models which can be adapted for a variety of natural language tasks
[14]. Instruction tuning refers to the fine-tuning of a language model on datasets
consisting of input-output pairs framed as instructions [15]. The tuning has been
done with supervised fine-tuning and reinforcement learning with human feedback
[14].

2.1.3 Prompting techniques
Both OpenAI’s GPT-4o-mini and Llama 3 Instruct are examples of pre-trained
language models (PLMs). The use of PLMs has surged recently due to their ex-
ceptional performance, which stems from extensive training requiring substantial
computational power and resources. Utilizing these models “out-of-the-box” is con-
venient and training models from scratch is unlikely to achieve comparable results.
A method for enhancing task adaptation in pre-trained models is the application of
fine-tuning. The pre-train, fine-tune approach leverages transfer learning, minimiz-
ing the need for labeled data, which is particularly beneficial in low-resource settings,
such as domains or languages with limited annotated datasets [16]. However, the
downsides of this approach, including the need for computational resources to fine-
tune models, though less than if a model is trained from scratched and the necessity
of understanding the architectures of the models being fine-tuned [16]. They fur-
ther explore the field of prompt-based learning, which involves formatting prompts
to guide models in producing desired outputs for NLP tasks. Using prompt engi-
neering to enhance user input with predefined contextual prompts has demonstrated
significant improvements of the generated outputs [17] [18].

This shift represents a transition from pre-train, fine-tune to a new paradigm pre-

6


2. Theory

train, prompt and predict [19]. This approach reformulates tasks to align with the
model’s existing knowledge rather than adapting the model itself, and thus requires
a new skill set in prompt engineering [19]. Since the chosen prompt significantly
influences the output of the LLM, identifying the most effective prompt to achieve
desired results is of utmost importance.

2.2 Retrieval-Augmented Generation
To overcome the limited knowledgebase of LLMs and the risk of hallucinations
appearing, the Retrieval Augmented Generation architecture has been developed.
RAG systems combine pre-trained parametric memory, meaning that the knowledge
is embedded in the parameters of the model itself, with a non-parametric memory,
a database [20]. A typical RAG architecture, shown in Figure 2.1, consists of the
mentioned parts, a vector database (non-parametric memory), an embedding model
that encodes both the stored information and the user query and a pre-trained LLM
(parametric memory) that generates responses based on the retrieved data.

Figure 2.1: RAG Architecture

As illustrated in Figure 2.1, the RAG architecture consists of three main components:
Indexing, Retrieval and Generation.

2.2.1 Indexing
Indexing involves creating a document database that serves as the information source
for answering user queries. This process includes segmenting data into smaller parts
to enhance search efficiency, embedding the information and storing it in vector
indexes to facilitate database searches, as detailed by Kamath et al. in their work
“Large Language Models: A Deep Dive” [16].

2.2.1.1 Data Segmentation

Segmenting data is crucial for optimizing the performance of the RAG system, as
language models often experience significant degradation with long contexts [21].
Key steps include text pre-processing, chunking the data into smaller parts and

7


2. Theory

augmenting with metadata [16]. Further, Kamath et al. [16] highlight the im-
portance of structuring information to leverage the RAG’s capabilities and discuss
several methods for incorporating information beyond simple document chunking.
As part of this process, feature engineering can also be performed. This process aims
to extract systematic features from raw data, transforming the data into a suitable
format for a AI model [22]. It is further emphasized that a significant amount of
time is often spent in this process, as it is essential to enable proper modeling of the
features of the AI model.

2.2.1.2 Embeddings

Embeddings refer to a representation of data, for example text, images or sound,
as numerical vectors in a continuous vector space. As the purpose of embeddings
in the RAG architecture is to enable semantic similarity search, choosing a suitable
embedding model is crucial to enable effective retrieval of information [16].

Representation of words as numerical vectors has for long been the foundation of
NLP. One of the early embedding models, Word2Vec, was introduced by Google
in 2013. Word2Vec uses two main training approaches, Continuous Bag of Words
(CBOW) and Skip-gram [23]. CBOW predict the current word based on the con-
text while Skip-gram predicts surrounding words given the current word. These
techniques allow the model to encode meaningful word relationships, leading to the
well-known algebraic operation on the word vectors below [23].

vKing − vMan + vWoman ≈ vQueen

Since then, word embeddings have evolved into sentence-level and contextual em-
beddings, enabling models to capture broader semantic meaning beyond individual
words. Modern text embedding models like the ones available from Sentence Trans-
formers and OpenAI create vector representation of sentences or paragraphs [24].
These models, built on transformer architectures, can encode entire paragraphs into
dense, high-dimensional vector representations, preserving semantic meaning [24].

2.2.1.3 Vector Storage

Vector storage, handled by vector databases are used for storing and retrieving em-
beddings based on vector similarity. Similarity between vectors are defined by the
distance, commonly calculated by Euclidian distance (L2 distance), cosine similar-
ity or the inner product similarity [25]. Well known vector storing libraries includes
Pinecone, Facebook AI Similarity Search (FAISS) and ChromaDB. Trade-offs be-
tween different choices often include search speed, scalability and the dynamic nature
of the database [16].

2.2.2 Retrieval
The retrieval component of the RAG architecture aims to provide relevant knowledge
to answer the user’s query. The retrieval process involves a user querying the model,
refining the query and searching the vector database for relevant documents.

8


2. Theory

2.2.2.1 User Query

The structure of a query, its clarity in expressing the user’s intent and the semantic
content it conveys all play a crucial role in retrieval performance [26]. Wang et al.
[26] further outline methods for overcoming risks of poor querying, such as:

• Query Rewriting: Refines the query by rewriting it to better match docu-
ments.

• Query Decomposition: Retrieves documents based on derived sub-questions
from the query.

• Pseudo-documents Generation: Generates a pseudo-document based on
the query, embeds the answer retrieved from the pseudo-document and re-
trieves similar real documents based on the pseudo-answer.

These techniques aim to bridge the gap of semantic dissimilarity when the essence of
the query aligns with the content in the stored vectors, as factors like grammatical
dissimilarity can hinder searches in the RAG approach [16].

2.2.2.2 Searching

A fundamental assumption underlying the RAG architecture is that the information
retrieved based on semantic similarity constitutes the relevant knowledge [16]. The
RAG model retrieves the top ranked documents based on the similarity between
the query and the documents, defined by the shortest distance between the vector
representations.

Searching for the smallest distance can be done in many ways. Vector databases like
FAISS and ChromaDB enables both Brute Force Search and Approximate Nearest
Neighbor Search (ANNS) [25]. Brute force search computes the exact distances
between all vectors and returns the top ranked results based on smallest distances
between vectors, ensuring complete accuracy. For larger datasets, brute-force search
becomes computationally inefficient, making ANNS a practical alternative. While
ANNS accelerates the search process, it may sacrifice some precision in the results
[25]. In most applications, the slight inaccuracy of ANNS is negligible compared to
the significant speed advantage it offers [27].

2.2.3 Generation
After identifying relevant information from the RAG architecture’s retrieval process,
the final step is to generate output using the LLM. For successful generation, the
retrieved information must be passed to the LLM in an appropriate format [16].
Research has examined how imperfect retrieval impacts the final generation. It can
be noted that the retrieval precision is generally low, leading the LLM to often
convey this imperfect information to the end user [28]. To address this issue, the
internal knowledge of LLMs can be utilized to assess the reliability and coherence
of the retrieved information, with the final answer based solely on consistent data
[28].

9


2. Theory

2.3 Evaluation
RAG systems are typically evaluated based on the specific tasks they assist. The
metrics used and the evaluation methods are determined by the task at hand and
these metrics should effectively capture how well the system assists in achieving the
task. While standard tasks such as question answering or summarization may use
established metrics like Exact Match, F1 Score, or ROUGE, evaluation metrics can
also be tailored to specific components of a system, such as retrieval or generation, by
defining appropriate accuracy measures for each. In addition to task-specific evalu-
ations, there have been efforts to assess RAG models objectively using standardized
evaluation methods.

A possible way to evaluate the different functions of a RAG model is to focus on the
performance of the two distinct parts of the model: Retrieval Component and Gen-
eration Component [2]. The retrieval performance can be evaluated with different
metrics such as accuracy, precision, recall, mean reciprocal rank and mean average
precision, with the goal of capturing relevancy of the information retrieved [29]. A
key aspect of using any of these metrics is the requirement for an established ground
truth. In some cases, the ground truth is subjective and depends on human eval-
uation. Evaluating the generation performance, the purpose is to consider factual
correctness, readability and user satisfaction using metrics like BLEU, ROUGE and
F1 Score [29].

2.4 Human-Centered AI
Beyond the technical aspects of AI systems, the human interaction with the tech-
nology can be evaluated. Human-Centered AI specifically examines the interaction
and can be divided into two areas: AI under human control and AI on the human
condition [30]. The first area focuses on the relationship between human and system
control, whereas the second area emphasizes designing AI systems with a priority
on explainability and interpretability to enhance human understanding [30].

Delving deeper into the aspect of human control within the learning process, we refer
to Human-in-the-loop machine learning, which integrates human expertise into the
technical learning framework. The degree of this integration enables us to evaluate
whether the system or the humans retain sufficient control [3].

2.4.1 Explainable AI
Beyond the topic of control in the learning process, the human aspect of interaction
and understanding of the AI model during deployment is referred to as Explainable
AI [3]. Mosqueira-Rey et al. further emphasize that as the use of “black-box model”
increases, where users have little to no insight into how decisions are made, the
demand for greater transparency in the model’s logic has become more pronounced
[3]. They also highlight that explainable AI moves beyond simply explaining the
model and how it functions, but also includes inherent intepretability from the model

10


2. Theory

design.

To understand the concept of explainable AI, five main factors should be consid-
ered: understandability, comprehensibility, interpretability, explainability and trans-
parency [31]. Understandability, defined as the extent to which a human can grasp
the model’s decisions, is recognized as the most critical factor. Comprehensibility
and interpretability relate to the model’s ability to represent or explain knowledge
and the learning process in a human-understandable manner. Transparency pertains
to the model itself and its clarity; for instance, a linear regression model is generally
more transparent than a neural network model. Transparency and explainability can
be further analyzed through three components: simulability, decomposability and al-
gorithmic transparency. Lastly, post-hoc explainability techniques can be employed
to enhance interpretability. Examples of such techniques include text explanations,
visual explanation and feature explanation, amongst other factors [3].

2.4.2 Useful & Usable AI
Exploring human interaction with AI reveals numerous factors that influence the
adoption of AI models in business settings and their value in achieving specific goals.
By considering both user needs and the ability to adopt solutions, we delve into the
concepts of Useful AI and Usable AI. Useful AI emphasizes how effectively the AI
model or system meets user needs and fulfills its intended purpose, while Usable
AI focuses on the interface and the ease with which users can learn to interact
with the tool [32]. Furthermore, advancements in understanding usage scenarios
and user experience (UX) factors are discussed, emphasizing that these elements
are key drivers of increased AI adoption [32]. Significant development is needed to
understand how humans use these systems. Additionally, it is suggested to move
beyond mere interaction to explore how AI systems and humans can transition
from interaction to integration and collaboration, highlighting the need for further
research in this area.

2.4.3 AI-assisted decision making
Deploying AI models can be categorized into two distinct use cases: The AI model
operates autonomously and makes final decisions and The AI model offers recom-
mendations for action, leaving the final decision to a human [33]. While the first
option allows for complete automation of decision making, it may be infeasible for
several reasons, and involving domain experts can help address these challenges by
adding specialized knowledge [33]. For instance, determining medical diagnoses,
evaluating creditworthiness, or making legal judgments are scenarios in which legal
requirements and ethical considerations may prevent the fully autonomous deploy-
ment of AI models.

Elaborating on the second use case, where the AI model offer recommendations to
humans, the area of AI-assisted decision making is defined. The final decision is then
left to humans, which can choose to agree or disagree with the AI models suggestion
[34]. To enhance human decision making with AI models, it is essential to optimize

11


2. Theory

joint decision outcomes. If users can form a mental model of their AI support,
understanding when to trust or distrust predictions, they can effectively identify the
model’s error boundaries [33]. Calibrating user trust in the AI model’s decisions
can be achieved by displaying the confidence scores of the model’s predictions [33].
However, this approach carries the risk of users becoming overly reliant on the
model’s predictions when the confidence score is very high.

The subjective nature of when to accept or reject assistance from AI in decision
making has not yet been fully addressed in research [34]. Analyzing the relationship
between algorithmic decision support and uncertainty, it has been found that, in
the presence of irreducible uncertainty, uncertainty that cannot be resolved before
an event occurs, human forecasters are preferred over algorithms [35]. Additionally,
repeated observations of an algorithm making similar mistakes lead to increased
aversion to using it as decision support [36].

12


3
Methods

Using the tools and techniques presented in the previous section, this section explores
the practical use case and implementation at Volvo Penta. The subsections will
follow the chronological order of events, from data exploration to final evaluation.
We will refer to this practical case as designing an AI System, as we did not train
a single specific model but aimed to design a comprehensive system for decision
making using various AI techniques.

3.1 Use case
At the outset of this thesis, we transitioned from a vague idea of the AI system’s
use case to a clearer understanding of the decision making process we aimed to
improve. This involved discussions with several domain experts, particularly the
main stakeholder, a product quality manager. The process focused on understanding
the current state, how decisions are currently made and the desired future state,
the vision for how decisions could be assisted by AI. Through conversations with
domain experts, we uncovered critical decision points with the highest potential for
improvement and practical constraints for the systems deployment.

The use case with the largest potential was identified as aggregating and processing
large amount of information from several sources. This would then help the user to
find a possible root cause for a defect power system. At Volvo Penta, this is done
by starting a Root Cause Analysis of the defect power system, where the defect is
described in a Quality Report. An overview of the current state and vision of the
future state can be found below in Figure 3.1 and 3.2.

Figure 3.1: Illustration of the current process of how Quality Reports are handled
within the company.

13


3. Methods

Figure 3.2: The desirable future state of how the AI System will enhance the
process of handling Quality Reports within the company.

3.2 Data

Given the limited timeframe of this thesis, we determined that collecting or defining
new data would be impractical. As a result, we relied on existing data for this thesis.
The first stage involved assessing the available data to address the business needs.
From all available data sources, we selected a subset that we deemed sufficient and
valuable for enhancing the decision making process.

This resulted in two distinct datasets for the thesis: one related to the Quality
Reports and one focused on investigating the Root Cause Analysis of those Quality
Reports. From each dataset, we extracted various features, as presented in Table
3.1. In figure 3.3 the two data sets and their relation is illustrated.

Data source Feature Type
Quality Report Feature 1 Textual
Quality Report Feature 2 Textual
Quality Report Feature 3 Textual
Quality Report Feature 4 List<Categorical>
Root Cause Analysis Feature 5 Textual
Root Cause Analysis Feature 6 Textual
Root Cause Analysis Feature 7 Textual
Root Cause Analysis Feature 8 Textual
Root Cause Analysis Feature 9 Categorical
Root Cause Analysis Feature 10 Categorical
Root Cause Analysis Target variable Textual

Table 3.1: Overview of Data Sources and Features

14


3. Methods

Figure 3.3: Description of the two distinct data sources and their relation.

3.2.1 Data preprocessing
Since the datasets used originate from distinct yet related sources, it was essential to
ensure we had a sufficient and complete dataset to demonstrate the relationships of
interest. As a result, we cleaned the data to remove any empty values in the target
variables. Additionally, we refined the Root Cause Analysis dataset by eliminating
any instances with empty references to Quality Reports. Lastly, we ensured that
only instances of Quality Reports present in the Root Cause Analysis dataset were
retained.

3.2.2 Size of data sets
After selecting the initial dataset, we needed to ensure it was large enough to iden-
tify relationships that could demonstrate how a data-driven approach could enhance
the current process. We quickly realized that data from Volvo Penta alone was in-
sufficient to build a robust model. Following consultations with domain experts, we
decided to incorporate data from a sister company, Volvo Group Trucks Technol-
ogy, to obtain the necessary dataset size. Although this decision might impact the
relevance of some information in the database, the products from both companies
are considered similar enough to provide adequate data for the proof of concept of
this AI system, as this thesis aims to demonstrate.

The final data set used for building this system contained 6570 distinct instances of
Quality Reports and 1699 distinct instances of Root Cause Analyses.

3.3 AI System architecture
After assessing the business needs and data availability, we drafted the overall system
architecture to ensure that the data effectively supports the decision making pro-
cess. The foundation of our solution is a RAG architecture, well-suited for handling
advanced text similarity searches. After several iterations throughout the project,
the final architecture was reached and is illustrated in Figure 3.4.

15


3. Methods

Figure 3.4: Overview of the system architecture. Green components illustrate
where AI have been used.

As shown in Figure 3.4, the foundation of the system is a RAG architecture, although
several modifications and adaptations have been made to transform the more tra-
ditional system architecture to better suit the specific needs of this project. The
components of the three distinct parts, indexing, retrievall and generation, will be
described in the following sections.

3.3.1 Indexing
The following section will explain all the steps in the indexing process: feature
engineering, embedding and vector storing.

3.3.1.1 Data Segmentation

The process of data segmentation was largely absent during the initial database
design, as each Quality Report naturally served as a distinct data instance. How-
ever, we identified a strong need for extensive feature engineering to structure and
standardize the historical Quality Report data.

We had a substantial amount of textual data to process and store for similarity
searches. Since the majority of this data consisted of unstructured textual data,
we recognized the need to extract more structured features. This was crucial for
building our RAG architecture, where efficient search and retrieval are foundational
components. As a result, we concluded that feature engineering was essential to
extract the relevant information.

16


3. Methods

To enhance the feature engineering process, we decided to employ an LLM to process
the input fields of the data. Collaborating with domain experts, we developed a strict
prompt to extract relevant features from the text. These desired output features
were defined to integrate both business and technical insights.

Firstly, we processed the data related to the Quality Reports. The data from the
features, together with the prompt, was then sent to the LLM with a clear task
to summarize the findings into a single text field. This process aimed to generate
highly structured text with carefully selected information, which would later serve
as the information database for our RAG system. The prompt template, slightly
modified for privacy concerns from Volvo Penta, is found in appendix A.1.

In addition to optimizing the text field for embeddings, we prompted the LLM to
assess the quality of the user input and the confidence level of its analysis. This
aimed to be the foundation to evaluate how these features could be used to enhance
the explainability and trustworthiness of the model. Figure 3.5 illustrates the full
process of feature engineering.

Figure 3.5: Overview of the feature engineering process.

Since these steps was critical for the model’s performance in retrieving similar his-
torical Quality Reports, we evaluated two different LLMs: GPT-4o-mini and Meta-
Llama-3-7B-Instruct. These models were chosen due to their performance, ease of
use and cost. Our goal was to compare the benefits of maintaining greater control
over the model, as with the Llama model, versus using GPT-4o-mini.

We allowed the Llama model and GPT-4o-mini to perform feature engineering to
compare their outputs across 40 Quality Reports. Additionally, we embedded the
outputs generated by the two LLMs and analyzed their differences. A qualitative
review of the outputs revealed minimal variation between the two models, as they
produced highly similar texts.

To quantify this similarity, we calculated the Euclidean distance between the em-
bedded texts from both LLMs, yielding an average similarity score of 0.948. The
high score indicates a strong resemblance between the generated outputs. Given the

17


3. Methods

minimal performance differences observed, we decided to use GPT-4o-mini moving
forward, as it offered faster computation time.

Further, we processed the Root Cause Analysis data set in the same manner by care-
fully crafting a prompt to extract relevant features from the six features available.
In contrast to the processing of the Quality Report data, the goal of this process was
rather to summarize and unify the data across different instances, to later enable
the LLM to compare different possible Root Causes Analyses in a just way.

3.3.1.2 Embedding

To efficiently store and search within our database of historical Quality Reports, the
next step after feature engineering was to embed the processed text. Since the qual-
ity of embeddings could significantly impacts the model’s ability to find similar Qual-
ity Reports, we evaluated two different embedding models: text-embedding-ada-002
from OpenAI and all-mpnet-base-v2 from the Sentence Transformer Python frame-
work. This allowed us to compare a high-performing API-based solution, though
offering minimal insight or control over the embedding process, with an open-source
alternative that provides greater flexibility and control.

To evaluate the embedding models, we sought a systematic approach to assess their
ability to capture the essence of different textual inputs that convey the same se-
mantics. We leveraged previous LLM processing of text fields describing Quality
Reports using two different models (GPT-4o-mini and Llama) for 40 Quality Re-
ports. These models produced slightly varied texts from the original descriptive
inputs, yet aimed to represent the same semantics. We embedded these variations
using both the text-embedding-ada-002 and all-mpnet-base-v2 as embedding model.
Subsequently, we compared the top 5 matches each embedding model identified for
a retrieval query.

Figure 3.6: Overview of embedding evaluation process. The same process was
completed using all-mpnet-base-v2.

18


3. Methods

The comparison of overlapping results measures how stable each embedding method
is at handling minor variations in input. A high overlap in results indicates that
the embedding model consistently captures key semantic features, whereas a lower
overlap suggests greater sensitivity to small text differences. This helps determine
which embedding method is more robust for retrieval tasks. Text-embedding-ada-
002 outperformed all-mpnet-base-v2 with a mean score of 3.725/5 on the 40 Quality
Reports and it was decided that embeddings using text-embedding-ada-002 would
be the choice for our system.

 GPT-4o-Mini & ada GPT-4o-Mini & mpnet
Llama & ada 3.725/5 −

Llama & mpnet − 2.850/5


Matrix 1: Average overlap of top 5 retrieval results when embedding similar texts

3.3.1.3 Vector Storing

To efficiently find similar embeddings produced in the previous step, proper storage
was of essence. This may be critical to the system as the amount of data increases.
As such, the FAISS vector storing library was used based on its performance and
its ease of implementation. When creating the vector database, the IndexFlatL2
index was used, which allows for brute-force euclidean distance search. As the size
of our data set is relatively small, brute-force search could be used instead of the
less compute intense ANNS.

3.3.2 Retrieval
The retrieval process closely mirrored the indexing process, albeit in a focused,
single-instance manner based on the user query.

Rather than allowing the user to express themselves freely, we constrained their input
to a format optimized for successful retrieval in the AI system. First, the user query
was limited to filling out four fields, which corresponded to the four features used
in the historical Quality Reports. Following, the same feature engineering steps was
performed and the resulting text was embedded using the same embedding model
used previously. This can be seen as a form of query rewriting where the aim is to
refine the query to better match the stored documents.

We then employed a brute force euclidean distance search algorithm to retrieve the
top ten matches. Each of the matches corresponded to a processed Quality Report.
During the feature engineering process, each case was assigned one of four text
quality labels, ranging from Poor to Comprehensive, by the LLM. The assigned text
label was then used as a weight for the distance, refereed to as similarity scores,
obtained from the brute force euclidean distance search, which resulted in a new
weighted similarity score. This new score was subsequently used to narrow the
results from 10 matches to four. The choice of retaining four potential matches
was determined after iterative discussions with end users and domain experts. This
number struck a balance between offering sufficient contextual matches, since some

19


3. Methods

cases that might not seem to match well could still be relevant and ensuring the
information provided remained manageable for the user. Additionally, we retrieved
the associated Root Cause Analysis data linked to the Quality Report data, which
provided further contextual information for the final analysis and text generation.

3.3.3 Generation
In the generative part of the system, the relevant context was passed on from the
retrieval section. This contextual information was provided to an LLM along with a
specific prompt detailing how the LLM was to analyze the given context. Once the
analysis was complete, the LLM returned its response to the user, consisting of rec-
ommendations for the next course of action. The prompt template, slightly modified
for privacy concerns from Volvo Penta, used for this analysis and recommendation
is found in appendix A.2.

3.4 Explainability factors
To explore how various factors influence user satisfaction and trust in the AI system,
we maintained a continuous focus on incorporating explainability into the system. A
key part of this approach, as previously mentioned, was having the LLM summarize
the user input, express its confidence in the analysis during the feature engineering
process and evaluating the quality of the textual input data. This strategy aimed
to extract features from both the data and analysis process, which could later be
presented to the user to improve the model’s transparency and emphasize factors
that could influence trust in the system’s recommendations. A similar approach was
applied during the final generative analysis phase, where the LLM was prompted
to assess its confidence in both the analysis and the recommendation, along with
providing two reasoning outputs explaining how it reached the final recommendation.

When retrieving the matching Quality Reports, the LLM was also asked to justify
how and why these cases were considered similar. Lastly, the similarity score was
transformed into one of three categorical variables with predefined threshold values,
which were also presented to the end user.

In summary, the extracted features aimed at enhancing user satisfaction and trust
were:

• Summary of user input:
Output from the LLM during the feature engineering process.

• Text quality of user input:
Output from the LLM during the feature engineering process.

• Confidence level of input data analysis:
Output from the LLM during the feature engineering process.

• Similarity of matches:
Combine score as describe in section 3.3.2

20


3. Methods

• Explaining why the matching Quality Reports are similar:
Output from the LLM after finding matching Quality Reports.

• Considerations in how the recommended Root Cause Analysis was reached:
Output from the LLM in the final analysis and output generation.

• Reasoning about alternative possible Root Cause Analyses:
Output from the LLM in the final analysis and output generation.

• Confidence level of recommended Root Cause Analysis:
Extracted by the LLM in the final analysis and output generation.

3.5 Evaluation

The evaluation consisted of a quantitative evaluation as well as a qualitative evalu-
ation of the whole AI system. Each of the evaluation methods will be described in
the following subsections.

3.5.1 Quantitative Evaluation
The quantitative evaluation, designed to provide an objective performance score for
the system, was divided into two parts: one focused on the retrieval component and
the other on the generation component.

3.5.1.1 Retrieval Component

For the quantitative retrieval evaluation, we aimed to assess how well the retrieval
component of the system could find historical matching Quality Reports. To do
this, we used historical Root Cause Analysis instances, each linked to several Quality
Reports. All Root Cause Analyses which met the criteria of having between two and
four related Quality Reports was selected. One of the linked Quality Reports was
chosen as the input while the others acted as ground truth, as illustrated in Figure
3.7. If at least one of the ground truth Quality Reports was retrieved, it was counted
as successful. If none of the retrieved Quality Reports were part of the ground truth,
the retrieval was deemed unsuccessful. This was evaluated on the entire subset of
data that met the specified criteria. The evaluation was conducted with retrieval of
one historical Quality Report up to ten historical Quality Reports. By comparing
successful and unsuccessful retrievals, an accuracy score could be calculated. As
the retrieval component is dependent on the text in the Quality Reports, the text
quality might change the results. Thus, a subset of the Quality Reports with text
quality deemed as Poor was removed to see how it compared in accuracy. Further,
we looked at Root Cause Analyses with groups of two, four and eight linked Quality
Reports individually, to see how the accuracy changed depending the number of
ground truths available.

21


3. Methods

Figure 3.7: Illustration of the quantitative evaluation method.

3.5.1.2 Generation Component

For the quantitative generation evaluation, we aimed to assess how well the genera-
tive component of the system could choose the correct Root Cause Analysis, focusing
on instances linked to two Quality Reports, similar to the retrieval evaluation. Using
cases with more than two linked reports would likely improve accuracy, as the LLM
would have fewer Root Causes to consider. Therefore, focusing only on pairs repre-
sents a more restrictive scenario and the resulting accuracy can be viewed as a lower
bound. One Quality Report was used as input, while the other was one of the four
retrieved Quality Reports, ensuring that the Root Cause Analysis used as ground
truth was among the considered Root Causes Analyses, as illustrated in Figure 3.8.
This filtering reduced the data set to 88 pairs of Quality Reports. A generation was
considered successful if the system selected the correct linked Root Cause Analysis,
otherwise, it was considered unsuccessful. By comparing the number of successful
and unsuccessful generations, we calculated an overall accuracy score.

Figure 3.8: Illustration of the generative evaluation method.

22


3. Methods

3.5.2 Qualitative Evaluation
The qualitative approach consisted of user tests with domain experts. The evalua-
tion aimed to evaluate both the system performance as well as its trustworthiness.
Throughout the user tests, interview questions were asked about domain experts
interaction and experience of the AI system.

Five domain experts, also potential future users of the system, were selected to
evaluate it. Since the purpose of these user tests was to assess the system’s future
value for these potential users, no additional criteria were required for participant
selection. All participants had over two years of experience in the field, consisting
of two women and three men, aged between 29 and 61.

Each domain expert were given a brief explanation of the AI system and two Quality
Reports used as input in the system. The Quality Reports were selected by our
main stake holder who deemed them suitable for the test. During the user test, the
following interviewing questions were asked to the user:

• Does the system accurately summarize the user input?

• How many of the matching Quality Reports are relevant to the user input?

• Is the recommended Root Cause Analysis plausible?

• Is the recommended Root Cause Analysis likely?

• Are any of the other Root Cause Analysis presented plausible?

• Would the recommended Root Cause Analysis be the first thing you investi-
gate?

Once the two Quality Reports have been evaluated, the domain experts were asked
questions related to the system as a whole and their experience of it.

• Do you trust the system?

• What specific part(s) of the system makes you trust it?

• Would a deeper knowledge in how the system works increase your trust in it?

• Would you use the system as it is?

The aim was to evaluate the overall performance and usefulness of the system, as
well as investigate what aspects are important for user satisfaction and trust in AI
systems.

23


3. Methods

24


4
Results

This chapter presents the results from each evaluation. First, the quantitative eval-
uation will be presented, covering both the retrieval and generative components.
Second, the qualitative evaluation is outlined, including insights on system perfor-
mance, explainability and trust factors.

4.1 Quantitative Evaluation
The following section presents the results from the quantitative evaluation on the
retrieval and generation component.

4.1.1 Retrieval Component
When filtering the data for the quantitative evaluation criteria as described under
3.5.1.1, we were left with 1141 Quality Reports. Additionally, a filter on poor text
quality was applied, leaving 1041 Quality Reports. The accuracy of both can be
seen in Figure 4.1.

Figure 4.1: Accuracy of the quantitative evaluation on the retrieval component for
all Quality Reports.

25


4. Results

The accuracy of only Volvo Penta’s Quality Reports were also calculated, in total
30 Quality Reports from Volvo Penta met the filtering criteria as described under
3.5.1.1. When applying the text quality filter 25 Quality Reports were left. The
results are illustrated in Figure 4.2.

Figure 4.2: Accuracy of the quantitative evaluation on the retrieval component for
Quality Reports from Volvo Penta only.

Additionally, different grouping sizes were evaluated to see how the accuracy changed
depending on how many ground truth Quality Reports exists. The result where
grouping size is 2, 4 and 8 is shown in Figure 4.3.

Figure 4.3: Accuracy of the retrieval component over different grouping sizes for
all Quality Reports.

26


4. Results

4.1.2 Generation Component
After filtering for the generative evaluation, described in 3.5.1.2 we were left with
88 Quality Report pairs. These where then evaluated and gave us an accuracy score
for the generative component of the system and can be seen in Table 4.1. As the
system is not completely deterministic, the evaluation was done three times on the
same dataset. The system was consistent in 65 of the 88 cases, that is, it selected
the same Root Cause Analysis for all iterations on 65 of the Quality Reports.

Evaluation Accuracy
Iteration 1 76.14%
Iteration 2 72.73%
Iteration 3 73.86%
Mean 74.24%

Table 4.1: Accuracy results across evaluation rounds

By combining the results from the retrieval and generation components, we can
calculate an overall accuracy for the system. The generative accuracy of 74.2% was
measured under the assumption of 100% retrieval accuracy, meaning the relevant
Quality Report was always retrieved. The retrieval accuracy with exactly two related
Quality Reports was 49.2%, shown in Figure 4.3. Combining this retrieval accuracy
with the generative accuracy, results in an overall system accuracy of 36.4%.

4.2 Qualitative Results
A summary of the participants responses to the two cases evaluated using the system
are found in Tables 4.2 and 4.3. Participants are labeled P1, P2, P3, P4 and P5.

Evaluation Question P1 P2 P3 P4 P5
Does the system accurately
summarize the user input?

Yes Yes Yes Yes Yes

How many of the four matching Quality
Reports are relevant to the user input?

3 4 3 4 0

Is the recommended
Root Cause Analysis plausible?

Can’t tell Yes Yes Yes No

Is the recommended root
cause likely?

Can’t tell Yes Yes Yes No

Are any of the other root
causes presented plausible?

Yes Yes Yes Yes No

Would the recommended root
cause be the first thing you investigate?

No No No No No

Table 4.2: The users responses when evaluating the first case.

27


4. Results

Evaluation Question P1 P2 P3 P4 P5
Does the system accurately
summarize the user input?

Yes Yes Yes Yes Yes

How many of the four matching Quality
Reports are relevant to the user input?

1 4 4 2 2

Is the recommended root
causes plausible?

Yes Yes Yes Yes Yes

Is the recommended
Root Cause Analysis likely?

Can’t tell Yes Yes Yes Yes

Are any of the other root
causes presented plausible?

N/A Yes Yes Yes No

Would the recommended root
cause be the first you investigate?

N/A Yes No Yes No

Table 4.3: The users responses when evaluating the second case.

The following sections present the responses to all interview questions and highlights
common themes that emerged in participants’ reasoning divided into the distinct
sections of the AI system: searching for similar Quality Reports, analyzing the related
Root Cause Analyses and importance of explainability factors.

4.2.1 Searching for similar Quality Reports
All users agreed that the system accurately summarizes the user input. Four out of
five participants specifically mentioned that this increased their confidence that the
system shared their understanding of the problem description, while one participant
did not comment on this aspect.

Continuing with the analysis of the matching Quality Reports, there was a general
agreement on the relevance of most matches in the first case. However, in the
second case, opinions were more divided. One observation was that participants less
experience with the specific issue type tended to find more of the presented Quality
Reports relevant, while those with deeper knowledge of the specific issue were more
likely to consider some of the cases irrelevant.

4.2.2 Analyzing the related Root Cause Analyses
In the first case, three out of five participants found the recommended Root Cause
Analysis plausible. Participant 1 expressed that they lacked sufficient information
to assess its relevance to the problem description, while Participant 5 noted a mis-
match between the contextual details in the problem description and those in the
recommended Root Cause Analysis. In the second case, there was united agreement
that the recommended Root Cause Analysis was plausible.

28


4. Results

None of the participants expressed the view that the recommended Root Cause
Analysis was plausible but unlikely. However, all participants emphasized that they
would have needed additional information about the specific Root Cause Analysis
before making a final judgment.

When asked whether the recommended Root Cause Analysis should be prioritized
over the others, participants gave varied responses. Most emphasized that they
preferred to review all available Root Cause Analyses to make their own informed
decision. However, Participant 1 noted that if under time pressure, they would
consider the recommended Root Cause Analysis first.

4.2.3 Importance of explainability factors
All participants agreed that they do not trust the system to make a final decision
in any of the cases, but that they trust the information that the interface provides
them with.

To evaluate more specifically what made the participants trust the system, we asked
targeted questions about components designed to enhance user satisfaction and
trust. Participants were asked whether each component increased their trust in
the system. The result is presented below in Table 4.4.

Explainability component P1 P2 P3 P4 P5
Summarizing the user input Yes Yes Yes Yes Yes

Examining text quality of the user input No No No No No

Examining confidence level in the
input data analysis

No No No No No

Examining similarity score of matches No No Yes No No

Explaining why the matching
Quality Reports are similar

Yes Yes No No No

Explaining considerations in choosing
the recommended Root Cause Analysis

No No No No No

Explaining alternatively considered
Root Cause Analyses

Yes Yes No No No

Table 4.4: Responses from the participants when evaluating the components de-
signed to enhance user satisfaction and trust.

All participants responded “yes” when asked whether the summary component con-
tributed to their trust in the system. When elaborating, several participants valued

29


4. Results

the component as a confirmation of alignment between the system and their under-
standing.

Regarding the components “Explaining why the matching Quality Reports are sim-
ilar”, as well as “Explaining alternatively considered Root Cause Analyses”, par-
ticipants who appreciated these features valued the ability to follow the model’s
reasoning. In contrast, those who responded negatively felt they could make these
assessments themselves based on the information provided.

All participants agreed that they do not want a deeper knowledge about the logic
behind the system. They did not believe it would enhance trust in any way.

All participants, except one, agreed they would use the system as it is if it became
available to them. A noteworthy aspect is that the system’s interface consolidates
information from what is normally accessed through multiple systems when investi-
gating these issues. This consolidation appears to enhance the system’s usefulness,
especially in comparison to the current workflow.

30


5
Discussion

In this chapter, we delve into a discussion of the quantitative and qualitative results
obtained from our thesis. The aim is to interpret these findings within the context
of the decision making processes they are designed to support. The discussion
explores the performance of the system in the context of potential future users. We
also explore the importance of explainability factors and their role in building trust
in AI systems. This chapter serves as a critical reflection on the methodologies
employed and offers insights into future directions for research and application.

5.1 Quantitative Evaluation
The following section discusses the results from the quantitative evaluation on the
retrieval and generation components.

5.1.1 Retrieval Component
When examining the accuracy of all Quality Reports as well as the subset specific to
Volvo Penta, a significant increase in accuracy up to the retrieval of four similar issues
can be seen. Beyond this, the rate of improvement diminishes. This strengthens our
choice of displaying the top four retrieved Quality Reports in the user interface, as
displaying additional matches will not result in any significant improvement in the
accuracy.

While the accuracy metric aims to give an objective measurement of the retrieval
part of the system performance, it is important to note that this metric is influenced
by certain characteristics of the data.

1. As the number of related Quality Reports ground truths increases, in theory,
the accuracy should increase based on probability.

2. The related Quality Reports serving as the ground truth might not be the only
relevant matching cases. In fact, more relevant Quality Reports linked to other
Root Cause Analyses might exist. Retrieving them would in this evaluation
count as an unsuccessful retrieval while the opposite might be true.

The first statement is partially confirmed in Figure 4.3 where it can be seen that the
accuracy increases as related Quality Reports increase. However, it is challenging

31


5. Discussion

to determine whether this improvement is proportional to the expected increase due
to more ground truths or if other factors, such as data quality, influence the results.

The second statement is harder to interpret in our results. It can be argued that
the achieved accuracy represents a lower bound of the true accuracy. As mentioned,
retrieval is deemed unsuccessful if none of the related Quality Reports appear, even if
the retrieved Quality Reports are directly relevant to the user input. Consequently,
the true accuracy might be higher, but without domain experts manually assessing
these cases, it remains unknown.

To conduct the quantitative evaluation, establishing a ground truth was essential,
necessitating the previously outlined data selection process. While this approach was
critical for ensuring the accuracy and reliability of the evaluation, it inadvertently
led to the exclusion of a substantial amount of data. This loss primarily affected
instances where only a single Quality Report was linked to a Root Cause Analysis,
making it impossible to evaluate them using our methodology. Despite the necessity
of this process, the reduction in available data highlights the challenges of balancing
thoroughness with comprehensiveness in data-driven evaluations. The alternative
of manually evaluating all instances with domain experts to obtain ground truths
was not feasible due to time constraints, leaving the performance of these reports
undetermined.

While the accuracy score provides a guidance in how well the system performs, a
more suitable performance metric might be time saved for the user. However, no
baseline for the current time spent in this process is available. From the qualitative
evaluation with potential future users, it is indicated that the system would save
them time and assist them in their work, but it is has not been possible to quantify
the actual time save or performance increase. The accuracy metric might or might
not mirror the true performance increase but the correlation between this accuracy
score and actual time save is not necessarily high. A long-term study in which
domain experts work in parallel, with one group using the system and another
without it, would be necessary to establish a reliable baseline and accurately measure
the system’s performance gains.

5.1.2 Generation Component
As previously noted, the generation component is not entirely deterministic, mean-
ing the same input may produce different outputs each time. To address this, we
conducted multiple iterations of the evaluation and calculated the mean accuracy.
Similar to the retrieval accuracy, the generation accuracy of 74.24% can be con-
sidered a lower bound. This is due to the fact that a more suitable Root Cause
Analysis for the user input might exist beyond the perceived ground truth. With-
out comprehensive manual evaluation by domain experts, this remains uncertain.

Similar to retrieval evaluation, the number of related Quality Reports to a Root
Cause Analysis affects generation accuracy, albeit differently. With more related
Quality Reports, the likelihood of considering fewer Root Cause Analyses increases.
For instance, if three out of four retrieved Quality Reports share the same Root

32


5. Discussion

Cause Analysis, the generation component examines only two distinct Root Cause
Analyses, potentially enhancing the chance of selecting the correct one. As we only
tested Root Cause Analyses with exactly two related Quality Reports, this was not
confirmed. The logic does however further the idea that the achieved accuracy serves
as a lower bound for the true accuracy.

The same discussion regarding the accuracy metrics correlation to general perfor-
mance increase written under 5.1.1 applies here too. Correlation is not necessarily
high.

5.2 Qualitative Evaluation
As is often the case in many RAG applications, this evaluation has yielded note-
worthy results. While there are numerous aspects that fall outside the scope and
structure of our evaluation in this thesis, certain areas and patterns have emerged
that deserves attention. Although the limited number of participants prevents us
from drawing definitive general conclusions, the results can be regarded as indica-
tive. This part of the discussion will focus on the following distinct parts of the
AI system: searching for similar Quality Reports, analyzing the related Root Cause
Analyses, importance of explainability factors as well as a final reflection about the
system as a whole and its drawbacks.

5.2.1 Searching for similar Quality Reports
The first component of this part of the system involves the AI summarizing the
user’s input. Results indicate that providing a summary, rather than omitting it,
helps build trust in the system. This is achieved by confirming the user’s problem
description, allowing them to verify that the AI has accurately understood the issue
without introducing hallucinations or irrelevant content. However, the perceived
importance of this component appears to vary between individuals. It should also
be noted that the problem descriptions used in both test cases were relatively brief,
which means the full potential of this component may not have been fully explored.

Based on user input, the system searches for similar historical Quality Reports
and identifies relevant matches with reasonable accuracy. Although we did not
establish a clear benchmark for evaluating the relevance of these matches, we relied
on subjective assessment guided by our understanding of the data and context.
Given the unstructured nature of the textual data in this domain, where exact
matches are rarely expected, it is encouraging that users perceived the suggested
cases as relevant. However, we cannot determine with certainty whether these were
the most relevant matches, as it would be impractical to manually evaluate all
6,500+ historical Quality Reports for comparison.

Two participants mentioned the idea of “garbage in, garbage out”, pointing out
that the system’s effectiveness largely depends on the quality of the problem de-
scription provided by the user. In addition, throughout the development process,

33


5. Discussion

we considered the quality of the data within the underlying database. The sys-
tem’s performance is inherently limited by the quality of the information available
for analysis. Another challenging scenario arises when the system encounters cases
with no suitable historical Quality Reports. Currently, it suggests four suboptimal
matching Quality Reports and often recommends a Root Cause Analysis from an
unrelated field. Although establishing a minimum similarity threshold for displaying
matching Quality Reports could mitigate this issue, a more effective strategy would
involve deriving new conclusions and proposing innovative solutions. By leveraging
historical problem-solving patterns alongside technical specifications of components
and power solutions, it would be intriguing to assess the system’s ability to generate
innovative solutions.

Our reflections on data quality centered on two key aspects. The first pertains
to the volume and comprehensiveness of historical data, addressing the issue of
occasionally presenting users with suboptimal matching cases, as discussed above.
This consideration influenced an early decision to expand the data source from only
Volvo Penta data to include data from Volvo Group Trucks Technology data as
well. However, several evaluations later indicated that the data from Volvo Group
Trucks Technology was not always perceived as relevant. The second aspect relates
to the quality of individual data entries. Many of the historical Quality Reports
in the database contained incomplete or vague information. Regardless of how well
the user’s input is formulated, low-quality data provides a weak foundation for
identifying meaningful matches.

Finally, it is important to acknowledge that, at the outset of this project, we lacked
the experience and perspective to critically assess the quality and relevance of the
data. Due to time constraints and unfamiliarity with the dataset and its context,
we initially underestimated the significance of this aspect. Only toward the end
of the thesis work did we begin to fully grasp how crucial data quality is to the
performance and reliability of the system.

5.2.2 Analyzing the related Root Cause Analyses
In general, participants’ reasoning regarding the relevance of the presented Root
Cause Analyses was closely linked to the relevance of the matching historical Qual-
ity Reports. This supports the underlying logic of using the similarity of Quality
Reports as a basis for identifying possible Root Causes Analyses.

The evaluation of both the relevance of all suggested Root Cause Analyses and the
system’s final recommendation suggests that multiple valid solutions to a problem
often exist and that the system is capable of identifying several relevant possible so-
lutions. However, when it comes to whether the recommended Root Cause Analysis
should be prioritized over the others, responses indicate some level of skepticism.
Participants appeared less inclined to fully trust the system’s recommendation, par-
ticularly when they had the expertise to critically assess the alternatives themselves.

This suggests that the recommendation of a single Root Cause Analysis is considered
less valuable than the broader ability to explore several plausible options. Trust in

34


5. Discussion

the recommendation seems highly dependent on the specific problem description
and how well both the user and the system have managed to capture the nature and
context of the issue. If the user is better at capturing the nature and context, they
tend to rely less on the system recommendation.

5.2.3 Importance of explainability factors
We have explored AI on the human condition to assess the interpretability of the
model. Although LLMs differ significantly from other “black-box models”, under-
standing their specific decision making processes remains challenging. While we did
not attempt to examine all components of Explainable AI, we aimed to incorporate
understandability, explainability and transparency into certain aspects of the model.
This focus is particularly relevant given that we did not train a model ourselves but
utilized an existing one within our system.

As previously discussed, the provided summary of the user input serves its purpose
as a form of verification. Based on our results, it is indicated that participants view
this as a necessity rather than a trust-building component. The expected behavior is
for it to summarize accurately, thus not increasing trust but meeting expectations.
However, it is important to note that this component may not have been fully
evaluated in the pre-designed test cases used during the qualitative assessment.
Participants were limited to the provided input, which was neither particularly long
nor complex, making it relatively easy to summarize. A more targeted evaluation
of this component, using more complex or user-generated input, could potentially
reveal its full value and explore its purpose in building trust.

When examining the factors that helped build trust in the AI system, the results
diverged to some degree. However, a common theme among several participants was
that trust was more strongly influenced by the reasoning components we presented,
rather than by decisions made by the LLM, such as determining a specific confidence
level. It appears that participants were hesitant to trust these specific decisions,
instead placing more value on the reasoning behind the systems’ decision making
process than on the decisions themselves.

It was also evident that some users valued these factors less than others. This seems
to be closely related to the users’ specific needs and expectations regarding what
the system can and should provide. The results from evaluating the explainability
components revealed patterns similar to those observed when evaluating the search
for matching cases. Users who sought broader matches tended to value these com-
ponents more than those looking for very specific matches and relationships. The
findings also suggest that users with a narrower search strategy may have a clearer
approach to analyzing and identifying historical Quality Reports. It is possible that
they pay less attention to the reasoning components because they already have a
structured method for identifying these aspects themselves. Additionally, they may
be more accustomed to performing this process without the aid of an AI system,
making them less likely to engage with the AI components and assistance. Another
factor to consider is the system’s performance and how it may influence user behav-

35


5. Discussion

ior. If the system does not provide sufficiently relevant information, the perceived
usefulness will naturally be lower.

An intriguing observation from the evaluation of trust-building factors is that none
of the participants expressed a desire for deeper knowledge about the system’s back-
end logic. The common justification for this was the preference to trust the system
as it is, without needing further understanding of its inner workings. This raises
important considerations about the choices made regarding the logic that signifi-
cantly influences the model’s outcomes and performance, especially given that users
are indifferent to these mechanisms. In our system, a crucial decision was how to
prioritize which Quality Reports were presented to end users. This was determined
by a relevance score, which combined the FAISS similarity score with a weight from
the text quality. Since this score fully dictated which Quality Reports were ulti-
mately shown to users, the choice of logic is extremely important. Our reflection,
underscored by the users’ lack of interest in this logic, is that involving the right
people in making these decisions is vital. It requires substantial domain knowledge
to ensure the decisions are made correctly.

While the usefulness of the AI system has been extensively discussed in previous
sections, the concept of Usable AI can be analyzed from the questions addressing the
overall perception of the system. Although this thesis did not focus on the front-end
application and interface, it is evident that a significant value of this solution lies in
integrating the gathered information into a single interface. This integration alone
can be seen as a time saver and potential performance improvement of the users’
analysis.

Finally, it is important to emphasize that all participants in the evaluation agreed
they would not trust the system to make decisions without human involvement.
This underscores that the system is designed to assist in decision making rather
than fully automate the process.

36


6
Conclusion

It is indicated that a RAG system effectively can retrieve and summarize historical
reports to assist in identifying the root cause of engine quality issues. With a retrieval
accuracy of 49.2% and a generation accuracy of 74.2%, combined with insights from
domain expert interviews, the system demonstrates its usefulness. However, the
relationship between these accuracy metrics and the time saved for users remains
unknown. A meaningful baseline for the current time expenditure has not been
established and an evaluation of the time spent using the system has yet to be
conducted. The key component of the RAG architecture used in this system is its
database; however, this database also presents a significant drawback. The system
is heavily reliant on it and when relevant information is absent, the resulting output
often becomes irrelevant.

Components that appear to influence user satisfaction and trust in AI-driven in-
sights are primarily the reasoning components that explain the system’s process for
reaching its conclusions. Providing an initial summary of the AI system’s under-
standing of the input allows users to verify that it aligns with their understanding
of the problem. This seems to be a hygiene factor in AI systems rather than a
means of building trust. Confidence statements from the AI system do not seem
to affect user trust; instead, users tend to make that assessment themselves. User
satisfaction with the AI system appears to depend on its performance compared to
current methods of working.

37


6. Conclusion

38


Bibliography

[1] Peter, N., 2021. Artificial Intelligence: A Modern Approach, Global Edition.
Pearson Education Limited.

[2] Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, H.
and Wang, H., 2023. Retrieval-augmented generation for large language models:
A survey. arXiv preprint arXiv:2312.10997, 2.

[3] Mosqueira-Rey, E., Hernández-Pereira, E., Alonso-Ríos, D., Bobes-Bascarán,
J. and Fernández-Leal, Á., 2023. Human-in-the-loop machine learning: a state
of the art. Artificial Intelligence Review, 56(4), pp.3005-3054.

[4] Naveed, H., Khan, A.U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar,
N., Barnes, N. and Mian, A., 2023. A comprehensive overview of large language
models. arXiv preprint arXiv:2307.06435.

[5] Worth, P.J., 2023. Word embeddings and semantic spaces in natural language
processing. International journal of intelligence science, 13(1), pp.1-21.

[6] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, Ł. and Polosukhin, I., 2017. Attention is all you need. Advances in
neural information processing systems, 30.

[7] Radford, A., Narasimhan, K., Salimans, T. and Sutskever, I., 2018. Improving
language understanding by generative pre-training.

[8] Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix,
T., Rozière, B., Goyal, N., Hambro, E., Azhar, F. and Rodriguez, A.,
2023. Llama: Open and efficient foundation language models. arXiv preprint
arXiv:2302.13971.

[9] Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H.,
Ji, Z., Yu, T., Chung, W. and Do, Q.V., 2023. A multitask, multilingual, mul-
timodal evaluation of chatgpt on reasoning, hallucination, and interactivity.
arXiv preprint arXiv:2302.04023.

[10] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry,
G., Askell, A., Mishkin, P., Clark, J. and Krueger, G., 2021, July. Learning

39


Bibliography

transferable visual models from natural language supervision. In International
conference on machine learning (pp. 8748-8763). PmLR.

[11] Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A.,
Ostrow, A.J., Welihinda, A., Hayes, A., Radford, A. and Mądry, A., 2024.
Gpt-4o system card. arXiv preprint arXiv:2410.21276.

[12] Liu, C., Tao, C., Liang, J., Feng, J., Shen, T., Huang, Q., & Zhao, D. (2023,
December). Length-Adaptive Distillation: Customizing Small Language Model
for Dynamic Token Pruning. In Findings of the Association for Computational
Linguistics: EMNLP 2023 (pp. 4452-4463).

[13] Muralidharan, S., Turuvekere Sreenivas, S., Joshi, R., Chochowski, M., Pat-
wary, M., Shoeybi, M., Catanzaro, B., Kautz, J. and Molchanov, P., 2024.
Compact language models via pruning and knowledge distillation. Advances in
Neural Information Processing Systems, 37, pp.41076-41102.

[14] AI@Meta. (2024). Llama 3 Model Card. Available at:
https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md
(Accessed: 9 April 2025).

[15] Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P.,
Zhang, C., Agarwal, S., Slama, K., Ray, A. and Schulman, J., 2022. Training
language models to follow instructions with human feedback. Advances in neural
information processing systems, 35, pp.27730-27744.

[16] Kamath, U., Keenan, K., Somers, G. and Sorenson, S., 2024. Large Language
Models: A Deep Dive. Springer.

[17] Zhang, X., Talukdar, N., Vemulapalli, S., Ahn, S., Wang, J., Meng, H., Mur-
taza, S.M.B., Leshchiner, D., Dave, A.A., Joseph, D.F. and Witteveen-Lane,
M., 2024. Comparison of prompt engineering and fine-tuning strategies in large
language models in the classification of clinical notes. AMIA Summits on Trans-
lational Science Proceedings, 2024, p.478.

[18] Zhou, H., Li, M., Xiao, Y., Yang, H. and Zhang, R., 2024. LEAP: LLM
instruction-example adaptive prompting framework for biomedical relation ex-
traction. Journal of the American Medical Informatics Association, 31(9),
pp.2010-2018.

[19] Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H. and Neubig, G., 2023. Pre-train,
prompt, and predict: A systematic survey of prompting methods in natural
language processing. ACM computing surveys, 55(9), pp.1-35.

[20] Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler,
H., Lewis, M., Yih, W.T., Rocktäschel, T. and Riedel, S., 2020. Retrieval-
augmented generation for knowledge-intensive nlp tasks. Advances in neural
information processing systems, 33, pp.9459-9474.

40


Bibliography

[21] Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F. and
Liang, P., 2024. Lost in the middle: How language models use long contexts.
Transactions of the Association for Computational Linguistics, 12, pp.157-173.

[22] Zheng, A. and Casari, A., 2018. Feature engineering for machine learning: prin-
ciples and techniques for data scientists. " O’Reilly Media, Inc.".

[23] Mikolov, T., Chen, K., Corrado, G. and Dean, J., 2013. Efficient estimation of
word representations in vector space. arXiv preprint arXiv:1301.3781.

[24] Günther, M., Ong, J., Mohr, I., Abdessalem, A., Abel, T., Akram, M. K., ...
Xiao, H. (2023). Jina embeddings 2: 8192-token general-purpose text embed-
dings for long documents. arXiv preprint arXiv:2310.19923.

[25] Douze, M., Guzhva, A., Deng, C., Johnson, J., Szilvasy, G., Mazaré, P.E.,
Lomeli, M., Hosseini, L. and Jégou, H., 2024. The faiss library. arXiv preprint
arXiv:2401.08281.

[26] Wang, X., Wang, Z., Gao, X., Zhang, F., Wu, Y., Xu, Z., Shi, T., Wang, Z.,
Li, S., Qian, Q. and Yin, R., 2024. Searching for best practices in retrieval-
augmented generation, pp. 17716-17736, arXiv preprint arXiv:2407.01219.

[27] Szilvasy, G., Mazaré, P.E. and Douze, M., 2024. Vector search with small ra-
diuses. arXiv preprint arXiv:2403.10746.

[28] RAG, A. Knowledge Conflicts for Large Language Mod-
els. Submitted to ACL Rolling Review, December 2024.
https://openreview.net/forum?id=WVDzLJMd7H

[29] Yu, H., Gan, A., Zhang, K., Tong, S., Liu, Q. and Liu, Z., 2024, August.
Evaluation of retrieval-augmented generation: A survey. In CCF Conference
on Big Data (pp. 102-120). Singapore: Springer Nature Singapore.

[30] Yang, S.J., Ogata, H., Matsui, T. and Chen, N.S., 2021. Human-centered arti-
ficial intelligence in education: Seeing the invisible through the visible. Com-
puters and Education: Artificial Intelligence, 2, p.100008.

[31] Arrieta, A.B., Díaz-Rodríguez, N., Del Ser, J., Bennetot, A., Tabik, S., Bar-
bado, A., García, S., Gil-López, S., Molina, D., Benjamins, R. and Chatila, R.,
2020. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportu-
nities and challenges toward responsible AI. Information fusion, 58, pp.82-115.

[32] Xu, W., 2019. Toward human-centered AI: a perspective from human-computer
interaction. interactions, 26(4), pp.42-46.

[33] Zhang, Y., Liao, Q.V. and Bellamy, R.K., 2020, January. Effect of confidence
and explanation on accuracy and trust calibration in AI-assisted decision mak-
ing. In Proceedings of the 2020 conference on fairness, accountability, and trans-
parency (pp. 295-305).

41


Bibliography

[34] Taudien, A., Fügener, A., Gupta, A. and Ketter, W., 2022. The effect of AI
advice on human confidence in decision-making.

[35] Dietvorst, B.J. and Bharti, S., 2020. People reject algorithms in uncertain de-
cision domains because they have diminishing sensitivity to forecasting error.
Psychological science, 31(10), pp.1302-1314.

[36] Dietvorst, B.J., Simmons, J.P. and Massey, C., 2015. Algorithm aversion: peo-
ple erroneously avoid algorithms after seeing them err. Journal of experimental
psychology: General, 144(1), p.114.

42


A
Appendix 1

A.1 Prompt for feature engineering
You are an expert diagnostician specializing in marine and industrial
parts analysis. Your analysis must be detailed, methodical, and highly
structured. Always provide clear, concise responses that follow the
exact format requested.

Analyze the following problem with precise, structured reasoning:

Context: <CONTEXT EXTRACTED FROM THE THREE DESCRIPTIONAL TEXTS>

Assessment Criterias: <FOUR CRITERIAS THAT NEEDS TO BE CONSIDERED>

Analysis Structure: <FOUR STEPS TO ANALYZE IN THE ASSESMENT CRITERIAS>

Response Format: <SEVEN RESPONSE VARIABLES>

Important Considerations: <FINAL NOTES TO GUIDE THE REASONING>

A.2 Prompt for determining the most likely root
cause

You are an expert diagnostician specializing in marine and industrial
parts analysis. Your analysis must be detailed, methodical, and highly
structured. Always provide clear, concise responses that follow the
exact format requested.

Analyze the following problem with precise, structured reasoning:

Context: <QUALITY ISSUE DESCRIPTION, POSSIBLE ROOT CAUSE INVESTIGATIONS>

Objective: <DETERMINE THE MOST LIKELY ROOT CAUSE TO THE ISSUE AT HAND>

I


A. Appendix 1

Analysis Structure: <THREE STEPS ANALYZE THE POSSIBLE ROOTCAUSES>

Response Format: <THREE RESPONSE VARIABLES>

Important Considerations: <FINAL NOTES TO GUIDE THE REASONING>

II


DEPARTMENT OF SOME SUBJECT OR TECHNOLOGY
CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden
www.chalmers.se

www.chalmers.se

	List of Acronyms
	List of Figures
	List of Tables
	Introduction
	Aim
	Research Questions
	Limitations

	Theory
	Large Language Models
	OpenAI GPT 4o-mini
	Llama 3 Instruct 70B
	Prompting techniques

	Retrieval-Augmented Generation
	Indexing
	Data Segmentation
	Embeddings
	Vector Storage

	Retrieval
	User Query
	Searching

	Generation

	Evaluation
	Human-Centered AI
	Explainable AI
	Useful & Usable AI
	AI-assisted decision making


	Methods
	Use case
	Data
	Data preprocessing
	Size of data sets

	AI System architecture
	Indexing
	Data Segmentation
	Embedding
	Vector Storing

	Retrieval
	Generation

	Explainability factors
	Evaluation
	Quantitative Evaluation
	Retrieval Component
	Generation Component

	Qualitative Evaluation


	Results
	Quantitative Evaluation
	Retrieval Component
	Generation Component

	Qualitative Results
	Searching for similar Quality Reports
	Analyzing the related Root Cause Analyses
	Importance of explainability factors


	Discussion
	Quantitative Evaluation
	Retrieval Component
	Generation Component

	Qualitative Evaluation
	Searching for similar Quality Reports
	Analyzing the related Root Cause Analyses
	Importance of explainability factors


	Conclusion
	Bibliography
	Appendix 1
	Prompt for feature engineering
	Prompt for determining the most likely root cause