Bridging Trust and Design of a Multi-Agent
LLM-Based HR Chatbot: For the Times
They Are A-Changin’

Master’s Thesis in Computer science and engineering

Jonatan Axetorn Felix Edholm

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2025


Master’s Thesis 2025

Bridging Trust and Design of a Multi-Agent
LLM-Based HR Chatbot: For the Times They

Are A-Changin’

Jonatan Axetorn Felix Edholm

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2025


Bridging Trust and Design of a Multi-Agent LLM-Based HR Chatbot: For the Times
They Are A-Changin’
Jonatan Axetorn Felix Edholm

© Jonatan Axetorn 2025.
© Felix Edholm 2025.

Academic supervisor: Lucas Gren, Department of Computer Science and Engineer-
ing
Industry supervisor: Lucas Gren
Examiner in practice: Krishna Ronanki, Department of Computer Science and En-
gineering
Examiner: Christian Berger, Department of Computer Science and Engineering

Master’s Thesis 2025
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2025

iv


Bridging Trust and Design of a Multi-Agent LLM Chatbot for HR: For the Times
They Are A-Changin’
Jonatan Axetorn
Felix Edholm
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Introduction: The integration of Large Language Models (LLMs) into workplace
systems presents significant opportunities, particularly in the domain of human re-
sources (HR), where repetitive tasks—such as providing information that employees
could retrieve themselves—are common and could potentially be replaced by an
LLM-based solution. However, a lack of user trust remains a major barrier to the
adoption of LLM-based systems.
Objective: This thesis investigates what trust factors exist in LLM-based systems
and how they can be addressed by system design, with a specific focus on a multi-
agent HR chatbot.
Method: Using a Design Science Research methodology, the study was conducted
in two iterative cycles. Cycle I identified trust factors through literature review
and interviews with six employees at a multinational company. It also included
a workshop with five AI experts to discuss and validate design choices. Cycle II
involved implementing, and evaluating an artifact, a multi-agent chatbot tailored to
HR queries.
Findings: Thematic analysis revealed external trust factors: transparency, organi-
sational measures, and external security and internal trust factors: internal security,
model differences, risk of bias and reliability, which emerged as the most critical trust
factor. The artifact was evaluated through interviews and metrics such as answer
relevancy, faithfulness, and robustness, showing consistently strong performance and
broad user acceptance.
Conclusion: The multi-agent HR chatbot effectively addressed key trust concerns
and was positively received by most interviewees, demonstrating its potential for
real-world application. These findings suggest that trust factors can be meaningfully
addressed through thoughtful design and should be treated as a core consideration
throughout the development process of LLM-based systems.

Keywords: autonomous agents, chatbot, design science research, human resources,
HR, large language model, multi-agent architectures, system design, trust, trust
factors

v


Acknowledgements
First and foremost, we would like to express our sincere gratitude to everyone who
participated in the interviews and workshop conducted during this thesis.

We would also like to thank Lucas Gren for his support and guidance as our academic
and industry supervisor during this project.

Jonatan Axetorn, Felix Edholm
Gothenburg, June 2025

vii


Contents

List of Figures xi

List of Tables xiii

1 Introduction 1
1.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Purpose of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Significance of the study . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 5
2.1 Trust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Large language models . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Hallucinations . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.2 Prompt engineering . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Retrieval-augmented generation . . . . . . . . . . . . . . . . . . . . . 7
2.4 Autonomous agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 LLM orchestration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5.1 LangChain & LangGraph . . . . . . . . . . . . . . . . . . . . 9
2.6 Guardrails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 LLM-as-a-judge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.7.1 DeepEval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Related Work 13
3.1 Trust in LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Challenges with multi-agent LLM-based systems . . . . . . . . . . . . 14
3.3 Collaboration in multi-agent systems . . . . . . . . . . . . . . . . . . 14
3.4 Multi-agent retrieval-augmented generation filtering . . . . . . . . . . 16

4 Method 19
4.1 Design science research . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.1.1 Problem investigation . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.2 Solution design . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.3 Design validation . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 21

ix


Contents

4.1.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 Overview of cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Cycle I 25
5.1 Method - Qualitative data collection . . . . . . . . . . . . . . . . . . 26

5.1.1 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.1.1.1 Problem investigation interview setup . . . . . . . . 26
5.1.1.2 Thematic analysis . . . . . . . . . . . . . . . . . . . 27

5.1.2 Workshop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.2 Findings - Cycle I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5.2.1 Trust factors in LLM-based systems (RQ1) . . . . . . . . . . . 28
5.2.1.1 External trust factors — Trust impacted by non-

technical forces . . . . . . . . . . . . . . . . . . . . . 29
5.2.1.2 Internal trust factors — Trust impacted by technical

details . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.2.2 Findings from workshop (RQ2) . . . . . . . . . . . . . . . . . 34

6 Cycle II 37
6.1 The artifact - final solution candidate (RQ2) . . . . . . . . . . . . . . 37

6.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
6.1.2 Guidelines component . . . . . . . . . . . . . . . . . . . . . . 38
6.1.3 Employment component . . . . . . . . . . . . . . . . . . . . . 41

6.2 Method - Quantitative data collection . . . . . . . . . . . . . . . . . . 43
6.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
6.2.2 Dummy data . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
6.2.3 Test runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6.3 Method - Qualitative evaluation interview . . . . . . . . . . . . . . . 47
6.4 Findings - Cycle II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

6.4.1 Findings from evaluation interviews (RQ3) . . . . . . . . . . . 48
6.4.2 Findings from quantitative evaluation (RQ3) . . . . . . . . . . 52

7 Discussion 61
7.1 Implications for research . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2 Implications for practice . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
7.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

8 Conclusion 67

References 69

A Appendix I
A.1 Problem investigation interview guide . . . . . . . . . . . . . . . . . . I
A.2 Evaluation interview guide . . . . . . . . . . . . . . . . . . . . . . . . III
A.3 Quantitative evaluation questions . . . . . . . . . . . . . . . . . . . . V
A.4 Artifact agent prompts . . . . . . . . . . . . . . . . . . . . . . . . . . IX

x


List of Figures

4.1 The regulative cycle of design science research. . . . . . . . . . . . . . 20
4.2 Activites performed during the two cycles in this thesis. . . . . . . . . 23

5.1 Identified trust factors in LLM-based systems. . . . . . . . . . . . . . 29

6.1 Example of choice for type of question in the chatbot. . . . . . . . . . 38
6.2 Structure of the HR chatbot. . . . . . . . . . . . . . . . . . . . . . . . 39
6.3 Example output from the chatbot to the question "How many vaca-

tion days do I get?" with corresponding HR guideline source. . . . . . 47

xi


List of Figures

xii


List of Tables

4.1 Participant counts for qualitative data collection activities. . . . . . . 23
4.2 Number of evaluation runs per quantitative metric. . . . . . . . . . . 24

6.1 Baseline evaluation results of the guidelines component for simple
category questions. Each question was asked and evaluated 20 times.
All values are rounded to three decimal places. . . . . . . . . . . . . . 52

6.2 Robustness evaluation results of the guidelines component for simple
category questions, including percentage change relative to the base-
line. Each baseline question was reformulated into 9 variations, and
all 10 versions (including the original) were each evaluated 5 times.
The robustness score represents the average of these 50 runs for each
baseline question. Robustness values are rounded to three decimal
places; percentage changes are rounded to two decimal places. . . . . 55

6.3 Evaluation results of the guidelines component for broader category
questions. Each question was asked and evaluated 20 times. All
values are rounded to three decimal places. . . . . . . . . . . . . . . . 57

6.4 Evaluation results of the employment component. Each question was
asked and evaluated 20 times. All values are rounded to three decimal
places. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

A.1 Questions for the employment component and expected outputs. . . . VII
A.2 Other questions for the employment component and expected outputs.VIII

xiii


List of Tables

xiv


1
Introduction

The application of Large Language Model (LLM) solutions across various business
areas has never been more relevant than it is today. The opportunity to use natu-
ral language to address repetitive tasks is promising. Text-based interactions with
LLMs are increasingly replacing traditional human-to-human interactions [1].

Despite their potential to improve organisational efficiency, the introduction of ar-
tificial intelligence (AI) solutions often encounters reluctance. Factors such as fear
of job displacement, distrust of AI’s perceived human qualities, and general scep-
ticism contribute to delays in adopting these systems [2]. To overcome these chal-
lenges, it is crucial to design LLM-based systems that actively build user trust.
Key performance-related factors—such as accuracy and the frequency of hallucina-
tions—have been shown to positively influence this trust [3, 4]. Since these factors
are directly shaped by system design, thoughtful design emerges as a vital strategy
for fostering trust in LLM-based technologies.

Developing autonomous agent systems based on LLMs, where agents refer to AI-
based entities that have capabilities such as planning, social interaction, and memory
[5], holds significant potential for positively impacting trust factors such as the re-
liability of the system. Additionally, LLM-based autonomous agent systems have
demonstrated significant versatility [6], highlighting their potential to address a wide
range of organisational needs. In an ideal scenario, a general agent-based system
could meet the needs of employees across different roles within a company. How-
ever, creating general-purpose LLM-based solutions has proven to be elusive [7, 8].
A possible alternative is tailoring LLM-based systems to specific purposes.

Furthermore, multi-agent architectures, which leverage the collaborative abilities
of multiple LLM agents, have been shown to outperform single-agent systems when
handling complex problems [9]. This suggests that designing a multi-agent architec-
ture tailored to a specific role within a company could yield significant performance
benefits. Although improved performance alone may not guarantee user trust, it
remains an important factor influencing trust [3, 4], as previously stated. Conse-
quently, a multi-agent architecture represents a promising approach for enhancing
user trust. A well designed multi-agent LLM-based system could also reduce the
need for human-to-human interactions, thereby improving efficiency. This is espe-

1


1. Introduction

cially relevant in the context of human resources (HR), where LLM-based systems
can automate tasks that traditionally required direct communication with HR staff
[2].

This thesis explores the factors that influence trust in LLM-based systems, consider-
ing both non-technical elements and those shaped by technical decisions. Through
a combination of literature review and interviews, this study identifies key trust fac-
tors. Employing a design science research approach, the thesis presents an artifact:
a multi-agent LLM-based HR chatbot designed to answer questions related to HR
guidelines and employment information, with trust factors integrated into its design.
The artifact is then evaluated using both qualitative and quantitative methods.

1.1 Problem description
Most existing studies regarding trust in LLM-based systems do not focus on system
design, they focus primarily on user experience. This reveals a critical gap: how can
LLM-based systems be designed with trust-building factors in mind?

Both single-agent and multi-agent systems present unique challenges. Single-agent
systems often face limitations such as shorter context windows [10] and a higher
risk of hallucinations [11]. Multi-agent systems, on the other hand, must address
complexities like task allocation and coordination among agents [12]. However, the
benefits offered by multi-agent systems—such as improved performance and robust-
ness [13, 8]—tend to outweigh these coordination challenges. Despite this potential,
current research primarily focuses on single-agent systems, leaving the potential of
multi-agent solutions underexplored.

Another important consideration is whether the LLM-based system is general-purpose
or domain-specific. Since different roles have different needs, general solutions often
underperform compared to bespoke, domain-specific alternatives. This has been
demonstrated in both legal [7] and HR contexts [14], where tailored systems have
shown superior results.

Taken together, these findings highlight a key research gap: the design of a multi-
agent LLM system tailored to specific roles and organisational needs—while incor-
porating trust-related factors—has yet to be thoroughly explored.

1.2 Purpose of the study
The purpose of this study is to explore the factors that influence user trust in LLM-
based systems and to examine how these factors can be addressed through system
design. Specifically, the study focuses on the development of a multi-agent chatbot
for HR-related queries, aiming to identify design choices that enhance trust. By
doing so, it seeks to bridge the gap between trust considerations and system design
in the context of bespoke, domain-specific LLM applications.

2


1. Introduction

1.3 Research questions
• RQ1: What are the main trust factors that exist in the usage of an LLM-based

system?

• RQ2: What potential solutions can be integrated into the system design of
an LLM-based HR chatbot to address the relevant trust factors identified in
RQ1?

• RQ3: To what extent can the relevant trust factors identified in RQ1 be
addressed through the design solutions implemented in an LLM-based HR
chatbot?

1.4 Significance of the study
The significance of this study lies in its contribution to bridging the gap between
system design and user trust in LLM-based applications. It offers practical knowl-
edge for organisations seeking to implement multi-agent LLM systems that foster
trust—and thereby encourage user adoption.

Additionally, developing a system built around an AI component, such as an LLM,
is part of Software Engineering (SE) for AI. As highlighted by Uchitel et al. [15], this
area is highly relevant to the broader software engineering community. This thesis
seeks to make a meaningful contribution to SE for AI by addressing the lack of
research regarding designing and constructing trust-fostering multi-agent systems.

1.5 Delimitations
This thesis focuses on the development of a chatbot designed to assist employees in
querying HR guideline documents and employment-related information. It explic-
itly excludes other use cases, such as HR personnel interacting with the system or
scenarios involving recruitment, onboarding, or employee management. The system
is limited to handling informational queries only and does not perform transactional
actions, such as applying for leave or managing tasks.

Although security and confidentiality are essential for systems that handle personal
or sensitive data, these concerns fall outside the scope of the developed artifact.

The research does not involve a comparison between different large language mod-
els. Instead, the chatbot exclusively uses llama3-70b-8192 without any fine-tuning
or modification of the underlying model.

The goal of developing the chatbot in this thesis is not to create a fully deployable
system for real-world use. Instead, the purpose is to explore how specific design
choices influence the trust factors identified. Consequently, no formal requirements

3


1. Introduction

elicitation is conducted with stakeholders.

The study is conducted in collaboration with a large multinational company, and
all interviews are carried out with employees from within this organisation.

Finally, while user interface design and usability are known to influence trust in
AI systems, these aspects are not a focus of this thesis.

1.6 Thesis outline
The thesis begins by presenting key concepts and background information in Chap-
ter 2.

Chapter 3 reviews related research relevant to the thesis, including studies on trust
in AI and multi-agent architectures.

The research methodology is described in Chapter 4, which outlines the overall
Design Science Research approach used in the study.

The thesis follows two iterative design cycles. Chapter 5 details Cycle I, including
the methodology for qualitative data collection and the corresponding findings.

Chapter 6 covers Cycle II, beginning with a presentation of the completed artifact,
a multi-agent LLM-based HR chatbot, followed by descriptions of the quantitative
and qualitative evaluation methods. The chapter concludes with the findings from
the artifact evaluation.

In Chapter 7, the discussion expands on the findings, explores their implications,
and addresses threats to validity. It also outlines potential directions for future
research.

Finally, Chapter 8 provides a conclusion to the thesis.

4


2
Background

This chapter provides background on the key concepts relevant to this thesis. It be-
gins with an overview of trust and then introduces LLMs more broadly, covering key
challenges such as hallucinations and the role of prompt engineering. The chapter
then shifts focus to the foundations of retrieval-augmented generation (RAG), or-
chestration frameworks, guardrails, and the concept of autonomous agents. Finally,
it outlines relevant evaluation techniques, with a focus on LLM-as-a-judge and the
DeepEval framework used in this study.

2.1 Trust
Trust is a complex and multi-dimensional concept that is challenging to define in
a way that applies universally across different contexts. It has been explored in
various fields, including psychology [16], economics [17], organisational theory [18],
and sociology [19], leading to diverse and sometimes conflicting research [20, 21].
However, in a general sense, trust can be viewed as the relationship between a
“trustor“ (the one who trusts) and a “trustee“ (the one who is trusted) according
to Mayer et al. [18].

While researching trust in digital information, Kelton et al. [20] discuss four levels
of trust that they have identified in the literature around trust:

• Individual trust: A person’s inherent trust based on accumulated experi-
ences.

• Interpersonal trust: A social connection between a trustor and a trustee.

• Relational trust: Trust that develops as an emergent property from the
relationship over time.

• Societal trust: Trust that exists within a community or society as a whole.

For the purposes of this thesis, interpersonal trust is most relevant, as it pertains
to the one-way trust relationship between a trustor and a trustee. Importantly, the
trustee does not necessarily need to be a human, it could also be a technological
system, such as an LLM-based chatbot.

5


2. Background

Furthermore, Kelton et al. [20] argues that three key conditions must be met for
trust to be relevant in a given situation:

• Uncertainty: A lack of information creates uncertainty.

• Vulnerability: The trustor is at risk of experiencing a loss if the trust is
betrayed.

• Dependence: The trustor has a need that the trustee is capable of fulfilling.

In the context of an LLM-based HR chatbot as in this thesis, uncertainty arises
for the employee (the trustor) due to the fact that they typically turn to the chat-
bot (the trustee) when they lack specific HR-related information, such as details
regarding vacation days or company benefits. Regarding vulnerability, there is a po-
tential risk that if the chatbot provides inaccurate information or discloses sensitive
data inappropriately, the employee may experience negative consequences, such as
making decisions based on faulty or incomplete information. Finally, the employee’s
dependence on the chatbot is evident, as the chatbot holds the necessary information
and has the capability to address the employee’s questions, thereby fulfilling their
informational needs in the HR context.

2.2 Large language models
LLMs are a category of artificial intelligence designed to generate, interpret, and
engage with natural human language. These models are trained on vast amounts of
textual data, enabling them to learn the complexities of language, including syntax,
semantics, and contextual relationships [22]. A significant advancement in this field
was the introduction of BERT (Bidirectional Encoder Representations from Trans-
formers), which enabled models to assess the importance of words in a sentence
regardless of their position [23]. ChatGPT, which gained widespread public atten-
tion in 2022, further advanced these capabilities with a larger and more powerful
model. The result is text generation that is both coherent and contextually relevant,
based on the input it receives [24]. The practical applications of LLMs are broad,
ranging from responding to simple queries to performing complex data analysis.

2.2.1 Hallucinations
LLMs can sometimes produce undesirable outcomes, resulting in outputs that are
“bland, incoherent, or caught in repetitive loops." [25] In such cases, the generated
content may be nonsensical or unfaithful to the source input. This phenomenon
is commonly referred to as hallucinations. Hallucinations present significant con-
cerns regarding the reliability and performance of LLMs for several reasons. One
major issue is a reduction in accuracy, as hallucinated responses are, by definition,
incorrect. Another concern is related to security, as hallucinations may lead the
model to produce or infer sensitive information that it should not access or disclose.
Addressing hallucinations remains an ongoing challenge in the field, and researchers

6


2. Background

are actively developing various techniques to mitigate their occurrence [25].

2.2.2 Prompt engineering
A prompt is an input provided to an LLM that guides the nature of the generated
output. Prompts can consist of various types of media, including text, images, audio,
or other formats. The process of designing and refining these inputs is referred to as
prompt engineering [26]. Prompt engineering has emerged as an effective method for
enhancing the performance of LLMs, as it does not require altering the underlying
model itself, but instead involves crafting more effective instructions for the AI.

Well-designed prompting techniques have been shown to significantly improve LLM
performance, making prompt engineering a critical consideration when developing
LLM-based systems [27, 28]. According to OpenAI, some effective strategies for
prompt engineering include:

• Including specific details in the query to obtain more relevant answers

• Using delimiters to clearly separate distinct parts of the input

• Specifying the steps required to complete a task

• Providing examples to guide the model’s response

• Indicating the desired length of the output

Relatively simple techniques such as these can lead to substantial improvements in
the quality and relevance of LLM-generated outputs [29].

2.3 Retrieval-augmented generation
RAG was originally developed by Lewis et al. [30] for natural language processing
tasks. This approach enhances LLMs by integrating domain-specific knowledge re-
trieved from external data sources, thereby mitigating the generation of inaccurate
or outdated information. RAG enables text generation to be grounded in relevant,
retrieved data rather than relying solely on the model’s pre-trained knowledge [31].

The incorporation of external data sources is particularly critical in question-answering
systems, where the factual accuracy of responses is a key requirement. As stated,
one of the primary challenges in LLM-based systems is the occurrence of hallucina-
tions. Research has demonstrated that RAG significantly reduces the frequency of
hallucinations while maintaining the overall performance of the system [32].

At its simplest, the RAG process follows three steps, indexing, retrieval and gener-
ation as explained by Gao et al. [31].

1. Indexing extracts data from various formats, such as PDF, HTML, and Mark-
down, standardising it into plain text, and segmenting it into smaller units.

7


2. Background

These segments are then encoded into vector representations using an em-
bedding model and stored in a vector database, enabling efficient similarity
searches [31].

2. Retrieval identifies and retrieves relevant information based on a query. The
system encodes the query into a vector and compares it to stored document
vectors, selecting the most relevant results. These retrieved segments expand
the LLM’s knowledge beyond its pre-trained dataset [31].

3. Generation synthesises a response using the retrieved context. The LLM
processes the query and retrieved document segments to generate a factually
grounded and contextually relevant response, integrating both external data
and its pre-trained knowledge as needed [31].

2.4 Autonomous agents

Agents have been studied extensively within the AI community long before the
emergence of LLMs. Agents are defined as software systems that may exhibit char-
acteristics including: autonomy, meaning they can operate without direct human
intervention; social ability, enabling interaction with other agents; reactivity, allow-
ing them to respond to environmental changes; and pro-activeness, giving them the
ability to take initiative [33]. Additionally, AI agents are implemented using con-
cepts typically associated with humans, such as knowledge, emotion, and intention.

The introduction of LLMs has positively impacted the development of autonomous
agents by leveraging natural language capabilities [5]. Modern LLM-based agents
integrate advanced features such as personalised profiles, memory retention, external
tool usage, and advanced planning [5]. These agents can adopt specialised roles and
collaborate with one another, enhancing their collective problem-solving capabilities.
This collaboration enables multi-agent systems.

2.5 LLM orchestration

LLM orchestration refers to the process of coordinating multiple LLMs, for instance
in the form of agents, to accomplish specific tasks. This involves managing activities
such as linking prompts, handling API calls, retrieving data, and maintaining state
across interactions. LLM orchestration is often done using an orchestration frame-
work, which provides the structure and tools needed to effectively manage these
tasks. These frameworks simplify the development process by offering standardised
components and workflows, allowing developers to focus on the higher logic of their
applications rather than low level details of coordinating different models [34].

8


2. Background

2.5.1 LangChain & LangGraph
LangChain is a framework for developing applications based on LLMs. It provides
an interface to interact with LLMs in creating simple linear workflows, while also
offering standardised components for AI application functionalities such as model
interactions, retrieval mechanisms, and integrations with various data sources [35].

LangGraph is an orchestration framework designed for creating multi-agent sys-
tems [36]. While it integrates well with LangChain, and is created by the same
company, it can also be used independently. Unlike LangChain’s sequential work-
flow approach, LangGraph enables a conditional workflow using directed graphs. It
supports key features such as looping, conditional branching, and state management,
allowing agents to dynamically adjust their behaviour based on evolving tasks.

2.6 Guardrails
The non-deterministic, black-box nature of LLMs introduces several risks. Bias in
training data can, for example, lead to outputs that reflect societal prejudices. An-
other challenge is inconsistency—an LLM may produce different answers to the same
prompt, which can be particularly problematic in applications requiring reliability,
such as question-answering systems. This unpredictability can erode user trust and
undermine confidence in LLM-based applications [37, 38].

To address these issues, the concept of guardrails has been introduced. Guardrails
are mechanisms designed to monitor and filter the inputs and outputs of LLMs,
helping to mitigate potential risks [38]. They analyse input prompts and generated
responses to determine whether intervention is required to prevent harmful, biased,
or incorrect outputs. Guardrails serve as a protective layer within LLM-based sys-
tems, reducing the likelihood of exposing sensitive data and limiting the sharing of
misleading or inappropriate content [38].

Although guardrails enhance security and reliability, they do not necessarily improve
robustness against hostile attacks. Research by Shen et al. [39] indicates that
guardrails provide only limited resistance to jailbreak attacks, which are prompt
manipulations designed to bypass safeguards and elicit harmful content. Their study
found that while guardrails marginally reduce the success rate of such attacks, they
do not fully prevent them. This highlights the ongoing need for further advancements
in LLM safety mechanisms, even in systems that incorporate guardrails.

2.7 LLM-as-a-judge
Coined by Zheng et al. [40], the term LLM-as-a-judge refers to using LLMs as
evaluators for tasks that typically require human judgment, such as assessing the
quality of chatbot responses in open-ended dialogue. This approach addresses a key
limitation of traditional benchmarks, which often fail to capture how well models
align with human preferences. By contrast, LLM-based judges can offer a scalable

9


2. Background

and efficient alternative to human evaluation.

To test the viability of this approach, Zheng et al. [40] developed two benchmarks.
Their findings show that the most commonly used LLM at the time, GPT-4, when
used as a judge, agrees with human preferences over 80% of the time—comparable
to the agreement rate between human annotators themselves.

While promising, the study also highlights several limitations, including susceptibil-
ity to biases (e.g. favouring the first-listed response or more verbose answers) and
occasional failures in evaluating complex tasks requiring precise reasoning. Despite
these issues, the results suggest that, when carefully applied, LLM-as-a-judge can
serve as a practical and surprisingly reliable proxy for human evaluation in many
settings

2.7.1 DeepEval
DeepEval [41] is an open-source evaluation framework designed to assess the perfor-
mance of LLM-based systems. By leveraging LLM-as-a-judge, DeepEval supports a
variety of evaluation tasks across different types of LLM applications, including—but
not limited to—RAG systems.

Among the evaluation metrics it offers for RAG scenarios are faithfulness, an-
swer relevancy, and contextual relevancy. Originally introduced in the RAGAS
framework by Es et al. [42], these metrics are defined as follows:

• Faithfulness measures how accurately the generated answer reflects the re-
trieved context, aiding in identifying hallucinations.

• Answer relevancy evaluates the degree to which the generated response
directly addresses the user’s question. The metric does not take into account
factuality but instead focuses on completeness and focus, penalising responses
that are irrelevant, incomplete, or verbose.

• Contextual relevancy assesses how relevant the retrieved context used to
generate the answer is to the input question. The context should be focused
and contain as little irrelevant information as possible.

DeepEval also provides the capability to create custom evaluation metrics through
the use of G-Eval [43]. G-Eval is a framework that enables the evaluation of outputs
based on user-defined criteria. For instance, it can be employed to assess the correct-
ness of a given output. This is achieved by specifying both the evaluation criteria
and the corresponding evaluation steps. An example of criteria and evaluation steps
for a custom correctness metric is given below.

• Criteria: Determine whether the actual output is factually correct based on
the expected output.

• Evaluation steps:

10


2. Background

– Check whether the facts in actual output contradict any facts in expected
output.

– Heavily penalise omission of detail.

– Vague language, or contradicting opinions, are acceptable.

This approach enables the creation of metrics that are not predefined in the DeepEval
framework, offering greater versatility when evaluating the outputs of the LLM [43].

11


2. Background

12


3
Related Work

This chapter reviews existing research relevant to the thesis. It begins with an
examination of literature on trust in LLM-based systems, followed by a presentation
of key challenges in multi-agent systems. The chapter then explores research on
collaboration within such systems and concludes with an overview of two approaches
aimed at enhancing RAG.

3.1 Trust in LLMs

Trust in AI has been studied long before the rise of LLMs, as evidenced by an
empirical research review by Ella and Wooley [44]. However, as LLMs become more
widely used, it is important to understand the key factors that influence trust in
these systems. Liu et al. [4] and Huang et al. [45] conducted extensive literature
reviews and developed taxonomies of trust factors while designing benchmarks to
evaluate LLMs. Although their work focuses on assessing the models themselves, and
not complete systems incorporating them, the same trust factors remain relevant,
as they ultimately relate to how users perceive and trust LLM-generated content.

Liu et al. [4] categorise trust into several key areas, including reliability, safety,
and explainability & reasoning. They state that reliability refers to the accu-
racy and consistency of outputs while minimising errors. Safety involves protecting
sensitive information, while explainability & reasoning focuses on how well a
system can justify its responses and provide clear explanations.

Huang et al. [45] propose a similar framework with some differences in classifica-
tion. Their taxonomy includes truthfulness, which emphasises providing correct
information, privacy, which is treated as a separate category rather than a subset
of safety, and transparency, which relates to how openly a system communicates
how it generates its outputs.

Schwartz et al. [46] add to this by identifying key factors that enhance trust in
LLM-based systems, including reliability, that they define as consistently delivering
high-quality, accurate results, openness, ensuring transparency regarding system
capabilities, limitations, and reliability, task characteristics, adapting responses
based on task type and complexity, and trust trajectory, recognising the impor-

13


3. Related Work

tance of first impressions while providing opportunities to rebuild trust through
subsequent accurate outputs.

3.2 Challenges with multi-agent LLM-based sys-
tems

Han et al. [12] emphasise challenges with multi-agent LLM-based systems that
remain inadequately addressed in the literature. The paper summarises these chal-
lenges into four categories, as follows:

• Optimising task allocation to leverage agents’ unique skills and specialisations.

• Fostering robust reasoning through iterative debates or discussions among a
subset of agents to enhance intermediate results.

• Managing complex and layered context information, such as context for over-
all tasks, single agents, and some common knowledge between agents, while
ensuring alignment to the general objective.

• Managing various types of memory that serve for different objectives in coher-
ent to the interactions in multi-agent systems.

3.3 Collaboration in multi-agent systems
There are numerous ways to facilitate collaboration among agents in a multi-agent
system. To summarise the various approaches, Tran et al. [47] conducted a survey of
LLMs and proposed a framework for LLM-based multi-agent systems. In doing so,
they identified three primary categories of multi-agent collaboration, identified in the
literature: collaboration types, collaboration strategies, and communication
structures.

Tran et al. [47] classify collaboration types into three subcategories:

• Cooperation, where agents align their efforts towards a shared goal. Some
advantages of cooperation include the ability to assign sub-tasks based on indi-
vidual agent strengths and its relatively straightforward design and execution,
provided the goals are clear. However, misaligned goals may lead to inefficien-
cies, and failures in one agent can significantly impact the entire multi-agent
structure. Example scenarios for cooperative collaboration include code gen-
eration, decision-making, game environments, question answering, and recom-
mendations.

• Competition, where agents prioritise their own objectives, even if they con-
flict with those of other agents. This type of collaboration encourages agents
to enhance their performance and promotes adaptive strategies. However, it is
crucial to have a conflict resolution mechanism to ensure competition remains

14


3. Related Work

beneficial to the system as a whole. Example scenarios where competition may
be advantageous include debate, game environments, and question answering.

• Coopetition, a hybrid of competition and cooperation, in which agents col-
laborate on some tasks while competing in others. This enables the system
to balance trade-offs and reach mutual agreements. However, as an under-
explored area, its effectiveness and ideal applications remain uncertain. Tran
et al. [47] cite negotiation, such as in policymaking systems, as the primary
example scenario for coopetition.

Tran et al. [47] identify three distinct collaboration strategies for multi-agent
systems:

• Rule-based, where predefined rules strictly govern agent interactions. This
ensures efficiency, high predictability, consistency, and fairness. However, it
also results in low adaptability to uncertainty and scalability challenges for
complex tasks. Rule-based strategies are best suited to applications such as
question answering, consensus-seeking, navigation, or peer-review processes.

• Role-based, where each agent assumes a predefined role and operates on
segmented objectives based on its domain knowledge to support the system’s
overarching goals. This strategy enhances modularity and reusability while
leveraging agents’ specialised expertise. However, poorly defined roles can
lead to rigidity, disputes, or functional deficiencies. Role-based strategies are
particularly applicable to simulations of real-world environments with well-
defined jobs, such as decision-making or software development.

• Model-based, where agents perform probabilistic decision-making based on
input (with uncertainty in perception potentially impacting agent actions),
environmental factors, and shared goals. This probabilistic approach allows
adaptability to dynamic environments and robustness to uncertainties. How-
ever, it is complex to implement and computationally expensive. Due to its
adaptability, this strategy is well suited to dynamic contexts such as game
environments or robotics.

Tran et al. [47] categorise communication structures into three main types:

• Centralised structure, where each agent connects to a central agent re-
sponsible for all collaboration decisions. This structure is easy to design and
implement and is efficient for resource allocation. However, its reliance on a
single central node creates a single point of failure, making it less resilient to
disruptions. According to Tran et al. [47], centralised structures are suitable
for question answering and decision-making scenarios.

• Decentralised structure, where control and decision-making are distributed
among agents that operate on local information. This structure enhances re-
silience, as the system can continue functioning even if individual agents fail,
and it is highly scalable. However, it may suffer from inefficient resource allo-

15


3. Related Work

cation and significant communication overhead. A decentralised structure is
applicable to decision-making, question answering, reasoning, and code gener-
ation.

• Hierarchical structure, where agents are organised in layers, with commu-
nication primarily occurring between adjacent layers. Each layer has distinct
functions, roles, and levels of authority. This structure reduces bottlenecks and
facilitates task distribution among layers. However, it is highly complex, lead-
ing to increased latency and implementation challenges. Hierarchical struc-
tures are used in scenarios such as code generation, question answering, and
reasoning.

Additionally, Tran et al. [47] discuss coordination and orchestration archi-
tectures, which extend beyond individual collaboration channels to manage the
relationships and interactions between multiple channels. These architectures de-
fine how collaboration channels are created, ordered, and characterised. Tran et al.
[47] identify two major types:

• Static architectures, which rely on predefined rules and domain expertise
to establish collaboration channels. By leveraging prior knowledge, these ar-
chitectures ensure interactions adhere to domain-specific requirements while
improving overall system efficiency and maintaining consistent task execution.
However, their dependence on accurate domain knowledge and their fixed na-
ture result in limited scalability and flexibility.

• Dynamic architectures, which adapt to changing environments and task
requirements by employing management agents or other adaptive mechanisms
to assign roles and define collaboration channels in real time. While suitable
for complex and evolving tasks, dynamic architectures require higher resource
allocation due to real-time adjustments and present a greater risk of failure
due to their fluid nature.

3.4 Multi-agent retrieval-augmented generation fil-
tering

As previously stated, RAG has become a key technique for improving the accuracy
and reliability of LLM-generated responses by incorporating external knowledge
retrieval. One approach, Self-RAG, introduced by Asai et al. [48], enhances factual
accuracy by allowing the model to decide when to retrieve additional information
and critically assess its own outputs. This method helps improve citation accuracy
and reduces the inclusion of irrelevant or misleading information.

A more recent development is MAIN-RAG, proposed by Chang et al. [49], which
takes a multi-agent approach to further refine the retrieval process. Their paper
shows that it outperforms Self-RAG across a number of datasets. MAIN-RAG is
a training-free framework and introduces three specialised agents: a Predictor,

16


3. Related Work

which retrieves documents and generates an initial answer based on each document.
The predictor then sends the documents to the Judge agent, which evaluates if the
documents provide relevant information to the query and answer and scores and
orders them accordingly. If a document is deemed to be irrelevant, it is filtered out
at this step. Finally, the documents are sent to the Final-Predictor agent, which
generates the final response based on the sources provided by the judge agent.

17


3. Related Work

18


4
Method

This thesis was conducted as design science research (DSR) mainly following the
methodology of Wieringa [50] and the guidelines for applying DSR in the context of
a master thesis presented by Knauss [51]. DSR focuses on the creation of a design
artifact to solve a concrete problem while also gathering data about knowledge
questions. In this thesis, the artifact is a multi-agent LLM-based HR chatbot with
two components, one for answering questions regarding employment data, and one
for answering question based on HR guideline documents. The chatbot is designed
with the goal of understanding what design choices can address trust factors and in
turn foster trust for such a system.

4.1 Design science research

Wieringa [50] describes design science research as an iterative problem-solving method-
ology structured around the "regulative cycle", illustrated in figure 4.1, which com-
prises five phases: problem investigation, solution design, design validation, solution
implementation, and implementation evaluation. Knauss [51] groups these phases
into three broader categories:

• Problem: Includes problem investigation, where the research problem is ex-
plored and analysed.

• Solution: Covers solution design and design validation, focusing on develop-
ing and validating possible solutions.

• Evaluation: Encompasses evaluation, where the effectiveness and usability
of the proposed solution are assessed.

Additionally, the implementation phase represents the artifact. Throughout the iter-
ative cycles, the artifact undergoes incremental work—continuously evolving based
on the insights gathered during the other phases. This thesis was conducted through
two cycles, described further in chapter 5 and chapter 6 respectively.

In alignment with Knauss’s guideline 3 [51], the research questions in this thesis
are formulated to correspond to the three main categories: RQ1 addresses problem

19


4. Method

Problem
Investigation

Solution
Design

Design
Validation

Solution
Imple-

mentation

Implementation
Evaluation

Figure 4.1: The regulative cycle of design science research.

understanding, RQ2 focuses on potential solutions, and RQ3 is connected to the
evaluation of the proposed solution.

4.1.1 Problem investigation

The purpose of the problem investigation phase is to gather information in order to
understand the given problem, as well as describe it and explain it. Wieringa [50]
presents four non-exclusive reasons for investigating the problem:

• Problem-driven investigation, where there is a concrete problem that
needs to be understood before trying to solve it.

• Goal-driven investigation, where the investigation is motivated not neces-
sarily by a problem but by some ambition to achieve change.

• Solution-driven investigation, where a technology’s potential to improve
or solve a problem is analysed.

• Impact-driven investigation, also called evaluation research, where the fo-
cus is on evaluating the impact of past actions instead of preparing for future
solutions.

In this thesis, two main problem investigation approaches were employed. Problem-
driven investigation was primarily used to address RQ1, which focuses on under-
standing the issue of trust factors in an LLM-based system. In contrast, solution-
driven investigation was mainly applied to RQ2, which explores potential solutions
to adress the trust factors identified in RQ1.

20


4. Method

4.1.2 Solution design
The solution design phase, as described by Wieringa [50], involves formulating possi-
ble solutions to the identified problem. These designs, which he refers to as solution
suggestions, serve as just that, suggestions, rather than definitive answers, as they
have not yet been validated or implemented. Solution designs can take various
forms, including natural language descriptions, sketches, blueprints, mathematical
models, or prototypes.

Wieringa highlights that solution design is not a fixed plan from the beginning.
Rather, it is a process that involves uncertainty, with the proposed solution devel-
oping further as it is evaluated and tested. A solution suggestion does not describe
an existing reality, explain past events, or predict future outcomes. Instead, it out-
lines a possible course of action that helps stakeholders move from uncertainty ("we
are uncertain about what to do") to confidence ("we are sufficiently certain about
what to do").

4.1.3 Design validation
During the design validation phase, the design is investigated with the purpose to
understand if it indeed would bring stakeholders closer to their goals. Wieringa [50]
states that there are three important knowledge questions that need to be answered
in this phase:

• Internal validity: If the design were to be implemented, would it satisfy the
criteria identified in the problem investigation?

• Trade-offs: How do different designs compare to each other if implemented
in this context?

• External validity: Does the design, if implemented in another context, sat-
isfy the criteria?

The solution design and design validation in this thesis were primarily conducted
through literature review, complemented by a two-day workshop with AI experts
from the collaborating company. The setup and findings of this workshop are pre-
sented in detail in chapter 6.

4.1.4 Implementation
As stated by Wieringa [50], the implementation phase in DSR depends on the na-
ture of the designed solution. If the goal of the research was to develop a method,
framework, or process to address a practical problem, then implementation involves
executing this process in a real-world setting. However, if the research focused on
testing the viability of a proposed solution, implementation consists of conducting
the planned evaluations or experiments.

21


4. Method

The final implementation in this thesis resulted in a multi-agent LLM-based HR
chatbot, the final artifact. This artifact is presented in detail in section 6.1.

4.1.5 Evaluation

As outlined by Hevner et al. [52], evaluation constitutes a fundamental component
of the research process, ensuring the effective integration of the artifact within the
technical infrastructure. A rigorous evaluation requires the establishment of appro-
priate metrics to accurately assess the quality of the implementation. As Hevner et
al. [52] emphasises, evaluation plays a critical role in the iterative research process,
facilitating the identification of deficiencies and informing necessary improvements
for subsequent development cycles.

Knauss [51] recommends adhering to Hevner et al.’s [52] established evaluation
methodologies to align this phase with RQ3. These methodologies include observa-
tional, analytical, experimental, testing, and descriptive evaluations.

The final artifact developed in this thesis was evaluated using both quantitative
and qualitative methods. The quantitative evaluation involved an experimental
simulation, where the artifact was executed with artificial data, described in detail in
section 6.2. In addition, a qualitative evaluation was conducted through interviews
with potential users, as outlined in section 6.3.

4.2 Overview of cycles

The two completed iterations of the regulative cycle in this thesis are visualised in
figure 4.2 and detailed in the coming chapters. Cycle I primarily focused on problem
investigation and preliminary design activities, including a series of interviews and
a collaborative workshop with domain experts at the partner company. Cycle II
primarily focused on finalising the artifact and conducting both quantitative and
qualitative evaluations.

22


4. Method

Figure 4.2: Activites performed during the two cycles in this thesis.

Table 4.1 presents an overview of the qualitative data collection activities, detailing
the number of participants and the total time spent on each activity. Table 4.2
summarises the number of evaluation runs conducted for each metric during the
quantitative data collection phase in Cycle II, as described in section 6.2.

The next two chapters describe each research cycle in detail. The chapter on Cycle
I begins by outlining the methodology used during this cycle, followed by the key
findings. In contrast, the chapter on Cycle II opens with a presentation of the final
artifact, which serves as a reference point for the evaluation approach and results
that follow.

The structure of presenting the research cycles sequentially—detailing the method
and findings of Cycle I followed by those of Cycle II—was chosen to enhance read-
ability and comprehension. Since the research questions build upon one another,
understanding the findings for RQ1 is essential for interpreting the final artifact,
methodology and results of Cycle II. As such, the findings for RQ1 are presented in
full within the Cycle I chapter, even though the findings were fully finalised during
Cycle II.

Activity No. of participants No. of hours
Problem investigation interviews 6 6
Workshop 5 16
Follow-up evaluation interviews 5 3.75
Total 14 25.75

Table 4.1: Participant counts for qualitative data collection activities.

23


4. Method

Metric No. of evaluation runs
Answer relevancy 360
Faithfulness 360
Contextual relevancy 360
Robustness (answer relevancy) 500
Robustness (faithfulness) 500
Robustness (contextual relevancy) 500
Correctness 280
Total 2860

Table 4.2: Number of evaluation runs per quantitative metric.

24


5
Cycle I

The first cycle focused primarily on understanding the problem space regarding trust
in LLM-based systems and exploring potential solutions. As such, it placed greater
emphasis on the first three phases of the regulative cycle, problem investigation,
solution design, and design validation, aligning closely with RQ1 and RQ2.

The problem investigation followed a problem-driven approach, aiming to identify
trust factors associated with an LLM-based HR chatbot. To achieve this, interviews
were conducted with potential users, followed by a thematic analysis of the results
to extract key insights into factors that impact their trust. The thematic analysis
was done separately by each author and then merged to mitigate bias.

Additionally, the investigation extended to exploring design choices and components
of multi-agent chatbot systems, primarily through a review of existing research.
These potential design solutions were then explored and validated through partic-
ipation in a two-day workshop with experts in building LLM-based systems. The
workshop facilitated discussions on various design strategies, and feedback from ex-
perts during the workshop served as an initial form of validation for these design
choices.

Although the primary focus of this cycle was on problem investigation and solution
exploration, a preliminary implementation was undertaken to test basic functional-
ity. The purpose of this early implementation was to explore high-level considera-
tions, such as which frameworks to use, how an agentic RAG system functions, and
which LLMs are compatible and can be effectively utilised. The evaluation of this
early prototype was basic and relied on human judgment by the authors, supported
by insights from the literature review on what appears to be most suitable for an
HR chatbot in practice.

This chapter outlines the methodology and findings from cycle I of the thesis. It
begins by presenting the qualitative data collection methods, including the approach
used for the problem investigation interviews, the subsequent thematic analysis,
and the setup of the expert workshop. The second part of the chapter focuses on
the findings from this cycle, starting with the trust factors identified through the
interviews and concluding with key insights from the workshop discussions.

25


5. Cycle I

5.1 Method - Qualitative data collection
To gain a deeper understanding of trust in LLM-based systems and to explore dif-
ferent solution designs, cycle I employed two qualitative data collection methods.
These included interviews conducted as part of the problem investigation as well as
a workshop with experts focused on the design and implementation of LLM-based
systems.

5.1.1 Interviews
To attain qualitative data about trust in our initial problem investigation, six in-
terviews were conducted with employees at the partner company. These interviews
were designed based on the guidelines provided by McNamara [53] and Patton [54].
The format used was the standardised, open-ended interview, where all intervie-
wees were asked the same open-ended questions and could respond freely in their
own words. In some instances, follow-up questions were posed to encourage further
elaboration from the interviewees.

This interview format was chosen because, as McNamara states, it "facilitates faster
interviews that can be more easily analysed and compared" [53]. As noted, the ques-
tions were constructed in accordance with McNamara’s guidelines [53], which empha-
sise important principles such as question neutrality, the use of open-ended wording,
and smooth transitions between major topics.

For the sampling method, snowball, also called chain sampling, [54] was used. In
this method, an industry supervisor with extensive knowledge about who would be
information-rich key informants for the interviews was tasked with reaching out and
finding such participants. The interviewees had varying levels of AI knowledge and
roles within the company, ensuring diverse perspectives.

5.1.1.1 Problem investigation interview setup

The interviews were conducted with each of the six employees as part of the problem
investigation. The questions touched on 4 overarching subjects:

• Background and demographic information,

• Knowledge and experience with AI & LLMs,

• Attitudes, opinions, and trust in AI

• HR system specific questions.

All interviews were conducted remotely and lasted approximately 60 minutes. With
the participants’ consent, interviews were recorded and automatically transcribed.
The transcripts were then reviewed and corrected in phase one of the thematic
analysis. The interview guide used for this round of interviews can be found in
appendix A.1.

26


5. Cycle I

5.1.1.2 Thematic analysis

To analyse the interviews, we employed thematic analysis, following the guidelines
established by Braun and Clarke [55]. Thematic analysis is defined as "a method
for identifying, analysing and reporting patterns (themes) within data." Braun and
Clarke [55] outline a structured approach that consists of five key phases, each
of which is detailed below. Importantly, they note that thematic analysis is not
strictly a linear process but rather a recursive one, where movement between phases
is necessary to refine and develop themes.

1. Familiarising yourself with your data:
This phase involves immersing oneself in the collected data through repeated
active reading. Initial notes and potential codes should be marked for later
phases. Transcription plays a key role in deepening familiarity, and if tran-
scription has been conducted by others or through automated tools, additional
time should be dedicated to engaging with the material.

2. Generating initial codes:
Following familiarisation, the data should be systematically coded to identify
meaningful features of interest. Equal attention must be given to all data
items, including those that challenge dominant narratives. Coding should
be as comprehensive as possible within the available timeframe, preserving
surrounding context and allowing for multiple codes per extract.

3. Searching for themes:
This phase focuses on organising codes into broader themes by clustering re-
lated codes and exploring their interconnections. Visual tools such as tables
or mind maps are recommended to facilitate the conceptual organisation. At
this stage, all codes and potential themes should be retained for further con-
sideration.

4. Reviewing themes:
Candidate themes are reviewed and refined to ensure they accurately reflect
patterns in the data. This process involves two levels: first, evaluating co-
herence within each theme’s coded extracts; second, assessing the thematic
structure in relation to the full dataset. Additional coding may be required if
new relevant data is identified. By the end of this phase, key themes and their
relationships should be clearly defined.

5. Defining and naming themes:
With a satisfactory thematic map in place, themes are further refined and
clearly defined. Each theme’s core meaning should be articulated, ensuring
alignment with the data and avoiding excessive breadth or overlap. Sub-
themes may be identified to capture nested or hierarchical relationships within
the data.

27


5. Cycle I

5.1.2 Workshop
The data collection process during cycle I also included a workshop, conducted at the
collaboration company over two days. The workshop focused on discussions around
the design and implementation of LLM-based systems within three organisational
domains: Sales, HR, and Cybersecurity. Participants in the workshop consisted of
AI experts in each respective area with a total of seven participants including the
authors.

The workshop mainly explored two use-cases for LLM-based solutions. The first be-
ing an LLM-based solution for managing large volumes of internal documentation,
and the second involving an LLM-based chatbot designed to respond to employee
queries, specifically those related to employment matters, HR guidelines, and or-
ganisational policies.

The workshop served both as a means of collecting empirical data on how domain ex-
perts approach the practical implementation of AI solutions within an organisational
setting and as an evaluation of the feasibility of previously studied approaches in a
real-world context. Furthermore, the workshop provided insights into the trade-offs
associated with different design strategies and explored how a scalable solution could
be developed to support future applications across other areas of the organisation.

5.2 Findings - Cycle I

This section presents the findings from the problem investigation interviews regard-
ing RQ1 and the findings from the workshop related to RQ2. It begins with the
results of the thematic analysis, which answer RQ1 by identifying key trust factors
in LLM-based systems. It then focuses on RQ2 by presenting insights from the
workshop, which informed the design of the artifact in cycle II.

5.2.1 Trust factors in LLM-based systems (RQ1)
After an extensive thematic analysis of the conducted interviews, we identified five
main themes. Two of these themes are directly related to trust factors and thus
address RQ1. The other three themes, however, are more concerned with attitudes,
thoughts, and concerns surrounding LLM-based systems, rather than directly ad-
dressing trust factors. These three themes—AI as a helping hand, Concerns
- Human interactions could be replaced, and Critical thinking - Output
should be challenged and revised—are considered outside the scope of this
thesis, as they do not directly correspond to trust factors in an LLM-based system.

In the following two sections, we will describe in detail the remaining two main
themes that were identified. These were External trust factors — Trust im-
pacted by non-technical forces and Internal trust factors — Trust im-
pacted by technical details.

28


5. Cycle I

Figure 5.1: Identified trust factors in LLM-based systems.

5.2.1.1 External trust factors — Trust impacted by non-technical forces

This theme addresses factors that are not directly influenced by the system design,
but instead areas outside the actual artifact. As seen in figure 5.1, these factors
include transparency, organisational measures such as education on LLMs
and chatbots as well as change management, and external security referring
to security concerns that cannot be addressed within the system design.

Transparency

Transparency encompasses both transparency in the development process and trans-
parency of the system’s limitations.

One interviewee stated that transparency in how an LLM-based system is built pro-
vides users with insight into its foundations, which in turn increases trust. Further,
they said that knowing the process about what model is being used, what approach
was used in the development and what data is being used would make them feel
more trust towards the system. Another interviewee also stressed that it is impor-
tant to understand how the data you provide to an LLM-based system is stored and
used.

Transparency about an LLM-based system’s limitations can also help users set re-
alistic expectations, something brought up by three out of the six interviewees. If
something is presented as always correct but proves otherwise, it may lose users’
trust. However, being upfront about potential inaccuracies can foster understand-
ing and make users more forgiving. As one interviewee shared:

"That’s the thing that I’m saying, is that I don’t trust 100%, but I still
think that we can implement things acknowledging that probably 100%
is impossible, but we can be close to 100% on the output. So we can
gain trust and confidence of all the people that are gonna use it."

29


5. Cycle I

(Interviewee 3)

Change management

Another factor that emerged from the data was the importance of actual system
usage in building trust, as well as the company’s role in encouraging that usage.
Four out of the six interviewees stated that using the system will allow it to prove
itself, eliminating possible preconceptions. One interviewee reflected on their initial
scepticism:

"Can you trust what it’s telling you? Will it make mistakes? And I had
those preconceptions, like a year and a half ago, when I first was like, you
know, how is this possible to use this? But as I’ve used [an LLM-based
tool] and seen it evolve, seen it improve, seen actually how it can benefit
my work, I really see the opportunities."

(Interviewee 5)

This suggests that increased use of the system, along with witnessing its evolution,
can serve as a catalyst for trust development. To further encourage this usage, or-
ganisations may need to actively guide employees toward adopting LLM-based tools
such as an HR chatbot. This would give these tools the opportunity to demonstrate
their value. Five out of the six interviewees highlighted the importance of effective
change management in increasing system adoption and engagement. This included
both changing the behaviour of the users of the system, as well as enabling easy
access to the tools.

When discussing the integration of an HR chatbot, two interviewees stated that
each employee currently has an assigned local HR business partner who is readily
accessible. As a result, there would be little motivation to use such a chatbot, even if
it provides equivalent support. Both interviewees stated that making direct contact
with HR operations less convenient could promote the use of the chatbot. Crucially,
they emphasised that the chatbot must offer clear and tangible value to employees.
If it is to be adopted, it should convincingly demonstrate that it is a more efficient
or beneficial alternative to traditional HR contact methods.

Education

To enable this change management, education emerges as a key organisational tool
to help employees understand the value of the system, thereby increasing trust and
fostering adoption. Education was discussed by all interviewees with one interviewee
stating that the organisation needs to make it possible for employees to get an
introduction into how to use a chatbot if it is implemented. Another interviewee
also discussed this benefit of education in improving their ability to optimise the use
of LLM-based chatbots and getting more value out of them:

"Yeah, I really think that I would benefit from educating myself more in
optimising the usage [of LLM-based chatbots]."

30


5. Cycle I

(Interviewee 4)

External security

One of the factors that was brought up by four out of the six interviewees was the
potential impact of the system’s origins or the underlying models on trust. If the
company behind the system or LLM is deemed untrustworthy, users’ trust in the
system itself can be compromised.

Further, five out of six interviewees expressed caution regarding the information
they input into LLM-based systems, especially when interacting with systems not
hosted by their organisation. One interviewee noted:

"For instance, obviously I played around with Deepseek, and I knew
that using Deepseek was basically sending information to China. I don’t
think China is way, way worse than the US, but still, it’s like, OK, I’m
sending to another country. That’s why I use this Deepseek just to play
around. It was just basically doing Q&A for bullshit stuff. So nothing
[sensitive]—that’s why I took that into consideration.

The moment I’m able to actually get, get Deepseek working on a—let’s
say—open-source environment, or, let’s say, download and install all the
way into a server, I would probably use it in a different way, that’s for
sure."

(Interviewee 3)

Finally, regarding LLM-based systems used and approved within the company, trust
can be handed over to the IT department and their expertise with one interviewee
saying

"I have no real limitations as long as I know that these tools have been
embedded by corporate IT from a security standpoint"

(Interviewee 1)

In summary, external trust factors centre on non-technical influences. Participants
emphasised the value of transparent communication about system development and
limitations, as this sets realistic expectations and builds trust. Organisational efforts
like education and guided adoption were seen as essential for encouraging usage and
overcoming skepticism. Trust was also shaped by concerns about where data is sent
and who controls the underlying technology—highlighting that trust is not just built
on what the system does, but also on who is behind it and how it is introduced.

5.2.1.2 Internal trust factors — Trust impacted by technical details

The second major theme identified was internal trust factors. These factors come
from the actual LLM-based systems themselves, how they perform and behave.

31


5. Cycle I

Thus, these are factors that may be impacted by system design. As illustrated in
figure 5.1, the internal trust factors identified were internal security, risk of bias,
model differences, and reliability.

Internal security

Internal security relates to the importance of protecting sensitive employee data,
particularly in the context of an HR chatbot. A concern brought up by two out
of six interviewees was the potential for unauthorised access, for example, if the
system could be exploited to retrieve other employees’ information. Another concern
was the system’s ability to comply with internal IT and legal frameworks. As one
interviewee noted:

"So we have to be compliant with all the rules that exist. We have IT
processes and legal processes that must be [followed]."

(Interviewee 4)

Risk of bias

Another sub-theme that emerged was the risk of bias in LLMs and LLM-based
tools. One participant, for example, expressed concern about the potential impact
on diversity when such tools are used in recruitment processes. They noted that the
system might favor candidates with similar educational backgrounds or professional
experiences—such as coming from the same types of companies—which could unin-
tentionally limit diversity. This was viewed as a risk that could lead to less varied
and inclusive hiring outcomes.

Another participant expanded on biases, discussing how they may be embedded in
the training data and reflected in outputs:

"And then also, like I mentioned before, the ethics around the informa-
tion that people use from it, and how these models have been built, and
who has built them, and the bias. And, you know, is information it gives
representative of the wider population? ... And that does really concern
me from a kind of diversity, equality point of view, because a lot of work
has been done previously to promote different types of voices on different
topics."

(Interviewee 5)

Model differences

Five out of six participants also commented on perceived differences between various
LLMs and LLM-based tools, which we have categorised as the theme model differ-
ences. Comparisons were for instance made between internal company tools and
more widely available tools such as ChatGPT, with one interviewee stating that:

32


5. Cycle I

"Tools that we have in [Company] as of today—I mean, they are good.
But I think they are not that good, obviously, like ChatGPT."

(Interviewee 3)

Reliability

The factor of reliability refers to the system’s ability to consistently produce ac-
curate, high-quality responses. All participants emphasised reliability as a central
trust factor when interacting with LLM-based systems. They highlighted that if the
system frequently produces incorrect answers, trust is quickly diminished. When
discussing what might deter them from using an LLM-based system, one intervie-
wee stated:

"No, but recurring inaccuracies, I think. That would have made it so
[I felt] ’But no, it’s not worth the time. I’ll have to look it up myself’
or something like that. So yes, repeated inaccuracies would have caused
my trust to decrease."

(Interviewee 4)

Another participant echoed this sentiment, describing how even a single mistake in
practical information could lead to reduced usage:

"I think it’s the reliability of the information [that is important for usage].
Of course, if I use the chatbot and ask the chatbot how many remaining
vacation days I have, I get an answer, and then the answer proves to be
the wrong one. I might not use it again easily."

(Interviewee 1)

These responses underline the importance of providing accurate responses from the
outset. If a system fails to meet expectations early on, trust may be damaged
and difficult to rebuild later. This concern was also reflected in discussions about
implementation strategy. One interviewee suggested that a gradual rollout of an
HR chatbot could help identify and resolve early issues before exposing it to a wider
audience. Thus, avoiding the risk of discouraging users who may perceive the tool
as unreliable.

"So it depends a bit [on] the purpose of it, but you can collect a lot
of feedback by rolling out something real that you then improve as you
go along, and then you roll it out a little wider, like, until you have
something that really doesn’t have a lot of teething problems and that
can provide value. Because if it doesn’t provide value, people won’t use
it."

(Interviewee 6)

33


5. Cycle I

Another aspect of reliability that emerged was the importance of source citation
in LLM-based systems. Two participants expressed greater trust in systems that
provide sources for their outputs, as it allows users to verify the information and
better understand where it comes from. When asked about their level of trust in
LLM-based chatbots, one participant responded:

"I think it would be probably if—if I have the—the—the sources men-
tioned, like in Copilot, I would say probably 8 or 9 out of 10."

(Interviewee 1)

And when asked how their trust would be affected if sources were not provided, the
same participant explained that they would feel the need to challenge the output
more actively—by comparing it across different chatbots and questioning the origin
of the information.

Summarising, the internal trust factors identified by participants reveal how trust in
LLM-based tools is closely tied to the system’s technical performance. Issues such
as hallucinations, lack of source transparency, and data privacy risks emerged as key
concerns. Trust was also found to be fragile—easily lost through early errors—and
difficult to rebuild, underscoring the need for high initial system performance and
thoughtful implementation.

5.2.2 Findings from workshop (RQ2)
This section outlines key findings from the workshop regarding design choices in
developing an agent-based HR chatbot. A primary concern raised in the workshop
was the importance of reliability, which is a quality also emphasised in the inter-
view findings around trust factors. Consequently, many of the design discussions
revolved around strategies for improving the reliability of the chatbot.

A core principle agreed upon was that the chatbot should avoid providing incorrect
answers. If the system cannot provide a sufficiently accurate or complete response,
it should explicitly state this to the user. The consensus was clear: it is better
to give no answer than an incorrect one. This approach supports both the
reliability and transparency of the system.

The HR chatbot use case was generally seen as relatively simple in nature. Its
primary function is to retrieve relevant information from HR documentation and
systems, and present it in response to user queries. Unlike systems that require com-
plex reasoning or computation, this task primarily involves information retrieval and
summarisation. Accordingly, one participant advised against over-engineering the
system architecture, saying “Don’t over-engineer the agent structure for a simpler
use case”.

To enable effective document retrieval, discussions revealed that a RAG approach
is the most suitable. However, rather than using a basic RAG pipeline with a single
LLM retrieving and generating responses, an enhanced RAG architecture was

34


5. Cycle I

proposed. This would involve the inclusion of additional agents to improve answer
quality and reliability. Specifically, a circular workflow was suggested, featuring
a “checker agent” responsible for evaluating the quality of the generated answer in
relation to the user query. If the answer is deemed insufficient, the system should
loop back to revise and improve it based on this feedback.

Another important factor related to document retrieval was the format of the docu-
ments themselves. Discussions emphasised that, although company documents are
often available as PDFs, LLMs perform more effectively when the files are provided
in Markdown format instead.

Internal security was another important topic in the discussion. Echoing concerns
raised in interviews, the inherent sensitivity of HR data was acknowledged as a po-
tential risk. One mitigation strategy discussed was the integration of guardrails
to limit inappropriate or insecure outputs. However, it was concluded that during
the early stages of development and testing, guardrails are not essen-
tial. This is due to the domain-specific nature of security requirements, which vary
significantly across companies, departments, and jurisdictions. As such, defining
meaningful security constraints requires detailed context that is often unavailable
during early development. Thus, the primary focus at this stage should be on the
performance of the system.

35


5. Cycle I

36


6
Cycle II

Building on the insights from the first cycle, the second cycle focused primarily
on solution implementation and implementation evaluation. This phase therefore
placed greater emphasis on the implementation, resulting in the final artifact, and
evaluation stages of the regulative cycle, aligning closely with RQ2 and RQ3.

The artifact was further developed based on the design suggestions identified dur-
ing the first cycle. To streamline the development process, the system was im-
plemented using the LLM orchestration framework LangGraph. Additional design
considerations—such as defining the roles of the agents, crafting the prompts, and
distinguishing between policy-related and employment-related questions were also
addressed.

To evaluate the artifact quantitatively, data was collected using the DeepEval frame-
work in an experimental simulation [56]. Five metrics were employed: faithfulness,
answer relevancy, contextual relevancy, robustness, and a custom G-Eval metric
termed correctness. The results of this evaluation were analysed to determine how
the artifact performed. In addition, five evaluation interviews were conducted dur-
ing this iteration to collect qualitative data. These interviews aimed to assess how
well the artifact addressed trust factors identified in the first iteration.

To provide context for the remainder of the chapter, it begins by presenting the
design of the final artifact. This is followed by a detailed description of the methods
used for both quantitative and qualitative data collection. The quantitative section
covers the evaluation metrics, the use of dummy data, and the setup of test runs,
while the qualitative section outlines the evaluation interview approach. Finally, the
chapter presents the findings from this cycle, including insights from the evaluation
interviews and the results of the quantitative analysis.

6.1 The artifact - final solution candidate (RQ2)

This chapter will present the final artifact designed in the project in the form of an
HR chatbot capable of answering question based either on HR guideline documents
or specific employment data. First, a brief overview of the artifact is presented to
give and understanding of how the chatbot works. Following this, the two compo-

37


6. Cycle II

nents of the chatbot, the employment component and the guidelines component will
be described in more detail, including the role of each agent within the components.

6.1.1 Overview
The HR chatbot is implemented as a Python application executed in the terminal.
The chatbot is composed of two multi-agent based components, the employment
component and the guidelines component. Within each component, the flow used to
answer a given question includes a set of agents, orchestrated through the framework
LangGraph, each with a distinct role and responsibility.

When starting the chatbot, the user is asked to provide a question. After provid-
ing the question, the chatbot asks if the question is about general HR policies or
employment data, as show in figure 6.1. The answer to this question decides which
component will be used to answer the given question.

Figure 6.1: Example of choice for type of question in the chatbot.

6.1.2 Guidelines component
The guidelines component implements an enhanced RAG workflow consisting of four
main parts, illustrated in figure 6.2:

• Vector Store: Contains indexed HR guideline document segments in Mark-
down format. These documents serve as the knowledge base for the chatbot.

• Judge agent: Retrieves document segments from the vector store, ranks them
based on relevance, and filters out those deemed insufficiently relevant.

• Generator agent: Uses the relevant document segments identified by the
judge to generate an answer to the user’s question.

• Checker agent: Evaluates the generator’s response against a predefined set
of criteria. If the response is considered invalid, the checker provides feedback
to the generator used to generate a new answer. This feedback loop continues
until one of the following conditions is met:

– The checker accepts the generated answer as valid, in which case it is
returned to the user.

38


6. Cycle II

– The maximum number of three iterations is reached, in which case the
system informs the user that it was unable to provide a satisfactory an-
swer.

Figure 6.2: Structure of the HR chatbot.

Below, each part of the guidelines component is described in more detail.

Vector Store

The vector store is implemented using the FAISS library [57] and contains in-
dexed HR guideline document segments. These document segments were embed-
ded using the BAAI/bge-small-en-v1.5 LLM, specifically developed for retrieval-
augmented LLM systems [58]. Document segment retrieval is performed using FAISS
similarity search, which returns the top six most relevant segments to the judge
agent.

Judge agent

The judge agent is responsible for filtering and ranking the retrieved document
segments. It assigns each segment a relevance score between 0 and 1, where 0
indicates complete irrelevance and 1 indicates a direct and highly relevant answer
to the query. The full prompt used by the judge agent is provided in appendix A.4.

39


6. Cycle II

Listing 6.1 shows an example of the judge’s reasoning when assigning a relevance
score of 1.0 in response to the question: "How many vacation days do I get?"

[JUDGE] Document relevance score: 1.0 (threshold: 0.6)
[JUDGE] Reasoning:
1. The question asks about the number of vacation days I get.
2. The document title is "Vacation Policy", which suggests that it might

be relevant to the question.
4. The first section "Annual Vacation Entitlement" explicitly states that

all employees are entitled to 25 paid vacation days per year, which
directly answers the question.

5. The rest of the document provides additional information about vacation
accrual, planning, and saving vacation days, but it is not directly

related to the question.

Listing 6.1: Example of judge agent scoring a relevant document.

As illustrated in the example above, the judge uses a relevance threshold of 0.6.
Segments scoring below this threshold are discarded. The remaining segments are
sorted by relevance and passed as context to the generator agent. If no document
segments are deemed relevant enough, the workflow is stopped and the answer "I
don’t have enough information to answer this question based on the HR handbook."
is returned to the user.

Generator agent

The generator agent produces an answer to the user’s question using the ranked and
filtered document segments provided by the judge agent. It is explicitly instructed to
base its answer strictly on the provided segments, ensuring that the response is both
accurate and comprehensive. Emphasis is placed on referencing specific documents
and sections from which the information is derived.

If a previously generated answer is deemed invalid by the checker agent, the feedback
provided is incorporated into the generator’s next attempt. This feedback-guided
loop enables iterative refinement of the answer. The full prompt used by the gener-
ator agent is available in appendix A.4.

Once an answer is generated, it is forwarded to the checker agent for validation.
Listing 6.2 shows an example of a valid answer generated by the generator agent:

According to the Vacation document, in the Annual Vacation Entitlement
section, all employees are entitled to a minimum of 25 paid vacation
days per year.

Listing 6.2: Example of valid answer to the question "How many vacation days do
I get?".

40


6. Cycle II

Checker agent

The checker agent evaluates the answer generated by the generator, using both the
answer itself and the set of document segments that informed it, called the context.
The evaluation is based on a predefined set of criteria, which include whether the
answer: addresses the user’s original question, is grounded in the provided context,
and avoids introducing information not found in the document segments.

The checker performs its assessment by responding to five yes/no verification ques-
tions. A "yes" indicates that the criterion has been met, while a "no" indicates that
it has not. Each response is accompanied by a rationale explaining the judgment.

If the overall answer is deemed invalid, this assessment is passed back to the gener-
ator as feedback for the next iteration. If the answer is deemed valid, it is returned
to the user. The full prompt used by the checker agent, including the verification
questions, is provided in appendix A.4.

Listing 6.3 shows an example of the response from the checker agent for an answer
to the question "Does the company handle chiro expenses?" that it deemed to be
invalid:

[CHECKER] Feedback: Here is my verification response:
Q1: Yes - The answer directly answers the question about whether the

company handles chiropractic expenses.
Q2: No - The answer claims that the company has a process for reimbursing

chiropractor visits, but the document only mentions a reimbursement
process for the healthcare allowance, not specifically for
chiropractic expenses.

Q3: No - The document does not mention chiropractic expenses as an
eligible or non-eligible expense for the healthcare allowance, and the
answer adds information not present in the documents.

Q4: No - The answer does not cite specific document names.
Q5: No - The answer does not contain all relevant information for the

question present in the documents, as the document does not mention
chiropractic expenses.

ASSESSMENT: INVALID: The answer adds unsupported information and does not
cite document names.

Listing 6.3: Example of checker agent assessing an answer as invalid.

6.1.3 Employment component
The employment component consists of three main parts, as illustrated in figure 6.2:

• Field identifier agent: Analyses the user’s question and determines which
data fields need to be retrieved from the dataset.

• Data retrieval node: Retrieves the specified fields, as identified by the field
identifier agent, for the user’s employment ID.

41


6. Cycle II

• Generator agent: Generates a response to the user’s question using the data
retrieved by the data retrieval node.

The employment component was considered a simpler use case than the guidelines
component, primarily because it handles structured data fields rather than unstruc-
tured document segments. Based on insights from the workshop, this simplicity
suggests that a less complex agent structure is more appropriate. As a result, the
employment component does not include a judge or checker agent.

Each part of the employment component is described in more detail in the following
sections.

Field identifier agent

The field identifier agent determines which available data fields are relevant to an-
swering the user’s question. The agent has access to all available fields from the
dataset, along with explanations of key relationships between them. Based on this
information, it produces a comma-separated list of field names, which is then passed
to the data retrieval node. The full prompt for the field identifier agent is provided
in appendix A.4.

Data retrieval node

The data retrieval node is not an agent, but a method that takes the set of field
names from the field identifier agent together with the employment ID of the user,
extracts the relevant data from the dataset, and stores it in the component state.
This state is accessed by the generator agent during answer generation. The
dataset is in the form of a CSV-file.

Generator agent

The generator agent produces an answer to the user’s question based on the data
retrieved by the data retrieval node. It is explicitly instructed not to perform any
calculations or actions beyond what is supported by the provided data. If the
required information is missing or unavailable, the generator should clearly state this.
The response should be concise and professional, avoiding references to technical
implementation details. The full prompt for the generator agent is provided in
appendix A.4.

Listing 6.4 shows an example of an answer generated by the generator agent in
response to the question: "What is my department and who is my manager?"
Your department is Production and your manager is Kelley Spirea.

Listing 6.4: Example of answer generated by the generator agent.

42


6. Cycle II

6.2 Method - Quantitative data collection
This section outlines the metrics used during the quantitative evaluation of the
artifact, the nature of the test data, and the procedure for running the evaluation.

6.2.1 Metrics
To evaluate the chatbot’s performance quantitatively, the system was assessed using
five distinct metrics. For the guidelines component, the DeepEval metrics used were
answer relevancy, faithfulness, and contextual relevancy. These metrics are
specifically for measuring performance in RAG-based systems, which this component
is. In addition, a custom robustness metric was introduced to measure the system’s
consistency under input variation. Since the employment component is not a RAG-
based component, a custom metric named correctness was developed. Below, the
caluclation for each metric is presented in more detail.

Answer relevancy is calculated as:

Answer relevancy = Number of relevant statements in the answer
Total number of statements in the answer

The evaluation LLM used by DeepEval extracts all statements from the chatbot’s
output and classifies whether each statement is relevant to the input.

Faithfulness is calculalted as:

Faithfulness = Number of truthful claims in the answer
Total number of claims in the answer

All claims are extracted from the output by the evaluation LLM, which then de-
termines whether each claim is truthful based on the context used to answer the
question.

Context relevancy is calculated as:

Context relevancy = Number of relevant statements in the context
Total number of statements in the context

In this case, the evaluation LLM extracts statements from the retrieved context and
classifies whether each one is relevant to the specific question being answered.

Robustness is calculated as:

For each question in the simple category (further explained in section 6.2.3), 9
reformulated versions were generated using ChatGPT. Each reformulated question
(plus the original) was evaluated 5 times, resulting in 50 evaluation runs per baseline
question per metric (answer relevancy, faithfulness, and contextual relevancy). The
robustness score for one metric is the average across these 50 runs:

43


6. Cycle II

Robustness = 1
50

50∑
i=1

Scorei

where Scorei is the evaluation score for each run for the given metric.

Correctness is a custom metric developed using the G-eval framework supplied
by DeepEval. To evaluate correctness, the following criteria were provided to the
evaluation LLM:

1. Check whether the answer from the HR API includes all relevant information
from the employee data.

2. Determine if the answer direct addresses the employee’s question.

3. Check if any information in the answer contradicts the available employee
data.

4. Assess whether the answer clearly indicates when requested information is not
available in the data.

5. The exact wording and phrasing in the answer is not important, but the re-
sponse must convey the key information specified in the expected output. For
example, if the expected output is "The answer should clearly state that you
have not been late the last 30 days," the actual response could be "According
to your records, you have 0 days late in the past month" or "You have perfect
attendance with no late days in the last 30-day period." Focus on evaluating
if the substance of the required information is present rather than exact word
matching.

In addition to these steps, the evaluation LLM is provided with the question and an
expected output formulated as an explanation of what the response should convey.
An example from the evaluation runs is shown in listing 6.5
"question": "Am I employed?",
"expected_output": "The answer should clearly state that you are currently

employed"

Listing 6.5: Example of question and expected output for evaluation of
employment component.

6.2.2 Dummy data
The data used for evaluating the guidelines component consisted of 14 mock HR
guidelines documents, formatted in Markdown. This format was selected following
the results from the workshop described in 5.2.2, where it was determined to be the
most easily interpreted by LLMs. These documents, generated using an LLM, do
not reflect actual HR policies or legislation but were designed to simulate a set of

44


6. Cycle II

guidelines that the chatbot could use to respond to typical HR-related queries. The
factual accuracy of the guidelines was not considered relevant for the evaluation, as
the documents were treated as the "ground truth" within the context of the simulated
scenario.

For the employment component, a publicly available dataset containing HR informa-
tion about fictitious employees from a fictitious company was used [59]. This dataset
served as the basis for evaluating how the system responded to employment-related
queries, with correctness as the only evaluation metric. The dataset provided struc-
tured employee data, such as job roles, salaries, and attendance records, which was
used as the ground truth for evaluating the system’s accuracy in handling factual
queries.

6.2.3 Test runs
The system’s performance was quantitatively evaluated by running it against a set
of questions commonly posed to HR staff. These questions were formulated based
on a combination of documents of commonly asked questions, provided by the HR
department of the collaborating company, as well as the available system data. The
full list of questions used in the evaluation is available in appendix A.3. For all test
runs, the artifact used the llama3-70b-8192 model [60] to answer each question.
These responses were then evaluated using DeepEval, with GPT-4.1—the latest
model from OpenAI at the time—serving as the LLM-as-a-judge [61].

Guidelines component

For the guidelines component of the chatbot, a total of 23 questions were used,
divided into three categories: simple, broader, and questions with no answers.

• Simple questions: 10 questions, each with direct answers available in the
HR documents used by the system.

• Broader questions: 7 questions where the answers were less straightforward
or the questions were phrased more vaguely.

• Questions with no answers: 6 questions for which the correct answers were
not provided in the documents. These questions were included to assess how
the system handles questions without a direct answer. Since no answers exist
for these questions, metrics were deemed not applicable. Instead, the system’s
responses to these questions were evaluated through qualitative assessment
during t