Bridging Trust and Design of a Multi-Agent LLM-Based HR Chatbot: For the Times They Are A-Changin’ Master’s Thesis in Computer science and engineering Jonatan Axetorn Felix Edholm Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2025 Master’s Thesis 2025 Bridging Trust and Design of a Multi-Agent LLM-Based HR Chatbot: For the Times They Are A-Changin’ Jonatan Axetorn Felix Edholm Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2025 Bridging Trust and Design of a Multi-Agent LLM-Based HR Chatbot: For the Times They Are A-Changin’ Jonatan Axetorn Felix Edholm © Jonatan Axetorn 2025. © Felix Edholm 2025. Academic supervisor: Lucas Gren, Department of Computer Science and Engineer- ing Industry supervisor: Lucas Gren Examiner in practice: Krishna Ronanki, Department of Computer Science and En- gineering Examiner: Christian Berger, Department of Computer Science and Engineering Master’s Thesis 2025 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2025 iv Bridging Trust and Design of a Multi-Agent LLM Chatbot for HR: For the Times They Are A-Changin’ Jonatan Axetorn Felix Edholm Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Introduction: The integration of Large Language Models (LLMs) into workplace systems presents significant opportunities, particularly in the domain of human re- sources (HR), where repetitive tasks—such as providing information that employees could retrieve themselves—are common and could potentially be replaced by an LLM-based solution. However, a lack of user trust remains a major barrier to the adoption of LLM-based systems. Objective: This thesis investigates what trust factors exist in LLM-based systems and how they can be addressed by system design, with a specific focus on a multi- agent HR chatbot. Method: Using a Design Science Research methodology, the study was conducted in two iterative cycles. Cycle I identified trust factors through literature review and interviews with six employees at a multinational company. It also included a workshop with five AI experts to discuss and validate design choices. Cycle II involved implementing, and evaluating an artifact, a multi-agent chatbot tailored to HR queries. Findings: Thematic analysis revealed external trust factors: transparency, organi- sational measures, and external security and internal trust factors: internal security, model differences, risk of bias and reliability, which emerged as the most critical trust factor. The artifact was evaluated through interviews and metrics such as answer relevancy, faithfulness, and robustness, showing consistently strong performance and broad user acceptance. Conclusion: The multi-agent HR chatbot effectively addressed key trust concerns and was positively received by most interviewees, demonstrating its potential for real-world application. These findings suggest that trust factors can be meaningfully addressed through thoughtful design and should be treated as a core consideration throughout the development process of LLM-based systems. Keywords: autonomous agents, chatbot, design science research, human resources, HR, large language model, multi-agent architectures, system design, trust, trust factors v Acknowledgements First and foremost, we would like to express our sincere gratitude to everyone who participated in the interviews and workshop conducted during this thesis. We would also like to thank Lucas Gren for his support and guidance as our academic and industry supervisor during this project. Jonatan Axetorn, Felix Edholm Gothenburg, June 2025 vii Contents List of Figures xi List of Tables xiii 1 Introduction 1 1.1 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Purpose of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Significance of the study . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.5 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.6 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Background 5 2.1 Trust . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Large language models . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 Hallucinations . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.2 Prompt engineering . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Retrieval-augmented generation . . . . . . . . . . . . . . . . . . . . . 7 2.4 Autonomous agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5 LLM orchestration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.5.1 LangChain & LangGraph . . . . . . . . . . . . . . . . . . . . 9 2.6 Guardrails . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.7 LLM-as-a-judge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.7.1 DeepEval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 3 Related Work 13 3.1 Trust in LLMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Challenges with multi-agent LLM-based systems . . . . . . . . . . . . 14 3.3 Collaboration in multi-agent systems . . . . . . . . . . . . . . . . . . 14 3.4 Multi-agent retrieval-augmented generation filtering . . . . . . . . . . 16 4 Method 19 4.1 Design science research . . . . . . . . . . . . . . . . . . . . . . . . . . 19 4.1.1 Problem investigation . . . . . . . . . . . . . . . . . . . . . . . 20 4.1.2 Solution design . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.3 Design validation . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 21 ix Contents 4.1.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 4.2 Overview of cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5 Cycle I 25 5.1 Method - Qualitative data collection . . . . . . . . . . . . . . . . . . 26 5.1.1 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 5.1.1.1 Problem investigation interview setup . . . . . . . . 26 5.1.1.2 Thematic analysis . . . . . . . . . . . . . . . . . . . 27 5.1.2 Workshop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2 Findings - Cycle I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.2.1 Trust factors in LLM-based systems (RQ1) . . . . . . . . . . . 28 5.2.1.1 External trust factors — Trust impacted by non- technical forces . . . . . . . . . . . . . . . . . . . . . 29 5.2.1.2 Internal trust factors — Trust impacted by technical details . . . . . . . . . . . . . . . . . . . . . . . . . . 31 5.2.2 Findings from workshop (RQ2) . . . . . . . . . . . . . . . . . 34 6 Cycle II 37 6.1 The artifact - final solution candidate (RQ2) . . . . . . . . . . . . . . 37 6.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.1.2 Guidelines component . . . . . . . . . . . . . . . . . . . . . . 38 6.1.3 Employment component . . . . . . . . . . . . . . . . . . . . . 41 6.2 Method - Quantitative data collection . . . . . . . . . . . . . . . . . . 43 6.2.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 6.2.2 Dummy data . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 6.2.3 Test runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 6.3 Method - Qualitative evaluation interview . . . . . . . . . . . . . . . 47 6.4 Findings - Cycle II . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 6.4.1 Findings from evaluation interviews (RQ3) . . . . . . . . . . . 48 6.4.2 Findings from quantitative evaluation (RQ3) . . . . . . . . . . 52 7 Discussion 61 7.1 Implications for research . . . . . . . . . . . . . . . . . . . . . . . . . 61 7.2 Implications for practice . . . . . . . . . . . . . . . . . . . . . . . . . 62 7.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 7.4 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 8 Conclusion 67 References 69 A Appendix I A.1 Problem investigation interview guide . . . . . . . . . . . . . . . . . . I A.2 Evaluation interview guide . . . . . . . . . . . . . . . . . . . . . . . . III A.3 Quantitative evaluation questions . . . . . . . . . . . . . . . . . . . . V A.4 Artifact agent prompts . . . . . . . . . . . . . . . . . . . . . . . . . . IX x List of Figures 4.1 The regulative cycle of design science research. . . . . . . . . . . . . . 20 4.2 Activites performed during the two cycles in this thesis. . . . . . . . . 23 5.1 Identified trust factors in LLM-based systems. . . . . . . . . . . . . . 29 6.1 Example of choice for type of question in the chatbot. . . . . . . . . . 38 6.2 Structure of the HR chatbot. . . . . . . . . . . . . . . . . . . . . . . . 39 6.3 Example output from the chatbot to the question "How many vaca- tion days do I get?" with corresponding HR guideline source. . . . . . 47 xi List of Figures xii List of Tables 4.1 Participant counts for qualitative data collection activities. . . . . . . 23 4.2 Number of evaluation runs per quantitative metric. . . . . . . . . . . 24 6.1 Baseline evaluation results of the guidelines component for simple category questions. Each question was asked and evaluated 20 times. All values are rounded to three decimal places. . . . . . . . . . . . . . 52 6.2 Robustness evaluation results of the guidelines component for simple category questions, including percentage change relative to the base- line. Each baseline question was reformulated into 9 variations, and all 10 versions (including the original) were each evaluated 5 times. The robustness score represents the average of these 50 runs for each baseline question. Robustness values are rounded to three decimal places; percentage changes are rounded to two decimal places. . . . . 55 6.3 Evaluation results of the guidelines component for broader category questions. Each question was asked and evaluated 20 times. All values are rounded to three decimal places. . . . . . . . . . . . . . . . 57 6.4 Evaluation results of the employment component. Each question was asked and evaluated 20 times. All values are rounded to three decimal places. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 A.1 Questions for the employment component and expected outputs. . . . VII A.2 Other questions for the employment component and expected outputs.VIII xiii List of Tables xiv 1 Introduction The application of Large Language Model (LLM) solutions across various business areas has never been more relevant than it is today. The opportunity to use natu- ral language to address repetitive tasks is promising. Text-based interactions with LLMs are increasingly replacing traditional human-to-human interactions [1]. Despite their potential to improve organisational efficiency, the introduction of ar- tificial intelligence (AI) solutions often encounters reluctance. Factors such as fear of job displacement, distrust of AI’s perceived human qualities, and general scep- ticism contribute to delays in adopting these systems [2]. To overcome these chal- lenges, it is crucial to design LLM-based systems that actively build user trust. Key performance-related factors—such as accuracy and the frequency of hallucina- tions—have been shown to positively influence this trust [3, 4]. Since these factors are directly shaped by system design, thoughtful design emerges as a vital strategy for fostering trust in LLM-based technologies. Developing autonomous agent systems based on LLMs, where agents refer to AI- based entities that have capabilities such as planning, social interaction, and memory [5], holds significant potential for positively impacting trust factors such as the re- liability of the system. Additionally, LLM-based autonomous agent systems have demonstrated significant versatility [6], highlighting their potential to address a wide range of organisational needs. In an ideal scenario, a general agent-based system could meet the needs of employees across different roles within a company. How- ever, creating general-purpose LLM-based solutions has proven to be elusive [7, 8]. A possible alternative is tailoring LLM-based systems to specific purposes. Furthermore, multi-agent architectures, which leverage the collaborative abilities of multiple LLM agents, have been shown to outperform single-agent systems when handling complex problems [9]. This suggests that designing a multi-agent architec- ture tailored to a specific role within a company could yield significant performance benefits. Although improved performance alone may not guarantee user trust, it remains an important factor influencing trust [3, 4], as previously stated. Conse- quently, a multi-agent architecture represents a promising approach for enhancing user trust. A well designed multi-agent LLM-based system could also reduce the need for human-to-human interactions, thereby improving efficiency. This is espe- 1 1. Introduction cially relevant in the context of human resources (HR), where LLM-based systems can automate tasks that traditionally required direct communication with HR staff [2]. This thesis explores the factors that influence trust in LLM-based systems, consider- ing both non-technical elements and those shaped by technical decisions. Through a combination of literature review and interviews, this study identifies key trust fac- tors. Employing a design science research approach, the thesis presents an artifact: a multi-agent LLM-based HR chatbot designed to answer questions related to HR guidelines and employment information, with trust factors integrated into its design. The artifact is then evaluated using both qualitative and quantitative methods. 1.1 Problem description Most existing studies regarding trust in LLM-based systems do not focus on system design, they focus primarily on user experience. This reveals a critical gap: how can LLM-based systems be designed with trust-building factors in mind? Both single-agent and multi-agent systems present unique challenges. Single-agent systems often face limitations such as shorter context windows [10] and a higher risk of hallucinations [11]. Multi-agent systems, on the other hand, must address complexities like task allocation and coordination among agents [12]. However, the benefits offered by multi-agent systems—such as improved performance and robust- ness [13, 8]—tend to outweigh these coordination challenges. Despite this potential, current research primarily focuses on single-agent systems, leaving the potential of multi-agent solutions underexplored. Another important consideration is whether the LLM-based system is general-purpose or domain-specific. Since different roles have different needs, general solutions often underperform compared to bespoke, domain-specific alternatives. This has been demonstrated in both legal [7] and HR contexts [14], where tailored systems have shown superior results. Taken together, these findings highlight a key research gap: the design of a multi- agent LLM system tailored to specific roles and organisational needs—while incor- porating trust-related factors—has yet to be thoroughly explored. 1.2 Purpose of the study The purpose of this study is to explore the factors that influence user trust in LLM- based systems and to examine how these factors can be addressed through system design. Specifically, the study focuses on the development of a multi-agent chatbot for HR-related queries, aiming to identify design choices that enhance trust. By doing so, it seeks to bridge the gap between trust considerations and system design in the context of bespoke, domain-specific LLM applications. 2 1. Introduction 1.3 Research questions • RQ1: What are the main trust factors that exist in the usage of an LLM-based system? • RQ2: What potential solutions can be integrated into the system design of an LLM-based HR chatbot to address the relevant trust factors identified in RQ1? • RQ3: To what extent can the relevant trust factors identified in RQ1 be addressed through the design solutions implemented in an LLM-based HR chatbot? 1.4 Significance of the study The significance of this study lies in its contribution to bridging the gap between system design and user trust in LLM-based applications. It offers practical knowl- edge for organisations seeking to implement multi-agent LLM systems that foster trust—and thereby encourage user adoption. Additionally, developing a system built around an AI component, such as an LLM, is part of Software Engineering (SE) for AI. As highlighted by Uchitel et al. [15], this area is highly relevant to the broader software engineering community. This thesis seeks to make a meaningful contribution to SE for AI by addressing the lack of research regarding designing and constructing trust-fostering multi-agent systems. 1.5 Delimitations This thesis focuses on the development of a chatbot designed to assist employees in querying HR guideline documents and employment-related information. It explic- itly excludes other use cases, such as HR personnel interacting with the system or scenarios involving recruitment, onboarding, or employee management. The system is limited to handling informational queries only and does not perform transactional actions, such as applying for leave or managing tasks. Although security and confidentiality are essential for systems that handle personal or sensitive data, these concerns fall outside the scope of the developed artifact. The research does not involve a comparison between different large language mod- els. Instead, the chatbot exclusively uses llama3-70b-8192 without any fine-tuning or modification of the underlying model. The goal of developing the chatbot in this thesis is not to create a fully deployable system for real-world use. Instead, the purpose is to explore how specific design choices influence the trust factors identified. Consequently, no formal requirements 3 1. Introduction elicitation is conducted with stakeholders. The study is conducted in collaboration with a large multinational company, and all interviews are carried out with employees from within this organisation. Finally, while user interface design and usability are known to influence trust in AI systems, these aspects are not a focus of this thesis. 1.6 Thesis outline The thesis begins by presenting key concepts and background information in Chap- ter 2. Chapter 3 reviews related research relevant to the thesis, including studies on trust in AI and multi-agent architectures. The research methodology is described in Chapter 4, which outlines the overall Design Science Research approach used in the study. The thesis follows two iterative design cycles. Chapter 5 details Cycle I, including the methodology for qualitative data collection and the corresponding findings. Chapter 6 covers Cycle II, beginning with a presentation of the completed artifact, a multi-agent LLM-based HR chatbot, followed by descriptions of the quantitative and qualitative evaluation methods. The chapter concludes with the findings from the artifact evaluation. In Chapter 7, the discussion expands on the findings, explores their implications, and addresses threats to validity. It also outlines potential directions for future research. Finally, Chapter 8 provides a conclusion to the thesis. 4 2 Background This chapter provides background on the key concepts relevant to this thesis. It be- gins with an overview of trust and then introduces LLMs more broadly, covering key challenges such as hallucinations and the role of prompt engineering. The chapter then shifts focus to the foundations of retrieval-augmented generation (RAG), or- chestration frameworks, guardrails, and the concept of autonomous agents. Finally, it outlines relevant evaluation techniques, with a focus on LLM-as-a-judge and the DeepEval framework used in this study. 2.1 Trust Trust is a complex and multi-dimensional concept that is challenging to define in a way that applies universally across different contexts. It has been explored in various fields, including psychology [16], economics [17], organisational theory [18], and sociology [19], leading to diverse and sometimes conflicting research [20, 21]. However, in a general sense, trust can be viewed as the relationship between a “trustor“ (the one who trusts) and a “trustee“ (the one who is trusted) according to Mayer et al. [18]. While researching trust in digital information, Kelton et al. [20] discuss four levels of trust that they have identified in the literature around trust: • Individual trust: A person’s inherent trust based on accumulated experi- ences. • Interpersonal trust: A social connection between a trustor and a trustee. • Relational trust: Trust that develops as an emergent property from the relationship over time. • Societal trust: Trust that exists within a community or society as a whole. For the purposes of this thesis, interpersonal trust is most relevant, as it pertains to the one-way trust relationship between a trustor and a trustee. Importantly, the trustee does not necessarily need to be a human, it could also be a technological system, such as an LLM-based chatbot. 5 2. Background Furthermore, Kelton et al. [20] argues that three key conditions must be met for trust to be relevant in a given situation: • Uncertainty: A lack of information creates uncertainty. • Vulnerability: The trustor is at risk of experiencing a loss if the trust is betrayed. • Dependence: The trustor has a need that the trustee is capable of fulfilling. In the context of an LLM-based HR chatbot as in this thesis, uncertainty arises for the employee (the trustor) due to the fact that they typically turn to the chat- bot (the trustee) when they lack specific HR-related information, such as details regarding vacation days or company benefits. Regarding vulnerability, there is a po- tential risk that if the chatbot provides inaccurate information or discloses sensitive data inappropriately, the employee may experience negative consequences, such as making decisions based on faulty or incomplete information. Finally, the employee’s dependence on the chatbot is evident, as the chatbot holds the necessary information and has the capability to address the employee’s questions, thereby fulfilling their informational needs in the HR context. 2.2 Large language models LLMs are a category of artificial intelligence designed to generate, interpret, and engage with natural human language. These models are trained on vast amounts of textual data, enabling them to learn the complexities of language, including syntax, semantics, and contextual relationships [22]. A significant advancement in this field was the introduction of BERT (Bidirectional Encoder Representations from Trans- formers), which enabled models to assess the importance of words in a sentence regardless of their position [23]. ChatGPT, which gained widespread public atten- tion in 2022, further advanced these capabilities with a larger and more powerful model. The result is text generation that is both coherent and contextually relevant, based on the input it receives [24]. The practical applications of LLMs are broad, ranging from responding to simple queries to performing complex data analysis. 2.2.1 Hallucinations LLMs can sometimes produce undesirable outcomes, resulting in outputs that are “bland, incoherent, or caught in repetitive loops." [25] In such cases, the generated content may be nonsensical or unfaithful to the source input. This phenomenon is commonly referred to as hallucinations. Hallucinations present significant con- cerns regarding the reliability and performance of LLMs for several reasons. One major issue is a reduction in accuracy, as hallucinated responses are, by definition, incorrect. Another concern is related to security, as hallucinations may lead the model to produce or infer sensitive information that it should not access or disclose. Addressing hallucinations remains an ongoing challenge in the field, and researchers 6 2. Background are actively developing various techniques to mitigate their occurrence [25]. 2.2.2 Prompt engineering A prompt is an input provided to an LLM that guides the nature of the generated output. Prompts can consist of various types of media, including text, images, audio, or other formats. The process of designing and refining these inputs is referred to as prompt engineering [26]. Prompt engineering has emerged as an effective method for enhancing the performance of LLMs, as it does not require altering the underlying model itself, but instead involves crafting more effective instructions for the AI. Well-designed prompting techniques have been shown to significantly improve LLM performance, making prompt engineering a critical consideration when developing LLM-based systems [27, 28]. According to OpenAI, some effective strategies for prompt engineering include: • Including specific details in the query to obtain more relevant answers • Using delimiters to clearly separate distinct parts of the input • Specifying the steps required to complete a task • Providing examples to guide the model’s response • Indicating the desired length of the output Relatively simple techniques such as these can lead to substantial improvements in the quality and relevance of LLM-generated outputs [29]. 2.3 Retrieval-augmented generation RAG was originally developed by Lewis et al. [30] for natural language processing tasks. This approach enhances LLMs by integrating domain-specific knowledge re- trieved from external data sources, thereby mitigating the generation of inaccurate or outdated information. RAG enables text generation to be grounded in relevant, retrieved data rather than relying solely on the model’s pre-trained knowledge [31]. The incorporation of external data sources is particularly critical in question-answering systems, where the factual accuracy of responses is a key requirement. As stated, one of the primary challenges in LLM-based systems is the occurrence of hallucina- tions. Research has demonstrated that RAG significantly reduces the frequency of hallucinations while maintaining the overall performance of the system [32]. At its simplest, the RAG process follows three steps, indexing, retrieval and gener- ation as explained by Gao et al. [31]. 1. Indexing extracts data from various formats, such as PDF, HTML, and Mark- down, standardising it into plain text, and segmenting it into smaller units. 7 2. Background These segments are then encoded into vector representations using an em- bedding model and stored in a vector database, enabling efficient similarity searches [31]. 2. Retrieval identifies and retrieves relevant information based on a query. The system encodes the query into a vector and compares it to stored document vectors, selecting the most relevant results. These retrieved segments expand the LLM’s knowledge beyond its pre-trained dataset [31]. 3. Generation synthesises a response using the retrieved context. The LLM processes the query and retrieved document segments to generate a factually grounded and contextually relevant response, integrating both external data and its pre-trained knowledge as needed [31]. 2.4 Autonomous agents Agents have been studied extensively within the AI community long before the emergence of LLMs. Agents are defined as software systems that may exhibit char- acteristics including: autonomy, meaning they can operate without direct human intervention; social ability, enabling interaction with other agents; reactivity, allow- ing them to respond to environmental changes; and pro-activeness, giving them the ability to take initiative [33]. Additionally, AI agents are implemented using con- cepts typically associated with humans, such as knowledge, emotion, and intention. The introduction of LLMs has positively impacted the development of autonomous agents by leveraging natural language capabilities [5]. Modern LLM-based agents integrate advanced features such as personalised profiles, memory retention, external tool usage, and advanced planning [5]. These agents can adopt specialised roles and collaborate with one another, enhancing their collective problem-solving capabilities. This collaboration enables multi-agent systems. 2.5 LLM orchestration LLM orchestration refers to the process of coordinating multiple LLMs, for instance in the form of agents, to accomplish specific tasks. This involves managing activities such as linking prompts, handling API calls, retrieving data, and maintaining state across interactions. LLM orchestration is often done using an orchestration frame- work, which provides the structure and tools needed to effectively manage these tasks. These frameworks simplify the development process by offering standardised components and workflows, allowing developers to focus on the higher logic of their applications rather than low level details of coordinating different models [34]. 8 2. Background 2.5.1 LangChain & LangGraph LangChain is a framework for developing applications based on LLMs. It provides an interface to interact with LLMs in creating simple linear workflows, while also offering standardised components for AI application functionalities such as model interactions, retrieval mechanisms, and integrations with various data sources [35]. LangGraph is an orchestration framework designed for creating multi-agent sys- tems [36]. While it integrates well with LangChain, and is created by the same company, it can also be used independently. Unlike LangChain’s sequential work- flow approach, LangGraph enables a conditional workflow using directed graphs. It supports key features such as looping, conditional branching, and state management, allowing agents to dynamically adjust their behaviour based on evolving tasks. 2.6 Guardrails The non-deterministic, black-box nature of LLMs introduces several risks. Bias in training data can, for example, lead to outputs that reflect societal prejudices. An- other challenge is inconsistency—an LLM may produce different answers to the same prompt, which can be particularly problematic in applications requiring reliability, such as question-answering systems. This unpredictability can erode user trust and undermine confidence in LLM-based applications [37, 38]. To address these issues, the concept of guardrails has been introduced. Guardrails are mechanisms designed to monitor and filter the inputs and outputs of LLMs, helping to mitigate potential risks [38]. They analyse input prompts and generated responses to determine whether intervention is required to prevent harmful, biased, or incorrect outputs. Guardrails serve as a protective layer within LLM-based sys- tems, reducing the likelihood of exposing sensitive data and limiting the sharing of misleading or inappropriate content [38]. Although guardrails enhance security and reliability, they do not necessarily improve robustness against hostile attacks. Research by Shen et al. [39] indicates that guardrails provide only limited resistance to jailbreak attacks, which are prompt manipulations designed to bypass safeguards and elicit harmful content. Their study found that while guardrails marginally reduce the success rate of such attacks, they do not fully prevent them. This highlights the ongoing need for further advancements in LLM safety mechanisms, even in systems that incorporate guardrails. 2.7 LLM-as-a-judge Coined by Zheng et al. [40], the term LLM-as-a-judge refers to using LLMs as evaluators for tasks that typically require human judgment, such as assessing the quality of chatbot responses in open-ended dialogue. This approach addresses a key limitation of traditional benchmarks, which often fail to capture how well models align with human preferences. By contrast, LLM-based judges can offer a scalable 9 2. Background and efficient alternative to human evaluation. To test the viability of this approach, Zheng et al. [40] developed two benchmarks. Their findings show that the most commonly used LLM at the time, GPT-4, when used as a judge, agrees with human preferences over 80% of the time—comparable to the agreement rate between human annotators themselves. While promising, the study also highlights several limitations, including susceptibil- ity to biases (e.g. favouring the first-listed response or more verbose answers) and occasional failures in evaluating complex tasks requiring precise reasoning. Despite these issues, the results suggest that, when carefully applied, LLM-as-a-judge can serve as a practical and surprisingly reliable proxy for human evaluation in many settings 2.7.1 DeepEval DeepEval [41] is an open-source evaluation framework designed to assess the perfor- mance of LLM-based systems. By leveraging LLM-as-a-judge, DeepEval supports a variety of evaluation tasks across different types of LLM applications, including—but not limited to—RAG systems. Among the evaluation metrics it offers for RAG scenarios are faithfulness, an- swer relevancy, and contextual relevancy. Originally introduced in the RAGAS framework by Es et al. [42], these metrics are defined as follows: • Faithfulness measures how accurately the generated answer reflects the re- trieved context, aiding in identifying hallucinations. • Answer relevancy evaluates the degree to which the generated response directly addresses the user’s question. The metric does not take into account factuality but instead focuses on completeness and focus, penalising responses that are irrelevant, incomplete, or verbose. • Contextual relevancy assesses how relevant the retrieved context used to generate the answer is to the input question. The context should be focused and contain as little irrelevant information as possible. DeepEval also provides the capability to create custom evaluation metrics through the use of G-Eval [43]. G-Eval is a framework that enables the evaluation of outputs based on user-defined criteria. For instance, it can be employed to assess the correct- ness of a given output. This is achieved by specifying both the evaluation criteria and the corresponding evaluation steps. An example of criteria and evaluation steps for a custom correctness metric is given below. • Criteria: Determine whether the actual output is factually correct based on the expected output. • Evaluation steps: 10 2. Background – Check whether the facts in actual output contradict any facts in expected output. – Heavily penalise omission of detail. – Vague language, or contradicting opinions, are acceptable. This approach enables the creation of metrics that are not predefined in the DeepEval framework, offering greater versatility when evaluating the outputs of the LLM [43]. 11 2. Background 12 3 Related Work This chapter reviews existing research relevant to the thesis. It begins with an examination of literature on trust in LLM-based systems, followed by a presentation of key challenges in multi-agent systems. The chapter then explores research on collaboration within such systems and concludes with an overview of two approaches aimed at enhancing RAG. 3.1 Trust in LLMs Trust in AI has been studied long before the rise of LLMs, as evidenced by an empirical research review by Ella and Wooley [44]. However, as LLMs become more widely used, it is important to understand the key factors that influence trust in these systems. Liu et al. [4] and Huang et al. [45] conducted extensive literature reviews and developed taxonomies of trust factors while designing benchmarks to evaluate LLMs. Although their work focuses on assessing the models themselves, and not complete systems incorporating them, the same trust factors remain relevant, as they ultimately relate to how users perceive and trust LLM-generated content. Liu et al. [4] categorise trust into several key areas, including reliability, safety, and explainability & reasoning. They state that reliability refers to the accu- racy and consistency of outputs while minimising errors. Safety involves protecting sensitive information, while explainability & reasoning focuses on how well a system can justify its responses and provide clear explanations. Huang et al. [45] propose a similar framework with some differences in classifica- tion. Their taxonomy includes truthfulness, which emphasises providing correct information, privacy, which is treated as a separate category rather than a subset of safety, and transparency, which relates to how openly a system communicates how it generates its outputs. Schwartz et al. [46] add to this by identifying key factors that enhance trust in LLM-based systems, including reliability, that they define as consistently delivering high-quality, accurate results, openness, ensuring transparency regarding system capabilities, limitations, and reliability, task characteristics, adapting responses based on task type and complexity, and trust trajectory, recognising the impor- 13 3. Related Work tance of first impressions while providing opportunities to rebuild trust through subsequent accurate outputs. 3.2 Challenges with multi-agent LLM-based sys- tems Han et al. [12] emphasise challenges with multi-agent LLM-based systems that remain inadequately addressed in the literature. The paper summarises these chal- lenges into four categories, as follows: • Optimising task allocation to leverage agents’ unique skills and specialisations. • Fostering robust reasoning through iterative debates or discussions among a subset of agents to enhance intermediate results. • Managing complex and layered context information, such as context for over- all tasks, single agents, and some common knowledge between agents, while ensuring alignment to the general objective. • Managing various types of memory that serve for different objectives in coher- ent to the interactions in multi-agent systems. 3.3 Collaboration in multi-agent systems There are numerous ways to facilitate collaboration among agents in a multi-agent system. To summarise the various approaches, Tran et al. [47] conducted a survey of LLMs and proposed a framework for LLM-based multi-agent systems. In doing so, they identified three primary categories of multi-agent collaboration, identified in the literature: collaboration types, collaboration strategies, and communication structures. Tran et al. [47] classify collaboration types into three subcategories: • Cooperation, where agents align their efforts towards a shared goal. Some advantages of cooperation include the ability to assign sub-tasks based on indi- vidual agent strengths and its relatively straightforward design and execution, provided the goals are clear. However, misaligned goals may lead to inefficien- cies, and failures in one agent can significantly impact the entire multi-agent structure. Example scenarios for cooperative collaboration include code gen- eration, decision-making, game environments, question answering, and recom- mendations. • Competition, where agents prioritise their own objectives, even if they con- flict with those of other agents. This type of collaboration encourages agents to enhance their performance and promotes adaptive strategies. However, it is crucial to have a conflict resolution mechanism to ensure competition remains 14 3. Related Work beneficial to the system as a whole. Example scenarios where competition may be advantageous include debate, game environments, and question answering. • Coopetition, a hybrid of competition and cooperation, in which agents col- laborate on some tasks while competing in others. This enables the system to balance trade-offs and reach mutual agreements. However, as an under- explored area, its effectiveness and ideal applications remain uncertain. Tran et al. [47] cite negotiation, such as in policymaking systems, as the primary example scenario for coopetition. Tran et al. [47] identify three distinct collaboration strategies for multi-agent systems: • Rule-based, where predefined rules strictly govern agent interactions. This ensures efficiency, high predictability, consistency, and fairness. However, it also results in low adaptability to uncertainty and scalability challenges for complex tasks. Rule-based strategies are best suited to applications such as question answering, consensus-seeking, navigation, or peer-review processes. • Role-based, where each agent assumes a predefined role and operates on segmented objectives based on its domain knowledge to support the system’s overarching goals. This strategy enhances modularity and reusability while leveraging agents’ specialised expertise. However, poorly defined roles can lead to rigidity, disputes, or functional deficiencies. Role-based strategies are particularly applicable to simulations of real-world environments with well- defined jobs, such as decision-making or software development. • Model-based, where agents perform probabilistic decision-making based on input (with uncertainty in perception potentially impacting agent actions), environmental factors, and shared goals. This probabilistic approach allows adaptability to dynamic environments and robustness to uncertainties. How- ever, it is complex to implement and computationally expensive. Due to its adaptability, this strategy is well suited to dynamic contexts such as game environments or robotics. Tran et al. [47] categorise communication structures into three main types: • Centralised structure, where each agent connects to a central agent re- sponsible for all collaboration decisions. This structure is easy to design and implement and is efficient for resource allocation. However, its reliance on a single central node creates a single point of failure, making it less resilient to disruptions. According to Tran et al. [47], centralised structures are suitable for question answering and decision-making scenarios. • Decentralised structure, where control and decision-making are distributed among agents that operate on local information. This structure enhances re- silience, as the system can continue functioning even if individual agents fail, and it is highly scalable. However, it may suffer from inefficient resource allo- 15 3. Related Work cation and significant communication overhead. A decentralised structure is applicable to decision-making, question answering, reasoning, and code gener- ation. • Hierarchical structure, where agents are organised in layers, with commu- nication primarily occurring between adjacent layers. Each layer has distinct functions, roles, and levels of authority. This structure reduces bottlenecks and facilitates task distribution among layers. However, it is highly complex, lead- ing to increased latency and implementation challenges. Hierarchical struc- tures are used in scenarios such as code generation, question answering, and reasoning. Additionally, Tran et al. [47] discuss coordination and orchestration archi- tectures, which extend beyond individual collaboration channels to manage the relationships and interactions between multiple channels. These architectures de- fine how collaboration channels are created, ordered, and characterised. Tran et al. [47] identify two major types: • Static architectures, which rely on predefined rules and domain expertise to establish collaboration channels. By leveraging prior knowledge, these ar- chitectures ensure interactions adhere to domain-specific requirements while improving overall system efficiency and maintaining consistent task execution. However, their dependence on accurate domain knowledge and their fixed na- ture result in limited scalability and flexibility. • Dynamic architectures, which adapt to changing environments and task requirements by employing management agents or other adaptive mechanisms to assign roles and define collaboration channels in real time. While suitable for complex and evolving tasks, dynamic architectures require higher resource allocation due to real-time adjustments and present a greater risk of failure due to their fluid nature. 3.4 Multi-agent retrieval-augmented generation fil- tering As previously stated, RAG has become a key technique for improving the accuracy and reliability of LLM-generated responses by incorporating external knowledge retrieval. One approach, Self-RAG, introduced by Asai et al. [48], enhances factual accuracy by allowing the model to decide when to retrieve additional information and critically assess its own outputs. This method helps improve citation accuracy and reduces the inclusion of irrelevant or misleading information. A more recent development is MAIN-RAG, proposed by Chang et al. [49], which takes a multi-agent approach to further refine the retrieval process. Their paper shows that it outperforms Self-RAG across a number of datasets. MAIN-RAG is a training-free framework and introduces three specialised agents: a Predictor, 16 3. Related Work which retrieves documents and generates an initial answer based on each document. The predictor then sends the documents to the Judge agent, which evaluates if the documents provide relevant information to the query and answer and scores and orders them accordingly. If a document is deemed to be irrelevant, it is filtered out at this step. Finally, the documents are sent to the Final-Predictor agent, which generates the final response based on the sources provided by the judge agent. 17 3. Related Work 18 4 Method This thesis was conducted as design science research (DSR) mainly following the methodology of Wieringa [50] and the guidelines for applying DSR in the context of a master thesis presented by Knauss [51]. DSR focuses on the creation of a design artifact to solve a concrete problem while also gathering data about knowledge questions. In this thesis, the artifact is a multi-agent LLM-based HR chatbot with two components, one for answering questions regarding employment data, and one for answering question based on HR guideline documents. The chatbot is designed with the goal of understanding what design choices can address trust factors and in turn foster trust for such a system. 4.1 Design science research Wieringa [50] describes design science research as an iterative problem-solving method- ology structured around the "regulative cycle", illustrated in figure 4.1, which com- prises five phases: problem investigation, solution design, design validation, solution implementation, and implementation evaluation. Knauss [51] groups these phases into three broader categories: • Problem: Includes problem investigation, where the research problem is ex- plored and analysed. • Solution: Covers solution design and design validation, focusing on develop- ing and validating possible solutions. • Evaluation: Encompasses evaluation, where the effectiveness and usability of the proposed solution are assessed. Additionally, the implementation phase represents the artifact. Throughout the iter- ative cycles, the artifact undergoes incremental work—continuously evolving based on the insights gathered during the other phases. This thesis was conducted through two cycles, described further in chapter 5 and chapter 6 respectively. In alignment with Knauss’s guideline 3 [51], the research questions in this thesis are formulated to correspond to the three main categories: RQ1 addresses problem 19 4. Method Problem Investigation Solution Design Design Validation Solution Imple- mentation Implementation Evaluation Figure 4.1: The regulative cycle of design science research. understanding, RQ2 focuses on potential solutions, and RQ3 is connected to the evaluation of the proposed solution. 4.1.1 Problem investigation The purpose of the problem investigation phase is to gather information in order to understand the given problem, as well as describe it and explain it. Wieringa [50] presents four non-exclusive reasons for investigating the problem: • Problem-driven investigation, where there is a concrete problem that needs to be understood before trying to solve it. • Goal-driven investigation, where the investigation is motivated not neces- sarily by a problem but by some ambition to achieve change. • Solution-driven investigation, where a technology’s potential to improve or solve a problem is analysed. • Impact-driven investigation, also called evaluation research, where the fo- cus is on evaluating the impact of past actions instead of preparing for future solutions. In this thesis, two main problem investigation approaches were employed. Problem- driven investigation was primarily used to address RQ1, which focuses on under- standing the issue of trust factors in an LLM-based system. In contrast, solution- driven investigation was mainly applied to RQ2, which explores potential solutions to adress the trust factors identified in RQ1. 20 4. Method 4.1.2 Solution design The solution design phase, as described by Wieringa [50], involves formulating possi- ble solutions to the identified problem. These designs, which he refers to as solution suggestions, serve as just that, suggestions, rather than definitive answers, as they have not yet been validated or implemented. Solution designs can take various forms, including natural language descriptions, sketches, blueprints, mathematical models, or prototypes. Wieringa highlights that solution design is not a fixed plan from the beginning. Rather, it is a process that involves uncertainty, with the proposed solution devel- oping further as it is evaluated and tested. A solution suggestion does not describe an existing reality, explain past events, or predict future outcomes. Instead, it out- lines a possible course of action that helps stakeholders move from uncertainty ("we are uncertain about what to do") to confidence ("we are sufficiently certain about what to do"). 4.1.3 Design validation During the design validation phase, the design is investigated with the purpose to understand if it indeed would bring stakeholders closer to their goals. Wieringa [50] states that there are three important knowledge questions that need to be answered in this phase: • Internal validity: If the design were to be implemented, would it satisfy the criteria identified in the problem investigation? • Trade-offs: How do different designs compare to each other if implemented in this context? • External validity: Does the design, if implemented in another context, sat- isfy the criteria? The solution design and design validation in this thesis were primarily conducted through literature review, complemented by a two-day workshop with AI experts from the collaborating company. The setup and findings of this workshop are pre- sented in detail in chapter 6. 4.1.4 Implementation As stated by Wieringa [50], the implementation phase in DSR depends on the na- ture of the designed solution. If the goal of the research was to develop a method, framework, or process to address a practical problem, then implementation involves executing this process in a real-world setting. However, if the research focused on testing the viability of a proposed solution, implementation consists of conducting the planned evaluations or experiments. 21 4. Method The final implementation in this thesis resulted in a multi-agent LLM-based HR chatbot, the final artifact. This artifact is presented in detail in section 6.1. 4.1.5 Evaluation As outlined by Hevner et al. [52], evaluation constitutes a fundamental component of the research process, ensuring the effective integration of the artifact within the technical infrastructure. A rigorous evaluation requires the establishment of appro- priate metrics to accurately assess the quality of the implementation. As Hevner et al. [52] emphasises, evaluation plays a critical role in the iterative research process, facilitating the identification of deficiencies and informing necessary improvements for subsequent development cycles. Knauss [51] recommends adhering to Hevner et al.’s [52] established evaluation methodologies to align this phase with RQ3. These methodologies include observa- tional, analytical, experimental, testing, and descriptive evaluations. The final artifact developed in this thesis was evaluated using both quantitative and qualitative methods. The quantitative evaluation involved an experimental simulation, where the artifact was executed with artificial data, described in detail in section 6.2. In addition, a qualitative evaluation was conducted through interviews with potential users, as outlined in section 6.3. 4.2 Overview of cycles The two completed iterations of the regulative cycle in this thesis are visualised in figure 4.2 and detailed in the coming chapters. Cycle I primarily focused on problem investigation and preliminary design activities, including a series of interviews and a collaborative workshop with domain experts at the partner company. Cycle II primarily focused on finalising the artifact and conducting both quantitative and qualitative evaluations. 22 4. Method Figure 4.2: Activites performed during the two cycles in this thesis. Table 4.1 presents an overview of the qualitative data collection activities, detailing the number of participants and the total time spent on each activity. Table 4.2 summarises the number of evaluation runs conducted for each metric during the quantitative data collection phase in Cycle II, as described in section 6.2. The next two chapters describe each research cycle in detail. The chapter on Cycle I begins by outlining the methodology used during this cycle, followed by the key findings. In contrast, the chapter on Cycle II opens with a presentation of the final artifact, which serves as a reference point for the evaluation approach and results that follow. The structure of presenting the research cycles sequentially—detailing the method and findings of Cycle I followed by those of Cycle II—was chosen to enhance read- ability and comprehension. Since the research questions build upon one another, understanding the findings for RQ1 is essential for interpreting the final artifact, methodology and results of Cycle II. As such, the findings for RQ1 are presented in full within the Cycle I chapter, even though the findings were fully finalised during Cycle II. Activity No. of participants No. of hours Problem investigation interviews 6 6 Workshop 5 16 Follow-up evaluation interviews 5 3.75 Total 14 25.75 Table 4.1: Participant counts for qualitative data collection activities. 23 4. Method Metric No. of evaluation runs Answer relevancy 360 Faithfulness 360 Contextual relevancy 360 Robustness (answer relevancy) 500 Robustness (faithfulness) 500 Robustness (contextual relevancy) 500 Correctness 280 Total 2860 Table 4.2: Number of evaluation runs per quantitative metric. 24 5 Cycle I The first cycle focused primarily on understanding the problem space regarding trust in LLM-based systems and exploring potential solutions. As such, it placed greater emphasis on the first three phases of the regulative cycle, problem investigation, solution design, and design validation, aligning closely with RQ1 and RQ2. The problem investigation followed a problem-driven approach, aiming to identify trust factors associated with an LLM-based HR chatbot. To achieve this, interviews were conducted with potential users, followed by a thematic analysis of the results to extract key insights into factors that impact their trust. The thematic analysis was done separately by each author and then merged to mitigate bias. Additionally, the investigation extended to exploring design choices and components of multi-agent chatbot systems, primarily through a review of existing research. These potential design solutions were then explored and validated through partic- ipation in a two-day workshop with experts in building LLM-based systems. The workshop facilitated discussions on various design strategies, and feedback from ex- perts during the workshop served as an initial form of validation for these design choices. Although the primary focus of this cycle was on problem investigation and solution exploration, a preliminary implementation was undertaken to test basic functional- ity. The purpose of this early implementation was to explore high-level considera- tions, such as which frameworks to use, how an agentic RAG system functions, and which LLMs are compatible and can be effectively utilised. The evaluation of this early prototype was basic and relied on human judgment by the authors, supported by insights from the literature review on what appears to be most suitable for an HR chatbot in practice. This chapter outlines the methodology and findings from cycle I of the thesis. It begins by presenting the qualitative data collection methods, including the approach used for the problem investigation interviews, the subsequent thematic analysis, and the setup of the expert workshop. The second part of the chapter focuses on the findings from this cycle, starting with the trust factors identified through the interviews and concluding with key insights from the workshop discussions. 25 5. Cycle I 5.1 Method - Qualitative data collection To gain a deeper understanding of trust in LLM-based systems and to explore dif- ferent solution designs, cycle I employed two qualitative data collection methods. These included interviews conducted as part of the problem investigation as well as a workshop with experts focused on the design and implementation of LLM-based systems. 5.1.1 Interviews To attain qualitative data about trust in our initial problem investigation, six in- terviews were conducted with employees at the partner company. These interviews were designed based on the guidelines provided by McNamara [53] and Patton [54]. The format used was the standardised, open-ended interview, where all intervie- wees were asked the same open-ended questions and could respond freely in their own words. In some instances, follow-up questions were posed to encourage further elaboration from the interviewees. This interview format was chosen because, as McNamara states, it "facilitates faster interviews that can be more easily analysed and compared" [53]. As noted, the ques- tions were constructed in accordance with McNamara’s guidelines [53], which empha- sise important principles such as question neutrality, the use of open-ended wording, and smooth transitions between major topics. For the sampling method, snowball, also called chain sampling, [54] was used. In this method, an industry supervisor with extensive knowledge about who would be information-rich key informants for the interviews was tasked with reaching out and finding such participants. The interviewees had varying levels of AI knowledge and roles within the company, ensuring diverse perspectives. 5.1.1.1 Problem investigation interview setup The interviews were conducted with each of the six employees as part of the problem investigation. The questions touched on 4 overarching subjects: • Background and demographic information, • Knowledge and experience with AI & LLMs, • Attitudes, opinions, and trust in AI • HR system specific questions. All interviews were conducted remotely and lasted approximately 60 minutes. With the participants’ consent, interviews were recorded and automatically transcribed. The transcripts were then reviewed and corrected in phase one of the thematic analysis. The interview guide used for this round of interviews can be found in appendix A.1. 26 5. Cycle I 5.1.1.2 Thematic analysis To analyse the interviews, we employed thematic analysis, following the guidelines established by Braun and Clarke [55]. Thematic analysis is defined as "a method for identifying, analysing and reporting patterns (themes) within data." Braun and Clarke [55] outline a structured approach that consists of five key phases, each of which is detailed below. Importantly, they note that thematic analysis is not strictly a linear process but rather a recursive one, where movement between phases is necessary to refine and develop themes. 1. Familiarising yourself with your data: This phase involves immersing oneself in the collected data through repeated active reading. Initial notes and potential codes should be marked for later phases. Transcription plays a key role in deepening familiarity, and if tran- scription has been conducted by others or through automated tools, additional time should be dedicated to engaging with the material. 2. Generating initial codes: Following familiarisation, the data should be systematically coded to identify meaningful features of interest. Equal attention must be given to all data items, including those that challenge dominant narratives. Coding should be as comprehensive as possible within the available timeframe, preserving surrounding context and allowing for multiple codes per extract. 3. Searching for themes: This phase focuses on organising codes into broader themes by clustering re- lated codes and exploring their interconnections. Visual tools such as tables or mind maps are recommended to facilitate the conceptual organisation. At this stage, all codes and potential themes should be retained for further con- sideration. 4. Reviewing themes: Candidate themes are reviewed and refined to ensure they accurately reflect patterns in the data. This process involves two levels: first, evaluating co- herence within each theme’s coded extracts; second, assessing the thematic structure in relation to the full dataset. Additional coding may be required if new relevant data is identified. By the end of this phase, key themes and their relationships should be clearly defined. 5. Defining and naming themes: With a satisfactory thematic map in place, themes are further refined and clearly defined. Each theme’s core meaning should be articulated, ensuring alignment with the data and avoiding excessive breadth or overlap. Sub- themes may be identified to capture nested or hierarchical relationships within the data. 27 5. Cycle I 5.1.2 Workshop The data collection process during cycle I also included a workshop, conducted at the collaboration company over two days. The workshop focused on discussions around the design and implementation of LLM-based systems within three organisational domains: Sales, HR, and Cybersecurity. Participants in the workshop consisted of AI experts in each respective area with a total of seven participants including the authors. The workshop mainly explored two use-cases for LLM-based solutions. The first be- ing an LLM-based solution for managing large volumes of internal documentation, and the second involving an LLM-based chatbot designed to respond to employee queries, specifically those related to employment matters, HR guidelines, and or- ganisational policies. The workshop served both as a means of collecting empirical data on how domain ex- perts approach the practical implementation of AI solutions within an organisational setting and as an evaluation of the feasibility of previously studied approaches in a real-world context. Furthermore, the workshop provided insights into the trade-offs associated with different design strategies and explored how a scalable solution could be developed to support future applications across other areas of the organisation. 5.2 Findings - Cycle I This section presents the findings from the problem investigation interviews regard- ing RQ1 and the findings from the workshop related to RQ2. It begins with the results of the thematic analysis, which answer RQ1 by identifying key trust factors in LLM-based systems. It then focuses on RQ2 by presenting insights from the workshop, which informed the design of the artifact in cycle II. 5.2.1 Trust factors in LLM-based systems (RQ1) After an extensive thematic analysis of the conducted interviews, we identified five main themes. Two of these themes are directly related to trust factors and thus address RQ1. The other three themes, however, are more concerned with attitudes, thoughts, and concerns surrounding LLM-based systems, rather than directly ad- dressing trust factors. These three themes—AI as a helping hand, Concerns - Human interactions could be replaced, and Critical thinking - Output should be challenged and revised—are considered outside the scope of this thesis, as they do not directly correspond to trust factors in an LLM-based system. In the following two sections, we will describe in detail the remaining two main themes that were identified. These were External trust factors — Trust im- pacted by non-technical forces and Internal trust factors — Trust im- pacted by technical details. 28 5. Cycle I Figure 5.1: Identified trust factors in LLM-based systems. 5.2.1.1 External trust factors — Trust impacted by non-technical forces This theme addresses factors that are not directly influenced by the system design, but instead areas outside the actual artifact. As seen in figure 5.1, these factors include transparency, organisational measures such as education on LLMs and chatbots as well as change management, and external security referring to security concerns that cannot be addressed within the system design. Transparency Transparency encompasses both transparency in the development process and trans- parency of the system’s limitations. One interviewee stated that transparency in how an LLM-based system is built pro- vides users with insight into its foundations, which in turn increases trust. Further, they said that knowing the process about what model is being used, what approach was used in the development and what data is being used would make them feel more trust towards the system. Another interviewee also stressed that it is impor- tant to understand how the data you provide to an LLM-based system is stored and used. Transparency about an LLM-based system’s limitations can also help users set re- alistic expectations, something brought up by three out of the six interviewees. If something is presented as always correct but proves otherwise, it may lose users’ trust. However, being upfront about potential inaccuracies can foster understand- ing and make users more forgiving. As one interviewee shared: "That’s the thing that I’m saying, is that I don’t trust 100%, but I still think that we can implement things acknowledging that probably 100% is impossible, but we can be close to 100% on the output. So we can gain trust and confidence of all the people that are gonna use it." 29 5. Cycle I (Interviewee 3) Change management Another factor that emerged from the data was the importance of actual system usage in building trust, as well as the company’s role in encouraging that usage. Four out of the six interviewees stated that using the system will allow it to prove itself, eliminating possible preconceptions. One interviewee reflected on their initial scepticism: "Can you trust what it’s telling you? Will it make mistakes? And I had those preconceptions, like a year and a half ago, when I first was like, you know, how is this possible to use this? But as I’ve used [an LLM-based tool] and seen it evolve, seen it improve, seen actually how it can benefit my work, I really see the opportunities." (Interviewee 5) This suggests that increased use of the system, along with witnessing its evolution, can serve as a catalyst for trust development. To further encourage this usage, or- ganisations may need to actively guide employees toward adopting LLM-based tools such as an HR chatbot. This would give these tools the opportunity to demonstrate their value. Five out of the six interviewees highlighted the importance of effective change management in increasing system adoption and engagement. This included both changing the behaviour of the users of the system, as well as enabling easy access to the tools. When discussing the integration of an HR chatbot, two interviewees stated that each employee currently has an assigned local HR business partner who is readily accessible. As a result, there would be little motivation to use such a chatbot, even if it provides equivalent support. Both interviewees stated that making direct contact with HR operations less convenient could promote the use of the chatbot. Crucially, they emphasised that the chatbot must offer clear and tangible value to employees. If it is to be adopted, it should convincingly demonstrate that it is a more efficient or beneficial alternative to traditional HR contact methods. Education To enable this change management, education emerges as a key organisational tool to help employees understand the value of the system, thereby increasing trust and fostering adoption. Education was discussed by all interviewees with one interviewee stating that the organisation needs to make it possible for employees to get an introduction into how to use a chatbot if it is implemented. Another interviewee also discussed this benefit of education in improving their ability to optimise the use of LLM-based chatbots and getting more value out of them: "Yeah, I really think that I would benefit from educating myself more in optimising the usage [of LLM-based chatbots]." 30 5. Cycle I (Interviewee 4) External security One of the factors that was brought up by four out of the six interviewees was the potential impact of the system’s origins or the underlying models on trust. If the company behind the system or LLM is deemed untrustworthy, users’ trust in the system itself can be compromised. Further, five out of six interviewees expressed caution regarding the information they input into LLM-based systems, especially when interacting with systems not hosted by their organisation. One interviewee noted: "For instance, obviously I played around with Deepseek, and I knew that using Deepseek was basically sending information to China. I don’t think China is way, way worse than the US, but still, it’s like, OK, I’m sending to another country. That’s why I use this Deepseek just to play around. It was just basically doing Q&A for bullshit stuff. So nothing [sensitive]—that’s why I took that into consideration. The moment I’m able to actually get, get Deepseek working on a—let’s say—open-source environment, or, let’s say, download and install all the way into a server, I would probably use it in a different way, that’s for sure." (Interviewee 3) Finally, regarding LLM-based systems used and approved within the company, trust can be handed over to the IT department and their expertise with one interviewee saying "I have no real limitations as long as I know that these tools have been embedded by corporate IT from a security standpoint" (Interviewee 1) In summary, external trust factors centre on non-technical influences. Participants emphasised the value of transparent communication about system development and limitations, as this sets realistic expectations and builds trust. Organisational efforts like education and guided adoption were seen as essential for encouraging usage and overcoming skepticism. Trust was also shaped by concerns about where data is sent and who controls the underlying technology—highlighting that trust is not just built on what the system does, but also on who is behind it and how it is introduced. 5.2.1.2 Internal trust factors — Trust impacted by technical details The second major theme identified was internal trust factors. These factors come from the actual LLM-based systems themselves, how they perform and behave. 31 5. Cycle I Thus, these are factors that may be impacted by system design. As illustrated in figure 5.1, the internal trust factors identified were internal security, risk of bias, model differences, and reliability. Internal security Internal security relates to the importance of protecting sensitive employee data, particularly in the context of an HR chatbot. A concern brought up by two out of six interviewees was the potential for unauthorised access, for example, if the system could be exploited to retrieve other employees’ information. Another concern was the system’s ability to comply with internal IT and legal frameworks. As one interviewee noted: "So we have to be compliant with all the rules that exist. We have IT processes and legal processes that must be [followed]." (Interviewee 4) Risk of bias Another sub-theme that emerged was the risk of bias in LLMs and LLM-based tools. One participant, for example, expressed concern about the potential impact on diversity when such tools are used in recruitment processes. They noted that the system might favor candidates with similar educational backgrounds or professional experiences—such as coming from the same types of companies—which could unin- tentionally limit diversity. This was viewed as a risk that could lead to less varied and inclusive hiring outcomes. Another participant expanded on biases, discussing how they may be embedded in the training data and reflected in outputs: "And then also, like I mentioned before, the ethics around the informa- tion that people use from it, and how these models have been built, and who has built them, and the bias. And, you know, is information it gives representative of the wider population? ... And that does really concern me from a kind of diversity, equality point of view, because a lot of work has been done previously to promote different types of voices on different topics." (Interviewee 5) Model differences Five out of six participants also commented on perceived differences between various LLMs and LLM-based tools, which we have categorised as the theme model differ- ences. Comparisons were for instance made between internal company tools and more widely available tools such as ChatGPT, with one interviewee stating that: 32 5. Cycle I "Tools that we have in [Company] as of today—I mean, they are good. But I think they are not that good, obviously, like ChatGPT." (Interviewee 3) Reliability The factor of reliability refers to the system’s ability to consistently produce ac- curate, high-quality responses. All participants emphasised reliability as a central trust factor when interacting with LLM-based systems. They highlighted that if the system frequently produces incorrect answers, trust is quickly diminished. When discussing what might deter them from using an LLM-based system, one intervie- wee stated: "No, but recurring inaccuracies, I think. That would have made it so [I felt] ’But no, it’s not worth the time. I’ll have to look it up myself’ or something like that. So yes, repeated inaccuracies would have caused my trust to decrease." (Interviewee 4) Another participant echoed this sentiment, describing how even a single mistake in practical information could lead to reduced usage: "I think it’s the reliability of the information [that is important for usage]. Of course, if I use the chatbot and ask the chatbot how many remaining vacation days I have, I get an answer, and then the answer proves to be the wrong one. I might not use it again easily." (Interviewee 1) These responses underline the importance of providing accurate responses from the outset. If a system fails to meet expectations early on, trust may be damaged and difficult to rebuild later. This concern was also reflected in discussions about implementation strategy. One interviewee suggested that a gradual rollout of an HR chatbot could help identify and resolve early issues before exposing it to a wider audience. Thus, avoiding the risk of discouraging users who may perceive the tool as unreliable. "So it depends a bit [on] the purpose of it, but you can collect a lot of feedback by rolling out something real that you then improve as you go along, and then you roll it out a little wider, like, until you have something that really doesn’t have a lot of teething problems and that can provide value. Because if it doesn’t provide value, people won’t use it." (Interviewee 6) 33 5. Cycle I Another aspect of reliability that emerged was the importance of source citation in LLM-based systems. Two participants expressed greater trust in systems that provide sources for their outputs, as it allows users to verify the information and better understand where it comes from. When asked about their level of trust in LLM-based chatbots, one participant responded: "I think it would be probably if—if I have the—the—the sources men- tioned, like in Copilot, I would say probably 8 or 9 out of 10." (Interviewee 1) And when asked how their trust would be affected if sources were not provided, the same participant explained that they would feel the need to challenge the output more actively—by comparing it across different chatbots and questioning the origin of the information. Summarising, the internal trust factors identified by participants reveal how trust in LLM-based tools is closely tied to the system’s technical performance. Issues such as hallucinations, lack of source transparency, and data privacy risks emerged as key concerns. Trust was also found to be fragile—easily lost through early errors—and difficult to rebuild, underscoring the need for high initial system performance and thoughtful implementation. 5.2.2 Findings from workshop (RQ2) This section outlines key findings from the workshop regarding design choices in developing an agent-based HR chatbot. A primary concern raised in the workshop was the importance of reliability, which is a quality also emphasised in the inter- view findings around trust factors. Consequently, many of the design discussions revolved around strategies for improving the reliability of the chatbot. A core principle agreed upon was that the chatbot should avoid providing incorrect answers. If the system cannot provide a sufficiently accurate or complete response, it should explicitly state this to the user. The consensus was clear: it is better to give no answer than an incorrect one. This approach supports both the reliability and transparency of the system. The HR chatbot use case was generally seen as relatively simple in nature. Its primary function is to retrieve relevant information from HR documentation and systems, and present it in response to user queries. Unlike systems that require com- plex reasoning or computation, this task primarily involves information retrieval and summarisation. Accordingly, one participant advised against over-engineering the system architecture, saying “Don’t over-engineer the agent structure for a simpler use case”. To enable effective document retrieval, discussions revealed that a RAG approach is the most suitable. However, rather than using a basic RAG pipeline with a single LLM retrieving and generating responses, an enhanced RAG architecture was 34 5. Cycle I proposed. This would involve the inclusion of additional agents to improve answer quality and reliability. Specifically, a circular workflow was suggested, featuring a “checker agent” responsible for evaluating the quality of the generated answer in relation to the user query. If the answer is deemed insufficient, the system should loop back to revise and improve it based on this feedback. Another important factor related to document retrieval was the format of the docu- ments themselves. Discussions emphasised that, although company documents are often available as PDFs, LLMs perform more effectively when the files are provided in Markdown format instead. Internal security was another important topic in the discussion. Echoing concerns raised in interviews, the inherent sensitivity of HR data was acknowledged as a po- tential risk. One mitigation strategy discussed was the integration of guardrails to limit inappropriate or insecure outputs. However, it was concluded that during the early stages of development and testing, guardrails are not essen- tial. This is due to the domain-specific nature of security requirements, which vary significantly across companies, departments, and jurisdictions. As such, defining meaningful security constraints requires detailed context that is often unavailable during early development. Thus, the primary focus at this stage should be on the performance of the system. 35 5. Cycle I 36 6 Cycle II Building on the insights from the first cycle, the second cycle focused primarily on solution implementation and implementation evaluation. This phase therefore placed greater emphasis on the implementation, resulting in the final artifact, and evaluation stages of the regulative cycle, aligning closely with RQ2 and RQ3. The artifact was further developed based on the design suggestions identified dur- ing the first cycle. To streamline the development process, the system was im- plemented using the LLM orchestration framework LangGraph. Additional design considerations—such as defining the roles of the agents, crafting the prompts, and distinguishing between policy-related and employment-related questions were also addressed. To evaluate the artifact quantitatively, data was collected using the DeepEval frame- work in an experimental simulation [56]. Five metrics were employed: faithfulness, answer relevancy, contextual relevancy, robustness, and a custom G-Eval metric termed correctness. The results of this evaluation were analysed to determine how the artifact performed. In addition, five evaluation interviews were conducted dur- ing this iteration to collect qualitative data. These interviews aimed to assess how well the artifact addressed trust factors identified in the first iteration. To provide context for the remainder of the chapter, it begins by presenting the design of the final artifact. This is followed by a detailed description of the methods used for both quantitative and qualitative data collection. The quantitative section covers the evaluation metrics, the use of dummy data, and the setup of test runs, while the qualitative section outlines the evaluation interview approach. Finally, the chapter presents the findings from this cycle, including insights from the evaluation interviews and the results of the quantitative analysis. 6.1 The artifact - final solution candidate (RQ2) This chapter will present the final artifact designed in the project in the form of an HR chatbot capable of answering question based either on HR guideline documents or specific employment data. First, a brief overview of the artifact is presented to give and understanding of how the chatbot works. Following this, the two compo- 37 6. Cycle II nents of the chatbot, the employment component and the guidelines component will be described in more detail, including the role of each agent within the components. 6.1.1 Overview The HR chatbot is implemented as a Python application executed in the terminal. The chatbot is composed of two multi-agent based components, the employment component and the guidelines component. Within each component, the flow used to answer a given question includes a set of agents, orchestrated through the framework LangGraph, each with a distinct role and responsibility. When starting the chatbot, the user is asked to provide a question. After provid- ing the question, the chatbot asks if the question is about general HR policies or employment data, as show in figure 6.1. The answer to this question decides which component will be used to answer the given question. Figure 6.1: Example of choice for type of question in the chatbot. 6.1.2 Guidelines component The guidelines component implements an enhanced RAG workflow consisting of four main parts, illustrated in figure 6.2: • Vector Store: Contains indexed HR guideline document segments in Mark- down format. These documents serve as the knowledge base for the chatbot. • Judge agent: Retrieves document segments from the vector store, ranks them based on relevance, and filters out those deemed insufficiently relevant. • Generator agent: Uses the relevant document segments identified by the judge to generate an answer to the user’s question. • Checker agent: Evaluates the generator’s response against a predefined set of criteria. If the response is considered invalid, the checker provides feedback to the generator used to generate a new answer. This feedback loop continues until one of the following conditions is met: – The checker accepts the generated answer as valid, in which case it is returned to the user. 38 6. Cycle II – The maximum number of three iterations is reached, in which case the system informs the user that it was unable to provide a satisfactory an- swer. Figure 6.2: Structure of the HR chatbot. Below, each part of the guidelines component is described in more detail. Vector Store The vector store is implemented using the FAISS library [57] and contains in- dexed HR guideline document segments. These document segments were embed- ded using the BAAI/bge-small-en-v1.5 LLM, specifically developed for retrieval- augmented LLM systems [58]. Document segment retrieval is performed using FAISS similarity search, which returns the top six most relevant segments to the judge agent. Judge agent The judge agent is responsible for filtering and ranking the retrieved document segments. It assigns each segment a relevance score between 0 and 1, where 0 indicates complete irrelevance and 1 indicates a direct and highly relevant answer to the query. The full prompt used by the judge agent is provided in appendix A.4. 39 6. Cycle II Listing 6.1 shows an example of the judge’s reasoning when assigning a relevance score of 1.0 in response to the question: "How many vacation days do I get?" [JUDGE] Document relevance score: 1.0 (threshold: 0.6) [JUDGE] Reasoning: 1. The question asks about the number of vacation days I get. 2. The document title is "Vacation Policy", which suggests that it might be relevant to the question. 4. The first section "Annual Vacation Entitlement" explicitly states that all employees are entitled to 25 paid vacation days per year, which directly answers the question. 5. The rest of the document provides additional information about vacation accrual, planning, and saving vacation days, but it is not directly related to the question. Listing 6.1: Example of judge agent scoring a relevant document. As illustrated in the example above, the judge uses a relevance threshold of 0.6. Segments scoring below this threshold are discarded. The remaining segments are sorted by relevance and passed as context to the generator agent. If no document segments are deemed relevant enough, the workflow is stopped and the answer "I don’t have enough information to answer this question based on the HR handbook." is returned to the user. Generator agent The generator agent produces an answer to the user’s question using the ranked and filtered document segments provided by the judge agent. It is explicitly instructed to base its answer strictly on the provided segments, ensuring that the response is both accurate and comprehensive. Emphasis is placed on referencing specific documents and sections from which the information is derived. If a previously generated answer is deemed invalid by the checker agent, the feedback provided is incorporated into the generator’s next attempt. This feedback-guided loop enables iterative refinement of the answer. The full prompt used by the gener- ator agent is available in appendix A.4. Once an answer is generated, it is forwarded to the checker agent for validation. Listing 6.2 shows an example of a valid answer generated by the generator agent: According to the Vacation document, in the Annual Vacation Entitlement section, all employees are entitled to a minimum of 25 paid vacation days per year. Listing 6.2: Example of valid answer to the question "How many vacation days do I get?". 40 6. Cycle II Checker agent The checker agent evaluates the answer generated by the generator, using both the answer itself and the set of document segments that informed it, called the context. The evaluation is based on a predefined set of criteria, which include whether the answer: addresses the user’s original question, is grounded in the provided context, and avoids introducing information not found in the document segments. The checker performs its assessment by responding to five yes/no verification ques- tions. A "yes" indicates that the criterion has been met, while a "no" indicates that it has not. Each response is accompanied by a rationale explaining the judgment. If the overall answer is deemed invalid, this assessment is passed back to the gener- ator as feedback for the next iteration. If the answer is deemed valid, it is returned to the user. The full prompt used by the checker agent, including the verification questions, is provided in appendix A.4. Listing 6.3 shows an example of the response from the checker agent for an answer to the question "Does the company handle chiro expenses?" that it deemed to be invalid: [CHECKER] Feedback: Here is my verification response: Q1: Yes - The answer directly answers the question about whether the company handles chiropractic expenses. Q2: No - The answer claims that the company has a process for reimbursing chiropractor visits, but the document only mentions a reimbursement process for the healthcare allowance, not specifically for chiropractic expenses. Q3: No - The document does not mention chiropractic expenses as an eligible or non-eligible expense for the healthcare allowance, and the answer adds information not present in the documents. Q4: No - The answer does not cite specific document names. Q5: No - The answer does not contain all relevant information for the question present in the documents, as the document does not mention chiropractic expenses. ASSESSMENT: INVALID: The answer adds unsupported information and does not cite document names. Listing 6.3: Example of checker agent assessing an answer as invalid. 6.1.3 Employment component The employment component consists of three main parts, as illustrated in figure 6.2: • Field identifier agent: Analyses the user’s question and determines which data fields need to be retrieved from the dataset. • Data retrieval node: Retrieves the specified fields, as identified by the field identifier agent, for the user’s employment ID. 41 6. Cycle II • Generator agent: Generates a response to the user’s question using the data retrieved by the data retrieval node. The employment component was considered a simpler use case than the guidelines component, primarily because it handles structured data fields rather than unstruc- tured document segments. Based on insights from the workshop, this simplicity suggests that a less complex agent structure is more appropriate. As a result, the employment component does not include a judge or checker agent. Each part of the employment component is described in more detail in the following sections. Field identifier agent The field identifier agent determines which available data fields are relevant to an- swering the user’s question. The agent has access to all available fields from the dataset, along with explanations of key relationships between them. Based on this information, it produces a comma-separated list of field names, which is then passed to the data retrieval node. The full prompt for the field identifier agent is provided in appendix A.4. Data retrieval node The data retrieval node is not an agent, but a method that takes the set of field names from the field identifier agent together with the employment ID of the user, extracts the relevant data from the dataset, and stores it in the component state. This state is accessed by the generator agent during answer generation. The dataset is in the form of a CSV-file. Generator agent The generator agent produces an answer to the user’s question based on the data retrieved by the data retrieval node. It is explicitly instructed not to perform any calculations or actions beyond what is supported by the provided data. If the required information is missing or unavailable, the generator should clearly state this. The response should be concise and professional, avoiding references to technical implementation details. The full prompt for the generator agent is provided in appendix A.4. Listing 6.4 shows an example of an answer generated by the generator agent in response to the question: "What is my department and who is my manager?" Your department is Production and your manager is Kelley Spirea. Listing 6.4: Example of answer generated by the generator agent. 42 6. Cycle II 6.2 Method - Quantitative data collection This section outlines the metrics used during the quantitative evaluation of the artifact, the nature of the test data, and the procedure for running the evaluation. 6.2.1 Metrics To evaluate the chatbot’s performance quantitatively, the system was assessed using five distinct metrics. For the guidelines component, the DeepEval metrics used were answer relevancy, faithfulness, and contextual relevancy. These metrics are specifically for measuring performance in RAG-based systems, which this component is. In addition, a custom robustness metric was introduced to measure the system’s consistency under input variation. Since the employment component is not a RAG- based component, a custom metric named correctness was developed. Below, the caluclation for each metric is presented in more detail. Answer relevancy is calculated as: Answer relevancy = Number of relevant statements in the answer Total number of statements in the answer The evaluation LLM used by DeepEval extracts all statements from the chatbot’s output and classifies whether each statement is relevant to the input. Faithfulness is calculalted as: Faithfulness = Number of truthful claims in the answer Total number of claims in the answer All claims are extracted from the output by the evaluation LLM, which then de- termines whether each claim is truthful based on the context used to answer the question. Context relevancy is calculated as: Context relevancy = Number of relevant statements in the context Total number of statements in the context In this case, the evaluation LLM extracts statements from the retrieved context and classifies whether each one is relevant to the specific question being answered. Robustness is calculated as: For each question in the simple category (further explained in section 6.2.3), 9 reformulated versions were generated using ChatGPT. Each reformulated question (plus the original) was evaluated 5 times, resulting in 50 evaluation runs per baseline question per metric (answer relevancy, faithfulness, and contextual relevancy). The robustness score for one metric is the average across these 50 runs: 43 6. Cycle II Robustness = 1 50 50∑ i=1 Scorei where Scorei is the evaluation score for each run for the given metric. Correctness is a custom metric developed using the G-eval framework supplied by DeepEval. To evaluate correctness, the following criteria were provided to the evaluation LLM: 1. Check whether the answer from the HR API includes all relevant information from the employee data. 2. Determine if the answer direct addresses the employee’s question. 3. Check if any information in the answer contradicts the available employee data. 4. Assess whether the answer clearly indicates when requested information is not available in the data. 5. The exact wording and phrasing in the answer is not important, but the re- sponse must convey the key information specified in the expected output. For example, if the expected output is "The answer should clearly state that you have not been late the last 30 days," the actual response could be "According to your records, you have 0 days late in the past month" or "You have perfect attendance with no late days in the last 30-day period." Focus on evaluating if the substance of the required information is present rather than exact word matching. In addition to these steps, the evaluation LLM is provided with the question and an expected output formulated as an explanation of what the response should convey. An example from the evaluation runs is shown in listing 6.5 "question": "Am I employed?", "expected_output": "The answer should clearly state that you are currently employed" Listing 6.5: Example of question and expected output for evaluation of employment component. 6.2.2 Dummy data The data used for evaluating the guidelines component consisted of 14 mock HR guidelines documents, formatted in Markdown. This format was selected following the results from the workshop described in 5.2.2, where it was determined to be the most easily interpreted by LLMs. These documents, generated using an LLM, do not reflect actual HR policies or legislation but were designed to simulate a set of 44 6. Cycle II guidelines that the chatbot could use to respond to typical HR-related queries. The factual accuracy of the guidelines was not considered relevant for the evaluation, as the documents were treated as the "ground truth" within the context of the simulated scenario. For the employment component, a publicly available dataset containing HR informa- tion about fictitious employees from a fictitious company was used [59]. This dataset served as the basis for evaluating how the system responded to employment-related queries, with correctness as the only evaluation metric. The dataset provided struc- tured employee data, such as job roles, salaries, and attendance records, which was used as the ground truth for evaluating the system’s accuracy in handling factual queries. 6.2.3 Test runs The system’s performance was quantitatively evaluated by running it against a set of questions commonly posed to HR staff. These questions were formulated based on a combination of documents of commonly asked questions, provided by the HR department of the collaborating company, as well as the available system data. The full list of questions used in the evaluation is available in appendix A.3. For all test runs, the artifact used the llama3-70b-8192 model [60] to answer each question. These responses were then evaluated using DeepEval, with GPT-4.1—the latest model from OpenAI at the time—serving as the LLM-as-a-judge [61]. Guidelines component For the guidelines component of the chatbot, a total of 23 questions were used, divided into three categories: simple, broader, and questions with no answers. • Simple questions: 10 questions, each with direct answers available in the HR documents used by the system. • Broader questions: 7 questions where the answers were less straightforward or the questions were phrased more vaguely. • Questions with no answers: 6 questions for which the correct answers were not provided in the documents. These questions were included to assess how the system handles questions without a direct answer. Since no answers exist for these questions, metrics were deemed not applicable. Instead, the system’s responses to these questions were evaluated through qualitative assessment during t