Enhancing the User Experience for AI-
driven Complex Knowledge Systems with
Natural Language Interfaces

A UX Design Approach for Diverse Professional Users in B2B

Masters thesis in Computer science and engineering

ANNIE CLAESSON, OLIVIA FRIBERG

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2025


Master’s thesis 2025

Enhancing the User Experience for AI-driven
Complex Knowledge Systems with Natural

Language Interfaces

A UX Design Approach for Diverse Professional Users in B2B

ANNIE CLAESSON, OLIVIA FRIBERG

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2025


Enhancing the User Experience for AI-driven Complex Knowledge Systems with
Natural Language Interfaces: A UX Design Approach for Diverse Professional Users
in B2B
A UX Design Approach for Diverse Professional Users in B2B
ANNIE CLAESSON, OLIVIA FRIBERG

© ANNIE CLAESSON, OLIVIA FRIBERG, 2025.

Supervisor: Beata Stahre Wästberg, Department of Computer Science and Engi-
neering
Advisor: Björn Berg Marklund, Recorded Future
Examiner: Staffan Björk, Department of Computer Science and Engineering

Master’s Thesis 2025
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2025

iv


Enhancing the User Experience for AI-driven Complex Knowledge Systems with
Natural Language Interfaces: A UX Design Approach for Diverse Professional Users
in B2B
A UX Design Approach for Diverse Professional Users in B2B
ANNIE CLAESSON, OLIVIA FRIBERG
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
While user experience (UX) in AI systems and natural language interfaces has been
explored in previous research and frameworks, few studies specifically focus on the
business-to-business (B2B) context. Existing guidelines and design implications
have primarily been shaped by leisure focused applications where they have stemmed
from business-to-consumer (B2C) contexts. In the context of designing for profes-
sional users, there is currently a need for UX design guidelines specifically targeting
work related contexts.

This thesis aims to address this gap by exploring what factors should be consid-
ered when seeking to improve AI-driven complex knowledge systems with natural
language interfaces. It focuses on the needs of diverse professional users in B2B
contexts and examines how UX design can address these factors. The study is
guided by the following research questions: (1) What factors should be taken into
consideration when seeking to improve AI-driven complex knowledge systems with
natural language interfaces in B2B contexts? (2) What role can UX design play
in addressing factors influencing the improvement of AI-driven complex knowledge
systems with natural language interfaces in B2B contexts?

This study uses a mixed-method approach where qualitative data are collected
through data logs, interviews, and a survey to draw insights from users in real-world
B2B scenarios working with an AI- driven system provided by a threat intelligence
company.

As a result of this research, eight factors were identified to affect the user experience
of natural language interfaces for information retrieval in a B2B context. Eleven de-
sign guidelines are proposed to provide UX designers guidance in designing natural
language interfaces that support user control and customization, trust and trans-
parency, and communication of the AI’s abilities and limitations. The identified
factors and proposed guidelines offer a foundation for future research in UX design
for AI-driven systems in professional environments. This study invites for further
exploration through validation across different B2B sectors and the identification of
additional context specific design factors. This thesis contributes to bridge the UX
research gap for AI-driven natural language interfaces in professional contexts by
offering practical insights to support trust, effective information retrieval, and user
friendly interactions.

Keywords: UX-design, Usability, B2B, Human computer interaction, Design princi-
ples.

v


Acknowledgements
We would like to express our gratitude to the company Recorded Future for providing
us this exciting thesis opportunity. For continuous support, insightful discussions,
and guidance throughout the project, we would like to express a special thanks
to our supervisor Björn Berg Marklund. We would also like to thank the product
design team and the AI development team at Recorded Future for their support,
guidance, and contribution with their expert knowledge. Lastly, we would like to
acknowledge our academic supervisor Beata Stahre Wästberg for her commitment
to our project. Her valuable academic guidance and encouraging support have been
greatly appreciated throughout this thesis.

Annie Claesson, Olivia Friberg, Gothenburg, 2025-06-10

vi


Contents

List of Abbreviations xi

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Aim and Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Stakeholders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4.1 Chalmers University of Technology . . . . . . . . . . . . . . . 2
1.4.2 Recorded Future . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4.3 End Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 4
2.1 The Role of UX in B2B Contexts . . . . . . . . . . . . . . . . . . . . 4

2.1.1 UX and Usability . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1.1 Definition of UX . . . . . . . . . . . . . . . . . . . . 5
2.1.1.2 Definition of Usability . . . . . . . . . . . . . . . . . 5

2.2 Conceptual Background of the Technical Scope . . . . . . . . . . . . 5
2.2.1 AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Natural Language Interfaces (NLI) . . . . . . . . . . . . . . . 6
2.2.3 Recorded Future AI . . . . . . . . . . . . . . . . . . . . . . . . 6

3 Theory 7
3.1 Information System Success and Technology Acceptance . . . . . . . 7
3.2 Affordance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.3 Explainable Artificial Intelligence . . . . . . . . . . . . . . . . . . . . 9

3.3.1 Human centered explainable AI . . . . . . . . . . . . . . . . . 9
3.4 Developing trustworthy systems . . . . . . . . . . . . . . . . . . . . . 10

3.4.1 HCAI Framework . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.5 Effectiveness and Efficiency in AI-driven systems . . . . . . . . . . . . 12

3.5.1 Measuring efficiency and effectiveness . . . . . . . . . . . . . . 12
3.5.1.1 Measuring efficiency . . . . . . . . . . . . . . . . . . 13
3.5.1.2 Measuring effectiveness . . . . . . . . . . . . . . . . . 13

vii


Contents

3.6 Effective Workflows in Natural Language Interfaces . . . . . . . . . . 14

4 Methodology 15
4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4.1.1 Literature Study . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.2 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.1.3 Qualitative Interviews . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.1 Thematic Analysis . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2.2 Log Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.3 Personas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2.4 User Journey Mapping . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.1 Delphi method . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 Process 22
5.1 Define Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Literature Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5.3.1 Findings from Questionnaire . . . . . . . . . . . . . . . . . . . 24
5.4 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.4.1 Findings from Interviews . . . . . . . . . . . . . . . . . . . . . 26
5.5 Analysis of User-AI Interaction . . . . . . . . . . . . . . . . . . . . . 26

5.5.1 Feedback Data Analysis . . . . . . . . . . . . . . . . . . . . . 27
5.5.1.1 Findnings from User Feedback Data Analysis . . . . 27

5.5.2 Session Data Analysis . . . . . . . . . . . . . . . . . . . . . . 28
5.5.2.1 Findings from Session Data Analysis . . . . . . . . . 28

5.6 Creation of Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.6.1 Mapping Identified Factors under Affordance, Trust, and Ef-

fectiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.6.1.1 Iteration 1 - Presentation of Initial Guidelines . . . . 32

5.7 Expert Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.7.1 Results from Expert Evaluations . . . . . . . . . . . . . . . . 35

5.7.1.1 Iteration 2 - Expert Interview . . . . . . . . . . . . . 35
5.7.1.2 Iteration 3 - Expert Panel 1 and 2 . . . . . . . . . . 36
5.7.1.3 Iteration 4 - Expert Panel 3 . . . . . . . . . . . . . . 37

5.8 Thesis Writing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

6 Result 39
6.1 Overview of Factors Emerging from Thematic Analysis . . . . . . . . 39
6.2 Design Guidelines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

7 Discussion 43
7.1 Reflecting on the Result . . . . . . . . . . . . . . . . . . . . . . . . . 43

7.1.1 Designing for Trust . . . . . . . . . . . . . . . . . . . . . . . . 43
7.1.2 Miscommunication in Natural Language Interactions . . . . . 45
7.1.3 The Scope of the Guidelines . . . . . . . . . . . . . . . . . . . 47

viii


Contents

7.2 Reflection on Methodologies and Process . . . . . . . . . . . . . . . . 48
7.3 Reflecting on Validity and Generalizability . . . . . . . . . . . . . . . 50
7.4 Ethical concerns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
7.6 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

8 Conclusion 52

Bibliography 53

ix


Contents

x


List of Abbreviations

Abbreviation Definition

AI Artificial Intelligence

B2B Business-to-business

HCXAI Human Centered Explainable Artificial Intelligence

HCAI Human Centered Artificial Intelligence

KRNW Knowledge Resource Nomination Worksheet

NLI Natural Language Interface

NLP Natural Language Processing

UX User Experience

XAI Explainable Artificial Intelligence

xi


Contents

xii


List of Figures

3.1 HCAI two- dimensional framework after Shneiderman.[25, p. 60] . . . 11
3.2 Efficiency formula by Alabbas and Alomar. [26] . . . . . . . . . . . . 13
3.3 Scaled Effectiveness formula by Alabbas and Alomar. [26] . . . . . . 13
3.4 Normalised Effectiveness formula by Alabbas and Alomar. [26] . . . . 13

4.1 Visualization of the thematic analysis process described by Braun and
Clarke.[37] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2 Visualization of the Data log analysis process by Dumais et al. [38].
Illustration by the authors. . . . . . . . . . . . . . . . . . . . . . . . . 18

4.3 Delphi method process after Okoli and Pawlowski. [46] . . . . . . . . 20

5.1 Execution of the research process. . . . . . . . . . . . . . . . . . . . . 22
5.2 User experience rating of the AI Reporting, showing qualitative feed-

back related to a specific rating. The citations are not direct quotes
from survey responses, but rather anonymized and rewritten as state-
ments to exemplify the nature of feedback from each category. . . . . 25

5.3 Results showing underlying patterns connected to miscommunications. 29
5.4 Initial Guidelines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5.5 An adjusted approach to the Delphi method, based on the framework

by Okoli and Pawlowski [46] . . . . . . . . . . . . . . . . . . . . . . . 34

xiii


List of Figures

xiv


List of Tables

5.1 Factors mapped to focus areas; Affordance, Trust, and Effectiveness. . 31

xv


List of Tables

xvi


1
Introduction

AI-driven complex knowledge information systems play a central role in supporting
professionals within business-to-business (B2B) contexts, by facilitating knowledge
discovery and informing decision-making [1]. With the rapid growth of AI-driven
systems, the opportunity for effective information access and analysis has improved
significantly[2], [3]. However, despite these advances, many AI-driven systems face
challenges in meeting the diverse needs of professional users, particularly in terms
of usability, transparency, and trust [4].

Cybersecurity is an industry where AI can have a crucial role with its computa-
tional power and capabilities to process large amounts of data, providing intelligent
cybersecurity services and management [5]. This study explores how UX design
can bridge the gap between system complexity and usability in B2B settings. The
study will be conducted together with Recorded Future, a company that provides
AI-driven solutions to detect and address potential cybersecurity threats. With so-
lutions like automated threat detection, they aim to give organizations real-time
contextual intelligence to enhance their decision-making. A central feature of the
platform is a natural language interface (NLI), which enables analysts to interact
with threat intelligence using conversational input. [6]

UX research has traditionally centered on consumer and leisure applications, while
work related contexts have received less attention [7]. In the context of designing
for professional users, there is currently a need for UX design guidelines specifically
addressing work contexts and professional users [7].

1.1 Aim and Goal

The study focuses on collecting insights regarding the challenges and needs of diverse
professional users when interacting with AI-driven complex knowledge information
systems. The aim is to provide insights that align with both industry needs and
academic advancements, ensuring a practical and research-driven approach to chal-
lenges within AI-driven information systems. With the gained insights the goal is
to propose a set of UX guidelines adapted to support UX designers in enhancing
the design and usability of such systems within a B2B context.

1


1. Introduction

1.2 Research Questions
The study addresses the following research questions:

• RQ1: What factors should be taken into consideration when seeking to im-
prove AI-driven complex knowledge systems with natural language interfaces
in B2B contexts?

• RQ2: What role can UX design play in addressing factors influencing the
improvement of AI-driven complex knowledge systems with natural language
interfaces in B2B contexts?

1.3 Scope and Limitations
This research explores factors affecting the user experience of AI-driven knowledge
systems with natural language interfaces in B2B contexts, and how they could be
addressed to enhance the user experience.

To understand the complexity of a diverse user groups needs, challenges and per-
spectives, methods that enable a broad data collection within the thesis limited
timeframe will be prioritized. Given the limited time, the study focuses on explor-
ing how design solutions could impact user trust, system affordance and perceived
effectiveness in AI-driven systems.

This thesis specifically aims to develop guidelines for NLI’s designed for information
retrieval, recognizing that other NLI solutions could have different user requirements
and design considerations, and therefore will fall outside the scope for this thesis.

To account for the users experience of an AI system being closely intertwined with
the underlying AI model and architecture, this report distinguishes between factors
primarily focused on technical aspects and factors related to the interface design.
The study focuses on exploring the design aspects of AI systems, therefore factors
primarily related to technical aspects, such as natural language processing will fall
outside the scope of this research.

A limitation for the project is the constraints regarding data privacy, which influence
our study in the presentation of the result, where citations and example conversa-
tions are excluded.

1.4 Stakeholders
The following section details the stakeholders who have an interest in or are related
to this thesis.

1.4.1 Chalmers University of Technology
Chalmers University of Technology has a strong interest in this thesis, as it con-
tributes to research in user experience (UX) and usability. As a university that

2


1. Introduction

values innovation and technological advancement, Chalmers benefits from studies
that deepen the understanding of UX, especially in complex digital environments.
The insights from this research can support future academic work and practical ap-
plications, helping to refine usability principles. Moreover, this thesis aligns with
Chalmers goal of connecting research with real-world challenges, making the findings
valuable for both academia and industry.

1.4.2 Recorded Future
This thesis is being conducted in collaboration with Recorded Future, a leading
threat intelligence company that holds interest in its findings. As the research
progresses, we will gain deeper insights into the needs of B2B customers within the
cybersecurity domain. By building on previous research and integrating our findings,
this thesis could contribute with valuable knowledge, supporting Recorded Future
in the development of AI-driven solutions.

1.4.3 End Users
End users of the Recorded Future systems have the potential to gain value from the
results of this thesis. By integrating the developed UX guidelines into the system
design, the user experience could be improved.

3


2
Background

This chapter presents the relevant background and context for the study. The chap-
ter begins by presenting the role of UX in B2B contexts, including defining the terms
UX and Usability. Following this, the conceptual background of the technical scope
is presented including AI and NLI. The chapter concludes by describing Recorded
Futures AI.

2.1 The Role of UX in B2B Contexts

Previous research in the field of user experience (UX) has primarily focused on
leisure contexts [7]. According to Çalar et al. this focus have been criticized for
overlooking work related contexts as the two areas vary significantly. Adaptation
of design practices in leisure contexts are often applied to the development of work
related tools, creating a problem area where work related tools tend to focus on
hedonic experiences limiting the consideration of work environments having different
motivations and expectations than leisure contexts. The research within the field
UX specifically related to work context is still in an immature state, emphasizing
the need for further exploration and focus. More specifically, Çalar et al. highlights
the need to address the unique challenge of developing work tools in a B2B context
that meets the needs of multiple stakeholders. The problem derives from finding
balance between the purchaser’s often more business centric requirements and the
user centric needs of the end user. [7]

2.1.1 UX and Usability

Studies within UX research focusing on the B2B context rarely provide a clear
definition of UX, an issue highlighted by Çalar [7]. Furthermore, Çalar et al. states
that the lack of clear definitions of UX in work related research creates challenges
for both academic research and the development of practical guidelines, contributing
to the ongoing imbalance between UX research in leisure and work contexts. Since
many studies in the B2B context lack clear definitions of UX, Çalar et al. also
emphasize the importance of distinguishing related concepts, such as usability, from
UX and clarify their meanings. [7]

4


2. Background

2.1.1.1 Definition of UX

The term UX is defined by the International Organization for Standardization (ISO)
as the ”Users perceptions and responses that result from the use and/or anticipated
use of a system product or service.” The perceptions and responses of the user
includes ”emotions, beliefs, preferences, perceptions, comfort, behaviors, and accom-
plishments that occur before, during and after use.” Furthermore, user experience is
the result from the combination of ”brand image, presentation, functionality, system
performance, interactive behavior, and assistive capabilities of a system, product or
service.” It also results from ”the internal and physical state of the user based on
prior experiences, attitudes, skills, abilities and personality, as well as the context
of which the system is used.” [8]

2.1.1.2 Definition of Usability

The term usability is defined by the International Organization of Standardization
(ISO) as to what extent a system, product or service can be used by ”specified users
to achieve specified goals with effectiveness, efficiency and satisfaction in a specified
context of use.” The specific users, goals and context of use refers to the ”particular
combination of users, goals and context of use for which usability is being considered.”
[8]

2.2 Conceptual Background of the Technical Scope
AI-driven complex knowledge systems are advanced digital systems that embed ar-
tificial intelligence techniques, such as machine learning and natural language pro-
cessing, into knowledge management to support the creation, organization, sharing,
and use of information to enhance organizational outcomes. [1], [9] The following
sections offer a conceptual background of AI-driven systems, Natural Language in-
terfaces (NLI) and Recorded Futures AI.

2.2.1 AI
We have seen rapid growth of AI chatbots in recent years, with increasing integration
into various industries and services [2], [3]. According to Meshram et al. [2], we as
humans have learned to adapt to the quick and effortless way of achieving our goals
with the help of AI. Due to this, our expectation of a certain standard in terms of
requirements increases and becomes more difficult to meet. [2]

AI-powered chatbots, also known as conversational agents, are typically based on
machine learning models that use patterns and keywords to interpret user input [2].
With the use of Natural language Processing (NLP) the chatbots manage to un-
derstand and generate text that characteristics a human, making it less demanding
for the user to have a dialog with the AI as no technical knowledge about specific
prompts is necessary [2], [10]. These systems can be developed to adapt and improve
over time by learning from user interactions, allowing for increasingly accurate and
context aware responses [2]. As the use of AI continues to expand rapidly, questions

5


2. Background

regarding how to design user friendly, trustworthy and understandable AI interfaces
emerge simultaneously [3]. Conversational agents rely on users to formulate and
guide the interaction through prompts, enabling the user to steer the direction of
the conversation [2]. While this gives the user the power over how the conversation
might play out, it puts a certain expectation on the user’s abilities to ask the right
questions. This fosters new perspectives on existing challenges, where designers
have to ensure a good user experience and reliable outputs for the user to find it
trustworthy [3].

2.2.2 Natural Language Interfaces (NLI)
Natural Language Interfaces (NLI) facilitates human-computer interaction by en-
abling users to communicate with interfaces through text or speech using natural
language [11], [12]. NLI plays a crucial role for end users as it bridges the gap
between humans and computer processing which allows the user to interact with
the application through text or speech in a way that does not require specific com-
mands [10]. By merging the NLP and HCI, NLI improves the user experience to
be more accessible and efficient for the user, making the digital tools more intuitive
and reducing the learning curve for users [12].

A primary goal of NLI is to facilitate analytical conversations, where users need
to query, filter, and manipulate data effortlessly. This possibility enables various
domain experts to carry out tasks more effectively without deep technical expertise.
NLI has the potential to effectively streamline documentation and reporting using
report generation. Additionally, Conversational interfaces such as AI-driven assis-
tants have proven to have a leverage NLI for supporting customers and facilitating
interactive learning environments. Visualization creation is another growing area
where Text-to-visualization can generate visual content from natural language de-
scriptions. This enhances data exploration for the user while also facilitating more
effective storytelling. [12]

2.2.3 Recorded Future AI
Recorded future [6] offers a threat intelligence platform designed to help organiza-
tions identify threats, enabling proactive actions to prevent attacks. An AI-driven
intelligence graph is a central part of the platform, combining internal organizational
data with external intelligence to provide real-time actionable insights. Recorded
Future AI is provided as a natural language interface (NLI) through which analysts
interact with the intelligence graph, making analysis and production of intelligence
automated. AI conversations, AI reporting, and AI insights are the three compo-
nents of the Recorded Future AI. [6]

The provided background to Recorded Futures platform is essential in this thesis,
as the forthcoming analysis draws upon feedback from users of the platform. This
study specifically includes two components of Recorded Future AI: AI conversations
and AI reporting.

6


3
Theory

This chapter presents a relevant theoretical background to provide a solid foundation
of important aspects in the design of AI-driven complex knowledge systems and
natural language interfaces (NLI). Its purpose is to outline the study within existing
research by highlighting key concepts, frameworks, and challenges identified in the
field. As this study aims to explore important factors and challenges, the chapter
also serves as a review of previous research that highlights what has already been
considered and helps guide the focus of this work.

3.1 Information System Success and Technology
Acceptance

To understand the success and usage of information systems (IS), two theoretical
frameworks can be applied: Delone and McLeans IS success model [13] and the
Unified Theory of Acceptance and Use of Technology (UTAUT) [14]. Delone and
McLeans model is a widely used framework in IS research, evaluating IS success
through six factors; System quality, Information quality, Use, User satisfaction, In-
dividual impact, and Organizational impact. System quality refers to the system’s
technical performance and usability, while information quality focuses on the accu-
racy and relevance of the provided information. Use and user satisfaction measure
how the system is utilized and perceived by users. The two final factors capture the
system’s impact on individual performance and the organization’s overall efficiency.
[13]

At the same time, UTAUT highlights the factors that influence how users accept
and use technology. The framework identifies four factors that have a direct impact
on user acceptance and usage behavior; Performance expectancy, Effort expectancy,
Social influence, and Facilitating conditions [14].

Performance expectancy refers to which extent a user believes the system will im-
prove work performance, while effort expectancy relates to perceived ease of use.
Social influence describes the importance of others opinions regarding system usage,
and facilitating conditions refer to the availability of organizational and technical
support for system implementation [14].

These models provide an understanding of both IS success and user acceptance,
which is central in designing AI-driven complex information systems in B2B contexts.

7


3. Theory

Delone and McLeans model can be used to assess how well the system performs and
its impact on individuals and organizations, while UTAUT helps analyze the factors
influencing users’ willingness to adopt and integrate the system into their workflow.
By combining these perspectives, design decisions can be better tailored to achieve
both technical success and user acceptance in professional environments.

3.2 Affordance
Affordance theory provides a foundational perspective on users’ interactions with
technology by defining the possibilities for action that an environment or system
offers. The interpretation of affordance varies, and relevant literature is spread across
several disciplines. Affordance theory was originally introduced by the psychologist
James Gibson describing the relationship between an actor and its environment in
terms of possibilities for action [15]. This concept has been applied to technology and
interaction design to understand users’ perceptions and interactions with systems
in order to guide the design of user interfaces and interactions [16].

In Gavers technology extension on Gibsons theory of affordance, perceptible, hidden,
false, and sequential affordances are introduced. Perceptible affordances refers to
those clearly signaling the intended use, while hidden affordances exist but are not
instantly visible. False affordances provide false cues to the user by indicating an ac-
tion opportunity that does not exist. Sequential affordances describe one affordance
leading to another [16]. Norman developed the affordance concept by categorizing
them as real and perceived that have had a major impact on the field of human
computer interaction (HCI). Real affordances refers to the actual properties that
determine how an object can be used, while perceived affordances emphasize the
user’s beliefs about what the object can do based on the design [15], [17].

According to Gibsons perspective, affordances exist independently of perception and
are constant across various users and contexts. Norman states that affordances are
designed into artifacts, and emphasize the importance of affordances being intuitive
and visible to effectively lead the user behavior. Gaver recognizes that affordances
exist independently while arguing that they must be created perceptible for effective
interaction. [15], [16]

While Gaver and Norman particularly focus on affordances on an individual level,
[15] broaden the discussion by focusing on affordances in organizational contexts.
They discuss the concept of affordances from existence, perception, actualization
and effect perspectives. The importance of understanding how affordances are ac-
tualized and affect organizations are emphasized in order to understand the role of
IT artefacts in organizations. Affordance effects occur as the result of affordance
actualization. [15]

Affordance theory provides a framework for analyzing and improving the UX in
AI-driven complex knowledge information systems within B2B contexts. [16] con-
tributes with insights on perceptible affordances in digital interfaces, while [17]
provides a clear distinction between real and perceived affordances to emphasize
intuitive design. Pozzi et al. [15] broadens the perspective to include organiza-

8


3. Theory

tional adoption and impact. By integrating these perspectives, UX designers can
develop AI-driven information systems that are both intuitive for users and effective
in supporting business knowledge processes.

3.3 Explainable Artificial Intelligence
Although Artificial Intelligence (AI) has existed for several years, its rapid growth
and widespread adoption across various fields have increasingly complicated its com-
prehensibility [18]. Explainable Artificial Intelligence (XAI) is a rapidly growing
field that aims to bridge the gap between the technical complexity of AI and human
understanding. The goal of XAI is to create interpretable data models that bene-
fit problem-solving tasks without neglecting humans’ ability to trust the provided
responses and solutions [19]. By providing users insight into the otherwise complex-
ity of machine learning (ML) models, XAI aims to help users make more informed
choices. XAI emphasizes transparency, explainability, and interpretability, which
are essential for ensuring that AI systems can be understood and trusted by both
experts and non-experts [18].

Transparent models and decision trees serve as key tools in XAI due to their un-
derstandable mechanisms. In addition to improving interpretability, XAI provides
a theoretical foundation for promoting responsible AI. By integrating explainabil-
ity with principles such as fairness and reliability, the objective is to foster broader
acceptance and encourage more ethical applications of artificial intelligence [18]

3.3.1 Human centered explainable AI
While explainable AI has an important focus on explaining complex models for
technical solutions, Human centered explainable AI (HCXAI) has a greater focus on
the human factors in the interaction between the AI and the user [20]. The goal of
HCXAI is to ask the question of who needs the explanation, why they need it and
how they will use it in order to consider the human experience. By doing so, the
HCXAI moves beyond the focus of transparency and opening up the black box to
addressing the need for human centered usage of AI in everyday settings [20], [21].

Lee [21] argues that transparent models do not necessarily result in a user centered
AI solution. Instead, researching intuitive human behaviors before training AI mod-
els can lead to a greater understanding of natural human behaviors. To gain a more
profound understanding of how users interact and interpret explanations given by
the AI, Elicitation tests can be utilized. These tests facilitate the identification of
how users may employ personal inputs to describe concepts or tokens when defining
their questions or prompts. After the tests have been carried out, the paper empha-
sizes that the new insights will benefit in customizing pre-trained language models.
However, the remaining challenge lies in translating these insights into concrete im-
plementation steps within the existing language model. Lee therefore suggests that
by initializing new words and optimizing their representation, the language model
can evolve into intuitive human behaviours and give the user a more personalized
experience. [21]

9


3. Theory

3.4 Developing trustworthy systems
Various disciplines within human centered design emphasize the importance of prior-
itizing human centered objectives when developing AI-driven systems [22]. Several
disciplines agree that ethics and trustworthiness are essential elements for AI devel-
opment to ensure human trust in the system.

Liao and Sundar [3] argue that trustworthiness of an AI is not established solely
by system attributes, but is communicated through trustworthy cues. These cues
are embedded in the interface design, documentation and interactions with the user
and perceived through independent judgement by each person. Therefore, trust-
worthiness in AI needs to be thoughtfully incorporated through the quality of the
system to convey users that their trust is justified. Liao and Sundar emphasize the
risk that AI systems may communicate trust signals that can create a false sense of
trustworthiness. Whether it is unintentionally or by design, it can result in users
believing that the systems intentions are credible when they are not. To prevent
this, developers and designers must understand how users process and perceive in-
formation and account for a diverse user group where the level of technical expertise
and ability to process complex information might vary. [3]

Khan et al. raises important aspects regarding transparent decision making pro-
cesses as key features to inform the user about how the AI reached its answer [23],
[24]. By keeping the human in the loop of the decision making process, it reduces
ethical concerns for deceiving users in making decisions based on misleading infor-
mation [23]. Traceability mechanisms can help accomplish this, as well as explaining
the system’s capabilities, limitations, and decisions in an understandable way for the
stakeholder involved [24]. This connects maintaining human control and providing
transparency, for enhancing trustworthiness.

One framework that shares the importance of trustworthy systems is Human cen-
tered AI (HCAI), which focuses on the synergy of automated systems and human
control. HCAI is a two dimensional framework that emphasizes the need for a high
level of human control within applications of high levels of automation [25, p. 47].
The goal is to develop reliable, safe, and trustworthy applications that ensure hu-
mans remain in control by enabling them to make informed decisions based on
the complex information generated by automated systems. Shneiderman states that
”Machine and human autonomy are both valuable in certain contexts, but a combined
strategy uses automation when it is reliable and human control when it is necessary”
[25, p. 53]. With this in mind, HCAI focuses on guiding the design choices in how
to make automated systems reliable, safe and trustworthy with the human in mind.
Shniederman provides the following definitions of the attributes.

Reliable systems aim to deliver expert responses and expected outcomes under de-
fined conditions. The reliability is commonly ensured through verification and vali-
dation, tracking and tracing failures, and fairness and predictability. [25, p. 53]

Safe systems: Safety in automated systems is managed through proactive risk man-
agement and following industry standards. This involves commitment to safety,
extensive failure reporting, and structured validation processes to refine operations

10


3. Theory

and prevent risks.[25, p. 54]

Trustworthy systems: A trustworthy system goes beyond trust as it has proven itself
to be deserving of the users trust. However, it is commonly difficult for the user
to determine since users often lack the skillset of assessing the trustworthiness of a
complex system. [25, p. 54]

These attributes are frequently referenced in discussions surrounding AI, yet Shnei-
derman states that they remain challenging to measure and evaluate. Consequently,
the HCAI framework aims to guide designers and researchers in developing systems
that integrate these attributes while also accounting for broader design considera-
tions [25, p. 55]. From the viewpoint of this study, this perspective can be valuable
as it encourages critical reflection and facilitates the formulation of questions that
may uncover additional aspects relevant to the assessment of complex AI-driven
systems.

3.4.1 HCAI Framework
The HCAI framework focuses on redefining autonomy in AI driven systems by em-
phasizing human control and enhancing automation [25, p. 57]. Upon HCAI, AI and
automation were often viewed from a single axis perspective where the scale went
from high level of human control and low level of automation to low level of human
control and high level of automation[25, p. 49]. HCAI enables a two dimensional
framework where applications with high levels of automation also can have a high
level of human control and is shown in Fig. 3.1.

Figure 3.1: HCAI two- dimensional framework after Shneiderman.[25, p. 60]

11


3. Theory

With this perspective, the framework encourages designers and researchers to be
innovative, explore new questions and reconsider existing methods to deliver AI
applications that are reliable, safe and trustworthy. The framework presents two
axes where human control is ranked from high to low on the left vertical axis and
computer automation is ranked on the bottom horizontal axis. For systems with
high automation, the ideal position is in the upper right quadrant. Here, the user
has the control to override the decisions made by the AI when considered necessary.
[25, p. 60]

3.5 Effectiveness and Efficiency in AI-driven sys-
tems

AI driven chat solutions have an impact on many industries as their capability to
provide real time support enhances effectiveness and user engagement [26]. However
a visible need for continuous evaluation of AI tools like chatbots is needed to make
sure that the user experience and usability aspects are taken into consideration.

Effectiveness and efficiency are fundamental aspects of usability that influence how
well a system helps users accomplish their goals. The ISO 9241-11 is the latest
edition of an ISO standard addressing effectiveness and efficiency within the field of
UX and usability [27]. Alabbas and Alomar present a framework that builds upon
the ISO 9241-11 [26]. The framework defines effectiveness as the systems ability to
deliver accurate and satisfactory results and is often evaluated by task success rates
and error minimization. Efficiency concerns the system’s capacity to support users
in completing tasks with minimal effort and is often evaluated through response
times and task completion rates. Together, effectiveness and efficiency are crucial
in shaping the overall user experience (UX).

Systems that respond quickly while maintaining accuracy contribute to a seamless
user experience, reducing frustration and cognitive strain. AI-driven systems, such
as those utilizing machine learning and natural language processing, must carefully
balance speed and precision to optimize user outcomes. If an AI system responds
instantly but frequently makes errors, user trust decreases.

By advising professionals to measure completion time, error rates, cognitive effort,
user trust and satisfaction, the framework seeks to help extract valuable information
on how to build AI solutions that work seamlessly and engage the users. [26]

3.5.1 Measuring efficiency and effectiveness
As efficiency and effectiveness have previously been evaluated using quantitative
performance metrics in AI-driven systems[26], incorporating this perspective in the
theoretical framework offers valuable insight into how such it can be assessed from
a quantitative point of view. However, while these metrics provide a structured
way to compare outcomes, they do not ensure that they capture the complexity
of user experience, contextual factors, or the nuanced impact of design decisions.
From this perspective, we choose to present the previous assessment of efficiency

12


3. Theory

and effectiveness in the following sections 3.5.1.1 and 3.5.1.2 with an ambition of
how its contributions together with other theoretical viewpoints could capture the
complexity of the research questions of this study.

3.5.1.1 Measuring efficiency

Efficiency can be assessed by measuring the time it takes for a chatbot to process and
deliver a response [26]. By analyzing the interval between when the user’s prompt
is submitted and the system’s response is delivered the researcher can gain a better
understanding of the time it takes the user to get a response. Since the prompts can
vary in complexity, the data can be categorized into groups of simple, intermediate
and complex levels of prompt data to facilitate a greater oversight. Following this,
the response time can be converted into a standardized score to facilitate comparison.
The formula in Fig. 3.2 is used to calculate this score:

Figure 3.2: Efficiency formula by Alabbas and Alomar. [26]

3.5.1.2 Measuring effectiveness

By measuring the amount of incorrect responses provided by the AI, the designer
can gain insight on how effective the user perceives the chatbot[26]. By enabling
the user to rate the responses on a scale from very poor (1) to excellent (5), real
time assessment of the provided answers can be measured. However, this requires
well defined guidelines on how the scale should be interpreted by the user. The
value can be calculated by first measuring how accurately the chatbot performs as
shown in Fig. 3.3, and then normalizing the effectiveness score by converting it into
a standardized 0100 scale and is shown in Fig. 3.4.

Figure 3.3: Scaled Effectiveness formula by Alabbas and Alomar. [26]

Figure 3.4: Normalised Effectiveness formula by Alabbas and Alomar. [26]

13


3. Theory

3.6 Effective Workflows in Natural Language In-
terfaces

Workflows can be defined as an organized collection of tasks designed to complete
a business process. Tasks can be executed by software systems, individuals, teams,
or a combination of these [28].

Cognitive load (CL) refers to the cognitive resources required to complete a task.
The theory of CL was developed within the field of education but is applicable
to a variety of contexts, including user experience of graphical interfaces. CL can
be divided into three components, intrinsic load, germane load, and extraneous
load. Intrinsic load is connected to the complexity of the task or a system and
is determined by the individual’s cognitive resources. Germane load relates to the
process of identifying and learning patterns in a task, within the research of CL there
is a divided opinion if germane load should be considered as a part of intrinsic load or
as a separate component. Extraneous load depends on the presentation and design
of the interface. Extraneous cognitive load can be influenced by design choices and
serves as a key consideration in human-computer interaction design. [29]

In the context of completing tasks through a natural language interface (NLI) Do
et al. [30] highlights abstraction matching as a significant challenge, referring to the
difficulty of composing a prompt that matches the system’s capabilities. Continuous
failures of abstraction matching can contribute to user frustration and impact the
users technology acceptance and possibly lead to abandonment of the system. [30]

To approach this challenge, Do et al. proposes a set of conversational interfaces
supporting grounded abstraction matching by applying the principle of least collab-
orative effort in communication grounding to the design. The study compared three
variations of grounding interfaces with an ungrounded control interface. The set of
grounded interfaces consisted of one conversational grounding interface, a multiple
grounding interface, and a structured grounding interface. Do et al. recommends
designing natural language interfaces that support provisional inputs and enable
collaborative refinement between the user and the system, following the principle of
least collaborative effort. Their findings showed that grounding interfaces, especially
the ones offering structured input fields, can reduce users cognitive load, improve
task performance, and increase system acceptance, still avoiding users feeling con-
strained. Furthermore, the authors propose that, for goal oriented natural language
systems, using an structured guidance approach is more beneficial than aiming for
fully free form naturalness. The structured support showed to help users effectively
compose inputs while maintaining their sense of control. [30]

14


4
Methodology

This chapter introduces various methodologies for collection, analysis, and validation
of data. The chapter begins with presenting the methods used for data collection, in-
cluding literature study, questionnaire, and interviews. This is followed by a section
describing thematic analysis, log data analysis, personas and user journey mapping.
The chapter concludes with describing the Delphi method, used as a validation
method.

4.1 Data Collection

4.1.1 Literature Study
A literature study is conducted to identify the research gap, and gather insights from
previous research including existing theoretical models and frameworks, serving as
a basis for conducting new research [31, p. 131], [32]. The practical approach of a
literature study involves scanning documents to provide an understanding of exist-
ing literature in the field, including identifying key topics. By identifying relevant
themes and topics, sources and concepts can be organized accordingly to form the
structure of the literature study. The writing process can be initiated once a general
outline has been formed [32].

4.1.2 Questionnaire
Questionnaires are an effective method to gather a large amount of data and generate
statistical insights about the problem area. It is a flexible approach, and often more
efficient than interviews as it enables gathering quantitative data from a broader
group of respondents. Furthermore, the direct interaction between the researcher
and respondent is minimal, reducing the risk of bias and the researcher affecting the
answers. [33, p. 95] However, the design of the questionnaire is crucial to ensure
reliable and unbiased answers. Since the respondents don’t have the opportunity to
ask follow up questions or ask for clearer explanations of the questions it can lead to
misunderstandings and the miss interpretations of the questions. The structure of
the questionnaire can also affect the validity of the responses; if the respondent has
access to all questions from start, it could adjust the answers to align with certain
experience patterns, affecting the research validity and decrease the ability to study
variables independently. [33, p. 95] Aiming to gather a broad range of perspectives

15


4. Methodology

from the target group, to inform further data collection, a questionnaire is well
suited for this study to gather initial insights.

4.1.3 Qualitative Interviews
Conducting qualitative interviews is a data collection approach that allows for an
open and explorative approach to capture a detailed understanding of the prob-
lem and nuanced perspectives. In addition to the participants’ verbal response,
interviews can reveal valuable insights through the respondents tone of voice, as
well as facial expressions and pauses answering the questions unlike questionnaires
where there is limited interaction between the researcher and the respondent. [33,
p. 104105], [34, p. 189]

Testing the questions in preparation for the interviews can provide insights about
how the interview questions are working, and is a chance to improve the questions
before the actual interviews. The researchers also get a chance to practice asking
questions, and taking notes of the answers [34, p. 203]. Conducting interviews sup-
ports the collection of more detailed and nuanced information, complementing the
broader insights gained through questionnaires. This method enables a deeper explo-
ration of individual experiences and perspectives, which is essential for addressing
the research questions in greater depth. As an alternative approach to interviews,
observations can be utilized. By observing users in their natural settings, the ob-
servers can gain insights on actual user behaviours rather than how they describe
their actions through interviews [35]. This approach could benefit this research as
an alternative approach as well as a compliment if sufficient time is available.

4.2 Analysis

4.2.1 Thematic Analysis
Thematic analysis is a method used to identify, analyze, and interpret patterns
forming themes in qualitative data. The method is commonly used to explore and
capture respondents’ experiences, beliefs, and behaviors from the data [36]. The
practical approach for thematic analysis includes six phases as shown in Fig. 4.1. The
initial phase focuses on familiarization with the data, to gain a deeper understanding
about the content and identify potential themes. Inphase two initial codes are
generated about interesting elements in the data, working as foundational building
blocks for further analysis. Phase three revolves around searching for themes through
analyzing similarities and differences in the codes, and grouping them in broader
themes. In phase four the identified themes are evaluated by comparing them to
the original data on two levels. The first level aims to control that each theme is
well founded in the coded data. The second level ensures that the themes are well
founded in the full data material. Phase five includes defining and naming themes,
the aim is to clearly describe the meaning of each theme and what is of interest about
them, as well as how different themes are related to each other. In the final phase
the analysis is presented in a report, including citations from the data to illustrate

16


4. Methodology

the identified themes [37]. Aiming to identify key factors, thematic analysis serves as
a valuable approach for revealing such factors through the systematic interpretation
of themes emerging from the collected data.

Figure 4.1: Visualization of the thematic analysis process described by Braun and
Clarke.[37]

4.2.2 Log Data Analysis
Log data analysis is a method that can be utilized to analyse users’ behavioural
interaction with digital systems through logged data [38, pp. 349–369]. The method
is often used in HCI research and requires behavioral logs that capture a large variety
of events where users interact with the system. Dumais et al. describes large scale
logs as recordings of events that are captured in real-world environments. Since the
data is conducted in natural settings, it can be viewed as natural observations where
the user is not influenced by the experimenters or the observers. Log data analysis
allows researchers to form an abstract perspective of a substantial number of users
and their behaviours while also enabling identification of usage patterns .[38]

Following the approach outlined by Dumais et al., Log data analysis is conducted
over three phases and visualized in Fig. 4.2.

The initial phase is Data collection, where the researchers find useful
logs that capture what queries users tend to issue.

The second phase focuses on Data cleaning, where the researchers ini-
tially familiarize themselves with the data before removing duplicates,
filtering out irrelevant data, and anonymizing parts if necessary.

The third phase is dedicated to Using log data responsibly, meaning that
the researchers process and analyze the prepared data with careful con-
sideration of user privacy [38].

17


4. Methodology

Figure 4.2: Visualization of the Data log analysis process by Dumais et al. [38].
Illustration by the authors.

Although log data analysis is commonly applied to quantitative research [39], it
offers a strong potential as a part of a mixed methods approach alongside thematic
analysis in this study. When analyzing data sets provided by Recorded Future, log-
data analysis could serve as a way of observing behaviors for a diverse group of users
while investigating common patterns in successful and unsuccessful user experiences.

4.2.3 Personas

Personas are a representation of shared characteristics and properties among multi-
ple target groups and are commonly used for creating a better understanding of the
user within the design team [40]. This methodology allows for the analysis of both
quantitative and qualitative data, ensuring that the representation is well grounded
in real user behaviors and needs [41], [42].

Given the potential to gain valuable insights from existing user data, Quantitative
Persona Creation (QPC) presents an opportunity to enhance efficiency in persona
development [43]. While QPC allows for a large scale, data-driven approach, our
goal remains to integrate both qualitative and quantitative insights to achieve a
more comprehensive and in depth perspective on user needs.

4.2.4 User Journey Mapping

User journey mapping is widely used within the field of UX design to capture how a
user interacts with a product or service, visualizing the step-by-step experience. The
aim is to map the different phases of the users journey from planning to completion
using a specific product or service. A common approach is to visualize the phases
on a horizontal axis to clarify the time progression. On the vertical axis different
metrics of interest can be added.

User journey mapping can serve as a complement to personas. While personas focus
on a static view of a typical user, user journey maps cover a dynamic and time based
description of the users experience, adding a valuable 3rd dimension. [44], [45]

18


4. Methodology

4.3 Validation

4.3.1 Delphi method

The Delphi method is a well established approach for evaluating a subject through
the collection and analysis of expert opinions [46]. The method offers a potential
means for validating various outcomes, such as derived guidelines with the assess-
ment of experts within relevant fields. In this study, the Delphi method may con-
tribute by offering an additional means of triangulating the forthcoming findings.

The method consists of two main segments,selecting experts and, conducting expert
advice and can be viewed in Fig. 4.3. Selecting experts is considered a critical process
in the Delphi method where the qualified experts need to have a deep understanding
of the issue. To facilitate this, 5 additional steps are used to select relevant experts
using a Knowledge Resource Nomination Worksheet (KRNW).

Step one focuses on preparing the KRNW by categorizing appropriate
classes of experts before identifying them individually. It is essential to
ensure that all relevant experts are considered to have the opportunity of
diverse insights on the issue. This is done by preparing the KRNW with
the categories: organizations, disciplines or skills and related literature.

Step two involves populating each identified category with potential ex-
perts who possess significant knowledge of the field of study. This pro-
cess helps create a clearer overview of which individuals are likely to
contribute valuable insights and assessments.

Step three allows for the expansion of the expert list. At this stage, the
study and its objectives are defined more clearly to determine whether
additional expertise is needed. As researchers, we may also contact
already identified experts for further recommendations of individuals
who may be well suited to the studys requirements.

Step four focuses on prioritizing and ranking the experts based on their
qualifications. By organizing them into sub-groups, they can be ranked
according to criteria such as years of experience, publications, and geo-
graphical relevance. It is acceptable at this stage for experts to appear
in more than one list if they possess overlapping qualifications.

Step five involves inviting the highest-ranked experts to participate in
the study. It is recommended that 10 to 18 experts be included in the
next segment. Each invited participant should be provided with clear
information about the studys objectives, timeline, expected value, and
the estimated time commitment required for their involvement.

19


4. Methodology

Figure 4.3: Delphi method process after Okoli and Pawlowski. [46]

Conducting the experts advice is carried out through three phases (1) Brainstorming,
(2) Narrowing Down Factors and (3) Ranking factors.

The brainstorming phase consists of two short questionnaires. Based
on Okoli and Pawlowski, the first questionnaire commonly recommends
each expert to independently list and rank factors they consider impor-
tant together with a brief explanation for each decision [46]. In this
study, the goal is to transform our findings into practical guidelines that
can assist the development of AI-driven solutions for B2B consumers.
The first questionnaire would therefore act as an individual assessment
by each expert where they have the opportunity to list the guidelines
together with their first impressions regarding relevance, inadequacies
and remarks on any missing parts based on their expertise. After this,
we as researchers consolidate the responses into a refined list of factors.
In the second questionnaire, experts review and validate this summary
before moving on to the next phase.

The narrowing down phase focuses on identifying the most important

20


4. Methodology

guidelines and factors from the consolidated list. The experts are grouped
into panels based on their stakeholder roles to allow viewpoints from ex-
perts with similar priorities to align. In this phase each panel prioritizes
the top 10 factors and guidelines they consider significant.

The ranking phase focuses on having each panel establish a prioritized
order of the most important factors and guidelines. Each panel works
independently and rank them based on their perceived significance.[46]

The application of the Delphi method could serve as a valuable approach for val-
idating the findings that have emerged from this study. By incorporating expert
perspectives, valuable insights may provide an indication of confirmation or alterna-
tively, point to additional considerations that would be needed in further refinement.
Online Expert Panels could serve as an alternative approach if physical interviews
are not possible to achieve using the Delphi Method. Online Expert Panels enables
the inclusion of a large number of stakeholders across different geographic locations
and would therefore facilitate a bigger diversity of experts [47].

21


5
Process

This chapter describes the execution of the research. The chapter begins by pre-
senting the process of defining the research problem. This is followed by describing
how each of the methodologies used for data collection and analysis were utilized.
Additionally, the findings that informed the final results are presented under each
methodology. The chapter concludes with outlining the process of creation and
validation of design guidelines. The execution process of this research is described
below in Fig.5.1.

Figure 5.1: Execution of the research process.

5.1 Define Problem
The study was initiated with two informal meetings with the supervisor at Recorded
Future, the aim was to form a deeper understanding of the research field, current
challenges, and possible perspectives the challenges could be investigated through.
Simultaneously, exploratory research was conducted to further understand the cur-
rent research done in the field. A first definition of the research field was created,
focusing on the role of UX design to improve complex knowledge information sys-
tems. After more in depth discussions with our supervisor, the focus was narrowed

22


5. Process

down to specifically target the role of UX design in AI-driven complex knowledge
information systems.

5.2 Literature Study
Once the scope of the study was defined, a literature study was conducted to build
a strong theoretical foundation focusing on two main aspects; literature related to
domain knowledge and design related literature. The process involved searching in
academic databases. The main search engines used for our search were Chalmers
Library’s database and Google Scholar. In addition, we reviewed papers recom-
mended by our supervisor at Recorded Future, which helped introduce us to key
works within the field. Search terms such as UX design", usability, AI, natural lan-
guage interfaces (NLI) and related concepts were used to locate relevant literature.
These keywords reflect the central focus of our research and aims at exploring how
users interact with AI systems, particularly through natural language.

The selection process emphasized peer-reviewed publications to ensure the credibility
and academic quality of the literature. We primarily searched for research and
theoretical frameworks published in journals and conferences within the field of
human-computer interaction (HCI), user experience (UX), and artificial intelligence
(AI). When selecting which studies to include, we considered how well each source
related to our research question. We aimed for a combination of well established
studies and more recent published papers.

5.3 Questionnaire
In this study, the survey data used were collected as part of an internal research
project conducted by Recorded Future. A total of 39 participants responded to
the survey. All participants in the survey work at companies that are customers of
Recorded Future, and use the modules where the AI functionality is available. The
questionnaire was designed and distributed by the AI development team and the
selection of participants was also carried out independently of the authors. As part
of this masters thesis, we were granted access to the anonymized survey responses
for the purpose of answering the research questions. While we did not contribute
to the development of the questionnaire or the recruitment of respondents, our role
involved conducting a thematic analysis of the qualitative data in order to identify
patterns and themes related to the platform’s AI reporting functionality. The survey
was distributed with the intent of examining several aspects of users current use of
Recorded Futures AI. Several of the questions covered topics that fell outside the
scope for this thesis, which included users general product habits and work roles.
However three of the questions were included in the study as they captured the users
experience and perceptions of the AI functionality making them highly relevant for
answering the research questions.

The thematic analysis process was conducted according to the six steps outlined
by Clarke and Braun [37]. The process was initiated with a familiarization phase,

23


5. Process

involving reading all responses multiple times. Initial codes were then generated in-
ductively from the data using manual coding. These codes were iteratively reviewed
and organized into candidate themes, which were refined through multiple rounds of
analysis to ensure clear differentiation. Each theme was evaluated to ensure align-
ment with both the coded data and the original data material. Each theme was
clearly defined and named to capture the essence of the participants perspectives.
To complete the analysis process an overview was created presenting the themes
with descriptions and citations as support.

5.3.1 Findings from Questionnaire
This section presents the user experience survey results, focusing on the users general
experiences and perceptions of the first six months with the AI reporting function.
A total of 39 participants responded to the survey.

The result from the thematic analysis highlights several key themes that influence
how diverse professional users interact with and experience a natural language in-
terface for AI-driven report generation. The following five themes were identified in
the analysis: Customization and Flexibility, Efficiency and Effectiveness, Workflow
Considerations, Transparency and Trust, and Language Model Performance.

Customization and Flexibility concerns the systems ability to support
diverse user needs, including how well it can be adapted to support
individual use cases and preferences.

Effectiveness and Efficiency reflects the users perception of how well the
system supports meaningful value creation, as well as how quickly it
enables them to achieve their intended goals.

Workflow Considerations relate to how the system aligns with existing
organizational processes and desired outcomes.

Transparency and Trust addresses the users ability to understand how
outputs are generated and the sources of the information.

Language Model Performance refers to how well the systems language
based features align with the users expectations in terms of relevance,
accuracy, and the level of detail provided in the output.

The majority of the responses related to Customization and Flexibility focused on
the level of detail in the generated output. The ability to adapt the timeframe of
the output also emerged as a recurring feedback within this theme. Feedback within
the theme Effectiveness and Efficiency revolved around users being uncertain about
their prompt writing skills, as they couldnt achieve their intended goal. Furthermore,
some users highlighted efficiency challenges as the responses required finetuning and
that the system offered a limited contribution in relief of workload. Within the
theme Workflow Considerations the feedback stated that the current functionality
is suitable for specific use cases. Feedback regarding Transparency and Trust related
to users struggling to evaluate the reliability of the information or experienced lack of
citations. Several of the responses related to Language Model Performance indicated

24


5. Process

dissatisfaction with the level of detail in the response, and a few users experienced
that their instructions were not reflected in the output.

Figure 5.2: User experience rating of the AI Reporting, showing qualitative feed-
back related to a specific rating. The citations are not direct quotes from survey
responses, but rather anonymized and rewritten as statements to exemplify the na-
ture of feedback from each category.

The majority of the respondents rated the function neutral. A few of the users
expressed uncertainty if they were asking the correct questions to achieve their goal.
Recurring feedback in this section also highlighted the desire for more customization
of the output, along with uncertainty about the reliability. Users that have had a
good experience with the functionality reported that the system is very easy to use
and returns good information, effectively supporting the needs of their specific use
cases. Further, they highlighted that the system meets their needs well regarding
providing good summaries quickly. Users that reported a poor experience with the
system often experienced issues regarding achieving the correct level of detail in the
response, looking for greater depth in the response. They also expressed the need for
more customization options and flexibility regarding sources and timeframe settings.
Lastly, the need for easier validation of reliability and traceability of the sources was
communicated.

The survey also captured specific user needs regarding affordances within the system.
The majority of the reported needs was related to customization of the visual design
of the report, as well as allowing for a more flexible and iterative process regarding
selection of sources. Furthermore, support for formulating natural language requests
was reported in the responses.

25


5. Process

5.4 Interviews
In line with the approach for the survey data, the interview data used were collected
as part of an internal research project conducted by Recorded Future. The interview
guide was designed by the AI development team and the selection of participants was
limited to the respondents from the survey that had given consent to be contacted
for setting up an interview. The selection process was carried out independently
of the authors. As part of this masters thesis, we were granted access to observe
one of the interviews and watch a recorded video of a second interview. Twelve of
the survey respondents agreed to be contacted for setting up an interview, our aim
was to take part in 6-10 interviews. However, during the weeks working with data
collection only two interviews were scheduled. The interviews acted as a support
for the survey analysis results, in adding more depth and detail to the insights. As
an initial step of the analysis process, the interviews were transcribed. Thereafter
citations from the interviews were grouped under themes identified from the survey
analysis.

5.4.1 Findings from Interviews
Two of the respondents from the survey participated in interviews. The interviews
resulted in collected data capturing detailed insights about their user experience
with the AI reporting function.

The themes Transparency and Trust and Usability and User experience were central
topics of the interviews. To enhance Transparency and Trust users want to be
able to add their own files and sources to generate the report from. Furthermore,
users want to be able to trace if the source is the original. Regarding Effectiveness
and Efficiency, customization of the report design that allows for a more iterative
workflow is considered to be important. Generating reports with visualizations was
also a central feedback of the interviews.

5.5 Analysis of User-AI Interaction
To gain a deeper understanding of the interaction between users and the AI, an anal-
ysis was conducted on feedback data and session data derived from past sessions.
The data used in this analysis covers a six month period and was obtained from
existing logs provided by Recorded Future. All datasets had been anonymized by
the company prior to the analysis in order to ensure user privacy. Although we did
not collect the data ourselves, the dataset was extensive and required a procedure
of processing and extracting relevant information. Our work involved identifying,
filtering, and organizing the data in a way that enabled meaningful analysis aligning
with the goals of the study. The analysis was grounded in central topics identified
in the literature study; affordances, trust, and effectiveness in the AI system. The
feedback data was utilized as a delimitation, thereby enabling a distinction between
successful and unsuccessful sessions between the user and AI. The session data, con-
taining complete conversations between users and the AI enabled a deeper analysis

26


5. Process

of how different scenarios can unfold. The following sections will provide a more
detailed explanation of how the execution of these two analyses were carried out.

5.5.1 Feedback Data Analysis
The user feedback data was first organised to distinguish successful and unsuccess-
ful conversations between the user and AI. The dataset contained three different
categories of user feedback; thumbs up indicating positive feedback, thumbs down
indicating negative feedback and additionally a possibility to leave a comment to-
gether with the thumbs down feedback. Each user feedback was directly linked to
a specific response written by the AI, and with an association to the session the
message was retrieved from.

The initial step was to sort and divide all the collected data into the three categories;
Positive feedback, negative feedback and negative feedback with comment. Due to
the valuable depth of qualitative data, a thematic analysis was conducted on the
category negative feedback with comment to identify recurring themes that could
provide insights on lacking or missing elements of the NLI. The process followed
Clarke and Brauns six-step thematic analysis [37] and began with familiarization of
the dataset. Due to the large dataset of 192 rows, we chose to carefully skim through
all the data before moving on to individually code each feedback comment. Themes
were developed based on these codes and guided by patterns of similarity while also
ensuring that each theme accurately represented the data it was intended to capture.
After this, the themes were reviewed and refined to ensure clear formulations and
descriptive names. In the final phase, representative quotes were selected to support
and illustrate each theme. To summarize how prevalent each theme was within the
dataset, we calculated how many comments were linked to each theme.

5.5.1.1 Findnings from User Feedback Data Analysis

The following section presents the result of a thematic analysis based on user feed-
back from conversations with AI generated responses. The feedback is in the form
of a users negative reaction to a message in a conversation, with an additional com-
ment with qualitative feedback. The analysis includes 192 feedback comments. The
coded citations were initially categorized into 14 distinct themes. Five of these were
excluded from further analysis, as they were closely related to technical aspects of
the system and fell outside the scope of this thesis. The remaining nine themes were
then grouped into four main themes, as several of them were closely related and
shared overlapping content. The analysis resulted in four themes: Miscommunica-
tion, Level of Detail Issues, Trust Issues, and False Affordances.

Miscommunication refers to when the AI doesnt understand the users
intention. The user perceives that the AI does not follow their instruc-
tions, the user is confused by the response, or the user is frustrated as a
result of trying to achieve something through the conversation.

Level of Detail Issues is related to when the system fails to reflect the
users needs in terms of level of detail in the response.

27


5. Process

Trust Issues refers to the situations where the user can’t verify the source
of the information or the user questions the reliability of the information.

False Affordances captures the situations where the user is trying to get
the system to do something that is not possible.

5.5.2 Session Data Analysis
To further deepen the insight gained from the feedback data, the session data was
used to provide a more detailed understanding of the context behind the users feed-
back. This was achieved by analyzing entire sessions that include all user prompts
and corresponding AI responses in the same chat. A mixed method approach was
employed, combining log data analysis and thematic analysis to examine user inter-
actions.

The process was initiated using the log data analysis’s three phases [38] where the
first step was to collect a comprehensive dataset (Data collection). We obtained
six months of session data containing conversations between the user and the AI.
The second phase consisted of data cleaning, during which private and irrelevant
data were removed. For the purpose of this study, we selected the conversations
that had previously been identified in the feedback analysis as related to affordance,
effectiveness, and trust for further analysis. In the third phase we conducted the
analysis by Using log data responsibly and incorporating valuable steps from the
thematic analysis to detect common patterns. Each session was reviewed while key
codes were identified and subsequently analyzed for common patterns. Additionally,
each session was evaluated to determine whether the user had attempted to use
fine-tuning. This was to get an indication of how the user intended to guide the AI
in cases of potential misunderstanding. In regards to using the log data responsibly,
the analysis was conducted entirely on our local computers. The final results were
later rephrased to ensure that no sensitive data would be exposed or disclosed to
unintended recipients.

5.5.2.1 Findings from Session Data Analysis

To get a deeper understanding of the user’s experience, 88 full AI sessions connected
to the user feedback regarding each separate theme were analyzed with a mixed-
method approach, including log data analysis and thematic analysis. From this
point forward, AI sessions will also be referred to as conversations throughout the
report.

The analysis for the theme Miscommunication includes 50 conversations. The analy-
sis resulted in five patterns that characterize miscommunications in an AI conversa-
tion. The following patterns were identified: Incorrect Scope, Incorrect Timeframe,
Incorrect Level of Detail, Abstraction Mismatch, and Missing Contextual Carryover.

As represented in Fig. 5.3 the result shows that 45.5% of the use cases connected
to miscommunication were related to the AI giving a response with out of scope
information. Several use cases include examples where the AI only focuses on specific
parts of the prompt, not capturing the full scope. Furthermore, there are use cases

28


5. Process

showing that the AI misunderstood the users question, leading to an out of scope
answer as irrelevant information is presented. 27.3% of the use cases exemplifies
situations where the AI does not apply the requested time frame to the response.
Additionally, for the use cases related to creating a report the user tries to prompt a
change of timeframe outside of the default options provided in the report template
editor. 15.9% of the conversations related to miscommunication can be connected to
incorrect level of detail in the response. In most use cases the user expresses a need
for a more detailed response to be able to gain value from the response. 6.8% of the
use cases were related to the user framing the prompt at a different conceptual level
than the AI is designed to handle leading to an abstraction mismatch, where the AI
couldnt find data matching the query. Lastly, 4.5% of the use cases involved the AI
missing contextual carryover, where the AI failed to incorporate information from
earlier messages and indicated a lack of contextual memory or thread awareness.

Figure 5.3: Results showing underlying patterns connected to miscommunications.

The analysis for the theme Level of Detail Issues includes ten conversations. The
analysis resulted in three situations that can be traced to affect the user experience
negatively due to the level of detail in the AI response. The most common situation
is when the response is too generic to match the user’s intention with the prompt
request, such as missing proper explanations or the response reflects what the user
considers to be a summary when seeking more in depth information. Another sit-
uation is when the response is not considered to complete, lacking thoroughness.
Lastly, is a situation when the AI provides details in the response that the user
considers to be irrelevant. In one of the ten examples the user tried to fine tune the
answer to achieve a more detailed response.

The analysis for the theme Trust Issues includes 15 conversations. The analysis
resulted in four identified situations that can affect the trust negatively with the AI.
The following patterns were identified: Traceability Issues, Unreasonable Response,
Outdated Information, and Hallucinations.

In most of the situations, the lack of possibility for the user to trace the source of

29


5. Process

the information was causing trust issues. Another recurring situation was that the
user considered the response to be unreasonable, due to the intended meaning of the
source not being fully captured and therefore lacking crucial information. A third
situation causing trust issues was when the user perceived that outdated information
was included in the response, therefore user uncertainty about accuracy led to trust
issues. Lastly, use cases when the user suspects hallucinations led to trust issues,
meaning the AI providing made up information or details due to lack of data related
to the prompt request.

The analysis for the theme False Affordances includes 13 conversations. The analysis
resulted in two situations that reflect false affordances within the AI. In twelve of
the use cases, the user wanted to get information within a certain timeframe but
couldnt get the AI to apply the asked timeframe. In one use case the user wanted
the AI to collect the data from a certain module in the platform, which is not a
possible action.

5.6 Creation of Guidelines
The process of creating the guidelines began by listing the factors identified as
important for enhancing the user experience of a NLI. Each factor was then linked
to one or more of the reports three central theoretical topics: affordances, trust, and
effectiveness. These connections were further developed into one or more guidelines,
based on the findings from the results combined with the theoretical foundation
established through the literature study. The initial list of guidelines added up to a
number of 13 recommendations.

5.6.1 Mapping Identified Factors under Affordance, Trust,
and Effectiveness

This section presents the identified factors in a categorized table 5.1, following the
division of the three focus areas that were identified in the literature study: Affor-
dance, Trust, and Effectiveness. This division is made to ensure that the upcoming
design guidelines address the important factors in our findings and facilitate the
translation of user needs and existing challenges into actionable design guidelines.
The marked boxes next to each factor indicate a required improvement within the
area in order to address the associated challenges. By employing the insights from
the literature study and the patterns found in real world user experiences, this work
outlines a foundation for formulating design guidelines.

Customization and Flexibility reflect challenges related to both affordance and effec-
tiveness, as the findings show that NLIs lacking these qualities often fail to support
users in maintaining an effective workflow or task completion. This is frequently
linked to users not understanding the system’s capabilities and limitations which
can indicate False Affordances.

Effectiveness and Efficiency issues were commonly seen throughout the survey, con-
versation and feedback analysis where users struggled to carry out their tasks and

30


5. Process

calls for an improvement regarding effectiveness. This also includes how quickly it
enabled them to achieve their intended goals.

Workflow Considerations can be connected to affordance and effectiveness challenges
as feedback indicated that the system supports some of their use cases but not all
of them.

Transparency and trust was identified as one of the most recurring factors among
the conducted analysis. Users clearly state that they want to be able to trace the
sources used for generating the output, as a way to validate the reliability of the
response.

Language Model Performance is a factor related to both trust and effectiveness,
challenges concern relevance and accuracy of output where the user wants to both
be able to trust the response as well as achieving alignment between their intention
with an input and the related output.

Miscommunication reflects challenges related to all the focus areas. The reasons iden-
tified leading to miscommunications were users trying to achieve false affordances,
or struggling to achieve their goals reflecting effectivity challenges. These situa-
tions relate to trust issues when the system continuously fails to interpret inputs as
intended.

Level of Detail reflects effectiveness challenges as users experience issues with achiev-
ing their goal in terms of level of detail in the response.

False Affordance reflects affordance challenges, specifically regarding situations where
the user thinks an unsupported action is possible.

Table 5.1: Factors mapped to focus areas; Affordance, Trust, and Effectiveness.

31


5. Process

5.6.1.1 Iteration 1 - Presentation of Initial Guidelines

To initiate the development of the Design guidelines, a first draft was created based
on the key factors and challenges identified through data analysis and insights from
existing research. The result from the first iteration is presented in Fig. 5.4 below.

Figure 5.4: Initial Guidelines.

32


5. Process

An expert interview and expert panel evaluations were then conducted to validate
the guidelines over three additional iterations. The following section provides a
detailed description of the expert validation process.

5.7 Expert Validation
The expert validation was carried out to assess and refine the formulated guidelines,
following the process of the Delphi method. As stated in section 4.3.1, the method
consists of two segments, Selecting experts which involves five additional steps and
Conduction expert advice, consisting of three additional phases. The execution of
this process ultimately led to an alternative approach for the second segment, which
is presented in the following section.

The validation started with preparing the KRNW. In the first step, three appropriate
classes of experts were selected, Experts within AI development, UX research and
UX design. In the second step, these classes were populated with 12 potential
experts whom we considered relevant for the purpose of the study. In the third step,
the list was further reviewed and expanded to 16 potential experts in collaboration
with one of the experts. Step four focused on ranking the experts. This was done by
evaluating them based on their years of experience within the discipline and level of
involvement in this project. With the aim of involving at least 12 participants in the
study, 16 experts were invited during step five of the expert selection phase. This
approach was intended to ensure a sufficient number of participants, accounting for
the possibility that some experts might be unable to participate due to conflicting
tasks or time constraints.

The second segment, Conduction expert advice, was originally designed to allow
the experts to brainstorm potential improvements and rank the guidelines through
two iterations. Due to concerns regarding a potentially low response rate to the
questionnaire, an alternative approach was adopted and presented in Fig.5.5. This
allowed for a more flexible arrangement with the experts while still capturing a
diverse range of feedback. Moreover, the revised approach facilitated multiple itera-
tive cycles, which in turn provided additional time for the guidelines to evolve and
mature into a final version.

33


5. Process

Figure 5.5: An adjusted approach to the Delphi method, based on the framework
by Okoli and Pawlowski [46]

Phase 1 : Instead of utilizing two questionnaires as a second iteration of the guide-
lines, we asked our top ranked expert within the UX research discipline to sit down
to evaluate the formulated guidelines. The expert was provided with the list of
guidelines in advance to review prior to the first assessment. The following day, the
expert provided feedback during an interview, in which all proposed changes were
documented collaboratively to ensure accurate interpretation.

Phase 2 : In the second phase, the guidelines were adjusted based on the received in-
put. The initial list of guidelines consisted of 13 potential guidelines. After applying
the collected feedback onto the guidelines, the updated list had four rewritten guide-

34


5. Process

lines, one new added and one removed. Following that, the guidelines were placed
in a document and sent out together with an updated invitation to the selected 15
experts.

Phase 3 : In the third phase of the alternative approach, two expert panels were
invited to review and give feedback on the refined guidelines. The phase was carried
out during two separate meetings where each panel discussed and left comments
in a common document as we observed the discussions. The panels were divided
based on their disciplinary background to capture a broad range of perspectives and
domain specific insights. However, due to conflicting tasks within the UX Research
and UX Design team they were grouped together and viewed as a UX focused panel,
based on the number of individuals involved. This resulted in the AI developer panel
consisting of seven participants and the ux panel consisting of three participants.

Phase 4 : Following the collection of expert feedback the guidelines we once again
refined. In this phase, we carefully analyzed the input to identify recurring sug-
gestions and address valuable insights from the two panels. This process aimed to
enhance the clarity, applicability and quality of the guidelines based on the received
feedback.

Phase 5 : In the final evaluation, a new panel of five experts within the UX discipline
were invited to validate the refined guidelines. In this stage, the goal was to validate
that the guidelines were well suited for practical guidance and if any final refinement
were needed.

Phase 6 : The guidelines were refined one last time based on the last expert panel.

5.7.1 Results from Expert Evaluations
This section presents the result from the conducted Delphi method evaluations. The
following subsections are organized according to the individual expert assessments.

5.7.1.1 Iteration 2 - Expert Interview

The following section presents feedback from an expert working with UX research.

The expert highlights that the guidelines target different levels, some more detailed
and product specific while others are more open for interpretation. Regarding the
category trust, the expert states that the guidelines capture important challenges
addressed in literature. However, the expert pointed out that trust is an exten-
sive challenge within this area and additional guidelines for this category could be
developed.

On the category effectiveness, specifically for guideline 9 and 10, the expert elabo-
rated on the challenge with managing ambiguity in the relationship between input-
output size, emphasizing the problem becomes more severe the bigger discrepancy in
this relationship. AI report creation is more of an ”order” than an iterative process,
making it harder to refine and control the output.

35


5. Process

5.7.1.2 Iteration 3 - Expert Panel 1 and 2

The following section presents the feedback from expert panel 1, consisting of seven
experts, working with AI development.

Affordance Category Feedback - For guideline 1, experts emphasized that the sys-
tem should prioritize user inputs, especially in cases where the system affordances
are unclear or ineffective. If interpretation is not possible, the system should offer
customization options to guide the user. Regarding guideline 2, experts highlighted
the importance of clear system feedback. They suggested that the system should
recognize when a user attempts an unsupported expression and provide guidance,
such as explaining why the input is not supported and offering alternative phrasing
to achieve the desired outcome. Finally guideline 4, experts recommended using
more natural language and avoiding overly technical terms. For example, replacing
phrases like collaborative revision and ”prompt” with more approachable alterna-
tives like ”guide” and ”initial question”. Trust Category Feedback - For guideline 7,
experts recommended including the AIs reasoning process to enhance transparency,
allowing users to understand how the AI arrived at its responses. Additionally, they
suggested introducing a new guideline that acknowledges the varying language styles
and characteristics of different models, which can significantly impact user experi-
ence. This guideline should also address the importance of understanding that not
all models have the same level of contextual awareness. While a chat interface may
present the interaction as a continuous conversation, the underlying AI may process
each user input as a standalone request, which places different design requirements
on the system. Effectiveness Category Feedback - Experts noted that the term ”ef-
fectiveness” might be problematic as its definition varies significantly depending on
the context and specific user needs. To truly assess effectiveness, designers must
understand what their particular users consider effective in their workflows. Fur-
thermore, they observed that the guidelines within this theme are mostly product
specific, especially guideline 10, which may not apply universally across different AI
systems. For guideline 9, experts suggested clarifying the perspective from which
effectiveness is being measured. For guideline 11, they recommended using another
term for ”contextual carryover” and clearly defining this concept before enabling it,
as its meaning can vary significantly depending on the extent to which it is applied.

Additional Feedback - Regarding the earlier mentioned comment on effectiveness
being a challenging term for this category. They offered the perspective that if affor-
dance and trust are addressed in the right way, it will likely lead to an overall effective
system, as those factors form the foundation of intuitive and reliable systems. Ad-
ditionally, they recommended clearly defining the scope of the NLI guidelines to
ensure they align with the intended design goals and accurately capture the types
of user interaction being prioritized.

The following section presents the feedback from expert panel 2, consisting of three
experts, working with UX research or UX design.

Affordance Category Feedback - For guideline 1, the experts emphasized that design-
ers have limited control over how the model itself functions. Instead focusing on
what users are ”allowed” to do, the guideline should prioritize clearly communicat-

36


5. Process

ing the systems capabilities to users, aligning the interface with what the system
can realistically support.

Trust Category Feedback - The experts questioned the distinction between guideline
6 and 7. Furthermore for guideline 6, they pointed out that different users as-
sess source validity using different criteria, therefore recommending that the system
should clearly explain what validity means within specific contexts. Furthermore,
for guideline 7, they questioned whether the approach genuinely encourages critical
thinking.

For guideline 8, they emphasize the importance of expressing the AI decision making
in a human readable way to support users with limited technical knowledge and
streamline the validation process.

Effectiveness Category Feedback - For guideline 11 on contextual carryover, the ex-
perts strongly recommend clarifying the scope. This could include a single session
context, personal use history, or organizational context, each with different implica-
tions for data privacy and user expectations. Furthermore, they pointed out that
not all users value this feature, so it should be clearly defined and appropriately
targeted.

They also suggest elaborating on guideline 12, as its current phrasing was considered
too vague to effectively guide design decisions. Additionally, they recommended
adding a guideline focused on visualizations to increase effectiveness.

Additional Comments - From a designer perspective, the experts emphasized the
need for more concrete examples within each guideline to provide clearer direction
for designers, as the current phrasing is often too generic to be actionable.

Lastly, they highlighted that transparency alone may not always be beneficial, as
users need the appropriate skills and time to interpret this information. Instead,
guidelines should promote example based explanations that are easier for a wider
range of users to understand.

5.7.1.3 Iteration 4 - Expert Panel 3

The following section presents feedback from expert panel 3, consisting of five experts
working with UX design or UX research. This group was presented with two versions
of the guidelines. The guidelines presented to expert panel 1 and 2, as well as an
updated version of the guidelines based on the feedback from iteration 3.

Regarding clarity and readability of the guidelines, the experts think that the guide-
lines are written in an overly academic style, requiring multiple readings to fully un-
derstand. Many sentences are long and cover multiple guidelines, making it difficult
for designers to assess their application. Additionally the experts suggest simplifying
the language to ensure designers with varying levels of AI knowledge can follow the
guidelines. They encouraged us to reduce unnecessary wording. Consider how much
text is necessary to describe the core message without losing meaning. Furthermore,
they suggest to differentiate between guidelines covering model affordance and UI
affordance to add clarity. Comments regarding terminology and definitions revolve

37


5. Process

around that many terms used are too general and open to interpretation. They
suggest we clearly define key terms, like ”affordance” to avoid misunderstanding.
Referring