Effects of Cognitive Load in Human-AI
Requirements Engineering

Master’s Thesis in Software Engineering and Technology

Niharika Nandi Shivamurthy Praveen
Laxmi Prashantraddi sasvihalli

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2025


Master’s Thesis 2025

Effects of Cognitive Load in Human-AI
Requirements Engineering

Niharika Nandi Shivamurthy Praveen
Laxmi Prashantraddi Savihalli

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2025


Effects of Cognitive Load in Human-AI Requirements Engineering
Niharika Nandi Shivamurthy Praveen
Laxmi Prashantraddi Savihalli

© Niharika Nandi Shivamurthy Praveen and Laxmi Prashantraddi Sasviahlli, 2025.

Supervisor: Richard Berntsson Sevensson , Department of Computer Science and Engi-
neering
Supervisor:Lekshmi Rani, Department of Computer Science and Engineering
Examiner: Gregory Gay, Department of Computer Science and Engineering

Master’s Thesis 2025
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2025

iv


Effects of Coginitive Load in Human-AI Requirements Engineering
Niharika Nandi Shivamurthy Praveen and Laxmi Prashnatraddi Sasvihalli
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
As Artificial Intelligence becomes more integrated into software engineering, its role in
decision-support systems within Requirements Engineering has grown. However, the
cognitive demands placed on users interacting with these AI tools remain underexplored.
This thesis investigates how explanation formats offered by Explainable AI affect mental
effort, task difficulty, confidence, and correctness during requirements engineering inspired
prioritization tasks. Through a controlled experiment with 61 participants, three XAI
formats of bar charts, textual explanations, and confidence scores were evaluated across
two task pairs of differing complexity. The study examined the influence of task complex-
ity and explanation format, the impact of explanation type on decision-making quality,
and whether participant preferences for certain formats aligned with improved perfor-
mance and lower cognitive strain. Statistical analyses, including Spearman correlation
and independent t-tests, revealed that task complexity consistently influenced cognitive
load, while explanation format had no clear effect. Additionally, although preferred for-
mats did not universally enhance task performance, participants who favored confidence
scores showed marginally higher correctness and confidence levels. These findings sug-
gest that cognitive effort in AI-assisted requirements engineering tasks is shaped more
by task characteristics than explanation format alone, and that tailoring explanations to
individual user preferences may offer subtle benefits.

Keywords: Requirement Engineering(RE), Cognitive Load(CL), Artificial Intelligence
(AI), Explainable Artificial Intelligence (XAI), Weighted Shortest Job First (WSJF),
Research Question (RQ), User Experience (UX).

v


Acknowledgements
We would like to sincerely thank our supervisors, Richard Svensson and Lekshmi Rani,
for their valuable guidance, feedback, and encouragement throughout the course of this
thesis. Their support has been instrumental in shaping our research. We would also like
to thank our examiner, Gregory Gay, for his input and constructive advice. Additionally,
we are grateful to all the participants who contributed their time and insights to our
study. Finally, we would like to extend our appreciation to our families and friends for
their continued support and motivation during this journey.

Niharika Nandi Shivamurthy Praveen and Laxmi Prashantraddi Sasvihalli, Gothenburg,
September 2025

vii


Contents

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Background 3
2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 Cognitive Load Theory . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.2 Requirements Engineering and Prioritization . . . . . . . . . . . . 4
2.1.3 CLT and Its Relevance in Requirements Engineering . . . . . . . 5
2.1.4 Explainable AI (XAI) and Its Role in Requirements Engineering . 5

3 Related Work 7
3.1 Cognitive Load in General Domains . . . . . . . . . . . . . . . . . . . . . 7
3.2 Cognitive Load in Software Engineering . . . . . . . . . . . . . . . . . . . 10
3.3 Human-AI Collaboration and LLMs in Requirements Engineering . . . . 12
3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Methodology 15
4.1 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
4.2 Methodology Process Overview . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Survey Design and Questionnaire . . . . . . . . . . . . . . . . . . . . . . 17

4.3.1 Survey Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.2 Demographics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.3 Prioritization Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3.4 XAI Explanation Formats . . . . . . . . . . . . . . . . . . . . . . 18
4.3.5 Implementation of AI Support . . . . . . . . . . . . . . . . . . . . 18
4.3.6 Measurement Approach . . . . . . . . . . . . . . . . . . . . . . . 19

4.4 Pilot Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.5 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.6 Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.6.1 Data Cleaning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.6.2 Defining the Correct Prioritization Order . . . . . . . . . . . . . . 20

4.6.2.1 WSJF Calculation Method . . . . . . . . . . . . . . . . 21
4.6.3 Prioritization Accuracy Scoring . . . . . . . . . . . . . . . . . . . 22
4.6.4 Cognitive Load Analysis . . . . . . . . . . . . . . . . . . . . . . . 22
4.6.5 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . 22

ix


Contents

4.7 Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.8 Validity of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Results 25
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Demographics of Survey Participants . . . . . . . . . . . . . . . . . . . . 25
5.3 Results Aligned with Research Questions . . . . . . . . . . . . . . . . . . 27

5.3.1 Overview of Key Task Metrics . . . . . . . . . . . . . . . . . . . . 27
5.3.2 RQ1: How do different styles of XAI impact cognitive load during

decision-making in requirements prioritization? . . . . . . . . . . 29
5.3.2.1 Correlation Between Tasks: Evidence of XAI’s Influence

on Cognitive Load . . . . . . . . . . . . . . . . . . . . . 29
5.3.2.2 Statistical Differences in Cognitive Load Measures . . . 30
5.3.2.3 Impact of Different XAI Types on Cognitive Load . . . . 30
5.3.2.4 Correlation Test for different XAI . . . . . . . . . . . . . 31
5.3.2.5 Statistical Differences in Cognitive Load by XAI Type . 31

5.3.3 RQ2: How do different styles of XAI impact the quality of decision-
making in requirements prioritization tasks? . . . . . . . . . . . . 32
5.3.3.1 Correlation Between Tasks: Evidence of XAI’s Influence

on Decision Quality . . . . . . . . . . . . . . . . . . . . 32
5.3.3.2 Statistical Differences in Cognitive Load Measures . . . 33
5.3.3.3 Impact of Different XAI Types on Decision Quality . . . 33
5.3.3.4 Correlation Test for different XAI . . . . . . . . . . . . . 33
5.3.3.5 Statistical Differences in Decision Quality by XAI Type 34

5.3.4 RQ3: How do users’ preferences for different XAI formats relate
to their task performance, perceived mental effort, and trust in
AI-supported requirements prioritization? . . . . . . . . . . . . . 35
5.3.4.1 Participant Preferences for XAI Types . . . . . . . . . . 35
5.3.4.2 Correlation Between XAI Preferences and Decision Quality 36
5.3.4.3 Significance Between XAI Preferences and Decision Quality 36
5.3.4.4 Significance between Correctness and perceived easiest to

understand XAI . . . . . . . . . . . . . . . . . . . . . . 37
5.3.4.5 Significance between mental effort and perceived overall

preferred XAI . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3.4.6 Significance between Trust in XAI and Reported Confi-

dence in Decisions . . . . . . . . . . . . . . . . . . . . . 38
5.3.5 Participant Perceptions of XAI Trust, Confidence, and Future Use 39

6 Discussion 41
6.1 RQ1: How do different styles of XAI impact cognitive load during decision-

making in requirements prioritization? . . . . . . . . . . . . . . . . . . . 41
6.2 RQ2: How do different styles of XAI impact the quality of decision-making

in requirements prioritization tasks? . . . . . . . . . . . . . . . . . . . . . 42
6.3 RQ3: How do users’ preferences for different XAI formats relate to their

task performance, perceived mental effort, and trust in AI-supported re-
quirements prioritization? . . . . . . . . . . . . . . . . . . . . . . . . . . 43

6.4 Summary of Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7 Conclusion 45

x


Contents

7.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
7.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
7.3 Use of generative AI in this thesis . . . . . . . . . . . . . . . . . . . . . . 47

Bibliography 49

A Appendix I
A.1 Supplementary Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

B Survey Instrument III

xi


Contents

xii


List of Figures

2.1 Requirements Engineering (RE) process with feedback loops ([49]). . . . 4

4.1 Methodology process flow (numbered steps). . . . . . . . . . . . . . . . . 16

5.1 Distribution of participants’ professional roles . . . . . . . . . . . . . . . 26
5.2 Distribution of participants’ experience . . . . . . . . . . . . . . . . . . . 26
5.3 Distribution of participants’ prioritization frequency . . . . . . . . . . . . 27
5.4 Box plots of all participant results . . . . . . . . . . . . . . . . . . . . . . 28
5.5 Key Task Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.6 Distribution of Participant Preferences for Each XAI Type by Category 35
5.7 Participant Ratings of Trustworthiness, Confidence, and Comfort with Fu-

ture Use of XAI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

xiii


List of Figures

xiv


List of Tables

3.1 Cognitive Load in General Domains . . . . . . . . . . . . . . . . . . . . . 10
3.2 Cognitive Load in Software Engineering . . . . . . . . . . . . . . . . . . . 12

4.1 Example of WSJF Grouping for Task 1.1 – Loan Management Task . . . 20

5.1 Summary of average scores across key metrics by task and XAI type. . . 27
5.2 Paired t-test results for mental effort and task difficulty across tasks. . . 30
5.3 Spearman correlation between Tasks 1.2 and 2.2 across key metrics for

each XAI type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Paired t-test comparison of Task 1.2 and Task 2.2 across XAI types . . . 31
5.5 Paired t-test results for correctness and confidence across tasks. . . . . . 33
5.6 Spearman correlation between Tasks 1.2 and 2.2 across key metrics for

each XAI type. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.7 Paired t-test comparison of Task 1.2 and Task 2.2 across XAI types . . . 34
5.8 Comparison of correctness scores based on participants’ preferred XAI type. 37

A.1 Spearman correlations between task pairs for correctness, effort, difficulty,
and confidence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I

A.2 Spearman correlation between perceived understandability of XAI types
and performance metrics. . . . . . . . . . . . . . . . . . . . . . . . . . . . II

xv


List of Tables

xvi


1
Introduction

Artificial Intelligence (AI) is rapidly reshaping software engineering, changing the way
core development tasks are carried out. In particular, recent studies show that AI is
becoming increasingly embedded in Requirements Engineering (RE), where it is used
to support activities such as eliciting requirements, prioritizing features, and analyzing
trade-offs [24, 5]. These activities are central to project success because they require
stakeholders to weigh feasibility, manage risks, and maximize value [62, 65]. As AI
systems take on a greater role in these decisions, the challenge is no longer only whether
their outputs are accurate, but also whether practitioners can understand and reason
with them [6, 27].

A central concern in this interaction is cognitive load, the mental effort required to process
and integrate information during task execution [60, 48]. In RE, practitioners already
operate under high cognitive demands due to the complexity of requirements, the diversity
of stakeholders, and the presence of competing constraints [30, 2]. When AI-generated
recommendations are opaque, vague, or misaligned with user expectations, they increase
this mental effort and can quickly lead to cognitive overload [46, 45]. Such overload does
not simply make tasks harder, it reduces the quality of decisions and undermines trust
in AI systems [15, 27].

Explainable AI (XAI) has emerged as a promising way to address the challenges posed
by opaque AI outputs. Techniques such as confidence scores, bar chart visualizations,
and plain-language text explanations are designed to improve transparency and build
user trust by clarifying how AI systems generate their results [6, 17]. Evidence from
domains such as healthcare and other safety-critical settings suggests that well-designed
explanations can enhance decision-making by making AI predictions more interpretable
and actionable [27, 32]. Despite these advances, the influence of explanation format on
cognitive load and decision-making performance within requirements engineering (RE)
tasks remains insufficiently explored [23, 5]

The broader literature also highlights the cognitive demands of RE tasks themselves.
Studies show that multitasking, task complexity, and ambiguous criteria can substantially
increase the mental effort required for requirements prioritization and analysis [30, 2].
Research in behavioral software engineering further underscores the need to understand
both individual and team cognition when engaging with decision-support tools [21, 51]. At
the same time, findings from XAI research confirm that explanation design directly shapes
users’ performance, trust, and overall satisfaction [6, 45]. Yet, few studies bring these
perspectives together, leaving an important gap in how different forms of explanation

1


1. Introduction

influence cognitive load during RE prioritization tasks.

This thesis addresses that gap by empirically examining how three common explanation
formats, text, bar charts, and confidence scores, shape cognitive load and decision-making
performance in requirements prioritization tasks. Using a controlled survey experiment
that varies task complexity (two criteria versus four criteria), the study provides sys-
tematic evidence on whether particular explanation designs can reduce mental effort and
enhance decision quality. [54, 63, 45, 15].

The significance of this research lies in bridging explainability studies with cognitive
load theory within the specific context of requirements engineering. While much prior
work has assessed explanations primarily in terms of technical accuracy or model in-
terpretability, this thesis shifts attention to the human perspective, focusing on how
individuals experience and manage cognitive demands when making critical project de-
cisions [60, 5, 23, 27, 6]. In doing so, the study offers practical insights for designing AI
tools that better align with human cognitive capacities, enabling practitioners and orga-
nizations to adopt AI in ways that actively support rather than complicate prioritization
and collaboration in software projects.

1.1 Thesis Outline
This thesis report is organized into several key sections to provide a clear and structured
overview of the study. It begins with an introduction to the topic, outlining the prob-
lem space and explaining why the study matters. The background section then sets the
foundation by discussing Cognitive Load Theory and its relevance to requirements engi-
neering, along with ideas around how humans and AI can work together in this space.
The next part covers related work, summarizing what past research has found, and point-
ing out the gaps this study aims to address. The methodology section walks through how
the study was carried out, from the survey design and tasks to how the data was collected
and analyzed. This is followed by the results chapter, which shares what was found in
the responses and highlights the main patterns. The discussion then reflects on these
findings, connecting them back to the research questions and existing studies, and con-
sidering what they mean for the use of AI in requirements engineering. After that, the
thesis looks at potential limitations and factors that could have influenced the results.
It ends with a conclusion that wraps everything up, highlights the study’s contributions,
and suggests where future research could go.

2


2
Background

2.1 Background

This section presents background information on Cognitive Load Theory (CLT) and
the Requirements Engineering (RE) process, focusing on the activity of requirements
prioritization. It also explores the relevance of CLT in RE contexts, especially as human
engineers increasingly collaborate with Artificial Intelligence (AI) tools in decision-making
processes.

This section also provides foundational context for the research, introducing Cognitive
Load Theory and its theoretical underpinnings. It also explains the nature of Require-
ments Engineering in software development, emphasizing the cognitively intensive task
of prioritizing requirements. The connection between CLT and RE is then elaborated,
establishing the rationale for applying cognitive principles to the challenges of AI-assisted
RE.

2.1.1 Cognitive Load Theory
Cognitive Load Theory (CLT), originally developed by John Sweller in the late 1980s, is
a psychological theory concerned with how people process and retain information while
learning or performing tasks [59]. The theory is based on the premise that working mem-
ory, the mental space in which we process information, is limited in both capacity and
duration. When individuals are asked to perform complex tasks, especially those involv-
ing new or unstructured information, they may experience cognitive overload, impairing
learning, problem-solving, or decision-making.

According to CLT, cognitive processing is divided into three kinds of load. Intrinsic cog-
nitive load depends on the built-in complexity of the task itself. For example, analyzing
interdependent software requirements involves holding multiple interacting elements in
mind, which inherently increases the mental effort required [61][48]. Extraneous cogni-
tive load results from the way information is presented to the learner. Poorly structured
documentation or confusing user interfaces can add unnecessary load without supporting
learning or task completion [47]. Germane cognitive load is the beneficial mental effort
used to build knowledge structures or "schemas" that improve problem-solving and under-
standing. For instance, when requirements engineers reflect on prioritization strategies
and gradually develop heuristics for evaluating trade-offs, they are investing cognitive
effort that strengthens their long-term expertise.[61].

3


2. Background

The key goal of CLT is to design information and tasks that minimize unnecessary load,
manage complexity, and encourage productive learning. These principles are increas-
ingly relevant in software development contexts where high cognitive demands can affect
decision-making and productivity.

2.1.2 Requirements Engineering and Prioritization
Requirements Engineering (RE) is a structured process in software development focused
on identifying, documenting, analyzing, and managing system requirements. The goal
is to ensure that the final software product aligns with user needs, stakeholder goals,
and system constraints. The RE process generally consists of several stages: elicitation,
where requirements are gathered; specification, where they are documented; validation,
where correctness is confirmed; and management, where changes are tracked throughout
the lifecycle.[49].

Elicitation
(gather requirements)

Specification
(document & analyze)

Validation
(review & agree)

Management
(trace & change)

issues found

change requests

Figure 2.1: Requirements Engineering (RE) process with feedback loops ([49]).

One of the most critical and cognitively demanding steps in RE is requirements prioritiza-
tion. This is the process of determining the relative importance of various requirements to
guide decision-making and resource allocation. Engineers must often prioritize based on
multiple, and sometimes conflicting, criteria such as stakeholder value, technical feasibil-
ity, cost, and implementation risk [1]. In multi-stakeholder environments, prioritization
becomes even more complex due to differing opinions and business objectives.

Prioritization becomes increasingly complex in large-scale or multi-stakeholder projects,
where competing interests must be balanced. Traditional methods such as the Analytic
Hierarchy Process (AHP) and Cost-Value Approaches are commonly used, but they often
require engineers to process large volumes of information and make difficult trade-offs [36].
This leads to significant cognitive effort, especially when requirements are ambiguous or
when there are many dependencies between them.

The emergence of Artificial Intelligence tools, including Large Language Models (LLMs),

4


2. Background

has added new capabilities to the prioritization process. These tools can analyze historical
data, detect patterns, and propose ranked lists of requirements based on weighted factors.
While AI can help reduce manual effort, it also introduces new challenges in managing
cognitive load, particularly when AI outputs are poorly explained or misaligned with
human expectations [56].

2.1.3 CLT and Its Relevance in Requirements Engineering
Cognitive Load Theory is highly relevant to Requirements Engineering, especially in
activities such as elicitation, analysis, and prioritization, where engineers must process
complex information and make judgments under uncertainty. As AI tools become more
integrated into RE tasks, it is essential to ensure that these tools support human cognition
rather than overwhelm it [3].

Studies have shown that engineers experience high levels of intrinsic and extraneous cog-
nitive load when working with complex, unstructured requirements or when interpreting
unclear AI-generated suggestions [5][30]. Poor management of these cognitive demands
can result in decision fatigue, errors, and reduced stakeholder alignment. On the other
hand, tools designed with CLT principles, such as those that present visual models, mod-
ularize information, or offer clear feedback, can reduce unnecessary load and improve task
performance [44].

In particular, requirements prioritization benefits from CLT-informed AI tool design.
Breaking down complex prioritization decisions into smaller, more manageable parts can
help engineers focus better and reason more clearly. Similarly, AI tools that offer trans-
parent, explainable recommendations rather than opaque outputs can reduce extraneous
load and increase trust in the system. As such, CLT offers a theoretical lens through
which the effectiveness of AI-assisted RE tools can be evaluated.

2.1.4 Explainable AI (XAI) and Its Role in Requirements En-
gineering

AI systems integrated into requirements engineering require humans to understand their
outputs effectively [13]. The set of techniques known as Explainable AI (XAI) provides
transparency into AI system behavior and decision-making processes, which human users
can understand. The interpretability requirement in RE contexts becomes essential be-
cause engineers need to evaluate and validate AI-generated suggestions and potentially
make changes to them [28].

The implementation of XAI techniques enhances trust and usability and cognitive ef-
ficiency in human-AI collaboration by minimizing the unclear aspects of AI outputs.
Engineers face difficulties in understanding AI recommendation rationales because of a
lack of explainability, which results in cognitive overload and misuse [15]. Well-designed
XAI methods enable engineers to verify AI outputs efficiently, which strengthens user
confidence and facilitates better decision-making processes [45].

In the context of software and requirements engineering, several prominent XAI tech-
niques have gained traction. Model-agnostic approaches like LIME (Local Interpretable
Model-Agnostic Explanations) and SHAP (SHapley Additive exPlanations) are widely

5


2. Background

adopted to interpret complex machine learning models by highlighting feature contri-
butions for individual predictions [7]. Visual tools such as saliency maps and attention
heatmaps are often used in domains involving image or text data, offering intuitive cues
about the system’s focus during decision-making. In requirements engineering specifically,
more structured explanations such as decision trees, rule-based outputs, and ranked lists
of features or criteria are frequently integrated to support traceability and justify priori-
tization decisions [33]. These methods aim to make AI outputs not only transparent but
also actionable for engineers and stakeholders who rely on such insights for validating
requirements, allocating resources, or managing trade-offs.

This research implements three XAI methods, which include confidence scores and bar
charts, and text-based explanations. The selection of these modalities represents different
ways to achieve interpretability. The AI provides quantitative certainty through confi-
dence scores, and bar charts display feature importance for fast comparison, while text
explanations deliver natural language explanations for AI-driven prioritization. Research
by [7] demonstrates that these XAI methods both work effectively to enhance under-
standing and minimizing the mental work needed to understand AI recommendations.

6


3
Related Work

This section presents an overview of research on cognitive load, focusing on various do-
mains relevant to this study. First, we summarize findings from a broad range of domains
where cognitive load has been studied, such as education, healthcare, navigation, and
marketing. Second, we review research specific to software engineering, where cognitive
demands are prominent due to task complexity and system interdependencies. Finally,
we discuss emerging literature on human-AI collaboration in SE and Requirements En-
gineering, particularly the role of Large Language Models and Explainable AI in shaping
cognitive experiences during prioritization tasks.

3.1 Cognitive Load in General Domains
Research on cognitive load extends far beyond software engineering, and insights from
other domains provide useful analogies for understanding task complexity, measurement,
and mitigation strategies. We include studies from education, healthcare, navigation,
marketing, and teamwork because they illustrate three points that are directly relevant
to this thesis: (1) cognitive load consistently emerges as a barrier to effective performance
across domains, (2) researchers have used diverse measurement techniques that can in-
form methodological choices in this work, and (3) strategies to mitigate cognitive load
remain underdeveloped, motivating further investigation in software engineering contexts.
The studies summarized in Table 3.1 were selected because they are frequently cited, rep-
resent methodological diversity (self-reports, physiological monitoring, behavioral mea-
sures), and exemplify how different factors such as emotional arousal, task complexity,
and collaboration shape cognitive effort. We describe them as "key" studies, not because
they exhaustively cover the field, but because they are illustrative and transferable to the
challenges of requirements engineering and human-AI collaboration.

Table 3.1 organizes prior work by the "Actor type" (individual, human–AI collaboration,
or human–human teams). For each study, we report the "Cognitive load factor" being
investigated (e.g., task complexity, distractions, working memory load), the "Task" partic-
ipants performed, the "Main findings", and whether "Mitigation strategies" were proposed
or tested. The "Domain" column specifies the application area, while the "Measurement
columns" indicate how cognitive load (CL Measure) and task performance (TP Measure)
were assessed. The final column provides references. This organization allows compar-
ison across domains and highlights recurring themes: task complexity and distractions
consistently elevate cognitive load, measurement techniques vary widely, and mitigation
remains more often theoretical than empirically validated.

7


3. Related Work

To illustrate, Fraser [21] found that high emotional excitement in simulated clinical set-
tings increased mental effort and hindered accurate task execution. Similarly, Skulmowski
[58] showed that learners under high extraneous and intrinsic load experienced reduced
focus and slower learning across complex online courses.

Another notable study by Žagar [68] examined navigation tasks and demonstrated how
auditory distractions could increase error rates by elevating mental load. In market-
ing, Kakaria [35] observed that consumers who shopped without a plan showed higher
EEG-based cognitive load, indicating more impulsive and less accurate decision-making.
Furthermore, Whitney [64] showed how framing effects, combined with increased working
memory load, shaped risky decisions in high-stress conditions.

These studies employed various cognitive load measurement techniques, including self-
reports (e.g., 7-point rating scales), physiological monitoring (e.g., heart rate, EEG), and
behavioral metrics (e.g., error rates, decision time). Despite the diversity of applications,
task complexity consistently emerged as a primary cognitive load driver across domains
[58][21]. However, few of these studies evaluated strategies to reduce load, but most
mitigation approaches, such as training or adaptive system design, remained theoretical
[68].

Actor
Type

CL
Factor

Task Findings Mitigation Domain CL
Measure

TP
Measure

Ref

Individual Emotional
state

Simulation
training

More excitement
increased CL;
calmness
reduced CL

Not addressed Medical
Educa-
tion

7-pt scale Heart
sound ID

[21]

Individual Extra/
intrinsic/
germane

Online
learning

Extra load
reduces focus,
right load
increases
understanding

Theory:
Constructive
alignment .
Empirical: No
empirical
evaluation

Education
Psychol-
ogy

Eye
tracking,
7-pt scale,
Pupilome-
try

Correct
answers,
Response
rates

[58]

Individual Distraction Ship steering
w/ alarm
sounds

More
errors,increased
CL

Theory:
Distraction
training

NavigationHeart rate
and stress
levels

Reaction
time,
errors

[68]

Individual Purchase
Planning

Virtual
shopping

Unplanned
increased
cognitive load

Theory:
Planning
Strategies

E-
commerce

EEG
(gamma
band)

Time,
number of
planned/
unplanned
count,
total ex-
penditure

[35]

Individual Working
Memory
Load

Risky
decisions

High load
decreased risky
choices

Theory:
Increased load
limits
WM.Empirical:
Higher load
reduces risky
decisions.

PsychologyDual-task Decision
bias

[64]

Individual Memory
Load

Decision-
making with
uncertainty
information

High load
reduced optimal
decisions

Theory:
Cognitive load
hinders
decisions.
Empirical:
Cognitive Load
impairs
optimal
decisions.

Cognitive
Psychol-
ogy

Dual-task Accuracy,
decisions

[3]

8


3. Related Work

Actor
Type

CL
Factor

Task Findings Mitigation Domain CL
Measure

TP
Measure

Ref

Individual Task
Complex-
ity, Visual
Distrac-
tion

Surgical
decision,

High TC
increased CL,
worse decisions.

Theory: TC
increases CL;
Empirical: TC
had no impact.

Medical Measured
using
NASA-
TLX,
SURG-
TLX,
eye-
tracking,
EEG,
EDA.

Time,
errors,
mental
effort, and
correct
tasks.

[32]

Individual Morphol
-ogical
Clarity

Image
classification

Low MC
increased CL;
high MC
reduced CL

Empirical:
Adjacent
visualizations
and high MC
CL

XAI and
its
impact on
human-
AI
collabora-
tion.

Pupil
dilation,7-
pt scale

Accuracy,
confi-
dence,
time

[33]

Human-
AI

TC,
Cognitive
Resources

AI chatbot
learning

Reduced
CL,increased
learning

Theory:
AI-assisted
learning
(iLearnTech
chatbot)

Education 7-pt scale Correct
an-
swers,time,
accuracy,
error

[46]

Human-
AI

Task
Complex-
ity

Robot-
assisted gait.

Real-time task
adjustment
maintained
optimal CL

Theory:
Adaptive
difficulty
.Empirical:
System
maintained
cognitive load
with 88%
accuracy.

Medical HR,EEG Time,
correct
answers

[38]

Human -
AI

Time
scarcity,
technology
availabil-
ity

Creative
problem-
solving

Time scarcity
increased AI
use, increased
CL

Empirical:
Time
management
strategies

Creative
Team-
work

Self-
report,task
ratings

Success,
creativity

[57]

Human -
AI

TC,
decision
flexibility

Dynamic
team tasks

Flexible AI
reduced CL,
improving
adaptability.

Empirical:
Adaptive AI

Workplace
AI Inte-
gration

TC
metrics,
team
adaptabil-
ity
measures

Success
rate, goal
achieve-
ment

[28]

Human-
AI

AI
explain-
ability

COVID-19
decisions

Different XAI
explanation
types affected
cognitive load
and task
performance;
explanations
focused on
specific decisions
led to reduced
cognitive load
and better
performance.

Theory: Clear
local
explanations
improve
cognitive
efficiency.
Empirical:
Local XAI
explanations
reducing
cognitive load
and improving
TP

AI/
Health-
care

7-pt scale Accuracy,
Time

[29]

HH -
Teams

Physiologi
-cal
synchro-
nization,
TC

Cardiac
surgery

Increased
synchronization
increased
performance.

Empirical:
Feedback on
synchroniza-
tion.

Medical HRV,
entropy
measures

Surgical
errors,
time

[16]

HH -
Teams

Collabora
-tive CL,
transac-
tive
activities

Collaborative
learning

Theoretical:
Collaboration
can reduce CL if
guided

Theory:
Structured
guidance, role
distribution.

Educational
Psychol-
ogy

Theoretical
discussion
- no direct
empirical
measure-
ment

Theoretical [37]

9


3. Related Work

Actor
Type

CL
Factor

Task Findings Mitigation Domain CL
Measure

TP
Measure

Ref

HH-
Teams

Cognitive
effort,
fatigue,
TC

Team sport High CL
impaired
physical and
tactical
performance in
sports.

Empirical:
Structured
training.

Sports NASA-
TLX,
PANAS,
HR

Physical
perfor-
mance,
tactical
decision-
making

[23]

HH -
Teams

Team
efficiency,
TC

Military
decisions

Improved
decision-making
and team
performance
reduced
cognitive load.

Empirical:
Decision-
support
systems
improved team
efficiency and
reduced
cognitive load.

Military TCE
Score

Task per-
formance
assessed
through
the Air
Defense
Warfare
Team Per-
formance
Index
(ATPI)

[34]

HH -
Teams

Cognitive
processing
load, col-
laboration
technology

Simulated -
command
and control

High CL
increased errors,
time.

Empirical:
Task
simplification
and real-time
feedback
improved
performance.

Military/
emergency

NASA-
TLX,TC
metrics

Error
rates, time

[22]

Table 3.1: Cognitive Load in General Domains

3.2 Cognitive Load in Software Engineering

Cognitive load has also been studied in the context of software development, where com-
plex problem-solving and information-intensive tasks are the norm. We include this body
of work because it directly informs the challenges of requirements engineering and priori-
tization tasks addressed in this thesis. The studies summarized in Table 3.2 were selected
because they represent diverse methodologies (EEG, eye-tracking, self-reports), focus on
typical SE activities (e.g., coding, debugging, information sharing), and highlight both
drivers of cognitive load and early attempts at mitigation. We describe them as “key”
studies not because they exhaustively cover the field, but because they illustrate recurring
patterns and gaps that are transferable to our problem space.

Table 3.2 organizes prior work by the "Actor type" (individual, human–AI collaboration,
or human–human teams). Each row describes a study, reporting the "Cognitive load fac-
tor" under investigation (e.g., task complexity, distraction, trust), the "Task" participants
performed, the "Main findings", and whether any "Mitigation strategies" were proposed or
tested. The "Domain" column specifies the application area (e.g., software development,
VR tasks, human–robot teaming). The final two measurement columns indicate how
cognitive load (CL Measure) and task performance (TP Measure) were assessed, followed
by the reference.

To illustrate, Goncales [26] used EEG sensors to show that higher task complexity in-
creased cognitive load and reduced code accuracy. In human–computer interaction re-
search, Ghulaxe [25] proposed AI-driven distraction reduction in development environ-
ments. While theoretically promising, these strategies were not empirically validated,
reflecting a broader issue in SE research: a lack of rigorous evaluation of cognitive load
interventions.

10


3. Related Work

Across these studies, tools and methods to assess cognitive load vary: some use physio-
logical measures (EEG, heart rate, pupil dilation), while others rely on behavioral perfor-
mance or subjective ratings. While tasks such as prioritization, elicitation, and debugging
are widely acknowledged as cognitively intense, there is still insufficient empirical work
on effective interventions to support engineers in these phases [44].

Actor
Type

CL
Factor

Task Findings Mitigation Domain CL
Measure

TP
Measure

Ref

Individual Task
Complex-
ity

Software
dev(coding)

Increased TC
led to higher
CL, affecting
code quality,
speed

Theory: TC
leads to higher
CL, but no
empirical
mitigation
strategies

Software
Engineer-
ing

EEG Code
accu-
racy,time

[26]

Individual Task
Complex-
ity

Cognitive
tasks(varied)

Higher TC
increased CL,via
physiological
signals

Theory:
Accurate
measurement
of CL can help
adaptive
systems to
reduce
CL.Empirical:
No specific
mitigations

Human-
Computer
Interac-
tion

Pupil,
blinking
rate,HR

None [2]

Individual Attention,
Distrac-
tion

Driving Task No empirical
results;
theoretical: AI
reduces CL.

Theory: AI
gaze tracking.

AutomotiveThe
evaluation
remained
theoretical,
based on
proposed AI
solutions
like gaze
tracking
and
blinking
pattern
detection

Theoretical
only.

[25]

Human-
AI

Task
difficulty,
trust

VR search
task

Higher CL,
reducing trust,
performance.

Theory:
Biosignal
assess-
ment.Empirical:
No significant
correlation
found between
biosignals,
trust, and
cognitive load.

AI EEG,
HR,7-pt
scale

Time,
correct
answers

[27]

Human-
AI

Task com-
plexity,
cognitive
teaming,
mental
modeling

Rescue /
exploration

Increased CL led
to poor teaming

Theory:
Adaptive
mental
modeling.

Human-
Robot
Teaming

Theoretical
discussion;
no empirical
measure-
ment.

Theoretical
discussion;
no
empirical
measure-
ment.

[13]

Human-
AI

Task com-
plexity,
packing
difficulty

Collaborative
packing task
(virtual
environment)

Higher CL,
reduced task
efficiency.

Empirical:
AI-assisted
packing
guidance.

Human-AI
Collabora-
tion

NASA-TLX Time,
efficiency,
errors

[39]

Human-
AI

Cognitive
capacity
limita-
tions, task
complex-
ity

Info sharing Increased CL
decreased
sharing.

Empirical:
HMM((Hidden
Markov
Model)-based
cognitive load
model
improved
information
sharing and
teamwork.

Human-
Agent
Collabora-
tion

Secondary
task perfor-
mance,
information
recall
(HMM-
based)

Information
re-
call,accuracy

[19]

11


3. Related Work

Actor
Type

CL
Factor

Task Findings Mitigation Domain CL
Measure

TP
Measure

Ref

Human-
AI

Task
difficulty,
agent
reliability

N-back,
shape
selection
tasks

Lower cognitive
load improved
task
performance;
agent reliability
reduced
cognitive strain.

Empirical:
Reliable agent
guidance
reduced
cognitive load
and improved
task efficiency.

VR-based
Human-AI
Interac-
tion

EEG, GSR,
HRV, self-
reported
cognitive
load ratings

time,
accuracy

[27]

Human -
AI

Decision
style, AI
identity

Word-
guessing
game

Autocratic
decision-making
increased CL
and reduced
team efficacy;
democratic style
improved
collaboration
and lowered CL

Empirical:
Democratic
decision-
making
improved team
efficacy and
user
satisfaction,
reducing
cognitive load

Human-AI
Collabora-
tion

NASA-TLX Game win
rate,
accuracy

[43]

HH-
Teams

Task
Complex-
ity

Emergency
game

Higher TC
increased CL,

Theory:Eye-
tracking
interfaces

Gaming Eye-
tracking
metrics
(e.g., pupil
diameter)

Accuracy [4]

Table 3.2: Cognitive Load in Software Engineering

3.3 Human-AI Collaboration and LLMs in Require-
ments Engineering

With the increasing adoption of Large Language Models (LLMs) in software engineering,
researchers have begun to explore their potential across the requirements engineering
(RE) lifecycle. Beyond traditional automation techniques, LLMs such as GPT-3.5 and
GPT-4 are now being used to support elicitation, analysis, refinement, and prioritization
tasks.

For elicitation, conversational LLMs have been studied as proxies for stakeholders during
interviews. Lojo et al. [42] showed that students preferred LLM-based simulations over
static transcripts when practicing elicitation, describing them as more realistic and engag-
ing, though sometimes inconsistent. Similarly, Franch et al. [20] investigated how LLMs
can generate stakeholder questions from software requirement patterns. While effective
for broadening coverage, their approach sometimes produced redundant or out-of-scope
requirements, requiring additional filtering effort from engineers.

Expanding on this idea, Ataei et al. [8] proposed “Elicitron”, a multi-agent framework
where LLMs simulate users, generate observations, and derive latent needs. This ap-
proach demonstrated improved coverage of design requirements but introduced inter-
pretability challenges, underscoring the cognitive demands placed on engineers when rec-
onciling multiple AI outputs. Quattrocchi et al. [50] benchmarked several LLMs for
generating and evaluating user stories. They found that while LLMs matched humans in
terms of coverage and style, they performed less well in creativity and acceptance criteria,
shifting the cognitive burden to human reviewers for quality assurance.

In the area of prioritization, Sami et al. [55, 56] introduced a multi-agent system em-
ploying LLMs to improve user story quality and ranking accuracy. While their approach

12


3. Related Work

showed productivity gains, it also revealed new issues: when AI-generated suggestions
were unclear or overly numerous, engineers experienced extraneous cognitive load, often
leading to confusion and delays.

These findings align with broader concerns in Explainable AI (XAI). Arrieta et al. [7]
emphasize that for AI systems to be cognitively beneficial, they must provide expla-
nations aligned with human reasoning. In RE, where decisions must be justified and
traceable, explainability is essential. Techniques such as confidence scores, visualized
importance weights, and natural language justifications have been proposed to increase
interpretability, reduce mental effort, and improve trust.

Despite these advances, most LLM-based strategies remain underexplored in RE, par-
ticularly regarding their cognitive implications. The cost of interacting with opaque or
overwhelming AI suggestions in sensitive tasks such as requirements elicitation and pri-
oritization remains a significant research gap, motivating this thesis to investigate how
human AI collaboration can be designed to support, rather than hinder, engineers’ cog-
nitive processes.

3.4 Summary
From the reviewed literature, we observed three consistent themes:

First, task complexity is consistently identified as a major source of cognitive load across
domains, including software engineering. For example, Goncales et al. [26] showed that
higher task complexity increases cognitive load during programming, but they did not
investigate how developers could be supported in managing this demand, particularly in
decision-intensive activities such as requirements prioritization.

Second, while AI tools such as LLMs are increasingly applied across the requirements
engineering (RE) lifecycle, research on their role in prioritization remains limited. Sami
et al. [55, 56] demonstrated that multi-agent LLM systems can improve user story quality
and ranking accuracy, but their work did not consider the cognitive implications of inter-
acting with such systems. Other studies have shown promising applications in elicitation
and user story generation, such as Lojo et al. [42], Franch et al. [20], Quattrocchi et al.
[50], yet prioritization, despite being a cognitively demanding and decision-critical task,
has received comparatively little attention.

Third, explainability has been widely discussed as a way to make AI more understandable
and trustworthy, but its impact in RE tasks is still largely untested. Arrieta et al. [7]
provide a broad taxonomy of XAI techniques, yet no study has empirically examined how
explanation styles influence engineers’ mental effort and decision quality in prioritization
contexts. Poorly explained outputs or overwhelming recommendations are, therefore,
likely sources of extraneous cognitive load, but they remain underexplored in RE research.

Taken together, these gaps highlight the need to study requirements prioritization as a
cognitively demanding RE activity where AI can both support and burden engineers.
While prior work has shown that LLMs can assist in prioritization, no study has system-
atically investigated how AI support with or without explainability shapes the cognitive
experience of engineers. This thesis addresses that gap by empirically evaluating how

13


3. Related Work

AI-assisted prioritization affects cognitive effort and decision outcomes, contributing new
insights at the intersection of prioritization, human cognition, and explainable AI.

14


4
Methodology

This study investigates the influence of XAI on cognitive load and decision-making per-
formance during software requirements prioritization tasks. We focus on prioritization
because it is one of the most cognitively demanding and decision-critical activities in
requirements engineering. Engineers must weigh competing stakeholder needs, balance
limited resources, and make trade-offs under uncertainty. While prior work has applied
LLMs to elicitation and user story generation, research on prioritization has been compar-
atively scarce and has not examined the cognitive implications of AI support. Addressing
this gap, our methodology followed a sequential process involving literature review, re-
search design, and survey implementation. This approach ensured the work was grounded
in theoretical understanding, refined through empirical testing, and systematically eval-
uated.

4.1 Research Design
A within-subject experimental design was adopted, where each participant performed
prioritization tasks both without and with AI support. This design was chosen because
it allowed participants to serve as their own control, enabling systematic comparisons
between unassisted and assisted conditions. In particular, it supported analysis of:

• differences in cognitive load (RQ1),

• quality of decision-making (RQ2), and

• user preferences across explanation formats (RQ3).

The experiment was structured around two domains: banking loan management and
doctor appointment scheduling. These domains were selected because they are widely
understandable and reflect realistic decision-making contexts without requiring special-
ized knowledge. To introduce variation in complexity, the banking tasks involved two
prioritization criteria, while the healthcare tasks involved four. This staged setup en-
abled systematic analysis of how task complexity and explanation format interact to
influence cognitive load and performance.

Details of the task flow, prioritization criteria, and measurement instruments are ex-
plained in the sections below.

The research questions guiding this study are :

15


4. Methodology

RQ1: How do different styles of XAI impact cognitive load during decision-making in
requirements prioritization?

RQ2: How do different styles of XAI impact the quality of decision-making in require-
ments prioritization tasks?

RQ3: How do users’ preferences for different XAI formats relate to their task perfor-
mance, perceived mental effort, and trust in AI-supported requirements prioritization?

4.2 Methodology Process Overview
An overview of the methodological process is presented in Figure 4.1, showing the se-
quential steps from literature review through to data interpretation.

Literature Review
(Cognitive Load, XAI, RE Practices)1

Research & Questionnaire Design
(Criteria, Domains, Tasks)2

Pilot Study
(Refinements & Adjustments)3

Data Collection
(With & Without AI Assistance)4

Data Analysis
(Cognitive Load & Performance)5

Results
(Answer RQs, Discuss Implications)6

Figure 4.1: Methodology process flow (numbered steps).

16


4. Methodology

4.3 Survey Design and Questionnaire
The survey instrument was developed using insights from the literature on cognitive load
theory, requirements engineering, and XAI transparency, refined through supervisor feed-
back and a pilot study. The full instruments, including task descriptions, instructions,
and XAI prompts of the AI-generated explanations, are provided in the link in the ap-
pendix B. Because Microsoft Forms does not support random assignment of alternative
explanation types within a single survey, we created three separate versions of the survey.
Each version contained a different combination of XAI explanation formats (e.g., text
with confidence, bar chart with confidence, or text with bar chart). Participants were
distributed across these versions to ensure that no participant was exposed to the same
explanation format twice, while still allowing comparison between numeric, visual, and
textual explanations. Participants were randomly assigned to one survey version, mean-
ing that each individual was exposed to only one explanation format per domain, while
across the full sample, all three formats were tested.

4.3.1 Survey Flow
The survey began with demographic questions, followed by requirements prioritization
tasks in two domains. In each domain, participants first completed a baseline task without
AI assistance, followed by a comparable task with AI-generated prioritization presented
in one of three explanation formats. After each task, participants rated their perceived
cognitive load. The survey concluded with questions on usability, trust, and preferences
for the explanation format experienced, along with open-ended feedback.

4.3.2 Demographics
The demographics section collected participant details such as professional role, years
of experience with requirements prioritization, and prior exposure to AI tools. This
contextual information was important for interpreting variation in task performance and
workload ratings. The target population was professionals and students with exposure
to requirements engineering, as they regularly engage in prioritization decisions and are
familiar with the challenges of balancing competing criteria. Their expertise provided
both realism and validity to the evaluation.

4.3.3 Prioritization Tasks
The main body of the questionnaire contained two domains: a Banking Loan Manage-
ment System and a Doctor Appointment System. Each task required participants
to prioritize ten functional requirements. These requirements were created by the re-
searchers, inspired by typical functionalities from publicly available system descriptions
and prior RE literature, to ensure they were realistic yet domain-neutral. Using ten re-
quirements, balanced realism and feasibility were achieved, resulting in a number that
was large enough to represent the complexity of real-world decision-making but still man-
ageable within the time constraints of an online survey.

To systematically vary complexity, the banking task (Task 1) required prioritization based
on two criteria: development time and customer value. The healthcare task (Task 2)

17


4. Methodology

required prioritization based on four criteria: development time, customer value, risk,
and time sensitivity. This staged design allowed us to examine how cognitive load changes
when task complexity increases while holding other parameters constant. These four
criteria are widely recognized in the requirements prioritization literature [36, 65, 9, 10,
11, 14, 40], providing theoretical grounding for their selection.

More specifically, in the banking domain, Task 1.1 involved prioritizing requirements for
a loan management system without AI assistance, while Task 1.2 involved prioritizing
requirements for an online banking system with AI assistance. In the healthcare domain,
Task 2.1 focused on prioritizing requirements for an emergency doctor appointment book-
ing system without AI support, while Task 2.2 addressed a general doctor appointment
booking system with AI support. These domains and task variations were selected be-
cause they are familiar to most participants, involve multi-criteria trade-offs similar to
software requirements decisions, and help reduce the risk of bias. By ensuring that the
with-AI tasks (1.2 and 2.2) were not identical to the without-AI tasks (1.1 and 2.1), par-
ticipants were less likely to find the second task easier simply due to prior exposure. This
design, combined with the use of three survey versions, minimized potential carryover
effects while still allowing comparison of cognitive load and decision quality with and
without AI support.

4.3.4 XAI Explanation Formats

Three explanation formats were tested:
1. Confidence scores (numerical probabilities, e.g., “Requirement A is recommended
with 72% confidence”) were selected for their ability to communicate model certainty in
a compact form.
2. Bar charts (visual ranking of requirements by importance) were selected for their
ability to present comparisons quickly and clearly.
3. Textual explanations (natural language reasoning, e.g., “Requirement B is priori-
tized because it reduces waiting time, which users rated highly”) were selected because
natural language is intuitive and widely used in LLM interfaces.

These formats were chosen because they represent common explanation styles in XAI
research [7, 45, 15], and together they allow us to compare numeric, visual, and textual
communication of AI reasoning.

4.3.5 Implementation of AI Support

Participants did not interact directly with a live AI tool. Instead, all requirements, pri-
oritizations, and explanations were pre-generated for consistency across participants and
survey versions. The explanations (confidence scores, bar charts, and textual justifica-
tions) were generated with ChatGPT-4 (OpenAI). To preserve authenticity, these outputs
were presented as screenshots embedded directly into Microsoft Forms, so participants
viewed them exactly as produced by ChatGPT-4. This ensured exposure to realistic
AI-generated content without introducing variability from system interfaces or user in-
teraction. In the appendix B, we provide the generated explanations as they appeared in
Microsoft Forms, showing the different XAI formats.

18


4. Methodology

4.3.6 Measurement Approach
Cognitive load was measured after each task using a 7-point Likert scale assessing mental
demand, effort, complexity, and confidence [48, 41]. Decision-making performance was
evaluated by accuracy, using the WSJF method. Usability, trust, and preference ratings
for the explanation formats were also collected through Likert-scale items and open-ended
responses. This mixed-methods approach provided both subjective (self-reported load,
trust, satisfaction) and objective (accuracy, time) measures.

4.4 Pilot Study
A pilot study with a small participant group was conducted to evaluate the clarity, us-
ability, and pacing of the questionnaire. The objectives were to test whether the task
instructions were understandable, verify that the XAI explanations were interpretable,
and measure the time needed to complete the survey. Findings showed that some par-
ticipants misunderstood the meaning of the prioritization criteria, prompting revisions
to the instructions and inclusion of illustrative examples. The demographic section was
streamlined to reduce participant fatigue, and task ordering was adjusted so simpler tasks
appeared first to improve engagement and minimize dropout. Descriptions of AI explana-
tions were also refined to ensure consistency in how participants interpreted each format.
These refinements increased the reliability and user-friendliness of the final instrument.

4.5 Data Collection
Participants were recruited using a convenience sampling approach [18], leveraging uni-
versity mailing lists, LinkedIn, WhatsApp groups, Discord servers, and both personal
and professional networks. Additional outreach was conducted via supervisors’ industry
contacts to enhance diversity. This non-probabilistic sampling method was chosen due
to its practicality and ability to reach participants with relevant experience in software
engineering and requirements prioritization. The survey was administered via Microsoft
Forms and remained open for three weeks to allow sufficient time for responses. Partici-
pation was voluntary, and the survey was designed to take approximately 12 minutes to
complete based on pilot testing. A total of 61 completed responses were collected, rep-
resenting participants from diverse backgrounds, including software developers, testers,
product owners, requirements engineers, and students in software engineering programs.
This diversity helped ensure that the study captured a range of perspectives on cognitive
load in AI-assisted requirements prioritization.

4.6 Data Analysis
Before analysis, all completed survey responses were consolidated into a single dataset.
As described in Section 4.3, participants were distributed across three survey versions,
each of which contained a different combination of XAI explanation formats (e.g., text
with confidence, bar chart with confidence, or text with bar chart). This ensured that
each participant was exposed to only one explanation type per domain, while still allowing
comparison of all three formats across the full sample.

19


4. Methodology

For analysis, responses were grouped according to the specific XAI technique presented
(bar chart, confidence score, or text explanation), and then subdivided into tasks per-
formed with and without AI assistance. Task performance and cognitive load responses
were aligned to their corresponding task identifiers. Scores for prioritization accuracy
were calculated using the WSJF-based gold standard described in Section 4.6.2.1. Cog-
nitive load scores were computed as the average of Likert-scale responses across four
dimensions: mental demand, effort, complexity, and confidence.

4.6.1 Data Cleaning
During data cleaning, only the responses collected during the pilot study were removed to
ensure that the analysis was based solely on data from the final version of the survey. The
remaining 61 valid responses included both task types completed by every participant:
(1) baseline prioritization without AI assistance and (2) prioritization with AI-generated
recommendations. For analysis, responses were sorted by these two task types, while also
distinguishing between the three explanation formats used in the AI-assisted tasks.

4.6.2 Defining the Correct Prioritization Order
A reference prioritization was created for each task using the Weighted Shortest Job First
(WSJF) method to evaluate participant performance objectively. The tasks included ten
functional requirements, with WSJF scores calculated according to the approach detailed
in Section 4.6.2.1. The resulting scores were used to organize requirements into three
priority levels: High, Medium, and Low.

The number of requirements grouped within each group changed from one task to another
because WSJF values were distributed differently across tasks. The grouping process used
natural breaks in WSJF scores instead of fixed numbers (e.g., 2-3-5) to determine the
relative [12][55]

Requirement Customer Value Development Time WSJF Score Priority Group
Loan Payment Reminder Notifi-
cations

5 2.5 2.00 High

Loan Interest Rate Calculator 4 2.0 2.00 High
Loan Application Form 5 3.5 1.43 Medium
Automated Loan Status Up-
dates

4 3.0 1.33 Medium

Loan Summary & Statement
Generation

4 3.5 1.14 Medium

Loan Repayment Schedule Gen-
erator

3 3.5 0.86 Low

Loan Eligibility Checker 3 4.0 0.75 Low
Document Upload & Verifica-
tion

2 3.0 0.67 Low

Loan Approval & Verification
Process

2 4.0 0.50 Low

Personalized Loan Offers 1 3.0 0.33 Low

Table 4.1: Example of WSJF Grouping for Task 1.1 – Loan Management Task

In this example 4.1, the WSJF score was calculated using only Customer Value and
Development Time. The two highest scores formed the High Priority group, the next
three requirements formed Medium Priority, and the remaining five were classified as

20


4. Methodology

Low Priority. This method ensured that features delivering the highest value in the
shortest time were addressed first.

In the remaining three tasks (Tasks 1.2, 2.1, and 2.2), the same WSJF-based grouping
logic was applied, but the specific priority group sizes varied depending on the distri-
bution of WSJF scores in each scenario. In Task 1.2 (AI-assisted banking), the group-
ing followed a similar pattern to Task 1.1 but with a different set of requirements and
slightly different group sizes. In Task 2.1 (emergency doctor booking), the WSJF for-
mula incorporated four criteria of Customer Value, Risk Reduction, Time Sensitivity, and
Development Time, resulting in group sizes determined by natural score gaps. Task 2.2
(AI-assisted doctor booking) also used the four-criterion WSJF calculation, producing a
distinct distribution of High, Medium, and Low priority requirements. This consistent
yet adaptive grouping method ensured that each task reflected the most valuable and
time-efficient features for its domain while allowing fair comparison between AI-assisted
and non-assisted conditions.

4.6.2.1 WSJF Calculation Method

The Weighted Shortest Job First (WSJF) method [52] was used to determine the reference
prioritization order for each task. WSJF helps identify which requirements deliver the
most value in the shortest time and is widely used in agile prioritization. However, the
calculation parameters varied between Task 1 and Task 2 due to differences in scenario
complexity and available attribute data.

WSJF Formula for Task 1 (Loan Management and Online Banking System) For Task 1.1
(without AI) and Task 1.2 (with AI), WSJF was calculated using only two factors:

WSJF = Customer Value
Development Time (WSJF-1)

• Customer Value: Rated between 1 and 5 based on the perceived importance of the
requirement to users.

• Development Time: Estimated effort or time required to implement the require-
ment.

WSJF Formula for Task 2 (Emergency and Doctor Appointment Systems) For Task 2.1
(without AI) and Task 2.2 (with AI), a more detailed version of WSJF was used to reflect
the higher complexity of healthcare-related decision-making:

WSJF = Customer Value + (5 - Risk) + Time Sensitivity
Development Time (WSJF-2)

• Customer Value: Scored from 1 to 5.

• Risk Reduction / Opportunity Enablement: Scored from 1 to 5, capturing the
potential to reduce failure or enable significant gains.

• Time Sensitivity: Reflected how urgent the requirement was in terms of delivery
impact.

21


4. Methodology

• Development Time: Estimated implementation time or effort.

This more comprehensive formula allowed for a richer prioritization context in tasks
involving time-critical healthcare scenarios. By adapting the WSJF model to each task
domain, the study ensured that prioritization benchmarks were realistic and contextually
appropriate [55]. The calculated WSJF scores were used to rank the features and group
them into High, Medium, and Low priority categories, as described in Section 6.2.

4.6.3 Prioritization Accuracy Scoring
Each participant’s prioritization output was compared to the standard grouping (High,
Medium, Low). The accuracy score reflected the number of requirements correctly clas-
sified into the same group as the reference. For example, if 7 out of 10 requirements were
placed in the correct group, the accuracy score was 0.70. These scores were calculated
for both:

• Manual Tasks (1.1, 2.1) – no AI support

• AI-assisted Tasks (1.2, 2.2) – using different XAI formats

4.6.4 Cognitive Load Analysis
Perceived cognitive load was measured using 7-point Likert scale-derived questions. Each
participant rated the mental demand, task difficulty, effort, and confidence after each task.
Scores were normalized and averaged for composite analysis. Separate mean load scores
were computed for: Tasks without AI (baseline) and Tasks with XAI support, segmented
by explanation type. This allowed direct comparisons of mental effort under varying AI
support conditions.

4.6.5 Descriptive Statistics
Descriptive statistics were used to summarize the participants’ demographics through
frequency distributions and to analyze task performance and self-reported cognitive load
ratings using means and standard deviations. Cognitive load was assessed on a 7-point
Likert scale, where 1 indicated very low demand/effort and 7 indicated very high de-
mand/effort. Average prioritization accuracy scores and cognitive load ratings were com-
puted for both AI-assisted and non-assisted conditions, enabling comparison across XAI
techniques and task complexity levels.

Independent- and paired-samples t-tests were used to compare (a) performance and load
between AI-assisted and non-assisted conditions, and (b) across explanation formats.
While we did not formally test for normality, t-tests are widely used in studies with
Likert-scale measures and moderate sample sizes, and we acknowledge this assumption
as a limitation.

The null hypotheses stated that there would be no significant differences in (H0a) prioriti-
zation accuracy between AI-assisted and non-assisted tasks, (H0b) self-reported cognitive
load between AI-assisted and non-assisted tasks, and (H0c) either accuracy or cognitive
load across the three explanation formats. The corresponding alternative hypotheses

22


4. Methodology

proposed that AI assistance and explanation format would exert significant effects on
accuracy, cognitive load, or both.

4.7 Ethics
The study was conducted in accordance with established ethical guidelines, specifically
the Declaration of Helsinki [66]. Participation was entirely voluntary, and informed con-
sent was obtained from all respondents before they began the survey. Participants were
clearly informed about the study’s purpose, what their participation would involve, and
their right to withdraw at any time without consequence. To protect anonymity, no per-
sonally identifiable information was collected. All responses were stored securely and were
accessible only to the research team. The survey instructions also highlighted the confi-
dentiality of the data and confirmed that it would be used solely for academic research
purposes.

4.8 Validity of the Study
Construct, internal, and external validity were addressed through the use of multiple
strategies [67]. To support construct validity, the tasks and instructions were standard-
ized, and established measurement tools were adapted, including Likert-scale items as-
sessing perceived mental demand, task difficulty, effort, and confidence in performance.
These items are widely used in cognitive load research and provide a reliable basis for
capturing subjective workload. To enhance internal validity, task contexts were varied
across related tasks to minimize learning effects. For example, in Task 1, participants
prioritized features in a loan management system, whereas subsequent banking tasks
involved general online banking activities. Similarly, in Task 2, the first task involved
emergency doctor appointments, and later tasks involved routine bookings.

External validity was strengthened through participant diversity, ensuring findings were
relevant to both academic and industry contexts. Random assignment of participants to
one of three XAI technique conditions (text explanations, bar charts, confidence scores)
reduced potential bias from prior exposure.[53]

Despite these controls, the study’s online administration meant participants completed
tasks in uncontrolled environments, potentially introducing distractions. The reliance
on self-reported measures also means results may be subject to personal bias; however,
validated scales and clear instructions were used to mitigate these risks.[31]

23


4. Methodology

24


5
Results

5.1 Introduction
The research questions guiding this thesis focused broadly on identifying cognitive load
drivers and their effect on decision-making. The focus is also aligned with the role of
XAI in shaping participants’ cognitive experiences and outcomes during requirements
prioritization tasks.

The research questions reflect the nature of the data collected, which specifically evalu-
ated how different forms of XAI, such as bar charts, textual explanations, and confidence
scores, affect cognitive load and influence decision outcomes. The questions aim to cap-
ture these dynamics more precisely and are as follows:

RQ1:How do different styles of XAI impact cognitive load during decision-making in
requirements prioritization?

RQ2:How do different styles of XAI impact the quality of decision-making in require-
ments prioritization tasks?

RQ3:How do users’ preferences for different XAI formats relate to their task perfor-
mance, perceived mental effort, and trust in AI-supported requirements prioritization?

The remainder of this chapter presents the findings aligned with these research ques-
tions, beginning with participant demographics, followed by an analysis of XAI influence
on cognitive load and decision outcomes.

5.2 Demographics of Survey Participants
To better understand the context in which participants engaged with the decision-making
tasks, the survey included three demographic questions: (1) participants’ primary profes-
sional role, (2) their years of experience in requirements engineering, and (3) how often
they prioritize requirements in their current roles. In later sections, these data provide a
foundation for interpreting participants’ interactions with XAI.

Participants represented a diverse range of roles across the software development lifecycle.

25


5. Results

The largest group identified as Software Developers (36.1%), followed by Software Testers
(24.6%). Other notable roles included System/Software Architects, Quality Assurance
Engineers, Project Managers, Business Analysts, UI/UX Designers, and Requirements
Engineers. A small portion also classified themselves under Other roles, including hybrid
or interdisciplinary functions. This spread indicates a broad participation base, ensuring
the results are informed by varying perspectives across industry roles.
See Figures 5.1 a and b for a visual breakdown of participants’ roles.

(a) (b)

Figure 5.1: Distribution of participants’ professional roles

The participants’ experience in requirements engineering ranged from 0 to 18 years. The
mean experience was approximately 7.4 years, with a median of 7 years, indicating a bal-
anced representation of both early-career and seasoned professionals. A few participants
reported no experience, while several others had more than a decade of involvement in
RE tasks. This distribution reflects a suitable range for analyzing cognitive responses
across experience levels.
See Figures 5.2 a and b for a visual breakdown of participants’ experience.

(a) (b)

Figure 5.2: Distribution of participants’ experience

When asked how often they prioritize requirements as part of their role, a majority of
participants indicated that they engage in this activity either “Often” (57.4%) or “Very

26


5. Results

Often” (24.6%). A smaller subset reported doing it only “Sometimes” (11.5%), while
very few chose “Rarely” or “Never”. These findings confirm that the task of requirements
prioritization is a common and regular part of participants’ workflows, making them
suitable evaluators of XAI support during such decision-making activities.
See Figures 5.3 a and b for visualization of prioritization frequency.

(a) (b)

Figure 5.3: Distribution of participants’ prioritization frequency

5.3 Results Aligned with Research Questions
This section presents the results of the survey aligned with the research questions, focusing
on how different forms of XAI influence cognitive load and decision quality in requirements
prioritization tasks. The findings are organized by each research question.

5.3.1 Overview of Key Task Metrics
Task / XAI Type Correct Answers Correct SD Mental Effort Effort SD Task Complexity Complexity SD Confidence in Answers Confidence SD
Task 1.1 3.90 2.17 4.79 1.40 4.43 1.51 4.89 1.59
Task 1.2 – Bar Chart 8.47 3.03 4.53 1.43 4.68 1.63 4.58 1.64
Task 1.2 – Text Explanations 8.52 2.87 4.45 1.36 4.75 1.29 4.40 1.39
Task 1.2 – Confidence Scores 8.50 2.37 4.32 1.49 3.95 1.36 4.90 1.60
Task 2.1 3.47 1.72 4.77 1.56 4.77 1.44 4.88 1.63
Task 2.2 – Bar Chart 4.32 0.23 4.31 1.54 4.27 1.32 4.90 1.51
Task 2.2 – Text Explanations 4.10 0.45 4.37 1.45 4.21 1.39 4.94 1.46
Task 2.2 – Confidence Scores 4.25 0.85 4.00 1.55 4.60 1.61 4.65 1.54

Table 5.1: Summary of average scores across key metrics by task and XAI type.

Before addressing the research questions individually, a high-level overview of partici-
pants’ performance and self-reported cognitive measures across all four tasks is presented
in Table 5.1. This summary includes average scores for correctness, mental effort, task
difficulty, and confidence, capturing the general effect of XAI on task experience. The
correct answers are in the scale of 1-10 and all the other columns in the scale of 1-7.

The box plots in Figure 5.4 illustrates the distribution of participant responses across the
four tasks for all measured metrics. These plots highlight not only the central tendency
but also the spread and outliers in the data. Metrics such as effort and confidence display
wider distributions across tasks, indicating greater variability in participant perceptions.
The individual data points further show how responses are spread within each task.

27


5. Results

Figure 5.4: Box plots of all participant results

From the graph in Figure 5.5, it is evident that Task 1.2(the first task involving XAI)
was associated with the highest correctness and slightly reduced effort and difficulty,
suggesting a positive impact of XAI support. Confidence remained relatively stable, with
minor variations across tasks. These trends provide context for the more detailed analyses
that follow in the upcoming sections.

Figure 5.5: Key Task Metrics

28


5. Results

5.3.2 RQ1: How do different styles of XAI impact cognitive
load during decision-making in requirements prioritiza-
tion?

5.3.2.1 Correlation Between Tasks: Evidence of XAI’s Influence on Cogni-
tive Load

To understand how different styles of XAI influence cognitive processing during decision-
making in requirements prioritization, this section focuses on two key indicators of cogni-
tive load: mental effort and task difficulty. These metrics reflect participants’ perceived
cognitive burden while working through tasks with and without XAI support.

To perceive how XAI influenced cognitive load, this study used both Spearman correlation
and paired t-tests to see different types of relationships within the data. Spearman
correlation was chosen because the variables of mental effort and task difficulty were
measured on ordinal Likert scales, making this non-parametric test more appropriate
than the Pearson correlation. It allowed the analysis to detect trends in how cognitive
load changed between tasks with and without XAI, without assuming a linear relationship
or normal distribution.

The results indicate moderate to strong correlations between the metrics across tasks.
Full details of the correlation coefficients are provided in Appendix A.1.

Mental Effort: Participants’ reported mental effort showed strong positive relationships
between several tasks that included XAI. For instance, effort ratings between Task 1.2
and Task 2.2 were closely related (r = 0.582, ρ < 0.001), as were those between Task
1.2 and Task 2.1 (r = 0.495, ρ < 0.001), and between Task 2.1 and Task 2.2 (r = 0.492,
ρ < 0.001). These findings suggest that participants tended to experience similar levels
of mental workload when working with XAI, even though the tasks varied. This might
reflect a consistent way of thinking or approaching the tasks when support was available.

Task Difficulty: A similar pattern appeared in the way participants rated task difficulty.
There were strong positive correlations between Task 1.2 and Task 2.2 (r = 0.571, ρ <
0.001), Task 1.2 and Task 2.1 (r = 0.473, ρ < 0.001), and Task 2.1 and Task 2.2 (r =
0.475, ρ < 0.001). These findings suggest that XAI influenced how challenging the tasks
felt, making perceptions of difficulty more consistent across the survey. However, when
comparing Task 1.1, which did not include XAI, to later tasks that did, the correlations
were negative. For example, the link between Task 1.1 and Task 2.1 was(r = 0.356 and
ρ = 0.005), and between Task 1.1 and Task 2.2 it was (r = 0.294 and ρ = 0.020). These
results may reflect a shift in how participants judged complexity, depending on whether
they had XAI support.

Overall, the correlation results suggest that the presence of XAI influences cognitive load
factors, especially effort and difficulty across tasks. Participants reported more consistent
cognitive responses in XAI-supported tasks, while performance remained task-dependent
and did not consistently improve with XAI.

29


5. Results

5.3.2.2 Statistical Differences in Cognitive Load Measures

To assess whether the presence of XAI meaningfully influenced cognitive load or decision
performance, paired t-tests were conducted across the dimensions of effort and difficulty.

Comparisons were made between tasks with and without XAI (Tasks 1.1/2.1 vs. 1.2/2.2),
as well as between the two XAI-supported tasks themselves (1.2 vs. 2.2). Since the
same participants completed both tasks, the paired design helped control for individual
differences, focusing the analysis on the effect of the XAI intervention itself. Together,
these tests provided a more complete view of whether and how XAI influenced users’
mental effort and perceived difficulty across varying task conditions.

Comparison t-statistic ρ-value Interpretation
Effort_1_1 vs Effort_1_2 1.403 0.1658 This result is not statistically significant
Effort_2_1 vs Effort_2_2 2.928 0.0048 This result is statistically significant
Effort_1_1 vs Effort_2_1 0.058 0.9539 This result is not statistically significant
Effort_1_2 vs Effort_2_2 1.177 0.2437 This result is not statistically significant
Difficulty_1_1 vs Difficulty_1_2 0.060 0.9524 This result is not statistically significant
Difficulty_2_1 vs Difficulty_2_2 2.313 0.0241 This result is statistically significant
Difficulty_1_1 vs Difficulty_2_1 -1.040 0.3023 This result is not statistically significant
Difficulty_1_2 vs Difficulty_2_2 0.471 0.6394 This result is not statistically significant

Table 5.2: Paired t-test results for mental effort and task difficulty across tasks.

Mental Effort: The analysis of perceived mental effort revealed some variation across
conditions. A significant difference was found between Task 2.2 and Task 2.1 (t = 2.928, ρ
= 0.0048), where the task supported by XAI appeared to require lesser levels of cognitive
engagement. This suggests that the presence of XAI may have affected how participants
approached or processed the task.

Task Difficulty: In terms of task difficulty, the results also pointed to some variation
linked to XAI. Task 2.2 was rated as significantly less difficult than Task 2.1 (t = 2.313,
ρ = 0.0241). This suggests that the XAI intervention may have altered how challenging
the task felt to participants.

By contrast, no notable differences in difficulty were reported between Task 1.1 and Task
1.2 (ρ = 0.9524), nor between the two XAI tasks, Task 1.2 and Task 2.2 (ρ = 0.6394).
These results imply that while XAI had some influence on how difficulty was experienced,
its effect was not consistent across all task pairs.

5.3.2.3 Impact of Different XAI Types on Cognitive Load

To answer RQ1, we begin by presenting the average values reported by participants for
each XAI-supported task. Table 5.1 above shows the mean number of correct answers,
self-reported mental effort, perceived task complexity, and confidence in answers across
the three XAI types in Tasks 1.2 and 2.2.

The descriptive statistics reveal subtle differences in how participants experienced each
XAI format. For example, confidence scores were associated with lower effort ratings,
while bar charts and text explanations tended to result in higher task complexity ratings

30


5. Results

depending on the task. These patterns are further investigated through correlation and
significance testing below.

5.3.2.4 Correlation Test for different XAI

Following the descriptive results in Table 5.1, which provided insight into the average
performance and perceptions for each XAI type, we now examine the consistency of
participant experience across the two XAI-supported tasks (1.2 and 2.2). Spearman
correlation tests were used to evaluate whether participants showed similar patterns of
correctness, effort, difficulty, and confidence across different XAI types.

XAI Type Metric Pair Spearman correlation Interpretation
Bar Chart Effort_1.2 vs Effort_2.2 0.656 Positive
Bar Chart Difficulty_1.2 vs Difficulty_2.2 0.615 Positive
Confidence Scores Effort_1.2 vs Effort_2.2 0.585 Positive
Confidence Scores Difficulty_1.2 vs Difficulty_2.2 0.694 Positive
Text Explanations Effort_1.2 vs Effort_2.2 0.493 Positive
Text Explanations Difficulty_1.2 vs Difficulty_2.2 0.499 Positive

Table 5.3: Spearman correlation between Tasks 1.2 and 2.2 across key metrics for each
XAI type.

Bar Charts: Between Task 1.2 and Task 2.2, positive correlations were observed in both
mental effort (r = 0.656) and task difficulty (r = 0.615). This suggests that participants
experienced bar charts as similarly demanding in both decision-making scenarios.

Confidence Scores: Correlations between tasks were similarly positive when partici-
pants interacted with confidence scores. Effort and difficulty both yielded strong positive
correlations (r = 0.585 and r = 0.694, respectively), pointing to a consistent cognitive
experience across different contexts.

Text Explanations: Text-based explanations produced positive correlations in both ef-
fort (r = 0.493) and difficulty (r = 0.499), indicating that the perceived cognitive demand
remained relatively stable across tasks.

5.3.2.5 Statistical Differences in Cognitive Load by XAI Type

To determine whether the same XAI type produced significantly different experiences
across two tasks, we conducted paired t-tests comparing participants’ ratings between
Task 1.2 and Task 2.2 for each form of XAI. We analyzed their mental effort and perceived
difficulty in answering.

XAI Type Metric Task 1.2 Mean Task 2.2 Mean Mean Difference t-statistic ρ-value Interpretation
Bar Chart Effort 4.53 4.37 -0.16 0.567 0.5778 Result is not statistically significant
Bar Chart Difficulty 4.68 4.21 -0.47 1.531 0.1431 Result is not statistically significant
Confidence Scores Effort 4.32 4.32 0.00 0.000 1 Result is not statistically significant
Confidence Scores Difficulty 3.95 4.27 0.32 -1.322 0.2005 Result is not statistically significant
Text Explanations Effort 4.45 4.00 -0.45 1.308 0.2063 Result is not statistically significant
Text Explanations Difficulty 4.75 4.60 -0.15 0.420 0.6794 Result is not statistically significant

Table 5.4: Paired t-test comparison of Task 1.2 and Task 2.2 across XAI types

31


5. Results

Bar Charts: No significant differences were seen in how much mental effort participants
reported (ρ = 0.578), how difficult the tasks felt (ρ = 0.143). These stable scores suggest
that bar charts offered a familiar and steady experience, even if the final outcomes didn’t
always match that comfort.

Confidence Scores: Participants rated their mental effort and perceived difficulty at
almost the same level in both tasks (all ρ-values above 0.20). This points to a kind
of cognitive consistency, participants seemed to process and trust the confidence scores
similarly across tasks, even though the actual results were not as consistent.

Text Explanations: The scores for effort and difficulty stayed relatively stable (ρ-values
all above 0.20). This suggests that while participants felt just as engaged and assured
using text explanations, those explanations may not have always helped them make better
decisions, depending on the task.

5.3.3 RQ2: How do different styles of XAI impact the quality
of decision-making in requirements prioritization tasks?

5.3.3.1 Correlation Between Tasks: Evidence of XAI’s Influence on Decision
Quality

To understand how different styles of XAI influence the quality of decision-making in
requirements prioritization, this section focuses on two key indicators of cognitive load:
Correctness and Confidence. These metrics reflect participants’ perceived cognitive bur-
den while working through tasks with and without XAI support

The results indicate moderate to strong correlations between the metrics across tasks.
Full details of the correlation coefficients are provided in Appendix A.1.

Correctness: The correlation between Task 1.2 and Task 2.2, both of which included
XAI support was negative (r = 0.347, ρ = 0.006). This suggests that participants who
performed well in one XAI-supported task did not necessarily do well in the other. In some
cases, doing well in one task actually aligned with performing less accurately in the next.
On the other hand, a positive correlation was observed between Task 1.2 and Task 2.1 (r
= 0.263, ρ = 0.039). This may indicate that task performance was shaped not only by the
presence of XAI, but also by the nature of the tasks themselves. Other comparisons, such
as between Task 1.1 and Task 1.2 or between Task 2.1 and Task 2.2, were not statistically
significant. These results suggest that correctness did not consistently carry over across
tasks, regardless of whether XAI was used.

Confidence: Confidence ratings showed the strongest and most consistent correlations,
especially across tasks that included XAI. The connection between Task 1.2 and Task
2.2 was strong (r = 0.718, ρ < 0.001), followed by the link between Task 1.2 and Task
2.1 (r = 0.639, ρ < 0.001), and between Task 2.1 and Task 2.2 (r = 0.653, ρ < 0.001).
These patterns suggest that XAI may have helped participants feel more certain in their
choices, even as the task structure changed. By contrast, confidence in Task 1.1, which
had no XAI, did not significantly correlate with the other tasks. This might mean that
the presence of XAI made the experience of decision-making feel more stable and reliable
overall.

32


5. Results

5.3.3.2 Statistical Differences in Cognitive Load Measures

To assess whether the presence of XAI meaningfully influenced decision quality, paired
t-tests were conducted across dimensions of correctness and confidence. Comparisons
were made between tasks with and without XAI (Tasks 1.1/2.1 vs. 1.2/2.2), as well as
between the two XAI-supported tasks themselves (1.2 vs. 2.2).

Comparison t-statistic ρ-value Interpretation
Correct_1_1 vs Correct_1_2 -10.624 0.0000 Result is Statistically significant
Correct_2_1 vs Correct_2_2 -2.595 0.0118 Result is Statistically significant
Correct_1_1 vs Correct_2_1 1.491 0.1410 Result is Not statistically significant
Correct_1_2 vs Correct_2_2 10.905 0.0000 Result is Statistically significant
Confidence_1_1 vs Confidence_1_2 0.823 0.4135 Result is Not statistically significant
Confidence_2_1 vs Confidence_2_2 0.314 0.7547 Result is Not statistically significant
Confidence_1_1 vs Confidence_2_1 0.000 1.0000 Result is Not statistically significant
Confidence_1_2 vs Confidence_2_2 -1.450 0.1523 Result is Not statistically significant

Table 5.5: Paired t-test results for correctness and confidence across tasks.

Correctness: When comparing correctness scores between tasks, several notable differ-
ences emerged. Task 1.2, which included support from an XAI system, showed a clear
improvement over Task 1.1, which did not include any AI assistance (t = –10.624, ρ <
0.001). A similar result was observed in the second task pair, where Task 2.2, again
supported by XAI, outperformed Task 2.1 (t = –2.595, ρ = 0.0118). These findings point
to a consistent pattern in which tasks that incorporated XAI were associated with higher
correctness scores than those without it.

Interestingly, even when comparing two tasks that both included XAI, Task 1.2 and
Task 2.2, the difference in performance remained statistically significant (t = 10.905, ρ
< 0.001). This suggests that factors beyond the mere presence of XAI, such as the way
information was presented or the specific nature of each task, may have contributed to
performance variation. On the other hand, when comparing the two tasks that lacked
XAI, namely Task 1.1 and Task 2.1, there was no significant difference in correctness (ρ
= 0.1410), which further highlights the impact of XAI within these scenarios.

Confidence: Confidence ratings remained relatively steady across all task conditions.
None of the comparisons produced statistically significant results, including those between
Task 1.1 and Task 1.2 (ρ = 0.4135), Task 2.1 and Task 2.2 (ρ = 0.7547), or Task 1.2 and
Task 2.2 (ρ = 0.1523). This suggests that participants’ level of self-assuredness in their
responses was largely unaffected by whether XAI was present or not. Despite differences
in correctness or effort, the introduction of AI support did not appear to influence how
confident participants felt about their decisions.

5.3.3.3 Impact of Different XAI Types on Decision Quality

5.3.3.4 Correlation Test for different XAI

Bar Chart: Confidence ratings showed a strong positive correlation (r = 0.820), in-
dicating that bar charts consistently contributed to a sense of confidence across tasks.
Correctness scores showed a positive correlation (r = 0.500), which may reflect a connec-
tion between perceived clarity and actual performance.

33


5. Results

XAI Type Metric Pair Spearman Correlation Interpretation
Bar Chart Correct_1.2 vs Correct_2.2 0.500 Positive
Bar Chart Confidence_1.2 vs Confidence_2.2 0.820 Positive
Confidence Scores Correct_1.2 vs Correct_2.2 -0.382 Negative
Confidence Scores Confidence_1.2 vs Confidence_2.2 0.758 Positive
Text Explanations Correct_1.2 vs Correct_2.2 -0.239 Negative
Text Explanations Confidence_1.2 vs Confidence_2.2 0.507 Positive

Table 5.6: Spearman correlation between Tasks 1.2 and 2.2 across key metrics for each
XAI type.

Confidence Scores: Confidence in answers remained positively correlated (r = 0.758),
suggesting that participants trusted this form of explanation across tasks. However, the
correctness scores showed a negative correlation (r = –0.382), implying that high self-
assurance may not always align with task accuracy.

Text Explanations: Confidence levels were positively correlated (r = 0.507), suggesting
participants felt equally sure in both instances. However, correctness showed a negative
correlation (r = –0.239), which may indicate variability in how effectively these explana-
tions supported accurate decisions.

5.3.3.5 Statistical Differences in Decision Quality by XAI Type

Bar Chart: Participants performed noticeably better in Task 1.2 compared to Task 2.2
when using bar charts, as shown by a significant difference in correctness scores (t =
5.678, ρ < 0.001). Even though the same type of explanation was used, something about
the second task may have made it harder to apply the information as effectively. On
the other hand, no significant differences were seen in how confident they were in their
decisions (ρ = 0.130).

Confidence Scores: The drop in correctness was again significant between Task 1.2
and Task 2.2 when confidence scores were used (t = 7.213, ρ < 0.001). Despite this,
participants rated confidence at almost exactly the same level in both tasks (all ρ-values
above 0.20). This points to a kind of cognitive consistency, participants seemed to process
and trust the confidence scores similarly across tasks, even though the actual results were
not as consistent.

XAI Type Metric Task 1.2 Mean Task 2.2 Mean Mean Difference t-statistic ρ-value Interpretation
Bar Chart Correct 8.47 4.11 -4.37 5.678 0 Result is statistically significant
Bar Chart Confidence 4.58 4.95 0.37 -1.587 0.1298 Result is not statistically significant
Confidence Scores Correct 8.50 4.32 -4.18 7.213 0 Result is statistically significant
Confidence Scores Confidence 4.91 4.91 0.00 0.000 1 Result is not statistically significant
Text Explanations Correct 8.45 4.25 -4.20 5.581 0 Result is statistically significant
Text Explanations Confidence 4.40 4.65 0.25 -0.839 0.4120 Result is not statistically significant

Table 5.7: Paired t-test comparison of Task 1.2 and Task 2.2 across XAI types

Text Explanations: For text-based explanations the correctness declined significantly
from Task 1.2 to Task 2.2 (t=5.581, ρ<0.001). But again, the scores for confidence stayed
relatively stable (ρ-values all above 0.20). This suggests that while participants felt just
as engaged and assured using text explanations, those explanations may not have always
helped them make better decisions, depending on the task.

34


5. Results

Across all three XAI types, the only consistent significant change between tasks was
in correctness scores, with participants performing better in Task 1.2. However, their
self-reported effort, difficulty, and confidence remained statistically unchanged in most
cases. This indicates that while participants perceived their cognitive load as stable,
the effectiveness of the XAI types in supporting correct decisions varied with context,
potentially due to differences in task structure or complexity.

5.3.4 RQ3: How do users’ preferences for different XAI formats
relate to their task performance, perceived mental effort,
and trust in AI-supported requirements prioritization?

5.3.4.1 Participant Preferences for XAI Types

Before analyzing how participant preferences relate to decision quality, effort, and con-
fidence, we summarize the distribution of preferences for each XAI type in Figure 5.6.
Participants were asked to indicate which explanation they found the easiest to under-
stand, the most useful for decision-making, and which they preferred overall.

As shown, bar charts were most often selected as the easiest to understand, while text
explanations and confidence scores were more frequently rated as most useful and overall
preferred. These patterns suggest that while participants found bar charts visually simple,
they may have valued the depth or clarity offered by other explanation formats during
actual decision-making. The following subsections investigate whether these subjective
preferences influenced task performance or cognitive load.

Figure 5.6: Distribution of Participant Preferences for Each XAI Type by Category

35


5. Results

5.3.4.2 Correlation Between XAI Preferences and Decision Quality

To explore how participants’ subjective preferences and perceptions of XAI types influ-
enced the quality of their decision-making, a series of Spearman correlation tests were
conducted. These correlations compare three preferences related survey variables: ease of
understanding, perceived usefulness and overall preference against the correctness of par-
ticipants, mental effort, perceived difficulty, and confidence in Tasks 1.2 and 2.2, where
XAI was present.

No correlations were found between most preference variables and performance, with one
low correlation between Confidence Scores and Correctness in Task 2.2. Details are in
Appendix A.2

Ease of Understanding and Performance: Participants who rated a specific XAI
type as easiest to understand showed notable relationships with performance in Task 2.2:
Confidence scores were positively correlated with correctness in Task 2.2 (r = 0.331),
suggesting that perceiving this format as easier to understand was linked with better
outcomes in later tasks. In contrast, bar charts were negatively correlated with correctness
in Task 2.2 (r = –0.210), possibly indicating that although they were preferred for clarity
by some, they did not necessarily lead to improved decisions. Text explanations showed
weak or no correlation with correctness, effort, or confidence.

Overall, these results suggest that ease of understanding alone does not guarantee better
performance, though confidence scores appear to have offered some benefit.

Perceived Usefulness and Cognitive Load: When asked which XAI they found
most useful for decision-making: Bar charts had a negative correlation with difficulty (r =
–0.241) and correctness in Task 2.2 (r = –0.156), suggesting that despite their appeal, they
may have contributed to cognitive strain or confusion in actual performance. Confidence
scores and text explanations showed no correlations across all metrics, indicating that
perceived usefulness was not a strong predictor of how participants experienced the task
cognitively.

Overall Preference and Task Performance: General preference showed only weak
trends. A positive correlation between confidence scores and correctness in Task 2.2
(r = 0.277) again reinforced the idea that some formats may have helped performance
modestly. Other correlations, including those for effort, difficulty, and confidence, were
negligible across all three formats.

In sum, the data shows that participants’ preferences or perceived ease of use do not
strongly correlate with actual decision performance. However, some patterns emerged,
confidence scores were both positively perceived and modestly linked with better correct-
ness in Task 2.2, suggesting they may have supported clearer judgment. In contrast, bar
charts, despite being widely seen as easy to understand, did not lead to stronger decisions
and were in some cases associated with lower correctness or higher perceived difficulty.

5.3.4.3 Significance Between XAI Preferences and Decision Quality

To complement the correlation analysis, we conducted independent samples t-tests com-
paring participants who preferred