Predicting the Need for Test Maintenance
Using LLM Agents

Applying Test Maintenance Factors to Changes in Production
Code to Identify If and Where Test Cases Need to Be Updated

Master’s Thesis in Computer Science and Engineering

LUDVIG LEMNER
LINNEA WAHLGREN

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2024


Master’s thesis 2024

Predicting the Need for Test Maintenance Using
LLM Agents

Applying Test Maintenance Factors to Changes in Production Code
to Identify If and Where Test Cases Need to Be Updated

LUDVIG LEMNER
LINNEA WAHLGREN

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2024


Predicting the Need for Test Maintenance Using LLM Agents
Applying Test Maintenance Factors to Changes in Production Code to Identify If
and Where Test Cases Need to Be Updated
LUDIVG LEMNER
LINNEA WAHLGREN

© LUDVIG LEMNER, LINNEA WAHLGREN, 2024.

Supervisor: Gregory Gay, Computer Science and Engineering
Advisors: Nasser Mohammadiha, Ericsson
Advisors: Roy Liu, Ericsson
Advisors: Joakim Wennerberg, Ericsson
Examiner: Robert Feldt, Computer Science and Engineering

Master’s Thesis 2024
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Typeset in LATEX
Gothenburg, Sweden 2024

iv


Predicting the Need for Test Maintenance Using LLM Agents
Applying Test Maintenance Factors to Changes in Production Code to Identify If
and Where Test Cases Need to Be Updated
LUDVIG LEMNER
LINNEA WAHLGREN
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Test maintenance, the act of modifying and updating test cases to ensure they keep
up with the changes made in the production code, is a necessary but time-consuming
and effort-intensive activity. One way to alleviate these efforts is by automating parts
of the test maintenance process, however, setting up and maintaining automation
tools can be time-consuming as well. Generative AI and Large Language Models
(LLMs) offer new avenues for automation and lessening the test maintenance prob-
lem. One of these is through LLM agents, sophisticated AI systems that reason,
plan, and use tools to help it achieve its goals.

This thesis was conducted as an exploratory case study at Ericsson and investigated
how generative AI can help ease test maintenance, specifically how LLM agents can
be used to predict test maintenance. The thesis had three phases: Identifying fac-
tors that trigger test maintenance; exploring the capabilities of generative AI and
how it might be used to help with test maintenance; and, using the results from
the two previous phases, building a prototype to help predict if and if so where test
maintenance is needed based on changes to the production code. We identified 40
factors that when changed in production code cause a need for test maintenance,
and successfully demonstrated how they can be used as triggers in a setup with LLM
agents. Out of the four different setups that were evaluated, we found that using
multiple LLM agents coordinated by a planning agent, and giving these access to
both production code and natural language summaries of test cases, worked best.
We also, through a thorough literature review, identify test maintenance actions
LLMs can take and help with. These demonstrate both the possibilities and current
limitations of LLMs when it comes to test maintenance, and the results highlight
how—though a large focus of LLM studies within software engineering has focused
on code generation—the capabilities of LLMs are much broader. This study provides
examples of how LLM agents can be used more broadly and all-encompassingly.

Keywords: Software engineering (SE), test maintenance, large language model (LLM),
LLM agent.

v


Acknowledgements
We would like to thank our academic supervisor Gregory Gay and our industrial
supervisors at Ericsson: Nasser Mohammadiha, Roy Liu, and Joakim Wennerberg.
A big thank you for your guidance and patience, and for always taking the time to
support us. We would also like to thank all other Ericsson employees who kindly
helped with and participated in the study, without which this thesis would not have
been possible.

Ludvig Lemner & Linnea Wahlgren, Gothenburg, June 2024

vii


Contents

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Purpose of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Background 7
2.1 Software Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Test Management . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Test Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Generative AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2.1 Large Language Models . . . . . . . . . . . . . . . . . . . . . 10
2.2.2 LLM Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3 Related Work 13
3.1 Test Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Test Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.2.1 Co-evolution of Production Code and Test Code . . . . . . . . 14
3.2.2 Co-evolution Factors . . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Generative Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 LLM Capabilities in Software Engineering . . . . . . . . . . . 17
3.3.2 LLMs for Code Interaction . . . . . . . . . . . . . . . . . . . . 19
3.3.3 LLMs for Testing . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.4 LLM Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4 Methods 23
4.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.3 Literature Review to Identify Test Maintenance Factors . . . . . . . . 26
4.4 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4.1 Selection of Interviewees . . . . . . . . . . . . . . . . . . . . . 27
4.4.2 Interview Instrument . . . . . . . . . . . . . . . . . . . . . . . 28

ix


Contents

4.4.3 Interview Analysis . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5.1 Selection of Participants . . . . . . . . . . . . . . . . . . . . . 28
4.5.2 Survey Instrument . . . . . . . . . . . . . . . . . . . . . . . . 29
4.5.3 Survey Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.6 Literature Review of Large Language Models . . . . . . . . . . . . . . 31
4.7 Matching Test Maintenance Problems and

LLM Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.8 Setup with LLM Agents Design . . . . . . . . . . . . . . . . . . . . . 34

4.8.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . 34
4.8.2 Notes on ReAct Framework . . . . . . . . . . . . . . . . . . . 37
4.8.3 Tool Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.8.5 Individual LLM and Agent Differences . . . . . . . . . . . . . 41

4.9 Evaluation of Setup with LLM Agents . . . . . . . . . . . . . . . . . 42

5 Results 45
5.1 Literature Review of Maintenance Factors . . . . . . . . . . . . . . . 45
5.2 Thematic Analysis of Interviews . . . . . . . . . . . . . . . . . . . . . 49

5.2.1 Reasons to Change Tests . . . . . . . . . . . . . . . . . . . . . 49
5.2.2 Ways to Assure Quality . . . . . . . . . . . . . . . . . . . . . 53
5.2.3 Issues Related to Test Maintenance . . . . . . . . . . . . . . . 56
5.2.4 Wishlist for Tool Support . . . . . . . . . . . . . . . . . . . . 60
5.2.5 Attitudes Towards Generative AI . . . . . . . . . . . . . . . . 64

5.3 Analysis of Survey Responses . . . . . . . . . . . . . . . . . . . . . . 65
5.4 Literature Review of the Use of LLMs for Test Maintenance . . . . . 69

5.4.1 Test Maintenance Actions . . . . . . . . . . . . . . . . . . . . 69
5.4.2 Considerations for LLMs in Corporate Environments . . . . . 72

5.5 Evaluation of Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . 74

6 Discussion 79
6.1 RQ1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.1.1 Low-level factors . . . . . . . . . . . . . . . . . . . . . . . . . 79
6.1.2 High-level factors . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.2 RQ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2.1 RQ2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.2.2 RQ2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2.3 RQ2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

6.3 RQ3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.4 Evaluation of Usefulness of the Prototypes . . . . . . . . . . . . . . . 86
6.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.5.1 Possible Improvements of Prototypes . . . . . . . . . . . . . . 87
6.5.2 Future Uses for High-level Triggers and Agents . . . . . . . . . 88
6.5.3 Future Uses for Low-level Triggers and LLMs . . . . . . . . . 89

6.6 A Note on AI Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.6.1 Pillars for Ethical AI . . . . . . . . . . . . . . . . . . . . . . . 90
6.6.2 Environmental Impact . . . . . . . . . . . . . . . . . . . . . . 92

x


Contents

6.6.3 AI Laws and Regulations . . . . . . . . . . . . . . . . . . . . . 92
6.7 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93

6.7.1 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.7.2 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . 93
6.7.3 Construct Validity . . . . . . . . . . . . . . . . . . . . . . . . 95
6.7.4 Writing Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 95

7 Conclusion 97

Bibliography 99

A Search Strings for Test Maintenance Factors Literature Review I

B Search Strings for LLM Capabilities Literature Review V

C Interview Consent Form IX

D Interview Questions XI

E Survey Instrument Questions XV
E.1 Demographic Questions . . . . . . . . . . . . . . . . . . . . . . . . . . XV
E.2 Test Maintenance Questions . . . . . . . . . . . . . . . . . . . . . . . XVI
E.3 Generative AI and LLM Questions . . . . . . . . . . . . . . . . . . . XVIII

F Evaluation Results of the Four Prototypes for Each Commit XIX

G Prompts Used For Agents XXVII
G.1 React Agent Base Prompt . . . . . . . . . . . . . . . . . . . . . . . . XXVII
G.2 Planning agent with summaries . . . . . . . . . . . . . . . . . . . . . XXVIII

G.2.1 Code Summariser Agent . . . . . . . . . . . . . . . . . . . . . XXVIII
G.2.2 Planning Agent . . . . . . . . . . . . . . . . . . . . . . . . . . XXVIII
G.2.3 Test Localisation Agent . . . . . . . . . . . . . . . . . . . . . XXIX
G.2.4 Test Maintenance Trigger LLM Instance . . . . . . . . . . . . XXX

G.3 Planning agent without summaries . . . . . . . . . . . . . . . . . . . XXXI
G.3.1 Code Summariser Agent . . . . . . . . . . . . . . . . . . . . . XXXI
G.3.2 Planning Agent . . . . . . . . . . . . . . . . . . . . . . . . . . XXXII
G.3.3 Test Localisation Agent . . . . . . . . . . . . . . . . . . . . . XXXIII
G.3.4 Test Maintenance Trigger LLM Instance . . . . . . . . . . . . XXXIII

G.4 LLM Chain With Summaries . . . . . . . . . . . . . . . . . . . . . . XXXV
G.4.1 Code Summariser LLM Instance . . . . . . . . . . . . . . . . . XXXV
G.4.2 Code Summariser LLM Instance . . . . . . . . . . . . . . . . . XXXV
G.4.3 Test Localisation Agent . . . . . . . . . . . . . . . . . . . . . XXXVI
G.4.4 Test Maintenance Trigger LLM Instance . . . . . . . . . . . . XXXVI

G.5 LLM Chain Without Summaries . . . . . . . . . . . . . . . . . . . . . XXXVIII
G.5.1 Code Summariser LLM Instance . . . . . . . . . . . . . . . . . XXXVIII
G.5.2 Test Localisation Agent . . . . . . . . . . . . . . . . . . . . . XXXVIII
G.5.3 Test Maintenance Trigger LLM Instance . . . . . . . . . . . . XXXIX
G.5.4 Promt For Summarising A Test Case . . . . . . . . . . . . . . XLI

xi


Contents

xii


List of Figures

2.1 Example of one-shot and chain-of-thougth prompting . . . . . . . . . 11

4.1 Overview of case study . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Survey distribution timeline . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Demographics of survey respondents . . . . . . . . . . . . . . . . . . 31
4.4 Difference in the work role and type of testing performed for employ-

ees with less and more than five years of experience . . . . . . . . . . 32
4.5 LLM multi-agent architecture . . . . . . . . . . . . . . . . . . . . . . 35
4.6 LLM chain architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.7 ReAct Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.8 ReAct Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1 Groupings of test maintenance factors from literature review . . . . . 46
5.2 Descriptive statistics of themes from thematic analysis . . . . . . . . 50
5.3 Survey results concerning test maintenance . . . . . . . . . . . . . . . 66
5.4 Survey results concerning LLMs . . . . . . . . . . . . . . . . . . . . . 67
5.5 Difference in opinion about test maintenance based on experience . . 68
5.6 Overview of actions an LLM can take that relate to test maintenance. 70
5.7 Considerations for LLMs in Corporate Environments. . . . . . . . . . 73

6.1 Overview of LLM uses for high-level factors . . . . . . . . . . . . . . 89
6.2 Detailed uses of LLMs for high-level factors . . . . . . . . . . . . . . 90
6.3 Further research connected to low-level triggers . . . . . . . . . . . . 91

xiii


List of Figures

xiv


List of Tables

4.1 Demographics of Interviewees . . . . . . . . . . . . . . . . . . . . . . 27

5.1 Description of test maintenance factors that affect the general func-
tionality of the system . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2 Description of test maintenance factors that are changes made to a
class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.3 Description of test maintenance factors that are changes made to a
method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.4 Description of groupings of test maintenance factors that are changes
made only to a single line of the production code . . . . . . . . . . . 48

5.5 Overview of themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.6 Overview of Reasons to Change Tests’ sub-themes . . . . . . . . . . . 49
5.7 Overview of Ways to Assure Quality’s sub-themes . . . . . . . . . . . 53
5.8 Overview of Issues Related to Test Maintenance’s sub-themes . . . . 57
5.9 Overview of Wishlist for Tool Support’s sub-themes . . . . . . . . . . 60
5.10 Overview of Attitudes Towards Generative AI’s sub-themes. . . . . . 64
5.11 Results of prototype evaluation . . . . . . . . . . . . . . . . . . . . . 74
5.12 Results of prototype evaluations comparing iteration limit . . . . . . 76

6.1 Description of high-level factors . . . . . . . . . . . . . . . . . . . . . 80

F.1 Result of evaluation of LLM chain with summaries . . . . . . . . . . XIX
F.1 Result of evaluation of LLM chain with summaries . . . . . . . . . . XX
F.2 Result of evaluation of LLM chain without summaries . . . . . . . . . XXI
F.2 Result of evaluation of LLM chain without summaries . . . . . . . . . XXII
F.3 Result of evaluation of planning agent with summaries . . . . . . . . XXII
F.3 Result of evaluation of planning agent with summaries . . . . . . . . XXIII
F.3 Result of evaluation of planning agent with summaries . . . . . . . . XXIV
F.4 Result of evaluation of planning agent without summaries . . . . . . . XXIV
F.4 Result of evaluation of planning agent without summaries . . . . . . . XXV
F.4 Result of evaluation of planning agent without summaries . . . . . . . XXVI

xv


List of Tables

xvi


1
Introduction

Software testing is a necessary but expensive activity, that can account for up to
half of the total development cost of a system [1]. As the system lives on and
evolves, test maintenance is needed to ensure the relevant tests are updated and
new tests are created. This too requires substantial effort and resources. One
solution to this is automation, which can be an important tool in reducing these
costs, and could furthermore improve the resulting quality of the production code.
Developers can save time through automation, which can be spent on other critical
challenges instead, thereby reducing costs. However, to be able to automate test
maintenance activities a proper understanding of these activities is required, and
even then automation can be trickier than expected, and an effort-expensive process
as well, e.g. test automation scripts may require a lot of upkeep. Because of this,
many test maintenance activities are still performed manually [2].

These last few years have seen a rise in the use of generative AI, which is AI capable
of generating content based on their input data, often text or images [3]. Large
Language Models (LLMs) are a form of generative AI that has been trained on mas-
sive datasets, and outputs text and/or code (depending on the language), the most
famous example at the time of writing perhaps being GPT-3.5 and GPT-4, which
are used in ChatGPT [4]. LLMs have been applied to various aspects of software
development (e.g. test case generation [5], documentation [6], refactoring [7], and
automated program repair (APR) [8]) with varying degrees of success, often with
the purpose to help developers by automating these activities or parts of the activi-
ties [7, 9]. LLMs have further been used as general assistants to developers, helping
not only by automating activities but also by giving advice and helping developers
work through problems [10] through continuous conversation.

This thesis seeks to understand how LLMs can be used to help with test main-
tenance, however, there are some known shortcomings when using LLMs. Due to
data privacy concerns, organisations may not wish to use commercial LLMs, and
may instead adopt open-source ones. Due to the high power consumption rate, they
also tend to use medium-sized models, which generally do not perform as well as the
larger models [11]. LLMs may also suffer from hallucinations, i.e. generating output
that is untrue while presenting it as correct, when put in unfamiliar situations not
covered by their training data, such as when asked about an organisation’s internal
data which will naturally be unknown to the LLM [12]. Another problem is that

1


1. Introduction

LLMs may struggle with large contexts, or breaking down tasks into sub-tasks to
more efficiently handle them [13]. One solution to these problems is LLM agents.
An LLM agent is an advanced AI system with an LLM as its core, further equipped
with tools or frameworks that allow it to reason, retain memory, plan, and perceive
or interact with chosen parts of its environment [14, 15]. By, for example, giving an
LLM access to an organisation’s code repository it may more accurately answer how
one change in the production code affects the remaining code base. In addition, dif-
ferent agents taking on different roles can be made to work together as multi-agents,
similarly to how a software development team works in reality [16].

This thesis was conducted as a case study of test maintenance at Ericsson, a Swedish
telecommunications company. Based on the found test maintenance problems, it was
investigated how LLM agents could help alleviate them, specifically by predicting if
and where test maintenance would be needed. Multiple steps were taken to analyse
how LLMs can assist with test maintenance. First, a literature review was conducted
to see which changes in production code lead to a need for test maintenance. This
was accompanied by a survey and interviews with Ericsson employees to understand
better the test maintenance problem, as well as how a solution might best fit with
their practices. To complement this, a literature review of the current capabilities
and uses of LLMs within software testing was conducted as well, to see what could
apply to test maintenance. Based on the combination of these results, a setup with
LLM agents was built to help predict whether test maintenance was needed because
of changes to the production code, and if so, to which test cases.

1.1 Problem Description
Testing in software development is generally an expensive process. This is true for
both the monetary cost as well as the effort required in the process itself [1]. This
thesis project will be conducted in collaboration with Ericsson. They have many
interests and products but what will be focused on in this study is their test suites
and maintenance that is being performed on those test suites.

There is an initial cost associated with creating the tests and test suites [17], but a
large part of the cost of testing comes from the maintenance aspect. Keeping test
suites updated through changes in the production code or dependencies over time is
challenging and expensive. It generally takes longer to modify existing methods and
classes, whether to rectify faults or change their functionality, compared to adding
new ones [18]. Additionally, maintenance includes further activities beyond main-
taining the test suites. Maintenance is also affected by the process, environment,
personnel, and the tools available.

Test maintenance encompasses a multitude of different potential activities that are
both complex and time-consuming. These activities include, but are not limited
to: Error correction, reverse engineering, program comprehension, re-engineering,
impact analysis, repository construction, functional enhancements, renovation, mi-
gration, integration, optimization, and adaptation. During test maintenance, the

2


1. Introduction

developers have to keep in mind the requirement specifications, the design docu-
ments, the test cases and the database schemas [18].

To reduce the expenses that testing can require, automation is an excellent tool that
is being used more and more on a wider scale [19]. By automating large areas of
the testing process, critical problems can be handled by developers with less focus
laid on other tasks.

This study aims to tackle the high expense of test maintenance by utilising generative
AI to reduce the costs involved in the maintenance process, specifically evolving test
cases. Having LLM agents provide suggestions for if and where test maintenance is
needed to the developers at Ericsson could help make the maintenance process more
manageable and cost-effective.

LLMs and generative AIs are especially adept at analysing large amounts of data
and based on that data give suggestions for best practices [20]. This particular
ability we feel will be especially helpful in the areas of test maintenance that are
hard to automate and that at the moment require human involvement. By partially
automating tasks it could ease the burden on the developers and reduce the effort
needed to complete their tasks. This especially goes for LLM agents, which are
uniquely equipped to understand a changing code base, because of their access to
tools that let them process new and changing information about the surrounding
environment.

1.2 Purpose of the Study
The purpose of the study is to explore how LLM agents can be used to make test
maintenance activities faster by allowing developers to more easily understand if
and where changes are needed and being able to apply these changes in a way that
ensures the test’s relevance and readability. In particular, we want to investigate
how LLM agents can help developers by predicting the need for test maintenance
while ensuring the quality of the testing process is maintained.

First, we mapped out the existing research on test maintenance, particularly factors
in the source code-under-test that indicate that a test needs maintenance. This liter-
ature review is then supplemented with corresponding data collected from developers
and employees at Ericsson, that is examples of when tests have been updated and
explanations of why. From the analysis of these results, the study maps out which
problems in test maintenance can and are suitable to be partially addressed with
LLMs. Factors concerning test maintenance have mostly been explored in general
terms [1, 21], and less in which specific parts of the source code-under-test indicate
a need for maintenance. We expect this part of the study to benefit both researchers
and practitioners by exploring this area since a better understanding of the problem
allows time and effort to be spent more effectively.

The second phase of the study built upon what had previously been found, and
explored how the solution of using a setup with LLM agents may help with test

3


1. Introduction

maintenance. It investigated how an agent with knowledge of both the production
code and test code might help with keeping the test code up-to-date when the
production code has been changed. Four different setups were investigated: Two
variants where the output of one LLM or LLM agent was provided as input to the
next, creating a chain of LLMs and LLM agents, and two multi-agent setups with
a planning agent to coordinate the work. The purpose of the second phase is to
evaluate the viability of the setup with LLM agents solution. This solution aims to
help practitioners save time and effort during test maintenance activities, and by
doing this increase the overall quality of software projects.

It should be noted that not all results from the first part of the study went towards
building the test maintenance setup with LLM agents. Some are also used more
as a basis for discussion by, based on the identified test maintenance problems,
speculating about how agents could be incorporated into more parts of the entire
development process to help more continuously and all-encompassingly with test
maintenance. For example, it was considered what higher-level use cases, e.g. a
change in requirements, might look like, to see how an agent might assist with and
support effective test maintenance throughout the whole production chain.

1.3 Significance of the Study
This thesis makes both scientific and practical contributions. It firstly contributes
both high and low-level factors that act as triggers for test maintenance. The low-
level triggers, specific changes in production code, are collected and organised from
existing literature, while the high-level triggers, reasons to change production code
that also leads to a need to change test code, were gathered from interviews and a
survey. Previous studies have focused more on general factors regarding test main-
tenance. They have mapped out factors that complicate maintenance, or indicate a
need for maintenance, factors such as the size of the test suite and the understand-
ability of the test. This study will have a larger focus on which specific changes in
the production code indicate a need for maintenance, which makes it scientifically
significant, as it is exploring and contributing to a less explored area.

The study secondly contributes to an understanding of how and where generative
AI can be used to help simplify the test maintenance problem. This is significant to
practitioners as test maintenance is a time-consuming and labour-intensive activity,
and a further understanding of the activity can help alleviate this. Most research on
LLMs within software testing focuses on test generation, however, LLMs have broad
uses and this study helps both practitioners and researchers better understand how
LLMs can be used for predicting test maintenance, which is a less studied area. It
is also significant as LLMs may already be used informally by practitioners for this,
but this study would help formalise that knowledge, and would also further it.

Lastly, this study sets up a proof of concept of how LLM agents can use the previ-
ously mentioned test maintenance triggers, and by using them help make test main-
tenance more efficient. This is significant to practitioners as a significant amount

4


1. Introduction

of time can be saved by providing automation to help with test maintenance activ-
ities. It is significant to research since there, to our knowledge, exists limited or no
research on how LLM agents can help with test maintenance.

1.4 Thesis Outline
This thesis is organised as follows:

Chapter 2: Background introduces relevant concepts for software testing and
generative AI and introduces necessary terminology.

Chapter 3: Related Work explores existing research on software testing, gener-
ative AI, and LLM agents and positions the thesis’ work to it.

Chapter 4: Methods lists the research questions, and describes the steps taken
to answer them. In particular, it describes the design of the literature reviews,
the interviews, the survey, and the architecture of the setup with LLM agents
as well as the steps taken to evaluate the agent.

Chapter 5: Results describes the results of the sub-tasks.

Chapter 6: Discussion provides answers to the research questions, examines how
the results fit with and relate to the research discussed in related work, pro-
poses future research paths and discusses threats to validity.

Chapter 7: Conclusion summarises the study and its results.

5


1. Introduction

6


2
Background

The background chapter will lay the foundation for the chapters to come by ex-
plaining the necessary knowledge to the reader and introducing essential concepts.
It will first look at software testing, including test management and test mainte-
nance. Secondly, it will look at generative AI and LLMs, before finishing with LLM
agents.

2.1 Software Testing

Software testing is the act of verifying and validating a software system, i.e. ensuring
both that it fulfils its stated requirements and its high-level purpose. It is a necessary
activity to ensure the quality of the product and is generally considered one of the
most important parts of software development [22].

In general, testing can be split into functional and non-functional testing. Non-
functional testing tests how well the system performs, such as its responsiveness,
usability, and stability [23]. In contrast, Functional testing tests the way the system
operates by comparing the result of input and execution conditions to an expected
output [24]. It can be split into three levels: Unit testing (sometimes also called
module testing), which generally tests a single unit of code or piece of functionality,
such as a class or a method [25]; Integration testing, which tests the interaction
between two or more separate software components [26]; And system testing, where
the entire software system working together is tested [25].

A test case is a specification of all the relevant information used to see that the
program fulfils an intended objective [24], and collection of test cases in turn make
up a test suite [27]. A test case is comprised of many different parts, some of the
most important of which are:

Test oracle: A mechanism for determining if the output of the test is correct [28].
Some examples include oracles derived from external information, such as re-
quirements documentation, and human knowledge on how the program should
behave.

Test steps: Describes and outlines the steps that should be taken to run the test.

7


2. Background

Initialization: The step that sets everything up in preparation for the test case,
such as a separate environment to ensure the test can be run in isolation
without affecting or being affected by the rest of the system. It also includes
defining and declaring variables, and similar preparations.

Teardown: The step that, after the test has been run, removes all the temporary
equipment used by the test, such as data structures.

Input: The data provided at the beginning of the test, i.e. what is fed into the test
case. The input will determine which path is taken through the program, and
it is therefore important to test with different inputs to ensure many different
scenarios are tested.

Software testing is also an expensive process that can take up to 50% of both time
and cost of the development of a system [25], which makes tools that can assist with
the process or automate parts of it desirable. The need for these tools is widely
recognised [29], and it is known that automation can help reduce time spent on
testing [30]. Many types of automation tools (e.g. test generation tools and test
code coverage tools) do exist [30], however, the tools themselves sometimes require
significant time and effort to maintain [1].

2.1.1 Test Management
Test management refers to actively working to ensure the quality of the test suite
by updating and evaluating it, to in turn be able to ensure the quality of the pro-
duction code. Management includes several subareas, such as test maintenance, test
automation, and test generation. As test maintenance is the area of this thesis it
will be further explained in Section 2.1.2, while test automation and generation will
be briefly explored here.

Test management is an expensive, time-consuming activity, and the need to alleviate
the process through automation is widely accepted [29]. However, test management
activities are still generally performed manually [2]. Best practices for test manage-
ment are not widely established [1], but it is known that automation can play a vital
role in reducing the time spent on testing [30]. Test automation refers to expressing
tests as executable code, then using an automated system (e.g. a CI/CD pipeline)
to execute the tests and process the results of the test execution [30]. Even though
automation decreases time and effort, maintaining automation scripts still requires
significant effort [1]. Automation also requires significant upfront investment and
needs maintenance throughout the program’s life-cycle [31].

In addition to automating test execution, automation can also be used to generate
test cases, especially test input. Since manually creating test cases can be among the
most labour-intensive parts of software testing, automatic test case generation is one
of the more well-researched areas within software testing. Many different techniques
to generate test cases exist: model-based testing, combinatorial testing, and search-
based testing are a few of them [32]. These have recently been joined by using LLMs

8


2. Background

to generate test cases [3]. Despite the many different approaches, difficulties still
exist in ensuring the generated tests’ maintainability and readability [33, 34]. Other
than test cases and input, there have also been efforts made to generate test oracles,
though they remain particularly difficult to generate automatically [28].

2.1.2 Test Maintenance
Test maintenance refers to the act of updating the test suite as the production code
changes and evolves [1]. Systems evolve as new features are added, requirements are
changed, and faults are discovered. As the production code is modified to accommo-
date this, the test code may need to be changed as well, to ensure that test results
are accurate for the current system behaviour. This includes adding new test cases,
repairing existing ones, and removing those that are no longer relevant.

Test maintenance is understood to be an important part of quality assurance, but
has not received as much attention as the insurance of the quality of production
code [35]. An example of this is the study of test smells, which despite their known
effect on the maintainability of the test suite have received significantly less attention
than their counterpart code smells [36]. This is despite reports of test maintenance
accounting for up to 60% of the total time spent on testing in a week [1]. In other
words, test maintenance is an important research area to further explore.

2.2 Generative AI
Generative AI is a form of AI that can understand the intent of a given instruction,
and based on this intent generate output in the form of media such as text, images,
or music, to name a few [20]. The form of the instruction and output differs between
different types of generative AI. Some generative AIs respond to user prompts while
others analyse a piece of received media, to name a few. Two examples of generative
AI are large language models (LLMs), which input text in the form of a prompt
and output text in the form of a response, and text-to-image AIs, which input a
text prompt and output an image based on the prompt. A prompt is an input the
user gives the LLM when interacting with it, and can for example be questions or
instructions.

Most generative AI uses a transformer-based architecture, which is based entirely
on attention mechanisms. These imitate the cognitive attention (i.e. the ability to
focus on select stimuli) seen in humans [37]. The development of the transformer
architecture was successfully combined with pre-trained systems, pre-training re-
ferring to training the model on a diverse data set to learn general patterns and
features [38]. The model can then be fine-tuned to a specific domain or task, which
is called transfer learning, as the model can transfer the general knowledge it learned
during pre-training to another domain [39]. These powerful pre-trained models are
called foundation models and can often be adapted to a wide variety of areas, such
as software development, education, and healthcare [40].

9


2. Background

2.2.1 Large Language Models
One form of generative AI is Large Language models (LLMs), which include famous
examples GPT-3.5 and GPT-4 that are used within ChatGPT [4, 41]. LLMs are
made for natural language processing (NLP) tasks, such as text generation. They
take text as input and generate text as output by iteratively predicting the next
token or word in a sequence to form a cohesive text [42].

One form of text-based language is code. Because of their ability to understand and
output both natural language and code, LLMs are well-suited for software develop-
ment. Within test maintenance, the most interesting applications are perhaps the
generation of test cases and input, but LLMs have also been used for automated
program repair, code review, and as a conversational programmer’s assistant, to
name a few [43, 44, 10]. Their strength lies in their versatile nature, their ability to
generate code, then reason about said code and allow the user to ask questions about
it. They can be used the other way around as well; help their user reason about
a problem or set up a plan to tackle it, then based on that help the user generate
or look over code. However, even with LLM’s advanced reasoning capabilities there
still exist problems.

LLMs are known to struggle with hallucinations. A hallucination, in the context
of LLMs, is the LLM generating text that appears to be correct and fluent, but in
reality, is nonsensical and unfaithful to the data the LLM was trained on. Even
when overlooking how this affects the LLM’s performance, hallucinations also cause
trust issues for the user and can pose safety risks if the user acts on the hallucinated
output [12].

Another problem is that at the moment there exist problems with how to rigorously
and extensively evaluate and judge LLMs. Firstly, there is a lack of benchmark
datasets to use, especially when testing more niche applications such as program re-
pair. These datasets may not have been designed for testing LLMs either. Secondly
is the problem of data leakage, where the LLMs may have seen the benchmarks
during training, which means reports of LLMs performance may be misleading [11].

Applying LLMs in real-world applications is also not without its share of problems.
Organisations may shy away from using commercial LLMs because of data privacy
concerns and would prefer to use open-source models instead. These models can
then be fine-tuned with the organisation’s internal data. However, building these
datasets for fine-tuning can take a lot of effort, both time and labour-wise. Organi-
sations may also pay attention to computational power or energy consumption, and
therefore choose a medium-sized model, with which it is harder to get state-of-the-art
performance, even with fine-tuning [11].

An LLM model and a human may not understand a prompt the same way, which
has given rise to the field of prompt engineering, i.e. how to best formulate the
prompt sent to the LLM to get the user’s desired output [45]. This can be likened
to a kind of natural language programming, a way to steer the generated output of
the LLM [46]. Prompt engineering is possible because of in-context learning, which

10


2. Background

is defined by Brown et al. as “a paradigm that allows language models to learn tasks
given only a few examples in the form of demonstration” [47]. Examples include
chain-of-thought and few-shot prompting. Chain-of-thought prompting is a way to
help the model with multi-step reasoning, something that is often challenging for
LLMs. This is done by providing the model with intermediate reasoning steps. Few-
shot prompting is done by including a few input-output examples into the model’s
input [48]. Following the same naming convention, one-shot prompts have exactly
one example, and zero-shot have none. Chain-of-thought and few-shot prompting
can be combined, as can several other techniques. These prompt engineering tech-
niques are illustrated in figure 2.1.

Q: Roger has 5 tennis balls. 
He buys 2 more cans of 
tennis balls. Each can has 3 
tennis balls. How many tennis 
balls does he have now? 

A: The answer is 11. 

Q: The cafeteria had 23 
apples. If they used 20 to 
make lunch and bought 6 
more, how many apples do 
they have?

Model input

A: The answer is 27.

Model output

The cafeteria had 23 apples. If 
they used 20 to make lunch 
and bought 6 more, how many 
apples do they have?

Model input

The answer is 27.

Model output

Q: Roger has 5 tennis balls. 
He buys 2 more cans of 
tennis balls. Each can has 3 
tennis balls. How many tennis 
balls does he have now? 

A: Roger started with 5 balls. 
2 cans of 3 tennis balls each 
is 6 tennis balls. 5 + 6 = 11. 
The answer is 11. 

Q: The cafeteria had 23 
apples. If they used 20 to 
make lunch and bought 6 
more, how many apples do 
they have?

Model input

A: The cafeteria had 23 apples 
originally. They used 20 to 
make lunch. So they had 23 - 
20 = 3. They bought 6 more 
apples, so they have 3 + 6 = 9. 
The answer is 9.

Model output

Unmodified Prompt One-Shot Prompt
One-Shot Prompt with 

Chain-of-Thought

Figure 2.1: Example showing how one-shot prompting and chain-of-though
prompting works. One-shot text is highlighted in yellow, and chain-of-thought rea-
soning is in pink. Note that in this example the model only arrives at the right
answer in the final example. Example text and design are partially taken from Wei
et al [48].

2.2.2 LLM Agents
An LLM agent is an advanced AI system with an LLM as its core which has ac-
cess to tools to help it solve problems. These tools or frameworks may allow it to
reason, retain memory, plan, and perceive or interact with chosen parts of its en-
vironment [14, 15]. For example, a hypothetical LLM agent may have access to an
organisation’s code base and a version control tool. This would allow it to reason
about the code, make changes to it, and push those changes to a remote repository.
To the best of our understanding, there exists no widely established definition of

11


2. Background

what an LLM agent is, however, there is some consensus [14, 15]. We have drawn
on this consensus when crafting the definition above, which is the definition that
will be used throughout the thesis.

One tool to help an agent understand a code base is a RAG pipeline, RAG standing
for Retrieval-Augmented Generation [49]. RAG allows the LLM to access infor-
mation from external sources, and use that information to better answer related
questions. For the code base example, a retrieval tool could be set up either for
locally stored files or for files on a remote repository. LLMs may struggle to use
the knowledge stored in their parameters for information-intensive tasks, and there
is also the problem of having to fine-tune a model again to be able to update its
knowledge about an organisational resource. RAG can help solve both of these
problems [49], by giving the agent easier access to updated information.

Multiple LLM agents can be made to work together, as multi-agents, to further
help with task complexity. There are many ways to set up how the agents work
together. For example, the agents can be given different roles, opinions, and tasks,
to help simulate the way an e.g. software development team would work in real
life [50]. Other approaches can also be used to make the agents work together, such
as combining task decomposition and task allocation [51].

12


3
Related Work

This chapter discusses related works relevant to the research topics. It first explores
test management, especially test maintenance, before moving on to generative AI.
There the sections examine related research on how LLMs can help with software
engineering and test maintenance, as well as how LLMs can be used as agents.

3.1 Test Management

Many tools exist to help developers automate test suites, both by automatically
executing tests [19] as well as generating tests [32]. Code and test generation with
LLMs has been explored, but the use of LLMs to help with other test management
has not, to our knowledge, been widely explored yet. Only a few studies have tackled
this topic, see Section 3.3.3. There is reason to expand upon this topic, as White
et al. note that LLMs hold immense potential for automating software engineering
tasks and activities, of which test management is a part [7]. In other words, though
automated solutions are in many cases used, there is room for these solutions to
improve, and reason to further explore how LLMs may help.

Test generation is a way to automate the creation of new test cases, which is part
of the various test management activities. Anand et al. describe test creation as
having a strong effect on the efficiency of software testing but also note that test
creation is among the most labour-intensive of software testing activities [32], and an
intellectually demanding task. Because of this much work has been done on different
fronts, and common methods for generating test cases include: symbolic execution,
model-based, combinatorial, adaptive random and search-based testing. Despite the
work done, the reliability of test generation has yet to be proven, with Gay et al.
finding that automatically generated test suites in some cases would perform worse
than randomly generated ones [52]. This is further supported by Palomba et al.
who show that automatically generated test cases often suffer from poor test code
quality [53]. They further establish that test cohesion and coupling are good metrics
for test code quality. Xu et al. through an empirical study examine how factors
influence the effectiveness and cost of test suite augmentation techniques [54]. They
find that the primary factor is the test case generation algorithm, followed by how
new and existing test cases are utilised.

13


3. Related Work

3.2 Test Maintenance
Test maintenance is a topic that has been explored to a certain extent in previous
literature. Pinto et al. examine why test suites evolve [27]. Their findings show
that tests that appear to be added or deleted are often simply old tests that have
been moved or renamed and that when tests are truly deleted it is most often
because they are obsolete, not because they are hard to repair. They further find
the main reasons for adding new tests are to cover new functionality, validate bug
fixes, and validate refactored code. There are, however, further areas within the
topic of test maintenance that are yet to be explored. Imtiaz et al. for instance,
noted that there is much room for future studies on methods for repairing tests
and that few of the studies done are done in an industrial context [55]. Even when
research has been conducted on test maintenance, it is not always implemented.
Gonzalez et al. looked at the usage of testing patterns in open-source projects and
found that only a quarter of the projects that had tests used patterns related to
maintainability [56]. Nevertheless, there is previous research on this topic. Kochar
et al. provide an overview of users’ perspectives on important aspects of software
testing by performing a survey and a series of interviews on test cases [57]. The
questions focus on characteristics of good test cases, which are divided into six
dimensions, one of which is maintainability.

Factors that affect test maintenance is another area of research within the topic
of test maintenance that is relevant to this thesis. Previous research has identified
various such factors. Alégroth et al. investigated maintenance for visual GUI testing
and found 13 factors that affect automated test suites, and also identified how long
the maintenance for tests affected by these factors was estimated to take [1], and
how much effort was required. An example is test case length, which had a high
impact and increased maintenance by more than an hour. Berglund et al. similarly
looked at test maintenance for machine learning systems and found 9 factors which
affect test maintenance for all systems, machine learning and traditional systems
included [21]. An example is oracle precision, where tests with sensitive oracles
require more updates as the test suite is updated while simultaneously being harder
to update. Sensitive, in this regard, refers to an oracle that is highly adapted to
the precision of the program’s output, where minor changes can cause the oracle to
need to be updated. It should be noted that in both these studies the factors are
related to how complicated the maintenance will be. They are not looking for specific
factors in the code or program that indicate the test needs maintenance, nor are they
looking at how changes in the production code might indicate a corresponding need
for test maintenance. Factors looking directly at changes in the production code
are presented in Section 5.1 and the relevant research investigated is summarised in
Section 3.2.2.

3.2.1 Co-evolution of Production Code and Test Code
Co-evolution is the effect of test code and production code being modified in paral-
lel. This concerns test maintenance in how test code is being modified accordingly

14


3. Related Work

to changes in production code to make sure the test suite is relevant and useful.
There have been a number of tools, metrics, and methods to help detect changes
in production code that indicate test code will need to be changed, as well as to
help with co-evolution. Huang et al. propose the tool Jtup, a machine learning ap-
proach using random forest [58], which analyses changes made to production code
to see if the matching test code needs to be changed as well [59]. The decision
on whether co-evolution is needed is based on code change features, semantic fea-
tures, as well as complexity features of the code. This distinguishes it from other
approaches, which mainly look at semantic changes as well as change features. Kita
et al. propose the Tconf metric for evaluating how well a production method has
co-evolved with its corresponding tests [60]. They make use of the evaluation of
logical couplings between production and test code instead of code analysis. Ens et
al. present a visualisation tool for co-evolution and co-change between production
code and test code [61]. The tool is called ChronoTwigger and is interactive and
shows co-change over time. Gall et al. present ChangeDistiller, a tool for examining
fine-grained code changes, which makes use of how source code can be represented
as abstract syntax trees [62]. ChangeDistiller has been used by several other studies
on co-evolution [63, 64, 65]. Sohn and Papadakis present CEMENT, a tool mak-
ing probable links between production and test code that has been updated in a
short time frame, under the assumption that the fact they have been updated, or
co-evolved, means they are related [66].

Beyond the tools, metrics, and methods themselves, there is further research on the
usefulness and applicability of co-evolution. Sun et al. investigate the assumption
that if a production class and its corresponding test class are updated within the
same commit or within a short time frame they are an example of linked co-evolution
between test and production code [67]. They found that the longer the time frame
between the change in the production code and the change in the test code, the
less likely they are to be a true example of co-evolution. Updates within the same
commit contained 11.34% false positives, while pairs with more than 24 and 48
hours had 85.71% and 89.19% false positives respectively. Based on this they claim
the co-evolution samples used by Wang et al. [2] include noise. Klammer and Kern
show that visualisations can be used to understand and keep up with how a systems
production and test code co-evolves [68]. This was tried when analysing co-evolution
in industrial projects.

3.2.2 Co-evolution Factors
The following section will present the factors found in existing literature for which
modifications in the production code lead to a need to modify the test code. These
were used as part of answering the first research question RQ1, which is defined in
Section 4.1 and the results of which can be found in Section 5.1. Previous studies
have looked at the co-evolution between production code and test code, and ex-
tracted patterns of what triggers this co-evolution. This has been done at different
scales, from certain types of changes (e.g. a change in the method body, or addition
of a conditional statement), to looking at specific syntax and keywords.

15


3. Related Work

Shimmi and Rahimi extracted and documented higher-level patterns on co-evolution
between production code and test code [69]. The patterns were classified under
additon, deletion, and modification and an example is addition: added functional-
ity where the corresponding test cases are made when the functionality is added
to the production code. Reich and Maalej similarly extracted patterns, but fo-
cused on refactorings and co-evolution to increase the testability of the production
code [70]. They identified both high and low-level changes in the production code.
They define a low-level change in the production code as a local change in a pro-
duction file, such as changing an attribute type. A high-level change in the pro-
duction code would have wider effects beyond the local area around the change,
such as merging a package. These changes were used to find testability patterns,
such as the extract_method_for_invocation pattern. Levin and Yehudai also use
semantic changes and investigate the relationship between test code maintenance
and production code maintenance [63]. They identify both high-level relationships
(e.g. REMOVED_CLASS, where removing a production class will lead to the re-
moval of a test class) and low-level relationships (e.g. RETURN_TYPE_CHANGE
where changing a return type in the production code will lead to test maintenance).
Marsavina et al. extracted patterns for fine-grained co-evolution between produc-
tion and test code [65]. They extracted fine-grained changes in production code and
linked them with the related test code, from which they identified six co-evolution
patterns. Vidács and Pinzger later found support for five of the six patterns found
by Marsavina et al [64].

Factors were also extracted from literature where tools to predict or help with test
maintenance had been developed, and they reported which factors the tools acted
on. DRIFT [71] is a further development of SITAR [2], which identifies outdated test
cases based on changes in the production code at the method level. Within this work,
they identified fine-grained changes in the production code that may be related to
co-evolution. Some examples of these fine-grained changes include: Try, Break, and
If. These fine-grained changes would require changes in the test code, hence their
relation to co-evolution. TestCareAssistant, originally proposed by Mirzaaghaei et
al. [72], was further developed by Mirzaaghaei et al. and can repair and generate
new test cases as the production code changes [73]. TestCareAssistant looks at
parameters and returns values to identify when a test case becomes outdated, which
were identified as indicators that co-evolution was needed. As part of the work
to develop TestCareAssistant, Mirzaaghaei worked to formalise test maintenance
activities into test adaptation patterns [74]. CEPROT identifies outdated test cases
and also updates them [22]. The main factors looked at in the production code to
detect the need for co-evolution are API invocation and changes to identifiers and
modifiers.

3.3 Generative Artificial Intelligence
The following section will present some of the relevant research on generative AI,
its use cases and its performance. Generative AI has been extensively researched
in the last few years with many different approaches. Gozalo-Brizuela and Garrido-

16


3. Related Work

Merchán investigated various generative AIs and classified them into a total of 9
categories [75]. The most relevant topic areas for this thesis lie in research where
generative AI and LLMs are used for coding and other software engineering purposes.
Thus the most relevant categories are the text-to-text models as well as the text-to-
code models. The most frequently used, and studied, LLMs at the time of writing
is OpenAI’s series of LLMs in the GPT series, most commonly GPT-3.5 and more
recently GPT-4. These LLMs are often interacted with through ChatGPT, a chat
robot that is implemented through the use of the GPT series. These are text-to-text
LLMs that have gained immense popularity after ChatGPT’s original introduction.
Conversely, there are many different text-to-code LLMs [20, 75].

Beyond ChatGPT and its use cases, there are further factors to consider regarding
how an LLM can perform. Mandvikar compares LLM models to each other and
presents several factors that describe how LLMs can differ [76]. These factors in-
clude, but are not limited to, the kind of pre-trained data, the size of the model, the
API capabilities, etc. These factors are then useful to consider when selecting an
LLM for a specific task according to Mandvikar. Beyond these factors, Döderlein
et al. investigate two LLMs and how they can be improved based on their input
parameters [77]. Their findings indicate that the temperature, which is a parameter
affecting how varied a response will be, and the initial prompt can have a signifi-
cant effect on the performance of the LLM. To get a further understanding of the
performance of LLMs Chang et al. perform a survey focusing on the evaluation of
LLMs [78]. They focus on three aspects, namely what to evaluate, how to evaluate,
as well as where to evaluate. Their findings include limitations in LLMs and their
reasoning ability, as well as their robustness.

3.3.1 LLM Capabilities in Software Engineering
There is some previous literature aiming to collect and present various research pa-
pers that have investigated LLMs and specifically their use for software engineering.
Zhang et al. investigate existing LLM-based software engineering (SE) studies, both
studies focusing on LLMs as well as studies focusing on SE [79]. They discuss ar-
chitectures, benchmarks, optimisation and application, as well as some challenges of
LLM research. Their findings indicate how LLMs are being trained for more code-
aware objectives compared to earlier natural language processing-derived objectives.
Further findings include a consideration for variables, and structural features, as well
as utilising cross-modal learning. This signifies advancements towards LLMs that
consider the semantics and functional aspects of code beyond processing the code as
a sequence of tokens. Hou et al. present a systematic literature review on how LLMs
are utilised for software engineering [43]. Their findings provide a comprehensive list
of different utilisation areas for LLMs including, but not limited to, code generation,
code completion, code understanding, program repair, code review, bug prediction,
vulnerability detection, and verification.

There are additional reviews and surveys of previous works for various software engi-
neering activities. Zhang et al. present a review of the history of code processing and
generating code from a natural language description, from natural language process-

17


3. Related Work

ing models to few-shot prompting applications of LLMs [44]. Wang and Chen present
a review of previous work on how LLMs can be utilised for code generation [80].
They focus on the application of LLMs for this topic as well as the evaluation of the
generated code. They find several limitations with an LLM’s application for code
generation including, but not limited to, compatibility, maintainability, portability,
correctness, and privacy. Despite the limitations presented Wang and Chen con-
clude that code generation with LLMs has progressed and can handle increasingly
complex tasks. Their findings show how there is a lack of research on the evaluation
of LLM-generated code. Zheng et al. provide a comprehensive review of the current
stage of code LLMs through their survey [81]. Code LLMs are LLMs that have
been trained mostly on code repositories instead of natural text, though some are
trained on both. They list several code LLMs, their applications, as well as the rela-
tionships between them, both between themselves and compared to general LLMs.
The performance of the code LLMs is investigated and compared to benchmarks
for multiple software engineering tasks. They summarise their findings with code
LLMs having a focus on code generation with some lesser emphasis on other tasks,
e.g. vulnerability repair or evaluation.

Other research directions include more specific work on how LLMs can be utilised
for various software engineering activities. Uusnäkki investigates the applications
of generative AI on software development [82]. As part of this Uusnäkki performed
an empirical study on the use of prompt engineering for enhancing software system
maintenance. As part of the results, the PESD framework is presented, which is
a framework for systematic prompt engineering. Fan et al. investigate how hybri-
dising, i.e. using LLMs along with existing software engineering techniques, such
as API search techniques or search-based test generation, can reduce hallucinations
and improve performance [83]. Their findings indicate that this is a promising topic
with several successful examples. Pei et al. investigate how LLMs can work with
program invariants, including predicting them [84]. They present a method for pre-
dicting invariants through fine-tuning LLMs and find that LLMs are effective on
this task, with 86% recall and 86% precision. The different invariants include ob-
ject, class, function-entry, function-exit, and loop invariants. Liu et al. propose
CodeExecutor, a model focused on enhancing code execution through LLMs [85].
They utilise pre-training and curriculum learning to improve the model on code exe-
cution tasks specifically. Liang et al. investigate the qualitative experience of LLMs
as coding assistants [86]. Their findings show that LLMs are mostly used for code
completion and faster keystrokes. On the other hand, most users in the study find
that code generation does not reach quality requirements and creativity and ideas
are underutilised.

As discussed in Section 3.3 ChatGPT is a popular LLM in a general sense at the
time of writing. It has also been investigated on its applicability for software devel-
opment. White et al. investigated the use of ChatGPT in software development and
identified 14 prompt patterns that would make the answers from ChatGPT more
helpful [7]. These patterns were focused on software development. Based on these
patterns some benefits of using ChatGPT identified were rapid experimentation at
different abstraction levels or identification of assumptions in the code of a project.

18


3. Related Work

Rahmaniar discusses potential applications of ChatGPT in software development
but also brings up several challenges that may arise when attempting to integrate
ChatGPT or other generative AI into a development process [87]. Rahmaniar men-
tions topics that ChatGPT would be adept at handling or assisting with such as
documentation, onboarding, reviewing, and of course code writing assistance. Worth
noting is that each of these topics, among others not mentioned here, has its limi-
tations and will according to Rahmaniar require some sort of human component for
best results.

3.3.2 LLMs for Code Interaction
This section will focus on LLMs for software engineering activities specifically in-
teracting with code. This includes various frameworks, techniques, and methods
that can view or make modifications to the production code. Sghaier and Sahraoui
present a framework for utilising LLMs for code review [88]. They believe, and their
findings indicate, that fully automating code reviews does not lead to the best results
and therefore their framework aims to lessen the workload of a code reviewer and
provide assistance instead of automating the whole review. Zhang et al. present a
survey of automated program repair (APR) solutions in current literature, many of
them focusing on utilising LLMs for APR [89]. They describe the typical framework,
and design strategies, as well as metrics and empirical studies. Similarly, Xia et al.
investigate the use of LLMs for APR [90]. Ibrahimzada et al. present BUGFARM,
a technique utilising LLMs to generate bugs [91]. They utilise attention analysis to
attempt to find the weak spots of LLM models and then improve their performance
through training on the generated bugs.

Further implementations using LLMs include Fried et al. who present InCoder, a
model that can perform both program synthesis as well as editing through LLMs [92].
Additionally, they use causal modelling to improve the performance of their model,
particularly the infilling capabilities of the model. Dou et al. investigate the capa-
bilities of LLMs when it comes to code clone detection [93]. Their findings indicate
that LLMs have the potential to outperform other automatic clone detection meth-
ods, especially regarding complex semantic clones. Geng et al. investigate how well
LLMs can generate comments and summaries of code [94]. Their findings indicate
that through few-shot learning an LLM can perform better than existing supervised
learning approaches. Chen et al. present SELF-DEBUGGING, a framework where
LLMs can iterate over their own generated code to find and rectify errors, both
semantic and syntactic [95]. Their findings suggest that LLMs can improve their
performance by going over the code it has previously generated. Ren et al. describe
several limitations of exception handling by LLMs and present KPC to mitigate that,
which is a code generation approach for using LLMs so that they handle exceptions
better [96].

3.3.3 LLMs for Testing
This section will focus on how LLMs have been used for testing and testing-related
activities. Wang et al. have investigated what testing activities have already been

19


3. Related Work

performed using LLMs [97]. They have investigated 50 different utilisations of LLMs
in software testing and have then reviewed and analysed the results from them.
Wang et al. state that LLMs are more useful for automation in testing in com-
parison to automation in source code. There has mainly been unit testing being
performed with LLMs, and while there is some system testing being done, no inte-
gration or acceptance testing being found by Wang et al. The majority of the testing
was functional testing with a small amount of security testing. No performance or
acceptance testing using LLMs was found by Wang et al. “There is currently no
clear consensus on the extent to which LLMs can solve software testing problems.”
says Wang et al. in an overall view of the current state of the research on this topic.
As found by Wang et al. there have been multiple studies on test code generation.
For instance, Yuan et al. evaluated ChatGPT’s ability to generate unit tests [3].
They found that about a quarter of the generated tests pass, but the rest suffer
from issues with compilation, correctness, and execution. What is notable is that
the tests that pass resemble manually written tests in quality. They describe Chat-
GPT’s ability as promising if the correctness were to be improved. Another study
on test code generation is by Schäfer et al. who presents TESTPILOT, an approach
for unit test generation by LLMs [5]. Their findings indicate that LLMs provide
higher coverage and a larger amount of non-trivial assertions compared to previous
test generation techniques. They conclude that LLMs can lessen the work required
for unit testing but not replace the need to write unit tests entirely, especially when
it comes to more complex tests.

Further research on test generation is by Siddiq et al. who investigate the unit
test generation capabilities of three code generation LLMs [98]. They compare
strongly typed languages, e.g. Java, to weakly typed languages, e.g. Python, to
see if the generation by LLMs differs. Their findings suggest that LLMs have more
difficulties with more strongly typed languages from the fact that syntax has an
increased importance compared to semantics. They also investigate the applicability
of utilising LLMs for Test Driven Development (TDD). Their findings indicate that
LLMs can work well with TDD. Kang et al. introduce LIBRO, a technique to use
LLMs for generating tests based on bug reports [99]. The tests generated have the
purpose of reproducing the bugs, which has a success rate of 33%. Lemieux et
al. present CODAMOSA, an algorithm for using LLMs to enhance search-based
software testing [100].

An important aspect of an LLM for this study is the understandability of the
code and the suggestions made by an LLM. Gay investigated the readability of
tests modified by LLMs [101]. Gay worked with GPT-4 through ChatGPT and a
code interpreter plug-in and identified that over 90% of the investigated case ex-
amples had significant improvement in the test readability after GPT transformed
the tests. There are some challenges present which include, but are not limited to:
non-determinism, text and prompt limits, code interpreter limitations, and trans-
formation order. Nevertheless Gay concludes that LLMs seem promising in the task
of improving readability in tests.

20


3. Related Work

3.3.4 LLM Agents
This section will present previous related work done on LLM agents. This is a very
recent topic at the time of writing and therefore there is limited research done on the
topic. Jiang et al. investigate the use of planning with LLMs, where the LLM first
makes a plan for its actions before proceeding with those actions [102]. Their findings
indicate that a planning phase can improve performance, despite planning being an
emergent ability of LLMs. Although this is not specifically about LLM agents it
serves as support for one of the bases of LLM agents. Zhao et al. present a method of
choosing between chain-of-thought and program-aided language models [103]. Their
findings indicate a benefit to performance for choosing the better-suited model for
each problem. This is not specifically about LLM agents but also serves as a base
idea for multi-agent frameworks, where multiple LLM agents cooperate and use their
various specialised skills to collectively produce an improved result.

For work done specifically with LLM agents, Feldt et al. work towards SocraTest,
an autonomous LLM agent that can invoke tools [15]. They present a taxonomy of
agents as well as a concrete example. Hong et al. present MetaGPT, a multi-agent
framework designed to solve various problems by simulating a software company
structure [104]. Their findings indicate that taking inspiration from humans can
improve the workings of LLM agents and how they work together. Rasheed et al.
present CodiPori, a code generation model based on multiple LLM agents [105].
Their findings show that the performance of LLM agents working together can out-
perform existing single LLM usage. Shen et al. investigate the limitations of small
LLMs when it comes to tool usage in LLM agents [106]. Their findings suggest that
simplifying and dividing tasks into different instances can improve the performance
of LLMs, especially LLMs of smaller sizes. Yoon et al. implement DROIDAGENT,
an LLM agent that performs Android app GUI testing automatically [107]. Their
findings indicate that LLM agents can contribute to autonomous GUI testing based
on more meaningful exploration choices and depth of search.

3.4 Summary
This section will present a summary of the related works and position this master
thesis against the open challenges of previous literature. The gaps in current research
and the steps we take to fill them are discussed in this section with a finishing
paragraph highlighting the research that has been utilised in this thesis.

Automating test management and maintenance are topics that have been somewhat
explored in previous research. However, the majority of automation comes in the
form of test case generation and focuses less on modifying existing test cases or
identifying affected test cases when production code has been changed. The research
on the co-evolution of test and production code also lacks research when it comes to
automation beyond test case generation that concerns generating entirely new test
cases. We address this gap by building a prototype that identifies test cases that
might need to be modified based on a production code change.

21


3. Related Work

Generative AI and LLMs are more recent topics that have seen extensive research
over the last few years. Most research has been performed with larger models that
are not available through open source and instead require access to APIs, and there-
fore lack control over the deployment of the model. In addition, most research in-
cludes pre-training or fine-tuning a model to fit a specific use case better. There
is also a lack of research done in the area of multi-agent architectures that focus
on topics beyond feature development. Previous research has explored the topic of
code understanding but has not further applied this understanding to topics such as
code traceability. Applying LLMs and LLM agents to test maintenance is also an
unexplored area from what we have found. To our knowledge, we are the first to ex-
plore a multi-agent setup of open-source LLMs without pre-training or fine-tuning.
In addition, our focus on using LLMs to help automate test maintenance is novel to
our knowledge.

Some research that has been utilised to great effect in this thesis is previous re-
search on triggers for test maintenance. For details on this see Sections 4.3 and
5.1. Additionally useful for this thesis is previous research done on actions that
LLMs and LLM agents can take. For details see Sections 3.3.1 and 5.4. By building
on previous research this thesis aimed to fill gaps in the research areas highlighted
in this section. The aim is to expand the areas of application of LLMs while also
providing more options for automating test maintenance activities. The inclusion of
test maintenance triggers stems from the desire to have criteria that an LLM agent
can utilise to know when to act. The inclusion of LLM actions and LLM agent
actions stems from the desire to understand what applications of LLMs and LLM
agents can be applied for this thesis’ use case.

22


4
Methods

The methods chapter will present the different steps taken throughout the case
study to answer the research questions. This chapter first presents the research
questions (Section 4.1) before giving an overview of the case study (Section 4.2).
The remaining sections provide more detailed explanations of each case study step.

4.1 Research Questions
This section will first present the research questions. It will then motivate them and
explain their purpose, as well as connect them and the scope of the thesis.

RQ1 Which factors suggest that test maintenance needs to occur due to changes
in the production code?

RQ2 What applications could current LLMs or LLM agents have within the area
of test maintenance?

RQ2.1 Which factors from RQ1 can act as triggers for test maintenance in
an LLM or LLM agent?

RQ2.2 What are potentially viable test maintenance actions that an LLM or
LLM agent could take based on these triggers?

RQ2.3 Based on the present-day landscape, what are some considerations for
building an LLM agent for test maintenance within a corporate setting?

RQ3 What is the precision, recall, and F1 score of our setup with LLM agents in
predicting if and where test maintenance is necessary using the factors found
in RQ1?

Identifying the need to evolve some part of a test suite is the first step of test
maintenance. The purpose of RQ1 is therefore to identify and categorise the issues
and changes that lead to the need to perform test maintenance. This was answered
by a literature review to identify changes in the production code that lead to a need
for test maintenance (described in Section 4.3), a thematic analysis of interviews
with practitioners (described in Section 4.4), as well as conducting a survey with

23


4. Methods

practitioners (described in Section 4.5).

RQ2 builds upon RQ1 and aims to explore how LLM agents might fit within the
problem space, as well as what should be taken into consideration when building
them. This decision to use LLM agents stems from the literature review of LLMs
and the results from that combined with the results of RQ1. RQ2.1 identifies which
of the results from RQ1 are suitable to move forward with by reasoning about the
triggers and the planned architecture of the agent. RQ2.2 is more exploratory. It
uses existing literature on software testing and LLMs to both give ideas about areas
of application for a test maintenance setup with LLM agents as well as sugges-
tions regarding how particular triggers might suggest particular applications. The
question is answered by presenting examples of how LLMs have been used within
software testing. RQ2.3 draws upon current literature as well as the interviews and
the survey to identify surrounding factors and limitations within Ericsson that must
be considered when deploying an agent.

RQ3 assesses the performance of an initial prototype solution. It builds upon the
results of RQ2, but does not seek to confirm all of RQ2s findings. This RQ only seeks
to try out one of the possible use cases found in RQ1 and RQ2. As a proof of concept,
a setup with LLM agents was designed and evaluated on its precision, recall, and
F1-score. The design process is described in Section 4.8 and the evaluation process
in Section 4.9.

4.2 Research Design
This study is a case study investigating the applicability of LLMs within the test
maintenance domain at Ericsson, especially LLM agents. The case study follows
the guidelines laid out by Runeson and Höst [108]. This section will present an
overview of the case study methods, and how the different elements relate to each
other and the research questions. The overall structure of the methods of the case
study is also displayed in Figure 4.1. The case study had the following steps:

(1) Literature review of test maintenance factors: To find factors that when
changed in production code lead to test maintenance, meaning a need to make
changes to test cases, a literature review was conducted. The process is described
in detail in Section 4.3, and the results contributed to RQ1.

(2) Interviews: The interviews were conducted with Ericsson employees to get a
better understanding of test maintenance problems at Ericsson. A thematic analysis
was done to analyse the results, which contributed to RQ1 and RQ2. The process
is described in Section 4.4.

(3) Survey: Similarly to the interviews, a survey was sent out to Ericsson employ-
ees to better understand Ericsson employees’ views of test maintenance, as well as
opinions about generative AI. The results contributed to RQ1 and RQ2, and it is
described in Section 4.5.

24


4. Methods

Design protocol 
and search 

strings

Conduct 
database search

Conduct forwards 
and backward 
snowballing

Read through 
and evaluate 

papers 

List of factors 
found in literature

1

2

Design interview 
questions

Select 
participants

Conduct pilot 
interview

Conduct 
interviews

Thematic 
analysis Thematic map

Transcribe 
interviews

Design protocol 

3

Design 
instrument Expert review Pilot test

Distribution Analysis of 
survey results

List of factors 
indicating a need 
for maintenance

RQ1 (Test 
maintenance 

factors)

Design protocol 
and search 

strings

Conduct 
database search

Conduct forwards 
and backward 
snowballing

Read through 
and evaluate 

papers 

List test 
maintenance 
related LLM 
capabilities

4

Analysis of how 
LLMs can assist 

with test 
maintenance

RQ2 (How LLMs 
can help)

Match LLM 
capabilities, 
triggers, and 

problems

5

Map of 
trigger-based use 

cases

Experiment with 
LLMs, agents, 
and triggers

6

Proof-of-concept 
LLM setups

Design evaluation 
protocol 

7

Evaluate, 
precision, recall, 

F1-score

RQ3 
(Effectiveness of 

LLMs on problem)

Discussion

LLM Chain setup

Experiment with 
chunking and 
information 

representation

Multi-agent LLM 
setup

Compare to 
baseline

RAG tools

Figure 4.1: Overview of case study. The numbers correspond to the case study’s
different steps. 1 = literature review of test maintenance factors, 2 = interviews
about test maintenance and LLMs, 3 = survey about test maintenance and LLMs,
4 = literature review of LLM capabilities, 5 = synthesise results, 6 = build prototype
of setup with LLM agents, 7 = evaluate prototype of setup with LLM agents. Grey
= activity, yellow = artefact, green = research question.

25


4. Methods

The results of these three steps were used to find the final list of factors indicating
a need for test maintenance, which is the answer to RQ1. The decision to use three
different methods was taken to get data and method triangulation, to help increase
the precision and validity of the results.

(4) Literature review of LLM capabilities: To understand how LLMs are
currently used for test maintenance, as well as understand their suitability and
limitations within the current use case, a literature review was conducted. The
results were used to answer RQ2, and the review protocol is described in Section
4.6.

(5) Analysis of which test maintenance problems and triggers to imple-
ment in agent: This step used the results from steps two, three, and four to
decide which of the identified test maintenance problems would fit with which trig-
gers. This was done ad-hoc, taking the time and resource limitations into account.
The result was partly used to answer RQ2 and provide a basis for larger use cases
in the discussion. For a further description, see Section 4.7.

(6) Design setup with LLM agents: Four proof-of-concept LLM setups were
designed and implemented to explore the viability of using LLMs to predict test
maintenance. This step contributed to the result of RQ3 and is described in Section
4.8.

(7) Evaluation of setup with LLM agents: The setup with LLM agents was
evaluated on its precision, recall, and F1-score, the results of which were used to
answer RQ3. For a further description, see Section 4.9.

4.3 Literature Review to Identify Test Mainte-
nance Factors

A literature review of factors in the source code that indicate the need for changes
in the test suite was performed in the early stages of the thesis to help answer
RQ1. Though the literature review was not a Systematic Literature Review due to
the time constraints, it did take inspiration from its strict protocol, as described
by Keele [109]. Each step of the literature review was based on the guidelines
provided by Keele. The majority of the steps described by Keele were followed, if
less meticulously than in the original, with two exceptions: the data collection, and
the dissemination. The data collection was not as rigorous due to the short time
frame, and the dissemination, i.e. report writing, had less focus due to the aim of
the literature review not being isolated but instead leading into the next step of the
thesis.

The databases that were utilised to search for primary sources were: IEEE, Science
Direct, ACM, SCOPUS, and Google Scholar. For the search strings used for each
respective database, see Appendix A. The search was limited to papers released
within the last 15 years, i.e. in the period 2009-2024. This range was chosen based

26


4. Methods

on the desire for relevancy to current-day software engineering and testing standards
and practices.

Relevancy was judged first through the title, followed by the abstract and the con-
clusion of the papers. Once it was judged that no more relevant papers were being
found the initial scan ended. 48 papers had been found through this step. These
papers were then examined in more detail, and if they contained relevant factors,
they were recorded in a separate document. This step yielded 8 different papers
that named test maintenance factors in the source code.

These 8 papers were then used as a staging point for both backwards and forward
snowball sampling. The method of examining the relevancy of the new research
papers was identical to the method used for the original examination of research
papers in the first step of the literature review where the relevancy of the papers
was checked. This step yielded an additional 64 papers. From these papers, an
additional four papers were found to describe relevant factors in production code
that when changed would lead to a need for test maintenance. This led to a total
of 12 found papers with relevant factors.

To sort and organise the factors, a Miro [110] board was used. See Section 5.1 for
the result of the literature review.

4.4 Interviews
Interviews were held with Ericsson employees to get a better understanding of the
current state of the test maintenance problem and where improvements can be made
as part of RQ1. The interviews also included questions on LLMs and generative AI,
both current use and opinions, as part of RQ2.

4.4.1 Selection of Interviewees

Table 4.1: Demographics of Interviewees. Experience refers to years of experience
with testing, type refers to testing they are currently performing. IDs that were
interviewed together have been grouped.

ID Experience
(Years)

Role Type

P1 15 Software Developer Unit
P2 3.5 Data Scientist Unit
P3 6 Developer Unit
P4 5 Developer Unit
P5 3 Software Developer Unit
P6 2 Test Manager Integration, System
P7 2 Test Manager Integration, System
P8 25 Principal Developer Overseeing Process

27


4. Methods

Convenience sampling was utilised for the sampling of the interviewees. Based on
the supervisors’ knowledge of the organisation, emails were sent out to relevant
teams and developers to explain the master thesis and inquire about participation
in interviews. Some interviews were held in groups, for the convenience of both
the interviewers and the interviewees. Table 4.1 presents the demographics of the
interviewees who agreed to be interviewed.

4.4.2 Interview Instrument
The initial step for the interviews started with writing out relevant questions to the
topics of RQ1 and RQ2. Questions were left open to avoid leading questions. After
the questions were written, an expert review was performed by the academic and
industrial supervisors of the thesis. A pilot interview was performed, and based on
it small changes were made. One question was removed, as it was deemed irrelevant
to the research questions. Some questions received minor clarifications. Because
the changes remained relatively small, the data from the pilot interview was used
in the final analysis. A consent form was presented to each interviewee before the
interview could start in addition to receiving permission from the interviewee(s) to
record the interview. The interview had a length of roughly 40 minutes on average.
See Appendix C for the interview consent form. See Appendix D for the interview
questions.

4.4.3 Interview Analysis
All the interviews were transcribed through Microsoft Teams [111] and were later
manually corrected after a process of listening through the recordings of the inter-
views. After the interviews were transcribed a thematic analysis was performed
to identify themes and common concepts and thoughts of the interviewees. The
thematic analysis followed the steps and guidelines described by Braun and Clarke
[112]. An inductive and semantic approach was mainly used. The results of the
thematic analysis can be found in Section 5.2.

4.5 Survey
A survey was sent out to employees at Ericsson, whose work was related to software
engineering, to gauge their way of performing test maintenance as part of RQ1 as
well as their opinions on LLMs as part of RQ2. A protocol for the survey was
designed based on Ghazi et al. [113] and Kasunic [114]. The nature of the survey
was exploratory, and the motivation behind the survey was to find out how the test
maintenance process is managed, what kind of help practitioners want from LLMs,
and what in the source code triggers an update to a test case.

4.5.1 Selection of Participants
The desired population was Ericsson developers with testing experience, as well
as other Ericsson employees who worked with testing. Convenience sampling was

28


4. Methods

Figure 4.2: Timeline of the distribution of the test maintenance and LLM usage
survey through February 2024.

used to distribute the survey. Based on the supervisors’ knowledge of Ericsson and
the various organisations within, emails were sent out to relevant communities and
teams. For demographics of the respondents, see Section 4.5.3 and Figure 4.3.

The survey was initially planned to be distributed to a single developer community,
consisting of over 100 developers across different countries and sections within Eric-
sson that work within the same area, and then be available for two weeks. However,
based on a low response rate the survey was sent out to several different commu-
nities during the time the survey was available, necessitating an extension of the
availability of the survey to make sure that respondents had time to answer. A
timeline for the survey distribution can be seen in Figure 4.2.

4.5.2 Survey Instrument
Next, the survey instrument was designed. The survey was determined to be an
unsupervised cross-sectional survey. The survey was designed to take 5-10 minutes
to increase the chance of the respondents answering the survey. Care was taken to
make sure that questions were not left open-ended nor that there were too many
questions. The total number of questions was 10, with all of them being multiple-
choice questions. Some questions allowed the respondent to choose multiple answers.
These questions had a limit to the number of answers that could be chosen, to
force the respondent to prioritise the most relevant choices. The wording of all
questions was evaluated using a checklist based on understandability criteria set by
Kasunic [114]. Understandability criteria are rules for the phrasing and structure of
questions such that minimal confusion and misunderstandings occur.

29


4. Methods

The instrument started with an information page that included the identity of the
surveyors as well as the purpose of the survey and how their answers would be
treated. Following the information page were some initial attribute questions re-
garding the demographics of the respondents.

After the attribute questions about demographics, the next section focused on test
maintenance activities. This section contained questions about the behaviour and
belief types, as described by Kasunic [114]. Thereafter the final section consisted
of two questions about the respondent’s use of and attitude towards LLMs and
generative AI. These two questions were of the behaviour and belief type respectively,
as described by Kasunic. All questions in the survey can be found in Appendix E.

The survey instrument was evaluated by a Data Analytics Expert at Ericsson. It was
also pilot-tested by three Ericsson developers to ensure there were no ambiguities in
the questions, as well as to ensure it could be completed in less than ten minutes.

4.5.3 Survey Analysis

The total number of people who received the survey is unknown based on the fact
that it can not be confirmed if respondents shared the survey with additional col-
leagues beyond the receivers of the emails that were originally sent out. What can
be confirmed is that no respondent answered the survey more than once as each
respondent had to log in with their Ericsson account and the survey was set to only
accept one response per account. The total number of people who received the sur-
vey is at least 300 but it is otherwise unknown. The total number of responses to
the survey was 29 and thus, while the exact number of recipients is unknown, the
response rate is less than 10%.

The demographics of the respondents are presented in Figure 4.3. The median role
of a respondent is developer, and the median experience and type of testing per-
formed is 3-5 years and unit testing. The main programming language was Python.
If respondents are separated into groups of up to five years of testing experience
compared to more than five years of testing experience, the results differ. While
respondents with up to five years of experience mostly work as developers with unit
testing, the role of respondents with more than five years of experience was less uni-
form, and integration testing was more common than unit testing (see Figure 4.4).
One explanation for these differences in results could be that experienced profession-
als are more suited to testing at levels where a broader and deeper understanding
of the product and requirements is needed.

The results of the survey were analysed using descriptive statistics, to get an overview
of the respondents’ thoughts about test maintenance and to identify trends in the
answers.

30


4. Methods

13

4
2

1
2

3
2 1 1

0
2
4
6
8

10
12
14

D
e

ve
lo

p
e

r

Te
st

e
r

A
rc

h
it

e
ct

D
e

vO
p

s

D
at

a 
Sc

ie
n

ti
st

O
th

e
r 

Te
st

in
g

R
e

la
te

d
 P

o
si

ti
o

n

P
ro

d
u

ct
 O

w
n

e
r

Te
ch

 L
e

ad

M
an

ag
e

m
e

n
t

P
o

si
ti

o
n

0
2
4
6
8
10
12
14

(a) Work role of respondents. Axes
show frequency and responses.

(b) Years of experience with software
testing. Axes show frequency and re-
sponses.

16

5 4

1 1 1 1
0
2
4
6
8
10
12
14
16
18

15; 52%

7; 24%

3; 10%

2; 7%
2; 7%

Python Java C/C++ Erlang None

(c) Most common type of testing per-
formed. Axes show frequency and re-
sponses.

(d) Most commonly used programming
language. Labels show the number of
occurrences; percentage of total occur-
rences.

Figure 4.3: Demographics of survey respondents. All questions and answer alter-
natives can be seen in Appendix E.

4.6 Literature Review of Large Language Models
A literature review of LLMs and their capabilities and applicability was performed
to find answers to RQ2. The process of performing this literature review closely
mimics the process for the previous literature review, see Section 4.3. A protocol
was created with inspiration from Keele [109] in the same fashion as the previous
literature review.

The databases that were utilised to search for primary sources were: IEEE, Science
Direct, ACM, SCOPUS, and Google Scholar. For the search strings used for each
respective database see Appendix B. The search was limited to papers released in
the time frame of 2017-2024. This range was chosen based on the first significant de-
velopments of LLMs, as defined after consultation with the supervisors at Ericsson.
The starting point was based on the release of papers such as Vaswani et al. [37]

31


4. Methods

10

1 2
1 1

2
0

1 1

0
2
4
6
8

10
12
14

D
e

ve
lo

p
e

r

Te
st

e
r

A
rc

h
it

e
ct

D
e

vO
p

s

D
at

a 
Sc

ie
n

ti
st

O
th

e
r 

Te
st

in
g

R
e

la
te

d
 P

o
si

ti
o

n

P
ro

d
u

ct
 O

w
n

e
r

Te
ch

 L
e

ad

M
an

ag
e

m
e

n
t

P
o

si
ti

o
n

3 3

0 0
1 1 2

0 0
0
2
4
6
8

10
12
14

D
e

ve
lo

p
e

r

Te
st

e
r

A
rc

h
it

e
ct

D
e

vO
p

s

D
at

a 
Sc

ie
n

ti
st

O
th

e
r 

Te
st

in
g

R
e

la
te

d
 P

o
si

ti
o

n

P
ro

d
u

ct
 O

w
n

e
r

Te
ch

 L
e

ad

M
an

ag
e

m
e

n
t

P
o

si
ti

o
n

(a) Work role of respondents with less
than 5 years of experience. Axes show
frequency and responses.

(b) Work role of respondents with more
than 5 years of experience. Axes show
frequency and responses.

14

1 2
0 0 0 1

0
2
4
6
8
10
12
14
16
18

2
4

2 1 1 1 0
0
2
4
6
8
10
12
14
16
18

(c) Most common type of testing per-
formed for employees with less than 5
years of experience. Axes show fre-
quency and responses.

(d) Most common type of testing per-
formed for employees with more than
5 years of experience. Axes show fre-
quency and responses.

Figure 4.4: Difference in the work role and type of testing performed for em-
ployees with less and more than five years of experience. All questions and answer
alternatives can be seen in Appendix E.

and the end date is up to the point where the literature review was conducted.

The papers were first checked for relevancy through their titles, abstracts, and con-
clusions. Relevant papers were added to a worksheet for later perusal. The in-
spection of papers for each search string continued until three subsequent pages
of irrelevant papers were identified. In total, 67 papers were chosen for further
examination.

After the initial scan ended the papers were examined in more detail. The papers
were checked for information regarding how LLMs can interact with code and other
artefacts, as well as how suitable LLMs are to help with different software engineering
tasks, especially test maintenance tasks.

32


4. Methods

Papers that were deemed to provide relevant information were distinguished from
papers that did not provide relevant information. These most useful papers were
then used as staging points for both backwards and forward snowball sampling.
There were ten papers used for the snowball sampling. The second selection of
papers found through the snowball sampling was treated the same as the papers
found through the initial scan. An additional 92 papers were selected for further
examination through snowball sampling. Two papers of the second selection were
deemed more useful than the rest, for a total of twelve papers that had the most
relevant information for this literature review