Predicting the Need for Test Maintenance Using LLM Agents Applying Test Maintenance Factors to Changes in Production Code to Identify If and Where Test Cases Need to Be Updated Master’s Thesis in Computer Science and Engineering LUDVIG LEMNER LINNEA WAHLGREN Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2024 Master’s thesis 2024 Predicting the Need for Test Maintenance Using LLM Agents Applying Test Maintenance Factors to Changes in Production Code to Identify If and Where Test Cases Need to Be Updated LUDVIG LEMNER LINNEA WAHLGREN Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2024 Predicting the Need for Test Maintenance Using LLM Agents Applying Test Maintenance Factors to Changes in Production Code to Identify If and Where Test Cases Need to Be Updated LUDIVG LEMNER LINNEA WAHLGREN © LUDVIG LEMNER, LINNEA WAHLGREN, 2024. Supervisor: Gregory Gay, Computer Science and Engineering Advisors: Nasser Mohammadiha, Ericsson Advisors: Roy Liu, Ericsson Advisors: Joakim Wennerberg, Ericsson Examiner: Robert Feldt, Computer Science and Engineering Master’s Thesis 2024 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX Gothenburg, Sweden 2024 iv Predicting the Need for Test Maintenance Using LLM Agents Applying Test Maintenance Factors to Changes in Production Code to Identify If and Where Test Cases Need to Be Updated LUDVIG LEMNER LINNEA WAHLGREN Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Test maintenance, the act of modifying and updating test cases to ensure they keep up with the changes made in the production code, is a necessary but time-consuming and effort-intensive activity. One way to alleviate these efforts is by automating parts of the test maintenance process, however, setting up and maintaining automation tools can be time-consuming as well. Generative AI and Large Language Models (LLMs) offer new avenues for automation and lessening the test maintenance prob- lem. One of these is through LLM agents, sophisticated AI systems that reason, plan, and use tools to help it achieve its goals. This thesis was conducted as an exploratory case study at Ericsson and investigated how generative AI can help ease test maintenance, specifically how LLM agents can be used to predict test maintenance. The thesis had three phases: Identifying fac- tors that trigger test maintenance; exploring the capabilities of generative AI and how it might be used to help with test maintenance; and, using the results from the two previous phases, building a prototype to help predict if and if so where test maintenance is needed based on changes to the production code. We identified 40 factors that when changed in production code cause a need for test maintenance, and successfully demonstrated how they can be used as triggers in a setup with LLM agents. Out of the four different setups that were evaluated, we found that using multiple LLM agents coordinated by a planning agent, and giving these access to both production code and natural language summaries of test cases, worked best. We also, through a thorough literature review, identify test maintenance actions LLMs can take and help with. These demonstrate both the possibilities and current limitations of LLMs when it comes to test maintenance, and the results highlight how—though a large focus of LLM studies within software engineering has focused on code generation—the capabilities of LLMs are much broader. This study provides examples of how LLM agents can be used more broadly and all-encompassingly. Keywords: Software engineering (SE), test maintenance, large language model (LLM), LLM agent. v Acknowledgements We would like to thank our academic supervisor Gregory Gay and our industrial supervisors at Ericsson: Nasser Mohammadiha, Roy Liu, and Joakim Wennerberg. A big thank you for your guidance and patience, and for always taking the time to support us. We would also like to thank all other Ericsson employees who kindly helped with and participated in the study, without which this thesis would not have been possible. Ludvig Lemner & Linnea Wahlgren, Gothenburg, June 2024 vii Contents List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Purpose of the Study . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Significance of the Study . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.4 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Background 7 2.1 Software Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Test Management . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Test Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Generative AI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2.1 Large Language Models . . . . . . . . . . . . . . . . . . . . . 10 2.2.2 LLM Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3 Related Work 13 3.1 Test Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 3.2 Test Maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3.2.1 Co-evolution of Production Code and Test Code . . . . . . . . 14 3.2.2 Co-evolution Factors . . . . . . . . . . . . . . . . . . . . . . . 15 3.3 Generative Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . 16 3.3.1 LLM Capabilities in Software Engineering . . . . . . . . . . . 17 3.3.2 LLMs for Code Interaction . . . . . . . . . . . . . . . . . . . . 19 3.3.3 LLMs for Testing . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.4 LLM Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4 Methods 23 4.1 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Research Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.3 Literature Review to Identify Test Maintenance Factors . . . . . . . . 26 4.4 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.4.1 Selection of Interviewees . . . . . . . . . . . . . . . . . . . . . 27 4.4.2 Interview Instrument . . . . . . . . . . . . . . . . . . . . . . . 28 ix Contents 4.4.3 Interview Analysis . . . . . . . . . . . . . . . . . . . . . . . . 28 4.5 Survey . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.5.1 Selection of Participants . . . . . . . . . . . . . . . . . . . . . 28 4.5.2 Survey Instrument . . . . . . . . . . . . . . . . . . . . . . . . 29 4.5.3 Survey Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.6 Literature Review of Large Language Models . . . . . . . . . . . . . . 31 4.7 Matching Test Maintenance Problems and LLM Capabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.8 Setup with LLM Agents Design . . . . . . . . . . . . . . . . . . . . . 34 4.8.1 Overall Architecture . . . . . . . . . . . . . . . . . . . . . . . 34 4.8.2 Notes on ReAct Framework . . . . . . . . . . . . . . . . . . . 37 4.8.3 Tool Use . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.8.4 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.8.5 Individual LLM and Agent Differences . . . . . . . . . . . . . 41 4.9 Evaluation of Setup with LLM Agents . . . . . . . . . . . . . . . . . 42 5 Results 45 5.1 Literature Review of Maintenance Factors . . . . . . . . . . . . . . . 45 5.2 Thematic Analysis of Interviews . . . . . . . . . . . . . . . . . . . . . 49 5.2.1 Reasons to Change Tests . . . . . . . . . . . . . . . . . . . . . 49 5.2.2 Ways to Assure Quality . . . . . . . . . . . . . . . . . . . . . 53 5.2.3 Issues Related to Test Maintenance . . . . . . . . . . . . . . . 56 5.2.4 Wishlist for Tool Support . . . . . . . . . . . . . . . . . . . . 60 5.2.5 Attitudes Towards Generative AI . . . . . . . . . . . . . . . . 64 5.3 Analysis of Survey Responses . . . . . . . . . . . . . . . . . . . . . . 65 5.4 Literature Review of the Use of LLMs for Test Maintenance . . . . . 69 5.4.1 Test Maintenance Actions . . . . . . . . . . . . . . . . . . . . 69 5.4.2 Considerations for LLMs in Corporate Environments . . . . . 72 5.5 Evaluation of Prototypes . . . . . . . . . . . . . . . . . . . . . . . . . 74 6 Discussion 79 6.1 RQ1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.1.1 Low-level factors . . . . . . . . . . . . . . . . . . . . . . . . . 79 6.1.2 High-level factors . . . . . . . . . . . . . . . . . . . . . . . . . 80 6.2 RQ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2.1 RQ2.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 6.2.2 RQ2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 6.2.3 RQ2.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.3 RQ3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.4 Evaluation of Usefulness of the Prototypes . . . . . . . . . . . . . . . 86 6.5 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 6.5.1 Possible Improvements of Prototypes . . . . . . . . . . . . . . 87 6.5.2 Future Uses for High-level Triggers and Agents . . . . . . . . . 88 6.5.3 Future Uses for Low-level Triggers and LLMs . . . . . . . . . 89 6.6 A Note on AI Ethics . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 6.6.1 Pillars for Ethical AI . . . . . . . . . . . . . . . . . . . . . . . 90 6.6.2 Environmental Impact . . . . . . . . . . . . . . . . . . . . . . 92 x Contents 6.6.3 AI Laws and Regulations . . . . . . . . . . . . . . . . . . . . . 92 6.7 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.7.1 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.7.2 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.7.3 Construct Validity . . . . . . . . . . . . . . . . . . . . . . . . 95 6.7.4 Writing Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 7 Conclusion 97 Bibliography 99 A Search Strings for Test Maintenance Factors Literature Review I B Search Strings for LLM Capabilities Literature Review V C Interview Consent Form IX D Interview Questions XI E Survey Instrument Questions XV E.1 Demographic Questions . . . . . . . . . . . . . . . . . . . . . . . . . . XV E.2 Test Maintenance Questions . . . . . . . . . . . . . . . . . . . . . . . XVI E.3 Generative AI and LLM Questions . . . . . . . . . . . . . . . . . . . XVIII F Evaluation Results of the Four Prototypes for Each Commit XIX G Prompts Used For Agents XXVII G.1 React Agent Base Prompt . . . . . . . . . . . . . . . . . . . . . . . . XXVII G.2 Planning agent with summaries . . . . . . . . . . . . . . . . . . . . . XXVIII G.2.1 Code Summariser Agent . . . . . . . . . . . . . . . . . . . . . XXVIII G.2.2 Planning Agent . . . . . . . . . . . . . . . . . . . . . . . . . . XXVIII G.2.3 Test Localisation Agent . . . . . . . . . . . . . . . . . . . . . XXIX G.2.4 Test Maintenance Trigger LLM Instance . . . . . . . . . . . . XXX G.3 Planning agent without summaries . . . . . . . . . . . . . . . . . . . XXXI G.3.1 Code Summariser Agent . . . . . . . . . . . . . . . . . . . . . XXXI G.3.2 Planning Agent . . . . . . . . . . . . . . . . . . . . . . . . . . XXXII G.3.3 Test Localisation Agent . . . . . . . . . . . . . . . . . . . . . XXXIII G.3.4 Test Maintenance Trigger LLM Instance . . . . . . . . . . . . XXXIII G.4 LLM Chain With Summaries . . . . . . . . . . . . . . . . . . . . . . XXXV G.4.1 Code Summariser LLM Instance . . . . . . . . . . . . . . . . . XXXV G.4.2 Code Summariser LLM Instance . . . . . . . . . . . . . . . . . XXXV G.4.3 Test Localisation Agent . . . . . . . . . . . . . . . . . . . . . XXXVI G.4.4 Test Maintenance Trigger LLM Instance . . . . . . . . . . . . XXXVI G.5 LLM Chain Without Summaries . . . . . . . . . . . . . . . . . . . . . XXXVIII G.5.1 Code Summariser LLM Instance . . . . . . . . . . . . . . . . . XXXVIII G.5.2 Test Localisation Agent . . . . . . . . . . . . . . . . . . . . . XXXVIII G.5.3 Test Maintenance Trigger LLM Instance . . . . . . . . . . . . XXXIX G.5.4 Promt For Summarising A Test Case . . . . . . . . . . . . . . XLI xi Contents xii List of Figures 2.1 Example of one-shot and chain-of-thougth prompting . . . . . . . . . 11 4.1 Overview of case study . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.2 Survey distribution timeline . . . . . . . . . . . . . . . . . . . . . . . 29 4.3 Demographics of survey respondents . . . . . . . . . . . . . . . . . . 31 4.4 Difference in the work role and type of testing performed for employ- ees with less and more than five years of experience . . . . . . . . . . 32 4.5 LLM multi-agent architecture . . . . . . . . . . . . . . . . . . . . . . 35 4.6 LLM chain architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.7 ReAct Cycle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4.8 ReAct Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.1 Groupings of test maintenance factors from literature review . . . . . 46 5.2 Descriptive statistics of themes from thematic analysis . . . . . . . . 50 5.3 Survey results concerning test maintenance . . . . . . . . . . . . . . . 66 5.4 Survey results concerning LLMs . . . . . . . . . . . . . . . . . . . . . 67 5.5 Difference in opinion about test maintenance based on experience . . 68 5.6 Overview of actions an LLM can take that relate to test maintenance. 70 5.7 Considerations for LLMs in Corporate Environments. . . . . . . . . . 73 6.1 Overview of LLM uses for high-level factors . . . . . . . . . . . . . . 89 6.2 Detailed uses of LLMs for high-level factors . . . . . . . . . . . . . . 90 6.3 Further research connected to low-level triggers . . . . . . . . . . . . 91 xiii List of Figures xiv List of Tables 4.1 Demographics of Interviewees . . . . . . . . . . . . . . . . . . . . . . 27 5.1 Description of test maintenance factors that affect the general func- tionality of the system . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2 Description of test maintenance factors that are changes made to a class . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.3 Description of test maintenance factors that are changes made to a method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.4 Description of groupings of test maintenance factors that are changes made only to a single line of the production code . . . . . . . . . . . 48 5.5 Overview of themes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 5.6 Overview of Reasons to Change Tests’ sub-themes . . . . . . . . . . . 49 5.7 Overview of Ways to Assure Quality’s sub-themes . . . . . . . . . . . 53 5.8 Overview of Issues Related to Test Maintenance’s sub-themes . . . . 57 5.9 Overview of Wishlist for Tool Support’s sub-themes . . . . . . . . . . 60 5.10 Overview of Attitudes Towards Generative AI’s sub-themes. . . . . . 64 5.11 Results of prototype evaluation . . . . . . . . . . . . . . . . . . . . . 74 5.12 Results of prototype evaluations comparing iteration limit . . . . . . 76 6.1 Description of high-level factors . . . . . . . . . . . . . . . . . . . . . 80 F.1 Result of evaluation of LLM chain with summaries . . . . . . . . . . XIX F.1 Result of evaluation of LLM chain with summaries . . . . . . . . . . XX F.2 Result of evaluation of LLM chain without summaries . . . . . . . . . XXI F.2 Result of evaluation of LLM chain without summaries . . . . . . . . . XXII F.3 Result of evaluation of planning agent with summaries . . . . . . . . XXII F.3 Result of evaluation of planning agent with summaries . . . . . . . . XXIII F.3 Result of evaluation of planning agent with summaries . . . . . . . . XXIV F.4 Result of evaluation of planning agent without summaries . . . . . . . XXIV F.4 Result of evaluation of planning agent without summaries . . . . . . . XXV F.4 Result of evaluation of planning agent without summaries . . . . . . . XXVI xv List of Tables xvi 1 Introduction Software testing is a necessary but expensive activity, that can account for up to half of the total development cost of a system [1]. As the system lives on and evolves, test maintenance is needed to ensure the relevant tests are updated and new tests are created. This too requires substantial effort and resources. One solution to this is automation, which can be an important tool in reducing these costs, and could furthermore improve the resulting quality of the production code. Developers can save time through automation, which can be spent on other critical challenges instead, thereby reducing costs. However, to be able to automate test maintenance activities a proper understanding of these activities is required, and even then automation can be trickier than expected, and an effort-expensive process as well, e.g. test automation scripts may require a lot of upkeep. Because of this, many test maintenance activities are still performed manually [2]. These last few years have seen a rise in the use of generative AI, which is AI capable of generating content based on their input data, often text or images [3]. Large Language Models (LLMs) are a form of generative AI that has been trained on mas- sive datasets, and outputs text and/or code (depending on the language), the most famous example at the time of writing perhaps being GPT-3.5 and GPT-4, which are used in ChatGPT [4]. LLMs have been applied to various aspects of software development (e.g. test case generation [5], documentation [6], refactoring [7], and automated program repair (APR) [8]) with varying degrees of success, often with the purpose to help developers by automating these activities or parts of the activi- ties [7, 9]. LLMs have further been used as general assistants to developers, helping not only by automating activities but also by giving advice and helping developers work through problems [10] through continuous conversation. This thesis seeks to understand how LLMs can be used to help with test main- tenance, however, there are some known shortcomings when using LLMs. Due to data privacy concerns, organisations may not wish to use commercial LLMs, and may instead adopt open-source ones. Due to the high power consumption rate, they also tend to use medium-sized models, which generally do not perform as well as the larger models [11]. LLMs may also suffer from hallucinations, i.e. generating output that is untrue while presenting it as correct, when put in unfamiliar situations not covered by their training data, such as when asked about an organisation’s internal data which will naturally be unknown to the LLM [12]. Another problem is that 1 1. Introduction LLMs may struggle with large contexts, or breaking down tasks into sub-tasks to more efficiently handle them [13]. One solution to these problems is LLM agents. An LLM agent is an advanced AI system with an LLM as its core, further equipped with tools or frameworks that allow it to reason, retain memory, plan, and perceive or interact with chosen parts of its environment [14, 15]. By, for example, giving an LLM access to an organisation’s code repository it may more accurately answer how one change in the production code affects the remaining code base. In addition, dif- ferent agents taking on different roles can be made to work together as multi-agents, similarly to how a software development team works in reality [16]. This thesis was conducted as a case study of test maintenance at Ericsson, a Swedish telecommunications company. Based on the found test maintenance problems, it was investigated how LLM agents could help alleviate them, specifically by predicting if and where test maintenance would be needed. Multiple steps were taken to analyse how LLMs can assist with test maintenance. First, a literature review was conducted to see which changes in production code lead to a need for test maintenance. This was accompanied by a survey and interviews with Ericsson employees to understand better the test maintenance problem, as well as how a solution might best fit with their practices. To complement this, a literature review of the current capabilities and uses of LLMs within software testing was conducted as well, to see what could apply to test maintenance. Based on the combination of these results, a setup with LLM agents was built to help predict whether test maintenance was needed because of changes to the production code, and if so, to which test cases. 1.1 Problem Description Testing in software development is generally an expensive process. This is true for both the monetary cost as well as the effort required in the process itself [1]. This thesis project will be conducted in collaboration with Ericsson. They have many interests and products but what will be focused on in this study is their test suites and maintenance that is being performed on those test suites. There is an initial cost associated with creating the tests and test suites [17], but a large part of the cost of testing comes from the maintenance aspect. Keeping test suites updated through changes in the production code or dependencies over time is challenging and expensive. It generally takes longer to modify existing methods and classes, whether to rectify faults or change their functionality, compared to adding new ones [18]. Additionally, maintenance includes further activities beyond main- taining the test suites. Maintenance is also affected by the process, environment, personnel, and the tools available. Test maintenance encompasses a multitude of different potential activities that are both complex and time-consuming. These activities include, but are not limited to: Error correction, reverse engineering, program comprehension, re-engineering, impact analysis, repository construction, functional enhancements, renovation, mi- gration, integration, optimization, and adaptation. During test maintenance, the 2 1. Introduction developers have to keep in mind the requirement specifications, the design docu- ments, the test cases and the database schemas [18]. To reduce the expenses that testing can require, automation is an excellent tool that is being used more and more on a wider scale [19]. By automating large areas of the testing process, critical problems can be handled by developers with less focus laid on other tasks. This study aims to tackle the high expense of test maintenance by utilising generative AI to reduce the costs involved in the maintenance process, specifically evolving test cases. Having LLM agents provide suggestions for if and where test maintenance is needed to the developers at Ericsson could help make the maintenance process more manageable and cost-effective. LLMs and generative AIs are especially adept at analysing large amounts of data and based on that data give suggestions for best practices [20]. This particular ability we feel will be especially helpful in the areas of test maintenance that are hard to automate and that at the moment require human involvement. By partially automating tasks it could ease the burden on the developers and reduce the effort needed to complete their tasks. This especially goes for LLM agents, which are uniquely equipped to understand a changing code base, because of their access to tools that let them process new and changing information about the surrounding environment. 1.2 Purpose of the Study The purpose of the study is to explore how LLM agents can be used to make test maintenance activities faster by allowing developers to more easily understand if and where changes are needed and being able to apply these changes in a way that ensures the test’s relevance and readability. In particular, we want to investigate how LLM agents can help developers by predicting the need for test maintenance while ensuring the quality of the testing process is maintained. First, we mapped out the existing research on test maintenance, particularly factors in the source code-under-test that indicate that a test needs maintenance. This liter- ature review is then supplemented with corresponding data collected from developers and employees at Ericsson, that is examples of when tests have been updated and explanations of why. From the analysis of these results, the study maps out which problems in test maintenance can and are suitable to be partially addressed with LLMs. Factors concerning test maintenance have mostly been explored in general terms [1, 21], and less in which specific parts of the source code-under-test indicate a need for maintenance. We expect this part of the study to benefit both researchers and practitioners by exploring this area since a better understanding of the problem allows time and effort to be spent more effectively. The second phase of the study built upon what had previously been found, and explored how the solution of using a setup with LLM agents may help with test 3 1. Introduction maintenance. It investigated how an agent with knowledge of both the production code and test code might help with keeping the test code up-to-date when the production code has been changed. Four different setups were investigated: Two variants where the output of one LLM or LLM agent was provided as input to the next, creating a chain of LLMs and LLM agents, and two multi-agent setups with a planning agent to coordinate the work. The purpose of the second phase is to evaluate the viability of the setup with LLM agents solution. This solution aims to help practitioners save time and effort during test maintenance activities, and by doing this increase the overall quality of software projects. It should be noted that not all results from the first part of the study went towards building the test maintenance setup with LLM agents. Some are also used more as a basis for discussion by, based on the identified test maintenance problems, speculating about how agents could be incorporated into more parts of the entire development process to help more continuously and all-encompassingly with test maintenance. For example, it was considered what higher-level use cases, e.g. a change in requirements, might look like, to see how an agent might assist with and support effective test maintenance throughout the whole production chain. 1.3 Significance of the Study This thesis makes both scientific and practical contributions. It firstly contributes both high and low-level factors that act as triggers for test maintenance. The low- level triggers, specific changes in production code, are collected and organised from existing literature, while the high-level triggers, reasons to change production code that also leads to a need to change test code, were gathered from interviews and a survey. Previous studies have focused more on general factors regarding test main- tenance. They have mapped out factors that complicate maintenance, or indicate a need for maintenance, factors such as the size of the test suite and the understand- ability of the test. This study will have a larger focus on which specific changes in the production code indicate a need for maintenance, which makes it scientifically significant, as it is exploring and contributing to a less explored area. The study secondly contributes to an understanding of how and where generative AI can be used to help simplify the test maintenance problem. This is significant to practitioners as test maintenance is a time-consuming and labour-intensive activity, and a further understanding of the activity can help alleviate this. Most research on LLMs within software testing focuses on test generation, however, LLMs have broad uses and this study helps both practitioners and researchers better understand how LLMs can be used for predicting test maintenance, which is a less studied area. It is also significant as LLMs may already be used informally by practitioners for this, but this study would help formalise that knowledge, and would also further it. Lastly, this study sets up a proof of concept of how LLM agents can use the previ- ously mentioned test maintenance triggers, and by using them help make test main- tenance more efficient. This is significant to practitioners as a significant amount 4 1. Introduction of time can be saved by providing automation to help with test maintenance activ- ities. It is significant to research since there, to our knowledge, exists limited or no research on how LLM agents can help with test maintenance. 1.4 Thesis Outline This thesis is organised as follows: Chapter 2: Background introduces relevant concepts for software testing and generative AI and introduces necessary terminology. Chapter 3: Related Work explores existing research on software testing, gener- ative AI, and LLM agents and positions the thesis’ work to it. Chapter 4: Methods lists the research questions, and describes the steps taken to answer them. In particular, it describes the design of the literature reviews, the interviews, the survey, and the architecture of the setup with LLM agents as well as the steps taken to evaluate the agent. Chapter 5: Results describes the results of the sub-tasks. Chapter 6: Discussion provides answers to the research questions, examines how the results fit with and relate to the research discussed in related work, pro- poses future research paths and discusses threats to validity. Chapter 7: Conclusion summarises the study and its results. 5 1. Introduction 6 2 Background The background chapter will lay the foundation for the chapters to come by ex- plaining the necessary knowledge to the reader and introducing essential concepts. It will first look at software testing, including test management and test mainte- nance. Secondly, it will look at generative AI and LLMs, before finishing with LLM agents. 2.1 Software Testing Software testing is the act of verifying and validating a software system, i.e. ensuring both that it fulfils its stated requirements and its high-level purpose. It is a necessary activity to ensure the quality of the product and is generally considered one of the most important parts of software development [22]. In general, testing can be split into functional and non-functional testing. Non- functional testing tests how well the system performs, such as its responsiveness, usability, and stability [23]. In contrast, Functional testing tests the way the system operates by comparing the result of input and execution conditions to an expected output [24]. It can be split into three levels: Unit testing (sometimes also called module testing), which generally tests a single unit of code or piece of functionality, such as a class or a method [25]; Integration testing, which tests the interaction between two or more separate software components [26]; And system testing, where the entire software system working together is tested [25]. A test case is a specification of all the relevant information used to see that the program fulfils an intended objective [24], and collection of test cases in turn make up a test suite [27]. A test case is comprised of many different parts, some of the most important of which are: Test oracle: A mechanism for determining if the output of the test is correct [28]. Some examples include oracles derived from external information, such as re- quirements documentation, and human knowledge on how the program should behave. Test steps: Describes and outlines the steps that should be taken to run the test. 7 2. Background Initialization: The step that sets everything up in preparation for the test case, such as a separate environment to ensure the test can be run in isolation without affecting or being affected by the rest of the system. It also includes defining and declaring variables, and similar preparations. Teardown: The step that, after the test has been run, removes all the temporary equipment used by the test, such as data structures. Input: The data provided at the beginning of the test, i.e. what is fed into the test case. The input will determine which path is taken through the program, and it is therefore important to test with different inputs to ensure many different scenarios are tested. Software testing is also an expensive process that can take up to 50% of both time and cost of the development of a system [25], which makes tools that can assist with the process or automate parts of it desirable. The need for these tools is widely recognised [29], and it is known that automation can help reduce time spent on testing [30]. Many types of automation tools (e.g. test generation tools and test code coverage tools) do exist [30], however, the tools themselves sometimes require significant time and effort to maintain [1]. 2.1.1 Test Management Test management refers to actively working to ensure the quality of the test suite by updating and evaluating it, to in turn be able to ensure the quality of the pro- duction code. Management includes several subareas, such as test maintenance, test automation, and test generation. As test maintenance is the area of this thesis it will be further explained in Section 2.1.2, while test automation and generation will be briefly explored here. Test management is an expensive, time-consuming activity, and the need to alleviate the process through automation is widely accepted [29]. However, test management activities are still generally performed manually [2]. Best practices for test manage- ment are not widely established [1], but it is known that automation can play a vital role in reducing the time spent on testing [30]. Test automation refers to expressing tests as executable code, then using an automated system (e.g. a CI/CD pipeline) to execute the tests and process the results of the test execution [30]. Even though automation decreases time and effort, maintaining automation scripts still requires significant effort [1]. Automation also requires significant upfront investment and needs maintenance throughout the program’s life-cycle [31]. In addition to automating test execution, automation can also be used to generate test cases, especially test input. Since manually creating test cases can be among the most labour-intensive parts of software testing, automatic test case generation is one of the more well-researched areas within software testing. Many different techniques to generate test cases exist: model-based testing, combinatorial testing, and search- based testing are a few of them [32]. These have recently been joined by using LLMs 8 2. Background to generate test cases [3]. Despite the many different approaches, difficulties still exist in ensuring the generated tests’ maintainability and readability [33, 34]. Other than test cases and input, there have also been efforts made to generate test oracles, though they remain particularly difficult to generate automatically [28]. 2.1.2 Test Maintenance Test maintenance refers to the act of updating the test suite as the production code changes and evolves [1]. Systems evolve as new features are added, requirements are changed, and faults are discovered. As the production code is modified to accommo- date this, the test code may need to be changed as well, to ensure that test results are accurate for the current system behaviour. This includes adding new test cases, repairing existing ones, and removing those that are no longer relevant. Test maintenance is understood to be an important part of quality assurance, but has not received as much attention as the insurance of the quality of production code [35]. An example of this is the study of test smells, which despite their known effect on the maintainability of the test suite have received significantly less attention than their counterpart code smells [36]. This is despite reports of test maintenance accounting for up to 60% of the total time spent on testing in a week [1]. In other words, test maintenance is an important research area to further explore. 2.2 Generative AI Generative AI is a form of AI that can understand the intent of a given instruction, and based on this intent generate output in the form of media such as text, images, or music, to name a few [20]. The form of the instruction and output differs between different types of generative AI. Some generative AIs respond to user prompts while others analyse a piece of received media, to name a few. Two examples of generative AI are large language models (LLMs), which input text in the form of a prompt and output text in the form of a response, and text-to-image AIs, which input a text prompt and output an image based on the prompt. A prompt is an input the user gives the LLM when interacting with it, and can for example be questions or instructions. Most generative AI uses a transformer-based architecture, which is based entirely on attention mechanisms. These imitate the cognitive attention (i.e. the ability to focus on select stimuli) seen in humans [37]. The development of the transformer architecture was successfully combined with pre-trained systems, pre-training re- ferring to training the model on a diverse data set to learn general patterns and features [38]. The model can then be fine-tuned to a specific domain or task, which is called transfer learning, as the model can transfer the general knowledge it learned during pre-training to another domain [39]. These powerful pre-trained models are called foundation models and can often be adapted to a wide variety of areas, such as software development, education, and healthcare [40]. 9 2. Background 2.2.1 Large Language Models One form of generative AI is Large Language models (LLMs), which include famous examples GPT-3.5 and GPT-4 that are used within ChatGPT [4, 41]. LLMs are made for natural language processing (NLP) tasks, such as text generation. They take text as input and generate text as output by iteratively predicting the next token or word in a sequence to form a cohesive text [42]. One form of text-based language is code. Because of their ability to understand and output both natural language and code, LLMs are well-suited for software develop- ment. Within test maintenance, the most interesting applications are perhaps the generation of test cases and input, but LLMs have also been used for automated program repair, code review, and as a conversational programmer’s assistant, to name a few [43, 44, 10]. Their strength lies in their versatile nature, their ability to generate code, then reason about said code and allow the user to ask questions about it. They can be used the other way around as well; help their user reason about a problem or set up a plan to tackle it, then based on that help the user generate or look over code. However, even with LLM’s advanced reasoning capabilities there still exist problems. LLMs are known to struggle with hallucinations. A hallucination, in the context of LLMs, is the LLM generating text that appears to be correct and fluent, but in reality, is nonsensical and unfaithful to the data the LLM was trained on. Even when overlooking how this affects the LLM’s performance, hallucinations also cause trust issues for the user and can pose safety risks if the user acts on the hallucinated output [12]. Another problem is that at the moment there exist problems with how to rigorously and extensively evaluate and judge LLMs. Firstly, there is a lack of benchmark datasets to use, especially when testing more niche applications such as program re- pair. These datasets may not have been designed for testing LLMs either. Secondly is the problem of data leakage, where the LLMs may have seen the benchmarks during training, which means reports of LLMs performance may be misleading [11]. Applying LLMs in real-world applications is also not without its share of problems. Organisations may shy away from using commercial LLMs because of data privacy concerns and would prefer to use open-source models instead. These models can then be fine-tuned with the organisation’s internal data. However, building these datasets for fine-tuning can take a lot of effort, both time and labour-wise. Organi- sations may also pay attention to computational power or energy consumption, and therefore choose a medium-sized model, with which it is harder to get state-of-the-art performance, even with fine-tuning [11]. An LLM model and a human may not understand a prompt the same way, which has given rise to the field of prompt engineering, i.e. how to best formulate the prompt sent to the LLM to get the user’s desired output [45]. This can be likened to a kind of natural language programming, a way to steer the generated output of the LLM [46]. Prompt engineering is possible because of in-context learning, which 10 2. Background is defined by Brown et al. as “a paradigm that allows language models to learn tasks given only a few examples in the form of demonstration” [47]. Examples include chain-of-thought and few-shot prompting. Chain-of-thought prompting is a way to help the model with multi-step reasoning, something that is often challenging for LLMs. This is done by providing the model with intermediate reasoning steps. Few- shot prompting is done by including a few input-output examples into the model’s input [48]. Following the same naming convention, one-shot prompts have exactly one example, and zero-shot have none. Chain-of-thought and few-shot prompting can be combined, as can several other techniques. These prompt engineering tech- niques are illustrated in figure 2.1. Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Model input A: The answer is 27. Model output The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Model input The answer is 27. Model output Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now? A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11. Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have? Model input A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9. Model output Unmodified Prompt One-Shot Prompt One-Shot Prompt with Chain-of-Thought Figure 2.1: Example showing how one-shot prompting and chain-of-though prompting works. One-shot text is highlighted in yellow, and chain-of-thought rea- soning is in pink. Note that in this example the model only arrives at the right answer in the final example. Example text and design are partially taken from Wei et al [48]. 2.2.2 LLM Agents An LLM agent is an advanced AI system with an LLM as its core which has ac- cess to tools to help it solve problems. These tools or frameworks may allow it to reason, retain memory, plan, and perceive or interact with chosen parts of its en- vironment [14, 15]. For example, a hypothetical LLM agent may have access to an organisation’s code base and a version control tool. This would allow it to reason about the code, make changes to it, and push those changes to a remote repository. To the best of our understanding, there exists no widely established definition of 11 2. Background what an LLM agent is, however, there is some consensus [14, 15]. We have drawn on this consensus when crafting the definition above, which is the definition that will be used throughout the thesis. One tool to help an agent understand a code base is a RAG pipeline, RAG standing for Retrieval-Augmented Generation [49]. RAG allows the LLM to access infor- mation from external sources, and use that information to better answer related questions. For the code base example, a retrieval tool could be set up either for locally stored files or for files on a remote repository. LLMs may struggle to use the knowledge stored in their parameters for information-intensive tasks, and there is also the problem of having to fine-tune a model again to be able to update its knowledge about an organisational resource. RAG can help solve both of these problems [49], by giving the agent easier access to updated information. Multiple LLM agents can be made to work together, as multi-agents, to further help with task complexity. There are many ways to set up how the agents work together. For example, the agents can be given different roles, opinions, and tasks, to help simulate the way an e.g. software development team would work in real life [50]. Other approaches can also be used to make the agents work together, such as combining task decomposition and task allocation [51]. 12 3 Related Work This chapter discusses related works relevant to the research topics. It first explores test management, especially test maintenance, before moving on to generative AI. There the sections examine related research on how LLMs can help with software engineering and test maintenance, as well as how LLMs can be used as agents. 3.1 Test Management Many tools exist to help developers automate test suites, both by automatically executing tests [19] as well as generating tests [32]. Code and test generation with LLMs has been explored, but the use of LLMs to help with other test management has not, to our knowledge, been widely explored yet. Only a few studies have tackled this topic, see Section 3.3.3. There is reason to expand upon this topic, as White et al. note that LLMs hold immense potential for automating software engineering tasks and activities, of which test management is a part [7]. In other words, though automated solutions are in many cases used, there is room for these solutions to improve, and reason to further explore how LLMs may help. Test generation is a way to automate the creation of new test cases, which is part of the various test management activities. Anand et al. describe test creation as having a strong effect on the efficiency of software testing but also note that test creation is among the most labour-intensive of software testing activities [32], and an intellectually demanding task. Because of this much work has been done on different fronts, and common methods for generating test cases include: symbolic execution, model-based, combinatorial, adaptive random and search-based testing. Despite the work done, the reliability of test generation has yet to be proven, with Gay et al. finding that automatically generated test suites in some cases would perform worse than randomly generated ones [52]. This is further supported by Palomba et al. who show that automatically generated test cases often suffer from poor test code quality [53]. They further establish that test cohesion and coupling are good metrics for test code quality. Xu et al. through an empirical study examine how factors influence the effectiveness and cost of test suite augmentation techniques [54]. They find that the primary factor is the test case generation algorithm, followed by how new and existing test cases are utilised. 13 3. Related Work 3.2 Test Maintenance Test maintenance is a topic that has been explored to a certain extent in previous literature. Pinto et al. examine why test suites evolve [27]. Their findings show that tests that appear to be added or deleted are often simply old tests that have been moved or renamed and that when tests are truly deleted it is most often because they are obsolete, not because they are hard to repair. They further find the main reasons for adding new tests are to cover new functionality, validate bug fixes, and validate refactored code. There are, however, further areas within the topic of test maintenance that are yet to be explored. Imtiaz et al. for instance, noted that there is much room for future studies on methods for repairing tests and that few of the studies done are done in an industrial context [55]. Even when research has been conducted on test maintenance, it is not always implemented. Gonzalez et al. looked at the usage of testing patterns in open-source projects and found that only a quarter of the projects that had tests used patterns related to maintainability [56]. Nevertheless, there is previous research on this topic. Kochar et al. provide an overview of users’ perspectives on important aspects of software testing by performing a survey and a series of interviews on test cases [57]. The questions focus on characteristics of good test cases, which are divided into six dimensions, one of which is maintainability. Factors that affect test maintenance is another area of research within the topic of test maintenance that is relevant to this thesis. Previous research has identified various such factors. Alégroth et al. investigated maintenance for visual GUI testing and found 13 factors that affect automated test suites, and also identified how long the maintenance for tests affected by these factors was estimated to take [1], and how much effort was required. An example is test case length, which had a high impact and increased maintenance by more than an hour. Berglund et al. similarly looked at test maintenance for machine learning systems and found 9 factors which affect test maintenance for all systems, machine learning and traditional systems included [21]. An example is oracle precision, where tests with sensitive oracles require more updates as the test suite is updated while simultaneously being harder to update. Sensitive, in this regard, refers to an oracle that is highly adapted to the precision of the program’s output, where minor changes can cause the oracle to need to be updated. It should be noted that in both these studies the factors are related to how complicated the maintenance will be. They are not looking for specific factors in the code or program that indicate the test needs maintenance, nor are they looking at how changes in the production code might indicate a corresponding need for test maintenance. Factors looking directly at changes in the production code are presented in Section 5.1 and the relevant research investigated is summarised in Section 3.2.2. 3.2.1 Co-evolution of Production Code and Test Code Co-evolution is the effect of test code and production code being modified in paral- lel. This concerns test maintenance in how test code is being modified accordingly 14 3. Related Work to changes in production code to make sure the test suite is relevant and useful. There have been a number of tools, metrics, and methods to help detect changes in production code that indicate test code will need to be changed, as well as to help with co-evolution. Huang et al. propose the tool Jtup, a machine learning ap- proach using random forest [58], which analyses changes made to production code to see if the matching test code needs to be changed as well [59]. The decision on whether co-evolution is needed is based on code change features, semantic fea- tures, as well as complexity features of the code. This distinguishes it from other approaches, which mainly look at semantic changes as well as change features. Kita et al. propose the Tconf metric for evaluating how well a production method has co-evolved with its corresponding tests [60]. They make use of the evaluation of logical couplings between production and test code instead of code analysis. Ens et al. present a visualisation tool for co-evolution and co-change between production code and test code [61]. The tool is called ChronoTwigger and is interactive and shows co-change over time. Gall et al. present ChangeDistiller, a tool for examining fine-grained code changes, which makes use of how source code can be represented as abstract syntax trees [62]. ChangeDistiller has been used by several other studies on co-evolution [63, 64, 65]. Sohn and Papadakis present CEMENT, a tool mak- ing probable links between production and test code that has been updated in a short time frame, under the assumption that the fact they have been updated, or co-evolved, means they are related [66]. Beyond the tools, metrics, and methods themselves, there is further research on the usefulness and applicability of co-evolution. Sun et al. investigate the assumption that if a production class and its corresponding test class are updated within the same commit or within a short time frame they are an example of linked co-evolution between test and production code [67]. They found that the longer the time frame between the change in the production code and the change in the test code, the less likely they are to be a true example of co-evolution. Updates within the same commit contained 11.34% false positives, while pairs with more than 24 and 48 hours had 85.71% and 89.19% false positives respectively. Based on this they claim the co-evolution samples used by Wang et al. [2] include noise. Klammer and Kern show that visualisations can be used to understand and keep up with how a systems production and test code co-evolves [68]. This was tried when analysing co-evolution in industrial projects. 3.2.2 Co-evolution Factors The following section will present the factors found in existing literature for which modifications in the production code lead to a need to modify the test code. These were used as part of answering the first research question RQ1, which is defined in Section 4.1 and the results of which can be found in Section 5.1. Previous studies have looked at the co-evolution between production code and test code, and ex- tracted patterns of what triggers this co-evolution. This has been done at different scales, from certain types of changes (e.g. a change in the method body, or addition of a conditional statement), to looking at specific syntax and keywords. 15 3. Related Work Shimmi and Rahimi extracted and documented higher-level patterns on co-evolution between production code and test code [69]. The patterns were classified under additon, deletion, and modification and an example is addition: added functional- ity where the corresponding test cases are made when the functionality is added to the production code. Reich and Maalej similarly extracted patterns, but fo- cused on refactorings and co-evolution to increase the testability of the production code [70]. They identified both high and low-level changes in the production code. They define a low-level change in the production code as a local change in a pro- duction file, such as changing an attribute type. A high-level change in the pro- duction code would have wider effects beyond the local area around the change, such as merging a package. These changes were used to find testability patterns, such as the extract_method_for_invocation pattern. Levin and Yehudai also use semantic changes and investigate the relationship between test code maintenance and production code maintenance [63]. They identify both high-level relationships (e.g. REMOVED_CLASS, where removing a production class will lead to the re- moval of a test class) and low-level relationships (e.g. RETURN_TYPE_CHANGE where changing a return type in the production code will lead to test maintenance). Marsavina et al. extracted patterns for fine-grained co-evolution between produc- tion and test code [65]. They extracted fine-grained changes in production code and linked them with the related test code, from which they identified six co-evolution patterns. Vidács and Pinzger later found support for five of the six patterns found by Marsavina et al [64]. Factors were also extracted from literature where tools to predict or help with test maintenance had been developed, and they reported which factors the tools acted on. DRIFT [71] is a further development of SITAR [2], which identifies outdated test cases based on changes in the production code at the method level. Within this work, they identified fine-grained changes in the production code that may be related to co-evolution. Some examples of these fine-grained changes include: Try, Break, and If. These fine-grained changes would require changes in the test code, hence their relation to co-evolution. TestCareAssistant, originally proposed by Mirzaaghaei et al. [72], was further developed by Mirzaaghaei et al. and can repair and generate new test cases as the production code changes [73]. TestCareAssistant looks at parameters and returns values to identify when a test case becomes outdated, which were identified as indicators that co-evolution was needed. As part of the work to develop TestCareAssistant, Mirzaaghaei worked to formalise test maintenance activities into test adaptation patterns [74]. CEPROT identifies outdated test cases and also updates them [22]. The main factors looked at in the production code to detect the need for co-evolution are API invocation and changes to identifiers and modifiers. 3.3 Generative Artificial Intelligence The following section will present some of the relevant research on generative AI, its use cases and its performance. Generative AI has been extensively researched in the last few years with many different approaches. Gozalo-Brizuela and Garrido- 16 3. Related Work Merchán investigated various generative AIs and classified them into a total of 9 categories [75]. The most relevant topic areas for this thesis lie in research where generative AI and LLMs are used for coding and other software engineering purposes. Thus the most relevant categories are the text-to-text models as well as the text-to- code models. The most frequently used, and studied, LLMs at the time of writing is OpenAI’s series of LLMs in the GPT series, most commonly GPT-3.5 and more recently GPT-4. These LLMs are often interacted with through ChatGPT, a chat robot that is implemented through the use of the GPT series. These are text-to-text LLMs that have gained immense popularity after ChatGPT’s original introduction. Conversely, there are many different text-to-code LLMs [20, 75]. Beyond ChatGPT and its use cases, there are further factors to consider regarding how an LLM can perform. Mandvikar compares LLM models to each other and presents several factors that describe how LLMs can differ [76]. These factors in- clude, but are not limited to, the kind of pre-trained data, the size of the model, the API capabilities, etc. These factors are then useful to consider when selecting an LLM for a specific task according to Mandvikar. Beyond these factors, Döderlein et al. investigate two LLMs and how they can be improved based on their input parameters [77]. Their findings indicate that the temperature, which is a parameter affecting how varied a response will be, and the initial prompt can have a signifi- cant effect on the performance of the LLM. To get a further understanding of the performance of LLMs Chang et al. perform a survey focusing on the evaluation of LLMs [78]. They focus on three aspects, namely what to evaluate, how to evaluate, as well as where to evaluate. Their findings include limitations in LLMs and their reasoning ability, as well as their robustness. 3.3.1 LLM Capabilities in Software Engineering There is some previous literature aiming to collect and present various research pa- pers that have investigated LLMs and specifically their use for software engineering. Zhang et al. investigate existing LLM-based software engineering (SE) studies, both studies focusing on LLMs as well as studies focusing on SE [79]. They discuss ar- chitectures, benchmarks, optimisation and application, as well as some challenges of LLM research. Their findings indicate how LLMs are being trained for more code- aware objectives compared to earlier natural language processing-derived objectives. Further findings include a consideration for variables, and structural features, as well as utilising cross-modal learning. This signifies advancements towards LLMs that consider the semantics and functional aspects of code beyond processing the code as a sequence of tokens. Hou et al. present a systematic literature review on how LLMs are utilised for software engineering [43]. Their findings provide a comprehensive list of different utilisation areas for LLMs including, but not limited to, code generation, code completion, code understanding, program repair, code review, bug prediction, vulnerability detection, and verification. There are additional reviews and surveys of previous works for various software engi- neering activities. Zhang et al. present a review of the history of code processing and generating code from a natural language description, from natural language process- 17 3. Related Work ing models to few-shot prompting applications of LLMs [44]. Wang and Chen present a review of previous work on how LLMs can be utilised for code generation [80]. They focus on the application of LLMs for this topic as well as the evaluation of the generated code. They find several limitations with an LLM’s application for code generation including, but not limited to, compatibility, maintainability, portability, correctness, and privacy. Despite the limitations presented Wang and Chen con- clude that code generation with LLMs has progressed and can handle increasingly complex tasks. Their findings show how there is a lack of research on the evaluation of LLM-generated code. Zheng et al. provide a comprehensive review of the current stage of code LLMs through their survey [81]. Code LLMs are LLMs that have been trained mostly on code repositories instead of natural text, though some are trained on both. They list several code LLMs, their applications, as well as the rela- tionships between them, both between themselves and compared to general LLMs. The performance of the code LLMs is investigated and compared to benchmarks for multiple software engineering tasks. They summarise their findings with code LLMs having a focus on code generation with some lesser emphasis on other tasks, e.g. vulnerability repair or evaluation. Other research directions include more specific work on how LLMs can be utilised for various software engineering activities. Uusnäkki investigates the applications of generative AI on software development [82]. As part of this Uusnäkki performed an empirical study on the use of prompt engineering for enhancing software system maintenance. As part of the results, the PESD framework is presented, which is a framework for systematic prompt engineering. Fan et al. investigate how hybri- dising, i.e. using LLMs along with existing software engineering techniques, such as API search techniques or search-based test generation, can reduce hallucinations and improve performance [83]. Their findings indicate that this is a promising topic with several successful examples. Pei et al. investigate how LLMs can work with program invariants, including predicting them [84]. They present a method for pre- dicting invariants through fine-tuning LLMs and find that LLMs are effective on this task, with 86% recall and 86% precision. The different invariants include ob- ject, class, function-entry, function-exit, and loop invariants. Liu et al. propose CodeExecutor, a model focused on enhancing code execution through LLMs [85]. They utilise pre-training and curriculum learning to improve the model on code exe- cution tasks specifically. Liang et al. investigate the qualitative experience of LLMs as coding assistants [86]. Their findings show that LLMs are mostly used for code completion and faster keystrokes. On the other hand, most users in the study find that code generation does not reach quality requirements and creativity and ideas are underutilised. As discussed in Section 3.3 ChatGPT is a popular LLM in a general sense at the time of writing. It has also been investigated on its applicability for software devel- opment. White et al. investigated the use of ChatGPT in software development and identified 14 prompt patterns that would make the answers from ChatGPT more helpful [7]. These patterns were focused on software development. Based on these patterns some benefits of using ChatGPT identified were rapid experimentation at different abstraction levels or identification of assumptions in the code of a project. 18 3. Related Work Rahmaniar discusses potential applications of ChatGPT in software development but also brings up several challenges that may arise when attempting to integrate ChatGPT or other generative AI into a development process [87]. Rahmaniar men- tions topics that ChatGPT would be adept at handling or assisting with such as documentation, onboarding, reviewing, and of course code writing assistance. Worth noting is that each of these topics, among others not mentioned here, has its limi- tations and will according to Rahmaniar require some sort of human component for best results. 3.3.2 LLMs for Code Interaction This section will focus on LLMs for software engineering activities specifically in- teracting with code. This includes various frameworks, techniques, and methods that can view or make modifications to the production code. Sghaier and Sahraoui present a framework for utilising LLMs for code review [88]. They believe, and their findings indicate, that fully automating code reviews does not lead to the best results and therefore their framework aims to lessen the workload of a code reviewer and provide assistance instead of automating the whole review. Zhang et al. present a survey of automated program repair (APR) solutions in current literature, many of them focusing on utilising LLMs for APR [89]. They describe the typical framework, and design strategies, as well as metrics and empirical studies. Similarly, Xia et al. investigate the use of LLMs for APR [90]. Ibrahimzada et al. present BUGFARM, a technique utilising LLMs to generate bugs [91]. They utilise attention analysis to attempt to find the weak spots of LLM models and then improve their performance through training on the generated bugs. Further implementations using LLMs include Fried et al. who present InCoder, a model that can perform both program synthesis as well as editing through LLMs [92]. Additionally, they use causal modelling to improve the performance of their model, particularly the infilling capabilities of the model. Dou et al. investigate the capa- bilities of LLMs when it comes to code clone detection [93]. Their findings indicate that LLMs have the potential to outperform other automatic clone detection meth- ods, especially regarding complex semantic clones. Geng et al. investigate how well LLMs can generate comments and summaries of code [94]. Their findings indicate that through few-shot learning an LLM can perform better than existing supervised learning approaches. Chen et al. present SELF-DEBUGGING, a framework where LLMs can iterate over their own generated code to find and rectify errors, both semantic and syntactic [95]. Their findings suggest that LLMs can improve their performance by going over the code it has previously generated. Ren et al. describe several limitations of exception handling by LLMs and present KPC to mitigate that, which is a code generation approach for using LLMs so that they handle exceptions better [96]. 3.3.3 LLMs for Testing This section will focus on how LLMs have been used for testing and testing-related activities. Wang et al. have investigated what testing activities have already been 19 3. Related Work performed using LLMs [97]. They have investigated 50 different utilisations of LLMs in software testing and have then reviewed and analysed the results from them. Wang et al. state that LLMs are more useful for automation in testing in com- parison to automation in source code. There has mainly been unit testing being performed with LLMs, and while there is some system testing being done, no inte- gration or acceptance testing being found by Wang et al. The majority of the testing was functional testing with a small amount of security testing. No performance or acceptance testing using LLMs was found by Wang et al. “There is currently no clear consensus on the extent to which LLMs can solve software testing problems.” says Wang et al. in an overall view of the current state of the research on this topic. As found by Wang et al. there have been multiple studies on test code generation. For instance, Yuan et al. evaluated ChatGPT’s ability to generate unit tests [3]. They found that about a quarter of the generated tests pass, but the rest suffer from issues with compilation, correctness, and execution. What is notable is that the tests that pass resemble manually written tests in quality. They describe Chat- GPT’s ability as promising if the correctness were to be improved. Another study on test code generation is by Schäfer et al. who presents TESTPILOT, an approach for unit test generation by LLMs [5]. Their findings indicate that LLMs provide higher coverage and a larger amount of non-trivial assertions compared to previous test generation techniques. They conclude that LLMs can lessen the work required for unit testing but not replace the need to write unit tests entirely, especially when it comes to more complex tests. Further research on test generation is by Siddiq et al. who investigate the unit test generation capabilities of three code generation LLMs [98]. They compare strongly typed languages, e.g. Java, to weakly typed languages, e.g. Python, to see if the generation by LLMs differs. Their findings suggest that LLMs have more difficulties with more strongly typed languages from the fact that syntax has an increased importance compared to semantics. They also investigate the applicability of utilising LLMs for Test Driven Development (TDD). Their findings indicate that LLMs can work well with TDD. Kang et al. introduce LIBRO, a technique to use LLMs for generating tests based on bug reports [99]. The tests generated have the purpose of reproducing the bugs, which has a success rate of 33%. Lemieux et al. present CODAMOSA, an algorithm for using LLMs to enhance search-based software testing [100]. An important aspect of an LLM for this study is the understandability of the code and the suggestions made by an LLM. Gay investigated the readability of tests modified by LLMs [101]. Gay worked with GPT-4 through ChatGPT and a code interpreter plug-in and identified that over 90% of the investigated case ex- amples had significant improvement in the test readability after GPT transformed the tests. There are some challenges present which include, but are not limited to: non-determinism, text and prompt limits, code interpreter limitations, and trans- formation order. Nevertheless Gay concludes that LLMs seem promising in the task of improving readability in tests. 20 3. Related Work 3.3.4 LLM Agents This section will present previous related work done on LLM agents. This is a very recent topic at the time of writing and therefore there is limited research done on the topic. Jiang et al. investigate the use of planning with LLMs, where the LLM first makes a plan for its actions before proceeding with those actions [102]. Their findings indicate that a planning phase can improve performance, despite planning being an emergent ability of LLMs. Although this is not specifically about LLM agents it serves as support for one of the bases of LLM agents. Zhao et al. present a method of choosing between chain-of-thought and program-aided language models [103]. Their findings indicate a benefit to performance for choosing the better-suited model for each problem. This is not specifically about LLM agents but also serves as a base idea for multi-agent frameworks, where multiple LLM agents cooperate and use their various specialised skills to collectively produce an improved result. For work done specifically with LLM agents, Feldt et al. work towards SocraTest, an autonomous LLM agent that can invoke tools [15]. They present a taxonomy of agents as well as a concrete example. Hong et al. present MetaGPT, a multi-agent framework designed to solve various problems by simulating a software company structure [104]. Their findings indicate that taking inspiration from humans can improve the workings of LLM agents and how they work together. Rasheed et al. present CodiPori, a code generation model based on multiple LLM agents [105]. Their findings show that the performance of LLM agents working together can out- perform existing single LLM usage. Shen et al. investigate the limitations of small LLMs when it comes to tool usage in LLM agents [106]. Their findings suggest that simplifying and dividing tasks into different instances can improve the performance of LLMs, especially LLMs of smaller sizes. Yoon et al. implement DROIDAGENT, an LLM agent that performs Android app GUI testing automatically [107]. Their findings indicate that LLM agents can contribute to autonomous GUI testing based on more meaningful exploration choices and depth of search. 3.4 Summary This section will present a summary of the related works and position this master thesis against the open challenges of previous literature. The gaps in current research and the steps we take to fill them are discussed in this section with a finishing paragraph highlighting the research that has been utilised in this thesis. Automating test management and maintenance are topics that have been somewhat explored in previous research. However, the majority of automation comes in the form of test case generation and focuses less on modifying existing test cases or identifying affected test cases when production code has been changed. The research on the co-evolution of test and production code also lacks research when it comes to automation beyond test case generation that concerns generating entirely new test cases. We address this gap by building a prototype that identifies test cases that might need to be modified based on a production code change. 21 3. Related Work Generative AI and LLMs are more recent topics that have seen extensive research over the last few years. Most research has been performed with larger models that are not available through open source and instead require access to APIs, and there- fore lack control over the deployment of the model. In addition, most research in- cludes pre-training or fine-tuning a model to fit a specific use case better. There is also a lack of research done in the area of multi-agent architectures that focus on topics beyond feature development. Previous research has explored the topic of code understanding but has not further applied this understanding to topics such as code traceability. Applying LLMs and LLM agents to test maintenance is also an unexplored area from what we have found. To our knowledge, we are the first to ex- plore a multi-agent setup of open-source LLMs without pre-training or fine-tuning. In addition, our focus on using LLMs to help automate test maintenance is novel to our knowledge. Some research that has been utilised to great effect in this thesis is previous re- search on triggers for test maintenance. For details on this see Sections 4.3 and 5.1. Additionally useful for this thesis is previous research done on actions that LLMs and LLM agents can take. For details see Sections 3.3.1 and 5.4. By building on previous research this thesis aimed to fill gaps in the research areas highlighted in this section. The aim is to expand the areas of application of LLMs while also providing more options for automating test maintenance activities. The inclusion of test maintenance triggers stems from the desire to have criteria that an LLM agent can utilise to know when to act. The inclusion of LLM actions and LLM agent actions stems from the desire to understand what applications of LLMs and LLM agents can be applied for this thesis’ use case. 22 4 Methods The methods chapter will present the different steps taken throughout the case study to answer the research questions. This chapter first presents the research questions (Section 4.1) before giving an overview of the case study (Section 4.2). The remaining sections provide more detailed explanations of each case study step. 4.1 Research Questions This section will first present the research questions. It will then motivate them and explain their purpose, as well as connect them and the scope of the thesis. RQ1 Which factors suggest that test maintenance needs to occur due to changes in the production code? RQ2 What applications could current LLMs or LLM agents have within the area of test maintenance? RQ2.1 Which factors from RQ1 can act as triggers for test maintenance in an LLM or LLM agent? RQ2.2 What are potentially viable test maintenance actions that an LLM or LLM agent could take based on these triggers? RQ2.3 Based on the present-day landscape, what are some considerations for building an LLM agent for test maintenance within a corporate setting? RQ3 What is the precision, recall, and F1 score of our setup with LLM agents in predicting if and where test maintenance is necessary using the factors found in RQ1? Identifying the need to evolve some part of a test suite is the first step of test maintenance. The purpose of RQ1 is therefore to identify and categorise the issues and changes that lead to the need to perform test maintenance. This was answered by a literature review to identify changes in the production code that lead to a need for test maintenance (described in Section 4.3), a thematic analysis of interviews with practitioners (described in Section 4.4), as well as conducting a survey with 23 4. Methods practitioners (described in Section 4.5). RQ2 builds upon RQ1 and aims to explore how LLM agents might fit within the problem space, as well as what should be taken into consideration when building them. This decision to use LLM agents stems from the literature review of LLMs and the results from that combined with the results of RQ1. RQ2.1 identifies which of the results from RQ1 are suitable to move forward with by reasoning about the triggers and the planned architecture of the agent. RQ2.2 is more exploratory. It uses existing literature on software testing and LLMs to both give ideas about areas of application for a test maintenance setup with LLM agents as well as sugges- tions regarding how particular triggers might suggest particular applications. The question is answered by presenting examples of how LLMs have been used within software testing. RQ2.3 draws upon current literature as well as the interviews and the survey to identify surrounding factors and limitations within Ericsson that must be considered when deploying an agent. RQ3 assesses the performance of an initial prototype solution. It builds upon the results of RQ2, but does not seek to confirm all of RQ2s findings. This RQ only seeks to try out one of the possible use cases found in RQ1 and RQ2. As a proof of concept, a setup with LLM agents was designed and evaluated on its precision, recall, and F1-score. The design process is described in Section 4.8 and the evaluation process in Section 4.9. 4.2 Research Design This study is a case study investigating the applicability of LLMs within the test maintenance domain at Ericsson, especially LLM agents. The case study follows the guidelines laid out by Runeson and Höst [108]. This section will present an overview of the case study methods, and how the different elements relate to each other and the research questions. The overall structure of the methods of the case study is also displayed in Figure 4.1. The case study had the following steps: (1) Literature review of test maintenance factors: To find factors that when changed in production code lead to test maintenance, meaning a need to make changes to test cases, a literature review was conducted. The process is described in detail in Section 4.3, and the results contributed to RQ1. (2) Interviews: The interviews were conducted with Ericsson employees to get a better understanding of test maintenance problems at Ericsson. A thematic analysis was done to analyse the results, which contributed to RQ1 and RQ2. The process is described in Section 4.4. (3) Survey: Similarly to the interviews, a survey was sent out to Ericsson employ- ees to better understand Ericsson employees’ views of test maintenance, as well as opinions about generative AI. The results contributed to RQ1 and RQ2, and it is described in Section 4.5. 24 4. Methods Design protocol and search strings Conduct database search Conduct forwards and backward snowballing Read through and evaluate papers List of factors found in literature 1 2 Design interview questions Select participants Conduct pilot interview Conduct interviews Thematic analysis Thematic map Transcribe interviews Design protocol 3 Design instrument Expert review Pilot test Distribution Analysis of survey results List of factors indicating a need for maintenance RQ1 (Test maintenance factors) Design protocol and search strings Conduct database search Conduct forwards and backward snowballing Read through and evaluate papers List test maintenance related LLM capabilities 4 Analysis of how LLMs can assist with test maintenance RQ2 (How LLMs can help) Match LLM capabilities, triggers, and problems 5 Map of trigger-based use cases Experiment with LLMs, agents, and triggers 6 Proof-of-concept LLM setups Design evaluation protocol 7 Evaluate, precision, recall, F1-score RQ3 (Effectiveness of LLMs on problem) Discussion LLM Chain setup Experiment with chunking and information representation Multi-agent LLM setup Compare to baseline RAG tools Figure 4.1: Overview of case study. The numbers correspond to the case study’s different steps. 1 = literature review of test maintenance factors, 2 = interviews about test maintenance and LLMs, 3 = survey about test maintenance and LLMs, 4 = literature review of LLM capabilities, 5 = synthesise results, 6 = build prototype of setup with LLM agents, 7 = evaluate prototype of setup with LLM agents. Grey = activity, yellow = artefact, green = research question. 25 4. Methods The results of these three steps were used to find the final list of factors indicating a need for test maintenance, which is the answer to RQ1. The decision to use three different methods was taken to get data and method triangulation, to help increase the precision and validity of the results. (4) Literature review of LLM capabilities: To understand how LLMs are currently used for test maintenance, as well as understand their suitability and limitations within the current use case, a literature review was conducted. The results were used to answer RQ2, and the review protocol is described in Section 4.6. (5) Analysis of which test maintenance problems and triggers to imple- ment in agent: This step used the results from steps two, three, and four to decide which of the identified test maintenance problems would fit with which trig- gers. This was done ad-hoc, taking the time and resource limitations into account. The result was partly used to answer RQ2 and provide a basis for larger use cases in the discussion. For a further description, see Section 4.7. (6) Design setup with LLM agents: Four proof-of-concept LLM setups were designed and implemented to explore the viability of using LLMs to predict test maintenance. This step contributed to the result of RQ3 and is described in Section 4.8. (7) Evaluation of setup with LLM agents: The setup with LLM agents was evaluated on its precision, recall, and F1-score, the results of which were used to answer RQ3. For a further description, see Section 4.9. 4.3 Literature Review to Identify Test Mainte- nance Factors A literature review of factors in the source code that indicate the need for changes in the test suite was performed in the early stages of the thesis to help answer RQ1. Though the literature review was not a Systematic Literature Review due to the time constraints, it did take inspiration from its strict protocol, as described by Keele [109]. Each step of the literature review was based on the guidelines provided by Keele. The majority of the steps described by Keele were followed, if less meticulously than in the original, with two exceptions: the data collection, and the dissemination. The data collection was not as rigorous due to the short time frame, and the dissemination, i.e. report writing, had less focus due to the aim of the literature review not being isolated but instead leading into the next step of the thesis. The databases that were utilised to search for primary sources were: IEEE, Science Direct, ACM, SCOPUS, and Google Scholar. For the search strings used for each respective database, see Appendix A. The search was limited to papers released within the last 15 years, i.e. in the period 2009-2024. This range was chosen based 26 4. Methods on the desire for relevancy to current-day software engineering and testing standards and practices. Relevancy was judged first through the title, followed by the abstract and the con- clusion of the papers. Once it was judged that no more relevant papers were being found the initial scan ended. 48 papers had been found through this step. These papers were then examined in more detail, and if they contained relevant factors, they were recorded in a separate document. This step yielded 8 different papers that named test maintenance factors in the source code. These 8 papers were then used as a staging point for both backwards and forward snowball sampling. The method of examining the relevancy of the new research papers was identical to the method used for the original examination of research papers in the first step of the literature review where the relevancy of the papers was checked. This step yielded an additional 64 papers. From these papers, an additional four papers were found to describe relevant factors in production code that when changed would lead to a need for test maintenance. This led to a total of 12 found papers with relevant factors. To sort and organise the factors, a Miro [110] board was used. See Section 5.1 for the result of the literature review. 4.4 Interviews Interviews were held with Ericsson employees to get a better understanding of the current state of the test maintenance problem and where improvements can be made as part of RQ1. The interviews also included questions on LLMs and generative AI, both current use and opinions, as part of RQ2. 4.4.1 Selection of Interviewees Table 4.1: Demographics of Interviewees. Experience refers to years of experience with testing, type refers to testing they are currently performing. IDs that were interviewed together have been grouped. ID Experience (Years) Role Type P1 15 Software Developer Unit P2 3.5 Data Scientist Unit P3 6 Developer Unit P4 5 Developer Unit P5 3 Software Developer Unit P6 2 Test Manager Integration, System P7 2 Test Manager Integration, System P8 25 Principal Developer Overseeing Process 27 4. Methods Convenience sampling was utilised for the sampling of the interviewees. Based on the supervisors’ knowledge of the organisation, emails were sent out to relevant teams and developers to explain the master thesis and inquire about participation in interviews. Some interviews were held in groups, for the convenience of both the interviewers and the interviewees. Table 4.1 presents the demographics of the interviewees who agreed to be interviewed. 4.4.2 Interview Instrument The initial step for the interviews started with writing out relevant questions to the topics of RQ1 and RQ2. Questions were left open to avoid leading questions. After the questions were written, an expert review was performed by the academic and industrial supervisors of the thesis. A pilot interview was performed, and based on it small changes were made. One question was removed, as it was deemed irrelevant to the research questions. Some questions received minor clarifications. Because the changes remained relatively small, the data from the pilot interview was used in the final analysis. A consent form was presented to each interviewee before the interview could start in addition to receiving permission from the interviewee(s) to record the interview. The interview had a length of roughly 40 minutes on average. See Appendix C for the interview consent form. See Appendix D for the interview questions. 4.4.3 Interview Analysis All the interviews were transcribed through Microsoft Teams [111] and were later manually corrected after a process of listening through the recordings of the inter- views. After the interviews were transcribed a thematic analysis was performed to identify themes and common concepts and thoughts of the interviewees. The thematic analysis followed the steps and guidelines described by Braun and Clarke [112]. An inductive and semantic approach was mainly used. The results of the thematic analysis can be found in Section 5.2. 4.5 Survey A survey was sent out to employees at Ericsson, whose work was related to software engineering, to gauge their way of performing test maintenance as part of RQ1 as well as their opinions on LLMs as part of RQ2. A protocol for the survey was designed based on Ghazi et al. [113] and Kasunic [114]. The nature of the survey was exploratory, and the motivation behind the survey was to find out how the test maintenance process is managed, what kind of help practitioners want from LLMs, and what in the source code triggers an update to a test case. 4.5.1 Selection of Participants The desired population was Ericsson developers with testing experience, as well as other Ericsson employees who worked with testing. Convenience sampling was 28 4. Methods Figure 4.2: Timeline of the distribution of the test maintenance and LLM usage survey through February 2024. used to distribute the survey. Based on the supervisors’ knowledge of Ericsson and the various organisations within, emails were sent out to relevant communities and teams. For demographics of the respondents, see Section 4.5.3 and Figure 4.3. The survey was initially planned to be distributed to a single developer community, consisting of over 100 developers across different countries and sections within Eric- sson that work within the same area, and then be available for two weeks. However, based on a low response rate the survey was sent out to several different commu- nities during the time the survey was available, necessitating an extension of the availability of the survey to make sure that respondents had time to answer. A timeline for the survey distribution can be seen in Figure 4.2. 4.5.2 Survey Instrument Next, the survey instrument was designed. The survey was determined to be an unsupervised cross-sectional survey. The survey was designed to take 5-10 minutes to increase the chance of the respondents answering the survey. Care was taken to make sure that questions were not left open-ended nor that there were too many questions. The total number of questions was 10, with all of them being multiple- choice questions. Some questions allowed the respondent to choose multiple answers. These questions had a limit to the number of answers that could be chosen, to force the respondent to prioritise the most relevant choices. The wording of all questions was evaluated using a checklist based on understandability criteria set by Kasunic [114]. Understandability criteria are rules for the phrasing and structure of questions such that minimal confusion and misunderstandings occur. 29 4. Methods The instrument started with an information page that included the identity of the surveyors as well as the purpose of the survey and how their answers would be treated. Following the information page were some initial attribute questions re- garding the demographics of the respondents. After the attribute questions about demographics, the next section focused on test maintenance activities. This section contained questions about the behaviour and belief types, as described by Kasunic [114]. Thereafter the final section consisted of two questions about the respondent’s use of and attitude towards LLMs and generative AI. These two questions were of the behaviour and belief type respectively, as described by Kasunic. All questions in the survey can be found in Appendix E. The survey instrument was evaluated by a Data Analytics Expert at Ericsson. It was also pilot-tested by three Ericsson developers to ensure there were no ambiguities in the questions, as well as to ensure it could be completed in less than ten minutes. 4.5.3 Survey Analysis The total number of people who received the survey is unknown based on the fact that it can not be confirmed if respondents shared the survey with additional col- leagues beyond the receivers of the emails that were originally sent out. What can be confirmed is that no respondent answered the survey more than once as each respondent had to log in with their Ericsson account and the survey was set to only accept one response per account. The total number of people who received the sur- vey is at least 300 but it is otherwise unknown. The total number of responses to the survey was 29 and thus, while the exact number of recipients is unknown, the response rate is less than 10%. The demographics of the respondents are presented in Figure 4.3. The median role of a respondent is developer, and the median experience and type of testing per- formed is 3-5 years and unit testing. The main programming language was Python. If respondents are separated into groups of up to five years of testing experience compared to more than five years of testing experience, the results differ. While respondents with up to five years of experience mostly work as developers with unit testing, the role of respondents with more than five years of experience was less uni- form, and integration testing was more common than unit testing (see Figure 4.4). One explanation for these differences in results could be that experienced profession- als are more suited to testing at levels where a broader and deeper understanding of the product and requirements is needed. The results of the survey were analysed using descriptive statistics, to get an overview of the respondents’ thoughts about test maintenance and to identify trends in the answers. 30 4. Methods 13 4 2 1 2 3 2 1 1 0 2 4 6 8 10 12 14 D e ve lo p e r Te st e r A rc h it e ct D e vO p s D at a Sc ie n ti st O th e r Te st in g R e la te d P o si ti o n P ro d u ct O w n e r Te ch L e ad M an ag e m e n t P o si ti o n 0 2 4 6 8 10 12 14 (a) Work role of respondents. Axes show frequency and responses. (b) Years of experience with software testing. Axes show frequency and re- sponses. 16 5 4 1 1 1 1 0 2 4 6 8 10 12 14 16 18 15; 52% 7; 24% 3; 10% 2; 7% 2; 7% Python Java C/C++ Erlang None (c) Most common type of testing per- formed. Axes show frequency and re- sponses. (d) Most commonly used programming language. Labels show the number of occurrences; percentage of total occur- rences. Figure 4.3: Demographics of survey respondents. All questions and answer alter- natives can be seen in Appendix E. 4.6 Literature Review of Large Language Models A literature review of LLMs and their capabilities and applicability was performed to find answers to RQ2. The process of performing this literature review closely mimics the process for the previous literature review, see Section 4.3. A protocol was created with inspiration from Keele [109] in the same fashion as the previous literature review. The databases that were utilised to search for primary sources were: IEEE, Science Direct, ACM, SCOPUS, and Google Scholar. For the search strings used for each respective database see Appendix B. The search was limited to papers released in the time frame of 2017-2024. This range was chosen based on the first significant de- velopments of LLMs, as defined after consultation with the supervisors at Ericsson. The starting point was based on the release of papers such as Vaswani et al. [37] 31 4. Methods 10 1 2 1 1 2 0 1 1 0 2 4 6 8 10 12 14 D e ve lo p e r Te st e r A rc h it e ct D e vO p s D at a Sc ie n ti st O th e r Te st in g R e la te d P o si ti o n P ro d u ct O w n e r Te ch L e ad M an ag e m e n t P o si ti o n 3 3 0 0 1 1 2 0 0 0 2 4 6 8 10 12 14 D e ve lo p e r Te st e r A rc h it e ct D e vO p s D at a Sc ie n ti st O th e r Te st in g R e la te d P o si ti o n P ro d u ct O w n e r Te ch L e ad M an ag e m e n t P o si ti o n (a) Work role of respondents with less than 5 years of experience. Axes show frequency and responses. (b) Work role of respondents with more than 5 years of experience. Axes show frequency and responses. 14 1 2 0 0 0 1 0 2 4 6 8 10 12 14 16 18 2 4 2 1 1 1 0 0 2 4 6 8 10 12 14 16 18 (c) Most common type of testing per- formed for employees with less than 5 years of experience. Axes show fre- quency and responses. (d) Most common type of testing per- formed for employees with more than 5 years of experience. Axes show fre- quency and responses. Figure 4.4: Difference in the work role and type of testing performed for em- ployees with less and more than five years of experience. All questions and answer alternatives can be seen in Appendix E. and the end date is up to the point where the literature review was conducted. The papers were first checked for relevancy through their titles, abstracts, and con- clusions. Relevant papers were added to a worksheet for later perusal. The in- spection of papers for each search string continued until three subsequent pages of irrelevant papers were identified. In total, 67 papers were chosen for further examination. After the initial scan ended the papers were examined in more detail. The papers were checked for information regarding how LLMs can interact with code and other artefacts, as well as how suitable LLMs are to help with different software engineering tasks, especially test maintenance tasks. 32 4. Methods Papers that were deemed to provide relevant information were distinguished from papers that did not provide relevant information. These most useful papers were then used as staging points for both backwards and forward snowball sampling. There were ten papers used for the snowball sampling. The second selection of papers found through the snowball sampling was treated the same as the papers found through the initial scan. An additional 92 papers were selected for further examination through snowball sampling. Two papers of the second selection were deemed more useful than the rest, for a total of twelve papers that had the most relevant information for this literature review