Systematic Design and Integration of Large Language Model Tools for Engineering Analysis Gabriel Krüger Johannes Lundahl Department of Industrial and Material Science CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2025 www.chalmers.se www.chalmers.se Master’s thesis 2025 Systematic Design and Integration of Large Language Model Tools for Engineering Analysis An investigation, and the development, of large language models tools in analysis engineering Gabriel Krüger Johannes Lundahl Department of Industrial and Material Science Chalmers University of Technology Gothenburg, Sweden 2025 Developing and evaluating large language models to align with engineering tasks and increase their effectiveness An investigation, and the development, of large language models tools in analysis engineering Gabriel Krüger, Johannes Lundahl © Gabriel Krüger, Johannes Lundahl 2025. Supervisors: Alejandro Pradas Gómez, Department of Industrial and Material Sci- ence. Najeem Muhammed, GKN Aerospace. Examiner: Ola Isaksson, Department of Industrial and Material Science. Master’s Thesis 2025 Department of Industrial and Material Science Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: (Insert relevant cover image) Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2025 iv Developing and evaluating large language models to align with engineering tasks and increase their effectiveness An investigation, and the development, of large language models tools in analysis engineering Gabriel Krüger, Johannes Lundahl Department of Industrial and Material Science Chalmers University of Technology Abstract Generative AI, and more specifically large language models (LLMs), show great promise in further facilitating and streamlining analysis engineering work. In this thesis, the current limitations for the use of such technologies regarding the in- corporation in engineering tasks are investigated. This investigation consists of a comprehensive literature study, in addition to nine interviews with engineers at GKN Aerospace in Trollhättan, Sweden. The results indicate a number of challenges. One being the importance of embedding internal company knowledge when leveraging the capacities of LLMs, while the use of non-local LLMs is linked to considerable is- sues when it comes to handling sensitive data. With the results of this investigative study as a foundation, an LLM-based tool meant to mitigate these identified issues is developed. The development is done using LangChain, such that the OpenAI API can be used in a Python environment. The developed software is focused on the manipulation, and extracting, of data in CNS files. A file format containing re- sults from finite element simulations. Different agentic systems, leveraging different methods for knowledge embedding such as retrieval augmented generation (RAG), and post-training such as fine-tuning, are investigated. The various architectures are evaluated based on their efficiency and accuracy in solving tasks. The results indicate that LLM-based tools have great potential in the field. The top perform- ing architecture based on this testing is incorporated into a sub-graph architecture, for which usability and validation are examined. The results for efficiency, accu- racy, usability testing and validation imply the considerable potential of leveraging LLMs in the domain. Nevertheless, performance is not perfect and a number of considerations in such a development must be taken. The methods for knowledge embedding and post-training seemingly have great impact on the performance, and more sophisticated approaches within RAG and fine-tuning have potential to further improve performance. Keywords: AI, Large Language Model (LLM), LangChain, Agent, Multi-Agent, RAG, Fine-Tuning, Knowledge-Based Engineering (KBE) v Acknowledgements First of all, we would like to express our great appreciation of the help we have received throughout this thesis by our supervisors Alejandro Pradas Gómez and Na- jeem Muhammed. Their knowledge in fields like academic writing and software de- velopment has been essential. Additionally, we would like to thank GKN Aerospace Trollhättan and manager Rikard Nedar for having us at the office during this thesis. A special thanks to all engineers at GKN who participated in our interview and usability testing studies. Gabriel Krüger & Johannes Lundahl, Gothenburg, April 2025 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: AI Artificial Intelligence API Application Programming Interface CAD Computer Aided Design CFD Computational Fluid Dynamics FEM Finite Element Method HITL Human In The Loop LLM Large Language Model QA Question & Answer RAG Retrieval Augmented Generation RQ Research Question ix Contents List of Acronyms ix List of Figures xv List of Tables xvii 1 Introduction 1 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Ethical and Environmental Considerations . . . . . . . . . . . . . . . 3 1.5 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5.1 AI agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5.1.1 Human in the loop (HITL) . . . . . . . . . . . . . . 5 1.5.2 LangChain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5.3 LangSmith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5.4 LangGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.5.5 Knowledge embedding methods . . . . . . . . . . . . . . . . . 6 1.5.5.1 Entire Context in System Prompt . . . . . . . . . . . 6 1.5.5.2 RAG (Retrieval Augmented Generation) . . . . . . . 7 1.5.5.3 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . 8 1.5.6 GKN Dummy Data . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5.7 PyCNS Knowledge . . . . . . . . . . . . . . . . . . . . . . . . 9 1.5.8 Computational Effort Parameters . . . . . . . . . . . . . . . . 10 2 Method 13 2.1 Data Gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.1.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1.1 Execution . . . . . . . . . . . . . . . . . . . . . . . . 14 2.1.1.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . 15 2.1.2 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.2.1 Execution . . . . . . . . . . . . . . . . . . . . . . . . 16 2.1.2.2 Data analysis . . . . . . . . . . . . . . . . . . . . . . 19 2.2 Software Development Method . . . . . . . . . . . . . . . . . . . . . . 19 2.2.1 Development Objective . . . . . . . . . . . . . . . . . . . . . . 19 2.2.2 Development Strategy . . . . . . . . . . . . . . . . . . . . . . 21 2.2.3 Development Method . . . . . . . . . . . . . . . . . . . . . . . 22 xi Contents 2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.3.1 Agent 1-4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . 24 2.3.1.1 Accuracy Evaluation . . . . . . . . . . . . . . . . . . 25 2.3.1.2 Efficiency Evaluation . . . . . . . . . . . . . . . . . . 25 2.3.2 Main Agent Evaluation . . . . . . . . . . . . . . . . . . . . . . 26 2.4 Usability Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3 Results 29 3.1 Data Gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1.2 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.1.3 Current Risks and Challenges . . . . . . . . . . . . . . . . . . 33 3.1.4 Cumbersome Tasks Pre-Processing . . . . . . . . . . . . . . . 34 3.1.5 Cumbersome Tasks Post-Processing . . . . . . . . . . . . . . . 35 3.1.6 Factors Hindering Full Automation . . . . . . . . . . . . . . . 36 3.1.7 AI Implementation Possibilities . . . . . . . . . . . . . . . . . 36 3.1.8 Risks and Challenges with AI . . . . . . . . . . . . . . . . . . 37 3.2 Software Development . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2.1 Agent 1: Complete Documentation . . . . . . . . . . . . . . . 39 3.2.2 Agent 2: RAG . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.2.3 Agent 3: Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . 42 3.2.4 Agent 4: Fine-Tuning & Complete Documentation . . . . . . . 43 3.2.5 Main Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3 Software Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.1 Comparison Between Agent 1-4 . . . . . . . . . . . . . . . . . 46 3.3.2 Agent 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 3.3.3 Agent 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.3.4 Agent 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.3.5 Agent 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.3.6 Main Graph Evaluation . . . . . . . . . . . . . . . . . . . . . 54 3.4 Usability Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4 Validation 57 5 Discussion 61 5.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.3 The Role of Knowledge Embedding . . . . . . . . . . . . . . . . . . . 63 5.4 The Role of Post-Training . . . . . . . . . . . . . . . . . . . . . . . . 65 5.5 Knowledge Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.6 The Rapidly Improving Field of AI . . . . . . . . . . . . . . . . . . . 66 5.7 Reliability of Results and Sources of Error . . . . . . . . . . . . . . . 67 5.8 GKN Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.9 Answers to the RQs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.9.1 RQ1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 5.9.2 RQ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.10 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 xii Contents 6 Conclusion 71 Bibliography 73 A Data Gathering 81 A.1 Semi-Structured Interview Questions . . . . . . . . . . . . . . . . . . 81 A.1.1 Demographic questions . . . . . . . . . . . . . . . . . . . . . . 81 A.1.2 Current process workflow . . . . . . . . . . . . . . . . . . . . . 81 A.1.3 Interviewee experience regarding AI . . . . . . . . . . . . . . . 81 A.1.4 Risks/challenges in incorporation of AI . . . . . . . . . . . . . 82 A.1.5 Potential incorporation of AI in the process . . . . . . . . . . 82 A.1.6 General finishing questions . . . . . . . . . . . . . . . . . . . . 82 A.1.7 Wrap it up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 A.2 Interview Contract . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 B Software Development 85 C Usability Testing 87 C.1 Usability Test Interview Questions . . . . . . . . . . . . . . . . . . . 87 xiii Contents xiv List of Figures 1.1 General flowchart for solid mechanics analysis using a tool . . . . . . 2 1.2 Visualization of nodes and edges in LangGraph. . . . . . . . . . . . . 6 1.3 Simple architecture with all the external knowledge accessed by the LLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4 RAG architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 1.5 Fine-tuning with a data set given internal knowledge . . . . . . . . . 9 1.6 P50 and P99 latencies . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1 Components of the methodology for answering the RQs . . . . . . . . 13 2.2 Screening procedure in Scopus for the literature review. Highly in- spired by Page et al. (2021) . . . . . . . . . . . . . . . . . . . . . . . 15 2.3 Stages for the interviews . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.4 Number of interviews in relation to identified needs (Griffin and Hauser 1993) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Initial graph structure . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.6 Main agent architecture . . . . . . . . . . . . . . . . . . . . . . . . . 23 2.7 QA evaluation for accuracy . . . . . . . . . . . . . . . . . . . . . . . 25 2.8 The branches of system acceptability (Nielsen 1993, p. 25) . . . . . . 26 2.9 The structure of the usability testing . . . . . . . . . . . . . . . . . . 27 3.1 Example of a human interrupt block in LangGraph studio . . . . . . 40 3.2 Graph structure system prompt context . . . . . . . . . . . . . . . . 41 3.3 LangGraph representing agent with RAG . . . . . . . . . . . . . . . . 42 3.4 Main agent with sub-graph . . . . . . . . . . . . . . . . . . . . . . . . 46 3.5 Comparison of mean accuracy, latency & token usage between agents 47 3.6 Accuracy and token usage results for the four agents. . . . . . . . . . 47 3.7 Comparison of accuracy and latency performance for Agent 1 . . . . 48 3.8 Agent correctness across all runs for Agent 1 . . . . . . . . . . . . . . 49 3.9 Comparison of accuracy and latency performance for Agent 2a . . . . 50 3.10 Agent correctness heatmap for Agent 2a . . . . . . . . . . . . . . . . 50 3.11 Comparison of accuracy and latency performance for Agent 2b . . . . 51 3.12 Agent correctness heatmap for Agent 2b . . . . . . . . . . . . . . . . 52 3.13 Comparison of accuracy and latency performance for Agent 3 with GPT-4o and GPT-4o mini models . . . . . . . . . . . . . . . . . . . . 52 3.14 Agent correctness heatmap for Agent 3 for GPT-4o and GPT-4o mini 53 3.15 Comparison of accuracy and latency performance using fine-tuning and complete context in system prompt . . . . . . . . . . . . . . . . . 54 xv List of Figures 3.16 Agent correctness heatmap for Agent 4 . . . . . . . . . . . . . . . . . 54 3.17 Comparison of accuracy and latency performance for Agent 4 . . . . 55 3.18 Agent correctness heatmap across multiple experiment runs . . . . . . 55 4.1 Validation architecture (Sargent 2010). . . . . . . . . . . . . . . . . . 57 xvi List of Tables 2.1 Keywords for literature search . . . . . . . . . . . . . . . . . . . . . . 14 2.2 Respondent details, roles, and AI experience. . . . . . . . . . . . . . . 18 2.3 Development Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 2.4 Comparison of Agents across Knowledge Embedding and Post-Training 23 3.1 Parameters and accuracy results for RAG strategy 1 . . . . . . . . . . 49 3.2 Parameters and accuracy results for RAG strategy 2 . . . . . . . . . . 51 4.1 The three pillars of a simulation model defined by Sargent (2010) and their corresponding parts in the thesis . . . . . . . . . . . . . . . . . 58 4.2 Summary of questions and methodology for the software validation . 59 xvii List of Tables xviii 1 Introduction The use of AI has, over the course of the last years, shown great promise in further optimizing work procedures in a number of engineering domains. AI is used during R&D phases within fields like the automotive and aerospace industry, for example in the design stage where certain CAD software integrates AI features (Financial Times 2024). Classic simulation methods, like finite element method (FEM) and computa- tional fluid dynamics (CFD) tools, are still significantly more common in engineering than AI and machine learning (ML) techniques (Ragani et al. 2023). Nevertheless, it is considered that the incorporation of AI methods can further streamline these "classic" simulation workflows within the pre- and post-processing stages of data. Consequently, this study investigates how specific analysis engineering procedures can be further streamlined with the help of AI tools, and what the corresponding challenges are. 1.1 Background GKN Aerospace is a leading company in the aviation industry, collaborating with the world’s top manufacturers of aircraft and engine components. The Aero Engines division in Trollhättan, Sweden, specializes in high-performance engine components and have facilities and engineering teams in several places all over the world. Here, advanced parts for aircraft and rocket engines are developed and manufactured, along with engine maintenance services (GKN-Aerospace 2024). In recent years, AI and particularly generative AI has advanced significantly. Due to extensive research in this field, generative AI now supports various applications within the tech sector, including code generation, product design and content mar- keting (Fauscette 2023). Central to these applications are large language models (LLMs), a key component of generative AI. LLMs are advanced deep neural net- works designed to process and generate content such as text, images and videos. These models are built on a transformer architecture, which is particularly effective at understanding the context within sequences of data (IBM 2023). This architec- ture enables LLMs to capture meaning more effectively than previous types of neural networks like CNNs or RNNs (NVIDIA 2023). LLMs have thus become essential tools for many tech companies and their employees enabling things like automa- tion of processes and improving product development through advanced analytics (Kietzmann and Park 2024). To facilitate and optimize the work of employees at GKN, the potential of LLMs 1 1. Introduction is currently being explored as a support tool in their engineering tasks, particularly for engineers in the solid mechanics department. Previously, at this specific depart- ment, LLMs have been tested as tools for assisting in report writing. This was done in a study titled "Supporting the Generation of Engineering Analysis Reports with Large Language Models", written by D. Söderqvist and F. Mare (Söderqvist and Mare 2024). Additionally, the use of LLMs have been investigated at GKN for the preparation of CAD geometries for FE analysis (Naik 2024). Consequently, there is an interest at GKN regarding how LLMs further can be de- ployed within analysis engineering for the streamlining of work procedures. In this thesis, it is investigated how LLMs can aid over the course of engineering analysis linked to simulation processes. For an engineering simulation involving a geom- etry, it generally requires the pre-processing of the geometry with corresponding loads and material properties such that a run script can be set up. After this, the simulation tool can be run, leading to a results file containing simulation data. Sub- sequently, this data must be analyzed and manipulated. This workflow is illustrated in Figure 1.1. Figure 1.1: General flowchart for solid mechanics analysis using a tool At the solid mechanics department, this can be a simulation process through a FE analysis tool. Over the course of this procedure, data must be processed before the simulation can be run as well as after its completion. Therefore, the task of an analysis engineer is to both pre-, and post-process data. The incorporation of AI, and more specifically LLMs, in this workflow can be achieved in various stages as indicated by previous studies at GKN. The foundation of the the thesis is therefore to investigate the further incorporation of LLMs in the analysis flow. This is done with a number of research questions, acting as guidelines for the methodology, presented later in this chapter. 2 1. Introduction 1.2 Scope Given the desire to further investigate the incorporation of LLMs in engineering tasks as GKN, two research questions (RQs) linked to this are defined. These act as a guide for the research metholodgy, such that a well structured The two RQs are: 1. What are the practical limitations for analysis engineers regarding adaptation of generative AI technologies in their engineering tasks? 2. How can an LLM-based tool be systematically designed and evaluated to mit- igate the identified limitations? 1.3 Limitations For this thesis, a number of limitations are considered. These limitations are of time, data security and financial nature. Accordingly, these limitations are as follows: • Time: the project is limited to a period of about 20 weeks. • Data Security: No confidential GKN data is considered during the research. This means that development is based on generic data, together with non- confidential insights from GKN engineers. • Financial: A budget cap of approximately $1300 is given for utilizing cloud- based LLMs. 1.4 Ethical and Environmental Considerations A number of ethical, and environmental, considerations must be taken into account during the development, and use of, generative AI tools. For example, the current development of LLMs has been criticized. When the Future of Life Institute in March 2023 published the open letter "Pause Giant AI Experiments", signed by a number of well-known persons within the domain of AI research and governance, it pointed out potential LLMs more capable than OpenAI’s GPT-4 model and called for a minimum six month pause for the development and training of such models such that AI safety and governance measures could be implemented (Mollen 2025). The signatories expressed their concern, and questions treating topics like "nonhu- man minds that might eventually outnumber, outsmart, obsolete and replace us?", and "let machines flood our information channels with propaganda and untruth?" were brought forward (Future of Life Institute 2023). While the scope of this thesis does not concern the development of an LLM, but rather the development of an LLM based tool using existing models, it is important to keep the general ethical and existential questions with respect to LLMs in mind. There are environmental considerations that must be taken into account during the development, but also use, of LLMs. LLMs have a significant carbon footprint, where one of the reasons is the heavy GPU usage during training (Faiz et al. 2023). For the data centers associated with LLMs, there are numerous factors that deter- mine the environmental impact. Data center energy usage, with the corresponding share of carbon free energy, as well as the embodied carbon footprint of the hardware 3 1. Introduction must be taken into account (Faiz et al. 2023). For its lifecycle, the carbon footprint of an LLM caused by inference is larger than the one caused by its initial training (Fu et al. 2024). The carbon emissions for GPT-3 caused by training are estimated to 502 tCO2 (Patterson et al. 2021). Furthermore, for GPT-3 the energy demand per query has been estimated to be between 0.002 and 0.005 kWh/query, and with an estimated US carbon intensity of 367 gCO2/kWh the emission per request has been approximated to 1.5 g CO2 (Vanderbauwhede 2024). With millions of requests every day, this highlights the energy needs and therefore the emissions of LLMs not only for training but also for daily use. Consequently, for large scale LLM use the environmental aspect must be taken into account. The question of ethics is not only important with respect to the subjects the thesis is treating, but also with respect to the applied research methodology. For example, a number of interviews are conducted with engineers at GKN. Ethical considerations within fields such as privacy, sensitive information and correct citation must be taken into account. Storage of recorded interviews must be in line with GDPR, where it states that the storage of data that enables personal identification may only be stored for as long as the specific purpose dictates (GDPR.eu 2025). Furthermore, it should not be possible to determine an interviewee’s identity based on the content of the report. Additionally, it is of importance that interviews are conducted such that subjects are not led to disclose irrelevant personal information or sensitive company information. 1.5 Prerequisites Before presenting the thesis methodology, it is necessary to present a number of prerequisites. These act as the foundation for how the identified issues could be mitigated, and how the desired functionality could be implemented. 1.5.1 AI agents For tools with AI functionality incorporated, agentic AI systems can ensure that systems or programs within the tool can perform specific tasks through, for example, calling external tools (IBM 2024a). The task of an agent may vary, and it can include tasks like analyzing written texts for grammatical errors to writing computer code. Therefore, there are AI agents of different nature. For example, a so-called "simple reflex agent" does not interact with other AI agents and is pre-programmed such that it performs a specific action given a specific pre-condition (IBM 2024a). There are more sophisticated types of AI agents, and some can determine a specific order to implement certain actions given a specific goal and an optimization algorithm (IBM 2024a). Therefore, the use of AI agents has the goal to optimize LLM processes. This by creating more complex and autonomous systems, that themselves can make independent decisions and leverage external tools. Additionally, an AI agent can reflect on the user input and plan what to do next based on its available tools (IBM 2024a). These tools can be for instance code executing tools, web-search tools or external API calling tools. The agent can also reflect on its own mistakes and correct 4 1. Introduction those without any human interference. However, in order to make the agents more reliable, it is sometimes useful to make use of active human interaction through the incorporation of human in the loop (HITL) aspects (IBM 2024a). 1.5.1.1 Human in the loop (HITL) Due to the fact that AI agents are partly or fully autonomous systems, there is a risk that they might make unwanted or even harmful decisions. To address this, it is often useful to implement HITL mechanisms that allow the human to intervene. That is, halting or modifying the agent’s actions before it proceeds. This approach not only helps to maintain the reliability and accuracy of the responses, but it also prevents issues such as the LLM entering infinite loops (IBM 2024a). 1.5.2 LangChain LangChain is an open-source framework designed to simplify the development of applications that leverage LLMs. It provides a standardized and modular approach to integrating LLMs with external data sources, APIs, and computational tools, enabling the creation of more advanced AI-driven workflows beyond basic API calls (Balasubramaniam et al. 2024). By offering a flexible architecture, LangChain supports RAG, memory management, and tool integration, making it particularly valuable for tailored applications requiring contextual awareness and reasoning. LangChain enables developers to design applications that integrate LLM capabilities with real-world operations, and has gained significant adoption in the AI commu- nity, establishing itself as a popular tool for building generative AI applications (Balasubramaniam et al. 2024). 1.5.3 LangSmith LangSmith is an observability platform designed to enhance the development and maintenance of LLM-based applications. Created by the developers of LangChain, it provides essential tools for debugging, monitoring, testing, and optimizing AI sys- tems, improving both reliability and interpretability (Balasubramaniam et al. 2024). The debugging and tracing capabilities enable developers to track input-output in- teractions, and analyze intermediate steps. Additionally, the available testing and evaluation tools facilitate benchmarking across various use cases, ensuring robust and well-optimized LLM applications (Balasubramaniam et al. 2024). 1.5.4 LangGraph LangGraph is an open-source library designed for building stateful, multi-actor ap- plications with LLMs. Allowing for the creation of agent and multi-agent workflows. It offers control over both application flow and state while integrating seamlessly with LangChain and LangSmith (LangChain Inc. 2024a). LangGraph uses nodes and edges to model functionality and behavior, effectively linking the various struc- tural components of an agent. The nodes are simply Python functions describing 5 1. Introduction the behavior of each part of the agent, while edges enables the routing logic be- tween the nodes in the agent (LangChain Inc. 2024c). There are two types of edges. The normal edge, where the routing is defined in one way only is represented by a solid line. The conditional edge on the other hand can take more than one way depending on the condition defined and is represented by dotted lines. These are illustrated in Figure 1.2. Important features of LangGraph, that have been relevant and widely used in this project, are memory persistence, HITL and centralized state management, which keeps track of global state updates (LangChain Inc. 2024a). Another key feature of LangGraph is the LangGraph Studio. LangGraph Studio allows users to visually deploy graphs using the LangGraph API and also supports quick debugging (LangChain Inc. 2024b). In Figure 1.2 a simple visualization of a graph in LangGraph is shown. Figure 1.2: Visualization of nodes and edges in LangGraph. 1.5.5 Knowledge embedding methods The use of LLMs is associated with hallucination due to, for example, that the model has not been trained on certain specific data. Consequently, when using LLMs for fields like engineering there is often a need to embed internal company knowledge. Knowledge that is deemed external to the LLM, since it has not been trained on it. There are different ways to embed external knowledge within an LLM architecture. In the thesis methodology, three different ways are encountered. These are presented here. 1.5.5.1 Entire Context in System Prompt When using an LLM through an API, one way to embed external knowledge is to explicitly state it in the system prompt of the LLM. How much that can be incorporated in the system prompt is dependent on the context length of the model. Meaning that if you insert an amount of information exceeding the context length, not all of it will be considered. For example, GPT-4o has a context length of 128 000 tokens (OpenAI 2024a). In such an architecture, the model has access to the same specific external knowledge regardless of the prompt defined by the user. This is illustrated in Figure 1.3. 6 1. Introduction Figure 1.3: Simple architecture with all the external knowledge accessed by the LLM 1.5.5.2 RAG (Retrieval Augmented Generation) As opposed to providing the entire context in the system prompt, the RAG archi- tecture is a bit more sophisticated in its nature. Its foundation is the splitting of documents and consequently the retrieval from certain parts, given a user question, and not the entire documentation. The information from the knowledge base can be split into several smaller segments through chunking, where the chunk size sets the size of the data splits and while a large chunk size may ensure that the data points are coherent the risk is they become too general while a small chunk size risks the loss of coherence within the data points (IBM 2024c). There is no exact answer for what a good chunk size is, and it is dependent on the type of documentation you have. If a chunk contains an excessive amount of data there is a risk that specific knowledge may be overshadowed by other topics, whereas a risk of context loss is associated with a too small chunk size (Stack Overflow 2024). Additionally, when chunking a document it may be split in sections where relevant specific knowledge is divided into two different chunks as opposed to keeping it all in one chunk. To mitigate this, documents can be split into chunks with a certain overlap, ensuring no brusque sudden borders between chunks but rather an overlap between chunk data (MongoDB 2024). After chunking the knowledge base, the text data can be transformed through an embedding model such that it is vectorized. Embedding is a method for the rep- resentation of, for example, text in a numerical way that allows for the use of the data as input to machine learning algorithms since they in general require the input to be low-dimensional numerical (IBM 2025b). As a result, this can improve the performance of the LLM. There are different embedding models, and one example is the OpenAI method used in ChatGPT which allows the model to, as opposed to evaluating each word separately, comprehend how words and categories are linked allowing for improved responses (IBM 2025b). Accordingly, the knowledge base can be transformed to a vector database where all the text is represented through vector embeddings. As opposed to traditional search, where results are explicitly retrieved based on the input word, vector databases enables search based on similar- ity (IBM 2025a). Meaning that for an input prompt asking a question regarding the word "smartphone", the traditional search retrieves passages explicitly containing 7 1. Introduction the word "smartphone" whereas a retriever from a vector database would yield re- sults containing similar words such as "cellphone" (IBM 2025a). There are different algorithms for the actual retrieval. Some algorithms are based on k Nearest Neig- bours (kNN) methods, such that the retriever searches for a number of k vectors that are deemed to be the mathematically closest to the defined query vector (Sawarkar, Mangal, and Solanki 2024). A RAG architecture with text splitting, embedding and retrieval is shown in Figure 1.4. Figure 1.4: RAG architecture 1.5.5.3 Fine-tuning Fine-tuning, in comparison to the previously mentioned knowledge embedding meth- ods, is a technique to adapt the tailored knowledge through additionally model training on an already pre-trained neural network. However, in a pre-trained model, there might be millions or even billions of parameters that it has been trained on. Changing the weights of all parameters is time-consuming and expensive and also runs a high risk of overfitting the network (IBM 2024b). Instead, fine-tuning makes use of the already pre-trained parameters and changes are only made to a few pa- rameter weights in the model. This enhances optimal behavior of the model since it retains the robustness and complexity of a pre-trained model, while leveraging the customization to a specific use case. Fine-tuning is commonly used when cus- tomizing neural networks with a large amount of parameters such as LLMs and computer vision models. Examples of fine-tuning use cases include developing spe- cialized LLMs, such as those designed for code generation or for replicating specific tonalities and writing styles (IBM 2024b). The foundation of fine-tuning, with a base model yielded through an enormous amount of general pre-training data and a fine-tuned model yielded through specifically curated fine-tuning data, is shown in Figure 1.5. 8 1. Introduction Figure 1.5: Fine-tuning with a data set given internal knowledge 1.5.6 GKN Dummy Data As presented in section 1.3, no confidential data is used in this project due to GKN’s security policies. Consequently, GKN provided a dummy structural analysis case. The structure of this dummy data was created from a real case, ensuring it aligns with real-case analysis while being significantly more compact with a highly simplified geometry. Although the dataset contains less information and only non- sensitive data, its structure and file formats closely emulate real-world data and structure. The file structure of the dummy analysis case is presented in ??. 1.5.7 PyCNS Knowledge PyCNS is an internal Python module used at GKN for extracting and manipulating data in CNS files, who are a result of a FE simulation. The module is treated in the thesis, and both source code and an existing user manual was provided. Therefore, the documentation was both of a more descriptive nature and a pure source code nature. An example for how the source code knowledge was given is displayed below. This is meant to show the structure, since some of it has been redacted in order to not display the entire source code. 1 def combine_cns ( inputs ) -> CNS: 2 """ Combine multiple CNS objects into a single CNS object . 3 4 Short description of functionality and use case. 5 6 Example : 7 new_cns = combine_cns ([cns1 , cns2 , cns3 ]) 8 9 Args: 10 inputs : A list of CNS objects to be combined . 11 12 Returns : 13 CNS: A combined CNS object . 14 """ 15 # ... 16 17 new_cns = ... # Redacted combination logic 18 19 return new_cns 9 1. Introduction Listing 1.1: Redacted combine_cns function Instead, the internal PyCNS manual did not contain any source code for the func- tions but rather short descriptions and examples in a pdf. A somewhat redacted extract from it, for combine_cns, is shown below. Extract from Internal Manual 1.1 pycns.combine_cns combine_cns(inputs) → CNS Function to combine multiple CNS objects into a single CNS object. Short description of functionality and use case. Example: >>> large_cns = combine_cns([cns_list]) Parameters: • Information about inputs Returns: • A pycns.CNS object. Return type: CNS . 1.5.8 Computational Effort Parameters When working with an LLM-based tool, it is important to measure the software’s efficiency. This can be evaluated based on both the computational effort and the computational time required to complete specific tasks. Computational time can be measured based on the latency associated with a given task, while computational effort can be evaluated by analyzing the number of tokens required to generate the solution. When evaluating the performance on a set of questions, two key latency metrics can be used. The P50 latency, also known as the median latency, measures the response time at the 50th percentile, meaning that half of the responses are faster and half are slower than this value. This metric provides an estimate of a typical response time, giving insight into how quickly an agent generally responds under normal conditions. The P99 latency, on the other hand, represents the 99th percentile latency. This value indicates that 99 percent of all responses were com- pleted in less time, while the slowest 1 percent exceeds this threshold. The P99 latency is particularly useful for assessing worst-case scenarios, ensuring that the system does not experience excessive delays in response generation. This concept is displayed in Figure 1.6 10 1. Introduction Figure 1.6: P50 and P99 latencies As mentioned above, the computational effort of an LLM can be measured through the number of tokens that is needed to satisfactory answer a specific request. In natural language processing, text is divided into a certain number of tokens such that the question prompt defined by a user to an LLM is split into different sections (Microsoft Research 2023). The input is tokenized. This means that the input text specified by the user, as well as the output text generated by the LLM, corresponds to a number of tokens. For OpenAI’s models, a general rule of thumb is that one token can be approximated to four characters of text (OpenAI 2024c). Consequently, an interaction with an LLM where the user has to correct the LLM and provide more context before receiving an acceptable answer will require more tokens both in the form of input and output, and will consequently be more computationally demanding as opposed to a correct answer yielded directly from a well written prompt. For a user, there is also a financial aspect to the number of tokens used. OpenAI API pricing is based on number of tokens, such that GPT-4o has a cost of $2.50 / 1 M input tokens and $10 / 1 M output tokens (OpenAI 2024b). 11 1. Introduction 12 2 Method In this chapter, the methodology for answering the two RQs is presented. For an- swering RQ1, namely what the practical limitaitions for analysis engineers regarding the adaptation of generative AI are, this is done through data gathering with the help of a literature review and an interview process conducted with engineers at GKN. The result for RQ1 serves as a base for the methodology when it comes to answering RQ2, namely how an LLM-based tool can be systematically designed given the limitations and challenges identified in the RQ1 result. This consists of the soft- ware development of an LLM-based tool and corresponding evaluation of it. The methodology workflow is displayed in Figure 2.1. Figure 2.1: Components of the methodology for answering the RQs 2.1 Data Gathering This section presents the Data Gathering process, which primarily consists of a literature review and interviews. For completeness, previously provided dummy data is also incorporated into the flowchart, in Figure 2.1. 13 2. Method 2.1.1 Literature Review A literature review was performed in order to gain a greater understanding of the use of generative AI in engineering, and what the potential challenges in its use may be. To ensure a fruitful and well-structured literature review, the methodology was divided into three main stages. These are Identification, Screening and Review. This methodology is modeled on Page et al. (2021), where an initial number of screened records successively is reduced such that a number of relevant and high quality papers are selected for the final analysis. The methodology for this process is further described in this section. 2.1.1.1 Execution The basis for the purpose of the literature review was RQ1, namely what the prac- tical limitations are for the use of generative AI in an engineering context. In order to perform a structured literature review that can be easily repeated in the greatest possible measure, a number of keywords were considered. These keywords were clas- sified in three major categories. These are AI Concept, Application and Embedded knowledge. The motivation to this was that it facilitates the actual generation of keywords during the thought process, but also structures later database searching when using Boolean operators. The idea was that articles containing at least one keyword from each class were desired. This means that the OR operator could be used within a class, and the AND operator between classes. The corresponding keywords are presented in Table 2.1. AI Concept Application Embedded Knowledge Large Language Model (LLM) Design Process Fine-tuning Generative AI Knowledge Management Coupling Systems Engineering Role-specific Mechanics Knowledge-based Complex Systems In-house Requirements Engineering Finite Element Method Table 2.1: Keywords for literature search An initial search in Scopus with these keywords yielded a total of 97 results, and while some of these were deemed to be relevant another search was performed such that more records could be screened. The decision was made to only search for the AI Concept and Application keywords. The screening procedure for the literature review is shown in Figure 2.2. This method is highly influenced by The PRISMA 2020 statement: an updated guideline for reporting systematic reviews (Page et al. 2021). This meant that the literature review was divided into three stages. Identification, where keywords were chosen and further refined after an initial search in Scopus. Screening, when records were excluded based on their titles, abstract, judged overall 14 2. Method quality and number of citations. Lastly, during the Review stage the records included for the literature review were analyzed. Figure 2.2: Screening procedure in Scopus for the literature review. Highly inspired by Page et al. (2021) The first keyword search in Scopus led to 1602 results. This was further refined in the Scopus filter by selecting Engineering and Computer Science as subject areas, English as language, and selecting keywords language model, large language model, knowledge management, requirement engineering and knowledge engineering. This led to 784 results in Scopus. An initial screening was performed by simply reading the titles. Titles who were deemed relevant with respect to the RQs were of interest. Consequently, titles indicating studies in fields like medicine and linguistics were ignored. Therefore, 39 records were selected for further screening. For each of these 39 records, the abstract was studied. Once again, records deemed relevant with respect to the RQs were selected for further screening. This meant a total of 21 records. These records were subject to the final screening. Here, the entire text was analyzed such that its quality and relevance could be judged. Number of citations were also taken into account, which means that articles ideally should have been deemed relevant and of high quality after an analysis of the text, and well cited. A number of 7 records were chosen for literature review in Scopus. In order to increase this number, snowballing together with limited keyword search was performed on Arxiv.org. Hence, this lead to 11 articles in total for the literature review. 2.1.1.2 Data Analysis For the data analysis stage of the literature review, a content analysis was performed for each chosen article. The articles were read first such that an initial understand- 15 2. Method ing of research questions, methods and results could be obtained. After this, the articles were read through again and sections that were deemed relevant with re- spect to the defined RQs were highlighted. After having performed this for all the selected articles, these highlighted sections were divided into different themes. The thematic analysis was performed as presented by Säfsten and Gustavsson (2020). The themes obtained from the thematic analysis included for example Lack of in- house knowledge. For each theme, the findings for the articles were summarized in a coherent text. These are presented in section ??. 2.1.2 Interviews Such that a further understanding of the topic of the potential and challenges of generative AI in engineering could be established, a number of interviews were per- formed. The advantages of interviews include flexibility, the ability to tailor ques- tions for specific cases (Säfsten and Gustavsson 2020). In this case, the focus was to collect and obtain relevant data in order to answer RQ1. However, there are also disadvantages that must be taken into consideration. One such disadvantage is the fact that a poorly chosen respondent could lead to deceitful results (Säfsten and Gustavsson 2020). Nevertheless, due to the fact that interviews enable a direct con- tact with respondents who may hold key answers regarding the research questions it was deemed appropriate. The interview process contained two main stages. These were Execution and Data Analysis, illustrated in Figure 2.3. The methodology for these is presented in the following sections. Figure 2.3: Stages for the interviews 2.1.2.1 Execution The respondents of interest were engineers at GKN Aerospace who have experience, or have previously possessed experience, in simulation based working processes. Ideally, it was desired that the respondents should have had diversified experience in the topic of AI and LLMs, and also regarding the incorporating of generative AI in their work. Nevertheless, a variation in type of respondents was desired since this counteracts biased results and generally widens the perspective in the answers. The interviewees where chosen with help from the supervisor in accordance with the stated requirements. A list of the respondents can be seen in Table 2.2. Some interviews were performed in person at GKN Aerospace in Trollhättan and some were performed digitally. 16 2. Method Semi-structured interviews were performed. This is the mix between the completely structured interview and the unstructured interview. Whereas the structured in- terview is governed by fixed questions, and the unstructured interview is very open with more of an overall theme deciding the discussion, the semi-structured is a sort of in-between (Säfsten and Gustavsson 2020). The primary reason for conducting semi-structured interviews was the variability in the respondents’ fields of expertise, allowing for dynamic adaptation of questions during the interview. Furthermore, the limited prior knowledge of certain processes is also a strong reason to perform semi-structured interviews. Consequently, the interview can be adapted more to the prior knowledge of the topic that the interviewee possesses. The questions for these interviews are presented in section A.1. In a study made by Griffin and Hauser (1993) the number of interviews needed to identify a given amount of customer needs was investigated. The data was collected and analyzed by professionals in the field. The study revealed that the added value plateaued after approximately five to six interviews, see Figure 2.4. Therefore a number of nine interviews, each lasting approximately 25 to 30 minutes, was deemed relevant for this project. Figure 2.4: Number of interviews in relation to identified needs (Griffin and Hauser 1993) Nevertheless, there was a question for how these nine interviewees were going to be selected. Therfore, Sampling in design research: eight key considerations by Cash et al. (2022) was used as a foundation for the choice of these interviewees. Since the LLM-based tool was going to be developed on a GKN specific case, the sampling was performed within a set of GKN engineers.The consideration Design framing: what type of impact on practice do you hope to achieve? presented by Cash et al. (2022) emphasizes the link between the conducted design research and practice, and high- lights its importance. Another consideration presented by Cash et al. (2022) is Theoretical framing: where in the theory-building/theory testing research cycle is current knowledge?, where relevant domain knowledge within a population must be considered. Interviewees were therefore to be selected based on their knowledge 17 2. Method and experience within analysis engineering, but also knowledge and experience and knowledge within AI. Knowledge connected to both of these domains was highly de- sired with respect to the RQs. Therefore, these two factors were deemed to provide a solid foundation for gathering relevant existing knowledge within the sample, but also gaining an understanding of the possible impact on current analysis practices that the thesis could yield. The profiles for the nine interviewees are shown in Table 2.2. The duration of the interviews was deemed sufficient to allow for the collection of comprehensive informa- tion while ensuring the project remained manageable within the given time-frame. The interviews were be recorded using a SONY ICD recorder, and the recordings were then transcribed using an offline, locally stored transcription software. This approach ensured that the content remained secure and was not publicly accessi- ble under any circumstances, in accordance with the aspect of personal information discussed in section 1.4. Furthermore, the interviews were conducted in English to streamline the transcription process, as the transcription tool used was better opti- mized for English speech recognition. After the automatic transcription was done, the content was manually revised to entirely comply with the interview recordings. With the aim of working in accordance to the policies and statutes of GKN regarding data security, an interview contract was made. Before the interview, the contract was provided to the respondent for signature. The contract contained information about how the recordings were to be managed and served as an additional safeguard between the interviewer and the respondent. This was deemed to be in line with one of the considerations presented by Cash et al. (2022), namely Good scientific conduct and ethical appropriateness. The contract can be seen in section A.2. ID Title Role Site AI Exp. 1 Research engineer Research in digitalization and au- tomation. Working with AI dev. Sweden Very High 2 Eng. method specialist Assessing fatigue life in manufac- turing; specialist in crackprop. Sweden Low 3 Analysis engineer Analysis lead, supporting hard- ware design in INC and Pratt & Whitney projects. Sweden High 4 Analysis engineer Solid mechanics analysis and sim- ulations in ANSYS. Sweden Low/Medium 5 Analysis engineer Solid mechanics analysis and sim- ulations in ANSYS. Sweden Low/Medium 6 Structural mech. engineer Simulation and analysis consultant in finite element analysis. Sweden Medium 7 Analysis lead Leads structural engineering team in India and senior engineers at GKN. Sweden Low 8 Design engineer Automating design processes and methods used in the company. Netherlands Low 9 Eng. team leader Working within design principles. Sweden Low Table 2.2: Respondent details, roles, and AI experience. 18 2. Method 2.1.2.2 Data analysis Having transcribed the interviews, a thematic analysis of the interviews was per- formed. A thematic analysis aims to identify recurring themes within the interview data. These themes represent patterns or insights that may be relevant answers to the RQs (Säfsten and Gustavsson 2020). In order to find themes, it is useful to as- sign codes to the data. These codes were extracted using a combination of inductive and deductive coding (Fereday and Muir-Cochrane 2006). Hence, a code was as- signed to parts of the data deemed to be valuable in answering the RQs, which were then paired and divided into labeled themes. These themes however, in contrast to a pure deductive coding strategy, might be defined in advance in accordance to a inductive coding strategy. The themes are in turn linked to a final theory (Säfsten and Gustavsson 2020). Therefore, the transcribed interviews were read through and relevant comments and responses were highlighted. These highlighted sections rep- resented codes, and from these codes a number of themes were defined. Each code was connected to a corresponding theme, and some codes were connected to more than one theme. These themes are presented in section 3.1.2. 2.2 Software Development Method After having finished the interview and literature review stages, intended to answer RQ1, the development of an LLM-based tool could be started. This development was intended to answer RQ2, with the results from RQ1 as a foundation. In this section, the development stages are presented. More precisely, how the desired functionality of the LLM-based tool was decided, how an initial strategy was outlined and how it was actually developed. 2.2.1 Development Objective RQ2 considered how an LLM-based tool could be developed and evaluated, given the obtained results from RQ1. These results, presented in chapter 3, highlighted various issues connected to the use of generative AI tools both in a broad industry perspec- tive as well as with respect to analysis engineering practices at GKN. While it would have been preferable to develop an LLM-based tool that could mitigate all these is- sues and be implemented in the entire analysis workflow displayed in Figure 1.1, this was deemed to be too comprehensive given the thesis timeline. Therefore, the key takeaways from the data gathering were used as a foundation for deciding the development objective. The interview study seemed to give an indication that the post-processing stage of the analysis was of a more cumbersome nature than the pre-processing stage. For example, interviewee 7 emphasized the time-consuming aspect of the post-processing of data in subsection 3.1.5. This meant, among other, identifying relevant data and extracting the relevant parameters to present. Fur- thermore, the potential of an LLM when it comes to dealing with extensive result data and summarizing key parts was mentioned by interviewee 8 in subsection 3.1.7. At GKN, there are a number of internal Python modules for facilitating the anal- 19 2. Method ysis and manipulation of certain file formats that are the result of a FE simu- lation. Therefore, these modules already aim to simplify the cumbersome post- processing stage. Nevertheless, some engineers may be less experienced when it comes to Python. This Python learning curve was outlined by interviewee 6 in sub- section 3.1.7. The combination of existing Python scripts for automation, and an LLM-based tool to make these more accessible, was considered as a consequence to this. Interviewee 8 discussed this possible integration in subsection 3.1.7. However, there are such Python modules for the analysis and manipulation of CNS, CDB and UNV files among other. Together with the GKN supervisors, a discussion re- garding an appropriate integration between these internal Python modules and an LLM-based tool was held. Based on their knowledge within the area, it was decided to start the focus on the PyCNS module. A module that treats the analysis and manipualtion of CNS files. The PyCNS module was chosen over other existing inter- nal modules since the overall structure of an LLM-based tool would remain largely consistent regardless of the module used. Given that the CNS module is the most comprehensive, it was selected as the starting point. This choice ensured that any future transitions or integrations with other modules could be managed smoothly without significant structural changes. An LLM-based tool tailored for this module was deemed to facilitate the post-processing stage of an analysis, and lowering the threshold for the use of available internal Python modules. A number of criteria, based on the results from RQ1, for such an LLM-based tool were defined. This was deemed to further structure and facilitate the development phase. These are shown below. Criteria 1 Able to manipulate or analyze CNS files using the pycns module. 2 Avoidance of black-box architecture through incorporation of human knowledge in process. 3 Prevention of hallucination and erroneous output. 4 Output of well-structured and well-formatted answers easy to follow. 5 Robust and user-friendly Table 2.3: Development Criteria The first criteria highlighted the successful integration of the PyCNS module to the LLM-based tool. The second criteria was meant to ensure a consideration of several findings from the interviews and literature review. Interviewee 7 commented, in subsection 3.1.7 , on the need for transparency in a potential LLM-based tool such that the engineer could easily follow computations. Additionally, this was meant to mitigate the theme of Over-reliance on AI tools was identified from the literature study and presented in subsection 3.1.1. The third criteria was stated based on the risk associated with LLMs that is hallucination. This risk was outlined from the theme Hallucination/Low quality output identified in the literature review, presented in subsection 3.1.1. Additionally, this was linked to the interview process where interviewee 4 in subsection 3.1.3 commented on the poor quality answers that available models like Copilot and ChatGPT yield when asked on theoretical 20 2. Method questions connected to analysis engineering. The fourth criteria was meant to reduce confusion and time consuming aspects of trying to identify parts of the output. The fifth, and final, criteria was meant to secure the development of a tool that was easy to use and reliable enough to justify the use of it. 2.2.2 Development Strategy With the actual development objective known, an initial development strategy was defined. While it was known that an LLM-based tool focused on the PyCNS module was going to be developed, it was unknown as to how this was going to be done. Consequently, with the help of the results obtained from the interview stage and the literature study, the development strategy with respect to RQ2 was defined. Since the LLM-based tool was to be focused on PyCNS, an initial logical flow of work with this module was defined. This meant the existence of a specific question re- lated to the data in a CNS file, an initial problem solving strategy and code writing, successful code execution and subsequently the presentation of the desired results in a clear and well-formatted way. Furthermore, this workflow was translated into a graph with respect to the desired behaviour of an LLM-based tool. This initial graph contained a number of nodes, corresponding to specific tasks. These nodes were meant to mitigate the challenges connected to the adaptation of generative AI in engineering, challenges who were outlined both in a broad sense as well as in a more GKN specific sense. These challenges are presented in section 3.1. For example, a human in the loop (HITL) was incorporated in this graph to avoid a black-box behaviour and consequently have the engineer in charge. With a question related to the pycns module as an input, the graph contained the nodes Generation, Human feedback, Code execution, and Present result. For the Generation node, a problem solving description and corresponding Python code was to be generated. The Human feedback node were to display this reasoning and code to the human engineer, asking for approval to proceed to the Code execution node. This key feature was desired due to the need to avoid a black-box architecture and having code run for which the user has not inspected. Thus, if the proposed code was disapproved the human user should have provided more context or commented why it was declined such that it returned to the Generation node where a revised description and code could be generated. At the Code execution node the proposed code was supposed to be run. In order for the model to present a feasible solution, a check whether the code execution failed or not was necessary. If the code was executed successfully, the present node were to present the results in a clear and well formatted way in the Present result node. Nevertheless, if the execution were to fail routing should have been done back to the Generation node for the generation of a revised and working code. This graph, with corresponding logic and nodes, was deemed to model the workflow of an engineer performing analysis on CNS objects. This initial graph structure is shown in Figure 2.5. 21 2. Method Figure 2.5: Initial graph structure The nodes were imagined with the criterion in Table 2.3 in mind. For example, the Human feedback node was deemed to ensure the second criteria, whereas the Present result node was deemed to ensure the fourth criteria. This graph acted as the basis for the actual software development when such an LLM-based tool was programmed. 2.2.3 Development Method After having performed the initial development strategy phase, it was deemed that some sort of agentic behavior system was required for the development of the sophis- ticated graph presented in Figure 2.5. The reason for this is that an implementation using manual systems, like chains, without internal decision making and the ability to use external tools would be impossible with respect to the desired graph struc- ture. After thorough research, it was determined that LangGraph offered the most convenient and effective approach for the design of agentic systems compatible with both open-source and state-of-the-art models. The introductory courses provided by LangChain Academy (LangChain Academy 2025) for LangGraph, LangChain, and LangSmith offered the foundational knowledge necessary for assessing complex graph structures in LangGraph and integrating functionalities from both LangChain and LangSmith. Python was selected as the implementation language because of its familiarity and widespread use. Additionally, after gaining a comprehensive under- standing of the PyCNS module, the development of programmable software based on the structure, shown in Figure 2.5, could begin. Nevertheless, such a graph requires the incorporation of internal knowledge. In this case, knowledge about the PyCNS module. The results from the literature study, presented in subsection 3.1.1, highlighted the challenges of poor quality out- put through hallucination as well as the need for the embedding of internal engi- neering knowledge. Consequently, four distinct agents were developed based on the knowledge embedding methods described in subsection 1.5.5, namely Entire Context in System Prompt, RAG, and Fine-Tuning. Furthermore, these agents were designed according to the criteria presented in Table 2.3, with the ultimate goal of closely emulating the graph presented in Figure 2.5. A concise overview of these agents is provided in Table 4.2. Finally, the resulting agents are presented in section 3.2. 22 2. Method Agent 1 Agent 2 Agent 3 Agent 4 Knowledge Embedding Complete documentation in system prompt RAG None Complete documentation in system prompt Post-Training None None Fine-tuning Fine-tuning Table 2.4: Comparison of Agents across Knowledge Embedding and Post-Training After developing the four initial agents, an additional main agent was introduced to further improve the robustness of the architecture and to fulfill all the criteria outlined in Table 2.3. Specifically, the initial four agents were primarily designed with the goal of executing code, which is a task that may not always align with user needs. Therefore, to design an agent capable of providing varied responses and especially meeting criterion 5, it became essential to implement functionality that could effectively guide the user. This guidance was achieved through the integration of a routing system capable of distinguishing irrelevant questions and providing explanatory responses without the necessity of executing code. Consequently, the main agent was developed in a modular way, enabling the four initial agents to be integrated in it, hence keeping the code-execution functionality. Ultimately, the agent that performed best, in terms of accuracy and efficiency in the evaluation, was chosen for integration into the main graph. The idea of the main agent architecture is presented in Figure 2.6, while the final implementation result is presented later in section 3.2. Figure 2.6: Main agent architecture 2.3 Evaluation After designing the four different agents, it was necessary to evaluate these against each other. This evaluation was done such that one of the four agents could be incorporated in to the main agent. Therefore, the evaluation stage could be divided into two main parts. Namely, the initial evaluation of each of the four agents and the evaluation of the main agent. Both parts will be evaluated on their accuracy and efficiency performance, according to subsection 2.2.3. 23 2. Method 2.3.1 Agent 1-4 Evaluation To assess the performance of the developed agentic systems, a structured QA pair evaluation was conducted. The evaluation focused on two key aspects. One of them was accuracy, which measured how well agents responded to predefined questions. The other one was efficiency, which measured how effective the computational pro- cess for answering the questions was. This was done through the analysis of latency and token data. Each agent was tested using a set of 30 QA pairs that were care- fully designed based on a thorough examination of the PyCNS documentation. The questions were selected to ensure diversity in style and execution while maintaining correctness. All the answers in the QA pairs included the computed result, which was pre-calculated manually in Python using the PyCNS module. To enhance the reliability of the evaluation, the experiment was repeated across five separate runs. This allowed for a more robust assessment of the agents’ performance and reduced the impact of any variability in individual runs. The entire evaluation process was conducted using LangSmith, which enabled a fully automated testing pipeline. By automating this process, the evaluation became both efficient and reproducible. In the end, all grades were manually inspected by a human to further revise the results. For agent 2 with its RAG architecture, an initial evaluation was needed to choose the relevant retriever parameters. RAG performance is dependent on a number of different factors such as the choice of embedding method and vector database. A comprehensive investigation as to what RAG implementation was best for this agent was not performed since it is such as an extensive domain. Nevertheless, the chosen chunk size and k value were investigated to a certain extent. The choice of chunk size has been shown to have a considerable effect on the performance in RAG config- urations, where a larger chunk size might provide more context while also increasing the time to process the provided information (Wang et al. 2024). The choice of chunk size is also dependent on the density of information in the documentation from which data is retrieved (Zhong et al. 2024). Furthermore, when using a k nearest neighbours (kNN) algorithm for the document retrieval the performance is linked with the selected k value (Leto et al. 2024). Therefore, a somewhat naive, investigation on the effect of chunk size and k value was performed for Agent 2. Since the information density varied in the documentation, two different strategies were outlined. One strategy consisted of setting a smaller k value of 5 while in- creasing the chunk size between values in the range of 250 to 1750, while the other strategy consisted of setting a smaller chunk size of 200 but increasing the k value in the range of 8 to 20. Initial QA evaluations, consisting of only one repetition, was performed for these different configurations such that one could be selected for the comprehensive QA evaluation for Agent 2. For fine-tuning, the number of pre-trained parameters of a model can affect the performance (Lu, Luu, and Buehler 2024). Therefore, two different models were evaluated for Agent 3. Since the OpenAI fine-tuning platform was used, these were GPT-4o and GPT-4o-mini. This meant that an initial investigation on how the performance varies between a smaller and a larger model in terms of parameters could be done. 24 2. Method 2.3.1.1 Accuracy Evaluation To measure accuracy, the response of each agent to the 30 questions was compared against a predefined reference answer using an LLM. The grader assessed correctness by assigning a binary value to each response: TRUE if the response was sufficiently close to the reference answer, and FALSE if it contained incorrect or conflicting information. The grading process followed structured evaluation criteria to ensure consistency and objectivity. Specifically, the grader focused solely on factual correct- ness, ensuring that responses did not contain contradictions. Additionally, responses that included more information than the reference answer were still considered cor- rect, provided they remained factually accurate. The structured evaluation prompt used by the grader was inspired by the LangChain GitHub repository (David 2024), ensuring a well-defined and systematic approach to evaluation. In order to ensure that the grader was trustworthy in its output of TRUE/FALSE answers, a multitude of predicted answers were compared to the provided reference answers with respect to the grading. The QA evaluation framework used in the assessment is illustrated in Figure 2.7. Figure 2.7: QA evaluation for accuracy 2.3.1.2 Efficiency Evaluation The efficiency evaluation is divided into latency performance and token usage. These were measured to understand how quickly the agents generated responses and how much computational power it demands, respectively. Latency was recorded in sec- onds, representing the time taken for an agent to provide an answer after receiving a query. By analyzing both P50 and P99 latencies, the evaluation captured both the expected response times under normal conditions and the potential worst-case scenarios. This ensures a balanced evaluation of the performance. Token usage, on the other hand, was tracked as two single numerical values provided by LangSmith through all the LLM API calls being done inside the graph. One represents the input or prompt tokens, while the other one represents the ouput or completion tokens. Since OpenAI charges based on the number of tokens used, this measurement was an indication of how the utility costs of the different agents differed. 25 2. Method 2.3.2 Main Agent Evaluation The agent evaluation was performed such that an approach for the main graph, displayed in Figure 3.4 and consisting of two subgraphs, could be decided. The agent that was deemed most suitable in terms of accuracy and efficiency, given the results yielded from the QA evaluations, was selected to be included in the main graph. Similar to the agent 1-4 testing, the evaluation of accuracy and efficiency through QA assessment was conducted with reference to the main graph. Nevertheless, this was done on a different set of questions than the agent evaluation. Whereas the agent evaluation consisted of 30 questions based on code generation, the questions for the main graph evaluation were written based on three pillars to capture the variety in style, mentioned in subsection 2.2.3. These were code generation, description generation and detection of relevance. Therefore, 31 questions were written such that the reference answer contained a result from a Python code, a description connected to PyCNS functionality or a standard output "You asked something that was not related to the PyCNS documentation!" given an irrelevant question. None of the 31 questions used in this QA evaluation were identical to one of the 30 questions used in the previous QA evaluation. 2.4 Usability Testing In order to verify if the developed software could be used in an industrial context, its usability was examined through user testing. This user testing, where two analysis engineers at GKN tested the software, used the Usability branch in the system acceptability architecture presented by Nielsen (1993) as a foundation. These are displayed in Figure 2.8. The engineers were chosen such that they were familiar with the PyCNS module and had used it within their work previously. The notion of user testing is of importance due to the fact that the software was not developed by intended users. Figure 2.8: The branches of system acceptability (Nielsen 1993, p. 25) While the method for investigating factors like Cost (through token utilization) was presented in section 2.3, important usability factors like Easy to learn and Subjec- 26 2. Method tively pleasing are more difficult to get an understanding of as a developer. Therefore, these required concrete testing with intended users. The testing methodology was inspired by Farzaneh and Neuner (2019), where a usability evaluation was performed on a software tool in the domain of bio-inspired design. The user testing procedure consisted of a number of steps. Through a provided LangGraph user interface, a brief tutorial of the tool functionality was demonstrated to the engineer. After this, the engineers were asked to use the tool for assignments similar to those they would normally use the PyCNS module in a direct programming interface. When the engineers were asked to use the tool, this was utilized in the user testing as an observational technique. The engineers were provided with the dummy case presented in section 1.5. This meant that the user behaviour was inspected, and that no interference with the user was done. The advantage of observational techniques is that it is deemed to yield an essentially authentic user environment, and prob- lems connected to the usability can be detected and further addressed (Farzaneh and Neuner 2019). Furthermore, an observation of the user rather than an active instruction of what they were going to do was deemed to reduce the risk of biased results. After this, a brief semi-structured interview was conducted with the engi- neers to collect their ideas regarding the usability of the tool. This semi-structured interview consisted of six set questions, taken from Srikanth, Hasanuzzaman, and Meem (2024) where the usability of existing state of the art LLMs within the do- main of threat intelligence enrichment are investigated. These questions are shown in section C.1. Lastly, the results from the observations and the interviews were analyzed such that an indication of the usability could be obtained. The usability testing method is summarized in Figure 2.9. Figure 2.9: The structure of the usability testing 27 2. Method 28 3 Results 3.1 Data Gathering After finishing the literature review and the interview process, various results rel- evant to answering RQ1 were obtained. In the following chapter, these results are divided between those explicitly obtained from the literature review in section ?? and those explicitly obtained from the interview process in section 3.1.2. While results from the interview process are generally more applicable towards a specific GKN use case and those obtained from the literature review tend to be of a broader nature, there seems to be an intersection between the obtained results from both stages. 3.1.1 Literature Review From the literature review a number of challenges connected to the use of generative AI in engineering were identified. These are presented in the following section. The different challenges are often connected to each other, meaning for example that a poor quality output yielded through hallucination may be connected to lack of in-house knowledge and poor prompting. However, generative AI in engineering has its set of limitations and some of these challenges may be chronic, but often it seems that they can be mitigated with the correct strategy. Applications of LLMs in engineering The use of generative AI, and more specifically LLMs, show promise in a number of engineering fields. Among these fields are computational mechanics, engineering design and requirements engineering. Here, a brief introduction of LLM applications in engineering based on the reviewed literature is presented. The use of LLMs could facilitate identification and analysis of data, whether is is re- lated to requirements or information connected to standards. Norheim et al. (2024) investigate the use of LLMs with respect to engineering requirements in complex systems. Arora, Grundy, and Abdelrazek (2024) also consider the use of LLMs in RE, such that a SWOT analysis is performed for the different stages of the RE pro- cess. Furthemore, LLMs have the potential to reduce the amount of manual work in product development. Ehring et al. (2024) investigate how LLMs can be used for information classification, such that role-specific identification of information can be 29 3. Results streamlined. LLMs also show potential when it comes to computational aspects in engineering. A more mechanical engineering focused study on the use of LLMs has been performed by Ni and Buehler (2023), who with the help of LLMs create AI agents, so called MechAgents, capable of solving elasticity problems through multi-agent collabora- tion. Additionally, an article by Alexiadis and Ghiassi (2024) brings forward the use of LLMs integrated to physics-based simulation software. Alexiadis and Ghiassi (2024) investigate how an AI model could be used for geometry generation, mesh set up and material property definition in the simulation software such that the simulation can be run and the AI model additionally could give thorough results to the user. The use of LLMs in computational mechanics is likewise investigated by Brodnik et al. (2023), where the potential and challenges of the use of LLMs in applied mechanics is discussed. One of the possbilities in the field of applied mechanics is assisted programming, where the creation of computational algorithms can be facilitated through code generation and translation (Brodnik et al. 2023). Engineering design is an area subject to investigation on how LLMs can further streamline engineering processes. Chiarello et al. (2024) analyze 15 355 research pa- pers within engineering design and discuss the potential of LLMs, where the use of LLMs in the four engineering design phases Problem Definition, Conceptual Design, Embodiment Design and Detailed Design is treated. Pradas Gomez et al. (2024) investigate the use of LLMs as support for designers in complex systems engineer- ing, where two aerospace cases, one being generation of Python code for an aircraft actuation system UML diagram and the other being code for geometry generation through interaction with a CAD kernel, are investigated. Additionally, LLMs in the automotive industry, using LLMs together with RAG for further automatization of design workflows as well as improvement of software development, is investigated by Zolfaghari et al. (2024) through a comparative study of current available LLMs. There are various approaches in using LLMs within engineering. For instance, Ni and Buehler (2023) proposing the multi-agent approach where multiple agents col- laborate to solve tasks. Similarly, Du et al. (2023) introduce a method for addressing problems, such as arithmetic tasks, using a multi-agent system. In their approach, agents debate the solutions generated by each participant and collaboratively deter- mine a final answer. The use of LLMs show potential regarding integrated use with external engineering software, as shown by for example Alexiadis and Ghiassi (2024) with the combina- tion of LLMs and physics-based simulation software. Jiang et al. (2024) investigate facilitating the creation of building energy models (BEMs) through LLMs. Such that an "auto-building modeling platform" can transform natural language from a user input prompt to a particular generated building model, via the integration between an LLM architecture and a physics-based simulation (Jiang et al. 2024). 30 3. Results Lack of in-house knowledge A prominent challenge regarding the use of generative AI, and more precisely LLMs, in an engineering context seems to be the contrast between their general nature and the need for internal competence in tasks. Ehring et al. (2024) highlight the current practice in product development of manual identification and analysis of suitable information from internal documentation, where the inquiry is whether LLMs can make the process more effective. However, the data LLMs have been pre-trained on may vary a lot from internal documentation. Consequently, when using generative AI tools for engineering purposes embedding in-house knowledge is of importance (Ehring et al. 2024). Indeed, as Ehring et al. (2024) write: "without fine-tuning, today’s model are unable to classify information in a role-specific way. The models lack too much industry-specific knowledge". A need for fine-tuning could be consid- ered. Nevertheless, fine-tuning is demanding with respect to necessary data (Ehring et al. 2024). Chiarello et al. (2024) discuss the possibility of using LLMs for facilitating engi- neering design through combining them with 3D models such that geometry models could be created, although stressing the importance of properly defined material con- straints and manufacturing processes. Such constraints could be guided by internal knowledge. LLMs having been subject to fine-tuning could therefore be consid- ered for this (Chiarello et al. 2024). The general nature of LLMs is brought up by Pradas Gomez et al. (2024), as available LLMs are not trained for internal company and project methods. In addition to fine-tuning, RAG can be used to mitigate this challenge (Pradas Gomez et al. 2024). Issues with sensitive information Using sensitive information as input to an LLM that is not run locally is problematic, and hence one of the main issues with using current available LLMs in an engineering context. Zolfaghari et al. (2024) mention the current challenge for the use of LLMs in certain engineering fields due to the need for internally stored private data, trade secrets and technical data. For the current available LLMs, the highest performing ones are available through external APIs which is a hinder for the use of LLMs in sensitive domains like defense projects (Pradas Gomez et al. 2024). However, there are LLMs that can be run locally, and therefore eliminating this issue. Examples of such models are LLAMA3 and Mistral, and the ability to run LLMs locally is synonymous to significant benefits in engineering fields where sensitive data is processed (Zolfaghari et al. 2024). Drawbacks in prompting The use of LLMs often presents challenges related to prompting. Creating an ef- fective prompt is needed to achieve the desired output. For their use cases, Pradas Gomez et al. (2024) had to experiment with multiple versions and iterations of prompts to obtain outputs that were considered valid responses. Arora, Grundy, 31 3. Results and Abdelrazek (2024) mention the importance of prompt engineering, as the out- put behaviour of the LLM is strongly reliant on the prompt design. Prompting is in itself a technique which must be mastered for efficient use of LLMs. Prompts could be specified in a pre-defined manner like "Context, Task and Expected Out- put" (Arora, Grundy, and Abdelrazek 2024). Prompt engineering makes use of the context of the specific task, the language and the known abilities of the LLM (Arora, Grundy, and Abdelrazek 2024). As Arora, Grundy, and Abdelrazek (2024) found in their study: "slightly different prompts can produce very different outputs". Jiang et al. (2024) also emphasize the importance of prompting. Given restricted com- putational resources the performance of the model can be improved through well defined prompts (Jiang et al. 2024). The issues related to prompting can display themselves in flawed output, for exam- ple a result obtained through hallucination. For reducing the risk of hallucination and erroneous output caused by a multiple step process with faulty reasoning by the LLM, there are methods in prompt engineering like "chain of thought prompting" consisting of clear step-by-step instructions (Brodnik et al. 2023). Mitigation of these errors can also be achieved through RAG, where retrieved parts of documents are added to the input prompt (Brodnik et al. 2023). There are two dimensions to prompt engineering with respect to LLMs, where one approach is the use of APIs, such as the OpenAI API, meaning the use of LLMs is done in a programming en- vironment as opposed to the less technical use in an pre-available user interface (Alexiadis and Ghiassi 2024). Over-reliance on AI tools/Loss of human engineering competence As the use of LLMs is associated with a risk of hallucinated or poor quality out- put, it is of importance to avoid over-reliance on their use. LLMs should not be incorporated such that engineers can use them as a "black-box" solution and exert- ing blind faith in these, a "best of both worlds" practice should rather be strived for. As Pradas Gomez et al. (2024) note, the current nature of LLMs as "helpful assistants" is contrary to the nature of a competent designer, who is not inclined to always respond to a question given the provided information. The competent designer should question flawed information, and in certain cases express the need for further additions (Pradas Gomez et al. 2024). As Chiarello et al. (2024) highlight in their example of translating functional models into natural language, the concept of functional analysis, described as a specialized language mastered by certain designers, carries a risk of information loss during the translation process. Specific domain insight possessed by a number of engineers could therefore be lost in translation as the LLM is invoked. And while LLMs seem to have a significant potential in facilitating tasks, the indication is such that yielded output should be inspected by a competent human actor to mitigate this. The reliance, and more specifically the need for verification of outputs, may differ between different engineering fields. Norheim et al. (2024) investigate the challenges of LLM application to requirements engineering (RE) tasks, and discuss it with respect to RE 32 3. Results tasks like requirement translation and requirement analysis. Furthermore, Norheim et al. (2024) note that a certain level of error is currently inevitable, and there is a need for organizations to determine domains where a certain level of error may be acceptable and domains where it is not. The risk of imperfect LLM performance should be considered in environments with human safety as a major aspect (Norheim et al. 2024). Hallucination/Low quality output A common problem with LLMs is their lack of reasoning ability. Hallucination, meaning the LLM gives an output that is factually incorrect but probable from a linguistic perspective, is a challenge with LLMs (Brodnik et al. 2023). Chiarello et al. (2024) mention the danger of hallucinations from an engineering design perspec- tive, where they may cause issues especially when such tools are used by designers with lack of experience. However, there are ways the challenge of hallucinations can be mitigated. One such is through a multi-agent reasoning model. The multi-agent reasoning model proposed by Ni and Buehler (2023) demonstrated the ability to self-correct hallucinations. For the multi-agent system proposed by Du et al. (2023), the authors found that having agents "debate" with each other lead to a higher accuracy in output. In addition to a multi-agent system, there are other methods to reduce the risk of hallucination. One such way is the use of RAG (Brodnik et al. 2023). 3.1.2 Interviews The results from the thematic analysis are presented through 6 different themes. These are Current risks and challenges, Cumbersome tasks pre-processing, Cumber- some tasks post-processing, Factors hindering full automation, AI implementation possibilities and Risks and challenges with AI. The themes Cumbersome tasks pre- processing and Cumbersome tasks post-processing refer to the stages before and after a simulation process. To substantiate the results from the given theme, several "quotes" are presented in context to it. 3.1.3 Current Risks and Challenges During the interviews, a number of points related to risks and challenges of current workflows were identified. These were, for instance, connected to how knowledge is stored, retrieved and used. Interviewees were asked whether they used AI in their work. While existing AI tools are often unsuitable for specific tasks, interviewee 8 noted the limitations of using non-local LLMs as tools from a data secrecy point of view: "You cannot just put anything in there." – Interviewee 8 Furthermore, these non-local LLMs like ChatGPT perform most effectively on generic questions. However, this generic nature often clashes with analysis procedures where you may need answers to theoretical questions such as on the topic of solid mechan- 33 3. Results ics. As interviewee 4 responded on the question whether he used existing LLMs in his work: "I’ve tried to use both Copilot and ChatGPT to ask like theoretical questions, but none these models have been trained on the manuals" – Interviewee 4 Consequently, there is a consideration on how to transfer existing knowledge to engineers. This transfer of knowledge seems to be especially important when it comes to less experienced engineers. On the topic of current challenges in the workflow, interviewee 7 commented on this, speaking as an experienced analysis lead: "It’s a very changing team. So we always have new people and they need to understand the codes, which is not always easy" – Interviewee 7 The issue of knowledge embedding is not only relevant to engineers, but also to other workforce. Interviewee 1 investigated the use of LLMs and augmented reality in assembly. Indeed, the potential use of such tools in an assembly context was justified by Interviewee 1 with the example from an assembly site: "They don’t keep their inspectors for a long amount of time, they tend to rotate people who they hire quite frequently, so they have to be retrained constantly." – Interviewee 1 Thus, such a challenge indicated the potential benefit of LLMs as a helping tool. The current risks and challenges in the analysis engineering workflow, as well as in other fields, highlighted the theme of current challenges connected to the storage and retrieval of knowledge. 3.1.4 Cumbersome Tasks Pre-Processing The interviewees were asked if there were any tasks in their workflow they found extra cumbersome. For analysis engineers, the answers could be divided into cumber- some tasks in the pre-processing stage as well as the post-processing stage. Meaning, during the preparation of a simulation and during the analysis and interpretation of the output. In the case when using Ansys as a simulation software, run scripts are written in APDL. For the pre-processing, interviewee 7 commented on the issue of setting up certain run scripts: "And then some of the problems come from scripts that are not properly set up and then cause problems." – Interviewee 7 This means time is consumed by inspecting, and fixing code such that the simulation can run properly. These run scripts contain numerous input variables, and inter- viewee 5 commented on the process of changing these for the specific simulation: "Normally, in some sort of run script, at the start of each run, you have a lot of input variables and today you change these manually." –Interviewee 5 Furthermore, the input to the run scripts needs to be identified and extracted in 34 3. Results an earlier stage. This data can for example be given in Excel, where it must be analyzed and retrieved. Interviewee 5 remarked on this: "Before you can actually give that as an input to the analysis, you need to do a few steps. One of them could be removing duplicates. Others could be formatting [...] So before the run script we could have another script basically prepared to convert that Excel sheet into a format that works for Ansys." –Interviewee 5 The indications were that the inconvenient aspects of the pre-processing stage were due to the current procedure of certain tasks related to analysis, retrieval and for- matting of input data. However, some interviewees mentioned perceiving the pre- processing stage as rather elementary and straight-forward. Interviewee 4 expressed the following: "And I would say that I don’t experience that much difficulty and it’s actually quite fast to do it" –Interviewee 5 3.1.5 Cumbersome Tasks Post-Processing For the post-processing stage, a number of tasks must be performed such that the output script can be transformed to presented data. Interviewee 3 commented on the current nature of the post-processing: "I believe these days we spend most of the time in post-processing. Es- pecially working with data, moving data, moving information from one system to another. From Ansys to text to PowerPoint or Word, or trans- forming the data in the process and then writing conclusions about the data that we got or the images that we see." –Interviewee 3 Data must be transferred between different programs, like Ansys and PowerPoint or Excel, which creates a need for the engineer to act as an intermediary, as these systems are not integrated. Interviewee 7 also remarked on the time-consuming nature of post-processing with respect to its engineering team: "And then they also spend a lot of time post-processing the data [...] They need to extract a lot of graphs, time points, load the result files, pick the nodes... All these kind of things." –Interviewee 7 These repetitive tasks in the post-processing stage, which do not always necessarily demand a lot of thinking, were further commented by interviewee 7: "So right now, I think we are spending a lot of times on things that the computer could do. And there’s not a lot of engineering thinking in the current work." –Interviewee 7 Therefore, it could be considered that the engineer could be relieved from certain of these cumbersome, and repetitive, tasks. 35 3. Results 3.1.6 Factors Hindering Full Automation Although AI has been present for a while with its applications expanding and im- proving significantly, there still remains a degree of reluctance toward it. The reasons behind it could differ between engineers. One key challenge mentioned in the in- terviews was getting engineers themselves to trust the transformation towards the incorporation of AI tools. For instance, interviewee 8 remarked on this: "But most of the time comes from getting the engineers that are doing it by hand now to trust the tools that you develop." –Interviewee 8 Furthermore, a common recurrent opinion was that the complexity in the processes makes it difficult to automate using AI. As discussed in subsection 3.1.8, LLMs can be considered non-deterministic. Given that many workflow processes require indi- vidual assessment and the application of common sense, interviewee 2 commented his suspicion against it: "I think a lot of analysis depends on judgement and how do you automate that judgment part?" –Interviewee 2 It could also be due to that some of the existing software’s that are currently used for simulation or calculation are not compatible with automation, as noted by in- terviewee 3: "If you think about ANSYS only, then ANSYS has its limitations on how much you can actually automate it. Of course, every software will have its own limitation and ANSYS has its set of limitations." –Interviewee 3 3.1.7 AI Implementation Possibilities A recurring theme from the interviews highlighted the importance of preserving the knowledge of the engineer and remaining them the central decision-maker in the workflow process. Hence, maintaining oversight and not fully relying on the LLM was considered a crucial element, as interviewee 4 and 7 commented: "Because everyone can kind of do an analysis, but you need an engineer to actually understand what you are putting in and what the output is" – Interviewee 4 "I see it more as like an assistant or like a helper, that the engineer is still in charge." – Interviewee 8 Several respondents mentioned the idea of letting the tool give feedback to the engineer before proceeding and initializing the simulation or any other action of a more significant nature. This ensures that the control and responsibility remain the engineer as previously discussed. Interviewee 7 suggested following: "The engineer should go, should be able to go and see what the program does and try to understand." – Interviewee 7 Another suggestion to ensure the engineer retain control of the process was proposed 36 3. Results by Interviewee 1. The idea focuses on establishing clear boundaries for the LLM through a set of predefined actions. This is not only delegating more control to the engineer, but also making the LLM less sensitive to hallucinations. "So instead of having it more freeform, it has to choose from a list of actions. So you have more control of what the LLM is doing rather than directly controlling a software." – Interviewee 1 A further recurring point was the idea of using the LLM for analyzing result data from the simulations and use it for visualization. Interviewee 8, among other, re- marked on this: "So indeed having lots of results, I think a large language model is really good in summarizing that, creating pictures for you, plots, things like that." – Interviewee 8 Current analysis processes are streamlined with the help of, for example, Python scripts for computations. However, not all engineers