Systematic Design and Integration of Large
Language Model Tools for Engineering
Analysis

Gabriel Krüger
Johannes Lundahl

Department of Industrial and Material Science

CHALMERS UNIVERSITY OF TECHNOLOGY

Gothenburg, Sweden 2025

www.chalmers.se

www.chalmers.se


Master’s thesis 2025

Systematic Design and Integration of Large
Language Model Tools for Engineering Analysis

An investigation, and the development, of large language models
tools in analysis engineering

Gabriel Krüger
Johannes Lundahl

Department of Industrial and Material Science
Chalmers University of Technology

Gothenburg, Sweden 2025


Developing and evaluating large language models to align with engineering tasks
and increase their effectiveness
An investigation, and the development, of large language models tools in analysis
engineering
Gabriel Krüger, Johannes Lundahl

© Gabriel Krüger, Johannes Lundahl 2025.

Supervisors: Alejandro Pradas Gómez, Department of Industrial and Material Sci-
ence. Najeem Muhammed, GKN Aerospace.
Examiner: Ola Isaksson, Department of Industrial and Material Science.

Master’s Thesis 2025
Department of Industrial and Material Science
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: (Insert relevant cover image)

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2025

iv


Developing and evaluating large language models to align with engineering tasks
and increase their effectiveness
An investigation, and the development, of large language models tools in analysis
engineering
Gabriel Krüger, Johannes Lundahl
Department of Industrial and Material Science
Chalmers University of Technology

Abstract
Generative AI, and more specifically large language models (LLMs), show great
promise in further facilitating and streamlining analysis engineering work. In this
thesis, the current limitations for the use of such technologies regarding the in-
corporation in engineering tasks are investigated. This investigation consists of a
comprehensive literature study, in addition to nine interviews with engineers at GKN
Aerospace in Trollhättan, Sweden. The results indicate a number of challenges. One
being the importance of embedding internal company knowledge when leveraging
the capacities of LLMs, while the use of non-local LLMs is linked to considerable is-
sues when it comes to handling sensitive data. With the results of this investigative
study as a foundation, an LLM-based tool meant to mitigate these identified issues
is developed. The development is done using LangChain, such that the OpenAI
API can be used in a Python environment. The developed software is focused on
the manipulation, and extracting, of data in CNS files. A file format containing re-
sults from finite element simulations. Different agentic systems, leveraging different
methods for knowledge embedding such as retrieval augmented generation (RAG),
and post-training such as fine-tuning, are investigated. The various architectures
are evaluated based on their efficiency and accuracy in solving tasks. The results
indicate that LLM-based tools have great potential in the field. The top perform-
ing architecture based on this testing is incorporated into a sub-graph architecture,
for which usability and validation are examined. The results for efficiency, accu-
racy, usability testing and validation imply the considerable potential of leveraging
LLMs in the domain. Nevertheless, performance is not perfect and a number of
considerations in such a development must be taken. The methods for knowledge
embedding and post-training seemingly have great impact on the performance, and
more sophisticated approaches within RAG and fine-tuning have potential to further
improve performance.

Keywords: AI, Large Language Model (LLM), LangChain, Agent, Multi-Agent,
RAG, Fine-Tuning, Knowledge-Based Engineering (KBE)

v


Acknowledgements
First of all, we would like to express our great appreciation of the help we have
received throughout this thesis by our supervisors Alejandro Pradas Gómez and Na-
jeem Muhammed. Their knowledge in fields like academic writing and software de-
velopment has been essential. Additionally, we would like to thank GKN Aerospace
Trollhättan and manager Rikard Nedar for having us at the office during this thesis.
A special thanks to all engineers at GKN who participated in our interview and
usability testing studies.

Gabriel Krüger & Johannes Lundahl, Gothenburg, April 2025

vii


List of Acronyms

Below is the list of acronyms that have been used throughout this thesis listed in
alphabetical order:

AI Artificial Intelligence
API Application Programming Interface
CAD Computer Aided Design
CFD Computational Fluid Dynamics
FEM Finite Element Method
HITL Human In The Loop
LLM Large Language Model
QA Question & Answer
RAG Retrieval Augmented Generation
RQ Research Question

ix


Contents

List of Acronyms ix

List of Figures xv

List of Tables xvii

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Ethical and Environmental Considerations . . . . . . . . . . . . . . . 3
1.5 Prerequisites . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5.1 AI agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5.1.1 Human in the loop (HITL) . . . . . . . . . . . . . . 5

1.5.2 LangChain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5.3 LangSmith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5.4 LangGraph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.5.5 Knowledge embedding methods . . . . . . . . . . . . . . . . . 6

1.5.5.1 Entire Context in System Prompt . . . . . . . . . . . 6
1.5.5.2 RAG (Retrieval Augmented Generation) . . . . . . . 7
1.5.5.3 Fine-tuning . . . . . . . . . . . . . . . . . . . . . . . 8

1.5.6 GKN Dummy Data . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.7 PyCNS Knowledge . . . . . . . . . . . . . . . . . . . . . . . . 9
1.5.8 Computational Effort Parameters . . . . . . . . . . . . . . . . 10

2 Method 13
2.1 Data Gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.1.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1.1 Execution . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.1.2 Data Analysis . . . . . . . . . . . . . . . . . . . . . . 15

2.1.2 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2.1 Execution . . . . . . . . . . . . . . . . . . . . . . . . 16
2.1.2.2 Data analysis . . . . . . . . . . . . . . . . . . . . . . 19

2.2 Software Development Method . . . . . . . . . . . . . . . . . . . . . . 19
2.2.1 Development Objective . . . . . . . . . . . . . . . . . . . . . . 19
2.2.2 Development Strategy . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Development Method . . . . . . . . . . . . . . . . . . . . . . . 22

xi


Contents

2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.1 Agent 1-4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . 24

2.3.1.1 Accuracy Evaluation . . . . . . . . . . . . . . . . . . 25
2.3.1.2 Efficiency Evaluation . . . . . . . . . . . . . . . . . . 25

2.3.2 Main Agent Evaluation . . . . . . . . . . . . . . . . . . . . . . 26
2.4 Usability Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3 Results 29
3.1 Data Gathering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.1.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.1.2 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.1.3 Current Risks and Challenges . . . . . . . . . . . . . . . . . . 33
3.1.4 Cumbersome Tasks Pre-Processing . . . . . . . . . . . . . . . 34
3.1.5 Cumbersome Tasks Post-Processing . . . . . . . . . . . . . . . 35
3.1.6 Factors Hindering Full Automation . . . . . . . . . . . . . . . 36
3.1.7 AI Implementation Possibilities . . . . . . . . . . . . . . . . . 36
3.1.8 Risks and Challenges with AI . . . . . . . . . . . . . . . . . . 37

3.2 Software Development . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2.1 Agent 1: Complete Documentation . . . . . . . . . . . . . . . 39
3.2.2 Agent 2: RAG . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.2.3 Agent 3: Fine-Tuning . . . . . . . . . . . . . . . . . . . . . . 42
3.2.4 Agent 4: Fine-Tuning & Complete Documentation . . . . . . . 43
3.2.5 Main Agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

3.3 Software Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.3.1 Comparison Between Agent 1-4 . . . . . . . . . . . . . . . . . 46
3.3.2 Agent 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.3.3 Agent 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.4 Agent 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.3.5 Agent 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
3.3.6 Main Graph Evaluation . . . . . . . . . . . . . . . . . . . . . 54

3.4 Usability Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4 Validation 57

5 Discussion 61
5.1 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2 Interviews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 The Role of Knowledge Embedding . . . . . . . . . . . . . . . . . . . 63
5.4 The Role of Post-Training . . . . . . . . . . . . . . . . . . . . . . . . 65
5.5 Knowledge Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.6 The Rapidly Improving Field of AI . . . . . . . . . . . . . . . . . . . 66
5.7 Reliability of Results and Sources of Error . . . . . . . . . . . . . . . 67
5.8 GKN Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.9 Answers to the RQs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.9.1 RQ1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.9.2 RQ2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.10 Further Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

xii


Contents

6 Conclusion 71

Bibliography 73

A Data Gathering 81
A.1 Semi-Structured Interview Questions . . . . . . . . . . . . . . . . . . 81

A.1.1 Demographic questions . . . . . . . . . . . . . . . . . . . . . . 81
A.1.2 Current process workflow . . . . . . . . . . . . . . . . . . . . . 81
A.1.3 Interviewee experience regarding AI . . . . . . . . . . . . . . . 81
A.1.4 Risks/challenges in incorporation of AI . . . . . . . . . . . . . 82
A.1.5 Potential incorporation of AI in the process . . . . . . . . . . 82
A.1.6 General finishing questions . . . . . . . . . . . . . . . . . . . . 82
A.1.7 Wrap it up . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

A.2 Interview Contract . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

B Software Development 85

C Usability Testing 87
C.1 Usability Test Interview Questions . . . . . . . . . . . . . . . . . . . 87

xiii


Contents

xiv


List of Figures

1.1 General flowchart for solid mechanics analysis using a tool . . . . . . 2
1.2 Visualization of nodes and edges in LangGraph. . . . . . . . . . . . . 6
1.3 Simple architecture with all the external knowledge accessed by the

LLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.4 RAG architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.5 Fine-tuning with a data set given internal knowledge . . . . . . . . . 9
1.6 P50 and P99 latencies . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.1 Components of the methodology for answering the RQs . . . . . . . . 13
2.2 Screening procedure in Scopus for the literature review. Highly in-

spired by Page et al. (2021) . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Stages for the interviews . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Number of interviews in relation to identified needs (Griffin and Hauser

1993) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Initial graph structure . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6 Main agent architecture . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 QA evaluation for accuracy . . . . . . . . . . . . . . . . . . . . . . . 25
2.8 The branches of system acceptability (Nielsen 1993, p. 25) . . . . . . 26
2.9 The structure of the usability testing . . . . . . . . . . . . . . . . . . 27

3.1 Example of a human interrupt block in LangGraph studio . . . . . . 40
3.2 Graph structure system prompt context . . . . . . . . . . . . . . . . 41
3.3 LangGraph representing agent with RAG . . . . . . . . . . . . . . . . 42
3.4 Main agent with sub-graph . . . . . . . . . . . . . . . . . . . . . . . . 46
3.5 Comparison of mean accuracy, latency & token usage between agents 47
3.6 Accuracy and token usage results for the four agents. . . . . . . . . . 47
3.7 Comparison of accuracy and latency performance for Agent 1 . . . . 48
3.8 Agent correctness across all runs for Agent 1 . . . . . . . . . . . . . . 49
3.9 Comparison of accuracy and latency performance for Agent 2a . . . . 50
3.10 Agent correctness heatmap for Agent 2a . . . . . . . . . . . . . . . . 50
3.11 Comparison of accuracy and latency performance for Agent 2b . . . . 51
3.12 Agent correctness heatmap for Agent 2b . . . . . . . . . . . . . . . . 52
3.13 Comparison of accuracy and latency performance for Agent 3 with

GPT-4o and GPT-4o mini models . . . . . . . . . . . . . . . . . . . . 52
3.14 Agent correctness heatmap for Agent 3 for GPT-4o and GPT-4o mini 53
3.15 Comparison of accuracy and latency performance using fine-tuning

and complete context in system prompt . . . . . . . . . . . . . . . . . 54

xv


List of Figures

3.16 Agent correctness heatmap for Agent 4 . . . . . . . . . . . . . . . . . 54
3.17 Comparison of accuracy and latency performance for Agent 4 . . . . 55
3.18 Agent correctness heatmap across multiple experiment runs . . . . . . 55

4.1 Validation architecture (Sargent 2010). . . . . . . . . . . . . . . . . . 57

xvi


List of Tables

2.1 Keywords for literature search . . . . . . . . . . . . . . . . . . . . . . 14
2.2 Respondent details, roles, and AI experience. . . . . . . . . . . . . . . 18
2.3 Development Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 Comparison of Agents across Knowledge Embedding and Post-Training 23

3.1 Parameters and accuracy results for RAG strategy 1 . . . . . . . . . . 49
3.2 Parameters and accuracy results for RAG strategy 2 . . . . . . . . . . 51

4.1 The three pillars of a simulation model defined by Sargent (2010) and
their corresponding parts in the thesis . . . . . . . . . . . . . . . . . 58

4.2 Summary of questions and methodology for the software validation . 59

xvii


List of Tables

xviii


1
Introduction

The use of AI has, over the course of the last years, shown great promise in further
optimizing work procedures in a number of engineering domains. AI is used during
R&D phases within fields like the automotive and aerospace industry, for example in
the design stage where certain CAD software integrates AI features (Financial Times
2024). Classic simulation methods, like finite element method (FEM) and computa-
tional fluid dynamics (CFD) tools, are still significantly more common in engineering
than AI and machine learning (ML) techniques (Ragani et al. 2023). Nevertheless,
it is considered that the incorporation of AI methods can further streamline these
"classic" simulation workflows within the pre- and post-processing stages of data.
Consequently, this study investigates how specific analysis engineering procedures
can be further streamlined with the help of AI tools, and what the corresponding
challenges are.

1.1 Background
GKN Aerospace is a leading company in the aviation industry, collaborating with
the world’s top manufacturers of aircraft and engine components. The Aero Engines
division in Trollhättan, Sweden, specializes in high-performance engine components
and have facilities and engineering teams in several places all over the world. Here,
advanced parts for aircraft and rocket engines are developed and manufactured,
along with engine maintenance services (GKN-Aerospace 2024).

In recent years, AI and particularly generative AI has advanced significantly. Due
to extensive research in this field, generative AI now supports various applications
within the tech sector, including code generation, product design and content mar-
keting (Fauscette 2023). Central to these applications are large language models
(LLMs), a key component of generative AI. LLMs are advanced deep neural net-
works designed to process and generate content such as text, images and videos.
These models are built on a transformer architecture, which is particularly effective
at understanding the context within sequences of data (IBM 2023). This architec-
ture enables LLMs to capture meaning more effectively than previous types of neural
networks like CNNs or RNNs (NVIDIA 2023). LLMs have thus become essential
tools for many tech companies and their employees enabling things like automa-
tion of processes and improving product development through advanced analytics
(Kietzmann and Park 2024).
To facilitate and optimize the work of employees at GKN, the potential of LLMs

1


1. Introduction

is currently being explored as a support tool in their engineering tasks, particularly
for engineers in the solid mechanics department. Previously, at this specific depart-
ment, LLMs have been tested as tools for assisting in report writing. This was done
in a study titled "Supporting the Generation of Engineering Analysis Reports with
Large Language Models", written by D. Söderqvist and F. Mare (Söderqvist and
Mare 2024). Additionally, the use of LLMs have been investigated at GKN for the
preparation of CAD geometries for FE analysis (Naik 2024).

Consequently, there is an interest at GKN regarding how LLMs further can be de-
ployed within analysis engineering for the streamlining of work procedures. In this
thesis, it is investigated how LLMs can aid over the course of engineering analysis
linked to simulation processes. For an engineering simulation involving a geom-
etry, it generally requires the pre-processing of the geometry with corresponding
loads and material properties such that a run script can be set up. After this, the
simulation tool can be run, leading to a results file containing simulation data. Sub-
sequently, this data must be analyzed and manipulated. This workflow is illustrated
in Figure 1.1.

Figure 1.1: General flowchart for solid mechanics analysis using a tool

At the solid mechanics department, this can be a simulation process through a FE
analysis tool. Over the course of this procedure, data must be processed before the
simulation can be run as well as after its completion. Therefore, the task of an
analysis engineer is to both pre-, and post-process data. The incorporation of AI,
and more specifically LLMs, in this workflow can be achieved in various stages as
indicated by previous studies at GKN. The foundation of the the thesis is therefore to
investigate the further incorporation of LLMs in the analysis flow. This is done with
a number of research questions, acting as guidelines for the methodology, presented
later in this chapter.

2


1. Introduction

1.2 Scope
Given the desire to further investigate the incorporation of LLMs in engineering
tasks as GKN, two research questions (RQs) linked to this are defined. These act as
a guide for the research metholodgy, such that a well structured The two RQs are:

1. What are the practical limitations for analysis engineers regarding adaptation
of generative AI technologies in their engineering tasks?

2. How can an LLM-based tool be systematically designed and evaluated to mit-
igate the identified limitations?

1.3 Limitations
For this thesis, a number of limitations are considered. These limitations are of time,
data security and financial nature. Accordingly, these limitations are as follows:

• Time: the project is limited to a period of about 20 weeks.
• Data Security: No confidential GKN data is considered during the research.

This means that development is based on generic data, together with non-
confidential insights from GKN engineers.

• Financial: A budget cap of approximately $1300 is given for utilizing cloud-
based LLMs.

1.4 Ethical and Environmental Considerations
A number of ethical, and environmental, considerations must be taken into account
during the development, and use of, generative AI tools. For example, the current
development of LLMs has been criticized. When the Future of Life Institute in
March 2023 published the open letter "Pause Giant AI Experiments", signed by a
number of well-known persons within the domain of AI research and governance, it
pointed out potential LLMs more capable than OpenAI’s GPT-4 model and called
for a minimum six month pause for the development and training of such models
such that AI safety and governance measures could be implemented (Mollen 2025).
The signatories expressed their concern, and questions treating topics like "nonhu-
man minds that might eventually outnumber, outsmart, obsolete and replace us?",
and "let machines flood our information channels with propaganda and untruth?"
were brought forward (Future of Life Institute 2023). While the scope of this thesis
does not concern the development of an LLM, but rather the development of an
LLM based tool using existing models, it is important to keep the general ethical
and existential questions with respect to LLMs in mind.

There are environmental considerations that must be taken into account during the
development, but also use, of LLMs. LLMs have a significant carbon footprint,
where one of the reasons is the heavy GPU usage during training (Faiz et al. 2023).
For the data centers associated with LLMs, there are numerous factors that deter-
mine the environmental impact. Data center energy usage, with the corresponding
share of carbon free energy, as well as the embodied carbon footprint of the hardware

3


1. Introduction

must be taken into account (Faiz et al. 2023). For its lifecycle, the carbon footprint
of an LLM caused by inference is larger than the one caused by its initial training
(Fu et al. 2024). The carbon emissions for GPT-3 caused by training are estimated
to 502 tCO2 (Patterson et al. 2021). Furthermore, for GPT-3 the energy demand
per query has been estimated to be between 0.002 and 0.005 kWh/query, and with
an estimated US carbon intensity of 367 gCO2/kWh the emission per request has
been approximated to 1.5 g CO2 (Vanderbauwhede 2024). With millions of requests
every day, this highlights the energy needs and therefore the emissions of LLMs not
only for training but also for daily use. Consequently, for large scale LLM use the
environmental aspect must be taken into account.

The question of ethics is not only important with respect to the subjects the thesis is
treating, but also with respect to the applied research methodology. For example, a
number of interviews are conducted with engineers at GKN. Ethical considerations
within fields such as privacy, sensitive information and correct citation must be taken
into account. Storage of recorded interviews must be in line with GDPR, where it
states that the storage of data that enables personal identification may only be
stored for as long as the specific purpose dictates (GDPR.eu 2025). Furthermore, it
should not be possible to determine an interviewee’s identity based on the content of
the report. Additionally, it is of importance that interviews are conducted such that
subjects are not led to disclose irrelevant personal information or sensitive company
information.

1.5 Prerequisites
Before presenting the thesis methodology, it is necessary to present a number of
prerequisites. These act as the foundation for how the identified issues could be
mitigated, and how the desired functionality could be implemented.

1.5.1 AI agents
For tools with AI functionality incorporated, agentic AI systems can ensure that
systems or programs within the tool can perform specific tasks through, for example,
calling external tools (IBM 2024a). The task of an agent may vary, and it can include
tasks like analyzing written texts for grammatical errors to writing computer code.
Therefore, there are AI agents of different nature. For example, a so-called "simple
reflex agent" does not interact with other AI agents and is pre-programmed such
that it performs a specific action given a specific pre-condition (IBM 2024a). There
are more sophisticated types of AI agents, and some can determine a specific order to
implement certain actions given a specific goal and an optimization algorithm (IBM
2024a). Therefore, the use of AI agents has the goal to optimize LLM processes.
This by creating more complex and autonomous systems, that themselves can make
independent decisions and leverage external tools. Additionally, an AI agent can
reflect on the user input and plan what to do next based on its available tools (IBM
2024a). These tools can be for instance code executing tools, web-search tools or
external API calling tools. The agent can also reflect on its own mistakes and correct

4


1. Introduction

those without any human interference. However, in order to make the agents more
reliable, it is sometimes useful to make use of active human interaction through the
incorporation of human in the loop (HITL) aspects (IBM 2024a).

1.5.1.1 Human in the loop (HITL)

Due to the fact that AI agents are partly or fully autonomous systems, there is a
risk that they might make unwanted or even harmful decisions. To address this, it
is often useful to implement HITL mechanisms that allow the human to intervene.
That is, halting or modifying the agent’s actions before it proceeds. This approach
not only helps to maintain the reliability and accuracy of the responses, but it also
prevents issues such as the LLM entering infinite loops (IBM 2024a).

1.5.2 LangChain
LangChain is an open-source framework designed to simplify the development of
applications that leverage LLMs. It provides a standardized and modular approach
to integrating LLMs with external data sources, APIs, and computational tools,
enabling the creation of more advanced AI-driven workflows beyond basic API
calls (Balasubramaniam et al. 2024). By offering a flexible architecture, LangChain
supports RAG, memory management, and tool integration, making it particularly
valuable for tailored applications requiring contextual awareness and reasoning.
LangChain enables developers to design applications that integrate LLM capabilities
with real-world operations, and has gained significant adoption in the AI commu-
nity, establishing itself as a popular tool for building generative AI applications
(Balasubramaniam et al. 2024).

1.5.3 LangSmith
LangSmith is an observability platform designed to enhance the development and
maintenance of LLM-based applications. Created by the developers of LangChain,
it provides essential tools for debugging, monitoring, testing, and optimizing AI sys-
tems, improving both reliability and interpretability (Balasubramaniam et al. 2024).
The debugging and tracing capabilities enable developers to track input-output in-
teractions, and analyze intermediate steps. Additionally, the available testing and
evaluation tools facilitate benchmarking across various use cases, ensuring robust
and well-optimized LLM applications (Balasubramaniam et al. 2024).

1.5.4 LangGraph
LangGraph is an open-source library designed for building stateful, multi-actor ap-
plications with LLMs. Allowing for the creation of agent and multi-agent workflows.
It offers control over both application flow and state while integrating seamlessly
with LangChain and LangSmith (LangChain Inc. 2024a). LangGraph uses nodes
and edges to model functionality and behavior, effectively linking the various struc-
tural components of an agent. The nodes are simply Python functions describing

5


1. Introduction

the behavior of each part of the agent, while edges enables the routing logic be-
tween the nodes in the agent (LangChain Inc. 2024c). There are two types of edges.
The normal edge, where the routing is defined in one way only is represented by
a solid line. The conditional edge on the other hand can take more than one way
depending on the condition defined and is represented by dotted lines. These are
illustrated in Figure 1.2. Important features of LangGraph, that have been relevant
and widely used in this project, are memory persistence, HITL and centralized state
management, which keeps track of global state updates (LangChain Inc. 2024a).
Another key feature of LangGraph is the LangGraph Studio. LangGraph Studio
allows users to visually deploy graphs using the LangGraph API and also supports
quick debugging (LangChain Inc. 2024b). In Figure 1.2 a simple visualization of a
graph in LangGraph is shown.

Figure 1.2: Visualization of nodes and edges in LangGraph.

1.5.5 Knowledge embedding methods

The use of LLMs is associated with hallucination due to, for example, that the model
has not been trained on certain specific data. Consequently, when using LLMs for
fields like engineering there is often a need to embed internal company knowledge.
Knowledge that is deemed external to the LLM, since it has not been trained on it.
There are different ways to embed external knowledge within an LLM architecture.
In the thesis methodology, three different ways are encountered. These are presented
here.

1.5.5.1 Entire Context in System Prompt

When using an LLM through an API, one way to embed external knowledge is
to explicitly state it in the system prompt of the LLM. How much that can be
incorporated in the system prompt is dependent on the context length of the model.
Meaning that if you insert an amount of information exceeding the context length,
not all of it will be considered. For example, GPT-4o has a context length of 128
000 tokens (OpenAI 2024a). In such an architecture, the model has access to the
same specific external knowledge regardless of the prompt defined by the user. This
is illustrated in Figure 1.3.

6


1. Introduction

Figure 1.3: Simple architecture with all the external knowledge accessed by the
LLM

1.5.5.2 RAG (Retrieval Augmented Generation)

As opposed to providing the entire context in the system prompt, the RAG archi-
tecture is a bit more sophisticated in its nature. Its foundation is the splitting of
documents and consequently the retrieval from certain parts, given a user question,
and not the entire documentation. The information from the knowledge base can be
split into several smaller segments through chunking, where the chunk size sets the
size of the data splits and while a large chunk size may ensure that the data points
are coherent the risk is they become too general while a small chunk size risks the
loss of coherence within the data points (IBM 2024c). There is no exact answer for
what a good chunk size is, and it is dependent on the type of documentation you
have. If a chunk contains an excessive amount of data there is a risk that specific
knowledge may be overshadowed by other topics, whereas a risk of context loss is
associated with a too small chunk size (Stack Overflow 2024). Additionally, when
chunking a document it may be split in sections where relevant specific knowledge
is divided into two different chunks as opposed to keeping it all in one chunk. To
mitigate this, documents can be split into chunks with a certain overlap, ensuring
no brusque sudden borders between chunks but rather an overlap between chunk
data (MongoDB 2024).

After chunking the knowledge base, the text data can be transformed through an
embedding model such that it is vectorized. Embedding is a method for the rep-
resentation of, for example, text in a numerical way that allows for the use of the
data as input to machine learning algorithms since they in general require the input
to be low-dimensional numerical (IBM 2025b). As a result, this can improve the
performance of the LLM. There are different embedding models, and one example
is the OpenAI method used in ChatGPT which allows the model to, as opposed to
evaluating each word separately, comprehend how words and categories are linked
allowing for improved responses (IBM 2025b). Accordingly, the knowledge base
can be transformed to a vector database where all the text is represented through
vector embeddings. As opposed to traditional search, where results are explicitly
retrieved based on the input word, vector databases enables search based on similar-
ity (IBM 2025a). Meaning that for an input prompt asking a question regarding the
word "smartphone", the traditional search retrieves passages explicitly containing

7


1. Introduction

the word "smartphone" whereas a retriever from a vector database would yield re-
sults containing similar words such as "cellphone" (IBM 2025a). There are different
algorithms for the actual retrieval. Some algorithms are based on k Nearest Neig-
bours (kNN) methods, such that the retriever searches for a number of k vectors that
are deemed to be the mathematically closest to the defined query vector (Sawarkar,
Mangal, and Solanki 2024). A RAG architecture with text splitting, embedding and
retrieval is shown in Figure 1.4.

Figure 1.4: RAG architecture

1.5.5.3 Fine-tuning

Fine-tuning, in comparison to the previously mentioned knowledge embedding meth-
ods, is a technique to adapt the tailored knowledge through additionally model
training on an already pre-trained neural network. However, in a pre-trained model,
there might be millions or even billions of parameters that it has been trained on.
Changing the weights of all parameters is time-consuming and expensive and also
runs a high risk of overfitting the network (IBM 2024b). Instead, fine-tuning makes
use of the already pre-trained parameters and changes are only made to a few pa-
rameter weights in the model. This enhances optimal behavior of the model since
it retains the robustness and complexity of a pre-trained model, while leveraging
the customization to a specific use case. Fine-tuning is commonly used when cus-
tomizing neural networks with a large amount of parameters such as LLMs and
computer vision models. Examples of fine-tuning use cases include developing spe-
cialized LLMs, such as those designed for code generation or for replicating specific
tonalities and writing styles (IBM 2024b). The foundation of fine-tuning, with a
base model yielded through an enormous amount of general pre-training data and a
fine-tuned model yielded through specifically curated fine-tuning data, is shown in
Figure 1.5.

8


1. Introduction

Figure 1.5: Fine-tuning with a data set given internal knowledge

1.5.6 GKN Dummy Data
As presented in section 1.3, no confidential data is used in this project due to
GKN’s security policies. Consequently, GKN provided a dummy structural analysis
case. The structure of this dummy data was created from a real case, ensuring it
aligns with real-case analysis while being significantly more compact with a highly
simplified geometry. Although the dataset contains less information and only non-
sensitive data, its structure and file formats closely emulate real-world data and
structure. The file structure of the dummy analysis case is presented in ??.

1.5.7 PyCNS Knowledge
PyCNS is an internal Python module used at GKN for extracting and manipulating
data in CNS files, who are a result of a FE simulation. The module is treated in the
thesis, and both source code and an existing user manual was provided. Therefore,
the documentation was both of a more descriptive nature and a pure source code
nature. An example for how the source code knowledge was given is displayed below.
This is meant to show the structure, since some of it has been redacted in order to
not display the entire source code.

1 def combine_cns ( inputs ) -> CNS:
2 """ Combine multiple CNS objects into a single CNS object .
3

4 Short description of functionality and use case.
5

6 Example :
7 new_cns = combine_cns ([cns1 , cns2 , cns3 ])
8

9 Args:
10 inputs : A list of CNS objects to be combined .
11

12 Returns :
13 CNS: A combined CNS object .
14 """
15 # ...
16

17 new_cns = ... # Redacted combination logic
18

19 return new_cns

9


1. Introduction

Listing 1.1: Redacted combine_cns function

Instead, the internal PyCNS manual did not contain any source code for the func-
tions but rather short descriptions and examples in a pdf. A somewhat redacted
extract from it, for combine_cns, is shown below.

Extract from Internal Manual
1.1 pycns.combine_cns
combine_cns(inputs) → CNS
Function to combine multiple CNS objects into a single CNS object.
Short description of functionality and use case.
Example:
>>> large_cns = combine_cns([cns_list])

Parameters:
• Information about inputs

Returns:
• A pycns.CNS object.

Return type: CNS

.

1.5.8 Computational Effort Parameters

When working with an LLM-based tool, it is important to measure the software’s
efficiency. This can be evaluated based on both the computational effort and the
computational time required to complete specific tasks. Computational time can be
measured based on the latency associated with a given task, while computational
effort can be evaluated by analyzing the number of tokens required to generate the
solution. When evaluating the performance on a set of questions, two key latency
metrics can be used. The P50 latency, also known as the median latency, measures
the response time at the 50th percentile, meaning that half of the responses are
faster and half are slower than this value. This metric provides an estimate of a
typical response time, giving insight into how quickly an agent generally responds
under normal conditions. The P99 latency, on the other hand, represents the 99th
percentile latency. This value indicates that 99 percent of all responses were com-
pleted in less time, while the slowest 1 percent exceeds this threshold. The P99
latency is particularly useful for assessing worst-case scenarios, ensuring that the
system does not experience excessive delays in response generation. This concept is
displayed in Figure 1.6

10


1. Introduction

Figure 1.6: P50 and P99 latencies

As mentioned above, the computational effort of an LLM can be measured through
the number of tokens that is needed to satisfactory answer a specific request. In
natural language processing, text is divided into a certain number of tokens such
that the question prompt defined by a user to an LLM is split into different sections
(Microsoft Research 2023). The input is tokenized. This means that the input text
specified by the user, as well as the output text generated by the LLM, corresponds
to a number of tokens. For OpenAI’s models, a general rule of thumb is that one
token can be approximated to four characters of text (OpenAI 2024c). Consequently,
an interaction with an LLM where the user has to correct the LLM and provide more
context before receiving an acceptable answer will require more tokens both in the
form of input and output, and will consequently be more computationally demanding
as opposed to a correct answer yielded directly from a well written prompt. For a
user, there is also a financial aspect to the number of tokens used. OpenAI API
pricing is based on number of tokens, such that GPT-4o has a cost of $2.50 / 1 M
input tokens and $10 / 1 M output tokens (OpenAI 2024b).

11


1. Introduction

12


2
Method

In this chapter, the methodology for answering the two RQs is presented. For an-
swering RQ1, namely what the practical limitaitions for analysis engineers regarding
the adaptation of generative AI are, this is done through data gathering with the
help of a literature review and an interview process conducted with engineers at
GKN. The result for RQ1 serves as a base for the methodology when it comes to
answering RQ2, namely how an LLM-based tool can be systematically designed given
the limitations and challenges identified in the RQ1 result. This consists of the soft-
ware development of an LLM-based tool and corresponding evaluation of it. The
methodology workflow is displayed in Figure 2.1.

Figure 2.1: Components of the methodology for answering the RQs

2.1 Data Gathering

This section presents the Data Gathering process, which primarily consists of a
literature review and interviews. For completeness, previously provided dummy
data is also incorporated into the flowchart, in Figure 2.1.

13


2. Method

2.1.1 Literature Review
A literature review was performed in order to gain a greater understanding of the
use of generative AI in engineering, and what the potential challenges in its use
may be. To ensure a fruitful and well-structured literature review, the methodology
was divided into three main stages. These are Identification, Screening and Review.
This methodology is modeled on Page et al. (2021), where an initial number of
screened records successively is reduced such that a number of relevant and high
quality papers are selected for the final analysis. The methodology for this process
is further described in this section.

2.1.1.1 Execution

The basis for the purpose of the literature review was RQ1, namely what the prac-
tical limitations are for the use of generative AI in an engineering context. In order
to perform a structured literature review that can be easily repeated in the greatest
possible measure, a number of keywords were considered. These keywords were clas-
sified in three major categories. These are AI Concept, Application and Embedded
knowledge. The motivation to this was that it facilitates the actual generation of
keywords during the thought process, but also structures later database searching
when using Boolean operators. The idea was that articles containing at least one
keyword from each class were desired. This means that the OR operator could be
used within a class, and the AND operator between classes. The corresponding
keywords are presented in Table 2.1.

AI Concept Application Embedded Knowledge
Large Language Model (LLM) Design Process Fine-tuning
Generative AI Knowledge Management Coupling

Systems Engineering Role-specific
Mechanics Knowledge-based
Complex Systems In-house
Requirements Engineering
Finite Element Method

Table 2.1: Keywords for literature search

An initial search in Scopus with these keywords yielded a total of 97 results, and
while some of these were deemed to be relevant another search was performed such
that more records could be screened. The decision was made to only search for the AI
Concept and Application keywords. The screening procedure for the literature review
is shown in Figure 2.2. This method is highly influenced by The PRISMA 2020
statement: an updated guideline for reporting systematic reviews (Page et al. 2021).
This meant that the literature review was divided into three stages. Identification,
where keywords were chosen and further refined after an initial search in Scopus.
Screening, when records were excluded based on their titles, abstract, judged overall

14


2. Method

quality and number of citations. Lastly, during the Review stage the records included
for the literature review were analyzed.

Figure 2.2: Screening procedure in Scopus for the literature review. Highly inspired
by Page et al. (2021)

The first keyword search in Scopus led to 1602 results. This was further refined in
the Scopus filter by selecting Engineering and Computer Science as subject areas,
English as language, and selecting keywords language model, large language model,
knowledge management, requirement engineering and knowledge engineering. This
led to 784 results in Scopus. An initial screening was performed by simply reading
the titles. Titles who were deemed relevant with respect to the RQs were of interest.
Consequently, titles indicating studies in fields like medicine and linguistics were
ignored. Therefore, 39 records were selected for further screening. For each of these
39 records, the abstract was studied. Once again, records deemed relevant with
respect to the RQs were selected for further screening. This meant a total of 21
records. These records were subject to the final screening. Here, the entire text was
analyzed such that its quality and relevance could be judged. Number of citations
were also taken into account, which means that articles ideally should have been
deemed relevant and of high quality after an analysis of the text, and well cited. A
number of 7 records were chosen for literature review in Scopus. In order to increase
this number, snowballing together with limited keyword search was performed on
Arxiv.org. Hence, this lead to 11 articles in total for the literature review.

2.1.1.2 Data Analysis

For the data analysis stage of the literature review, a content analysis was performed
for each chosen article. The articles were read first such that an initial understand-

15


2. Method

ing of research questions, methods and results could be obtained. After this, the
articles were read through again and sections that were deemed relevant with re-
spect to the defined RQs were highlighted. After having performed this for all the
selected articles, these highlighted sections were divided into different themes. The
thematic analysis was performed as presented by Säfsten and Gustavsson (2020).
The themes obtained from the thematic analysis included for example Lack of in-
house knowledge. For each theme, the findings for the articles were summarized in
a coherent text. These are presented in section ??.

2.1.2 Interviews
Such that a further understanding of the topic of the potential and challenges of
generative AI in engineering could be established, a number of interviews were per-
formed. The advantages of interviews include flexibility, the ability to tailor ques-
tions for specific cases (Säfsten and Gustavsson 2020). In this case, the focus was
to collect and obtain relevant data in order to answer RQ1. However, there are also
disadvantages that must be taken into consideration. One such disadvantage is the
fact that a poorly chosen respondent could lead to deceitful results (Säfsten and
Gustavsson 2020). Nevertheless, due to the fact that interviews enable a direct con-
tact with respondents who may hold key answers regarding the research questions it
was deemed appropriate. The interview process contained two main stages. These
were Execution and Data Analysis, illustrated in Figure 2.3. The methodology for
these is presented in the following sections.

Figure 2.3: Stages for the interviews

2.1.2.1 Execution

The respondents of interest were engineers at GKN Aerospace who have experience,
or have previously possessed experience, in simulation based working processes.
Ideally, it was desired that the respondents should have had diversified experience
in the topic of AI and LLMs, and also regarding the incorporating of generative AI
in their work. Nevertheless, a variation in type of respondents was desired since
this counteracts biased results and generally widens the perspective in the answers.
The interviewees where chosen with help from the supervisor in accordance with
the stated requirements. A list of the respondents can be seen in Table 2.2. Some
interviews were performed in person at GKN Aerospace in Trollhättan and some
were performed digitally.

16


2. Method

Semi-structured interviews were performed. This is the mix between the completely
structured interview and the unstructured interview. Whereas the structured in-
terview is governed by fixed questions, and the unstructured interview is very open
with more of an overall theme deciding the discussion, the semi-structured is a sort
of in-between (Säfsten and Gustavsson 2020). The primary reason for conducting
semi-structured interviews was the variability in the respondents’ fields of expertise,
allowing for dynamic adaptation of questions during the interview. Furthermore,
the limited prior knowledge of certain processes is also a strong reason to perform
semi-structured interviews. Consequently, the interview can be adapted more to the
prior knowledge of the topic that the interviewee possesses. The questions for these
interviews are presented in section A.1.

In a study made by Griffin and Hauser (1993) the number of interviews needed to
identify a given amount of customer needs was investigated. The data was collected
and analyzed by professionals in the field. The study revealed that the added value
plateaued after approximately five to six interviews, see Figure 2.4. Therefore a
number of nine interviews, each lasting approximately 25 to 30 minutes, was deemed
relevant for this project.

Figure 2.4: Number of interviews in relation to identified needs (Griffin and Hauser
1993)

Nevertheless, there was a question for how these nine interviewees were going to be
selected. Therfore, Sampling in design research: eight key considerations by Cash
et al. (2022) was used as a foundation for the choice of these interviewees. Since the
LLM-based tool was going to be developed on a GKN specific case, the sampling was
performed within a set of GKN engineers.The consideration Design framing: what
type of impact on practice do you hope to achieve? presented by Cash et al. (2022)
emphasizes the link between the conducted design research and practice, and high-
lights its importance. Another consideration presented by Cash et al. (2022) is
Theoretical framing: where in the theory-building/theory testing research cycle is
current knowledge?, where relevant domain knowledge within a population must be
considered. Interviewees were therefore to be selected based on their knowledge

17


2. Method

and experience within analysis engineering, but also knowledge and experience and
knowledge within AI. Knowledge connected to both of these domains was highly de-
sired with respect to the RQs. Therefore, these two factors were deemed to provide
a solid foundation for gathering relevant existing knowledge within the sample, but
also gaining an understanding of the possible impact on current analysis practices
that the thesis could yield.

The profiles for the nine interviewees are shown in Table 2.2. The duration of the
interviews was deemed sufficient to allow for the collection of comprehensive informa-
tion while ensuring the project remained manageable within the given time-frame.
The interviews were be recorded using a SONY ICD recorder, and the recordings
were then transcribed using an offline, locally stored transcription software. This
approach ensured that the content remained secure and was not publicly accessi-
ble under any circumstances, in accordance with the aspect of personal information
discussed in section 1.4. Furthermore, the interviews were conducted in English to
streamline the transcription process, as the transcription tool used was better opti-
mized for English speech recognition. After the automatic transcription was done,
the content was manually revised to entirely comply with the interview recordings.

With the aim of working in accordance to the policies and statutes of GKN regarding
data security, an interview contract was made. Before the interview, the contract
was provided to the respondent for signature. The contract contained information
about how the recordings were to be managed and served as an additional safeguard
between the interviewer and the respondent. This was deemed to be in line with
one of the considerations presented by Cash et al. (2022), namely Good scientific
conduct and ethical appropriateness. The contract can be seen in section A.2.

ID Title Role Site AI Exp.
1 Research engineer Research in digitalization and au-

tomation. Working with AI dev.
Sweden Very High

2 Eng. method specialist Assessing fatigue life in manufac-
turing; specialist in crackprop.

Sweden Low

3 Analysis engineer Analysis lead, supporting hard-
ware design in INC and Pratt &
Whitney projects.

Sweden High

4 Analysis engineer Solid mechanics analysis and sim-
ulations in ANSYS.

Sweden Low/Medium

5 Analysis engineer Solid mechanics analysis and sim-
ulations in ANSYS.

Sweden Low/Medium

6 Structural mech. engineer Simulation and analysis consultant
in finite element analysis.

Sweden Medium

7 Analysis lead Leads structural engineering team
in India and senior engineers at
GKN.

Sweden Low

8 Design engineer Automating design processes and
methods used in the company.

Netherlands Low

9 Eng. team leader Working within design principles. Sweden Low

Table 2.2: Respondent details, roles, and AI experience.

18


2. Method

2.1.2.2 Data analysis

Having transcribed the interviews, a thematic analysis of the interviews was per-
formed. A thematic analysis aims to identify recurring themes within the interview
data. These themes represent patterns or insights that may be relevant answers to
the RQs (Säfsten and Gustavsson 2020). In order to find themes, it is useful to as-
sign codes to the data. These codes were extracted using a combination of inductive
and deductive coding (Fereday and Muir-Cochrane 2006). Hence, a code was as-
signed to parts of the data deemed to be valuable in answering the RQs, which were
then paired and divided into labeled themes. These themes however, in contrast to
a pure deductive coding strategy, might be defined in advance in accordance to a
inductive coding strategy. The themes are in turn linked to a final theory (Säfsten
and Gustavsson 2020). Therefore, the transcribed interviews were read through and
relevant comments and responses were highlighted. These highlighted sections rep-
resented codes, and from these codes a number of themes were defined. Each code
was connected to a corresponding theme, and some codes were connected to more
than one theme. These themes are presented in section 3.1.2.

2.2 Software Development Method
After having finished the interview and literature review stages, intended to answer
RQ1, the development of an LLM-based tool could be started. This development was
intended to answer RQ2, with the results from RQ1 as a foundation. In this section,
the development stages are presented. More precisely, how the desired functionality
of the LLM-based tool was decided, how an initial strategy was outlined and how it
was actually developed.

2.2.1 Development Objective
RQ2 considered how an LLM-based tool could be developed and evaluated, given the
obtained results from RQ1. These results, presented in chapter 3, highlighted various
issues connected to the use of generative AI tools both in a broad industry perspec-
tive as well as with respect to analysis engineering practices at GKN. While it would
have been preferable to develop an LLM-based tool that could mitigate all these is-
sues and be implemented in the entire analysis workflow displayed in Figure 1.1,
this was deemed to be too comprehensive given the thesis timeline. Therefore, the
key takeaways from the data gathering were used as a foundation for deciding the
development objective. The interview study seemed to give an indication that the
post-processing stage of the analysis was of a more cumbersome nature than the
pre-processing stage. For example, interviewee 7 emphasized the time-consuming
aspect of the post-processing of data in subsection 3.1.5. This meant, among other,
identifying relevant data and extracting the relevant parameters to present. Fur-
thermore, the potential of an LLM when it comes to dealing with extensive result
data and summarizing key parts was mentioned by interviewee 8 in subsection 3.1.7.

At GKN, there are a number of internal Python modules for facilitating the anal-

19


2. Method

ysis and manipulation of certain file formats that are the result of a FE simu-
lation. Therefore, these modules already aim to simplify the cumbersome post-
processing stage. Nevertheless, some engineers may be less experienced when it
comes to Python. This Python learning curve was outlined by interviewee 6 in sub-
section 3.1.7. The combination of existing Python scripts for automation, and an
LLM-based tool to make these more accessible, was considered as a consequence to
this. Interviewee 8 discussed this possible integration in subsection 3.1.7. However,
there are such Python modules for the analysis and manipulation of CNS, CDB
and UNV files among other. Together with the GKN supervisors, a discussion re-
garding an appropriate integration between these internal Python modules and an
LLM-based tool was held. Based on their knowledge within the area, it was decided
to start the focus on the PyCNS module. A module that treats the analysis and
manipualtion of CNS files. The PyCNS module was chosen over other existing inter-
nal modules since the overall structure of an LLM-based tool would remain largely
consistent regardless of the module used. Given that the CNS module is the most
comprehensive, it was selected as the starting point. This choice ensured that any
future transitions or integrations with other modules could be managed smoothly
without significant structural changes. An LLM-based tool tailored for this module
was deemed to facilitate the post-processing stage of an analysis, and lowering the
threshold for the use of available internal Python modules.
A number of criteria, based on the results from RQ1, for such an LLM-based tool
were defined. This was deemed to further structure and facilitate the development
phase. These are shown below.

Criteria
1 Able to manipulate or analyze CNS files using the pycns module.
2 Avoidance of black-box architecture through incorporation of human

knowledge in process.
3 Prevention of hallucination and erroneous output.
4 Output of well-structured and well-formatted answers easy to follow.
5 Robust and user-friendly

Table 2.3: Development Criteria

The first criteria highlighted the successful integration of the PyCNS module to
the LLM-based tool. The second criteria was meant to ensure a consideration of
several findings from the interviews and literature review. Interviewee 7 commented,
in subsection 3.1.7 , on the need for transparency in a potential LLM-based tool
such that the engineer could easily follow computations. Additionally, this was
meant to mitigate the theme of Over-reliance on AI tools was identified from the
literature study and presented in subsection 3.1.1. The third criteria was stated
based on the risk associated with LLMs that is hallucination. This risk was outlined
from the theme Hallucination/Low quality output identified in the literature review,
presented in subsection 3.1.1. Additionally, this was linked to the interview process
where interviewee 4 in subsection 3.1.3 commented on the poor quality answers
that available models like Copilot and ChatGPT yield when asked on theoretical

20


2. Method

questions connected to analysis engineering. The fourth criteria was meant to reduce
confusion and time consuming aspects of trying to identify parts of the output. The
fifth, and final, criteria was meant to secure the development of a tool that was easy
to use and reliable enough to justify the use of it.

2.2.2 Development Strategy

With the actual development objective known, an initial development strategy was
defined. While it was known that an LLM-based tool focused on the PyCNS module
was going to be developed, it was unknown as to how this was going to be done.
Consequently, with the help of the results obtained from the interview stage and the
literature study, the development strategy with respect to RQ2 was defined. Since
the LLM-based tool was to be focused on PyCNS, an initial logical flow of work
with this module was defined. This meant the existence of a specific question re-
lated to the data in a CNS file, an initial problem solving strategy and code writing,
successful code execution and subsequently the presentation of the desired results
in a clear and well-formatted way. Furthermore, this workflow was translated into
a graph with respect to the desired behaviour of an LLM-based tool. This initial
graph contained a number of nodes, corresponding to specific tasks. These nodes
were meant to mitigate the challenges connected to the adaptation of generative
AI in engineering, challenges who were outlined both in a broad sense as well as
in a more GKN specific sense. These challenges are presented in section 3.1. For
example, a human in the loop (HITL) was incorporated in this graph to avoid a
black-box behaviour and consequently have the engineer in charge.

With a question related to the pycns module as an input, the graph contained
the nodes Generation, Human feedback, Code execution, and Present result. For the
Generation node, a problem solving description and corresponding Python code was
to be generated. The Human feedback node were to display this reasoning and code
to the human engineer, asking for approval to proceed to the Code execution node.
This key feature was desired due to the need to avoid a black-box architecture and
having code run for which the user has not inspected. Thus, if the proposed code
was disapproved the human user should have provided more context or commented
why it was declined such that it returned to the Generation node where a revised
description and code could be generated. At the Code execution node the proposed
code was supposed to be run. In order for the model to present a feasible solution,
a check whether the code execution failed or not was necessary. If the code was
executed successfully, the present node were to present the results in a clear and
well formatted way in the Present result node. Nevertheless, if the execution were
to fail routing should have been done back to the Generation node for the generation
of a revised and working code. This graph, with corresponding logic and nodes, was
deemed to model the workflow of an engineer performing analysis on CNS objects.
This initial graph structure is shown in Figure 2.5.

21


2. Method

Figure 2.5: Initial graph structure

The nodes were imagined with the criterion in Table 2.3 in mind. For example,
the Human feedback node was deemed to ensure the second criteria, whereas the
Present result node was deemed to ensure the fourth criteria. This graph acted as
the basis for the actual software development when such an LLM-based tool was
programmed.

2.2.3 Development Method

After having performed the initial development strategy phase, it was deemed that
some sort of agentic behavior system was required for the development of the sophis-
ticated graph presented in Figure 2.5. The reason for this is that an implementation
using manual systems, like chains, without internal decision making and the ability
to use external tools would be impossible with respect to the desired graph struc-
ture. After thorough research, it was determined that LangGraph offered the most
convenient and effective approach for the design of agentic systems compatible with
both open-source and state-of-the-art models. The introductory courses provided
by LangChain Academy (LangChain Academy 2025) for LangGraph, LangChain,
and LangSmith offered the foundational knowledge necessary for assessing complex
graph structures in LangGraph and integrating functionalities from both LangChain
and LangSmith. Python was selected as the implementation language because of its
familiarity and widespread use. Additionally, after gaining a comprehensive under-
standing of the PyCNS module, the development of programmable software based
on the structure, shown in Figure 2.5, could begin.

Nevertheless, such a graph requires the incorporation of internal knowledge. In
this case, knowledge about the PyCNS module. The results from the literature
study, presented in subsection 3.1.1, highlighted the challenges of poor quality out-
put through hallucination as well as the need for the embedding of internal engi-
neering knowledge. Consequently, four distinct agents were developed based on the
knowledge embedding methods described in subsection 1.5.5, namely Entire Context
in System Prompt, RAG, and Fine-Tuning. Furthermore, these agents were designed
according to the criteria presented in Table 2.3, with the ultimate goal of closely
emulating the graph presented in Figure 2.5. A concise overview of these agents is
provided in Table 4.2. Finally, the resulting agents are presented in section 3.2.

22


2. Method

Agent 1 Agent 2 Agent 3 Agent 4

Knowledge
Embedding

Complete
documentation in
system prompt

RAG None
Complete

documentation in
system prompt

Post-Training None None Fine-tuning Fine-tuning

Table 2.4: Comparison of Agents across Knowledge Embedding and Post-Training

After developing the four initial agents, an additional main agent was introduced
to further improve the robustness of the architecture and to fulfill all the criteria
outlined in Table 2.3. Specifically, the initial four agents were primarily designed
with the goal of executing code, which is a task that may not always align with
user needs. Therefore, to design an agent capable of providing varied responses and
especially meeting criterion 5, it became essential to implement functionality that
could effectively guide the user. This guidance was achieved through the integration
of a routing system capable of distinguishing irrelevant questions and providing
explanatory responses without the necessity of executing code. Consequently, the
main agent was developed in a modular way, enabling the four initial agents to
be integrated in it, hence keeping the code-execution functionality. Ultimately, the
agent that performed best, in terms of accuracy and efficiency in the evaluation, was
chosen for integration into the main graph. The idea of the main agent architecture
is presented in Figure 2.6, while the final implementation result is presented later
in section 3.2.

Figure 2.6: Main agent architecture

2.3 Evaluation
After designing the four different agents, it was necessary to evaluate these against
each other. This evaluation was done such that one of the four agents could be
incorporated in to the main agent. Therefore, the evaluation stage could be divided
into two main parts. Namely, the initial evaluation of each of the four agents and
the evaluation of the main agent. Both parts will be evaluated on their accuracy
and efficiency performance, according to subsection 2.2.3.

23


2. Method

2.3.1 Agent 1-4 Evaluation
To assess the performance of the developed agentic systems, a structured QA pair
evaluation was conducted. The evaluation focused on two key aspects. One of them
was accuracy, which measured how well agents responded to predefined questions.
The other one was efficiency, which measured how effective the computational pro-
cess for answering the questions was. This was done through the analysis of latency
and token data. Each agent was tested using a set of 30 QA pairs that were care-
fully designed based on a thorough examination of the PyCNS documentation. The
questions were selected to ensure diversity in style and execution while maintaining
correctness. All the answers in the QA pairs included the computed result, which
was pre-calculated manually in Python using the PyCNS module. To enhance the
reliability of the evaluation, the experiment was repeated across five separate runs.
This allowed for a more robust assessment of the agents’ performance and reduced
the impact of any variability in individual runs. The entire evaluation process was
conducted using LangSmith, which enabled a fully automated testing pipeline. By
automating this process, the evaluation became both efficient and reproducible. In
the end, all grades were manually inspected by a human to further revise the results.

For agent 2 with its RAG architecture, an initial evaluation was needed to choose
the relevant retriever parameters. RAG performance is dependent on a number of
different factors such as the choice of embedding method and vector database. A
comprehensive investigation as to what RAG implementation was best for this agent
was not performed since it is such as an extensive domain. Nevertheless, the chosen
chunk size and k value were investigated to a certain extent. The choice of chunk
size has been shown to have a considerable effect on the performance in RAG config-
urations, where a larger chunk size might provide more context while also increasing
the time to process the provided information (Wang et al. 2024). The choice of
chunk size is also dependent on the density of information in the documentation
from which data is retrieved (Zhong et al. 2024). Furthermore, when using a k
nearest neighbours (kNN) algorithm for the document retrieval the performance is
linked with the selected k value (Leto et al. 2024). Therefore, a somewhat naive,
investigation on the effect of chunk size and k value was performed for Agent 2.
Since the information density varied in the documentation, two different strategies
were outlined. One strategy consisted of setting a smaller k value of 5 while in-
creasing the chunk size between values in the range of 250 to 1750, while the other
strategy consisted of setting a smaller chunk size of 200 but increasing the k value
in the range of 8 to 20. Initial QA evaluations, consisting of only one repetition,
was performed for these different configurations such that one could be selected for
the comprehensive QA evaluation for Agent 2.

For fine-tuning, the number of pre-trained parameters of a model can affect the
performance (Lu, Luu, and Buehler 2024). Therefore, two different models were
evaluated for Agent 3. Since the OpenAI fine-tuning platform was used, these were
GPT-4o and GPT-4o-mini. This meant that an initial investigation on how the
performance varies between a smaller and a larger model in terms of parameters
could be done.

24


2. Method

2.3.1.1 Accuracy Evaluation

To measure accuracy, the response of each agent to the 30 questions was compared
against a predefined reference answer using an LLM. The grader assessed correctness
by assigning a binary value to each response: TRUE if the response was sufficiently
close to the reference answer, and FALSE if it contained incorrect or conflicting
information. The grading process followed structured evaluation criteria to ensure
consistency and objectivity. Specifically, the grader focused solely on factual correct-
ness, ensuring that responses did not contain contradictions. Additionally, responses
that included more information than the reference answer were still considered cor-
rect, provided they remained factually accurate. The structured evaluation prompt
used by the grader was inspired by the LangChain GitHub repository (David 2024),
ensuring a well-defined and systematic approach to evaluation. In order to ensure
that the grader was trustworthy in its output of TRUE/FALSE answers, a multitude
of predicted answers were compared to the provided reference answers with respect
to the grading. The QA evaluation framework used in the assessment is illustrated
in Figure 2.7.

Figure 2.7: QA evaluation for accuracy

2.3.1.2 Efficiency Evaluation

The efficiency evaluation is divided into latency performance and token usage. These
were measured to understand how quickly the agents generated responses and how
much computational power it demands, respectively. Latency was recorded in sec-
onds, representing the time taken for an agent to provide an answer after receiving
a query. By analyzing both P50 and P99 latencies, the evaluation captured both
the expected response times under normal conditions and the potential worst-case
scenarios. This ensures a balanced evaluation of the performance. Token usage, on
the other hand, was tracked as two single numerical values provided by LangSmith
through all the LLM API calls being done inside the graph. One represents the input
or prompt tokens, while the other one represents the ouput or completion tokens.
Since OpenAI charges based on the number of tokens used, this measurement was
an indication of how the utility costs of the different agents differed.

25


2. Method

2.3.2 Main Agent Evaluation
The agent evaluation was performed such that an approach for the main graph,
displayed in Figure 3.4 and consisting of two subgraphs, could be decided. The agent
that was deemed most suitable in terms of accuracy and efficiency, given the results
yielded from the QA evaluations, was selected to be included in the main graph.
Similar to the agent 1-4 testing, the evaluation of accuracy and efficiency through
QA assessment was conducted with reference to the main graph. Nevertheless, this
was done on a different set of questions than the agent evaluation. Whereas the agent
evaluation consisted of 30 questions based on code generation, the questions for the
main graph evaluation were written based on three pillars to capture the variety
in style, mentioned in subsection 2.2.3. These were code generation, description
generation and detection of relevance. Therefore, 31 questions were written such
that the reference answer contained a result from a Python code, a description
connected to PyCNS functionality or a standard output "You asked something that
was not related to the PyCNS documentation!" given an irrelevant question. None of
the 31 questions used in this QA evaluation were identical to one of the 30 questions
used in the previous QA evaluation.

2.4 Usability Testing
In order to verify if the developed software could be used in an industrial context, its
usability was examined through user testing. This user testing, where two analysis
engineers at GKN tested the software, used the Usability branch in the system
acceptability architecture presented by Nielsen (1993) as a foundation. These are
displayed in Figure 2.8. The engineers were chosen such that they were familiar
with the PyCNS module and had used it within their work previously. The notion
of user testing is of importance due to the fact that the software was not developed
by intended users.

Figure 2.8: The branches of system acceptability (Nielsen 1993, p. 25)

While the method for investigating factors like Cost (through token utilization) was
presented in section 2.3, important usability factors like Easy to learn and Subjec-

26


2. Method

tively pleasing are more difficult to get an understanding of as a developer. Therefore,
these required concrete testing with intended users. The testing methodology was
inspired by Farzaneh and Neuner (2019), where a usability evaluation was performed
on a software tool in the domain of bio-inspired design. The user testing procedure
consisted of a number of steps. Through a provided LangGraph user interface, a
brief tutorial of the tool functionality was demonstrated to the engineer. After this,
the engineers were asked to use the tool for assignments similar to those they would
normally use the PyCNS module in a direct programming interface.

When the engineers were asked to use the tool, this was utilized in the user testing
as an observational technique. The engineers were provided with the dummy case
presented in section 1.5. This meant that the user behaviour was inspected, and that
no interference with the user was done. The advantage of observational techniques
is that it is deemed to yield an essentially authentic user environment, and prob-
lems connected to the usability can be detected and further addressed (Farzaneh
and Neuner 2019). Furthermore, an observation of the user rather than an active
instruction of what they were going to do was deemed to reduce the risk of biased
results. After this, a brief semi-structured interview was conducted with the engi-
neers to collect their ideas regarding the usability of the tool. This semi-structured
interview consisted of six set questions, taken from Srikanth, Hasanuzzaman, and
Meem (2024) where the usability of existing state of the art LLMs within the do-
main of threat intelligence enrichment are investigated. These questions are shown
in section C.1. Lastly, the results from the observations and the interviews were
analyzed such that an indication of the usability could be obtained. The usability
testing method is summarized in Figure 2.9.

Figure 2.9: The structure of the usability testing

27


2. Method

28


3
Results

3.1 Data Gathering
After finishing the literature review and the interview process, various results rel-
evant to answering RQ1 were obtained. In the following chapter, these results are
divided between those explicitly obtained from the literature review in section ??
and those explicitly obtained from the interview process in section 3.1.2. While
results from the interview process are generally more applicable towards a specific
GKN use case and those obtained from the literature review tend to be of a broader
nature, there seems to be an intersection between the obtained results from both
stages.

3.1.1 Literature Review
From the literature review a number of challenges connected to the use of generative
AI in engineering were identified. These are presented in the following section. The
different challenges are often connected to each other, meaning for example that
a poor quality output yielded through hallucination may be connected to lack of
in-house knowledge and poor prompting. However, generative AI in engineering has
its set of limitations and some of these challenges may be chronic, but often it seems
that they can be mitigated with the correct strategy.

Applications of LLMs in engineering

The use of generative AI, and more specifically LLMs, show promise in a number
of engineering fields. Among these fields are computational mechanics, engineering
design and requirements engineering. Here, a brief introduction of LLM applications
in engineering based on the reviewed literature is presented.

The use of LLMs could facilitate identification and analysis of data, whether is is re-
lated to requirements or information connected to standards. Norheim et al. (2024)
investigate the use of LLMs with respect to engineering requirements in complex
systems. Arora, Grundy, and Abdelrazek (2024) also consider the use of LLMs in
RE, such that a SWOT analysis is performed for the different stages of the RE pro-
cess. Furthemore, LLMs have the potential to reduce the amount of manual work
in product development. Ehring et al. (2024) investigate how LLMs can be used for
information classification, such that role-specific identification of information can be

29


3. Results

streamlined.

LLMs also show potential when it comes to computational aspects in engineering. A
more mechanical engineering focused study on the use of LLMs has been performed
by Ni and Buehler (2023), who with the help of LLMs create AI agents, so called
MechAgents, capable of solving elasticity problems through multi-agent collabora-
tion. Additionally, an article by Alexiadis and Ghiassi (2024) brings forward the
use of LLMs integrated to physics-based simulation software. Alexiadis and Ghiassi
(2024) investigate how an AI model could be used for geometry generation, mesh
set up and material property definition in the simulation software such that the
simulation can be run and the AI model additionally could give thorough results
to the user. The use of LLMs in computational mechanics is likewise investigated
by Brodnik et al. (2023), where the potential and challenges of the use of LLMs
in applied mechanics is discussed. One of the possbilities in the field of applied
mechanics is assisted programming, where the creation of computational algorithms
can be facilitated through code generation and translation (Brodnik et al. 2023).

Engineering design is an area subject to investigation on how LLMs can further
streamline engineering processes. Chiarello et al. (2024) analyze 15 355 research pa-
pers within engineering design and discuss the potential of LLMs, where the use of
LLMs in the four engineering design phases Problem Definition, Conceptual Design,
Embodiment Design and Detailed Design is treated. Pradas Gomez et al. (2024)
investigate the use of LLMs as support for designers in complex systems engineer-
ing, where two aerospace cases, one being generation of Python code for an aircraft
actuation system UML diagram and the other being code for geometry generation
through interaction with a CAD kernel, are investigated. Additionally, LLMs in the
automotive industry, using LLMs together with RAG for further automatization of
design workflows as well as improvement of software development, is investigated by
Zolfaghari et al. (2024) through a comparative study of current available LLMs.

There are various approaches in using LLMs within engineering. For instance, Ni
and Buehler (2023) proposing the multi-agent approach where multiple agents col-
laborate to solve tasks. Similarly, Du et al. (2023) introduce a method for addressing
problems, such as arithmetic tasks, using a multi-agent system. In their approach,
agents debate the solutions generated by each participant and collaboratively deter-
mine a final answer.

The use of LLMs show potential regarding integrated use with external engineering
software, as shown by for example Alexiadis and Ghiassi (2024) with the combina-
tion of LLMs and physics-based simulation software. Jiang et al. (2024) investigate
facilitating the creation of building energy models (BEMs) through LLMs. Such
that an "auto-building modeling platform" can transform natural language from a
user input prompt to a particular generated building model, via the integration
between an LLM architecture and a physics-based simulation (Jiang et al. 2024).

30


3. Results

Lack of in-house knowledge

A prominent challenge regarding the use of generative AI, and more precisely LLMs,
in an engineering context seems to be the contrast between their general nature and
the need for internal competence in tasks. Ehring et al. (2024) highlight the current
practice in product development of manual identification and analysis of suitable
information from internal documentation, where the inquiry is whether LLMs can
make the process more effective. However, the data LLMs have been pre-trained on
may vary a lot from internal documentation. Consequently, when using generative
AI tools for engineering purposes embedding in-house knowledge is of importance
(Ehring et al. 2024). Indeed, as Ehring et al. (2024) write: "without fine-tuning,
today’s model are unable to classify information in a role-specific way. The models
lack too much industry-specific knowledge". A need for fine-tuning could be consid-
ered. Nevertheless, fine-tuning is demanding with respect to necessary data (Ehring
et al. 2024).

Chiarello et al. (2024) discuss the possibility of using LLMs for facilitating engi-
neering design through combining them with 3D models such that geometry models
could be created, although stressing the importance of properly defined material con-
straints and manufacturing processes. Such constraints could be guided by internal
knowledge. LLMs having been subject to fine-tuning could therefore be consid-
ered for this (Chiarello et al. 2024). The general nature of LLMs is brought up by
Pradas Gomez et al. (2024), as available LLMs are not trained for internal company
and project methods. In addition to fine-tuning, RAG can be used to mitigate this
challenge (Pradas Gomez et al. 2024).

Issues with sensitive information

Using sensitive information as input to an LLM that is not run locally is problematic,
and hence one of the main issues with using current available LLMs in an engineering
context. Zolfaghari et al. (2024) mention the current challenge for the use of LLMs
in certain engineering fields due to the need for internally stored private data, trade
secrets and technical data. For the current available LLMs, the highest performing
ones are available through external APIs which is a hinder for the use of LLMs in
sensitive domains like defense projects (Pradas Gomez et al. 2024). However, there
are LLMs that can be run locally, and therefore eliminating this issue. Examples
of such models are LLAMA3 and Mistral, and the ability to run LLMs locally
is synonymous to significant benefits in engineering fields where sensitive data is
processed (Zolfaghari et al. 2024).

Drawbacks in prompting

The use of LLMs often presents challenges related to prompting. Creating an ef-
fective prompt is needed to achieve the desired output. For their use cases, Pradas
Gomez et al. (2024) had to experiment with multiple versions and iterations of
prompts to obtain outputs that were considered valid responses. Arora, Grundy,

31


3. Results

and Abdelrazek (2024) mention the importance of prompt engineering, as the out-
put behaviour of the LLM is strongly reliant on the prompt design. Prompting is
in itself a technique which must be mastered for efficient use of LLMs. Prompts
could be specified in a pre-defined manner like "Context, Task and Expected Out-
put" (Arora, Grundy, and Abdelrazek 2024). Prompt engineering makes use of the
context of the specific task, the language and the known abilities of the LLM (Arora,
Grundy, and Abdelrazek 2024). As Arora, Grundy, and Abdelrazek (2024) found in
their study: "slightly different prompts can produce very different outputs". Jiang
et al. (2024) also emphasize the importance of prompting. Given restricted com-
putational resources the performance of the model can be improved through well
defined prompts (Jiang et al. 2024).

The issues related to prompting can display themselves in flawed output, for exam-
ple a result obtained through hallucination. For reducing the risk of hallucination
and erroneous output caused by a multiple step process with faulty reasoning by the
LLM, there are methods in prompt engineering like "chain of thought prompting"
consisting of clear step-by-step instructions (Brodnik et al. 2023). Mitigation of
these errors can also be achieved through RAG, where retrieved parts of documents
are added to the input prompt (Brodnik et al. 2023). There are two dimensions to
prompt engineering with respect to LLMs, where one approach is the use of APIs,
such as the OpenAI API, meaning the use of LLMs is done in a programming en-
vironment as opposed to the less technical use in an pre-available user interface
(Alexiadis and Ghiassi 2024).

Over-reliance on AI tools/Loss of human engineering competence

As the use of LLMs is associated with a risk of hallucinated or poor quality out-
put, it is of importance to avoid over-reliance on their use. LLMs should not be
incorporated such that engineers can use them as a "black-box" solution and exert-
ing blind faith in these, a "best of both worlds" practice should rather be strived
for. As Pradas Gomez et al. (2024) note, the current nature of LLMs as "helpful
assistants" is contrary to the nature of a competent designer, who is not inclined
to always respond to a question given the provided information. The competent
designer should question flawed information, and in certain cases express the need
for further additions (Pradas Gomez et al. 2024).

As Chiarello et al. (2024) highlight in their example of translating functional models
into natural language, the concept of functional analysis, described as a specialized
language mastered by certain designers, carries a risk of information loss during
the translation process. Specific domain insight possessed by a number of engineers
could therefore be lost in translation as the LLM is invoked. And while LLMs seem to
have a significant potential in facilitating tasks, the indication is such that yielded
output should be inspected by a competent human actor to mitigate this. The
reliance, and more specifically the need for verification of outputs, may differ between
different engineering fields. Norheim et al. (2024) investigate the challenges of LLM
application to requirements engineering (RE) tasks, and discuss it with respect to RE

32


3. Results

tasks like requirement translation and requirement analysis. Furthermore, Norheim
et al. (2024) note that a certain level of error is currently inevitable, and there is a
need for organizations to determine domains where a certain level of error may be
acceptable and domains where it is not. The risk of imperfect LLM performance
should be considered in environments with human safety as a major aspect (Norheim
et al. 2024).

Hallucination/Low quality output

A common problem with LLMs is their lack of reasoning ability. Hallucination,
meaning the LLM gives an output that is factually incorrect but probable from a
linguistic perspective, is a challenge with LLMs (Brodnik et al. 2023). Chiarello et
al. (2024) mention the danger of hallucinations from an engineering design perspec-
tive, where they may cause issues especially when such tools are used by designers
with lack of experience.
However, there are ways the challenge of hallucinations can be mitigated. One such is
through a multi-agent reasoning model. The multi-agent reasoning model proposed
by Ni and Buehler (2023) demonstrated the ability to self-correct hallucinations. For
the multi-agent system proposed by Du et al. (2023), the authors found that having
agents "debate" with each other lead to a higher accuracy in output. In addition to
a multi-agent system, there are other methods to reduce the risk of hallucination.
One such way is the use of RAG (Brodnik et al. 2023).

3.1.2 Interviews
The results from the thematic analysis are presented through 6 different themes.
These are Current risks and challenges, Cumbersome tasks pre-processing, Cumber-
some tasks post-processing, Factors hindering full automation, AI implementation
possibilities and Risks and challenges with AI. The themes Cumbersome tasks pre-
processing and Cumbersome tasks post-processing refer to the stages before and
after a simulation process. To substantiate the results from the given theme, several
"quotes" are presented in context to it.

3.1.3 Current Risks and Challenges
During the interviews, a number of points related to risks and challenges of current
workflows were identified. These were, for instance, connected to how knowledge is
stored, retrieved and used. Interviewees were asked whether they used AI in their
work. While existing AI tools are often unsuitable for specific tasks, interviewee 8
noted the limitations of using non-local LLMs as tools from a data secrecy point of
view:

"You cannot just put anything in there." – Interviewee 8
Furthermore, these non-local LLMs like ChatGPT perform most effectively on generic
questions. However, this generic nature often clashes with analysis procedures where
you may need answers to theoretical questions such as on the topic of solid mechan-

33


3. Results

ics. As interviewee 4 responded on the question whether he used existing LLMs in
his work:

"I’ve tried to use both Copilot and ChatGPT to ask like theoretical
questions, but none these models have been trained on the manuals"

– Interviewee 4
Consequently, there is a consideration on how to transfer existing knowledge to
engineers. This transfer of knowledge seems to be especially important when it comes
to less experienced engineers. On the topic of current challenges in the workflow,
interviewee 7 commented on this, speaking as an experienced analysis lead:

"It’s a very changing team. So we always have new people and they need
to understand the codes, which is not always easy" – Interviewee 7

The issue of knowledge embedding is not only relevant to engineers, but also to
other workforce. Interviewee 1 investigated the use of LLMs and augmented reality
in assembly. Indeed, the potential use of such tools in an assembly context was
justified by Interviewee 1 with the example from an assembly site:

"They don’t keep their inspectors for a long amount of time, they tend to
rotate people who they hire quite frequently, so they have to be retrained
constantly." – Interviewee 1

Thus, such a challenge indicated the potential benefit of LLMs as a helping tool.
The current risks and challenges in the analysis engineering workflow, as well as in
other fields, highlighted the theme of current challenges connected to the storage
and retrieval of knowledge.

3.1.4 Cumbersome Tasks Pre-Processing
The interviewees were asked if there were any tasks in their workflow they found
extra cumbersome. For analysis engineers, the answers could be divided into cumber-
some tasks in the pre-processing stage as well as the post-processing stage. Meaning,
during the preparation of a simulation and during the analysis and interpretation of
the output. In the case when using Ansys as a simulation software, run scripts are
written in APDL. For the pre-processing, interviewee 7 commented on the issue of
setting up certain run scripts:

"And then some of the problems come from scripts that are not properly
set up and then cause problems." – Interviewee 7

This means time is consumed by inspecting, and fixing code such that the simulation
can run properly. These run scripts contain numerous input variables, and inter-
viewee 5 commented on the process of changing these for the specific simulation:

"Normally, in some sort of run script, at the start of each run, you
have a lot of input variables and today you change these manually."

–Interviewee 5
Furthermore, the input to the run scripts needs to be identified and extracted in

34


3. Results

an earlier stage. This data can for example be given in Excel, where it must be
analyzed and retrieved. Interviewee 5 remarked on this:

"Before you can actually give that as an input to the analysis, you need
to do a few steps. One of them could be removing duplicates. Others
could be formatting [...] So before the run script we could have another
script basically prepared to convert that Excel sheet into a format that
works for Ansys." –Interviewee 5

The indications were that the inconvenient aspects of the pre-processing stage were
due to the current procedure of certain tasks related to analysis, retrieval and for-
matting of input data. However, some interviewees mentioned perceiving the pre-
processing stage as rather elementary and straight-forward. Interviewee 4 expressed
the following:

"And I would say that I don’t experience that much difficulty and it’s
actually quite fast to do it" –Interviewee 5

3.1.5 Cumbersome Tasks Post-Processing

For the post-processing stage, a number of tasks must be performed such that the
output script can be transformed to presented data. Interviewee 3 commented on
the current nature of the post-processing:

"I believe these days we spend most of the time in post-processing. Es-
pecially working with data, moving data, moving information from one
system to another. From Ansys to text to PowerPoint or Word, or trans-
forming the data in the process and then writing conclusions about the
data that we got or the images that we see." –Interviewee 3

Data must be transferred between different programs, like Ansys and PowerPoint
or Excel, which creates a need for the engineer to act as an intermediary, as these
systems are not integrated. Interviewee 7 also remarked on the time-consuming
nature of post-processing with respect to its engineering team:

"And then they also spend a lot of time post-processing the data [...] They
need to extract a lot of graphs, time points, load the result files, pick the
nodes... All these kind of things." –Interviewee 7

These repetitive tasks in the post-processing stage, which do not always necessarily
demand a lot of thinking, were further commented by interviewee 7:

"So right now, I think we are spending a lot of times on things that the
computer could do. And there’s not a lot of engineering thinking in the
current work." –Interviewee 7

Therefore, it could be considered that the engineer could be relieved from certain of
these cumbersome, and repetitive, tasks.

35


3. Results

3.1.6 Factors Hindering Full Automation
Although AI has been present for a while with its applications expanding and im-
proving significantly, there still remains a degree of reluctance toward it. The reasons
behind it could differ between engineers. One key challenge mentioned in the in-
terviews was getting engineers themselves to trust the transformation towards the
incorporation of AI tools. For instance, interviewee 8 remarked on this:

"But most of the time comes from getting the engineers that are doing it
by hand now to trust the tools that you develop." –Interviewee 8

Furthermore, a common recurrent opinion was that the complexity in the processes
makes it difficult to automate using AI. As discussed in subsection 3.1.8, LLMs can
be considered non-deterministic. Given that many workflow processes require indi-
vidual assessment and the application of common sense, interviewee 2 commented
his suspicion against it:

"I think a lot of analysis depends on judgement and how do you automate
that judgment part?" –Interviewee 2

It could also be due to that some of the existing software’s that are currently used
for simulation or calculation are not compatible with automation, as noted by in-
terviewee 3:

"If you think about ANSYS only, then ANSYS has its limitations on
how much you can actually automate it. Of course, every software will

have its own limitation and ANSYS has its set of limitations."
–Interviewee 3

3.1.7 AI Implementation Possibilities
A recurring theme from the interviews highlighted the importance of preserving the
knowledge of the engineer and remaining them the central decision-maker in the
workflow process. Hence, maintaining oversight and not fully relying on the LLM
was considered a crucial element, as interviewee 4 and 7 commented:

"Because everyone can kind of do an analysis, but you need an engineer
to actually understand what you are putting in and what the output is"

– Interviewee 4

"I see it more as like an assistant or like a helper, that the engineer is
still in charge." – Interviewee 8

Several respondents mentioned the idea of letting the tool give feedback to the
engineer before proceeding and initializing the simulation or any other action of a
more significant nature. This ensures that the control and responsibility remain the
engineer as previously discussed. Interviewee 7 suggested following:

"The engineer should go, should be able to go and see what the program
does and try to understand." – Interviewee 7

Another suggestion to ensure the engineer retain control of the process was proposed

36


3. Results

by Interviewee 1. The idea focuses on establishing clear boundaries for the LLM
through a set of predefined actions. This is not only delegating more control to the
engineer, but also making the LLM less sensitive to hallucinations.

"So instead of having it more freeform, it has to choose from a list of
actions. So you have more control of what the LLM is doing rather than
directly controlling a software." – Interviewee 1

A further recurring point was the idea of using the LLM for analyzing result data
from the simulations and use it for visualization. Interviewee 8, among other, re-
marked on this:

"So indeed having lots of results, I think a large language model is really
good in summarizing that, creating pictures for you, plots, things like
that." – Interviewee 8

Current analysis processes are streamlined with the help of, for example, Python
scripts for computations. However, not all engineers