Data-Driven Automated Reporting
Solution for External Collaborations
- LLM-driven KPI Definition
A Proof-of-Concept at AstraZeneca

Master’s Thesis

Jakob Juul and Marcus Lorentzon

DEPARTMENT OF PHYSICS

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2026
www.chalmers.se

www.chalmers.se


Master’s Thesis 2026

Data-Driven Automated Reporting Solution for
External Collaborations - LLM-driven KPI

Definition

A Proof-of-Concept at AstraZeneca

Jakob Juul: jakobjuu@chalmers.se
Marcus Lorentzon: marcuslo@chalmers.se

Department of Physics
Chalmers University of Technology

Gothenburg, Sweden 2026


Data-Driven Automated Reporting Solution for External Collaborations - LLM-
driven KPI Definition
A Proof-of-Concept at AstraZeneca
Jakob Juul
Marcus Lorentzon

© Jakob Juul & Marcus Lorentzon, 2026.

Supervisor: Jesús Pineda, Department of Physics at Gothenburg University
Examiner: Giovanni Volpe, Department of Physics at Gothenburg University

Degree project report 2026
Department of Physics
Chalmers University of Technology
SE-412 96 Gothenburg
Sweden
Telephone +46 31 772 1000

Cover: A cartoon sketch of a robot AI processing text information from the phar-
maceutical industry. Generated by Google Gemini’s Nano Banana Pro model.

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Gothenburg, Sweden 2026

iv


Data-Driven Automated Reporting Solution for External Collaborations - LLM-
driven KPI Definition
A Proof-of-Concept at AstraZeneca
JAKOB JUUL and MARCUS LORENTZON
Department of Physics
Chalmers University of Technology

Abstract
This thesis presents a proof-of-concept, developed with AstraZeneca (AZ), that ex-
plores automating progress reporting for external collaborations by testing whether
a large language model (LLM)-driven system can extract objectives from contracts
and translate them into tailor-made key performance indicators (KPIs). Objective
extraction is quite reliable, reaching several highs of accuracy around the 85%-mark,
but converting objectives into KPIs that stakeholders judge as relevant, clear, ac-
tionable, and measurable, is substantially less solid. Fewer than half of the KPIs
met each quality criterion on average, and 39% met none. Survey responses noted
that KPIs were often unclear, overly generic, or poorly timed, and skewed toward
simple counts (e.g., “number of models”) that miss quality and impact.

From interviews conducted at AZ, a set of general KPIs, that were deemed mean-
ingful to measure in a collaboration project, could be demonstrated. The final eval-
uation suggests that these KPIs (e.g., external engagement and budget coherence)
outperform collaboration-specific KPIs generated directly from objectives. This un-
derscores the difficulty of creating bespoke target measures in diverse contexts.

Despite these issues, the approach offers practical value. In principle, the pipeline
should be better suited for agreements with explicit milestones (e.g., business or
commercialisation contracts), where more clearly defined expected outcomes sup-
port better-formed KPIs. However, this cannot be conclusively established by the
implementation in this thesis, due to limited data.

Ultimately, translating qualitative objectives into quantitative, decision-grade KPIs
remains inherently difficult. Contemporary LLMs are capable across many aspects of
automation, but evidently less reliable for high-judgement and context-specific KPI
design that balances relevance, clarity, actionability, and measurability, at least by
following the approach outlined in this thesis. Therefore, the most defensible near-
term usefulness is in metadata extraction and recommendation, while still requiring
a human-in-the-loop as a safeguard. In turn, this can improve customer relation-
ship management (CRM) metadata completeness and enable collaboration health
insights and automated reporting.

Keywords: AstraZeneca, KPI, LLM, GPT-4o, contracts, collaboration, automated,
reporting.

v


Acknowledgements
Firstly, we would like to express our deepest gratitude to our AstraZeneca supervi-
sors, Gaurav Gupta and Per Hillertz, for their guidance, trust, and day-to-day sup-
port throughout this project. Your insight, availability, and encouragement shaped
both the direction and the quality of this work, and we are sincerely thankful for the
opportunities to learn from you. Many thanks also go to the wider M&A IT team
at AstraZeneca for your warm welcome, ongoing support, and for integrating us so
well into the organisation. We are also thankful to the interviewees at AstraZeneca
for taking the time to walk us through their processes and thoughts.

Moreover, we want to especially show our gratitude to Jesús Pineda, our university
supervisor, whose commitment extended well beyond formal obligations. Even af-
ter his official appointment with the university had ended, he continued to provide
thoughtful feedback and steady mentorship. His dedication and support were in-
strumental in the completion of this thesis.

We are also grateful to our examiner, Giovanni Volpe, for his role in the assessment
process and for his perspective on the work. While less involved in the day-to-day
development of the project, his input has been an important part of the overall
evaluation and refinement of this thesis.

Finally, we would like to thank everyone who, in ways large or small, contributed to
this project’s progress and to our growth during this period.

Jakob Juul and Marcus Lorentzon, Gothenburg, January 2026

vii


List of Acronyms

Below is the list of acronyms that have been used throughout this thesis listed in
alphabetical order:

AI Artificial Intelligence
AZ AstraZeneca
BD Business Development
CRM Customer Relationship Management
JSON JavaScript Object Notation
KPI Key Performance Indicator
LLM Large Language Model
ML Machine Learning
NLP Natural Language Processing
PoC Proof-of-concept
R&D Research and Development

ix


Contents

List of Acronyms ix

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1.1 Problem Description . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.2 Scope of Implementation . . . . . . . . . . . . . . . . . . . . . 3

1.2 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.3.1 Scope-related . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3.2 Organisational . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3.3 Data access . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Theory 7
2.1 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.1 GPT-4o . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Interviews 11
3.1 Different Kinds of Agreements . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Reporting Data Workflow . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Relevant KPIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Pain-points and Needs . . . . . . . . . . . . . . . . . . . . . . . . . . 15

4 Implementation 17
4.1 Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.1.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1.2 Data Annotation . . . . . . . . . . . . . . . . . . . . . . . . . 18

4.2 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Evaluation Criteria 25
5.1 Evaluations of Objectives with ML Metrics . . . . . . . . . . . . . . . 25
5.2 Evaluation of KPIs with Human Feedback . . . . . . . . . . . . . . . 28

xi


Contents

6 Results 31
6.1 Performance of Objective Extraction . . . . . . . . . . . . . . . . . . 31
6.2 Performance of KPI Definition . . . . . . . . . . . . . . . . . . . . . . 35

6.2.1 Quantitative Survey Results . . . . . . . . . . . . . . . . . . . 35
6.2.2 Qualitative Survey Results . . . . . . . . . . . . . . . . . . . . 36

7 Discussion 41
7.1 Objective Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
7.2 KPI Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.3 Value of System for AZ . . . . . . . . . . . . . . . . . . . . . . . . . . 45

7.3.1 Analysing Different Kinds of Agreements . . . . . . . . . . . . 46
7.3.2 Alternative Utilities of LLM Text Extraction . . . . . . . . . . 46

7.4 Sources of Errors and Problems . . . . . . . . . . . . . . . . . . . . . 48
7.4.1 Lack of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
7.4.2 Subjectivity of KPIs . . . . . . . . . . . . . . . . . . . . . . . 49

7.5 Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
7.6 Recommendations for AZ . . . . . . . . . . . . . . . . . . . . . . . . 50

8 Conclusion 53

Bibliography 55

9 Bibliography 55

A Interview Method I
A.1 Sampling and Participant Recruitment . . . . . . . . . . . . . . . . . II
A.2 Thematic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

B Backbone Performance Distributions III

xii


List of Figures

1.1 Schematic overview of how to address the problem. The scope of the
project is defined within the gray box and is an important step of
the bigger pipeline to evaluate collaborations. Blue boxes correspond
to processes, such as tasks performed by an LLM. Red boxes are
data that already exist within AZ in some shape or form, while the
green boxes are data that are generated as a product of the developed
system. The steps outside the gray box are left out of scope. . . . . . 4

2.1 The images shows an example of a typical transformer architecture
[18]. The architecture includes the encoder and decoder part, both
with processes such as Multi-Head Attention, Positional Encoding,
as well as a Feed Forward layers, among others. . . . . . . . . . . . . 9

4.1 The figure visualises the evaluation pipeline, including the first step
of data preparation and annotation. The objectives are extracted by
the LLM and are subsequently evaluated. In the last step, the LLM
defines KPIs from the objectives. These KPIs are evaluated by asking
experts working on each respective project for feedback. . . . . . . . 17

B.1 Distribution of academic agreement scores across performance metrics
for different embedding models. . . . . . . . . . . . . . . . . . . . . . IV

B.2 Distribution of business agreement scores across performance metrics
for different embedding models. . . . . . . . . . . . . . . . . . . . . . V

B.3 Distribution of business agreement scores across performance metrics
for different embedding models. These business agreements are, how-
ever, referring to the long versions (with complete objective extraction). VI

B.4 Distribution of resource agreement scores across performance metrics
for different embedding models. . . . . . . . . . . . . . . . . . . . . . VII

xiii


List of Figures

xiv


List of Tables

2.1 Text evaluation accuracy (%) across benchmarks and models from
study by OpenAI [24]. Blank fields indicate that there is no data for
that measurement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 The KPIs presented in the table are deduced from the interviewees’
suggestions on measuring the value stemming from collaborations. . . 15

5.1 The table shows a comparison of embedding models across metrics
tested on dummy data. The highest score per metric is marked in
bold. The reason the scores are significantly lower than one is because
there are some predictions that are not in the golden labels, and some
that have different semantic meaning. . . . . . . . . . . . . . . . . . . 26

5.2 The table shows the meaningfulness criteria and their respective def-
inition as it was described in the evaluation surveys. . . . . . . . . . . 29

5.3 The table shows the general questions that were asked at the end of
every survey. The questions do not pertain any single KPI but rather
the composition of all proposed KPIs for a given collaboration. . . . . 29

6.1 Comparison between text segments identified as objectives, from the
resource agreement [29] between the EU and Argentina. In the left
column, the text corresponds to predictions by the LLM, while the
right column displays what was deemed to be the most correct text
when annotating the dataset. . . . . . . . . . . . . . . . . . . . . . . 34

6.2 The table lists a set of annotated objectives that have not been paired
with an LLM output, for the EU-Argentina resource agreement [29]. . 34

6.3 The average performance in terms of the evaluation metrics is show-
cased by document type. The underlying backbone that is utilised
for this is the all-MiniLM-L6-v2. The best scores per metric are high-
lighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.4 Averages over all KPIs within agreements by meaningfulness criteria.
The numbers correspond to different agreements that survey respon-
dents gave feedback on. . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.5 The table presents which traits of the meaningfulness criteria the
evaluators found the KPIs to have. The scores are ratios of how many
KPIs that were selected as having a given trait. KPIs are grouped
by categories described in Chapter 3. The best value per criterion is
highlighted in bold. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

xv


List of Tables

6.7 The table presents the written answers corresponding to the different
kinds of KPIs. Every KPI category is not present as all types of KPIs
did not receive qualitative feedback. Specific information relating to
the projects have been redacted. If a KPI category is not displayed
in the left column, the previous category still applies. . . . . . . . . . 36

6.6 This table presents answers to the general broad questions of the survey. 39

xvi


1
Introduction

Businesses across a multitude of industries depend on external partnerships to co-
ordinate complex value chains, accelerate innovation, and meet predefined objec-
tives. Increasingly, organisations operate within multi-party ecosystems where suc-
cess relies on effective collaboration management [1]. In the pharmaceutical and
life sciences sector, external partnerships and collaborations play a critical role in
driving innovation, accelerating research, and gaining a competitive advantage [2].
However, the complexity and volume of such collaborations make it challenging to
capture their true value and communicate their impact effectively [3].

Advancements in the field of machine learning (ML) have made it possible to auto-
mate not only tedious but also difficult tasks. Large language models in particular
are one such domain of ML that has introduced novel capabilities beyond those
previously thought to be attainable [4][5]. With such huge potential and broad ap-
plication possibilities, it is of utmost interest for all companies, especially industry
leaders, to leverage these tools to gain or retain a strategical market position. For
example, in an article by Martín-Domingo, Fernandez Roblero, Efthymiou, et al. [6],
the authors managed to use OpenAI’s ChatGPT to extract KPIs with an accuracy
of 71%, for airline emissions reporting. Models, such as ChatGPT, have achieved
broad public visibility due to their demonstrated capacity to solve complex prob-
lems in text analysis and synthesis. This context creates a unique opportunity for
AZ to leverage this technology to generate, process, and present data in support of
decision-making.

1.1 Background
AstraZeneca is a leading global biopharmaceutical company originating from the
UK and Sweden. AZ focuses on innovation through research, development, and
marketing of pharmaceuticals [7]. The Gothenburg plant is one of the company’s
main centres of research and development (R&D), with 3100 employees ranging from
formulation scientists to software developers [8]. The plant works on holistic drug
development processes, from molecular biology to clinical trials of new drugs.

Chesbrough [2] introduced the concept of open innovation and describes it as pur-
posefully allowing ideas, inventions, and knowledge to exit and enter an organisation
in hopes of benefitting the innovative process of the organisation. He argues that

1


1. Introduction

open innovation makes the process of innovating more efficient, reduces time to mar-
ket, and betters the potential of innovative breakthrough compared to traditional
closed innovation, which relies on internal R&D. Open innovation is presented with
having the potential to innovate and advance the pharmaceutical industry, which
generally has slow-progressing R&D[9].

Chesbrough [2] also highlights spin-offs as a form of open innovation. It is explained
that a spin-off is when a company creates a legally separate business entity from
the company while possibly maintaining some ownership. In this way, the sepa-
rate business can commercialise innovations that are not part of the core business
of the founding company. Some benefits of spin-offs are that they encourage en-
trepreneurial strategy, reduce risk for the founding organisation, and increase market
adoption [2]. Wikhamn and Styhre [10] explains in an article how AZ collaborates
with spin-offs by transferring internal projects to start-up companies backed by ex-
ternal venture capital. The study highlights several challenges facing AZ in this
process, including internal decision-making difficulties, cultural barriers such as the
‘not-invented-elsewhere’ syndrome, and challenges in making internal projects ap-
pealing to external investors.

An example of how AZ works with collaborations is shown through the companies
within the BioVentureHub. BioventureHub was launched in 2014 and has since
hosted 57 companies at the heart of the AstraZeneca’s R&D plant in Gothenburg
[11]. The purpose of BioVentureHub is to create an open innovation environment
that increases the competitiveness and energy of the life science sector using a public-
private partnership model [11]. It allows start-ups and academic teams to work
alongside AstraZeneca experts and access advanced laboratories and facilities, pro-
moting collaboration and knowledge sharing.

In addition to the research taking place at the facility, AZ also actively participates in
collaborations with organisations, such as research institutions and research-driven
start-ups. AZ’s relationships with these collaborators range from business partners,
e.g. where AZ and a partner organisation develop a product together, to situations
where AZ is investing in a company, to collective research with an academic institu-
tion. Managing the relationship with the collaborators in this ecosystem is crucial
for their success in driving innovation, and therefore to keep AZ successful in the
research and innovation oriented market of pharmaceuticals. Collaboration is at the
heart of everything AZ strives to do.

1.1.1 Problem Description
Decision-makers at AZ need to constantly stay updated on the numerous different
partners and the state of their collaborations. This includes analysing financials, as
well as following scientific progression and product development, such as patents.
This creates an enormous landscape of manual analysis that decision-makers need
to consider. This analysis is time-consuming and prone to human error, especially
considering that the data that needs compiling is heavily dispersed within AZ. These

2


1. Introduction

data locations include CRM systems, publication databases, patent repositories, and
investment trackers, as well as slide decks and emails. There is also a lack of ac-
cessibility for decision-makers to continuously keep up to date on the status of any
given collaboration, including its health and progress.

In general, there is a lack of an intelligent system that can consolidate datasets
into a coherent and evaluative framework. This makes it difficult for stakeholders
to quantify both financial and non-financial outcomes of collaborations, identify
opportunities, and benchmark against external activities. Therefore, the need to
develop a system with automated capabilities, measuring the health and value of
collaboration projects, has become increasingly prominent within the AZ business
development team.

1.1.2 Scope of Implementation
In order to achieve the prospect of automation in the reporting purposes of collabora-
tions, a multi-agent-system (MAS) is a potential solution that could be implemented
[12]. A MAS divides tasks among multiple separate agents in order to manage a
more complex goal [13]. This is the method deployed by Choi, Lopez-Lira, Lee, et
al. [14] to retrieve financial insights, such as KPIs, from various reports. They de-
veloped a MAS, consisting of an extraction agent and a text-to-SQL agent, able to
transform financial filings into structured data with an accuracy of 95%, showcasing
the effectiveness of this approach.

KPIs are especially important for the solution proposed by this thesis, as they are
vessels to coherently and concisely report the current state of a given endeavour.
They are quantifiable measures that translate goals into trackable signals of progress
and results. In collaborations, well-defined KPIs create a shared language for rele-
vance, clarity, and accountability, linking objectives to data, timelines, and owners.
They enable evidence-based decisions that ideally focus teams on what is truly im-
pactful. Considering that as the end goal, developing an automated system similar
to the one previously mentioned would, in the case of AZ’s collaboration ecosystem,
require several steps.

The first stage is the consolidation of all existing relevant data, which captures the
nature of the collaboration. For a single collaborator with AZ, a multitude of infor-
mation needs to be retrieved. Depending on the type of collaboration, these docu-
ments could include drafts of agreements, final contracts, non-disclosure agreements,
and potentially significant metadata. To get a full picture of the collaboration, all
information relevant to that relationship needs to be located, extracted, and then
compiled, in order to eventually give a comprehensive report. This process could
potentially be achieved with a specific agent that uses crawling techniques to search
and gather relevant documents.

To establish what exactly the KPIs should be measuring, a separate agent could
extract expected outcomes and objectives of a collaboration, based on what is stip-

3


1. Introduction

ulated in the corresponding contracts of that partnership. Subsequently, another
agent could define which KPIs best capture advances related to the identified ob-
jectives. In that way, performance measurements could be tailored to each collabo-
ration.

Other agents could be tasked with quantifying the defined KPIs of a collaboration
and retrieving relevant information to enable proper quantification of the KPIs. By
contextualising the proposed KPIs with the continuous progress data on the cor-
responding collaboration, the end result would finally amount to bespoke valuable
insights that could be reported to decision-makers.

A general overview of the problem and how it could be subdivided into separate tasks
is illustrated in Figure 1.1. For the scope of this project, all of these procedures are
not taken into account. Only the extraction of objectives and the definition of KPIs
are considered. A further motivation for this can be read in Section 1.3.

Figure 1.1: Schematic overview of how to address the problem. The scope of
the project is defined within the gray box and is an important step of the bigger
pipeline to evaluate collaborations. Blue boxes correspond to processes, such as
tasks performed by an LLM. Red boxes are data that already exist within AZ in
some shape or form, while the green boxes are data that are generated as a product
of the developed system. The steps outside the gray box are left out of scope.

1.2 Purpose
This project strives to design and evaluate an agent that, based on contract agree-
ments, can extract relevant KPIs for measuring the performance of a collaboration.
This is one of the crucial steps of the overarching problem: taking agreements and

4


1. Introduction

collaboration result data to automatically evaluate the health of a collaboration in
terms of KPIs. The flowchart of the problem is presented in Figure 1.1.

When the LLM-driven system is developed, the performance will be evaluated based
on real examples of contract agreements of different kinds, as well as dummy agree-
ments, to see how well the agent is able to extract concrete objectives. Then the
objectives will be used to define KPIs, which should reflect the KPIs that AZ ac-
tually wants to evaluate, and those that are feasible to extract from the existing data.

In doing so, the hope is that the system can introduce a way to increase transparency
and business insight to meaningfully contribute to AZ and their partners, allowing
them to enhance their collaborations and consequently advance their contributions
to research and patient care. This project begins the development of the full system
as a proof-of-concept (PoC) through the building of a key subsystem.

The project also aims to put the solution into the context of AZ and examine what
value it may give the organisation. To do this, data management, collaboration
project workflow, reporting process, and needs must be examined to accurately un-
derstand how such a model can be used effectively given the ecosystem, the needs
of AZ, and what value can be created. An example is understanding what kind of
KPIs capture value from collaborations.

This project seeks to investigate the following research questions:
• How effectively can an LLM-based system extract and identify relevant KPIs

from contract agreements of varying complexity to accurately report the health
and performance of collaborations at AstraZeneca?

• Can this system be used efficiently in the pharmaceutical context of AstraZeneca?

1.3 Limitations
As with all research projects, some limitations are inevitable. Conducting research
in partnership with a large company comes with great benefits, such as experienced
guidance and other resources, but also constraints related to organisational bureau-
cracy, which have shaped the projects scope. These limitations are presented here.

1.3.1 Scope-related
Due to limitations in time and data, the scope is to develop a system that solves one
part of the general problem that the solution shown in Figure 1.1 strives to address.
The project will not include implementations of every agent in the larger MAS and
data analysis method in the system. It will also be assumed that the format of
the documents is already processed in a suitable manner. Hence, the scraping and
crawling of the relevant systems to extract documents will be left out of the project’s
scope.

5


1. Introduction

1.3.2 Organisational
Due to strict requirements on data security, the proposed solution is restricted to
utilise tools that are approved by AZ. This means that only the software, e.g. pro-
grams and models, that have been explicitly verified by AZ can be employed in our
solution, thus ensuring that patient safety and company intellectual property are
maintained.

1.3.3 Data access
Because the collaboration agreements analysed in this PoC are confidential, only a
handful of agreements were supplied for the project. This hinders the possibility to
make use of and extract more general insights. For instance, it confined the project to
use pretrained LLMs rather than being able to train or even fine-tune any models. It
also limits the ability to conclusively compare different kinds of documents. Finally,
the proprietary nature of the few agreements that were provided restricts this report
from presenting any specific examples.

6


2
Theory

This chapter gives a brief overview of the foundational concepts that are needed to
properly digest this report. The theories are described but not explained in detail,
so for a deeper understanding, it is recommended to explore the cited sources.

2.1 Deep Learning
Deep learning is a subset of ML and powers advanced artificial intelligence (AI)
methods, such as computer vision, classification algorithms, and generative AI, in-
cluding LLMs [15][16]. The concept leverages neural networks consisting of multiple
layers of artificial neurons to learn non-linear representations of data [15]. The neu-
rons are connected through a weight and bias, which together with a non-linear
activation function define the behaviour of the neural network model [15][16]. The
model is trained on data samples by minimising a loss function that measures pre-
diction error based on the outputs of the forward pass [15][16]. Backpropagation
computes the gradient of the loss function based on the weights and biases [15][16].
Using gradient descent, the learning rule is deducted, which updates the parameters
in the direction that minimises the loss function most [15]. By applying this con-
cept to a network of enough layer size and amount of layers, it is possible to model
complex problems.

A drawback of deep learning is the so called black box behaviour that implicates
low intuitive interpretability of the model output [15]. Complex neural networks are
also computationally heavy, which means that a huge amount of energy is required
for the increasingly sophisticated hardware to perform the calculations on a massive
scale [15]. Often, a massive amount of labelled training data is needed for these
larger networks to achieve optimal performance [15].

2.2 Large Language Models
During recent years natural language processing (NLP), and more specifically LLMs,
have fundamentally improved their ability to process and generate text to perform
various tasks at human level [4]. LLMs have gained abilities such as decision-making,
reasoning, planning, and in-context learning due to the gigantic scale of the models
[5]. The LLMs, which are probabilistic sequence models, achieve this by predicting

7


2. Theory

the next token [5]. All tokens are subword units that have a unique index. Each
index is then mapped to a learned vector embedding. These embeddings are in turn
numerically processed in the deep learning model, such as an LLM, to distinguish
between contexts and generate outputs.

At the core of the LLM architecture is the transformer. The transformer replaced
older methods, such as Recurrent Neural Networks, and enabled the models to
process tokens in parallel instead of sequentially [17]. The transformer consists of
multiple layers creating encoders and decoders, which can be seen in Figure 2.1. The
encoder layers embed the input into a higher dimensional space, leveraging feed for-
ward network layers. The decoder layers utilises these higher-dimension embeddings
to create output sequences [17]. The transformer architecture needs to encode the
position of a token and in that way account for the order of the words when process-
ing them [17]. Traditional methods such as Recurrent Neural Networks process data
sequentially, while transformers use multi-head self-attention [18][17]. Self-attention
relates tokens to each other by weighting the relevance of other tokens to a specific
one [19]. The model can then dynamically process information from multiple input
positions [19]. This allows the model to consider positional data from all positions
and provide contextual awareness [19]. The multi-head self-attention runs several
attention "heads" in parallel, each with its own learnt projections of the input [18].
Every one of these heads computes attention weights over the sequence to produce
an output focused on different aspects of the data [18]. The head outputs are then
concatenated and passed through a final projection to form the layer’s result. This
design permits the model to process multiple representation subspaces and positions
at once, capturing patterns that a single head would blur together, while keeping
the computation efficient by using reduced dimensions per head [18].

The pretraining step in the LLM training requires a large amount of data and com-
pute. The weights of the layers are optimised by training on billions of tokens, in
the case of full-scale models. After pretraining, the model can be fine-tuned for
specific tasks, e.g. processing a certain kind of document or to follow instructions
and answer questions by training it on question-answer pairs [20]. To perform the
correct tasks with better accuracy, it is possible to give input prompts in ways that
guides how the output should be created [21]. For example, few-shot prompting
includes providing some examples of how the model should respond when given the
instruction [21]. In this way, structured output can be generated.

2.2.1 GPT-4o
Researchers and industry are racing to develop the most powerful and intelligent
LLM systems [22]. During the last years, Google have developed Gemini while
Anthropic have Claude. OpenAI, made the first mass-breakthrough beyond the
AI community with their ChatGPT. GPT stands for Generative Pretrained Trans-
former and is based on a similar architecture to the one shown in 2.1. Since then,
OpenAI have developed a range of models: GPT-3, GPT-3.5, GPT-4, and currently

8


2. Theory

Figure 2.1: The images shows an example of a typical transformer architecture
[18]. The architecture includes the encoder and decoder part, both with processes
such as Multi-Head Attention, Positional Encoding, as well as a Feed Forward layers,
among others.

9


2. Theory

GPT-5 just to list a few. GPT-4o, in particular, was released in 2024 to surpass the
performance of the competitors [22] and is estimated to consist of over one trillion
parameters. This is well above the number of parameters that the competing models
had at the time. The "o" in GPT-4o stands for omni, which highlights the model’s
ability to accept prompts consisting of audio, images, and text [23].

In the study Putting GPT-4o to the Sword: A Comprehensive Evaluation of Lan-
guage, Vision, Speech, and Multimodal Proficiency [22], GPT-4o is tested on mul-
tiple exams to evaluate language and the model’s ability to understand and solve
complex problems. In the The United States Medical Licensing Examination Step
1 (USMLE) test, the model achieved an accuracy of 83.5%, which is a lot better
than GPT-3.5, which attained 51.67%. However, this was lower than GPT-4, which
scored 90.00%. On The Chartered Financial Analyst Level 1 (CFA) the GPT-4o
performed with an accuracy of 85.6% and beat the other two models by more than
12 percentage points. When assessed on The Scholastic Assessment Test (SAT), the
model obtained a 90.91% accuracy on reading and writing questions and 87.48% on
mathematical questions.

In a benchmark study conducted by OpenAI, after the release of GPT-4o, it was
clear how, at that time, the model outperformed previous models from OpenAI and
competitors alike [24]. The benchmarking was evaluated on different benchmarking
sets, such as HumanEval and MMLU. The results of the comparison can be seen in
Table 2.1.

Table 2.1: Text evaluation accuracy (%) across benchmarks and models from study
by OpenAI [24]. Blank fields indicate that there is no data for that measurement.

Benchmark GPT-4o GPT-4 (23-03-14) Claude 3 Opus Gemini Pro 1.5 Gemini Ultra 1.0 Llama 3 400B
MMLU 88.7 86.4 86.8 81.9 83.7 86.1
GPQA 53.6 35.7 50.4 – – 48.0
MATH 76.6 42.5 60.1 58.5 53.2 57.8
HumanEval 90.2 67.0 84.9 74.9 84.1 84.1
MGSM 90.5 74.5 90.7 88.7 79.0 –
DROP (F1) 83.4 80.9 83.1 78.9 82.4 83.5

10


3
Interviews

Interviews were conducted to answer the research question regarding the organisa-
tional workflow of collaborations at AZ to understand how an automatic reporting
system can be used. To answer this question, a qualitative approach is needed.
The purpose of the interviews was to understand how people in the organisation
working with collaborations report the progress of the collaboration, and what their
stakeholders are expecting from them in terms of reporting. This is a way to put
the automatic reporting system into the organisational context of AZ, and to un-
derstand strengths and weaknesses of the system in practice. The manner in which
the interviews were conducted and analysed is described in Appendix A.

The answers of five interviews are presented in this chapter after being themati-
cally grouped into sections. To keep the anonymity of each participant, they are
referenced by their title and number, e.g. Alliance Manager 1.

3.1 Different Kinds of Agreements
According to Principal Scientist 1, collaborations take multiple forms: "It can range
from collaborations around pharmacological equity to technical development of in-
strumentation, techniques, or scientific collaborations aiming to publish new scien-
tific discoveries." The interviewee continues to explain how the collaborations differ.
In terms of follow-up for pharmacological equity, there are usually agreed milestones
that trigger payments or continuation of the project. Technical development has to
do with providing feedback on new equipment and evaluating the usefulness of the
data in the pipeline. Scientific collaborations have a more free form of structure, and
the nature of the reporting is up to the principal investigator and their academic
counterpart.

Alliance Manager 1 explains in the interview that academic agreements start out on
shaking grounds but get more solidified as the project progresses and AZ continues
working with them. The participant continues to say that on the other side, business
agreements are "cast in stone from the very beginning and you need to deliver this
to get money", meaning that business collaborations are more dependent on defined
milestones, which are required for payments to take place. Academic collaborations
do not have to stop if specified milestones are not reached.

As explained by Vice President 1, differences arise because commercialisation agree-

11


3. Interviews

ments require significant effort to ensure that all parties are aligned. As they explain,
"tremendous amounts of work that goes into that, because companies also come at
things with different points of views, different cultures, different sizes, different stages
of the life cycle of the company itself, different priorities. So, it is quite complex to
navigate."

Principal Scientist 2 works in an academic collaboration and mentions that mile-
stones are also set up and agreed upon in advance for this type of collaboration.
Another person working within an academic collaboration is Science Director 1,
who says "it’s research, we don’t know the value beforehand." Thus, even if the
work process is structured in their case, the value and goals might not be clearly
defined. When asked about clearly defined milestones the participant answered, "I
don’t know how it would work, right, but it’s a bit maybe complicated. I don’t
know because it’s research right? We have no idea the direction. So we can start
with something and we can discover something else. We will change totally the
milestone on what we saw." The statement paints a picture of the research as an
exploratory effort. Additionally, the interviewee describes business collaborations as
having much clearer timelines.

3.2 Reporting Data Workflow
Multiple interview participants testify that reporting to senior management and
stakeholders is usually done solely through meetings and presentations on progress,
and at vastly different intervals. Principal Scientist 2 explains their process as
follows: "As part of the post-doc proposal, both parties agreed to a set of goals (6-
month goals, 1-year goals, 2-years goals) that are monitored throughout the project
via recurring meetings (fortnightly meeting frequency), reports (quarterly status
updates, 6-month progress report, annual written report)." The participant contin-
ues to describe that progress is reported to the head of the department. During
these recurring meetings, progress, which often takes the form of collected data or
new insights, is discussed in relation to the goals. In the same participant’s current
project, results from modelling analyses are literature studies, which correspond to
the progress that is being reported. The outcome is written in the mentioned report,
which is compiled by the external collaborator. This report is stored in a common
SharePoint or Teams channel.

Moreover, Allicance Manager 1 states that the frequency of reporting depends on
how fast the project is moving and how often there is something to report. If there
is nothing to report in quarterly meetings they will report bi-annually. The respon-
dent also mentions that SharePoint is used for storing report data if not stored in
folders of the alliance manager.

Science Director 1 reports in a similar fashion. Once a year, a one slide update
is presented, including goals and key deliverables. Apart from that, there is one
oral presentation per year with all data and progress. This reporting is directed at
alliance management. Reporting data does not seem to be stored in any centralised

12


3. Interviews

database, according to Science Director 1.

For commercialisation collaborations, Vice President 1 states that there is often a
joint steering committee consisting of senior leaders that handle the governance of
the collaboration. Any reporting is ultimately targeted towards this committee. It
is clarified that the reporting to the committee is qualitative and takes place as
presentations in quarterly meetings.

Vice President 1 continues by saying that a system for structured and recurring
reporting would be of high value to both parties of a commercialisation collabora-
tion. However, the system would have to bridge the structural gap between the
organisations to be efficient and actually counteract duplication of work. Some dif-
ficulties related to this, especially if the system relies on AI, is getting the partner
organisation to trust the system with their sensitive data.

Nevertheless, through these interview responses, it showcased that progress report-
ing on collaborations is performed dynamically and in an unstructured manner,
depending on the needs of the involved people. Instead of continuously and sys-
tematically reporting milestones so that stakeholders can keep up to date, progress
is reported in a free format with a certain infrequency. This is a natural way of
working on these projects, as they often are fundamentally dynamic, as research
often is. However, it creates difficulties when considering an automated approach,
as automation requires access to systematic and structured data.

3.3 Relevant KPIs
In what way the value and health of a collaboration can be measured in terms of
KPIs is described in detail by Alliance Manager 1. The participant mentions how
conducting scientific research can create opportunities for follow-up research, which
is a form of value. Conducing successful research also improves the scientific as-
sets of AZ, which in turn creates opportunities for recruiting talented post-docs or
PhD-students, since they may be keen to join future projects and work. A person
who has worked in an AZ collaboration has already acquired relevant experience to
continue working effectively in AZ. The interview participant says that relationships
and academic contacts related to previous research are important values, especially
for future and follow-up research. Some of these thoughts from Alliance Manager 1
are captured in the following statement.

You just don’t randomly pick up a lab or a scientist to work with in an
academic collaboration or even anywhere. It’s not like you know, you
just scan through [names and say], this name looks fantastic. No, you
would have probably worked with him. That person would have presented
somewhere. That person would be having an asset which is being referred
to in multiple publications. [...] That creates a value for that person,
and when you get into that, that becomes your KPI. [...] Sometimes you

13


3. Interviews

would find it very well written in the agreement, but sometimes not.

Naturally, other performance indicators of research-oriented collaborations are key
deliverables and scientific publications, as exemplified by Scientific Director 1 and
Principal Scientist 1. Alliance Manager 1 expresses the strong benefits of published
scientific work in the following way.

It’s a huge qualification for academic collaborations, because if we are to
do a publication, we need to generate a lot of data ourselves, and that
data may not be towards our goal of bringing new medicines to the mar-
ket. That is the primary goal for all of us. Now, when you are working
with an academic partner, the academic partner does 60% of the job, we
put in the 40%, but you still become part of the publication. So that also
adds value, and that also adds value to the people who work here. As you
know, I’ve got a publication in say impact Factor 15, which is a huge
thing. Anything around [impact] factor 10 is pretty good, so that adds
the value again.

Ultimately, Alliance Manager 1 provided a list of KPIs deemed to be relevant for
measuring the performance of collaborations in general. The list looks as follows:

• The deliveries of the collaboration within the time frame and budget
• External funding secured
• Number of post-docs and PhDs working on the collaboration
• External presentations
• Publications or joint publications
• No escalations to senior leadership
• Positive feedback during health check-ups

According to Vice President 1, who works on commercialisation agreements with
companies, early in the collaboration process, the joint commercialisation commit-
tee will establish the goals, how to measure them, and the value that the partnership
will bring. These criteria are critical to the success and efficiency of the collaboration,
as is establishing principles of cooperation. The participant continues by pointing
out that these collaboration criteria have to be explicitly determined, particularly in
the case of financial KPIs: "Although the business performance is ultimately going
to be the main goal for a commercial collaboration because the collaboration has to
have a positive financial impact or it will not continue very long."

Although, complexities and differences in the partnership with each commercial col-
laborator, make measuring of KPIs inconsistent. Vice President 1 says, "for every
company there are similarities, but sometimes companies want to see the same kind
of information in different ways and it would be good to try to be more consistent
and try to harmonise. If there was some kind of tool to try to help with that, that
would be helpful."

Finally, based on what has been expressed by the interviewees, a set of generally

14


3. Interviews

applicable KPIs could be formulated. These are presented in Table 3.1 and become
important for the prompting of the system developed in this project.

Table 3.1: The KPIs presented in the table are deduced from the interviewees’
suggestions on measuring the value stemming from collaborations.

# KPIs
1 Patents filed
2 Follow-up research
3 Impact Factor of scientific publication
4 Reputation of the journal of scientific publication
5 External engagement (e.g., event presentation)
6 External funding for projects
7 Alignment with company goals
8 Budget coherence
9 No escalations to senior leadership
10 Recruitment of Post-Doc from University
11 Collaboration-specific KPIs based on objectives

3.4 Pain-points and Needs
When the interviewees are asked about pain-points in the current workflow, potential
improvements, and the role of new technologies, such as AI, most participants (Prin-
cipal Scientist 1, Principal Scientist 2, and Alliance Manager 1 ) answer that they
see no problem with the current reporting. However, Science Director 1 replies that
there is not much transparency in the system, making it difficult to find information
on ongoing collaborations. The participant mentions that a common collaboration
platform could improve these issues.

The responses regarding the role of new technologies included AI for note taking,
a system for identifying new partners, a data base with past and ongoing projects,
and AI for streamlining the processing from multiple collaborations.

In sum, the interview participants are satisfied with the workflow of reporting today,
but some of them see opportunities to use new technologies to improve the trans-
parency and accessibility of data related to collaborations. This could be a database
that connects the set milestones of a collaboration to the most recent progress in
terms of those milestones. This finding aligns well with the scope of this thesis and
the identified problem to be addressed.

15


3. Interviews

16


4
Implementation

In this part of the report, everything from how information was collected to how the
end product was created is covered. All these steps are based on the premise that
the purpose of a collaboration can be read in the contract of that project. Then,
they further build on the assumption that that specific purpose can be condensed
into quantifiable performance indicators. The implementation of the technical LLM
pipeline will be presented, and that workflow could be considered in the following
parts. Figure 4.1 shows a visualisation of the different stages in the implementation
of this thesis including the evaluation steps, which are explained in Chapter 5.

Figure 4.1: The figure visualises the evaluation pipeline, including the first step of
data preparation and annotation. The objectives are extracted by the LLM and are
subsequently evaluated. In the last step, the LLM defines KPIs from the objectives.
These KPIs are evaluated by asking experts working on each respective project for
feedback.

4.1 Input
As defined by the scope of the project, the relevant data needed to address the
specified problem consist of legal contracts between AZ and their corresponding
collaborator. These documents are gathered from AZ’s CRM system for business
development (BD).

17


4. Implementation

4.1.1 Data Preprocessing
The agreements are confined to a PDF format and do therefore require suitable
preprocessing to be used as input for an ML model. With the use of the PyPDF
library in Python, the text can be extracted from the PDF files. Depending on
whether the document has been scanned from a physical copy, a separate method
was applied to extract the textual content; when scanned, the PDF file considers the
entire content of a page as an image. Therefore, the words are not discernable by
the PDF-reading program, which is why those documents have to be processed with
optical character recognition software. There are a multitude of different solutions
that do this, but to ensure the confidentiality of the data, only AZ-approved tools
could be used. To avoid data leakage, Microsoft OneNote was used to prepare the
text in these cases, even though the program’s primary function is not to extract
text from images. There were discussions about granting the project approved OCR
tools from Amazon Web Services, but the access was never finalised. Ultimately,
adopting this solution would likely result in much higher quality transcripts.

Moreover, the length of the documents had to be adjusted in some instances, due
to token restrictions. Those files had to be shortened to fit the given requirements.
Therefore, such contracts were cut to include only a certain piece of the whole text
that was deemed to be relevant. This may be considered a form of selection bias;
however, given the scarcity of AZ data, it was considered preferable to use as much
accessible data as possible. An alternative that was evaluated was to process the
entire document by passing segments of it through the system, which ensured that
all the text would be processed. Nevertheless, this approach has its own separate
drawbacks.

Before getting security clearance, part of the solution was evaluated on legal docu-
ments from a public database. Hence, the development of the system could proceed
without having to wait for the proprietary data. When permission was granted to
the agreements, access to a limited dataset, containing a handful of confidential
contracts, was given.

4.1.2 Data Annotation
One of the most important and time-consuming tasks of the project was the annota-
tion of the data. Since the intention was to evaluate the performance of the system,
it required structured information, i.e. a labelled dataset. This is so that the output
of the LLM can be directly compared to something that, according to us annotat-
ing, can be considered to be the objective truth. In NLP, this ground truth is often
referred to as the gold standard. Since it is pivotal for gaining meaningful outcomes,
it is usually performed by domain experts [25]. Even though we, as annotators, are
neither legal nor pharmaceutical experts, we annotated the contracts based on our
experience and sense of the topics. Due to the time required to complete this task,
it was not feasible to let other people in the organisation perform the annotation.
The labelling process involved going through each of the documents and identifying
which statements in the contract could define the overarching purpose of the col-

18


4. Implementation

laboration. Often, there were segments containing sections such as a research plan,
which were particularly helpful in defining the project objectives. If a document did
not outline any motives for the collaboration, it was disregarded.

Once an objective was identified, it was copied over, word for word, to a JavaScript
Object Notation (JSON) file. JSON is one of the data standards in the field and
was very useful for the purposes of structuring the annotations. The objectives were
not subdivided into sub-objectives, although in some documents a sentence or bullet
point could consist of what can be considered multiple objectives. Only the part of
the text sequence that related to what could be considered an objective was placed
in the gold standard JSON-document. In order to keep consistency, it was decided
to start the annotated segment with an action-verb, when it was possible and log-
ical. For instance, if the text in the agreement was "the goal of the project is to
increase the research capabilities within the given field", then the objective that was
annotated in the gold standard would be "increase the research capabilities within
the given field".

Furthermore, no more information than the objectives that we had identified was
annotated, even if that additional intelligence could have been valuable. An exam-
ple of such information could be data that is useful in contributing to specific KPIs
for a given collaboration, based on their agreement. For instance, this could be
information such as time-duration, budget, reporting, research method, or number
of samples. This was left out of scope due to time constraints and challenges in
annotation and evaluation. It would have been too cumbersome to determine ex-
actly which data points to target and then identify all of them from each document.
There are simply too many relevant data points to extract in that case, and doing
so puts the thesis down a quite different path. Sticking to extraction of objectives
was therefore seen as the most sensible prospect.

Also, it was not clear to what extent the LLM could perform multiple tasks in one
prompt, so to be sure that it would not have a negative impact on the extraction of
the objectives only, it was omitted completely. An alternative would have been to
dedicate a separate agent (instance of the LLM) to perform this task and compare
its results. However, doing so would have expanded the thesis outside of the origi-
nally defined scope, which is why it was not investigated further.

4.2 System
The subsequent passages involve the implementation of the AI models that enable
the extraction of objectives and definition of KPIs from legal contracts. The idea is
to automatically identify important quantifiable metrics for any given collaboration.

To achieve this, a range of software libraries were employed in the development
of the Python-based solution. In particular, Azure OpenAI was used to access the
GPT-4o model. Although there are many alternatives through Microsoft Azure [26],

19


4. Implementation

it was the only one provided by the AZ organisation. It is, however, still a highly
functional model [22] that has been frequently investigated in research, making it
suitable for what is sought to be achieved in this thesis. Pretrained LLMs, such as
GPT-4o, serve as sufficient means for a PoC, but for a full-fledged future system it
could be wise, depending on available resources, to train it, or at least fine-tune it,
to the specific needs of AZ.

Another important library was LangChain, which is a common open-source frame-
work for NLP. By implementing it, the system could be designed to allow future
swapping of LLMs without difficulty [27]. With the LangGraph library, the con-
struction of a larger system could be facilitated by connecting and coordinating
specialised subtasks through a graph network approach. Thus, the shell of the
pipeline described in the flowchart of Figure 1.1 could be implemented with greater
ease and with options for extensibility.

The described system outlines one part of a proposed MAS. Thus, it strives to be a
piece of a larger puzzle. This puzzle piece can, in terms of the graph representation,
be broken down into the following constituents. The first node is defined as the ob-
jective extraction. It takes the content of an agreement and adds it to the prompt.
The result is the collection of identified objectives. The second node corresponds to
the KPI generation. This takes the output from the previous node, includes it in
its prompt, and returns the generated KPI definition related to the identified objec-
tives. Prompt-engineering is a substantial part of the system’s development process,
as it greatly affects the results from each invoke of the LLM. For that reason, the
prompting has changed during the project to adjust the results and accommodate
new needs. The prompts used to achieve the final results are presented below. The
first one instructs the LLM to extract objectives from the provided text.

You are an expert at identifying big-picture goals, objectives, and
targets. Your task is to provide insights from a document in the {
state.get(’domain’)} domain.

Document:
{state.get(’raw_text’)}

Based on the provided text, you should answer what the purpose of the
given collaboration/partnership is.

Do this by extracting the overarching objectives of the collaboration
between the company and its counterpart, from the provided text. If
none are present, then do not output any.

It is extremely important that each objective should be a separate
point. Do not answer with super long sentences, but rather keep the
extracted objectives concise and to the point. If the objective is
longer than a sentence, it can most likely be subdivided into

separate objectives.

20


4. Implementation

It is also crucial that you make sure to quote the text directly, i.e.
do not alter any of the excerpts!

If there are multiple objectives provided in a sentence, split them
into different objectives. Never include more than one objective
within an objective.

When possible, start the quotation of the identified objective by
beginning with the first action-verb of that text sequence, e.g., "
The aim is to measure..." becomes "measure...".

The output should follow the format below and thus be returned as a
JSON array:

[
{{

"obj_id": "obj_1",
"text": "quote of the full objective statement"

}},
{{

"obj_id": "obj_2",
"text": "quote of the next full objective statement"

}}
]

Return ONLY the JSON array! It is vital that you do not respond with
any other text.

For the generation of the project specific KPIs, the prompt was instead formulated
as follows.

You are an expert at defining quantifiable key performance indicators
(KPIs) from identified objectives for collaboration projects in the
{state.get(’domain’)} domain.

Objectives:
{state.get(’objectives’)}

Instructions:
Define KPIs that can be measured grounded strictly in the objectives

above and useful for decision-makers. There can be multiple KPIs
per objective.

Return ONLY a valid JSON array. Do not include any explanations,
comments, markdown code fences, or a leading JSON label. The first
character must be "[" and the last must be "]".

Each array element must be a JSON object with keys: "kpi_id", "kpi", "
relating_objectives".

"relating_objectives" must be an array of objects with keys "obj_id"

21


4. Implementation

and "text". Use the exact objective text for "text".
Assign sequential KPI IDs: kpi_1, kpi_2, kpi_3, kpi_4, kpi_5, kpi_6,

...
Preserve objective IDs if provided.
Use plain UTF-8 characters and standard JSON escaping. Do not use

trailing commas.
If no relevant KPIs can be created from the objectives, return a

single-element array with one object where "kpi" is "insufficient
objective statements".

Example format (shape only): [ {{ "kpi_id": "kpi_1", "kpi": "a
measurable KPI", "relating_objectives": [ {{ "obj_id": "obj_1", "
text": "full objective text" }} ] }} ]

There are some typical KPIs that should always be included in the KPIs
but formulated based on the objective. These KPIs are:

{KPI_constraint}

Apart from these, project specific KPIs based on objectives should
also be included.

4.3 Output
The output of the first node is a string meant to be formatted as JSON. To assert
that this truly is the case, it is processed through a Python function that extracts a
correct JSON-object from the string. Sometimes, it is possible that the string may
contain incomplete objects, such as when the context window is surpassed. At that
point, this safety measure ensures that the output passed to the next step is in the
correct format. Specifically, all complete objects are passed on, while the incomplete
remainder is disregarded. Moreover, the output is also saved in a JSON-file con-
taining all the processed contracts. This is so that the results of a given run can be
evaluated at a later stage. The reason for keeping the results from each document
in the same JSON-file is mainly because of convenience. By doing so, there are
substantially fewer files to keep track of, and they can be grouped together based on
which prompt they were initiated from and at what time. This also facilitates the
annotation and evaluation process as all the gold labels can be collected similarly
in one place.

In the example below, it is possible to examine how a JSON format of extracted
objectives might look. The report ID refers to which contract the objectives belong
to. The three consecutive dots mark a hypothetical continuation at that level in the
data structure.

[
{

22


4. Implementation

"report_id": "*first contract*",
"objectives": [

{
"obj_id": "obj_1",
"text": "*The text of the first objective in the first

contract*"
},
{

"obj_id": "obj_2",
"text": "*The text of the second objective in the first

contract*"
},
...

]
},
...

]

If the model is not able to define a KPI based on the presented objectives for some
reason, e.g. due to the absence of identified objectives, it is instructed to answer
with "insufficient objective statements".

When it comes to the output of the second node, it is fairly similar to the previous
case, although the JSON-object contains the proposed KPIs instead. Linked to
each KPI is the set of objectives that the KPI aims to quantify. The example below
illustrates how this structure might look. The output is also stored in a JSON-file,
to save the progress. The KPIs are later retrieved and incorporated into surveys
that are sent out to people with the right expertise to receive human feedback.

[
{

"report_id": "*first conctract*",
"kpis": [

{
"kpi_id": "kpi_1",
"kpi": "*The first KPI defined from the first contract*",
"relating_objectives": [

{
"obj_id": "obj_1",
"text": "*The text of the objective, in the first contract

, that this KPI corresponds to*"
},
...

]
},
{

23


4. Implementation

"kpi_id": "kpi_2",
"kpi": "*The second KPI defined from the first contract*",
"relating_objectives": [

{
"obj_id": "obj_1",
"text": "*The text of the objective, in the first contract

, that this KPI corresponds to*"
},
...

]
},
...

]
},
...

]

24


5
Evaluation Criteria

It is crucial that the results from the system can be trusted. For that reason it
is necessary to validate the outputs at each step of the pipeline, by evaluating
its performance therein. As LLMs have a tendency to hallucinate, i.e. fabricate
information that is unsupported by their training data and reality, validation is
made essential. Therefore, it is important to ensure that the produced results are
traceable back to the source from which they were retrieved. These concepts are
described in further detail in this part of the report.

5.1 Evaluations of Objectives with ML Metrics
The performance of the objective extraction is evaluated using multiple metrics.
Many of these metrics are standard in the field of ML, and will be described in
greater detail. For the output of the LLMs to be evaluated, there needs to be a
reference. In this case, the reference is a golden label, i.e. the ground truth, which
represents the text snippet in the input text that ideally should be extracted and
outputted by the model. The process of labelling the objectives was described in
Section 4.1.2.

To consistently compare the set of predicted objectives to the set of golden ob-
jectives, in terms of the ML metrics, an evaluation program was written. There,
the golden labels of one document are compared to the output prediction from the
LLM. The texts will be presented in sets of multiple objectives with no connection
between which objectives in each respective set correspond to each other. To pair up
the corresponding objectives a general purpose sentence embedding model is used,
commonly called SBERT [28]. These models are LLMs fine-tuned for comparing
sentences. The way this is done is by running both sentences through the backbone
of the LLM to get the embedding vectors. The vectors are then analysed using cosine
similarity, which is a measurement of the similarity of the two embedding vectors. A
high score of a cosine similarity means the texts are semantically similar, even if the
exact words differ. The process can compare every single objective in the objectives
extracted from the LLM to the gold label objectives to see which ones match the
best. The algorithm used to match these objectives was a greedy one-to-one match-
ing algorithm, which takes the highest value in the cosine similarity matrix, matches
the corresponding objectives and then removes the values in the matrix correspond-
ing to these objectives by setting the specific row and column to −∞. This process
is repeated until there are no more objectives left from one of the lists. If there is

25


5. Evaluation Criteria

an unequal amount of objectives in the lists, some objectives will not be paired up
and therefore ignored. If the objectives are very different they might still get paired
up but will create a low evaluation metric, mirroring that the LLM performed poorly.

When choosing which model to use for creating embeddings of sentences to calculate
the cosine similarity, there were multiple choices. To find the models that can best
embed the context of the objectives, a test was conducted. Using public agreements
from ResourceContracts.org [29] as dummy data, the full evaluation pipeline was
employed to examine the average cosine similarity. The tested models and the
corresponding metrics can be seen in Table 5.1, where gte-large has the highest
cosine similarity. Although, when examining specific examples, it was made clear
that the models e5-base-v2, e5-large-v2, and gte-large had a positive bias. They
seemed to output too high cosine similarity, particularly for cases where the sentence
clearly had differing semantic meaning. For that reason, all-mpnet-base-v2 and all-
MiniLM-L6-v2 remained the two embedding models to be explored. Ultimately,
all-MiniLM-L6-v2 was selected for the cosine similarity functions of the evaluation
pipeline, as it performed marginally better. More information on why this choice
was made is presented in Appendix B.

Metric all-mpnet-base-v2 all-MiniLM-L6-v2 e5-base-v2 e5-large-v2 gte-large
Precision 0.3948 0.3999 0.3885 0.3995 0.3951
Cosine Sim. 0.4875 0.4879 0.5671 0.5709 0.5807
Recall 0.4875 0.4961 0.4834 0.4970 0.4841
F1 0.4263 0.4344 0.4211 0.4341 0.4265

Table 5.1: The table shows a comparison of embedding models across metrics
tested on dummy data. The highest score per metric is marked in bold. The reason
the scores are significantly lower than one is because there are some predictions that
are not in the golden labels, and some that have different semantic meaning.

Four different evaluation metrics are used when evaluating. The first is cosine sim-
ilarity, which is also used to match the objectives in the greedy match algorithm.
For two vectors, a and b, the cosine similarity is calculated as follows:

Cosine Similarity(a, b) = a · b
||a||||b||

Cosine similarity shows how similarly the texts are embedded by comparing the
distance between the vectors in the latent space. The LLM should embed them
similarly if the model works well for the text, assuming that the semantic meaning
is similar.

Other metrics are precision and recall, which the F1-score is comprised of. Given
a pair of sets of tokens, which in this case is one output prediction and one golden
label, the token level precision is calculated as:

Precision = |Predicted Output ∩ Golden Label|
|Predicted Output| (5.1)

26


5. Evaluation Criteria

This quantifies the proportion of the predicted output tokens that also appear in
the corresponding golden label tokens. Similarly, the recall metric calculates the
proportion of the golden label tokens that also appear in the predicted output tokens.
It is instead calculated as:

Recall = |Predicted Output ∩ Golden Label|
|Golden Label| (5.2)

The F1-score measures the harmonic mean of these two proportions and is defined
as:

F1 = 2 × Precision × Recall
Precision + Recall (5.3)

These metrics cover different aspects of the result. Cosine similarity assesses con-
text and paraphrasing, whereas the F1-score evaluates lexical overlap by measuring
whether the exact tokens in the reference set are present in the prediction. The F1-
score combines recall and precision; predictions containing many tokens not present
in the gold standard reference are penalised through lower precision, while pre-
dictions that omit many reference tokens are penalised through lower recall. By
balancing these two components, the F1-score accounts for both overinclusive and
underinclusive predictions.

At the end of the evaluation, when all the objectives have been matched and eval-
uated in one agreement, the average score was calculated for that document. The
average gives an overall score for each and every agreement. The average score
was calculated in two ways. One score consists of the average for all the matched
objectives and ignores the unmatched ones. In the other case, the average score
takes the unmatched objectives into consideration and applies zero-padding. This
zero-padding means that the unmatched objectives that remain either in the golden
objectives set or the predicted objectives set will be given a score of zero for all
four evaluation metrics. For instance, if only half of the predicted objectives are
matched with a golden objective, none of the evaluation scores will be more than
0.5; Half of them are considered incorrect and therefore they penalise the total score.

The reason for keeping both methods is that they show different sides of the result.
The average matched score, without zero-padding, shows how well the objectives
that are matched actually perform in terms of the evaluation metrics. In plain
words, it reveals how many of the reference objectives were identified. In contrast,
the zero-padded and penalised average considers the whole output and can there-
fore present a bad score if the gold standard set and prediction set differ in size
a lot. Thus, it also captures to which extent the LLM has identified more or less
objectives that the annotators have. The scores do therefore not solely reflect how
well the model managed to extract the specific wordings of each objective. In the
end, an average of all the agreement averages is calculated for both the matched
and penalised metrics.

27


5. Evaluation Criteria

5.2 Evaluation of KPIs with Human Feedback
When evaluating the KPIs generated by the system of LLMs, a human evaluation
method was implemented where the KPIs are manually validated based on a defined
set of criteria. The reason for choosing this method is that the KPIs do not have
a predetermined golden label, unlike the objectives, which can be retrieved directly
from the agreements. The KPIs are something that logically need to be derived
based on the objectives. They indicate the performance of something, and that
something is determined by what the objective of a given endeavour is. In addition
to that, there are an arbitrary number of different ways in which the same KPI
can be defined and expressed. Moreover, multiple different KPIs can measure the
same objective. These complexities make it infeasible to create labels in the same
way as when evaluating objectives. The golden labels need to be an objective truth
that reflect the ideal answer. A KPI generated by the system could be completely
valid, but due to it not being exactly as in a potential ground truth set, it would be
considered incorrect by the scoring system.

Another recurring issue would be the annotator bias. This, however, would be es-
pecially prominent given the subjective nature of producing a suitable KPI for a
specific objective. There is no way of determining a ground truth, and therefore
a case-by-case approach for validating is the only reliable option. The evaluation
should ultimately address the usefulness of the system, which most automatic met-
rics typically are incapable of, and for which human evaluation still remains the
undisputed dominant approach [30]. Therefore, evaluation by the means of human
feedback from relevant people was considered the best alternative. Specifically, this
was achieved by formalising surveys, one for each collaboration described by an
agreement. These could then be sent out to the appropriate parties. To ensure
relevance, the evaluators were selected by searching through the agreements to find
lead-scientists, scientists, or principal investigators related to the project. The CRM
system was also utilised in the search for appropriate assessors.

One survey was created for each agreement. The evaluators received the ques-
tionnaire by email based on their involvement in the project to which the agreement
corresponded. Since these people had a direct link to the collaborations, their judge-
ment regarding the model’s proposed KPIs of the same project could be deemed more
trustworthy. For the evaluators to better grasp the meaning behind this request,
background information about the PoC was also provided.

Each survey consisted of all the KPIs of a project, generated by the LLM-based sys-
tem. For each KPI, the evaluators had to reflect on it in terms of four criteria. They
were asked to answer whether the KPI was relevant, clear, actionable, measurable,
or neither. The criteria were explained to the respondents as explained in Table 5.2.

In addition, evaluators were also asked to comment on a KPI if they felt the need
to express some other reflections on the KPI. Furthermore, general questions were
posed at the very end, which can be viewed in Table 5.3.

28


5. Evaluation Criteria

Table 5.2: The table shows the meaningfulness criteria and their respective defi-
nition as it was described in the evaluation surveys.

Criteria Definition
Relevant Do the KPIs reflect what matters for this collaboration?
Clear Are the KPIs clearly defined and unambiguous?
Actionable Would these KPIs support decision-making and tracking?
Measurable Are the KPIs possible to measure given existing data today, or data that could exist?

Do you consider that the KPIs cover the whole problem and measure
everything that should be measured when evaluating the health of the
collaboration?

Are there any KPIs you would add which are not included in the
stated KPIs?

Do you have any general comments and/or recommendations?

Table 5.3: The table shows the general questions that were asked at the end of every
survey. The questions do not pertain any single KPI but rather the composition of
all proposed KPIs for a given collaboration.

The final general questions assess how well the predictions address the problem over-
all and whether the agreement’s KPIs collectively provide complete coverage of all
relevant aspects of what can and should be measured by KPIs.

29


5. Evaluation Criteria

30


6
Results

This chapter presents the results of each step in the developed system. Therefore,
the outcomes of the evaluation are divided accordingly into the following sections:
Objective Extraction and KPI Definition. These findings stem from the methodolog-
ical process described in the previous chapters. Due to the confidential nature of the
data used to generate the results, no concrete examples of objectives or KPIs derived
from proprietary contracts can be presented. Instead, the illustrations shown are
based on publicly available contracts. The KPIs generated from these public sources
are unvalidated and are provided solely to demonstrate how the results would ap-
pear if the proprietary data could be disclosed. Because the survey responses refer
to confidential AZ contracts, portions of their responses have been redacted.

6.1 Performance of Objective Extraction
The ability to identify and extract objectives from the documents has been evaluated
in terms of the metrics specified earlier (see Section 5.1). To ensure that the results
were not skewed in any direction, a multitude of embedding models were tried for
the evaluation of the objective extraction. When comparing the performance met-
rics with differing backbone selection, the alternative yielding the highest score and
limited standard deviation was ultimately picked, resulting in the use of all-MiniLM-
L6-v2. Details regarding backbone selection are left to be read in Section 5.1 and
Appendix B.

This objective extraction procedure was performed on both public data and propri-
etary data from AZ. The AZ data could also be subdivided based on whether the
project was conducted with an academic or a business collaborator. The perfor-
mance on each of these document types can be seen in Table 6.3. The row marked
with Business [long] refers to the data where, for extremely long contracts, the whole
document has been processed. This is in contrast to processing only the segments
that were deemed to be more relevant. The reasoning behind these approaches can
be read in Section 4.1.1.

Similarly to during the annotation process, the outcome is a set of identified text
snippets considered to reflect the overall objective of a given collaboration project.
The difference is that now the LLM is tasked with doing this. To illustrate the
results of the objective extraction process, an example based on a publicly available

31


6. Results

contract is displayed in Table 6.1. As such, it is possible to compare the two sets.

The indices show which segment from each set that is mapped to one another.
Their numbering is based on the order the snippet was identified in the text. In
the subsequent Table 6.2, the remaining snippets are presented. The matching
algorithm is exhaustive, meaning that it maps elements from each of the sets until
one is empty. In this case, one can observe that more objectives were identified by
the annotators than by the LLM. The fact that it is the objective with index 0 also
explains why the matched indices are staggered by one in Table 6.1.

Indices Prediction Gold Standard Metrics
(1, 2) closer economic and industrial

integration of the Participants
in sustainable value chain of
raw materials

closer economic and industrial
integration of the Participants
in sustainable value chain of
raw materials

P=1.000
R=1.000
F1=1.000
cos=1.000

(6, 7) cooperation on skills, capacity
building and competences
necessary for the development
of sustainable raw materials
value chains, including the
promotion of the most
sustainable extraction and
transformation practices, and
circular economy

cooperation on skills, capacity
building and competences
necessary for the development
of sustainable raw materials
value chains, including the
promotion of the most
sustainable extraction and
transformation practices, and
circular economy

P=1.000
R=1.000
F1=1.000
cos=1.000

(2, 3) cooperation to increase
resilience of raw materials
value chains

cooperation to increase
resilience of raw materials
value chains

P=1.000
R=1.000
F1=1.000
cos=1.000

(4, 5) the development of open,
resilient and competitive
markets for raw, processed and
recycled materials, allowing the
EU to diversify its suppliers for
materials necessary in
particular to achieve the clean
and digital transition and its
open strategic autonomy

the development of open,
resilient and competitive
markets for raw, processed and
recycled materials, allowing the
EU to diversify its suppliers for
materials necessary in
particular to achieve the clean
and digital transition and its
open strategic autonomy

P=1.000
R=1.000
F1=1.000
cos=1.000

Continued on next page

32


6. Results

Indices Prediction Gold Standard Metrics

(5, 6) promoting the alignment of
sustainable raw materials value
chains developed between the
EU and the Argentine Republic
with internationally agreed
principles and guidelines for
environmental, social and
governance (ESG) standards

promoting the alignment of
sustainable raw materials value
chains developed between the
EU and the Argentine Republic
with internationally agreed
principles and guidelines for
environmental, social and
governance (ESG) standards

P=1.000
R=1.000
F1=1.000
cos=1.000

(7, 8) facilitate closer cooperation on
research and innovation along
the raw materials value chain,
including advanced
exploration, earth observation,
innovative extractive,
processing, refining and
recycling technologies

facilitate closer cooperation on
research and innovation along
the raw materials value chain,
including advanced
exploration, earth observation,
innovative extractive,
processing, refining and
recycling technologies

P=1.000
R=1.000
F1=1.000
cos=1.000

(0, 1) identifying and jointly
developing innovative and
sustainable and responsible raw
materials value chain projects
by facilitating business
opportunities, deploying
financial support, investment
de-risking instruments

identifying and jointly
developing innovative and
sustainable and responsible raw
materials value chain projects
by facilitating business
opportunities, deploying
financial support, investment
de-risking instruments

P=1.000
R=1.000
F1=1.000
cos=1.000

(3, 4) developing the Argentine
Republic’s sustainable raw
materials value chains in its
environmental, social and
economic dimensions as a lever
for a sustainable and inclusive
economic growth, the creation
of local added value, quality
employment, the development
of local industrialization and
domestic revenue mobilisation

developing the Argentine
Republic’s sustainable raw
materials value chains in its
environmental, social and
economic dimensions as a lever
for a sustainable and inclusive
economic growth, the creation
of local added value, quality
employment, the development
of local industrialization and
domestic revenue mobilisation;
thereby increasing the
competitiveness of the
Argentine economy

P=1.000
R=0.895
F1=0.944
cos=0.988

Continued on next page

33


6. Results

Indices Prediction Gold Standard Metrics

Table 6.1: Comparison between text segments identified as objectives, from the
resource agreement [29] between the EU and Argentina. In the left column, the text
corresponds to predictions by the LLM, while the right column displays what was
deemed to be the most correct text when annotating the dataset.

Index Gold Standard
(0) deepen cooperation in the field of sustainable raw materials value

chains that support the clean energy and digital transition

Table 6.2: The table lists a set of annotated objectives that have not been paired
with an LLM output, for the EU-Argentina resource agreement [29].

The performance of the first stage can be summarised in Table 6.3 for all types of
agreements. The scoring is done with the same metrics as previously. Each metric is
separated into two distinct measurable scores; one in which all identified objectives
from each respective set have been paired, and another in which the remaining
unpaired objectives are considered as well. The unmatched case could be considered
as a penalised version of the matched case (see Section 5.1). This is because by
taking more unidentified objectives into account, the denominator increases, which
pushes the quotient down. This explains why the penalised result is constantly lower
for all metrics than its matched counterpart. Notable is how consistently the score is
the best in the matched case for academic documents. For the penalised scores, the
resource agreements fair better. When it comes to Business compared to Business
[long], the former performs better than the latter, significantly in the penalised cases.
The only time the roles are reversed is for matched cosine similarity, where Business
[long] scores slightly higher than Business. Nevertheless, both achieve results that
are lower in general compared to Academic and Resource.

Document
type No. of files Precision Recall F1 Cosine

Similarity
Match. Pen. Match. Pen. Match. Pen. Match. Pen.

Academic
(AZ) 15 0.867 0.569 0.830 0.548 0.839 0.553 0.891 0.580

Business (AZ) 3 0.728 0.568 0.709 0.545 0.708 0.547 0.699 0.532
Business
[long] (AZ) 3 0.708 0.318 0.690 0.288 0.676 0.293 0.723 0.275

Resource
(Dummy) 7 0.807 0.642 0.810 0.633 0.793 0.626 0.881 0.688

Table 6.3: The average performance in terms of the evaluation metrics is showcased
by document type. The underlying backbone that is utilised for this is the all-
MiniLM-L6-v2. The best scores per metric are highlighted in bold.

34


6. Results

6.2 Performance of KPI Definition
The results are based on the surveys consisting of quantitative yes or no questions,
which aim to rate in terms of predefined criteria, as well as qualitative comments.
Hence, this section is divided into a part on quantitative and qualitative results,
respectively. In total, 101 KPIs were evaluated, spread over seven evaluators who
responded to the survey.

6.2.1 Quantitative Survey Results
In total, there were eight answers to the surveys. One person was involved in two
separate projects, thus responding to two different surveys. For both these question-
naires, the respondent answered by marking all proposed KPIs as irrelevant, except
for two and three instances, respectively. The other surveys did not reflect responses
as one-sided as this.

In Table 6.4, each agreement has had the KPIs analysed by calculating the average
score over all KPIs in terms of the meaningfulness criteria (see Section 5.2). The
Average column shows the average score for the specific criteria over all the KPIs in
all documents. For instance, the KPIs of agreement number eight were 8% relevant
on average. The average relevance over all agreements was then calculated to be
41%. This is performed for all meaningfulness criteria. Not Applicable considers
whether that box has been checked in the survey or not. If selected, it signified
that none of the other four criteria were satisfied for a proposed KPI. Applicable is
calculated by Applicable = 1− (NotApplicable), assuming that all KPIs not marked
as Not Applicable can be considered Applicable.

Table 6.4: Averages over all KPIs within agreements by meaningfulness criteria.
The numbers correspond to different agreements that survey respondents gave feed-
back on.

Criteria 1 2 3 4 5 6 7 8 Average
Relevant 0.30 0.67 0.54 0.46 0.17 0.53 0.54 0.08 0.41
Clear 0.10 0.67 0.62 0.38 0.00 0.40 0.54 0.31 0.38
Actionable 0.10 0.67 0.23 0.46 0.00 0.67 0.46 0.08 0.33
Measurable 0.10 0.67 0.69 0.46 0.00 0.67 0.69 0.62 0.49
Not Applicable 0.70 0.17 0.23 0.54 0.83 0.20 0.23 0.23 0.39

In Table 6.5, the evaluation has instead been divided into the different categories
of KPIs deduced from the interview process (see Table 3.1). This is done to better
examine which of the general and specific KPIs the respondents consider relevant,
clear, actionable, measurable, or not applicable. Collaboration-specific KPIs are
grouped into one category as they differ between each project.

35


6. Results

Table 6.5: The table presents which traits of the meaningfulness criteria the eval-
uators found the KPIs to have. The scores are ratios of how many KPIs that were
selected as having a given trait. KPIs are grouped by categories described in Chap-
ter 3. The best value per criterion is highlighted in bold.

Category Relevant Clear Actionable Measurable Not Applicable
No. of patents filed 0.50 0.38 0.38 0.38 0.50
Follow-up research 0.50 0.50 0.50 0.75 0.25
No. of scientific publications 0.50 0.50 0.50 0.50 0.50
Impact factor of publication 0.63 0.63 0.50 0.75 0.25
Reputation of publication journal 0.33 0.33 0.17 0.33 0.67
External engagements 0.75 0.38 0.38 0.50 0.13
External funding 0.29 0.57 0.43 0.57 0.43
Alignment with company goals 0.50 0.00 0.50 0.00 0.50
Budget coherence 0.88 0.75 0.63 0.63 0.00
No escalations to leadership 0.00 0.38 0.13 0.25 0.63
Recruitment of Post-Doc 0.25 0.38 0.38 0.63 0.38
Collaboration-specific KPIs 0.30 0.20 0.23 0.5 0.4

6.2.2 Qualitative Survey Results
As part of the human evaluation, the participants were asked to write qualitative
comments. Table 6.6 shows the answers to the general questions about the KPIs
collectively. The responses reflect mixed reviews regarding the quality of the KPIs.
They also provide some insight on the project’s relation to KPIs and their measure-
ment possibilities. Table 6.7 instead presents the comments that the respondents
made on the KPIs while quantitatively evaluating them. It works as a compliment
to the scoring, as it specifies some cases of why a given KPI is considered good or bad.

Table 6.7: The table presents the written answers corresponding to the different
kinds of KPIs. Every KPI category is not present as all types of KPIs did not
receive qualitative feedback. Specific information relating to the projects have been
redacted. If a KPI category is not displayed in the left column, the previous category
still applies.

KPI Answers

Number of patents

– We don’t expect any patents to be filed
from our collaboration, thus not checking
the ’relevant’ box. A more appropriate
measure along the same lines would be sci-
entific publications.
– Not a goal but at least clear, here num-
ber has more value as it goes though an
evaluation process
– It´s not likely that any patents will be
filed, but the KPI is valid.

Continued on next page

36


6. Results

KPI Category Answers

Follow-up projects

– It would be an indirect measure of the
success of the first project, but wouldn’t
necessarily tell the whole story. The
project could have been great but run its
course. Or funding might have run out
even though there’s a want from the peo-
ple involved to continue.

Impact factor of publication
– Number of publications, which would
strike me as the first obvious KPI, seems
to be absent as a KPI alltogether.

Reputation of publication journal
– We always use impact factor as the met-
ric for journals we publish in. I don´t
know what a reputation score is.

Alignment with company (%) – Don’t think it either relevant or in any
way quantifuable/measurable.

No escalation to senior leadership – not sure what this means/refers to
– Define escalation - I assume concerns or
issues escalated?
– It´s a rather odd KPI; does it mean you
are successful in performing your research
without ethical or data integrity miscon-
duct? Or what?

Number of publications
– Good evaluation of academic collabora-
tion, often takes time though and may not
be completed within contract time.
– Number of research results" isn´t really
a well defined metric. What constitutes "a
result"? If the KPI is supposed to capture
"Number of publications in peer reviewed
journals" it´s a relevant KPI.

External engagement

– External engagements" is rather vague.
"Number of posters and presentations at
external conferences/workshops" would be
a better KPI defined KPI, in my opinion

Collaboration-specific KPI
– The time scale for getting on this would
be years/decades, so not really measurable
in practice I would say.
– Too wide of a scope.
– Not sure if this can be actionable since
we cannot influence what [REDACTED]
– It’s unclear what [REDACTED] means
– we included this as a QC check and con-
ditional element to the collaboration

Continued on next page

37


6. Results

KPI Category Answers
– i think there is something here but how
do you define "accuracy and consistency"
– Number of algorithms does not necessar-
ily capture impact completely. Number of
houses: three sheds and a palace. . .
– Same as in KPI 1
– Number not good measure, perhaps
functional integration of new QC/analysis
modules from collaboration into internal
pipeline.
– Again, would focus on integration or
maybe. Again, this is relevant but easy
to check box without impact.
– did the collaboration result in the devel-
opment of a useful FM for feature extrac-
tion. Number has little meaning
– I presume you could count the number
of models tested, but that by itself has no
meaning. The important thing is if the
models are relevant in the context.
– Measurable, but like the "number of
mathematical models" KPI the number it-
self has limited relevance, it´s the impact
that is important.
– I am uncertain about what data this KPI
is supposed to be calculated from. Is it
peer reviewer feedback from journals the
work is submitted to? Scoring on these
aspects on research grant requests submit-
ted based on the data? Or what?

38


6. Results

Table 6.6: This table presents answers to the general broad questions of the survey.

Questions Answers
Do you consider that the KPIs cover the
whole problem and measure everything
that should be measured when evaluating
the health of the collaboration?

– No

– To some extent
– Largely covers it
– Some happen sooner (data generation)
and some later (publications, etc) so KPIs
might not be time bound
– No

Are there any KPIs you would add which
are not included in the stated KPIs?

– Number of publications, or total im-
pact/citations etc.
– Not sure
– This is a collaboration in R&D, specifi-
cally early development - KPIs in relation
to new target identification, positive gov-
ernance interactions would seem salient
– Again, focus on integration of methods
vs PoC or number of algorithms. One can
develop 100 algorithms in an afternoon
but none of them are useful...
– Yes, several. E.g. "Number of AZ
projects supported by the new models de-
veloped"

Do you have any general comments and/or
recommendations?

– The suggested KPIs seems to have a
more late stage product focus than basic,
early science.
– You are on the right track, explore how
better define quality and impact of new
algorithms or digital deliveries
– Identifying measurable KPIs that show
relevant impact on how we do our busi-
ness is challenging in general. In my expe-
rience we tend to prioritize "measurable"
over "relevant"...

39


6. Results

40


7
Discussion

This chapter aims to dissect the meaning behind the results and try to package a
recommendation on how to maximise the utility of the implemented system as well
as how to apply it in a larger setting. However, the analysis of the outputs may be
limited to general discussions, and the presented examples may have been altered
to not reflect confidential information.

7.1 Objective Extraction
To begin with, the objective extraction capabilities are quite good. Even if pro-
cessing large quantities of text is what an LLM is designed to do, it is still quite
impressive that the same task can be accomplished multiple times to a fairly high
level of accuracy, given the variation in the data it is exposed to.

For the largest set of documents, namely the ones with academic collaborators, the
results were the best for the matched case. For the unmatched, the performance met-
rics were, while not the best, still very similar to those of the business agreements.
In the unmatched category, the resource agreements (dummy data) performed the
best instead.

What is not presented in the results section of this report is the ratio of unmatched
objectives to the exhausted set of objectives. The example shown in Table 6.2 infers
that the exhausted set is the one with the predictions. N.B., this is not always
the case, as it can sometimes be the other way around. Although, from looking
directly at the data, it can be concluded that in the vast majority of instances,
the unmatched objectives left are from the prediction set, i.e. there is a surplus of
candidates that leads to these low scores. Albeit, in some cases, it can also be that
the remaining objectives after matching are from the gold standard set. This entails
that, at times, objectives simply are not identified to the extent that they ideally
would. Regardless, the objectives that have been identified are most often correct,
indicated by the matched metrics. This reasoning can be applied to both academic,
business, and resource collaborations.

The performance on the Business [long] agreements was substantially worse across
all metrics, in comparison to the Business where only a selection of relevant excerpts
were taken into account as business agreements. As expected, the preprocessing pro-
cedure has a massive effect on the results. The view that the discrepancy stems from

41


7. Discussion

a selection bias may be correct, although it seems even more likely that simply too
many candidates for objectives were identified when processing the extremely long
documents in full. Piece-wise processing of the long contracts has without question
produced more objectives than in the gold standard. Their length likely affects the
LLM’s ability to sift out the correct objectives. A most possible cause for this is
that the system forces itself to extract objectives from the provided excerpt, even
when that piece of text does not necessarily contain the relevant information. The
LLM is prompted with finding the overarching objective of the collaboration, which
naturally is skewed when the whole frame is narrowed down.

The fact that more candidate objectives were found explains why the penalised
metrics become much worse than otherwise. Similarly, one would guess that the
matched metrics would increase, or at least stay the same, as the certainty of find-
ing correct objectives rises with more candidates available. This turns out to be
quite the opposite. One e