AI Assisted matching in
Mergers And Acquisitions
A Data-Driven Approach to Identifying Potential Acquirers

Master’s thesis in Data Science and AI

WILHELM JOHNSON SWEGMARK
DIDRIK TVEDT

DEPARTMENT OF MATHEMATICAL SCIENCES

CHALMERS UNIVERSITY OF TECHNOLOGY

Gothenburg, Sweden 2026

www.chalmers.se

www.chalmers.se


Master’s thesis 2026

AI Assisted Matching in
Mergers and Acquisitions

A Data-Driven Approach to Identifying Potential Acquirers

WILHELM JOHNSON SWEGMARK

DIDRIK TVEDT

Department of Mathematical Sciences

Chalmers University of Technology

Gothenburg, Sweden 2026


AI Assisted Matching in Mergers and Acquisitions

A Data-Driven Approach to Identifying Potential Acquirers

WILHELM JOHNSON SWEGMARK

DIDRIK TVEDT

© WILHELM JOHNSON SWEGMARK, DIDRIK TVEDT 2026.

Supervisor: Johan Jonasson, Department of Mathematical Sciences

Examiner: Johan Jonasson, Department of Mathematical Sciences

Master’s Thesis 2026

Mathematical Sciences

Chalmers University of Technology

SE-412 96 Gothenburg

Telephone +46 31 772 1000

Cover:

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria

Printed by Chalmers Reproservice

Gothenburg, Sweden 2026

iv


AI Assisted Matching in Mergers and Acquisitions

A Data-Driven Approach to Identifying Potential Acquirers

WILHELM JOHNSON SWEGMARK

DIDRIK TVEDT

Department of Mathematical Sciences

Chalmers University of Technology

Abstract

Traditional buyer identification in M&A relies on manual screening and professional

networks, making it resource-intensive and naturally limiting the buyer pool. This

thesis investigates whether textual embedding models can support the identification

of relevant potential buyers in mergers and acquisitions. The study examines how

different representation methods, including TF-IDF, Doc2Vec with smooth inverse

frequency weighting, and Transformer based models, capture similarity between

companies when applied to standardized summaries of portfolio company descrip-

tions. The summaries are created using a large language model with information

provided on the portfolio companies websites. The performance of the embedding

models is evaluated through visualization of the embedding spaces, cosine simi-

larity search experiments, and an expert review of buyer recommendations. The

results indicate that TF-IDF and the Transformer model produced relevant recom-

mendations, with the Transformer model demonstrating the best performance in

embedding space separation and alignment with expert judgment, while Doc2Vec

models showed weaker differentiation between company types. Overall, the study

shows that embedding based similarity search can serve as a useful first step in buyer

discovery by expanding the range of potential buyers considered and improving effi-

ciency. The work also highlights that further validation across a larger set of targets

and with a more complete dataset would strengthen confidence in these results.

Keywords: M&A, NLP, LLM, Embeddings, Semantic Similarity

v


Acknowledgements

We would like to express our sincere gratitude to our Chalmers supervisor, Johan

Jonasson, whose expertise, guidance, and feedback have been invaluable throughout

the course of this thesis. Your support has helped us navigate challenges, refine our

ideas, and ultimately shape the direction of our work.

We would also like to extend our appreciation to Merge, with which this thesis

was carried out. The opportunity to work closely with the company, along with

the access to industry knowledge, data and practical perspectives, has contributed

greatly to the development and relevance of this project.

Wilhelm Johnson Swegmark and Didrik Tvedt, Gothenburg, December 2025

vii


List of Acronyms

Below is the list of acronyms that have been used throughout this thesis listed in

alphabetical order:

BERT Bidirectional Encoder Representations from Transformers

BoW Bag of Words

LLM Large Language Model

M&A Mergers and Acquisitions

NLP Natural Language Processing

PCA Principal Component Analysis

SIF Smooth Inverse Frequency

TF-IDF Term Frequency-Inverse Document Frequency

UMAP Uniform Manifold Approximation and Projection for Dimension

Reduction

ix


Contents

List of Acronyms ix

List of Figures xv

List of Tables xvii

1 Introduction 1

1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Ethical aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.5 AI Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Theory 7

2.1 Term Frequency-Inverse Document Frequency . . . . . . . . . . . . . 7

2.2 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3 Doc2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.4 SIF embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.5.1 Sentence Embeddings . . . . . . . . . . . . . . . . . . . . . . . 15

2.6 Dimensionality Reduction Methods . . . . . . . . . . . . . . . . . . . 17

2.6.1 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.6.2 UMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2.7 Similarity search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

2.8 Large Language Models for Summarization . . . . . . . . . . . . . . . 20

xi


Contents

3 Method 23

3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.2 Summarization via LLM . . . . . . . . . . . . . . . . . . . . . 25

3.1.3 Preprocessing and cleaning . . . . . . . . . . . . . . . . . . . . 27

3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.2 Doc2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.2.3 Transformer Models . . . . . . . . . . . . . . . . . . . . . . . 30

3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.1 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.2 Similarity Search . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4.3 Expert Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 33

4 Results 37

4.1 Dataset statistics and quality . . . . . . . . . . . . . . . . . . . . . . 37

4.1.1 Quality of LLM Summaries . . . . . . . . . . . . . . . . . . . 38

4.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Embedding Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3.1 All Portfolio Companies . . . . . . . . . . . . . . . . . . . . . 41

4.3.2 Subset visualization . . . . . . . . . . . . . . . . . . . . . . . 41

4.4 Similarity Search Experiments . . . . . . . . . . . . . . . . . . . . . . 43

4.4.1 Similarity Score Distributions . . . . . . . . . . . . . . . . . . 44

4.4.2 Example of target similarity suggestions . . . . . . . . . . . . 45

4.5 Expert Review Results . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5 Discussion 51

5.1 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.1.1 Embedding space and Visualizations . . . . . . . . . . . . . . 52

5.1.2 Similarity Search and Buyer Recommendation . . . . . . . . . 54

5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

xii


Contents

5.2.1 Impact of LLM Summarization . . . . . . . . . . . . . . . . . 56

5.2.2 Limitations in Scraped Data . . . . . . . . . . . . . . . . . . . 57

5.2.3 Limitations in Method . . . . . . . . . . . . . . . . . . . . . . 58

5.3 Recommendations for Future Work . . . . . . . . . . . . . . . . . . . 58

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

Bibliography 61

A Appendix A: Competitor Similarity Results I

B Appendix B: Expert Evaluation Summary XVII

xiii


Contents

xiv


List of Figures

2.1 Overview of the Transformer architecture (Vaswani et al., 2017), with

the encoder part on the left and the decoder part on the right . . . . 15

4.1 UMAP visualization of all companies for the different models . . . . . 42

4.2 UMAP visualization of a subset of companies within 5 industries for

the different models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.3 UMAP visualization of all companies in the embedding space, with

selected companies from five industries highlighted . . . . . . . . . . . 43

4.4 Distribution of similarity scores . . . . . . . . . . . . . . . . . . . . . 44

xv


List of Figures

xvi


List of Tables

3.1 Model specifications for Transformer embedding models . . . . . . . . 30

4.1 Statistics for the company summaries . . . . . . . . . . . . . . . . . . 37

4.2 Example of a generated company summary for BSI Software . . . . . 38

4.3 Example of a generated company summary for Apotea. . . . . . . . . 39

4.4 Runtime statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5 Statistics of cosine similarity distributions for each embedding model 44

4.6 TFIDF Model Matches for Klarna . . . . . . . . . . . . . . . . . . . . 46

4.7 Doc2Vec PCA Model Matches for Klarna . . . . . . . . . . . . . . . . 47

4.8 Doc2Vec SIF Model Matches for Klarna . . . . . . . . . . . . . . . . 48

4.9 Transformer Model Matches for Klarna . . . . . . . . . . . . . . . . . 49

4.10 Expert ratings for each model. Scores range from 1 (not relevant) to

3 (highly relevant) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

A.1 Top-3 Similar Companies for http://www.partnerre.com (Insurance) . I

A.2 Top-3 Similar Companies for http://www.maxm.se (Insurance) . . . . II

A.3 Top-3 Similar Companies for https://www.hedvig.com (Insurance) . . II

A.4 Top-3 Similar Companies for https://www.epicbrokers.com (Insurance) III

A.5 Top-3 Similar Companies for https://www.brookfield.com (Asset Man-

agement) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III

A.6 Top-3 Similar Companies for http://cworldwide.com (Asset Manage-

ment) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV

A.7 Top-3 Similar Companies for https://www.oaktreesicav.com (Asset

Management) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V

xvii


List of Tables

A.8 Top-3 Similar Companies for https://www.spiltanfonder.se (Asset Man-

agement) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V

A.9 Top-3 Similar Companies for https://www.mandatum.fi (Asset Man-

agement) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI

A.10 Top-3 Similar Companies for https://www.soderbergpartners.se (As-

set Management) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI

A.11 Top-3 Similar Companies for https://sjukhus.sophiahemmet.se (Health-

care) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII

A.12 Top-3 Similar Companies for https://www.landmarkhealth.org (Health-

care) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII

A.13 Top-3 Similar Companies for http://www.vamed-care.com (Healthcare)VIII

A.14 Top-3 Similar Companies for https://www.highridgemedical.com (Health-

care) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX

A.15 Top-3 Similar Companies for http://www.reliant-rehab.com (Health-

care) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X

A.16 Top-3 Similar Companies for https://www.cloverhealth.com (Health-

care) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X

A.17 Top-3 Similar Companies for https://www.valimmobilier.ch (Real Es-

tate Brokers) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI

A.18 Top-3 Similar Companies for https://bskimmobilier.com (Real Estate

Brokers) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI

A.19 Top-3 Similar Companies for https://www.renson.fr (Industrial) . . . XII

A.20 Top-3 Similar Companies for https://azekco.com (Industrial) . . . . . XIII

A.21 Top-3 Similar Companies for https://www.globalppi.com (Industrial) XIII

A.22 Top-3 Similar Companies for https://www.hubs.com (Industrial) . . . XIV

A.23 Top-3 Similar Companies for https://www.rotomon.fi (Industrial) . . XIV

B.1 Expert evaluation scores (1/3) . . . . . . . . . . . . . . . . . . . . . . XVII

B.2 Expert evaluation scores (2/3) . . . . . . . . . . . . . . . . . . . . . . XVIII

B.3 Expert evaluation scores (3/3) . . . . . . . . . . . . . . . . . . . . . . XIX

xviii


1
Introduction

Identifying potential buyers is a critical first step in the mergers and acquisitions

(M&A) process, directly influencing deal success and transaction outcomes (Merge,

2025). Brokers, investment banks, and advisory firms traditionally rely on their net-

works, relationships, curated databases of past transactions, and manual screening

to build buyer lists (Merge, 2025). Although this human expertise is invaluable, it is

also resource intensive and prone to biases that limit both scalability and diversity in

the buyer lists produced. Consequently, buyer lists may become concentrated around

well-known or easily identifiable investors, such as major private equity funds. This

can inadvertently limit the consideration of smaller or less conventional buyers, re-

ducing the breadth of potential opportunities. This creates a clear business need

for systems that can complement human expertise by discovering non-trivial, high-

potential buyers, reducing manual effort, and broadening the coverage of the buyer

universe.

A further challenge lies in the fact that potential buyers differ fundamentally in

their motivations and acquisition strategies. An important distinction is between

financial and strategic buyers (Merge, 2025). Financial buyers, such as private eq-

uity firms, venture capital funds, and investment funds, are typically motivated

by return on investment, exit strategies, and growth potential. Their acquisition

considerations often emphasise cash flows, profitability, and other financial metrics,

and they usually hold assets for a limited period before seeking a profitable exit.

Strategic buyers, on the contrary, are operating companies that acquire for reasons

tied to long-term positioning, such as vertical or horizontal integration, geographic

1


1. Introduction

expansion, or access to new technology and capabilities. These buyers may be will-

ing to pay a premium because synergies raise the potential value of the acquisition,

and they often place greater emphasis on non-financial considerations like cultural

fit, strategic alignment, and competitive positioning. They also tend to integrate

acquisitions more directly into their operations, whereas financial buyers may leave

management largely in place. Understanding these distinctions is crucial for de-

signing a recommendation system, since the relevance of a buyer depends on its

incentives, constraints, and historical behaviour.

Artificial intelligence and machine learning open up opportunities to address these

challenges by enabling more data-driven and scalable approaches to buyer iden-

tification. By capturing richer representations of companies, modelling patterns of

similarity, and integrating diverse types of information, such methods can reduce the

dependence on manual processes. Importantly, these technologies can be designed

to support, rather than replace, human expertise and thus help brokers generate

more diverse and high-quality recommendations while maintaining interpretability.

This motivates the present study, which explores how AI-driven methods can con-

tribute to a more efficient, data-driven, and insightful process for buyer–seller match-

ing in M&A.

1.1 Aim

The general aim of this thesis is to investigate how AI can be applied to improve

the process of buyer identification in mergers and acquisitions. Specifically, the goal

is to design a method that generates prioritised and interpretable buyer recommen-

dations for a given target company, thus reducing reliance on manual processes and

increasing efficiency in generating leads.

2


1. Introduction

1.2 Research Questions

The research questions for the thesis are stated below:

1. How can we embed companies to best represent the business in operative terms

given publicly available data?

(a) How can dimensionality reduction techniques be used to visualize these

embeddings and provide insight into the structure of embedding spaces?

(b) Which dimensions of a firm’s business model (e.g. product offerings and

operational domain) are encoded in these embeddings?

2. Can we use these embeddings to find companies with a similar business model?

(a) How can we use these embeddings to generate a buyer recommendation

of a given target company?

(b) How does performance differ between different embedding models?

1.3 Limitations

Several limitations influence both the scope and the technical approach taken. First,

the study is based on a list of financial buyers provided by Merge with a strong con-

centration of Nordic buyers but also including several major global buyers. This

means that the portfolio companies of these buyers may extend internationally.

Strategic buyers are not included in this analysis. The primary reason is that no

comprehensive or standardized list of potential strategic acquirers exists. Identify-

ing them would require considering virtually every corporation that might acquire a

company for strategic reasons. Because of the lack of definable boundaries and the

impractical scale of data collection, we chose not to pursue the inclusion of strategic

buyers in this study.

On the technical side, the study emphasizes AI methods that can be executed on

CPU compute, ensuring that any resulting application can run efficiently in Merge’s

environment without the need for GPU resources.

3


1. Introduction

Another limitation concerns the lack of labelled data that could be used to directly

supervise and benchmark machine learning models. Instead, evaluation must rely

on indirect methods, such as conducting manual assessments together with domain

experts. While this allows for valuable qualitative insights, it also introduces sub-

jectivity and reduces the extent to which model performance can be measured using

standard quantitative metrics.

1.4 Ethical aspects

Training state of the art Large Language Models (LLMs) can carry a large en-

vironmental cost. Strubell et al. (2019) estimate that the total carbon emissions

associated with training and tuning a single LLM including hyperparameter tuning,

architecture exploration, and repeated runs, can exceed the equivalent of 284 tons

of CO2. This amount is comparable to or greater than the lifetime emissions of

multiple passenger vehicles and has likely exploded further in recent years. These

emissions reflect the electricity consumed by GPUs and data centre infrastructure

during extensive experimentation, rather than a single training run alone. Even

though using readily available models doesn’t infer a full training regime model se-

lection should still consider not only accuracy but also computational and energy

demands. Even though using readily available pretrained models does not require

repeating the full training regime, model selection should still consider not only KPIs

such as accuracy but also computational and energy demands. While this does not

alter how foundation models are trained broadly, but a default preference for effi-

cient models helps ensure that environmental cost remains an explicit consideration

rather than an afterthought.

A further ethical consideration concerns how differences in data availability across

companies may influence the results. Companies with extensive publicly available

information, and well-structured websites are more likely to be represented by infor-

mative summaries. In contrast, smaller firms or companies operating in less digitally

4


1. Introduction

mature contexts may have limited online information. As the work is based on pub-

lic information, the models may systematically favour companies with more data,

not necessarily because they are more relevant, but because their representations

are more complete. This introduces a form of representation bias, where visibility

and data richness influence buyer recommendations. Consequently, some potentially

relevant buyers or portfolio companies may be overlooked. While this limitation is

largely driven by data availability rather than model design, it shows the importance

of interpreting model outputs with caution and complementing automated recom-

mendations with human judgement, particularly in high-stakes decision-making sit-

uations such as mergers and acquisitions.

1.5 AI Declaration

AI tools (e.g., ChatGPT, Grammarly) were used for language support such as gram-

mar correction, and rephrasing. All content, analysis, and conclusions are the au-

thors own.

5


1. Introduction

6


2
Theory

This chapter establishes the theoretical foundation for the embedding and similar-

ity methods employed in this study. It covers semantic textual similarity and the

algorithms used to measure it, ranging from classical approaches like TF-IDF to

modern Transformer-based models, concluding with cosine similarity as the metric

for comparing vectorized representations.

2.1 Term Frequency-Inverse Document Frequency

A challenge in representing textual data for machine learning tasks is how to Trans-

form the unstructured text into features that capture the importance of the words.

One of the most widely used methods in information retrieval is TF-IDF, origi-

nally formalized Salton and Buckley (1988). The method addresses the challenge of

quantifying the importance of words in a document relative to a larger collection of

documents (a corpus). By combining local and global weighting, TF-IDF captures

not only how often a term appears within a document, but also how distinctive it

is across the corpus.

The term frequency component measures how often a term t occurs in a given

document as shown in equation 2.1. To avoid bias toward longer documents, the

frequency is normalized by the total number of terms in that document.

TF (t, d) = ft,d∑
t′∈d ft,d

(2.1)

7


2. Theory

Here ft,d is the count of the term t in document d. This ensures that TF represents

the relative importance of a word within the document itself. The inverse document

frequency component adjusts for the fact that certain words are common across the

entire corpus, and therefore carry limited discriminative power. This is described in

equation 2.2.

IDF (t, D) = log

(
N

1 + |{d ∈ D : t ∈ d}|

)
(2.2)

Here N is the total number of documents in the corpus D, and the denominator

counts how many documents contain the term t. The logarithm serves to dampen

the effect of very frequent words, while the addition of 1 prevents division by zero.

The TD-IDF score is obtained as the product of these two components TF and

IDF. This weighting scheme highlights terms that are frequent within a document

but rare across the corpus, making them useful for distinguishing that document

from others. Common words such as “the”, “or”, and “will” therefore receive low

weights, while domain-specific or distinctive terms receive higher values.

Although TF-IDF is defined at the term level the implementation of Salton and

Buckley (1988) is primarily used to construct vector representations of entire docu-

ments. After computing the TF-IDF weight for every term t in the vocabulary, each

document d is represented as a vector as shown in Equation 2.3.

vd = (w1,d, w2,d, . . . , wV,d), (2.3)

In Equation 2.3 where V is the size of the vocabulary and wi,d is the TF-IDF value

of term i in document d. Terms that do not appear in a document receive a weight

of zero, resulting in a high-dimensional but sparse vector. These vectors provide a

simple yet effective representation of documents and form the basis for tasks such

as similarity measurement, clustering, and document ranking.

8


2. Theory

2.2 Word2Vec

An important contribution to natural language processing (NLP) was made by

Mikolov et al. (2013) in their work ´´Efficient Estimation of Word Representations

in Vector Space”, which introduced the Word2Vec framework. It enables words

to be encoded as dense, continuous-valued vectors in a way that they capture se-

mantic meaning. The core idea is to learn these vector representations through a

simple neural network trained on a large corpus of text. Instead of representing

words as discrete symbols, Word2Vec encodes them in a continuous vector space

such that words occurring in similar contexts have similar embeddings (Mikolov

et al., 2013). To train the embeddings, the network predicts either a target word

given its surrounding context (the Continuous Bag-of-Words, or CBOW, model) or

the surrounding context words given a target word (the Skip-gram model) (Mikolov

et al., 2013). Both variants share the same underlying architecture of a single hidden

layer neural network with a linear transformation from the one-hot encoded input

to a dense embedding space. Finally a softmax output layer produces a probability

distribution over the vocabulary.

For a skip-gram model, we predict context words N steps away from a given target

word in the sequence. To achieve robust training, context words are sampled with

probability correlated to their distance from the target word Mikolov et al. (2013).

In the standard Word2Vec implementation by Mikolov et al. (2013), each word is

represented by two vectors: a target vector vw and a context vector vc. The likelihood

of observing actual context words given a target word and model parameters θ

across the training corpus is modelled using the dot product vc · vw as shown in

Equation 2.4 where C denotes the set of all context words. During training, the

model alternates between fixing one set of vectors and optimizing the other, iterating

until convergence (Mikolov et al., 2013).

p(c|w; θ) = evc·vw∑
c′∈C evc′ ·vw

(2.4)

9


2. Theory

By maximizing the corpus probability of the context c given the target word w and

by taking the log of that expression the sum as shown 2.5 is attained. Let T denote

the set of all (word, context) pairs extracted from the corpus.

arg max
θ

∑
(w,c)∈T

log p(c|w; θ) =
∑

(w,c)∈T

vc · vw − log
∑
c′∈C

evc′ ·vw

 (2.5)

As shown in 2.5 this optimization requires summarizing over all context words c′

requiring lots of compute for training especially for large context windows. By

introducing negative samples denoted T ′ the task can be modelled as a binary clas-

sification task with a likelihood shown in 2.6.

p(D = 1|w, c; θ) = 1
1 + e−vc·vw

(2.6)

This removes the worst summation step resulting in a simpler training routine. The

objective function to optimize now becomes 2.7

arg max
θ

∑
(w,c)∈T

log 1
1 + e−vc·vw

+
∑

(w,c)∈T ′

log 1
1 + evc·vw

(2.7)

For the Bag of Words model the context is made up of all terms contained in a

symmetric window around the target word where each context word is encoded as

a bag of words vector and the neural net outputs a probability vector via softmax

activation. The model is trained in the same manner as for skip grams using a cross

entropy loss except the context is now the sum of word vectors.

Word2Vec models such as CBOW and Skip-gram provide effective representations

for individual words, whereas plenty of applications require vector representations of

longer textual sequences, such as sentences or documents. An intuitive approach is

to achieve this by averaging the word vectors contained within a text possibly using

a weighted average. Although this method produces a fixed-length representation,

it misses information about word order and context.

10


2. Theory

2.3 Doc2Vec

To address limitations of Word2Vec for sequence embeddings, Doc2Vec was intro-

duced by Le and Mikolov (2014) as an extension of Word2Vec. Each paragraph is

associated with a unique vector pj that is shared across all contexts and sampled

from the same paragraph (Le and Mikolov, 2014). The word vector matrix on the

other hand is shared globally making words retain meaning across different para-

graphs (Le and Mikolov, 2014). There are two common implementations Doc2Vec

corresponding conceptually to Continous Bag Of Words (CBOW) and Skip-gram.

The first implementation called Distributed Memory Model of Paragraph Vectors

(PV-DM) starts by mapping every paragraph to a unique vector, corresponding

to a column in matrix P and each word is represented as a vector, corresponding

to a column in matrix W (Le and Mikolov, 2014). The combination of the word

vectors vW with the paragraph vector pj is achieved through concatenation or av-

eraging and allows the model to capture semantic and contextual dependencies (Le

and Mikolov, 2014). After training, the learned paragraph vectors can be directly

used as features for downstream machine learning tasks. The PV-DM model offers

several advantages over bag of words representation as it captures semantic relation-

ships between words and accounts for local word order, similar to high-order n-gram

models but without resulting in high-dimensional, sparse representations (Le and

Mikolov, 2014).

As opposed to the PV-DM model, that combines the paragraph vector with word

vectors as input to predict the next word, the Distributed Bag of Words model

(DBOW) simplifies the problem by using only the paragraph vector as input to

predict randomly sampled context words (Le and Mikolov, 2014). This approach

is conceptually similar to the skip-gram model, where the task is to predict words

appearing within a given context window. During training, a text window is sampled

from the paragraph, and a random word within that window is used as the target

word. The model then performs a classification task to predict this word based only

11


2. Theory

on the paragraph vector (Le and Mikolov, 2014). This design not only simplifies

training but also reduces storage requirements as only the softmax weights has to

be stored rather than both the softmax weights and word vectors as in PV-DM (Le

and Mikolov, 2014).

2.4 SIF embeddings

To address the limitations inherent in standard averaging of word vectors, Arora

et al. (2017) proposed the Smooth Inverse Frequency (SIF) method. The method

first uses a Word2Vec model to build embeddings vw for each word w. The word

level embeddings are combined for each sentence s ∈ S using a weighted average

as described in Equation 2.8. sentence is defined as any sequence of tokens and

can also represent longer documents. Note that p(w) is the empirical probability of

observing word w in our corpus and a is a hyper-parameter to be set.

vs = 1
|s|

∑
w∈s

a

a + p(w)vw (2.8)

Recall that in the training objective of Word2Vec frequent words are down-weighted

because they appear in many contexts and are assumed to represent limited infor-

mation. The SIF method achieves a similar effect through its weighting function.

When p(w) is large, corresponding to highly frequent words, the weigh as described

in Equation 2.8 becomes smaller (Arora et al., 2017). Conversely, when p(w) is

small, representing rare words, the weight approaches 1 (Arora et al., 2017). The

name “smooth inverse frequency” describes this functional behaviour as the weight-

ing is approximately proportional to 1
p(w) for frequent words but converges smoothly

to 1 for rare words instead of exploding.

The final stage of the SIF embedding method is a normalization that involves the

removal of the dominant component shared across sentence embeddings (Arora et al.,

2017). After constructing the sentence embedding matrix X from the set of sentence

embeddings vs : s ∈ S, the first singular vector u is computed via singular value

12


2. Theory

decomposition and normalized to unit length. Each normalized sentence vector ṽs

is then obtained by subtracting from vs its projection onto the unit-normalized u,

as shown in Equation 2.9.

ṽs = vs − (u · vs) u (2.9)

This final operation serves as a normalization by filtering out the most common vari-

ance amongst the sentence embeddings. Arora et al. (2017) demonstrate through

empirical analysis that the leading singular vector u mainly captures common func-

tion words, rather than meaningful semantic information. Words exhibiting the

highest cosine similarity to u in their study include typical stop words such as “but”,

“when”, and “even”. This adjustment ensures that the final representations better

reflect the actual semantic content of sentences rather than generic grammatical

structures.

2.5 Transformers

The Transformer architecture, introduced by Vaswani et al. (2017) in the paper

“Attention Is All You Need”, marked a shift in natural language processing (NLP)

by replacing recurrent and convolutional structures with a fully attention-based

mechanism. The key innovation of the Transformer is the self-attention mechanism,

which computes contextual relationships between all tokens in parallel. It allows

the model to weigh each token’s importance relative to others, capturing long-range

dependencies regardless of their position in the sequence. As shown in the multi-

head attention block of Figure 2.1, this enables the Transformer to attend to the

most relevant parts of the input and build richer contextual representations. Self-

attention is computed according to Equation 2.10 (Vaswani et al., 2017).

Attention(Q, K, V ) = softmax
(

QKT

√
dk

)
V (2.10)

Within each head of the multi-head attention block of Figure 2.1, the embedding

13


2. Theory

of each input token including its positional encoding, is projected into three vector

spaces using the weight matrices WQ, WK and WV . These weight matrices are pa-

rameters learned during training, specific to each attention head. These projected

vectors will be called Query (Q), Value (V) and Key (K). The Q and K vectors are

used to calculate how strongly each token should attend to every other token in the

sequence. This is done by taking the dot product QKT , which produces a matrix

of attention scores and essentially measures the contextual relationship between all

pairs of tokens. To prevent these scores from becoming too large when the dimen-

sionality of the K vectors dk is high, the result is scaled by
√

(dk) (Vaswani et al.,

2017). The softmax function is then applied to each row of this matrix to convert

the raw scores into normalized attention weights that sum to one. Finally, these

weights are used to compute a weighted sum over the corresponding V vector, giv-

ing a new representation that integrates information from the entire sequence. By

combining multiple attention mechanisms in parallel, the Transformer can learn to

focus on different aspects of the sentence structure simultaneously (Vaswani et al.,

2017). Additional residual connections and layer normalization are applied around

each attention feed-forward block to stabilize optimization and maintain gradient

flow during training.

In the full Transformer architecture, the decoder (shown on the right-hand side of

Figure 2.1) mirrors the encoder’s structure but introduces two key modifications.

First, it includes a masked multi-head self-attention mechanism that ensures the

model can only attend to previous positions in the output sequence (Vaswani et al.,

2017), preserving the autoregressive nature of generation. Second, a cross-attention

layer is inserted between the self-attention and feed-forward sublayers. This layer

allows the decoder to attend to the encoder’s outputs, effectively connecting the

encoded source representations with the tokens being generated. The combination

of masked self-attention and cross-attention enables the decoder to generate output

sequences while conditioning on the full encoded representation of the input.

14


2. Theory

Figure 2.1: Overview of the Transformer architecture (Vaswani et al., 2017), with

the encoder part on the left and the decoder part on the right

In the standard training setup, Transformer models are typically first pretrained

using self-supervised learning on large unlabeled text corpora, where the objective

is autoregressive next-token prediction (Kalyan et al., 2021) . During this phase,

the decoder learns to predict the next token given all previously observed tokens,

enforced through masked self-attention. The model is then commonly fine-tuned in

a supervised manner on smaller, task-specific datasets. In the original Transformer

architecture for machine translation, next-token prediction in the decoder is addi-

tionally conditioned on the encoder’s representation of the source sequence (Vaswani

et al., 2017).

2.5.1 Sentence Embeddings

The contextual nature of Transformer representations provides the foundation for

obtaining meaningful sentence-level embeddings. Note that a “sentence” in this con-

text can be an arbitrarily long sequence of text, rather than an actual grammatical

15


2. Theory

sentence. Devlin et al. (2018) proposed Bidirectional Encoder Representations from

Transformers (BERT), a model that is pre-trained by masking tokens from an un-

labelled input and finally fine-tuned on labelled data for downstream tasks. The

Bi-directionality comes from the fact that unlike the sequential pre-training of GPT

and RNNs, BERT may use both left and right context of the masked tokens for

prediction (Devlin et al., 2018). The input sequence expects a [CLS] token, that

represents the entire sequence at the start and a [SEP] token at the end of each

sequence. The last hidden state for this token can be used as a sequence represen-

tation for classification tasks. Further BERT expects 512 tokens as input so if the

sequence exceeds this it must be truncated and conversely in the case it falls of that

the sequence is filled with [SEP] tokens to ensure a constant sequence length (Devlin

et al., 2018).

While BERT was a big improvement in NLP, it was designed and trained for clas-

sification tasks rather than semantic similarity. To address this gap, Reimers and

Gurevych (2019) introduced Sentence-BERT (SBERT), modifying the BERT ar-

chitecture using siamese and triplet network structures to generate fixed-length

sentence embeddings suited for similarity comparisons. SBERT can be used in

three ways to get sentence embeddings from the Transformer output; using the

CLS token representation, computing the mean across all output vectors, or apply-

ing max-pooling across the output vectors (Reimers and Gurevych, 2019). Among

these approaches, mean pooling has emerged as the most widely adopted strategy

in practice (Reimers and Gurevych, 2019). The improvement of SBERT is not only

in its architecture and pooling, but more so in its fine-tuning methodology. The

SBERT fine-tuning trains the model on sentence pairs with known similarity rela-

tionships, enabling it to learn representations where semantically similar sentences

are positioned close together in the embedding space while dissimilar sentences re-

main distant (Reimers and Gurevych, 2019).

16


2. Theory

2.6 Dimensionality Reduction Methods

High-dimensional data often contain redundant or noisy information that can ob-

scure underlying patterns and relationships. Dimensionality reduction techniques

aim to project data from a high-dimensional space into a lower-dimensional repre-

sentation while preserving as much of the original structure as possible.

Dimensionality reduction approaches can be categorized into linear and non-linear

methods. Linear techniques, such as Principal Component Analysis (PCA), assume

that the data lie approximately on a linear subspace of the original feature space and

identify directions that capture the most variance. Non-linear methods, including

Uniform Manifold Approximation and Projection (UMAP), relax this assumption

and instead focus on preserving local and global relationships between data points on

a curved manifold. In the following subsections, both PCA and UMAP are described

in greater detail, including their mathematical formulations and key intuitions.

2.6.1 PCA

Principal Component Analysis (PCA) is a foundational method for dimensional-

ity reduction that seeks to represent high dimensional data in a more compact form

while preserving as much of its original structure as possible. It works by identifying

orthogonal directions in the data, known as principal components, that successively

capture the greatest possible variance revealing the directions of maximal informa-

tion content in the dataset (Shlens, 2014). This reduction is achieved by computing

the covariance matrix of the data and extracting its eigenvectors and eigenvalues,

where the eigenvectors form the orthogonal axes of the new feature space and the

eigenvalues quantify the amount of variance each axis accounts for in the dataset

(Shlens, 2014).

When applied to vector embeddings, PCA acts as a projection method that com-

presses high dimensional representations by removing less informative components.

Each embedding vector xi is projected onto the subspace spanned by the top k

17


2. Theory

principal components, producing a reduced representation zi = W T
k xi, where the

columns of Wk consists of the eigenvectors associated with the k largest eigenvalues

of the covariance matrix (Ringnér, 2008). The fraction of the dataset’s total variance

preserved through this transformation, the explained variance ratio, is calculated as

the sum of these top k eigenvalues divided by the total sum of all eigenvalues.

This measure guides the choice of dimensionality k, providing a balance between

maintaining the essential informational structure of the embeddings and improving

computational efficiency (Ringnér, 2008).

2.6.2 UMAP

Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimen-

sionality reduction method introduced by McInnes et al. (2020). The algorithm aims

to produce a low-dimensional representation that preserves both local and broader

structural relationships present in the original high-dimensional space.

UMAP begins by constructing a weighted k-nearest neighbour graph. For each data

point xi, it defines a local connectivity radius ρi as the smallest non-zero distance

to any of its neighbours, see Equation 2.11.

ρi = min{d(xi, xj) | 1 ≤ j ≤ k, d(xi, xj) > 0}. (2.11)

Distances are then converted into membership strengths that quantify how strongly

two points are connected as shown in Equation 2.12.

w((xi, xj)) = exp
(

− max(0, d(xi, xj) − ρi)
σi

)
, (2.12)

σi normalizes the local neighbourhood. These weights form a fuzzy simplicial set

representing the connectivity structure of the data. The intuition is that the closer

two points are in the original space, the higher the probability that they belong to

the same local neighbourhood.

18


2. Theory

UMAP then optimizes a low-dimensional embedding Y = (y1, . . . , yn) that preserves

these relationships. In the embedding space, connectivity is modelled using a smooth

kernel function. Let dY (yi, yj) denote the Euclidean distance between two embedded

points. The low-dimensional relationship is modelled as Equation 2.13.

w̃((yi, yj)) = 1
1 + a dY (yi, yj)2b

(2.13)

with parameters a and b chosen to match the decay of connectivity observed in the

high-dimensional graph. This ensures that nearby embedded points receive high

membership weights, while distant points contribute minimally. The final embed-

ding is obtained by minimizing the cross-entropy between high- and low-dimensional

membership strengths, as seen in Equation 2.14

L =
∑
i ̸=j

[
w((xi, xj)) log w̃((yi, yj)) + (1 − w((xi, xj))) log(1 − w̃((yi, yj)))

]
. (2.14)

Through this optimization, strongly connected points remain close in the embedding,

while weakly connected points are pushed apart. This makes UMAP particularly

suitable for visualizing complex datasets such as text embeddings, where both global

clustering patterns and fine-grained neighbourhood relationships provide insights.

2.7 Similarity search

Similarity search refers to the process of identifying objects that are the most alike

according to a defined measure of proximity in a vector space. In the context of tex-

tual data embeddings are usually expressed as vectors in a high-dimensional space.

The task of similarity search in this domain is thus to determine which vectors are

closest to each other, which in turn indicates semantic or contextual resemblance.

An important part of similarity search is choosing how to measure the closeness

19


2. Theory

between two vectors. One of the most common measures used for comparing text

embeddings is cosine similarity. It measures the cosine of the angle between two

vectors xi and xj as shown in 2.15, showing how similar their directions are in the

vector space. In Equation 2.15, xi · xj is the dot product of the two vectors, and

∥xi∥ and ∥xj∥ are their Euclidean norms.

sim(xi, xj) = xi · xj

∥xi∥ ∥xj∥
, (2.15)

The similarity value ranges from −1 to 1, where 1 indicates perfect alignment (iden-

tical direction in the embedding space), 0 orthogonality (no similarity), and −1

opposite orientation.

Cosine similarity is particularly well suited for textual representations because it

focuses on the orientation of the vectors rather than their magnitude. This property

ensures that two documents or descriptions with similar patterns of term importance

or semantic meaning are considered close, even if they differ in length or scale.

2.8 Large Language Models for Summarization

Large Language Models (LLMs) are a class of large scale models that builds on

the Transformer architecture (Vaswani et al., 2017). While models such as BERT

(Devlin et al., 2018) use an encoder-only architecture for deep contextual under-

standing, decoder-only models are optimized for next-token prediction in an autore-

gressive manner. This design, introduced in models like GPT (Radford et al., 2018),

generates text sequentially by conditioning each token on all preceding ones (Rad-

ford et al., 2018). Through extensive pre-training on large unlabelled text corpora

where the model learns to predict the next token, these models gain deep linguistic

knowledge. Decoder based LLMs thus form the foundation of chatbots, capable of

maintaining context and producing fluent responses to prompts.

When generating text, a decoder only LLM computes a probability distribution over

20


2. Theory

its vocabulary for each token position. The next output token is selected using a

sampling strategy, such as greedy sampling (choosing the most probable token) or

stochastic sampling, which introduces controlled randomness. The randomness of

sampling is typically regulated using a temperature parameter before applying the

softmax function. Formally, for a token i with output logit zi , the probability Pi

under temperature T ∈ [0, 1] is given by the softmax Equation 2.16.

Pi = exp(zi/T )∑
j exp(zj/T ) (2.16)

Lower temperatures (approaching zero) produce more deterministic and focused out-

puts, preferring tokens with higher predicted probabilities, whereas higher temper-

atures increase the likelihood of sampling less probable tokens, increasing diversity,

and creative text generation (Holtzman et al., 2019).

Because LLM text generation involves stochastic sampling, even identical prompts

can produce slightly different outputs. This variability can affect reproducibility,

which is important to consider in the embedding analysis. Controlling decoding pa-

rameters and standardizing prompts helps reduce this effect. The temperature pa-

rameter represents a trade-off where lower values improve reproducibility by favoring

high probability tokens, while higher values allow more diverse outputs. Choosing

an appropriate temperature balances reliable results with maintaining the model’s

generative performance.

21


2. Theory

22


3
Method

This chapter describes the methodology for developing and evaluating the buyer

target recommendation system. It covers the data collection and preprocessing

steps, followed by the implementation and analysis of embedding models. Building

upon these embeddings, the similarity based retrieval system was then constructed

and evaluated through sampling of target companies.

3.1 Data

The primary data used in this study consisted of textual descriptions of compa-

nies rather than numerical or financial data. All data were collected from company

websites to ensure high relevance and consistency. Websites generally contained

comprehensive and current descriptions of a company’s activities giving clues about

the business model. The tricky part was to find the useful information in the web-

page without a too complex logic. Using only textual data was motivated by the

study’s focus on identifying semantic similarities between businesses, where the tex-

tual representation of their activities provides richer descriptive information than

financial indicators. Further, the financial data available through public datasets

did not cover all markets where portfolio companies were present.

The data collection process began with a curated list of financial buyers, from which

all portfolio companies were identified. Because raw website text often was inconsis-

tently formatted with redundant information, a summarization step was introduced

before embedding. An intermediate large language model (LLM) was used to trans-

23


3. Method

form the scraped content into concise and standardized company descriptions. The

standardized summaries form the textual representations used in later stages of the

pipeline, including embedding generation and similarity search.

3.1.1 Scraping

Since no centralized or publicly available dataset of portfolio companies existed, a

web scraping protocol was developed in collaboration with Merge to collect the nec-

essary data. The process began with a curated list of financial buyers provided by

Merge, each associated with a verified company website URL. These websites typ-

ically contain sections that describe the buyer’s portfolio or list of holdings, which

served as the starting point for the data retrieval routine.

The first step of the scraping involved identifying the specific webpage that listed

the buyer’s portfolio companies. This was achieved by searching the homepage of

each buyer for links containing relevant keywords such as “Portfolio or “Holdings.”

When such a link was found, the crawler followed it to access the portfolio section

of the website.

Then, a crawler systematically traversed these portfolio pages to locate subpages

that contained information about individual portfolio companies. From these sub-

pages, all external hyperlinks (href attributes) were extracted. To ensure relevance,

only links pointing to external company domains were retained, while links asso-

ciated with social media platforms (e.g., LinkedIn, Twitter, Facebook) or general

navigation elements were filtered out. The output of this stage was a structured

mapping between each financial buyer and the corresponding list of portfolio com-

pany URLs. When analysing how often each portfolio company link appeared across

portfolios, certain domains reoccurred at unusually high frequencies. These were not

actual portfolio companies but external sources such as news sites or financial data

pages. Because these invalid records usually were among the most frequently oc-

curring domains, the dataset could be improved by systematically inspecting and

removing entries with the highest counts.

24


3. Method

Once the list of portfolio company websites had been established, the next step was

to extract descriptive text for each company. For every company, the raw HTML

content was retrieved from text-bearing elements which contain the main written

material on a webpage. The extraction focused on two key sources of information:

the company’s landing page and its “About us” page, when available. The landing

page was easily identified as the root domain and provided a concise overview of

the company’s offering and positioning. Locating the “About us” page required an

additional search step, as its structure and URL varied between companies. To iden-

tify it, the crawler searched for links within the site containing indicative keywords

such as “about”, “who we are”, or “what we do”. When such a link was found, the

corresponding page was scraped and its textual content extracted.

This approach ensured that the collected text captured both the general presentation

of the company and its self-described purpose and activities. The resulting HTML

texts from the landing page and the identified “About us” page were later parsed and

processed into clean textual representations for the summarization and embedding

steps described in subsequent sections.

3.1.2 Summarization via LLM

After having retrieved the full HTML content from the selected webpages using

BeautifulSoup, the textual material was extracted through the text attribute of

the parsed HTML object. This unstructured text served as input to a large language

model (LLM), which was used to generate standardized and coherent summaries

suitable for embeddings. The purpose of this step was to convert noisy website text

into concise, comparable descriptions that consistently capture the key characteris-

tics of each company.

The summarization was performed using the GPT-4o-mini model, which provided

a 128,000-token context window (OpenAI, 2025). This capacity ensured that both

the landing-page text and, when available, the “About us” section could be included

in their entirety. The API was called using the OpenAI library for python with the

25


3. Method

provided API key to first create a client. Then a request could be sent by providing

the model, prompt, and desired temperature as arguments. Several temperature val-

ues were tested during experimentation, and the temperature of 0.5 was ultimately

selected over the default value of 0.7. After testing a range of temperatures from

0-1 the selected temperature built coherent sentences and kept more of the provided

information, reducing the risk of hallucinations.

The summarization prompt used in this study consisted of two components: a struc-

tured instruction block outlining the required tone, content, and formatting, followed

by the raw text extracted from each company’s landing page and, when available,

its “About us” section. The instruction block ensured that all summaries adhered

to a consistent structure and level of detail, while the inclusion of the full extracted

text allowed the model to base its output solely on information explicitly provided

in the source material. The complete prompt is shown below.

The Summarization Prompt

You are an M&A analyst. Your task is to create a company description from

the information given into a concise, neutral and standardized summary.

The style should be factual and objective, write it in free text and not as a list.

Instructions:

- Length: 300 words.

- Tone: neutral, objective and professional.

- Content focus: Industry and business model, Core products or services,

Geographic focus and main markets, Customer segments or end-users.

- Avoid marketing laguage, exaggerations or subjective adjectives.

Do not infer or invent information not explicitly mentioned in the text.

If information is missing, omit it rather then guessing.

Information below:

26


3. Method

{WEB PAGE CONTENT}

3.1.3 Preprocessing and cleaning

As the LLM generated summaries form the primary input to all embedding models,

it was important to ensure that they were both accurate and consistent. Although

the web scraping pipeline retrieved text from the landing pages and the “About us”

sections for most companies, several issues appeared in the raw data that required

additional cleaning.

Some webpages contained faulty or redirected links, which led to empty or unusable

HTML content. In these cases the LLM could not produce a meaningful summary

and instead returned placeholder phrases such as “information not found” or “not

specified.” To identify such cases in a systematic way, the summaries were examined

using a combination of keyword searches and length based filters. An inspection of

the distribution of summary lengths showed that nearly all summaries shorter than

about 86 words were failed generations. These entries typically corresponded to

missing webpage content, non-English source text, or very limited material that the

LLM could not expand into a proper summary. Based on this observation, all sum-

maries below the threshold of 86 were removed. A keyword filter was also applied

to detect low quality summaries even if they were slightly longer. These keywords

or phrases included phrasing like “Not found”, “Not specified”, and words in foreign

languages.

Language inconsistencies created another source of noise. Although the summa-

rization prompt was written in English, pages written entirely in other languages

sometimes resulted in short or partially untranslated summaries. These were iden-

tified through manual language detection checks and removed in order to maintain

a coherent dataset.

After having applied these cleaning steps, including the removal of invalid outputs,

27


3. Method

filtering by length, keyword detection, and exclusion of non English summaries, the

resulting dataset consisted of high quality and comparable company descriptions.

This cleaned corpus serves as the basis for all embeddings. Further preprocessing

of the summaries was done individually for the models. This included tokenization,

lemmatization, and punctuation removal. As the summaries were generated using a

LLM the textual quality was generally very high with few misspellings and special

characters.

3.2 Models

After obtaining the textual descriptions of each portfolio company, the next step

was to generate numerical vector representations that could be used for similarity

analysis. This was done by applying several embedding models through a Python-

based pipeline. A dedicated script was developed to load the summarized company

texts and organize them into a pandas DataFrame. From there, each model was

applied to the corpus to produce a corresponding set of embedding vectors.

Three main types of models were applied: TF-IDF, Doc2Vec models, and several

Transformer-based models. All models followed the same data pipeline and storage

setup to ensure comparability.

3.2.1 TF-IDF

TF-IDF was used to generate vector representations of the standardized company

summaries. The summarized texts were read into Python and processed using the

TfidfVectorizer from the Scikit-learn library. For the TF-IDF representation,

preprocessing was handled directly within the TfidfVectorizer. Instead of using

an external tokenizer, a custom regular-expression-based token pattern was applied.

This pattern ensures that only alphabetic tokens with a minimum length of two

characters was included, effectively filtering out numbers, isolated letters, and other

non-informative fragments.

28


3. Method

The vectorizer was fitted on the full corpus of documents, with the minimum

document-frequency parameter min_df tuned prior to finalizing the configuration.

We evaluated values in the range 4 ≤ min_df ≤ 7, and selected 5 as it provided

an effective balance between vocabulary coverage and noise reduction. Terms ap-

pearing in fewer than five documents were therefore excluded from the vocabulary,

mitigating the impact of extremely rare words that offer limited discriminative value.

The resulting transformation produced a sparse TF–IDF matrix in which each doc-

ument was represented as a weighted vector reflecting both term frequency and

inverse document frequency.

3.2.2 Doc2Vec

Document embeddings were generated using the Doc2Vec implementation from the

Gensim library. Each business summary was first preprocessed through lemmatiza-

tion, where punctuation were removed using spaCy, after which the cleaned tokens

were wrapped into Gensim TaggedDocument objects with unique integer identi-

fiers. The Doc2Vec model implemented a Distributed Bag of Words (dm=0), as

described in 2.3. The model was set to build 100 dimension vectors, a context

window size of 4, a minimum token frequency threshold of 5, and a negative sam-

pling rate of 10. As no numerical target such as an accuracy could be used to set

optimal hyper-parameters we started with recommended values and tweaked them

based on the resulting distribution. The model object was trained for 5 epochs using

train(), a reasonable number for a smaller corpus. After vocabulary construction

with build_vocab(), the model was trained using stochastic gradient descent with

Gensim’s optimized routines. Final document representations were produced using

the model’s infer_vector method, which applies several gradient descent steps to

derive stable embeddings that align with the learned semantic space.

Building on the trained Doc2Vec embeddings, two additional procedures were im-

plemented to enhance and further analyze the resulting document representations.

First, Smooth Inverse Frequency (SIF) embeddings were computed to reduce the in-

fluence of high-frequency, non-informative terms. This was achieved by re-weighting

29


3. Method

token contributions based on inverse frequency and then applying NumPy’s SVD

function to remove the first principal component, which captures the most most

generic information across documents. Second, to investigate the effects of dimen-

sionality reduction on representation quality, Scikit-learn’s PCA function was ap-

plied to the embedding matrix. The number of components retained was set to

preserve 90% of the explained variance, enabling the construction of more compact

vectors while maintaining the core informational content of the original Doc2Vec

representations.

3.2.3 Transformer Models

Transformer-based embeddings were generated using pretrained models from the

SentenceTransformer framework, which includes architectures derived from SBERT

as well as variants influenced by GPT-style embedding designs and more. Three

prospective models were considered to illustrate the range of Transformer-based

embedding approaches: the lightweight “all-MiniLM-L6-v2”, the intermediate “all-

mpnet-base-v2”, and the larger “Qwen/Qwen3-Embedding-0.6B”. All models op-

Model Name Parameters Output Dim

MiniLM-L6-v2 22M 384

all-mpnet-base-v2 110M 768

Qwen3-Embedding-0.6B 600M 1024

Table 3.1: Model specifications for Transformer embedding models

erated directly on raw text inputs and incorporated their own tokenization and

normalization procedures and therefore no additional preprocessing was applied.

Embeddings were obtained using the framework’s encode() method. Among these

considered models, all-mpnet-base-v2 was selected for the primary analysis as it

provided a balance between a good representation and computational cost, and was

suitable for environments without access to GPU resources. This model took about

15 minutes to embed all 10 000 samples.

30


3. Method

3.3 Implementation

The aim of this study was to design a system that generates suggested buyers for a

given target company that was to be sold. When evaluating this system it seemed

reasonable to sample target companies from the dataset of portfolio companies and

then do a similarity search disregarding the target company from the set of portfo-

lio companies. Given a target company the embeddings could be analysed both in

terms of direct similarity search and through buyer suggestion. To perform similarity

search the embedding vectors were queried using cosine similarity against the target

embedding to find top K most similar companies. The system identified companies

to add to the portfolio of the buyer so that they shared similar business models

with companies already in their existing portfolio. This assumes that buyers are

interested in what they have experience with and know works. To implement this

idea for a given target company the system began by computing the pairwise cosine

similarity for the target to every other portfolio company. Then the portfolios were

grouped by their financial buyer (owner) and only the top three portfolio companies

were kept for each buyer. The score for each buyer were then equal to the geometric

average of the cosine similarities over these top three holdings. Using these buyer

scores the system could then suggest buyers for the given target company based

relevant parts of the buyers portfolios.

For a practical implementation the target company was not sampled randomly but

provided outside of the dataset. This meant that the target company needed to

be scraped, summarized, and embedded separately from the previously embedded

portfolio companies. Further, the vector embeddings could be stored in a vector

database so that each run only needed to embed a single summary. This put a

requirement on the models used as the new target summary needed to be embedded

on the same terms as the other companies. For TF-IDF that used corpus word

counts in the IDF term this posed a problem. As this was a very fast model it was

feasible to simply re-embed all summaries including the new one given that all the

summaries can be stored instead of the vectors. Doc2Vec used a neural network to

31


3. Method

build the embeddings and was pre-trained for our entire corpus of summaries so it

was necessary to store the neural network in a pickle file for example to be used for

new summaries. This did however assume that the vocabulary in the summaries

were consistent with new ones as the training did not consider these newly added

summaries. The Transformer model was not fine-tuned and the weights were pre-

trained and imported so this model could be used to embed new target summaries

on the same terms as the old ones.

3.4 Evaluation

The models were evaluated both visually and through similarity-based retrieval.

Visual inspection allowed for studying how companies were positioned relative to one

another in the embedding space, while retrieval tests measured how well each model

surfaced similar companies based on cosine similarity. Together, these methods give

an indication of whether the embeddings capture business-level similarity.

3.4.1 Visualizations

For evaluation by visualization, dimensionality reduction techniques were used to

plot the vectors in 2D. Using UMAP we were able to represent the embedding vec-

tors in 2D to enable scatter plotting. As UMAP works with stochastic initialization

it was given a seed for comparability. The UMAP was used via the UMAP-learn

library taking parameters of the target dimensionality (=2), the random seed, and

some parameters to tweak the resulting distribution.

Taking the full set of companies and plotting them resulted in a very large number

of samples giving an overview of clusters. There was no industry codes associated

with the data so all that could be done at this level was to analyse the distribution

by hovering over samples to check if the sample has reasonable neighbours.

The final visualization examined subsets of companies manually labelled by indus-

try based on their summaries. Five industries were selected to test both similar-

32


3. Method

ity and dissimilarity: ’Insurance’, ’Asset Management’, ’Industrial’, ’Realtors’, and

’Healthcare’. Insurance and Asset Management were chosen for their financial sector

similarities, while Healthcare, Realtors, and Industrial represent materially different

operational domains. This selection allowed investigation of whether embeddings

capture both clear distinctions between dissimilar sectors and the relationships be-

tween related ones.

3.4.2 Similarity Search

To compare companies, similarity scores were computed directly on the embedding

vectors generated by the models. For each company that was evaluated, its embed-

ding was first retrieved and then compared with all embeddings in the buyer dataset.

The comparison was carried out using cosine similarity, which offers a normalized

measure of closeness between vectors and allows for a consistent interpretation of

similarity across different embedding types.

The implementation followed a straightforward procedure. Once the target company

had been embedded using the chosen model such as TF-IDF or a Transformer-based

encoder, its vector representation was compared against every portfolio-company

embedding using cosine similarity. Cosine similarity was calculated pairwise, and

the resulting scores form the basis for ranking the potential portfolio companies.

The companies in the buyer set was sorted according to their similarity scores in

descending order, after which the top-k most similar candidates were returned as

the system’s recommendation.

3.4.3 Expert Evaluation

To evaluate the practical relevance of the buyer recommendations produced by the

models, a structured expert review was conducted. The goal of this step was to

assess whether the buyer recommendations generated by the system align with the

expectations of experienced analysts and to identify failures not captured by numer-

ical metrics comparing the different models presented in the thesis.

33


3. Method

First a sample of portfolio companies were selected to cover a mix of industries and

business models. For each of these companies, the three different models generated

two suggested buyers each. These buyers were identified by first locating the three

most similar portfolio companies within each buyer’s portfolio and then averaging

their cosine similarity scores to form a buyer-level relevance measure. In some cases,

the three underlying portfolio companies contributed evenly to the similarity score,

while in others a single highly similar portfolio company had a disproportionate

influence on the buyer’s ranking.

The motivation behind using the three most similar portfolio companies, instead

of relying only on the single closest match, was to obtain a more stable and rep-

resentative measure of buyer relevance. By considering the three closest portfolio

companies and taking the average similarity score, the measure better reflects the

overall investment profile of the buyer. This method gives a more reliable picture

of what the buyer normally invests in, reduces the effect of outliers, and avoids

placing too much weight on one unusually strong match. There are several possi-

ble approaches to constructing a buyer relevance score, such as using larger sample

groups or applying weighted similarity. In dialogue with Merge this setup was cho-

sen because it offered a clear and balanced way of evaluating buyer interest while

still being practical to work with.

A sample of 71 target companies was included in the evaluation. To ensure a man-

ageable and consistent review process, only the first paragraph of each target com-

pany’s summary was provided, as this was deemed sufficient for forming a clear

understanding of the business. It is important to note that reviewers were aware

that the system gives suggestions purely on textual similarity.

34


3. Method

Experts assigned a relevance score to each suggested buyer using a three-point scale:

• 3 – Highly relevant

• 2 – Relevant

• 1 – Not relevant

The expert rankings offer a qualitative assessment of how well each model identifies

strategically meaningful buyers. By focusing directly on buyer-level suggestions,

rather than individual portfolio companies, the evaluation captures experts’ judge-

ment of sector alignment and business model fit. These results serve as the primary

qualitative benchmark for comparing the models’ performance.

35


3. Method

36


4
Results

This chapter presents the empirical results of the implemented buyer-matching

pipeline. We report findings from the summarization stage, embedding performance,

dimensionality-reduction visualizations, similarity-search experiments, and the ex-

pert evaluation. The results follow the methodological structure outlined in the

Method chapter.

4.1 Dataset statistics and quality

The final dataset contains a total of 9,492 company summaries after all preprocessing

steps. Each observation corresponds to a cleaned textual summary. Basic descriptive

statistics of token lengths for these summaries are presented in Table 4.1.

Statistic Value

Number of documents 9,492

Average number of tokens 199.36

Standard deviation 22.98

Minimum number of tokens 86

Median number of tokens 202

Maximum number of tokens 287

Table 4.1: Statistics for the company summaries

The summaries vary between 86 and 287 tokens, with most values concentrated

around 200 tokens. These statistics describe the textual characteristics of the dataset

used in the embedding and similarity-search stages.

37


4. Results

4.1.1 Quality of LLM Summaries

The summarization step produced one structured summary per company, resulting

in 9,492 text outputs. The outputs follow a consistent format extracted from the

scraped website content and the examples in Tables 4.2 and 4.3 illustrates the typ-

ical structure of the generated summaries in the dataset.

Example Summary - BSI Software

BSI Software is a European company specializing in customer relationship

management (CRM) and customer experience (CX) solutions. Founded in

1996 in Switzerland, BSI Software offers the BSI Customer Suite, a modular

platform that integrates artificial intelligence to enhance customer engage-

ment, data insights, and relationship management. The company focuses on

industries such as banking, insurance, retail, and energy, providing tailored so-

lutions that comply with regulatory standards and meet specific market needs.

The BSI Customer Suite is designed to facilitate seamless customer interac-

tions across sales, marketing, and service channels, ensuring data protection

and digital sovereignty for its users. The platform includes features such as

customer data management, enterprise integration, and AI-driven analytics,

allowing businesses to derive actionable insights from their customer data.

BSI Software emphasizes flexibility and scalability, enabling clients to adapt

the platform to their evolving requirements. Geographically, BSI Software

operates primarily in the DACH region (Germany, Austria, and Switzerland)

and Italy, serving a diverse range of customer segments from large corpo-

rations to smaller enterprises. The company prioritizes customer-centricity

and collaboration, fostering a networked approach to project management

without traditional hierarchies. BSI Software’s commitment to quality and

precision is reflected in its extensive industry expertise and its focus on long-

term partnerships with clients.

Table 4.2: Example of a generated company summary for BSI Software

38


4. Results

Example Summary - Apotea

Apotea is an online pharmacy operating in Sweden, specializing in the sale of

pharmaceutical products, health and beauty items, and various wellness so-

lutions. The company offers a wide range of products, including prescription

medications, over-the-counter drugs, dietary supplements, and personal care

items. Apotea’s business model is centered around e-commerce, providing

customers with the convenience of shopping for health-related products from

home, with options for fast delivery and free shipping. The geographic focus

of Apotea is primarily within Sweden, serving customers nationwide. The

company caters to diverse customer segments, including individuals seeking

health and wellness products for themselves and their families, as well as pet

owners looking for veterinary medications and supplies. Apotea also provides

professional advice through its licensed pharmacists, ensuring customers re-

ceive guidance on their purchases and health inquiries. With an extensive

inventory that includes over 50,000 quality-checked products, Apotea posi-

tions itself as one of the largest online pharmacies in Sweden. The product

categories range from allergy relief and skincare to nutritional supplements

and household items. The company emphasizes customer service, offering

support via email, phone, and chat, and aims to meet the needs of various

consumer demographics, including those with chronic health conditions and

specific wellness requirements.

Table 4.3: Example of a generated company summary for Apotea.

The summary in Table 4.2 provides a clear overview of BSI Software’s core business

areas by identifying CRM and CX solutions as its primary focus with the main

offering being the BSI Customer Suite. It also states the company’s focus toward

regulated service industries which helps situate its target markets. Geographically,

the summary specifies that the firm was founded in Switzerland and operates in

Europe. In terms of business model, the description implies a modular, AI-enhanced

software platform that allows businesses to derive actionable insights from their

39


4. Results

customer data. As a result everything desired seems to be mentioned but not at a

detailed level.

The summary in Table 4.3 provides a clear overview of Apotea’s core operations

by identifying its role as a Swedish online pharmacy with an extensive assortment

of pharmaceutical, health, and wellness products. It highlights e-commerce as the

central business model, emphasizing convenience, fast delivery options, and broad

product availability as key value propositions. The summary also situates Apotea

geographically by noting its exclusive focus on the Swedish market and its nation-

wide customer base. In terms of customer segments, the description covers both

general consumers seeking health and personal care items as well as pet owners

requiring veterinary products. The mention of licensed pharmacists adds context

to the company’s service offering, suggesting a model that combines digital retail

with professional guidance. Overall, the summary captures the main business areas,

customer focus, and operational model, though it remains high-level rather than

detailing specific logistics or competitive differentiators.

4.2 Models

Table 4.4 reports the approximate runtime for each embedding model when gen-

erating representations for the dataset and half of the dataset to compare scaling.

TF-IDF runs in a under a second in both cases, while the Doc2Vec variants com-

plete in around two minutes for half the dataset and three minutes for the whole.

The MPNet Transformer however requires substantially more time due to its higher

model complexity taking ca 16 minutes to train over the full dataset. The runtime

for TF-IDF and MPNet Transformrer more than doubles for the full dataset, while

for doc2vec the runtime is slightly less than twice the amount. Overall, the results

confirm the expected trade-off between model expressiveness and runtime.

40


4. Results

Runtime (seconds)

Model 5,000 documents 10,000 documents

TF-IDF 0.3 s 0.8 s

Doc2Vec 102.7 s 192.3 s

Doc2Vec PCA 111.4 199.4 s

Doc2Vec SIF 110.3 198.3 s

MPNet Transformer 410.6 990.2 s

Table 4.4: Runtime statistics

4.3 Embedding Visualization

Using UMAP dimensionality reduction the distribution of companies in the embed-

ding space can be plotted in two dimensions.

4.3.1 All Portfolio Companies

The UMAP plots of all the portfolio companies are shown in Figure 4.1. As the

dataset includes no industry codes it is difficult to draw any real conclusions from

this data other than some partial cluster formation. It appears that Transformer

MPNet and TF-IDF embeds more dense clusters whilst the Doc2Vec models appear

more dispersed. TF-IDF seems to have the most outliers and MPNet have some

outliers also.

4.3.2 Subset visualization

In order to show how the models embed information about the business and its in-

dustry a subset of companies within different sectors where selected. The selection

was done manually to find subsets of companies with different and similar business

models. The selection was made as ’Insurance’, ’Asset Management’, ’Industrial’,

’Realtors’, ’Healthcare’. Some of the companies where consciously selected with

vague industries such as Healthcare Insurance, Insurance Brokers, and Asset Man-

41


4. Results

Figure 4.1: UMAP visualization of all companies for the different models

agers with pension insurance operation. The full list of these companies are attached

in Appendix A.

Figure 4.2: UMAP visualization of a subset of companies within 5 industries for

the different models

Figure 4.2 shows the UMAP plots for each embedding method, and a few pat-

terns stand out. The industrial companies are somewhat clustered in all models

except TF-IDF, where the groups blend more horizontally. The industry names

were densely clustered only for Transformer model while in the other appeared more

dispersed. Healthcare names cluster tightly across all embeddings, but for Doc2Vec

one name stands out as it is more focused on products (Spinal Discs) rather than

providing healthcare services or pharmaceuticals. Also both Doc2Vec models embed

42


4. Results

a consumer insurance company close to the healthcare names. The asset manage-

ment cluster is also interesting because most of these companies also do pension

insurance or insurance brokerage, and both the Doc2Vec and Transformer models

pick up on that by placing them near the insurance cluster, whereas TF-IDF does

not capture this relationship as well, placing them closer to the real estate brokers.

Overall the Transformer model gives the most separated clusters in terms of the

given subset. In Figure 4.3 the subset overlays the plot of all companies to show

Figure 4.3: UMAP visualization of all companies in the embedding space, with

selected companies from five industries highlighted

how the labelled data points conforms to the total structure of the dataset. This

gives some context about the cluster formation. The TF-IDF appear to have some

clustering but also plenty of outliers where the insurance names appear together in

one of these outlier clusters. Both Doc2Vec models seems to place the industrial

names in a large dispersed cluster whilst the healthcare names appear in two more

dense clusters. The two are very similar where the Real Estate brokers are slightly

more separated for SIF and also there seems to be slightly more cluster formation.

4.4 Similarity Search Experiments

This section presents the results of the similarity search experiments, evaluating how

effectively each embedding model retrieves relevant companies for a given set of test

queries.

43


4. Results

4.4.1 Similarity Score Distributions

To better understand how each embedding model represents companies in the vector

space, the distribution of cosine similarity scores between all pairs of companies are

shown in Table 4.5. These distributions provide insights into how densely or sparsely

the models cluster the representations, which in turn influences how sensitive each

model is when identifying relevant buyers.

Model Min Max Mean Median

TF–IDF 0.000 0.940 0.127 0.120

Doc2Vec PCA -0.76685 0.94113 -0.00016 -0.00926

Doc2Vec SIF -0.652 0.911 -0.00012 -0.00766

MPNet -0.139 0.994 0.251 0.239

Table 4.5: Statistics of cosine similarity distributions for each embedding model

Figure 4.4: Distribution of similarity scores

For the TF-IDF model, the similarity scores are concentrated around low values

44


4. Results

with a long tail toward higher similarities. This behaviour is expected, as TF-IDF

produces high-dimensional and sparse vectors where most company pairs share few

terms. As a result, only companies with strongly overlapping vocabulary achieve

high similarity scores, while the majority remain close to zero.

The SIF-Doc2Vec model shows a distribution centered around zero. Unlike TF-IDF,

which only contains non-negative values and therefore produces mostly positive sim-

ilarities, Doc2Vec vectors contain both positive and negative components. After

applying SIF (removal of the first singular vector), the embeddings become more

isotropic, further pushing cosine similarities toward a normal distribution centered

around 0. This results in high contrast between similar and dissimilar companies,

but also means that random pairs will have similarity close to zero.

In contrast, the Transformer-based embeddings (SBERT) show a distribution skewed

toward higher similarity values compared to TF-IDF. These models place seman-

tically related companies closer in the embedding space, even when textual de-

scriptions do not share explicit vocabulary. The transformer embeddings therefore

produce higher baseline similarity scores.

4.4.2 Example of target similarity suggestions

The tables below illustrate an example of the top-k retrieved companies for a selected

target, note that the target in this case is a sampled portfolio company. For the

chosen target, cosine similarity scores are computed against all portfolio companies

in the dataset, and each model returns its highest-ranked matches based on these

scores. A larger set of top k results for target companies can be found in Appendix A.

The purpose of the example is to provide a qualitative impression of how the models

behave in practice. By examining one representative case, it becomes possible to

observe the types of semantic or textual patterns that lead to high similarity scores

under TF-IDF, Doc2Vec SIF, and the Transformer-based embeddings. Below is a

short description of the sampled target company, Klarna, followed by the top-2 sug-

45


4. Results

gestions produced by each model.

Klarna summary

Klarna is a financial technology company specializing in payment solutions for both

consumers and businesses. Operating primarily within the e-commerce sector, Klarna

provides a variety of payment options designed to enhance the online shopping expe-

rience. Its core offerings include immediate payments, deferred payment solutions,

and installment plans, enabling consumers to manage purchases according to their

financial preferences.

TFIDF Summary Similarity

Paysafe is a global payment solutions provider that offers a range of ser-

vices designed to facilitate online transactions for businesses and con-

sumers. The company operates within the financial technology industry,

focusing on payment processing, digital wallets, and online cash solu-

tions. Its core offerings include card processing, eCommerce solutions,

local payment methods, and various digital wallet services such as Skrill,

Neteller, and PaysafeCard.

0.518

Qliro is a financial technology company that operates in the payments

and savings industry, providing a platform designed to facilitate both

online and in-store transactions. The company offers a range of pay-

ment solutions, allowing customers to choose their preferred payment

methods, including options for immediate payment or flexible payment

plans. The Qliro app serves as a comprehensive tool for users to manage

their payments and finances, enabling them to track invoices, schedule

payments, and communicate with customer support.

0.495

Table 4.6: TFIDF Model Matches for Klarna

46


4. Results

The results of TF-IDF top two similarity search is presented in table 4.6. TF-IDF

appears to focus on the transaction side of the Klarna business picking two com-

panies providing payment solutions. Both offer a checkout solution for e-commerce

which is a big part of the Klarna business. Only Qliro also offers the deferred pay-

ment service’s making it a more direct competitor to Klarna.

Doc2Vec PCA Summary Similarity

Curve Pay is a digital wallet service that consolidates multiple pay-

ment cards into a single, secure platform, enabling users to manage

their finances more effectively. The service allows customers to switch

between cards even after a purchase, thereby eliminating hidden foreign

exchange fees and enhancing the rewards associated with existing bank

cards. Curve Pay is designed for both online and in-store transactions,

as well as for international spending, offering features such as cashback

on purchases and flexible payment options.

0.784

Freecharge is a financial services and payment technology company

based in India, operating as a subsidiary of Axis Bank. The company

primarily focuses on providing a wide range of payment solutions, includ-

ing mobile and DTH recharges, utility bill payments, and UPI transac-

tions, catering to over 100 million users across the country. Freecharge’s

business model integrates various payment methods, allowing users to

transact using wallets, UPI, net banking, debit cards, and credit cards,

thereby facilitating seamless payments for both online and offline mer-

chants.

0.678

Table 4.7: Doc2Vec PCA Model Matches for Klarna

47


4. Results

The PCA reduced Doc2Vec model produces 2 more diverse financial payment com-

panies presented in table 4.7. Curve pay seems in line with Klarna’s offering of

e-commerce payment solutions but Freeharge appears more focused on general per-

sonal finance.

Doc2Vec SIF Summary Similarity

Curve Pay is a digital wallet service that consolidates multiple pay-

ment cards into a single, secure platform, enabling users to manage

their finances more effectively. The service allows customers to switch

between cards even after a purchase, thereby eliminating hidden foreign

exchange fees and enhancing the rewards associated with existing bank

cards. Curve Pay is designed for both online and in-store transactions,

as well as for international spending, offering features such as cashback

on purchases and flexible payment options.

0.629

Splitit USA Inc. operates in the financial technology sector, specializing

in a buy now, pay later (BNPL) service that allows consumers to split

their purchases into smaller monthly payments using their existing credit

cards. The company’s business model is designed to facilitate flexible

payment options without the need for new loans or credit checks, thereby

avoiding additional interest or fees.

0.612

Table 4.8: Doc2Vec SIF Model Matches for Klarna

The SIF Doc2Vec model results presented in table 4.8 gain presents Curve Pay but

includes a new suggestion. Splitit USA seems to offer buy now pay later services

which is a big part of Klarna’s offering. The focus on providing consumer credit

makes Splitit an interesting suggestion.

48


4. Results

Transformer Summary Similarity

Klar Technologies (Klar) is a regulated financial entity in Mexico, au-

thorized by the Comisión Nacional Bancaria y de Valores (CNBV) and

operating under the Law of Savings and Popular Credit. The company

primarily focuses on providing financial services, including credit cards,

personal loans, and investment accounts. Klar offers a variety of credit

card options, such as the Klar Plus and Klar Platino, which feature ben-

efits like no annual fees, cashback rewards, and flexible payment terms.

0.739

Karbon is a fintech company based in Bengaluru, India, specializing

in foreign remittance services and corporate expense management solu-

tions. The company primarily focuses on facilitating international pay-

ments for businesses, including exporters, importers, freelancers, and

direct-to-consumer (D2C) e-commerce enterprises. Karbon’s service of-

ferings include a prepaid corporate card designed for expense tracking,

an AI Accountant to streamline accounting tasks, and competitive for-

eign exchange remittance solutions.

0.667

Table 4.9: Transformer Model Matches for Klarna

The MPNet Transformer model results presented in table 4.9 shows two new sug-

gestions. The first suggestion appears to have more of a traditional bank offering

of credit cards, loans, and investment accounts. The second option seems to have a

broader financial offering more targeted toward B2B.

Although nearly all suggested peers to Klarna operate in the financial sector, their

business models show notable variation. Klarna’s diversified operations span e-

commerce checkout solutions, consumer payment cards, and micro-lending services,

creating a complex profile for similarity matching. Notably, multiple models re-

turned pairs of suggestions with comparably high cosine similarity scores, indicat-

ing substantial representation of fintech companies within the dataset. However,

Klarna’s major international competitor Affirm does not appear among the retrieved

49


4. Results

peers, as verification confirmed its absence from the portfolio dataset.

4.5 Expert Review Results

In designing the expert evaluation, it was necessary to balance the number of models

assessed with the practical constraints of the reviewers’ time. As mentioned in the

method, each target gave one suggested buyer per model and each of these included

three portfolio companies to be considered. Increasing the number of models would

therefore have reduced the number of target companies possible to review meaning-

fully, and a trade-off was required. The SIF Doc2Vec and PCA Doc2Vec showed

very similar results in the visualization and cosine similarity search. Given the minor

benefits of SIF Doc2Vec model variant this was selected as the third model together

with MPNet Transformer and TF-IDF for the expert evaluation.

Table 4.10 shows the results of the expert evaluation with a total of 71 samples

evaluated by professionals from Merge. Note that each model outputs two examples

and so the total count for each model is therefore 142. The full evaluation set with

scores per target can be found in Appendix B.

Model Count 1 Count 2 Count 3 Average Median

Doc2Vec (SIF) 36 52 54 2.13 2

TF–IDF 16 43 83 2.47 3

Transformer (MPNet) 3 43 96 2.65 3

Table 4.10: Expert ratings for each model. Scores range from 1 (not relevant) to

3 (highly relevant)

50


5
Discussion

The purpose of this thesis was to investigate whether textual embedding models

can support the identification of relevant potential buyers in M&A processes. The

central question was whether different embedding approaches create vector repre-

sentations that meaningfully reflect business similarities between companies, and

whether those representations can be used to recommend buyers whose portfolio

profiles align with a given target. To address this, the study relied on generating a

large dataset of scraped and LLM-generated summaries describing the operations,

value propositions, and geographic footprints of portfolio companies. Across both

quantitative analyses and expert evaluation, the results indicate that the choice of

embedding representation significantly influences the quality and relevance of the

recommendations. In particular, the Transformer model demonstrated the most co-

herent clustering and achieved the strongest alignment with expert judgement.

5.1 Discussion of Results

This chapter interprets and reflects on the results presented in the previous section.

The aim is to evaluate how well the proposed methodology addresses the problem

of buyer-seller matching in M&A, assess the relative performance of the three em-

bedding models, and relate the findings back to the overarching research questions

and practical context at Merge.

51


5. Discussion

5.1.1 Embedding space and Visualizations

A central objective of this thesis was to explore how companies can be embedded in

a vector space in a way that reflects their underlying business characteristics. Since

all embeddings were genereted from LLM-produced summaries of publicly available

website content, the quality and content of this data directly shaped the geometry

of each embedding space. The UMAP visualisations presented in Figure 4.1-4.3

provide an interpretable view of these high dimensional structures and give an in-

dication of how well the different models captures the semantics and content of the

summaries.

Although the models differ in how their embeddings are constructed, they also differ

substantially in dimensionality, which influences the structure of the resulting em-

bedding spaces. TF-IDF produces very high-dimensional sparse vectors (in our case

8000) and Doc2Vec creates dense vectors with substantially lower dimensionality

with a standard length of 100. The MPNet Transformer model lies between these

extremes, generating 768-dimensional dense semantic embeddings. These dimen-

sional differences affect how much variation each model can encode and how tightly

companies can be positioned relative to one another. TF-IDF representations tend

to create separations based on vocabulary patterns, whereas lower-dimensional dense

vectors may compress information and produce smoother boundaries. To visualise

these high-dimensional structures, UMAP was applied to project each embedding

space into two dimensions. UMAP preserves local neighbourhoods while maintain-

ing aspects of global organisation. Although some information is inevitably lost,

meaningful and coherent clusters in the two-dimensional projections indicate that

the original embeddings capture relevant business similarities.

Figure 4.1 shows how the different models embed all the data in the embedding

space, showing clear differences in the structure between the models. Both the TF-

IDF and MPNet transfomer embeddings show more distincitve cluster seperation,

suggesting that these models capture more consistent patterns in how companies

52


5. Discussion

relate to one another. In contrast, both Doc2Vec variants generate more evenly

spread shapes, indicating that their representations are less sharply defined and may

struggle to separate companies clearly. The stronger cluster formation observed for

TF-IDF and especially for the Transformer model suggests that these embeddings

retain more distinctive information.

The subset visualizations in Figure 4.2 provides an example of how well the models

seperate companies within different sectors. In this example, the differences be-

tween the models become clearer. TF-IDF manages to separate the selected sectors

relatively well, with insurance, healthcare, real estate, and asset management form-

ing distinct clusters, while the industrial companies appear spread out on a line.

This pattern can be expected, as industrial firms often span a wide range of activi-

ties and therefore have a wide range of textual information. Both Doc2Vec models

show a similar overall structure, producing clear clusters for real estate, industrial,

and healthcare companies, but asset management overlaps with insurance. Addi-

tionally one insurance company is positioned closer to healthcare. In this example

the MPNet Transformer mo