AI Assisted matching in Mergers And Acquisitions A Data-Driven Approach to Identifying Potential Acquirers Master’s thesis in Data Science and AI WILHELM JOHNSON SWEGMARK DIDRIK TVEDT DEPARTMENT OF MATHEMATICAL SCIENCES CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2026 www.chalmers.se www.chalmers.se Master’s thesis 2026 AI Assisted Matching in Mergers and Acquisitions A Data-Driven Approach to Identifying Potential Acquirers WILHELM JOHNSON SWEGMARK DIDRIK TVEDT Department of Mathematical Sciences Chalmers University of Technology Gothenburg, Sweden 2026 AI Assisted Matching in Mergers and Acquisitions A Data-Driven Approach to Identifying Potential Acquirers WILHELM JOHNSON SWEGMARK DIDRIK TVEDT © WILHELM JOHNSON SWEGMARK, DIDRIK TVEDT 2026. Supervisor: Johan Jonasson, Department of Mathematical Sciences Examiner: Johan Jonasson, Department of Mathematical Sciences Master’s Thesis 2026 Mathematical Sciences Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2026 iv AI Assisted Matching in Mergers and Acquisitions A Data-Driven Approach to Identifying Potential Acquirers WILHELM JOHNSON SWEGMARK DIDRIK TVEDT Department of Mathematical Sciences Chalmers University of Technology Abstract Traditional buyer identification in M&A relies on manual screening and professional networks, making it resource-intensive and naturally limiting the buyer pool. This thesis investigates whether textual embedding models can support the identification of relevant potential buyers in mergers and acquisitions. The study examines how different representation methods, including TF-IDF, Doc2Vec with smooth inverse frequency weighting, and Transformer based models, capture similarity between companies when applied to standardized summaries of portfolio company descrip- tions. The summaries are created using a large language model with information provided on the portfolio companies websites. The performance of the embedding models is evaluated through visualization of the embedding spaces, cosine simi- larity search experiments, and an expert review of buyer recommendations. The results indicate that TF-IDF and the Transformer model produced relevant recom- mendations, with the Transformer model demonstrating the best performance in embedding space separation and alignment with expert judgment, while Doc2Vec models showed weaker differentiation between company types. Overall, the study shows that embedding based similarity search can serve as a useful first step in buyer discovery by expanding the range of potential buyers considered and improving effi- ciency. The work also highlights that further validation across a larger set of targets and with a more complete dataset would strengthen confidence in these results. Keywords: M&A, NLP, LLM, Embeddings, Semantic Similarity v Acknowledgements We would like to express our sincere gratitude to our Chalmers supervisor, Johan Jonasson, whose expertise, guidance, and feedback have been invaluable throughout the course of this thesis. Your support has helped us navigate challenges, refine our ideas, and ultimately shape the direction of our work. We would also like to extend our appreciation to Merge, with which this thesis was carried out. The opportunity to work closely with the company, along with the access to industry knowledge, data and practical perspectives, has contributed greatly to the development and relevance of this project. Wilhelm Johnson Swegmark and Didrik Tvedt, Gothenburg, December 2025 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: BERT Bidirectional Encoder Representations from Transformers BoW Bag of Words LLM Large Language Model M&A Mergers and Acquisitions NLP Natural Language Processing PCA Principal Component Analysis SIF Smooth Inverse Frequency TF-IDF Term Frequency-Inverse Document Frequency UMAP Uniform Manifold Approximation and Projection for Dimension Reduction ix Contents List of Acronyms ix List of Figures xv List of Tables xvii 1 Introduction 1 1.1 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Research Questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.4 Ethical aspects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.5 AI Declaration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2 Theory 7 2.1 Term Frequency-Inverse Document Frequency . . . . . . . . . . . . . 7 2.2 Word2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3 Doc2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4 SIF embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Transformers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.5.1 Sentence Embeddings . . . . . . . . . . . . . . . . . . . . . . . 15 2.6 Dimensionality Reduction Methods . . . . . . . . . . . . . . . . . . . 17 2.6.1 PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 2.6.2 UMAP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.7 Similarity search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 2.8 Large Language Models for Summarization . . . . . . . . . . . . . . . 20 xi Contents 3 Method 23 3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.1.1 Scraping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.2 Summarization via LLM . . . . . . . . . . . . . . . . . . . . . 25 3.1.3 Preprocessing and cleaning . . . . . . . . . . . . . . . . . . . . 27 3.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.1 TF-IDF . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.2.2 Doc2Vec . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2.3 Transformer Models . . . . . . . . . . . . . . . . . . . . . . . 30 3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.1 Visualizations . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.4.2 Similarity Search . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.4.3 Expert Evaluation . . . . . . . . . . . . . . . . . . . . . . . . 33 4 Results 37 4.1 Dataset statistics and quality . . . . . . . . . . . . . . . . . . . . . . 37 4.1.1 Quality of LLM Summaries . . . . . . . . . . . . . . . . . . . 38 4.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.3 Embedding Visualization . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3.1 All Portfolio Companies . . . . . . . . . . . . . . . . . . . . . 41 4.3.2 Subset visualization . . . . . . . . . . . . . . . . . . . . . . . 41 4.4 Similarity Search Experiments . . . . . . . . . . . . . . . . . . . . . . 43 4.4.1 Similarity Score Distributions . . . . . . . . . . . . . . . . . . 44 4.4.2 Example of target similarity suggestions . . . . . . . . . . . . 45 4.5 Expert Review Results . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5 Discussion 51 5.1 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.1.1 Embedding space and Visualizations . . . . . . . . . . . . . . 52 5.1.2 Similarity Search and Buyer Recommendation . . . . . . . . . 54 5.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 xii Contents 5.2.1 Impact of LLM Summarization . . . . . . . . . . . . . . . . . 56 5.2.2 Limitations in Scraped Data . . . . . . . . . . . . . . . . . . . 57 5.2.3 Limitations in Method . . . . . . . . . . . . . . . . . . . . . . 58 5.3 Recommendations for Future Work . . . . . . . . . . . . . . . . . . . 58 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 Bibliography 61 A Appendix A: Competitor Similarity Results I B Appendix B: Expert Evaluation Summary XVII xiii Contents xiv List of Figures 2.1 Overview of the Transformer architecture (Vaswani et al., 2017), with the encoder part on the left and the decoder part on the right . . . . 15 4.1 UMAP visualization of all companies for the different models . . . . . 42 4.2 UMAP visualization of a subset of companies within 5 industries for the different models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.3 UMAP visualization of all companies in the embedding space, with selected companies from five industries highlighted . . . . . . . . . . . 43 4.4 Distribution of similarity scores . . . . . . . . . . . . . . . . . . . . . 44 xv List of Figures xvi List of Tables 3.1 Model specifications for Transformer embedding models . . . . . . . . 30 4.1 Statistics for the company summaries . . . . . . . . . . . . . . . . . . 37 4.2 Example of a generated company summary for BSI Software . . . . . 38 4.3 Example of a generated company summary for Apotea. . . . . . . . . 39 4.4 Runtime statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.5 Statistics of cosine similarity distributions for each embedding model 44 4.6 TFIDF Model Matches for Klarna . . . . . . . . . . . . . . . . . . . . 46 4.7 Doc2Vec PCA Model Matches for Klarna . . . . . . . . . . . . . . . . 47 4.8 Doc2Vec SIF Model Matches for Klarna . . . . . . . . . . . . . . . . 48 4.9 Transformer Model Matches for Klarna . . . . . . . . . . . . . . . . . 49 4.10 Expert ratings for each model. Scores range from 1 (not relevant) to 3 (highly relevant) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 A.1 Top-3 Similar Companies for http://www.partnerre.com (Insurance) . I A.2 Top-3 Similar Companies for http://www.maxm.se (Insurance) . . . . II A.3 Top-3 Similar Companies for https://www.hedvig.com (Insurance) . . II A.4 Top-3 Similar Companies for https://www.epicbrokers.com (Insurance) III A.5 Top-3 Similar Companies for https://www.brookfield.com (Asset Man- agement) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III A.6 Top-3 Similar Companies for http://cworldwide.com (Asset Manage- ment) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IV A.7 Top-3 Similar Companies for https://www.oaktreesicav.com (Asset Management) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V xvii List of Tables A.8 Top-3 Similar Companies for https://www.spiltanfonder.se (Asset Man- agement) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . V A.9 Top-3 Similar Companies for https://www.mandatum.fi (Asset Man- agement) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI A.10 Top-3 Similar Companies for https://www.soderbergpartners.se (As- set Management) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VI A.11 Top-3 Similar Companies for https://sjukhus.sophiahemmet.se (Health- care) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VII A.12 Top-3 Similar Companies for https://www.landmarkhealth.org (Health- care) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . VIII A.13 Top-3 Similar Companies for http://www.vamed-care.com (Healthcare)VIII A.14 Top-3 Similar Companies for https://www.highridgemedical.com (Health- care) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . IX A.15 Top-3 Similar Companies for http://www.reliant-rehab.com (Health- care) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X A.16 Top-3 Similar Companies for https://www.cloverhealth.com (Health- care) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . X A.17 Top-3 Similar Companies for https://www.valimmobilier.ch (Real Es- tate Brokers) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI A.18 Top-3 Similar Companies for https://bskimmobilier.com (Real Estate Brokers) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . XI A.19 Top-3 Similar Companies for https://www.renson.fr (Industrial) . . . XII A.20 Top-3 Similar Companies for https://azekco.com (Industrial) . . . . . XIII A.21 Top-3 Similar Companies for https://www.globalppi.com (Industrial) XIII A.22 Top-3 Similar Companies for https://www.hubs.com (Industrial) . . . XIV A.23 Top-3 Similar Companies for https://www.rotomon.fi (Industrial) . . XIV B.1 Expert evaluation scores (1/3) . . . . . . . . . . . . . . . . . . . . . . XVII B.2 Expert evaluation scores (2/3) . . . . . . . . . . . . . . . . . . . . . . XVIII B.3 Expert evaluation scores (3/3) . . . . . . . . . . . . . . . . . . . . . . XIX xviii 1 Introduction Identifying potential buyers is a critical first step in the mergers and acquisitions (M&A) process, directly influencing deal success and transaction outcomes (Merge, 2025). Brokers, investment banks, and advisory firms traditionally rely on their net- works, relationships, curated databases of past transactions, and manual screening to build buyer lists (Merge, 2025). Although this human expertise is invaluable, it is also resource intensive and prone to biases that limit both scalability and diversity in the buyer lists produced. Consequently, buyer lists may become concentrated around well-known or easily identifiable investors, such as major private equity funds. This can inadvertently limit the consideration of smaller or less conventional buyers, re- ducing the breadth of potential opportunities. This creates a clear business need for systems that can complement human expertise by discovering non-trivial, high- potential buyers, reducing manual effort, and broadening the coverage of the buyer universe. A further challenge lies in the fact that potential buyers differ fundamentally in their motivations and acquisition strategies. An important distinction is between financial and strategic buyers (Merge, 2025). Financial buyers, such as private eq- uity firms, venture capital funds, and investment funds, are typically motivated by return on investment, exit strategies, and growth potential. Their acquisition considerations often emphasise cash flows, profitability, and other financial metrics, and they usually hold assets for a limited period before seeking a profitable exit. Strategic buyers, on the contrary, are operating companies that acquire for reasons tied to long-term positioning, such as vertical or horizontal integration, geographic 1 1. Introduction expansion, or access to new technology and capabilities. These buyers may be will- ing to pay a premium because synergies raise the potential value of the acquisition, and they often place greater emphasis on non-financial considerations like cultural fit, strategic alignment, and competitive positioning. They also tend to integrate acquisitions more directly into their operations, whereas financial buyers may leave management largely in place. Understanding these distinctions is crucial for de- signing a recommendation system, since the relevance of a buyer depends on its incentives, constraints, and historical behaviour. Artificial intelligence and machine learning open up opportunities to address these challenges by enabling more data-driven and scalable approaches to buyer iden- tification. By capturing richer representations of companies, modelling patterns of similarity, and integrating diverse types of information, such methods can reduce the dependence on manual processes. Importantly, these technologies can be designed to support, rather than replace, human expertise and thus help brokers generate more diverse and high-quality recommendations while maintaining interpretability. This motivates the present study, which explores how AI-driven methods can con- tribute to a more efficient, data-driven, and insightful process for buyer–seller match- ing in M&A. 1.1 Aim The general aim of this thesis is to investigate how AI can be applied to improve the process of buyer identification in mergers and acquisitions. Specifically, the goal is to design a method that generates prioritised and interpretable buyer recommen- dations for a given target company, thus reducing reliance on manual processes and increasing efficiency in generating leads. 2 1. Introduction 1.2 Research Questions The research questions for the thesis are stated below: 1. How can we embed companies to best represent the business in operative terms given publicly available data? (a) How can dimensionality reduction techniques be used to visualize these embeddings and provide insight into the structure of embedding spaces? (b) Which dimensions of a firm’s business model (e.g. product offerings and operational domain) are encoded in these embeddings? 2. Can we use these embeddings to find companies with a similar business model? (a) How can we use these embeddings to generate a buyer recommendation of a given target company? (b) How does performance differ between different embedding models? 1.3 Limitations Several limitations influence both the scope and the technical approach taken. First, the study is based on a list of financial buyers provided by Merge with a strong con- centration of Nordic buyers but also including several major global buyers. This means that the portfolio companies of these buyers may extend internationally. Strategic buyers are not included in this analysis. The primary reason is that no comprehensive or standardized list of potential strategic acquirers exists. Identify- ing them would require considering virtually every corporation that might acquire a company for strategic reasons. Because of the lack of definable boundaries and the impractical scale of data collection, we chose not to pursue the inclusion of strategic buyers in this study. On the technical side, the study emphasizes AI methods that can be executed on CPU compute, ensuring that any resulting application can run efficiently in Merge’s environment without the need for GPU resources. 3 1. Introduction Another limitation concerns the lack of labelled data that could be used to directly supervise and benchmark machine learning models. Instead, evaluation must rely on indirect methods, such as conducting manual assessments together with domain experts. While this allows for valuable qualitative insights, it also introduces sub- jectivity and reduces the extent to which model performance can be measured using standard quantitative metrics. 1.4 Ethical aspects Training state of the art Large Language Models (LLMs) can carry a large en- vironmental cost. Strubell et al. (2019) estimate that the total carbon emissions associated with training and tuning a single LLM including hyperparameter tuning, architecture exploration, and repeated runs, can exceed the equivalent of 284 tons of CO2. This amount is comparable to or greater than the lifetime emissions of multiple passenger vehicles and has likely exploded further in recent years. These emissions reflect the electricity consumed by GPUs and data centre infrastructure during extensive experimentation, rather than a single training run alone. Even though using readily available models doesn’t infer a full training regime model se- lection should still consider not only accuracy but also computational and energy demands. Even though using readily available pretrained models does not require repeating the full training regime, model selection should still consider not only KPIs such as accuracy but also computational and energy demands. While this does not alter how foundation models are trained broadly, but a default preference for effi- cient models helps ensure that environmental cost remains an explicit consideration rather than an afterthought. A further ethical consideration concerns how differences in data availability across companies may influence the results. Companies with extensive publicly available information, and well-structured websites are more likely to be represented by infor- mative summaries. In contrast, smaller firms or companies operating in less digitally 4 1. Introduction mature contexts may have limited online information. As the work is based on pub- lic information, the models may systematically favour companies with more data, not necessarily because they are more relevant, but because their representations are more complete. This introduces a form of representation bias, where visibility and data richness influence buyer recommendations. Consequently, some potentially relevant buyers or portfolio companies may be overlooked. While this limitation is largely driven by data availability rather than model design, it shows the importance of interpreting model outputs with caution and complementing automated recom- mendations with human judgement, particularly in high-stakes decision-making sit- uations such as mergers and acquisitions. 1.5 AI Declaration AI tools (e.g., ChatGPT, Grammarly) were used for language support such as gram- mar correction, and rephrasing. All content, analysis, and conclusions are the au- thors own. 5 1. Introduction 6 2 Theory This chapter establishes the theoretical foundation for the embedding and similar- ity methods employed in this study. It covers semantic textual similarity and the algorithms used to measure it, ranging from classical approaches like TF-IDF to modern Transformer-based models, concluding with cosine similarity as the metric for comparing vectorized representations. 2.1 Term Frequency-Inverse Document Frequency A challenge in representing textual data for machine learning tasks is how to Trans- form the unstructured text into features that capture the importance of the words. One of the most widely used methods in information retrieval is TF-IDF, origi- nally formalized Salton and Buckley (1988). The method addresses the challenge of quantifying the importance of words in a document relative to a larger collection of documents (a corpus). By combining local and global weighting, TF-IDF captures not only how often a term appears within a document, but also how distinctive it is across the corpus. The term frequency component measures how often a term t occurs in a given document as shown in equation 2.1. To avoid bias toward longer documents, the frequency is normalized by the total number of terms in that document. TF (t, d) = ft,d∑ t′∈d ft,d (2.1) 7 2. Theory Here ft,d is the count of the term t in document d. This ensures that TF represents the relative importance of a word within the document itself. The inverse document frequency component adjusts for the fact that certain words are common across the entire corpus, and therefore carry limited discriminative power. This is described in equation 2.2. IDF (t, D) = log ( N 1 + |{d ∈ D : t ∈ d}| ) (2.2) Here N is the total number of documents in the corpus D, and the denominator counts how many documents contain the term t. The logarithm serves to dampen the effect of very frequent words, while the addition of 1 prevents division by zero. The TD-IDF score is obtained as the product of these two components TF and IDF. This weighting scheme highlights terms that are frequent within a document but rare across the corpus, making them useful for distinguishing that document from others. Common words such as “the”, “or”, and “will” therefore receive low weights, while domain-specific or distinctive terms receive higher values. Although TF-IDF is defined at the term level the implementation of Salton and Buckley (1988) is primarily used to construct vector representations of entire docu- ments. After computing the TF-IDF weight for every term t in the vocabulary, each document d is represented as a vector as shown in Equation 2.3. vd = (w1,d, w2,d, . . . , wV,d), (2.3) In Equation 2.3 where V is the size of the vocabulary and wi,d is the TF-IDF value of term i in document d. Terms that do not appear in a document receive a weight of zero, resulting in a high-dimensional but sparse vector. These vectors provide a simple yet effective representation of documents and form the basis for tasks such as similarity measurement, clustering, and document ranking. 8 2. Theory 2.2 Word2Vec An important contribution to natural language processing (NLP) was made by Mikolov et al. (2013) in their work ´´Efficient Estimation of Word Representations in Vector Space”, which introduced the Word2Vec framework. It enables words to be encoded as dense, continuous-valued vectors in a way that they capture se- mantic meaning. The core idea is to learn these vector representations through a simple neural network trained on a large corpus of text. Instead of representing words as discrete symbols, Word2Vec encodes them in a continuous vector space such that words occurring in similar contexts have similar embeddings (Mikolov et al., 2013). To train the embeddings, the network predicts either a target word given its surrounding context (the Continuous Bag-of-Words, or CBOW, model) or the surrounding context words given a target word (the Skip-gram model) (Mikolov et al., 2013). Both variants share the same underlying architecture of a single hidden layer neural network with a linear transformation from the one-hot encoded input to a dense embedding space. Finally a softmax output layer produces a probability distribution over the vocabulary. For a skip-gram model, we predict context words N steps away from a given target word in the sequence. To achieve robust training, context words are sampled with probability correlated to their distance from the target word Mikolov et al. (2013). In the standard Word2Vec implementation by Mikolov et al. (2013), each word is represented by two vectors: a target vector vw and a context vector vc. The likelihood of observing actual context words given a target word and model parameters θ across the training corpus is modelled using the dot product vc · vw as shown in Equation 2.4 where C denotes the set of all context words. During training, the model alternates between fixing one set of vectors and optimizing the other, iterating until convergence (Mikolov et al., 2013). p(c|w; θ) = evc·vw∑ c′∈C evc′ ·vw (2.4) 9 2. Theory By maximizing the corpus probability of the context c given the target word w and by taking the log of that expression the sum as shown 2.5 is attained. Let T denote the set of all (word, context) pairs extracted from the corpus. arg max θ ∑ (w,c)∈T log p(c|w; θ) = ∑ (w,c)∈T vc · vw − log ∑ c′∈C evc′ ·vw  (2.5) As shown in 2.5 this optimization requires summarizing over all context words c′ requiring lots of compute for training especially for large context windows. By introducing negative samples denoted T ′ the task can be modelled as a binary clas- sification task with a likelihood shown in 2.6. p(D = 1|w, c; θ) = 1 1 + e−vc·vw (2.6) This removes the worst summation step resulting in a simpler training routine. The objective function to optimize now becomes 2.7 arg max θ ∑ (w,c)∈T log 1 1 + e−vc·vw + ∑ (w,c)∈T ′ log 1 1 + evc·vw (2.7) For the Bag of Words model the context is made up of all terms contained in a symmetric window around the target word where each context word is encoded as a bag of words vector and the neural net outputs a probability vector via softmax activation. The model is trained in the same manner as for skip grams using a cross entropy loss except the context is now the sum of word vectors. Word2Vec models such as CBOW and Skip-gram provide effective representations for individual words, whereas plenty of applications require vector representations of longer textual sequences, such as sentences or documents. An intuitive approach is to achieve this by averaging the word vectors contained within a text possibly using a weighted average. Although this method produces a fixed-length representation, it misses information about word order and context. 10 2. Theory 2.3 Doc2Vec To address limitations of Word2Vec for sequence embeddings, Doc2Vec was intro- duced by Le and Mikolov (2014) as an extension of Word2Vec. Each paragraph is associated with a unique vector pj that is shared across all contexts and sampled from the same paragraph (Le and Mikolov, 2014). The word vector matrix on the other hand is shared globally making words retain meaning across different para- graphs (Le and Mikolov, 2014). There are two common implementations Doc2Vec corresponding conceptually to Continous Bag Of Words (CBOW) and Skip-gram. The first implementation called Distributed Memory Model of Paragraph Vectors (PV-DM) starts by mapping every paragraph to a unique vector, corresponding to a column in matrix P and each word is represented as a vector, corresponding to a column in matrix W (Le and Mikolov, 2014). The combination of the word vectors vW with the paragraph vector pj is achieved through concatenation or av- eraging and allows the model to capture semantic and contextual dependencies (Le and Mikolov, 2014). After training, the learned paragraph vectors can be directly used as features for downstream machine learning tasks. The PV-DM model offers several advantages over bag of words representation as it captures semantic relation- ships between words and accounts for local word order, similar to high-order n-gram models but without resulting in high-dimensional, sparse representations (Le and Mikolov, 2014). As opposed to the PV-DM model, that combines the paragraph vector with word vectors as input to predict the next word, the Distributed Bag of Words model (DBOW) simplifies the problem by using only the paragraph vector as input to predict randomly sampled context words (Le and Mikolov, 2014). This approach is conceptually similar to the skip-gram model, where the task is to predict words appearing within a given context window. During training, a text window is sampled from the paragraph, and a random word within that window is used as the target word. The model then performs a classification task to predict this word based only 11 2. Theory on the paragraph vector (Le and Mikolov, 2014). This design not only simplifies training but also reduces storage requirements as only the softmax weights has to be stored rather than both the softmax weights and word vectors as in PV-DM (Le and Mikolov, 2014). 2.4 SIF embeddings To address the limitations inherent in standard averaging of word vectors, Arora et al. (2017) proposed the Smooth Inverse Frequency (SIF) method. The method first uses a Word2Vec model to build embeddings vw for each word w. The word level embeddings are combined for each sentence s ∈ S using a weighted average as described in Equation 2.8. sentence is defined as any sequence of tokens and can also represent longer documents. Note that p(w) is the empirical probability of observing word w in our corpus and a is a hyper-parameter to be set. vs = 1 |s| ∑ w∈s a a + p(w)vw (2.8) Recall that in the training objective of Word2Vec frequent words are down-weighted because they appear in many contexts and are assumed to represent limited infor- mation. The SIF method achieves a similar effect through its weighting function. When p(w) is large, corresponding to highly frequent words, the weigh as described in Equation 2.8 becomes smaller (Arora et al., 2017). Conversely, when p(w) is small, representing rare words, the weight approaches 1 (Arora et al., 2017). The name “smooth inverse frequency” describes this functional behaviour as the weight- ing is approximately proportional to 1 p(w) for frequent words but converges smoothly to 1 for rare words instead of exploding. The final stage of the SIF embedding method is a normalization that involves the removal of the dominant component shared across sentence embeddings (Arora et al., 2017). After constructing the sentence embedding matrix X from the set of sentence embeddings vs : s ∈ S, the first singular vector u is computed via singular value 12 2. Theory decomposition and normalized to unit length. Each normalized sentence vector ṽs is then obtained by subtracting from vs its projection onto the unit-normalized u, as shown in Equation 2.9. ṽs = vs − (u · vs) u (2.9) This final operation serves as a normalization by filtering out the most common vari- ance amongst the sentence embeddings. Arora et al. (2017) demonstrate through empirical analysis that the leading singular vector u mainly captures common func- tion words, rather than meaningful semantic information. Words exhibiting the highest cosine similarity to u in their study include typical stop words such as “but”, “when”, and “even”. This adjustment ensures that the final representations better reflect the actual semantic content of sentences rather than generic grammatical structures. 2.5 Transformers The Transformer architecture, introduced by Vaswani et al. (2017) in the paper “Attention Is All You Need”, marked a shift in natural language processing (NLP) by replacing recurrent and convolutional structures with a fully attention-based mechanism. The key innovation of the Transformer is the self-attention mechanism, which computes contextual relationships between all tokens in parallel. It allows the model to weigh each token’s importance relative to others, capturing long-range dependencies regardless of their position in the sequence. As shown in the multi- head attention block of Figure 2.1, this enables the Transformer to attend to the most relevant parts of the input and build richer contextual representations. Self- attention is computed according to Equation 2.10 (Vaswani et al., 2017). Attention(Q, K, V ) = softmax ( QKT √ dk ) V (2.10) Within each head of the multi-head attention block of Figure 2.1, the embedding 13 2. Theory of each input token including its positional encoding, is projected into three vector spaces using the weight matrices WQ, WK and WV . These weight matrices are pa- rameters learned during training, specific to each attention head. These projected vectors will be called Query (Q), Value (V) and Key (K). The Q and K vectors are used to calculate how strongly each token should attend to every other token in the sequence. This is done by taking the dot product QKT , which produces a matrix of attention scores and essentially measures the contextual relationship between all pairs of tokens. To prevent these scores from becoming too large when the dimen- sionality of the K vectors dk is high, the result is scaled by √ (dk) (Vaswani et al., 2017). The softmax function is then applied to each row of this matrix to convert the raw scores into normalized attention weights that sum to one. Finally, these weights are used to compute a weighted sum over the corresponding V vector, giv- ing a new representation that integrates information from the entire sequence. By combining multiple attention mechanisms in parallel, the Transformer can learn to focus on different aspects of the sentence structure simultaneously (Vaswani et al., 2017). Additional residual connections and layer normalization are applied around each attention feed-forward block to stabilize optimization and maintain gradient flow during training. In the full Transformer architecture, the decoder (shown on the right-hand side of Figure 2.1) mirrors the encoder’s structure but introduces two key modifications. First, it includes a masked multi-head self-attention mechanism that ensures the model can only attend to previous positions in the output sequence (Vaswani et al., 2017), preserving the autoregressive nature of generation. Second, a cross-attention layer is inserted between the self-attention and feed-forward sublayers. This layer allows the decoder to attend to the encoder’s outputs, effectively connecting the encoded source representations with the tokens being generated. The combination of masked self-attention and cross-attention enables the decoder to generate output sequences while conditioning on the full encoded representation of the input. 14 2. Theory Figure 2.1: Overview of the Transformer architecture (Vaswani et al., 2017), with the encoder part on the left and the decoder part on the right In the standard training setup, Transformer models are typically first pretrained using self-supervised learning on large unlabeled text corpora, where the objective is autoregressive next-token prediction (Kalyan et al., 2021) . During this phase, the decoder learns to predict the next token given all previously observed tokens, enforced through masked self-attention. The model is then commonly fine-tuned in a supervised manner on smaller, task-specific datasets. In the original Transformer architecture for machine translation, next-token prediction in the decoder is addi- tionally conditioned on the encoder’s representation of the source sequence (Vaswani et al., 2017). 2.5.1 Sentence Embeddings The contextual nature of Transformer representations provides the foundation for obtaining meaningful sentence-level embeddings. Note that a “sentence” in this con- text can be an arbitrarily long sequence of text, rather than an actual grammatical 15 2. Theory sentence. Devlin et al. (2018) proposed Bidirectional Encoder Representations from Transformers (BERT), a model that is pre-trained by masking tokens from an un- labelled input and finally fine-tuned on labelled data for downstream tasks. The Bi-directionality comes from the fact that unlike the sequential pre-training of GPT and RNNs, BERT may use both left and right context of the masked tokens for prediction (Devlin et al., 2018). The input sequence expects a [CLS] token, that represents the entire sequence at the start and a [SEP] token at the end of each sequence. The last hidden state for this token can be used as a sequence represen- tation for classification tasks. Further BERT expects 512 tokens as input so if the sequence exceeds this it must be truncated and conversely in the case it falls of that the sequence is filled with [SEP] tokens to ensure a constant sequence length (Devlin et al., 2018). While BERT was a big improvement in NLP, it was designed and trained for clas- sification tasks rather than semantic similarity. To address this gap, Reimers and Gurevych (2019) introduced Sentence-BERT (SBERT), modifying the BERT ar- chitecture using siamese and triplet network structures to generate fixed-length sentence embeddings suited for similarity comparisons. SBERT can be used in three ways to get sentence embeddings from the Transformer output; using the CLS token representation, computing the mean across all output vectors, or apply- ing max-pooling across the output vectors (Reimers and Gurevych, 2019). Among these approaches, mean pooling has emerged as the most widely adopted strategy in practice (Reimers and Gurevych, 2019). The improvement of SBERT is not only in its architecture and pooling, but more so in its fine-tuning methodology. The SBERT fine-tuning trains the model on sentence pairs with known similarity rela- tionships, enabling it to learn representations where semantically similar sentences are positioned close together in the embedding space while dissimilar sentences re- main distant (Reimers and Gurevych, 2019). 16 2. Theory 2.6 Dimensionality Reduction Methods High-dimensional data often contain redundant or noisy information that can ob- scure underlying patterns and relationships. Dimensionality reduction techniques aim to project data from a high-dimensional space into a lower-dimensional repre- sentation while preserving as much of the original structure as possible. Dimensionality reduction approaches can be categorized into linear and non-linear methods. Linear techniques, such as Principal Component Analysis (PCA), assume that the data lie approximately on a linear subspace of the original feature space and identify directions that capture the most variance. Non-linear methods, including Uniform Manifold Approximation and Projection (UMAP), relax this assumption and instead focus on preserving local and global relationships between data points on a curved manifold. In the following subsections, both PCA and UMAP are described in greater detail, including their mathematical formulations and key intuitions. 2.6.1 PCA Principal Component Analysis (PCA) is a foundational method for dimensional- ity reduction that seeks to represent high dimensional data in a more compact form while preserving as much of its original structure as possible. It works by identifying orthogonal directions in the data, known as principal components, that successively capture the greatest possible variance revealing the directions of maximal informa- tion content in the dataset (Shlens, 2014). This reduction is achieved by computing the covariance matrix of the data and extracting its eigenvectors and eigenvalues, where the eigenvectors form the orthogonal axes of the new feature space and the eigenvalues quantify the amount of variance each axis accounts for in the dataset (Shlens, 2014). When applied to vector embeddings, PCA acts as a projection method that com- presses high dimensional representations by removing less informative components. Each embedding vector xi is projected onto the subspace spanned by the top k 17 2. Theory principal components, producing a reduced representation zi = W T k xi, where the columns of Wk consists of the eigenvectors associated with the k largest eigenvalues of the covariance matrix (Ringnér, 2008). The fraction of the dataset’s total variance preserved through this transformation, the explained variance ratio, is calculated as the sum of these top k eigenvalues divided by the total sum of all eigenvalues. This measure guides the choice of dimensionality k, providing a balance between maintaining the essential informational structure of the embeddings and improving computational efficiency (Ringnér, 2008). 2.6.2 UMAP Uniform Manifold Approximation and Projection (UMAP) is a non-linear dimen- sionality reduction method introduced by McInnes et al. (2020). The algorithm aims to produce a low-dimensional representation that preserves both local and broader structural relationships present in the original high-dimensional space. UMAP begins by constructing a weighted k-nearest neighbour graph. For each data point xi, it defines a local connectivity radius ρi as the smallest non-zero distance to any of its neighbours, see Equation 2.11. ρi = min{d(xi, xj) | 1 ≤ j ≤ k, d(xi, xj) > 0}. (2.11) Distances are then converted into membership strengths that quantify how strongly two points are connected as shown in Equation 2.12. w((xi, xj)) = exp ( − max(0, d(xi, xj) − ρi) σi ) , (2.12) σi normalizes the local neighbourhood. These weights form a fuzzy simplicial set representing the connectivity structure of the data. The intuition is that the closer two points are in the original space, the higher the probability that they belong to the same local neighbourhood. 18 2. Theory UMAP then optimizes a low-dimensional embedding Y = (y1, . . . , yn) that preserves these relationships. In the embedding space, connectivity is modelled using a smooth kernel function. Let dY (yi, yj) denote the Euclidean distance between two embedded points. The low-dimensional relationship is modelled as Equation 2.13. w̃((yi, yj)) = 1 1 + a dY (yi, yj)2b (2.13) with parameters a and b chosen to match the decay of connectivity observed in the high-dimensional graph. This ensures that nearby embedded points receive high membership weights, while distant points contribute minimally. The final embed- ding is obtained by minimizing the cross-entropy between high- and low-dimensional membership strengths, as seen in Equation 2.14 L = ∑ i ̸=j [ w((xi, xj)) log w̃((yi, yj)) + (1 − w((xi, xj))) log(1 − w̃((yi, yj))) ] . (2.14) Through this optimization, strongly connected points remain close in the embedding, while weakly connected points are pushed apart. This makes UMAP particularly suitable for visualizing complex datasets such as text embeddings, where both global clustering patterns and fine-grained neighbourhood relationships provide insights. 2.7 Similarity search Similarity search refers to the process of identifying objects that are the most alike according to a defined measure of proximity in a vector space. In the context of tex- tual data embeddings are usually expressed as vectors in a high-dimensional space. The task of similarity search in this domain is thus to determine which vectors are closest to each other, which in turn indicates semantic or contextual resemblance. An important part of similarity search is choosing how to measure the closeness 19 2. Theory between two vectors. One of the most common measures used for comparing text embeddings is cosine similarity. It measures the cosine of the angle between two vectors xi and xj as shown in 2.15, showing how similar their directions are in the vector space. In Equation 2.15, xi · xj is the dot product of the two vectors, and ∥xi∥ and ∥xj∥ are their Euclidean norms. sim(xi, xj) = xi · xj ∥xi∥ ∥xj∥ , (2.15) The similarity value ranges from −1 to 1, where 1 indicates perfect alignment (iden- tical direction in the embedding space), 0 orthogonality (no similarity), and −1 opposite orientation. Cosine similarity is particularly well suited for textual representations because it focuses on the orientation of the vectors rather than their magnitude. This property ensures that two documents or descriptions with similar patterns of term importance or semantic meaning are considered close, even if they differ in length or scale. 2.8 Large Language Models for Summarization Large Language Models (LLMs) are a class of large scale models that builds on the Transformer architecture (Vaswani et al., 2017). While models such as BERT (Devlin et al., 2018) use an encoder-only architecture for deep contextual under- standing, decoder-only models are optimized for next-token prediction in an autore- gressive manner. This design, introduced in models like GPT (Radford et al., 2018), generates text sequentially by conditioning each token on all preceding ones (Rad- ford et al., 2018). Through extensive pre-training on large unlabelled text corpora where the model learns to predict the next token, these models gain deep linguistic knowledge. Decoder based LLMs thus form the foundation of chatbots, capable of maintaining context and producing fluent responses to prompts. When generating text, a decoder only LLM computes a probability distribution over 20 2. Theory its vocabulary for each token position. The next output token is selected using a sampling strategy, such as greedy sampling (choosing the most probable token) or stochastic sampling, which introduces controlled randomness. The randomness of sampling is typically regulated using a temperature parameter before applying the softmax function. Formally, for a token i with output logit zi , the probability Pi under temperature T ∈ [0, 1] is given by the softmax Equation 2.16. Pi = exp(zi/T )∑ j exp(zj/T ) (2.16) Lower temperatures (approaching zero) produce more deterministic and focused out- puts, preferring tokens with higher predicted probabilities, whereas higher temper- atures increase the likelihood of sampling less probable tokens, increasing diversity, and creative text generation (Holtzman et al., 2019). Because LLM text generation involves stochastic sampling, even identical prompts can produce slightly different outputs. This variability can affect reproducibility, which is important to consider in the embedding analysis. Controlling decoding pa- rameters and standardizing prompts helps reduce this effect. The temperature pa- rameter represents a trade-off where lower values improve reproducibility by favoring high probability tokens, while higher values allow more diverse outputs. Choosing an appropriate temperature balances reliable results with maintaining the model’s generative performance. 21 2. Theory 22 3 Method This chapter describes the methodology for developing and evaluating the buyer target recommendation system. It covers the data collection and preprocessing steps, followed by the implementation and analysis of embedding models. Building upon these embeddings, the similarity based retrieval system was then constructed and evaluated through sampling of target companies. 3.1 Data The primary data used in this study consisted of textual descriptions of compa- nies rather than numerical or financial data. All data were collected from company websites to ensure high relevance and consistency. Websites generally contained comprehensive and current descriptions of a company’s activities giving clues about the business model. The tricky part was to find the useful information in the web- page without a too complex logic. Using only textual data was motivated by the study’s focus on identifying semantic similarities between businesses, where the tex- tual representation of their activities provides richer descriptive information than financial indicators. Further, the financial data available through public datasets did not cover all markets where portfolio companies were present. The data collection process began with a curated list of financial buyers, from which all portfolio companies were identified. Because raw website text often was inconsis- tently formatted with redundant information, a summarization step was introduced before embedding. An intermediate large language model (LLM) was used to trans- 23 3. Method form the scraped content into concise and standardized company descriptions. The standardized summaries form the textual representations used in later stages of the pipeline, including embedding generation and similarity search. 3.1.1 Scraping Since no centralized or publicly available dataset of portfolio companies existed, a web scraping protocol was developed in collaboration with Merge to collect the nec- essary data. The process began with a curated list of financial buyers provided by Merge, each associated with a verified company website URL. These websites typ- ically contain sections that describe the buyer’s portfolio or list of holdings, which served as the starting point for the data retrieval routine. The first step of the scraping involved identifying the specific webpage that listed the buyer’s portfolio companies. This was achieved by searching the homepage of each buyer for links containing relevant keywords such as “Portfolio or “Holdings.” When such a link was found, the crawler followed it to access the portfolio section of the website. Then, a crawler systematically traversed these portfolio pages to locate subpages that contained information about individual portfolio companies. From these sub- pages, all external hyperlinks (href attributes) were extracted. To ensure relevance, only links pointing to external company domains were retained, while links asso- ciated with social media platforms (e.g., LinkedIn, Twitter, Facebook) or general navigation elements were filtered out. The output of this stage was a structured mapping between each financial buyer and the corresponding list of portfolio com- pany URLs. When analysing how often each portfolio company link appeared across portfolios, certain domains reoccurred at unusually high frequencies. These were not actual portfolio companies but external sources such as news sites or financial data pages. Because these invalid records usually were among the most frequently oc- curring domains, the dataset could be improved by systematically inspecting and removing entries with the highest counts. 24 3. Method Once the list of portfolio company websites had been established, the next step was to extract descriptive text for each company. For every company, the raw HTML content was retrieved from text-bearing elements which contain the main written material on a webpage. The extraction focused on two key sources of information: the company’s landing page and its “About us” page, when available. The landing page was easily identified as the root domain and provided a concise overview of the company’s offering and positioning. Locating the “About us” page required an additional search step, as its structure and URL varied between companies. To iden- tify it, the crawler searched for links within the site containing indicative keywords such as “about”, “who we are”, or “what we do”. When such a link was found, the corresponding page was scraped and its textual content extracted. This approach ensured that the collected text captured both the general presentation of the company and its self-described purpose and activities. The resulting HTML texts from the landing page and the identified “About us” page were later parsed and processed into clean textual representations for the summarization and embedding steps described in subsequent sections. 3.1.2 Summarization via LLM After having retrieved the full HTML content from the selected webpages using BeautifulSoup, the textual material was extracted through the text attribute of the parsed HTML object. This unstructured text served as input to a large language model (LLM), which was used to generate standardized and coherent summaries suitable for embeddings. The purpose of this step was to convert noisy website text into concise, comparable descriptions that consistently capture the key characteris- tics of each company. The summarization was performed using the GPT-4o-mini model, which provided a 128,000-token context window (OpenAI, 2025). This capacity ensured that both the landing-page text and, when available, the “About us” section could be included in their entirety. The API was called using the OpenAI library for python with the 25 3. Method provided API key to first create a client. Then a request could be sent by providing the model, prompt, and desired temperature as arguments. Several temperature val- ues were tested during experimentation, and the temperature of 0.5 was ultimately selected over the default value of 0.7. After testing a range of temperatures from 0-1 the selected temperature built coherent sentences and kept more of the provided information, reducing the risk of hallucinations. The summarization prompt used in this study consisted of two components: a struc- tured instruction block outlining the required tone, content, and formatting, followed by the raw text extracted from each company’s landing page and, when available, its “About us” section. The instruction block ensured that all summaries adhered to a consistent structure and level of detail, while the inclusion of the full extracted text allowed the model to base its output solely on information explicitly provided in the source material. The complete prompt is shown below. The Summarization Prompt You are an M&A analyst. Your task is to create a company description from the information given into a concise, neutral and standardized summary. The style should be factual and objective, write it in free text and not as a list. Instructions: - Length: 300 words. - Tone: neutral, objective and professional. - Content focus: Industry and business model, Core products or services, Geographic focus and main markets, Customer segments or end-users. - Avoid marketing laguage, exaggerations or subjective adjectives. Do not infer or invent information not explicitly mentioned in the text. If information is missing, omit it rather then guessing. Information below: 26 3. Method {WEB PAGE CONTENT} 3.1.3 Preprocessing and cleaning As the LLM generated summaries form the primary input to all embedding models, it was important to ensure that they were both accurate and consistent. Although the web scraping pipeline retrieved text from the landing pages and the “About us” sections for most companies, several issues appeared in the raw data that required additional cleaning. Some webpages contained faulty or redirected links, which led to empty or unusable HTML content. In these cases the LLM could not produce a meaningful summary and instead returned placeholder phrases such as “information not found” or “not specified.” To identify such cases in a systematic way, the summaries were examined using a combination of keyword searches and length based filters. An inspection of the distribution of summary lengths showed that nearly all summaries shorter than about 86 words were failed generations. These entries typically corresponded to missing webpage content, non-English source text, or very limited material that the LLM could not expand into a proper summary. Based on this observation, all sum- maries below the threshold of 86 were removed. A keyword filter was also applied to detect low quality summaries even if they were slightly longer. These keywords or phrases included phrasing like “Not found”, “Not specified”, and words in foreign languages. Language inconsistencies created another source of noise. Although the summa- rization prompt was written in English, pages written entirely in other languages sometimes resulted in short or partially untranslated summaries. These were iden- tified through manual language detection checks and removed in order to maintain a coherent dataset. After having applied these cleaning steps, including the removal of invalid outputs, 27 3. Method filtering by length, keyword detection, and exclusion of non English summaries, the resulting dataset consisted of high quality and comparable company descriptions. This cleaned corpus serves as the basis for all embeddings. Further preprocessing of the summaries was done individually for the models. This included tokenization, lemmatization, and punctuation removal. As the summaries were generated using a LLM the textual quality was generally very high with few misspellings and special characters. 3.2 Models After obtaining the textual descriptions of each portfolio company, the next step was to generate numerical vector representations that could be used for similarity analysis. This was done by applying several embedding models through a Python- based pipeline. A dedicated script was developed to load the summarized company texts and organize them into a pandas DataFrame. From there, each model was applied to the corpus to produce a corresponding set of embedding vectors. Three main types of models were applied: TF-IDF, Doc2Vec models, and several Transformer-based models. All models followed the same data pipeline and storage setup to ensure comparability. 3.2.1 TF-IDF TF-IDF was used to generate vector representations of the standardized company summaries. The summarized texts were read into Python and processed using the TfidfVectorizer from the Scikit-learn library. For the TF-IDF representation, preprocessing was handled directly within the TfidfVectorizer. Instead of using an external tokenizer, a custom regular-expression-based token pattern was applied. This pattern ensures that only alphabetic tokens with a minimum length of two characters was included, effectively filtering out numbers, isolated letters, and other non-informative fragments. 28 3. Method The vectorizer was fitted on the full corpus of documents, with the minimum document-frequency parameter min_df tuned prior to finalizing the configuration. We evaluated values in the range 4 ≤ min_df ≤ 7, and selected 5 as it provided an effective balance between vocabulary coverage and noise reduction. Terms ap- pearing in fewer than five documents were therefore excluded from the vocabulary, mitigating the impact of extremely rare words that offer limited discriminative value. The resulting transformation produced a sparse TF–IDF matrix in which each doc- ument was represented as a weighted vector reflecting both term frequency and inverse document frequency. 3.2.2 Doc2Vec Document embeddings were generated using the Doc2Vec implementation from the Gensim library. Each business summary was first preprocessed through lemmatiza- tion, where punctuation were removed using spaCy, after which the cleaned tokens were wrapped into Gensim TaggedDocument objects with unique integer identi- fiers. The Doc2Vec model implemented a Distributed Bag of Words (dm=0), as described in 2.3. The model was set to build 100 dimension vectors, a context window size of 4, a minimum token frequency threshold of 5, and a negative sam- pling rate of 10. As no numerical target such as an accuracy could be used to set optimal hyper-parameters we started with recommended values and tweaked them based on the resulting distribution. The model object was trained for 5 epochs using train(), a reasonable number for a smaller corpus. After vocabulary construction with build_vocab(), the model was trained using stochastic gradient descent with Gensim’s optimized routines. Final document representations were produced using the model’s infer_vector method, which applies several gradient descent steps to derive stable embeddings that align with the learned semantic space. Building on the trained Doc2Vec embeddings, two additional procedures were im- plemented to enhance and further analyze the resulting document representations. First, Smooth Inverse Frequency (SIF) embeddings were computed to reduce the in- fluence of high-frequency, non-informative terms. This was achieved by re-weighting 29 3. Method token contributions based on inverse frequency and then applying NumPy’s SVD function to remove the first principal component, which captures the most most generic information across documents. Second, to investigate the effects of dimen- sionality reduction on representation quality, Scikit-learn’s PCA function was ap- plied to the embedding matrix. The number of components retained was set to preserve 90% of the explained variance, enabling the construction of more compact vectors while maintaining the core informational content of the original Doc2Vec representations. 3.2.3 Transformer Models Transformer-based embeddings were generated using pretrained models from the SentenceTransformer framework, which includes architectures derived from SBERT as well as variants influenced by GPT-style embedding designs and more. Three prospective models were considered to illustrate the range of Transformer-based embedding approaches: the lightweight “all-MiniLM-L6-v2”, the intermediate “all- mpnet-base-v2”, and the larger “Qwen/Qwen3-Embedding-0.6B”. All models op- Model Name Parameters Output Dim MiniLM-L6-v2 22M 384 all-mpnet-base-v2 110M 768 Qwen3-Embedding-0.6B 600M 1024 Table 3.1: Model specifications for Transformer embedding models erated directly on raw text inputs and incorporated their own tokenization and normalization procedures and therefore no additional preprocessing was applied. Embeddings were obtained using the framework’s encode() method. Among these considered models, all-mpnet-base-v2 was selected for the primary analysis as it provided a balance between a good representation and computational cost, and was suitable for environments without access to GPU resources. This model took about 15 minutes to embed all 10 000 samples. 30 3. Method 3.3 Implementation The aim of this study was to design a system that generates suggested buyers for a given target company that was to be sold. When evaluating this system it seemed reasonable to sample target companies from the dataset of portfolio companies and then do a similarity search disregarding the target company from the set of portfo- lio companies. Given a target company the embeddings could be analysed both in terms of direct similarity search and through buyer suggestion. To perform similarity search the embedding vectors were queried using cosine similarity against the target embedding to find top K most similar companies. The system identified companies to add to the portfolio of the buyer so that they shared similar business models with companies already in their existing portfolio. This assumes that buyers are interested in what they have experience with and know works. To implement this idea for a given target company the system began by computing the pairwise cosine similarity for the target to every other portfolio company. Then the portfolios were grouped by their financial buyer (owner) and only the top three portfolio companies were kept for each buyer. The score for each buyer were then equal to the geometric average of the cosine similarities over these top three holdings. Using these buyer scores the system could then suggest buyers for the given target company based relevant parts of the buyers portfolios. For a practical implementation the target company was not sampled randomly but provided outside of the dataset. This meant that the target company needed to be scraped, summarized, and embedded separately from the previously embedded portfolio companies. Further, the vector embeddings could be stored in a vector database so that each run only needed to embed a single summary. This put a requirement on the models used as the new target summary needed to be embedded on the same terms as the other companies. For TF-IDF that used corpus word counts in the IDF term this posed a problem. As this was a very fast model it was feasible to simply re-embed all summaries including the new one given that all the summaries can be stored instead of the vectors. Doc2Vec used a neural network to 31 3. Method build the embeddings and was pre-trained for our entire corpus of summaries so it was necessary to store the neural network in a pickle file for example to be used for new summaries. This did however assume that the vocabulary in the summaries were consistent with new ones as the training did not consider these newly added summaries. The Transformer model was not fine-tuned and the weights were pre- trained and imported so this model could be used to embed new target summaries on the same terms as the old ones. 3.4 Evaluation The models were evaluated both visually and through similarity-based retrieval. Visual inspection allowed for studying how companies were positioned relative to one another in the embedding space, while retrieval tests measured how well each model surfaced similar companies based on cosine similarity. Together, these methods give an indication of whether the embeddings capture business-level similarity. 3.4.1 Visualizations For evaluation by visualization, dimensionality reduction techniques were used to plot the vectors in 2D. Using UMAP we were able to represent the embedding vec- tors in 2D to enable scatter plotting. As UMAP works with stochastic initialization it was given a seed for comparability. The UMAP was used via the UMAP-learn library taking parameters of the target dimensionality (=2), the random seed, and some parameters to tweak the resulting distribution. Taking the full set of companies and plotting them resulted in a very large number of samples giving an overview of clusters. There was no industry codes associated with the data so all that could be done at this level was to analyse the distribution by hovering over samples to check if the sample has reasonable neighbours. The final visualization examined subsets of companies manually labelled by indus- try based on their summaries. Five industries were selected to test both similar- 32 3. Method ity and dissimilarity: ’Insurance’, ’Asset Management’, ’Industrial’, ’Realtors’, and ’Healthcare’. Insurance and Asset Management were chosen for their financial sector similarities, while Healthcare, Realtors, and Industrial represent materially different operational domains. This selection allowed investigation of whether embeddings capture both clear distinctions between dissimilar sectors and the relationships be- tween related ones. 3.4.2 Similarity Search To compare companies, similarity scores were computed directly on the embedding vectors generated by the models. For each company that was evaluated, its embed- ding was first retrieved and then compared with all embeddings in the buyer dataset. The comparison was carried out using cosine similarity, which offers a normalized measure of closeness between vectors and allows for a consistent interpretation of similarity across different embedding types. The implementation followed a straightforward procedure. Once the target company had been embedded using the chosen model such as TF-IDF or a Transformer-based encoder, its vector representation was compared against every portfolio-company embedding using cosine similarity. Cosine similarity was calculated pairwise, and the resulting scores form the basis for ranking the potential portfolio companies. The companies in the buyer set was sorted according to their similarity scores in descending order, after which the top-k most similar candidates were returned as the system’s recommendation. 3.4.3 Expert Evaluation To evaluate the practical relevance of the buyer recommendations produced by the models, a structured expert review was conducted. The goal of this step was to assess whether the buyer recommendations generated by the system align with the expectations of experienced analysts and to identify failures not captured by numer- ical metrics comparing the different models presented in the thesis. 33 3. Method First a sample of portfolio companies were selected to cover a mix of industries and business models. For each of these companies, the three different models generated two suggested buyers each. These buyers were identified by first locating the three most similar portfolio companies within each buyer’s portfolio and then averaging their cosine similarity scores to form a buyer-level relevance measure. In some cases, the three underlying portfolio companies contributed evenly to the similarity score, while in others a single highly similar portfolio company had a disproportionate influence on the buyer’s ranking. The motivation behind using the three most similar portfolio companies, instead of relying only on the single closest match, was to obtain a more stable and rep- resentative measure of buyer relevance. By considering the three closest portfolio companies and taking the average similarity score, the measure better reflects the overall investment profile of the buyer. This method gives a more reliable picture of what the buyer normally invests in, reduces the effect of outliers, and avoids placing too much weight on one unusually strong match. There are several possi- ble approaches to constructing a buyer relevance score, such as using larger sample groups or applying weighted similarity. In dialogue with Merge this setup was cho- sen because it offered a clear and balanced way of evaluating buyer interest while still being practical to work with. A sample of 71 target companies was included in the evaluation. To ensure a man- ageable and consistent review process, only the first paragraph of each target com- pany’s summary was provided, as this was deemed sufficient for forming a clear understanding of the business. It is important to note that reviewers were aware that the system gives suggestions purely on textual similarity. 34 3. Method Experts assigned a relevance score to each suggested buyer using a three-point scale: • 3 – Highly relevant • 2 – Relevant • 1 – Not relevant The expert rankings offer a qualitative assessment of how well each model identifies strategically meaningful buyers. By focusing directly on buyer-level suggestions, rather than individual portfolio companies, the evaluation captures experts’ judge- ment of sector alignment and business model fit. These results serve as the primary qualitative benchmark for comparing the models’ performance. 35 3. Method 36 4 Results This chapter presents the empirical results of the implemented buyer-matching pipeline. We report findings from the summarization stage, embedding performance, dimensionality-reduction visualizations, similarity-search experiments, and the ex- pert evaluation. The results follow the methodological structure outlined in the Method chapter. 4.1 Dataset statistics and quality The final dataset contains a total of 9,492 company summaries after all preprocessing steps. Each observation corresponds to a cleaned textual summary. Basic descriptive statistics of token lengths for these summaries are presented in Table 4.1. Statistic Value Number of documents 9,492 Average number of tokens 199.36 Standard deviation 22.98 Minimum number of tokens 86 Median number of tokens 202 Maximum number of tokens 287 Table 4.1: Statistics for the company summaries The summaries vary between 86 and 287 tokens, with most values concentrated around 200 tokens. These statistics describe the textual characteristics of the dataset used in the embedding and similarity-search stages. 37 4. Results 4.1.1 Quality of LLM Summaries The summarization step produced one structured summary per company, resulting in 9,492 text outputs. The outputs follow a consistent format extracted from the scraped website content and the examples in Tables 4.2 and 4.3 illustrates the typ- ical structure of the generated summaries in the dataset. Example Summary - BSI Software BSI Software is a European company specializing in customer relationship management (CRM) and customer experience (CX) solutions. Founded in 1996 in Switzerland, BSI Software offers the BSI Customer Suite, a modular platform that integrates artificial intelligence to enhance customer engage- ment, data insights, and relationship management. The company focuses on industries such as banking, insurance, retail, and energy, providing tailored so- lutions that comply with regulatory standards and meet specific market needs. The BSI Customer Suite is designed to facilitate seamless customer interac- tions across sales, marketing, and service channels, ensuring data protection and digital sovereignty for its users. The platform includes features such as customer data management, enterprise integration, and AI-driven analytics, allowing businesses to derive actionable insights from their customer data. BSI Software emphasizes flexibility and scalability, enabling clients to adapt the platform to their evolving requirements. Geographically, BSI Software operates primarily in the DACH region (Germany, Austria, and Switzerland) and Italy, serving a diverse range of customer segments from large corpo- rations to smaller enterprises. The company prioritizes customer-centricity and collaboration, fostering a networked approach to project management without traditional hierarchies. BSI Software’s commitment to quality and precision is reflected in its extensive industry expertise and its focus on long- term partnerships with clients. Table 4.2: Example of a generated company summary for BSI Software 38 4. Results Example Summary - Apotea Apotea is an online pharmacy operating in Sweden, specializing in the sale of pharmaceutical products, health and beauty items, and various wellness so- lutions. The company offers a wide range of products, including prescription medications, over-the-counter drugs, dietary supplements, and personal care items. Apotea’s business model is centered around e-commerce, providing customers with the convenience of shopping for health-related products from home, with options for fast delivery and free shipping. The geographic focus of Apotea is primarily within Sweden, serving customers nationwide. The company caters to diverse customer segments, including individuals seeking health and wellness products for themselves and their families, as well as pet owners looking for veterinary medications and supplies. Apotea also provides professional advice through its licensed pharmacists, ensuring customers re- ceive guidance on their purchases and health inquiries. With an extensive inventory that includes over 50,000 quality-checked products, Apotea posi- tions itself as one of the largest online pharmacies in Sweden. The product categories range from allergy relief and skincare to nutritional supplements and household items. The company emphasizes customer service, offering support via email, phone, and chat, and aims to meet the needs of various consumer demographics, including those with chronic health conditions and specific wellness requirements. Table 4.3: Example of a generated company summary for Apotea. The summary in Table 4.2 provides a clear overview of BSI Software’s core business areas by identifying CRM and CX solutions as its primary focus with the main offering being the BSI Customer Suite. It also states the company’s focus toward regulated service industries which helps situate its target markets. Geographically, the summary specifies that the firm was founded in Switzerland and operates in Europe. In terms of business model, the description implies a modular, AI-enhanced software platform that allows businesses to derive actionable insights from their 39 4. Results customer data. As a result everything desired seems to be mentioned but not at a detailed level. The summary in Table 4.3 provides a clear overview of Apotea’s core operations by identifying its role as a Swedish online pharmacy with an extensive assortment of pharmaceutical, health, and wellness products. It highlights e-commerce as the central business model, emphasizing convenience, fast delivery options, and broad product availability as key value propositions. The summary also situates Apotea geographically by noting its exclusive focus on the Swedish market and its nation- wide customer base. In terms of customer segments, the description covers both general consumers seeking health and personal care items as well as pet owners requiring veterinary products. The mention of licensed pharmacists adds context to the company’s service offering, suggesting a model that combines digital retail with professional guidance. Overall, the summary captures the main business areas, customer focus, and operational model, though it remains high-level rather than detailing specific logistics or competitive differentiators. 4.2 Models Table 4.4 reports the approximate runtime for each embedding model when gen- erating representations for the dataset and half of the dataset to compare scaling. TF-IDF runs in a under a second in both cases, while the Doc2Vec variants com- plete in around two minutes for half the dataset and three minutes for the whole. The MPNet Transformer however requires substantially more time due to its higher model complexity taking ca 16 minutes to train over the full dataset. The runtime for TF-IDF and MPNet Transformrer more than doubles for the full dataset, while for doc2vec the runtime is slightly less than twice the amount. Overall, the results confirm the expected trade-off between model expressiveness and runtime. 40 4. Results Runtime (seconds) Model 5,000 documents 10,000 documents TF-IDF 0.3 s 0.8 s Doc2Vec 102.7 s 192.3 s Doc2Vec PCA 111.4 199.4 s Doc2Vec SIF 110.3 198.3 s MPNet Transformer 410.6 990.2 s Table 4.4: Runtime statistics 4.3 Embedding Visualization Using UMAP dimensionality reduction the distribution of companies in the embed- ding space can be plotted in two dimensions. 4.3.1 All Portfolio Companies The UMAP plots of all the portfolio companies are shown in Figure 4.1. As the dataset includes no industry codes it is difficult to draw any real conclusions from this data other than some partial cluster formation. It appears that Transformer MPNet and TF-IDF embeds more dense clusters whilst the Doc2Vec models appear more dispersed. TF-IDF seems to have the most outliers and MPNet have some outliers also. 4.3.2 Subset visualization In order to show how the models embed information about the business and its in- dustry a subset of companies within different sectors where selected. The selection was done manually to find subsets of companies with different and similar business models. The selection was made as ’Insurance’, ’Asset Management’, ’Industrial’, ’Realtors’, ’Healthcare’. Some of the companies where consciously selected with vague industries such as Healthcare Insurance, Insurance Brokers, and Asset Man- 41 4. Results Figure 4.1: UMAP visualization of all companies for the different models agers with pension insurance operation. The full list of these companies are attached in Appendix A. Figure 4.2: UMAP visualization of a subset of companies within 5 industries for the different models Figure 4.2 shows the UMAP plots for each embedding method, and a few pat- terns stand out. The industrial companies are somewhat clustered in all models except TF-IDF, where the groups blend more horizontally. The industry names were densely clustered only for Transformer model while in the other appeared more dispersed. Healthcare names cluster tightly across all embeddings, but for Doc2Vec one name stands out as it is more focused on products (Spinal Discs) rather than providing healthcare services or pharmaceuticals. Also both Doc2Vec models embed 42 4. Results a consumer insurance company close to the healthcare names. The asset manage- ment cluster is also interesting because most of these companies also do pension insurance or insurance brokerage, and both the Doc2Vec and Transformer models pick up on that by placing them near the insurance cluster, whereas TF-IDF does not capture this relationship as well, placing them closer to the real estate brokers. Overall the Transformer model gives the most separated clusters in terms of the given subset. In Figure 4.3 the subset overlays the plot of all companies to show Figure 4.3: UMAP visualization of all companies in the embedding space, with selected companies from five industries highlighted how the labelled data points conforms to the total structure of the dataset. This gives some context about the cluster formation. The TF-IDF appear to have some clustering but also plenty of outliers where the insurance names appear together in one of these outlier clusters. Both Doc2Vec models seems to place the industrial names in a large dispersed cluster whilst the healthcare names appear in two more dense clusters. The two are very similar where the Real Estate brokers are slightly more separated for SIF and also there seems to be slightly more cluster formation. 4.4 Similarity Search Experiments This section presents the results of the similarity search experiments, evaluating how effectively each embedding model retrieves relevant companies for a given set of test queries. 43 4. Results 4.4.1 Similarity Score Distributions To better understand how each embedding model represents companies in the vector space, the distribution of cosine similarity scores between all pairs of companies are shown in Table 4.5. These distributions provide insights into how densely or sparsely the models cluster the representations, which in turn influences how sensitive each model is when identifying relevant buyers. Model Min Max Mean Median TF–IDF 0.000 0.940 0.127 0.120 Doc2Vec PCA -0.76685 0.94113 -0.00016 -0.00926 Doc2Vec SIF -0.652 0.911 -0.00012 -0.00766 MPNet -0.139 0.994 0.251 0.239 Table 4.5: Statistics of cosine similarity distributions for each embedding model Figure 4.4: Distribution of similarity scores For the TF-IDF model, the similarity scores are concentrated around low values 44 4. Results with a long tail toward higher similarities. This behaviour is expected, as TF-IDF produces high-dimensional and sparse vectors where most company pairs share few terms. As a result, only companies with strongly overlapping vocabulary achieve high similarity scores, while the majority remain close to zero. The SIF-Doc2Vec model shows a distribution centered around zero. Unlike TF-IDF, which only contains non-negative values and therefore produces mostly positive sim- ilarities, Doc2Vec vectors contain both positive and negative components. After applying SIF (removal of the first singular vector), the embeddings become more isotropic, further pushing cosine similarities toward a normal distribution centered around 0. This results in high contrast between similar and dissimilar companies, but also means that random pairs will have similarity close to zero. In contrast, the Transformer-based embeddings (SBERT) show a distribution skewed toward higher similarity values compared to TF-IDF. These models place seman- tically related companies closer in the embedding space, even when textual de- scriptions do not share explicit vocabulary. The transformer embeddings therefore produce higher baseline similarity scores. 4.4.2 Example of target similarity suggestions The tables below illustrate an example of the top-k retrieved companies for a selected target, note that the target in this case is a sampled portfolio company. For the chosen target, cosine similarity scores are computed against all portfolio companies in the dataset, and each model returns its highest-ranked matches based on these scores. A larger set of top k results for target companies can be found in Appendix A. The purpose of the example is to provide a qualitative impression of how the models behave in practice. By examining one representative case, it becomes possible to observe the types of semantic or textual patterns that lead to high similarity scores under TF-IDF, Doc2Vec SIF, and the Transformer-based embeddings. Below is a short description of the sampled target company, Klarna, followed by the top-2 sug- 45 4. Results gestions produced by each model. Klarna summary Klarna is a financial technology company specializing in payment solutions for both consumers and businesses. Operating primarily within the e-commerce sector, Klarna provides a variety of payment options designed to enhance the online shopping expe- rience. Its core offerings include immediate payments, deferred payment solutions, and installment plans, enabling consumers to manage purchases according to their financial preferences. TFIDF Summary Similarity Paysafe is a global payment solutions provider that offers a range of ser- vices designed to facilitate online transactions for businesses and con- sumers. The company operates within the financial technology industry, focusing on payment processing, digital wallets, and online cash solu- tions. Its core offerings include card processing, eCommerce solutions, local payment methods, and various digital wallet services such as Skrill, Neteller, and PaysafeCard. 0.518 Qliro is a financial technology company that operates in the payments and savings industry, providing a platform designed to facilitate both online and in-store transactions. The company offers a range of pay- ment solutions, allowing customers to choose their preferred payment methods, including options for immediate payment or flexible payment plans. The Qliro app serves as a comprehensive tool for users to manage their payments and finances, enabling them to track invoices, schedule payments, and communicate with customer support. 0.495 Table 4.6: TFIDF Model Matches for Klarna 46 4. Results The results of TF-IDF top two similarity search is presented in table 4.6. TF-IDF appears to focus on the transaction side of the Klarna business picking two com- panies providing payment solutions. Both offer a checkout solution for e-commerce which is a big part of the Klarna business. Only Qliro also offers the deferred pay- ment service’s making it a more direct competitor to Klarna. Doc2Vec PCA Summary Similarity Curve Pay is a digital wallet service that consolidates multiple pay- ment cards into a single, secure platform, enabling users to manage their finances more effectively. The service allows customers to switch between cards even after a purchase, thereby eliminating hidden foreign exchange fees and enhancing the rewards associated with existing bank cards. Curve Pay is designed for both online and in-store transactions, as well as for international spending, offering features such as cashback on purchases and flexible payment options. 0.784 Freecharge is a financial services and payment technology company based in India, operating as a subsidiary of Axis Bank. The company primarily focuses on providing a wide range of payment solutions, includ- ing mobile and DTH recharges, utility bill payments, and UPI transac- tions, catering to over 100 million users across the country. Freecharge’s business model integrates various payment methods, allowing users to transact using wallets, UPI, net banking, debit cards, and credit cards, thereby facilitating seamless payments for both online and offline mer- chants. 0.678 Table 4.7: Doc2Vec PCA Model Matches for Klarna 47 4. Results The PCA reduced Doc2Vec model produces 2 more diverse financial payment com- panies presented in table 4.7. Curve pay seems in line with Klarna’s offering of e-commerce payment solutions but Freeharge appears more focused on general per- sonal finance. Doc2Vec SIF Summary Similarity Curve Pay is a digital wallet service that consolidates multiple pay- ment cards into a single, secure platform, enabling users to manage their finances more effectively. The service allows customers to switch between cards even after a purchase, thereby eliminating hidden foreign exchange fees and enhancing the rewards associated with existing bank cards. Curve Pay is designed for both online and in-store transactions, as well as for international spending, offering features such as cashback on purchases and flexible payment options. 0.629 Splitit USA Inc. operates in the financial technology sector, specializing in a buy now, pay later (BNPL) service that allows consumers to split their purchases into smaller monthly payments using their existing credit cards. The company’s business model is designed to facilitate flexible payment options without the need for new loans or credit checks, thereby avoiding additional interest or fees. 0.612 Table 4.8: Doc2Vec SIF Model Matches for Klarna The SIF Doc2Vec model results presented in table 4.8 gain presents Curve Pay but includes a new suggestion. Splitit USA seems to offer buy now pay later services which is a big part of Klarna’s offering. The focus on providing consumer credit makes Splitit an interesting suggestion. 48 4. Results Transformer Summary Similarity Klar Technologies (Klar) is a regulated financial entity in Mexico, au- thorized by the Comisión Nacional Bancaria y de Valores (CNBV) and operating under the Law of Savings and Popular Credit. The company primarily focuses on providing financial services, including credit cards, personal loans, and investment accounts. Klar offers a variety of credit card options, such as the Klar Plus and Klar Platino, which feature ben- efits like no annual fees, cashback rewards, and flexible payment terms. 0.739 Karbon is a fintech company based in Bengaluru, India, specializing in foreign remittance services and corporate expense management solu- tions. The company primarily focuses on facilitating international pay- ments for businesses, including exporters, importers, freelancers, and direct-to-consumer (D2C) e-commerce enterprises. Karbon’s service of- ferings include a prepaid corporate card designed for expense tracking, an AI Accountant to streamline accounting tasks, and competitive for- eign exchange remittance solutions. 0.667 Table 4.9: Transformer Model Matches for Klarna The MPNet Transformer model results presented in table 4.9 shows two new sug- gestions. The first suggestion appears to have more of a traditional bank offering of credit cards, loans, and investment accounts. The second option seems to have a broader financial offering more targeted toward B2B. Although nearly all suggested peers to Klarna operate in the financial sector, their business models show notable variation. Klarna’s diversified operations span e- commerce checkout solutions, consumer payment cards, and micro-lending services, creating a complex profile for similarity matching. Notably, multiple models re- turned pairs of suggestions with comparably high cosine similarity scores, indicat- ing substantial representation of fintech companies within the dataset. However, Klarna’s major international competitor Affirm does not appear among the retrieved 49 4. Results peers, as verification confirmed its absence from the portfolio dataset. 4.5 Expert Review Results In designing the expert evaluation, it was necessary to balance the number of models assessed with the practical constraints of the reviewers’ time. As mentioned in the method, each target gave one suggested buyer per model and each of these included three portfolio companies to be considered. Increasing the number of models would therefore have reduced the number of target companies possible to review meaning- fully, and a trade-off was required. The SIF Doc2Vec and PCA Doc2Vec showed very similar results in the visualization and cosine similarity search. Given the minor benefits of SIF Doc2Vec model variant this was selected as the third model together with MPNet Transformer and TF-IDF for the expert evaluation. Table 4.10 shows the results of the expert evaluation with a total of 71 samples evaluated by professionals from Merge. Note that each model outputs two examples and so the total count for each model is therefore 142. The full evaluation set with scores per target can be found in Appendix B. Model Count 1 Count 2 Count 3 Average Median Doc2Vec (SIF) 36 52 54 2.13 2 TF–IDF 16 43 83 2.47 3 Transformer (MPNet) 3 43 96 2.65 3 Table 4.10: Expert ratings for each model. Scores range from 1 (not relevant) to 3 (highly relevant) 50 5 Discussion The purpose of this thesis was to investigate whether textual embedding models can support the identification of relevant potential buyers in M&A processes. The central question was whether different embedding approaches create vector repre- sentations that meaningfully reflect business similarities between companies, and whether those representations can be used to recommend buyers whose portfolio profiles align with a given target. To address this, the study relied on generating a large dataset of scraped and LLM-generated summaries describing the operations, value propositions, and geographic footprints of portfolio companies. Across both quantitative analyses and expert evaluation, the results indicate that the choice of embedding representation significantly influences the quality and relevance of the recommendations. In particular, the Transformer model demonstrated the most co- herent clustering and achieved the strongest alignment with expert judgement. 5.1 Discussion of Results This chapter interprets and reflects on the results presented in the previous section. The aim is to evaluate how well the proposed methodology addresses the problem of buyer-seller matching in M&A, assess the relative performance of the three em- bedding models, and relate the findings back to the overarching research questions and practical context at Merge. 51 5. Discussion 5.1.1 Embedding space and Visualizations A central objective of this thesis was to explore how companies can be embedded in a vector space in a way that reflects their underlying business characteristics. Since all embeddings were genereted from LLM-produced summaries of publicly available website content, the quality and content of this data directly shaped the geometry of each embedding space. The UMAP visualisations presented in Figure 4.1-4.3 provide an interpretable view of these high dimensional structures and give an in- dication of how well the different models captures the semantics and content of the summaries. Although the models differ in how their embeddings are constructed, they also differ substantially in dimensionality, which influences the structure of the resulting em- bedding spaces. TF-IDF produces very high-dimensional sparse vectors (in our case 8000) and Doc2Vec creates dense vectors with substantially lower dimensionality with a standard length of 100. The MPNet Transformer model lies between these extremes, generating 768-dimensional dense semantic embeddings. These dimen- sional differences affect how much variation each model can encode and how tightly companies can be positioned relative to one another. TF-IDF representations tend to create separations based on vocabulary patterns, whereas lower-dimensional dense vectors may compress information and produce smoother boundaries. To visualise these high-dimensional structures, UMAP was applied to project each embedding space into two dimensions. UMAP preserves local neighbourhoods while maintain- ing aspects of global organisation. Although some information is inevitably lost, meaningful and coherent clusters in the two-dimensional projections indicate that the original embeddings capture relevant business similarities. Figure 4.1 shows how the different models embed all the data in the embedding space, showing clear differences in the structure between the models. Both the TF- IDF and MPNet transfomer embeddings show more distincitve cluster seperation, suggesting that these models capture more consistent patterns in how companies 52 5. Discussion relate to one another. In contrast, both Doc2Vec variants generate more evenly spread shapes, indicating that their representations are less sharply defined and may struggle to separate companies clearly. The stronger cluster formation observed for TF-IDF and especially for the Transformer model suggests that these embeddings retain more distinctive information. The subset visualizations in Figure 4.2 provides an example of how well the models seperate companies within different sectors. In this example, the differences be- tween the models become clearer. TF-IDF manages to separate the selected sectors relatively well, with insurance, healthcare, real estate, and asset management form- ing distinct clusters, while the industrial companies appear spread out on a line. This pattern can be expected, as industrial firms often span a wide range of activi- ties and therefore have a wide range of textual information. Both Doc2Vec models show a similar overall structure, producing clear clusters for real estate, industrial, and healthcare companies, but asset management overlaps with insurance. Addi- tionally one insurance company is positioned closer to healthcare. In this example the MPNet Transformer mo