Using Machine Learning for
Predicting Interchanges
Leveraging Machine Learning Methods for Finding Equivalent
Products Based on Competitor Products

Master’s thesis in Engineering Mathematics and Computational Science

Jesper Wiström

DEPARTMENT OF MATHEMATICS

CHALMERS UNIVERSITY OF TECHNOLOGY
Gothenburg, Sweden 2025
www.chalmers.se

www.chalmers.se


Master’s thesis 2025

Using Machine Learning for
Predicting Interchanges

Leveraging Machine Learning Methods for Finding Equivalent
Products Based on Competitor Products

Jesper Wiström

Department of Mathematics
Chalmers University of Technology

Gothenburg, Sweden 2025


Using Machine Learning for Predicting Interchanges
Leveraging Machine Learning Methods for Finding Equivalent Products Based on
Competitor Products
Jesper Wiström

© Jesper Wiström, 2025.

Supervisor: Hugo Carlen, Company Supervisor
Supervisor: Johan Jonasson, Department of Mathematics
Examiner: Johan Jonasson, Department of Mathematics

Master’s Thesis 2025
Department of Mathematics
Chalmers University of Technology
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Cover: A picture of a bearing.

Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria
Printed by Chalmers Reproservice
Gothenburg, Sweden 2025

iv


Using Machine Learning for Predicting Interchanges
Leveraging Machine Learning Methods for Finding Equivalent Products Based on
Competitor Products
JESPER WISTRÖM
Department of Mathematics
Chalmers University of Technology

Abstract
In the bearing industry, identifying equivalent products across different manufactur-
ers is a complex and time-consuming task, traditionally reliant on expert knowledge
and manual cross-referencing. This thesis explores the feasibility of automating this
process using machine learning techniques. The core challenge lies in predicting an
equivalent bearing designation from a competitor’s product code, which typically
lacks structured attribute data and consists of a single, information-dense string.
To address this, a comprehensive machine learning pipeline was developed, encom-
passing feature extraction, encoding, model training, and decoding. The problem
was framed as a multi-label classification task, where suffixes representing product
features were predicted and then used to reconstruct a valid target designation.
Several machine learning models were evaluated, including Random Forests, Sup-
port Vector Machines, k-Nearest Neighbors, and Neural Networks, using both prob-
lem transformation and method adaptation strategies. Random Forests with label
powerset transformation emerged as the most effective approach, offering a strong
balance between accuracy and computational efficiency. The pipeline was further
optimized through domain-specific feature engineering, such as extracting product
size and type indicators, which significantly improved model performance.
Despite inherent limitations in the data—such as inconsistent labeling and sparse
representation of certain suffixes—the results demonstrate that machine learning can
reliably assist in generating bearing interchanges. This approach has the potential
to significantly reduce manual effort and scale the interchange process, making it a
valuable tool for industry practitioners.

Keywords: Bearing Interchange, Machine Learning, Multi-label Classification, Fea-
ture Extraction, Random Forest, Product Designation, Suffix Prediction, Cross-
reference Automation, Domain-specific Encoding, Industrial AI Applications

v


Contents

List of Figures ix

List of Tables xi

1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Previous Attempts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Internal Project . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Previous Master Thesis . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Problem Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 7
2.1 Designation system . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Basic Designation System . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Affixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Evaluation Metric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.3.1 Single label classification . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Multi-label Classification . . . . . . . . . . . . . . . . . . . . . 9

2.3.2.1 Problem Transformation . . . . . . . . . . . . . . . . 10
2.3.2.2 Method Adaption . . . . . . . . . . . . . . . . . . . . 10

2.4 RAKEL: Random k-Labelsets . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 RAKELd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 RAKELo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.5 kNN - k-Nearest Neighbors . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 MLkNN - Multi-Label k-Nearest Neighbors . . . . . . . . . . . . . . . 13

2.6.1 Technical Details . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 Random Forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.8 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.9 Support Vector Machines (SVM) . . . . . . . . . . . . . . . . . . . . 17
2.10 One-Hot Encoding and Multi Label Binarizer . . . . . . . . . . . . . 18

3 Implementation 21
3.1 Overview of the Solution . . . . . . . . . . . . . . . . . . . . . . . . . 21

vii


Contents

3.1.1 Mathematical Description of the Problem . . . . . . . . . . . . 22
3.2 Pipeline Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Preprocessing, Feature Extraction, and Encoding . . . . . . . 22
3.2.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.2.1 Affix Conversion Accuracy . . . . . . . . . . . . . . . 24
3.2.2.2 Designation Conversion Accuracy . . . . . . . . . . . 24
3.2.2.3 Incorrect Prediction Rate . . . . . . . . . . . . . . . 25

3.2.3 Initial Result . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.4 Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.4.1 Initial Search and Filtering . . . . . . . . . . . . . . 25
3.2.4.2 Evaluation and Matching Algorithm . . . . . . . . . 26
3.2.4.3 Algorithm Optimization . . . . . . . . . . . . . . . . 26
3.2.4.4 Speed Optimization . . . . . . . . . . . . . . . . . . 26

3.2.5 Pipeline Improvements . . . . . . . . . . . . . . . . . . . . . . 27
3.2.5.1 Cleaning of OEM Numbers . . . . . . . . . . . . . . 27
3.2.5.2 Adding Features Extracted from the Designation . . 27
3.2.5.3 Refining the List of Suffixes . . . . . . . . . . . . . . 28

3.3 Model Exploration and Comparison . . . . . . . . . . . . . . . . . . . 29
3.3.1 Model Implementation . . . . . . . . . . . . . . . . . . . . . . 30

3.3.1.1 Neural Network Classifier . . . . . . . . . . . . . . . 30
3.3.1.2 Random Forest Classifier . . . . . . . . . . . . . . . 30
3.3.1.3 SVM . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.3.1.4 RAKEL . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1.5 kNN Variations . . . . . . . . . . . . . . . . . . . . . 32
3.3.1.6 Notes on Computational Complexity . . . . . . . . . 32

4 Results 35
4.1 Pipeline Optimisation . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.1.1 Improved Affix List . . . . . . . . . . . . . . . . . . . . . . . . 35
4.1.2 Adding Features for Size and Type . . . . . . . . . . . . . . . 36
4.1.3 Training without OEM References . . . . . . . . . . . . . . . . 37

4.2 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Data Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 Conclusion 43
5.1 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2 Machine Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.3 Cross Reference Prediction Feasability . . . . . . . . . . . . . . . . . 46
5.4 Recommendations for Future Work . . . . . . . . . . . . . . . . . . . 47

Bibliography 49

viii


List of Figures

2.1 A simple illustration showing the basics of how random forest works. 16

3.1 Schematic describing the larger building blocks of the pipeline . . . . 22
3.2 Illustrative drawing showing the input and output nodes with hidden

layers in between. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.1 Results of suffix and designation accuracy as well as false suggstion
rate based on changes to the target affix list . . . . . . . . . . . . . . 36

4.2 Results of adding features for size and first 3 characters(feature is
called f3) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.3 Results of training without OEM references . . . . . . . . . . . . . . 38
4.4 Comparison of the accuracy for different models and strategies. Ran-

dom forest seems to perform the best. . . . . . . . . . . . . . . . . . . 39
4.5 Comparison of the runtime for different models and strategies. Ran-

dom forest has notable differences between label powerset and binary
relevance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.6 Proportion of data for each value of f3 . . . . . . . . . . . . . . . . . 40
4.7 Proportion of data for each size value . . . . . . . . . . . . . . . . . . 41
4.8 Histogram of competitor affixes . . . . . . . . . . . . . . . . . . . . . 41
4.9 Histogram of output affixes . . . . . . . . . . . . . . . . . . . . . . . 41

ix


List of Figures

x


List of Tables

2.1 Fruit, Categorical Value, and Price . . . . . . . . . . . . . . . . . . . 18
2.2 One-Hot Encoded Fruits and Prices . . . . . . . . . . . . . . . . . . . 18
2.3 Genres and Ratings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.4 One-Hot Encoded Genres and Ratings . . . . . . . . . . . . . . . . . 19

3.1 Neural Network Architecture . . . . . . . . . . . . . . . . . . . . . . . 30

xi


List of Tables

xii


1
Introduction

This report investigates the application of machine learning methods in the bearing
industry. A common problem is replacing a bearing from one manufacturer with a
bearing from another manufacturer. This task requires deep industry knowledge,
which few people possess, and is extremely time-consuming. To aid in this task and
reduce the time spent, machine learning will be applied.
The goal is to train a model on references from one manufacturer to another. The
aim is that, given a product designation from the first manufacturer, the model can
predict the equivalent bearing from the other manufacturer. The difficulty lies in
the lack of attributes in this data; it consists of only one string that needs to be
converted to another string. The core of the problem is to extract enough features
from the product designation and train the model on these relationships.

1.1 Background
In the bearing industry, product designations (or product names) carry extensive
information about the dimensions, type, and features of the product. Some parts
of the designation follow an ISO standard, while other parts, which describe the
product’s features, are created by each manufacturer. The naming system uses
many codes and abbreviations, and the designation system provides the information
needed to decode these codes into meaningful descriptions. This system allows
you to compare products between brands and identify an equivalent product from
another manufacturer. However, this requires knowledge of the designation systems
of different manufacturers and an understanding of their relationships.
To assist users who are not familiar with the different designation systems, inter-
change tools are available. These tools allow you to enter the designation from
one manufacturer and then show the equivalent product designation from another
manufacturer. Such a reference between a product from one manufacturer and an
equivalent product from another manufacturer is called an interchange.
The project originated from such an interchange tool. This tool allows distributors
to find an interchange from a competitor’s product designation. With approxi-
mately 10,000 searches each day, it is a valuable selling tool for the company and
an important tool for distributors in their daily work.
The application is based on a large, manually created database of interchanges from
competitor products. This database contains information spanning most product
categories and many brands, with the majority of the data focused on the most im-
portant competitors. The company now recognizes the need to expand this database

1


1. Introduction

to cover more of the assortment and to find references for frequently searched prod-
uct designations that are not currently included.
The current process is manual and extremely time-consuming, taking 10-20 minutes
to identify each individual interchange. First, you need to find the competitor
product and understand its type and features. Then, you need to determine if you
have a similar offering and, if so, identify the exact product. This requires extensive
knowledge of the bearing domain and the products. To free up time and scale the
service, the company now aims to leverage machine learning methods to create new
interchanges.

1.2 Previous Attempts

Over the years, there have been several attempts to solve this problem, highlighting
a clear business need for a faster and more efficient way to identify interchanges.
Ideally, an AI model would be created that requires less expertise and can be used
on a larger scale. Notably, there have been two significant attempts using different
methods: one based on traditional machine learning and another utilizing Large
Language Models to make these predictions. The main aspects of this work and
their outcomes will be described below.

1.2.1 Internal Project

This internal project, driven by the company’s research and development branch,
was undertaken by an experienced AI & ML expert. The goal was to find a way to
match competitor products to the company’s products by exploring various machine
learning methods to determine what works best and what is possible.
The project concluded that this task is impossible. It was determined that converting
product designations requires rich attribute data from both manufacturers’ products.
Obtaining such data is challenging for the company’s own products and even more
difficult for competitor products. The conclusion was that a model needs this data
to learn the relationships, and without it, the task is not feasible.
Another suggested solution was to manually build a decision tree along with a lookup
table to convert the affixes from the competitor’s products to the company’s equiv-
alent affixes and create rules based on this. However, this approach would result in
an extremely large and complex decision tree, as suffixes have different meanings for
different product types. Building such a tree would be extremely time-consuming,
difficult, and expensive, making it unfeasible, even though it was considered the best
possible approach.
Overall, the project’s conclusion suggests that the task we are attempting in this
master thesis is impossible or at least infeasible. However, the hope is that by trans-
forming the problem in the right way and performing effective feature extraction,
success can be achieved.

2


1. Introduction

1.2.2 Previous Master Thesis

There has also been a previous attempt to complete a master thesis on this topic.
The premise of the thesis was the same as this one; however, the students who took
on this challenge came from a very different educational background.

Initially, they conducted extensive literature reviews to find if anyone had tried to
solve a similar problem before. The only slightly similar case they found was the
conversion of e-commerce descriptions, which are much longer and often contain
product attributes. The next step was exploring methods such as mapping the
products into a vector space. However, an encoder was never successfully developed
and would need to be hand-created for each specific competitor.

Due to these challenges, it was decided to utilize Large Language Models (LLMs)
to solve the task of identifying equivalent products. Initially, standard models were
employed with little success. Consequently, they tried to fine-tune a model based on
existing interchanges. This approach yielded better results but was very expensive,
with costs both for training and for running the model when needed.

Although the fine-tuned model produced better results, they were still not good
enough. An accuracy of 70% was achieved, which is decent, but this was on a
filtered dataset that removed many OEM interchanges. Such interchanges are from
a random number to an industrial product where there is no logical relationship or
pattern, so removing them gives much higher accuracy. Another challenge was the
lack of explainability; it was impossible to explain why a particular suggestion was
made.

One major issue with this method was that the model always provided an answer,
even if it had no idea about the relationship. This made it difficult to filter out good
predictions from bad ones, making validation time-consuming. Another significant
issue was that the predictions had no connection to actual product designations,
so a prediction could look correct only to discover that the product doesn’t exist
and is a series that the company does not make. The final issue was the lack
of consistency and explainability. Running the model several times on the same
product could yield different results each time, making it non-deterministic. This is
because of the nature of LLMs, which can produce different output given the same
input. The predictions were also completely unexplainable, and you could not follow
the reasoning. This inconsistency caused low trust in the model and led to doubts
about all its predictions.

The conclusion from this project was that, at the time of evaluation, LLMs were
not good enough and not suitable for the task. It could be that in the future, when
they are easier to train and adapt to domain-specific areas and better at learning
patterns, this methodology might be feasible. However, currently, this is not the
case. A main takeaway is the importance of consistency and reliability, as trust in
the model is crucial for its actual implementation. For these reasons, the model was
never implemented or utilized again.

3


1. Introduction

1.3 Aim
The aim of the project is to identify and leverage an appropriate machine learning
method that can be trained on the existing database of interchanges. The trained
model should be able to identify a correct interchange for a given competitor’s
product designation.

1.4 Limitations
Due to the large scale, varying designation systems, and number of interchanges in
the database, the project will be limited to the most important competitor.
However, the aim is to develop a method that can be replicated for other manufac-
turers, with minor differences in data preprocessing.

1.5 Problem Specification
The input to the model will consist of only a single data point: the competitor’s
product designation. Because of this, the project will need to be split into four
dependent parts: feature extraction, encoding, model fitting, and decoding.
The main difficulties in this project will be feature splitting, encoding, and decod-
ing. With perfect data, fitting a model would be relatively easy. However, for a
machine learning model to be effective, it needs features from which it can learn
patterns. Therefore, feature extraction will be a core part of the project. If you
were to split the designations by the delimiters you would get features that contain
different types of information with different designations as sometimes suffixes come
in different positions. This happens when a product is of the standard specifica-
tion of an affix group, and therefore does not need to be specified. This makes it
difficult to split the affixes into good features. To enable the model to understand
the features and bypass difficulties, encoding will be central. If the affixes are en-
coded during training and applying the model, they need to be reinstated during
the decoding step to produce a valid product designation. The difficulty here lies
in making sure the affixes appear in the right order, which can be hard to achieve
if the permutation of them has been removed in the encoding. Consequently, the
core challenges will be those occurring before and after the actual machine learning
model, as is often the case when applying such methods to domain-specific problems
rather than theoretical ones.

1.6 Methodology
First, a broad analysis will be conducted to identify potential machine learning
models and methods to apply.
Methods for feature extraction will be investigated first to understand how possible
inputs could look for the models that will be applied later. This will be the most
critical part of the project since, if the features are well extracted, most models

4


1. Introduction

would likely perform well. It will become more a question of what feature extraction
limitations and trade-offs can be made and later finding a machine learning model
that supports these limitations.
After identifying several different methods with various trade-offs, corresponding
models will be investigated and evaluated for feasibility. They will all have some
requirements for success that need to be verified if they can be fulfilled.
After this study, a method will be selected, and a period of implementation will
start. The first step will be to implement the feature extraction according to the
selected method, as well as the encoding and decoding, to and from the model. The
selected model will then be trained and evaluated on a training/test split of the
data.
Evaluation is more complex. While a simple accuracy or F1 score provides a score
from one perspective, it is more subjective from a business perspective. This is
because you want to make references that have business relevance, which might
often be to a different product than the current reference due to reasons such as
availability, profitability, or strategy. For simplicity, a traditional score will be the
primary metric, applied to a validation set that the model has not been trained on.
Finally, the whole process will be refined with modifications to the feature extraction
and model to achieve better results. The cases where the model fails will be studied
to identify the root cause, which will be mitigated if possible.

5


1. Introduction

6


2
Theory

2.1 Designation system
In the bearing industry you have many domain specific terms. To describe the
products and be specific in our language, some of these need to be understood.
The name of a product is known as the product Designation. This is more than just
a name, it contains a lot of dense information about the dimensions and features of
the bearing. The basics of this nomenclature is referred to as the Basic Designation
System. [2]

2.1.1 Basic Designation System
The basic bearing designation system describes how to interpret the Main Des-
ignation, this is the part of the designation describing the product type and the
main dimensions of the product. In a standard bearing the principal dimensions
are the inner diameter (also referred to as bore diameter), the outer diameter, and
the width. This naming convention comes from an ISO standard which is com-
monly used for bearings, in this standard it is clearly defined how to name products.
This also ensures that if you buy a product with the same main designation, the
principal dimensions are ensured to be the same. This provides a high degree of
interchangeability of bearings for many of the simple use cases.
According to the standard the first number describes the product type or product
family. This is the type of bearing which could be one of many different options, each
type has specific use cases and features. Some accommodate radial load only, while
others support axial load, some types even support loads in both directions. The
next two numbers describe the width and outer diameter. These are not specified
directly, you instead have different dimensional series of products where the ratio
between the width and outer diameter is described relative to the bore diameter.
You often talk about series such as the 222 series of bearings which are spherical
roller bearings, indicated by the first number, and of a specific outer diameter and
width ratio. This bearing series is then available in many different sizes, here size
refers to the inner diameter (commonly referred to as the bore diameter). The bore
diameter is specified by the last 2 numbers of the main designation. If you multiply
this number by 5 you get the bore diameter in millimetres.
Having a system like this allows you to communicate a lot of information in a very
short product name. It does however require a lot of knowledge from the customer,
which is not easily acquired. What also complicates things are the many exceptions,
which deviate from this standard.

7


2. Theory

For example, the product type Deep Groove Ball Bearing is represented by the
number 6, however in one specific dimension series you instead write 16. Many
of the deep groove ball bearing dimension series are so common that you exclude
numbers from the dimension series to make the designation shorter. For example,
the common designation 6205 should really be called 60205 but the first 0 is omitted.
This is typically the same between manufacturers, but not all opt to do this omission,
while other manufacturers choose to do further omissions.
There are also product families where instead of being represented by a number
it is represented by one or more letters. This is typical for the Cylindrical Roller
Bearings where for example "N", "NU", or "NJ" is used depending on the design. [2]

2.1.2 Affixes
A suffix consists of blocks of letters and numbers that comes after the main designa-
tion. Each product family has a different designation system where the suffixes and
their meaning are described. The suffixes provide information about the features
of the product, a certain type in a certain size has many different variants distin-
guished by the suffixes. Sometimes a product has prefixes, this is similar to a suffix
but before the main designation. Such prefixes could indicate the product type but
also attributes of the product. It is therefore similar to a suffix in many ways, but
not used as often. Suffixes and prefixes can collectively be called affixes.
The affixes do not follow any form of standard, and different manufacturers use
different affixes to describe the same thing. For example, sealed on both sides can
be described as 2HRS by one manufacturer and by 2RSH by another. To replace
a product from one manufacturer with a bearing from another you therefore need
to understand what the suffixes mean, and then also what the other manufacturer
calls the same thing.
Suffixes describe features such as sealing, clearance, cage variants, etc. Many times,
these features are vital to the application and cannot be exchanged for another vari-
ant. If possible, you always opt for standard variants as they are cheaper, therefore
if you have purchased a bearing with certain features these must be matched when
finding an equivalent product. [2]

2.2 Evaluation Metric
The accuracy score is a common metric within machine learning topics. This is a
simplistic but crude measurement of classification accuracy. It is defined according
to the following:

No. Correct Classifications
No. Correct Classifications + No. Incorrect Classifications (2.1)

This formula shows that the output is simply the percentage of correct classifications
from the model. In a multilabel classification model the prediction needs to predict
all the same labels to be considered a correct classification, if one label is missing,
incorrect, or additionally included then the classification will be considered incorrect.

8


2. Theory

2.3 Classification
Classification is the task of learning relationships to classify an instance based on
its input data. There are several forms of classification which could be relevant such
as binary classification, multi-class classification or multi-label classification.
You want to, from a set of data, learn to predict one or more labels λ from a set of
disjoint (unique) labels L. We define M as the number of labels in L. [7]

2.3.1 Single label classification
In binary classification you want to classify an instance into one of two possible
classes, in other words, M = 2 and you want to assign a single label λ to each
instance.
Similarly, in multi-class classification you want to assign each instance to a single
class out of multiple options of classes. This means that in multi-class classification
M > 2 while you still aim to predict a single λ for each instance.
The difference between binary classification and multi-class classification is the num-
ber of possible classes. In binary it is one of 2 classes, while in multi-class its one
out of many possible classes. Multi-class classification can be achieved by problem
transformation into several binary classification problems. This means that a binary
classifier can be used for multi-class classification tasks.
There are many ways to do this, but the most common and simple one is one-vs-rest.
Here you turn several classes into binary decisions which are taken in series, each
of the binary decisions are whether the instance belongs to the class or not. If you
have 3 classes A, B, C you need 2 binary classifiers, one for A or not A, one for B or
not B. If both classifier predictions are negative, then it is implied that the instance
belongs to the class C.

2.3.2 Multi-label Classification
Multi-label classification is a subset of classification in the machine learning domain.
Classic classification tries to assign each datapoint to a single class. In multi label
classification you want to assign a set of labels λj ∈ L of an unknown number
j ≥ 0 to an instance. It could be zero or more labels that should be assigned, and
its a fully acceptable state that an instance has the empty set as its label. Most
common is that the number of labels M > 2, while if M = 1 it transforms into
binary classification since it becomes a decision between no label or a label. In more
simple terms, you want to based on the input features, assign the correct tags out
of multiple options to an instance. [7]
The difficulty partly comes from not knowing how many labels that should be as-
signed as this can vary between instances, but also from the very large output space.
In Neural Networks, multi label classification is quite natural. You have an output
node for each of the different labels, and using an activation function which decides
if this label should be included in the prediction or not. This prediction is done via
a single classifier which would not be possible with a method such as SVM. This is
a form of method adaption.

9


2. Theory

If you are using classifiers such as SVM or Random Forest, which are capable of
multi-class classification, you instead need a different approach for multi-label clas-
sification. Instead of adapting the method, you transform the problem into a multi-
class classification task. [8]

2.3.2.1 Problem Transformation

Binary Relevance is a method that treats each label as a binary classification
problem. You train M separate binary classifiers on each of the labels λj. This
method does not take interdependencies between the labels into account as you
train a separate classifier for each label which predicts if a label should be present
or not. The approach is relatively simple, but the consequence is that you lose the
inter-label dependencies which can negatively affect performance of the model if the
labels have relationships between each other.
Classifier Chains is similar to the binary relevance method except here you build
chains of binary classifiers where each build upon the decisions of the previous bi-
nary classifiers. This method takes interdependencies between labels into account
somewhat, however it will still fail to capture trends which are permutation depen-
dent. If for example you have a label which will depend on the existence of another
label, if that label has not been classified yet then the dependent label will not have
the information needed.
Label Powerset solves the issue in a slightly different way. Instead, you keep the
problem as a single class classification problem by creating unique classes for each
of the combinations of labels present in the training data. This can create a huge
number of possible classes where some of them could have very little support. It also
makes it impossible to make predictions of label combinations that are not present
in the training data. The method can work well when you have a limited number
of labels so the number of classes does not become huge, otherwise it suffers from
performance issues and computational complexity.
As the problem gets broken down into a binary classification problem in most cases
when dealing with multi label classification, you should consider this when choosing
a classifier. Something well adapted for binary classification is likely to perform
well. [8]

2.3.2.2 Method Adaption

Another alternative to transforming the problem into a domain which is suitable for
existing implementations of classification methods, is to instead adapt the methods
to suit multi label classification. This can be done in many ways and will be spe-
cific to the method which is being adapted. The aim is to customize the method
to natively be able to handle multi label classification with a multi label output.
Many gains can be achieved here, especially compared to the binary relevance prob-
lem transformation method which does not take the label interdependencies into
account. Examples of such methods which have been adapted are Multi-label k-
Nearest Neighbours, Multi-label Decision Trees, and Neural Networks with a multi
label activation layer. [8]

10


2. Theory

2.4 RAKEL: Random k-Labelsets
RAKEL (Random k-Labelsets) is a machine learning method specifically designed
for multi-label classification tasks as described in Section 2.3.2. Since this is not a
separate method but instead a way to apply a normal multi-class classifier, it is a
form of problem transformation.
RAKEL can in simple terms be explained as dividing all the labels into smaller sets,
either overlapping or distinct, then applying label powerset on these smaller sets
and training a classifier for each. The results of these are then aggregated through
majority voting into a final prediction, making it an ensamble method.
RAKEL is an application of label powerset, which is described in Section 2.3.2.1.
Compared to regular label powerset, RAKEL does not create classes out of all the
possible combinations of labels. Instead, it divides the initial set of labels L, into
random subsets, called labelsets. The labelsets are used to create classes out of
all possible combinations of the labels, this turns the problem into a multi-class
classification task. If a labelset contains all of L, you get the regular label powerset
method. In RAKEL, the parameter k specifies the size of the labelsets, meaning
how many labels are included in each labelset.
With these labelsets created the regular Label Powerset(LP) method is used on
each of these subsets. This means training a given multi-class classifier on the new
transformed task of predicting one of the classes which represent a combination of
labels. The difference here from regular LP is that you do this on smaller labelsets
and train multiple classifiers, instead of one on a huge set of classes.
RAKEL is an ensemble method, meaning that it trains several models and aggregates
their output into a final prediction. There are two ways of dividing the labels into
labelsets of size k, either disjoint (RAKELd) or overlapping (RAKELo):

2.4.1 RAKELd

Using the disjoint labelsets, you will not have any overlap in labels between the
classifiers. Given the size of the labelsets k, you divide the set of labels L randomly
into m = M/k disjoint labelsets. When m is an integer, you get evenly sized
labelsets, if m is a fraction you round it upwards and the final labelset is the size of
M mod k, meaning a smaller labelset with the remainder. Using LP on each of the
labelsets, you transform the labelset into classes of all combinations of labels in the
labelset. You then train m multi-label classifiers, one for each labelset. Each of the
classifiers will then make a prediction of a class from their respective labelset. This
will then, in turn, be aggregated by taking a union of these disjoint sets of predicted
labels, returning a list of labels that makes the final prediction.

2.4.2 RAKELo

When using overlapping labelsets, the same label may occur in several labelsets.
The set of possible labelsets of size k is defined by

(
M
k

)
. With overlapping labelsets

m is an input parameter, signifying the number of labelsets, with the restriction
m <

(
M
k

)
. The m labelsets are drawn from the set

(
M
k

)
by random sampling without

11


2. Theory

replacement. This makes it so that the sets can overlap, while when mk > M the
overlap in the labelsets is certain. It is not guaranteed that all labels will be included
in a labelset.
A multi-class classifier is trained on each of the m labelsets. As the labelsets may
overlap it is not as simple as in RAKELd, the final prediction for each label is instead
aggregated using majority voting. The decision is taken for each label separately,
first you take all the labelsets where the label is present. After this you record the
decision by that classifier to either include the label or exclude it, if included you
assign a value of 1 otherwise 0. The mean of all these decisions is then calculated
to give an average vote, if this is greater than 0.5 the majority voting decision is to
predict the label, otherwise the prediction is that the label is not present.
RAKEL reduces the computational complexity dramatically when the number of
labels M becomes large. The recommended value of m is 2M , but even though
there are many labelsets, due to combinatorics, it will still result in a lower com-
putational complexity. It also partly solves the issue of low support for many of
the labelsets, since there will be more occurrences of each of these sub-labelsets in
the data. Another limitation of a regular label powerset method is that it can only
predict labelsets that are seen in the training data. Using RAKEL however, new
combination of labels can be created in the ensemble stage of the method.
RAKEL has several advantages over using the regular label powerset method.
This reduction in computational complexity is valid when using a multi-label clas-
sifier only capable of binary classification, and uses methods such as one-vs-rest as
described in 2.3.1. However, decision tree classifiers are sublinear in complexity to
the number of classes. Meaning that, the computational complexity scales better
than linear to the number of classes (O < O(n)). If this is the case, then the com-
putational cost of using RAKEL is likely higher compared to label powersets. This
means that RAKEL should preferably be combined with a classifier such as SVM,
and is not suitable for random forest.
Another big advantage to RAKEL compared to label powerset is that it can predict
labels which are not present in the training data. If you have many labels it is
very unlikely that all combinations are represented in the training data, while it is
likely that unseen data should be tagged using new label combinations. RAKEL
also solves the problem of low support in many of the classes, when having a huge
number of classes representing each present label combination, there will be very few
occurrences in many of the classes. However, when using RAKEL you get smaller
labelsets which have a chance of having a much higher representation in the training
data and therefore providing more support. [7]

2.5 kNN - k-Nearest Neighbors

kNN (k-Nearest Neighbours) is a common but relatively simple machine learning
method. It is a classifier which naturally handles multi-class classification directly.
The method places the instance in the feature space of the training data, and, as
the name describes, the k nearest neighbours are used to predict the class of the
instance. It works based on the idea that similar instances are likely to belong to the

12


2. Theory

same class. Out of these neighbours the class with the most occurrences is chosen
for the instance as the prediction.
There can be several different measures of distance which determines which points
are the k closest neighbours, the typical distance metric is the Euclidean distance,
but Manhattan(Grid, L1-Norm), or Minkowski(Lp-Norm) distance can also be used.
Even though it is a simple method, it allows for very complex decision boundaries
between classes which are non-linear, this can be useful if the relationships between
features is known to be non-linear and complex.
The main parameter to the classifier is the value of k, this determines how many of
the nearest neighbours of an instance should be found. A small value means that the
classifier will be sensitive to noise in the data as very few points will be used for the
decision. It also makes the model quite prone to overfitting as it will closely follow
the pattern of the training data, on the other hand it also makes the classifier more
sensitive to local patterns. On the other hand, a large k smooths out the decision
boundary between classes making the classifier more robust. Because of this the
hyperparameter k needs to be tuned against validation data to prevent overfitting
and find a balanced k. This is often done through cross validation.
A big advantage of kNN is its simplicity, it is a method which is easy to understand
and reason about, while also being very simple to implement yourself. It also does
not require any training phase, however it does need to calculate distances to all
points given an instance. This makes it computationally expensive in datasets with
a large amount of data.
kNN can and has been applied in many domains, the performance if often good
with regards to the simplicity of the method. Successful implementations have
been seen in various domains such as: image recognition, text categorization and
recommendation systems. [3]

2.6 MLkNN - Multi-Label k-Nearest Neighbors

MLkNN (Multi-Label k-Nearest Neighbours) is an adaptation of the traditional k-
Nearest Neighbours (kNN) algorithm tailored for multi-label classification tasks.
This method leverages the principles of kNN while incorporating mechanisms to
handle multiple labels simultaneously. It is the first multi-label lazy learning algo-
rithm, which means that the generalization of the training data is delayed until a
request of a prediction is made, unlike eager learning methods where the model is
trained on the training data first, and then predictions are made. This does however
mean that the computational cost will be incurred at runtime instead of training
time. Meaning that the user of the model will have a higher query time, compared
to other models where most of the time is spent in training, but later querying of
the model will be fast.
The MLkNN algorithm works similarly to regular kNN, it finds the k nearest neigh-
bours to the instance in the same way, but since multiple labels needs to be predicted
instead of assigning the instance to a single class, more needs to be done. The num-
ber of neighbours belonging to each label is used with the maximum a posteriori
principle to determine the labelset of the instance.

13


2. Theory

2.6.1 Technical Details
To understand more precisely how the method works we need to dive into the more
technical details of the method.
We will define an instance as x ∈ X where X is the domain of all instances, and
Y ⊂ Y where Y is a labelset and Y is the set of all labels.
We then define yx as the category vector of x with the length equal to the number
of labels |Y|, where each of the labels is represented as a 1 if the label is present and
0 if its not. We let N(x) denote the set of k nearest neighbours of x in the training
set. With this we can define a membership counting vector based on the labelsets
of the neighbours as:

Cx,l =
∑

a∈N(x)
ya,l

This counts the number of neighbours of the instance x which has each label l.
We pick a test instance t ∈ X . Then let H l

1 be the event that t has label l and H l
0

be the opposite that t does not have label l.
Then El

j be the event that, among the kNN of t, there are j instances which have
the label l. Using the maximum a posteriori principle, the category vector of labels
for the test instance, yt can be determined as:

yt,l = arg max
b∈{0,1}

P
(
H l

b|El
Ct,l

)
El

Ct,l
is the event that we have the count of label l from the nearest neighbours that

we have, so practically, it is the count of neighbours with the label. Given this we
then get the conditional probability of the label l being b so either 1 or 0. We are
then trying to find if either b = 1 or b = 0 maximizes the probability, which means,
is it more probable that the label is present or not.
With the goal of the calculation clarified and the notation explained, Baye’s rule
will now be used to rewrite the equation into something we can calculate:

yt,l = arg max
b∈{0,1}

P
(
H l

b

)
P

(
El

Ct,l
|H l

b

)
P

(
El

Ct,l

)
yt,l = arg max

b∈{0,1}
P

(
H l

b

)
P

(
El

Ct,l
|H l

b

)
This shows that all that is needed is the prior probabilities P

(
H l

b

)
and the posterior

probabilities P
(
El

Ct,l
|H l

b

)
which can be directly estimated from the training set by

frequency counting. For the exact details around how this is done and the algorithm
used refer to the paper introducing the method.
ML-kNN has 3 main hyperparameters number of neighbours k, and the distance
metric. Exactly like in regular kNN, k is the number of neighbours of an instance to
take into consideration when making a prediction. Also similar to kNN several dis-
tance metrics can be chosen from, but the most common is to use regular Euclidean
distance. Also, here it is common practice to use cross validation to determine the
most suitable k for the dataset

14


2. Theory

In the paper introducing the method it was compared to several other multi-label
classifiers where it was shown to outperform them in 3 different classification tasks.
It was also seen that the model was relatively insensitive to the parameter k. [12]

2.7 Random Forests
Random forest is a machine learning method that is commonly used in both classifi-
cation and regression. It is an ensemble method which constructs multiple decision
trees during training and combines their output to make a final prediction, which
improves accuracy and reduces overfitting.
Random forest is based around binary decision trees. Such a tree is built up by
several binary decisions, splitting each node into two branches. At each node the
model is given a random subset of features which it can use for determining the
split. The reason for splitting on a random subset of features is that if a feature is a
very strong predictor for the target, the feature will be selected in many of the trees
causing them to become correlated. By doing this random selection of a subset of
features you avoid this and thereby get higher accuracy.
The number of features which are considered in the subset for each split is a parame-
ter of the model, if there are p features input into the model, then for a classification
task you usually use √

p randomly selected features in each subset. This hyperpa-
rameter can be tuned to become optimal for the given problem it is applied to.
The split itself in each of the nodes based on the random subset of features is based
around either information gain or Gini impurity to determine the optimal cut point.
The whole idea of the random forest model is to build up several of these binary
decision trees where each split is determined based on a random subset of features,
this is what creates a forest of many trees. Random forest is an ensemble method
using Bagging (bootstrap aggregating), which means that you utilize bootstrapping
of samples to generate several training sample sets for the different decision trees,
that are then aggregated into a final decision.
Bootstrapping works by generating a new sample set based on the existing dataset
by sampling with replacement. This means that you draw a given number of random
samples, where it is possible to draw the same row multiple times. These samples
which are taken from the original dataset is bootstrapped samples. The idea of this
is to artificially generate more sample data than is originally available.
Each of these samples generated using bootstrapping is then used to train a decision
tree classifier which can then classify an instance. The aggregation part comes from
aggregating each of the classifier’s decisions into a final decision. If you are doing
classification this is done by taking the Mode of the outputs, meaning performing
majority voting and taking the most commonly predicted class. If doing regres-
sion, you typically take the average of all the outputs from the trees in the forest.
Figure 2.1 describes the Bagging process visually.
Random forest offers several advantages such as being resilient to overfitting com-
pared to using a single decision tree, it’s also relatively insensitive to hyperparam-
eters. Additional benefits are that it can handle both classification and regression
tasks as well as providing direct measures of feature importance. There are some
drawbacks to the method such as the computational demand of training several trees

15


2. Theory

Figure 2.1: A simple illustration showing the basics of how random forest works.

instead of only one. You also lose some interpretability of the model when using an
ensemble of many trees, but this can instead be provided by looking at the feature
importance which gives some explanation.
In summary, Random Forest is a powerful and flexible machine learning method
that leverages the strengths of multiple decision trees with a random subset of
features, where each trees prediction is aggregated to make the final prediction.
This in general provides robust and accurate predictions which are resilient against
overfitting and not sensitive to its hyperparameters. [10]

2.8 Neural Networks
Neural networks are a type of machine learning methods which are inspired by
how the human brain works. They consist of a large number of nodes which are
connected, and organized in several layers. These neural networks can learn complex
patterns from data and are nowadays widely used in many different applications.
A neural network consists of an input layer, one or more hidden layers and an
output layer. Each neuron in these layers is connected to neurons in the previous
and next layer, but not within its own layer. Each of these connections have a weight
determining the transfer.
A network also uses an activation function, this function is applied at each node and,
along with the weights, determines the output of that node given all the inputs from
its connections from the previous layer. There are several common functions such as:
sigmoid, hyperbolic tangent (tanh), and rectified linear unit (ReLU). These functions
help with introducing non-linearity into the system which allows the network to learn
complex relationships which can be nonlinear.
During the training on data the network adjusts these weights based on the error of
its predictions, this is what is called backpropagation. By adjusting the weights, the
aim is to reduce the error as much as possible to increase the networks performance.
If the weights are adjusted too much to the training data you can get issues with

16


2. Theory

overfitting, which is why you control this against unseen validation data.
There are several different kinds of neural networks which vary in architecture. The
most common and standard one is feed forward neural networks which only has
nodes that connect to nodes in the next layer, a variation of this is convolutional
neural networks, which are common in analysing grid-based data such as images.
Here a convolution using a kernel is utilized to combine each cell in the grid with the
ones around it and summaries this into one output, where each convolution takes
the dimensionality down. There are several other variations, but these are the most
basic and commonly used.
Neural networks are very powerful and can fit to complex data, but they have certain
limitations. They require very large amounts of data and are very computationally
demanding to train. They can also be very prone to overfitting such that the model
performs very well on the training data but not so much on new unseen data, but
there are several techniques that can be utilized to mitigate this.
In summary, Neural networks are a common technique used in machine learning
which can be applied to many different problems and often generate great results.
However, it requires a lot of data to be successful and is very computationally
demanding to train. [9]

2.9 Support Vector Machines (SVM)
Support Vector Machines (SVM) is a machine learning method that is widely used
for both classification and regression tasks. SVM is built around finding the optimal
hyperplane that separates data points of 2 classes with the maximum margin.
The main idea in SVM is to find a linear separator between two binary classes using
a hyperplane that separates these classes. To do this, SVM maximizes the margin
between the hyperplane and support vectors. The closest datapoints of the two
classes to the decision boundary are what is known as support vectors, and the
distance between these and the hyperplane is called margin. The method tries to
place the hyperplane in a position so that it has the most margin to both classes
support vectors
In a lot of datasets, the data is not possible to linearly separate with a hyperplane,
therefore SVM uses kernels to transform the feature space into a space of higher
dimensionality where the classes are hopefully linearly separable. This use of kernels
allows the method to handle non-linear relationships since the hyperplane in the
higher dimensionality space can be non-linear in the original feature space.
SVM does not handle multi-class classification natively, the hyperplane naturally will
separate binary classes. This means that some strategy for multi-class classification
needs to be used. The most common one is to use one-vs-rest as described in 2.3.1
The choice of kernel function is an important parameter in SVM. Some kernels
that are common in SVM are: linear kernel, polynomial kernel, and radial basis
function (RBF). When using a linear kernel, the feature space is not transformed
into higher dimensionality and a hyperplane is found in the original feature space.
This generally only works well when the data is linearly separable. However, the
non-linear kernels allow SVM to learn non-linear relationships which makes it useful
for a wide range of classification tasks

17


2. Theory

Fruit Categorical value of fruit Price
apple 1 5
mango 2 10
apple 1 15
orange 3 20

Table 2.1: Fruit, Categorical Value, and Price

Fruit_apple Fruit_mango Fruit_orange Price
1 0 0 5
0 1 0 10
1 0 0 15
0 0 1 20

Table 2.2: One-Hot Encoded Fruits and Prices

Other than the choice of kernel there are other hyperparameters which can be tuned.
The regularization parameter C controls the trade-off between maximizing the mar-
gin and minimizing the classification error. Depending on the kernel used there
are hyperparameters for the kernel such as the degree of polynomial when using a
polynomial kernel, or the parameter γ when using the RBF kernel.
In summary, Support Vector Machines is a good machine learning method for binary
classification that finds the optimal hyperplane for separating data points of two
classes. By using kernels, the method can transform the feature space into a higher
dimension and thereby fit non-linear decision boundaries between the classes. [11]

2.10 One-Hot Encoding and Multi Label Bina-
rizer

One-hot encoding is a method for preprocessing data to prepare it for a machine
learning model to be trained on. The method converts categorical variables into a
binary format through splitting the categorical value into several binary features.
Each category is converted into an individual feature, this feature then takes a binary
value of either 0, if it is not the class in the data record, or 1 if the class is present. In
Table 2.1 we can see some example data, and in Table 2.2 the same data is one-hot
encoded to demonstrate the method.
The benefit of one-hot encoding is that the method eliminates ordinality, meaning
the order of categories. Many categories such as fruit have no inherent order and
in the categorical encoding seen in Table 2.1 we see that the categorical value of
fruit has an increasing series of numbers. A model does not know if a column
represents numbers which are ordered or not, therefore this could be misinterpreted
when training and bias the model on trends which do not exist. By one-hot encoding
this into binary variable we can eliminate the ordinality at the sacrifice of getting
higher dimensionality. [1]
One-hot encoding can be applied when you have a multiclass categorical variable,

18


2. Theory

Genre Rating
{Sci-fi, Thriller} 6

Thriller 8
Sci-fi 7

{Thriller, Comedy} 6

Table 2.3: Genres and Ratings

Genre_scifi Genre_thriller Genre_comedy Rating
1 1 0 6
0 1 0 8
1 0 0 7
0 1 1 6

Table 2.4: One-Hot Encoded Genres and Ratings

however if you have a multilabel categorical feature another method is required. A
very similar method which is based on one-hot encoding is multi label binarization.
In a similar way it finds all the unique classes and creates binary features from them,
the only difference is that each record could have several labels or classes assigned.
The result of this is that several of the columns can have a value of 1 whereas in
one-hot encoding they are mutually exclusive, meaning that if one column has a
value of 1 then you can deduce that all the others have the value 0. An example
of data that is pre-processed using this method can be seen in Table 2.3 and then
encoded in Table 2.4 [5]
What is important to note when using multi-label binarization that the order of
categories is lost.

19


2. Theory

20


3
Implementation

The implementation will consist of two phases, first a phase where the pipeline is
built up. This includes developing a good feature extractor as well as encoding of
data to suit a machine learning model. It will also include implementing a basic
model which can make affix predictions, as well as a decoder which makes the final
prediction of an equivalent designation to the input product.
The second phase will involve the exploration and implementation of several machine
learning models and methods which will be evaluated and compared to each other.
The first phase will enable this since all the surrounding building blocks will be
done. This means that different models and methods can be swapped in place since
the rest of the infrastructure surrounding the model is set.

3.1 Overview of the Solution
An overview of the larger building blocks of the solution can be seen in Figure 3.1
The solution consists of several steps. First, a preprocessing step where features are
extracted from the input string, which is the competitor designation. The similar
feature extraction is performed to extract the labels of the target. This data is then
encoded to make the format suitable for a machine learning model. The features
are used to predict the affixes(labels) of the equivalent product, which is the task
the machine learning model is trained to perform through multi-label classification.
After the model has been trained and can predict affixes, these need to be decoded
into a real designation of the equivalent product. This is done by parsing the main
designation from the competitor and assuming it will be the same in the equivalent
product. The predicted affixes are then utilized to find the right bearing, of the
right size, with the right features. Through this decoding step, a final prediction of
an equivalent product is determined, which is then validated against the true value
in a test set to determine the performance of the overall process.
By studying the test cases and identifying patterns where the process fails, areas of
improvement in the pre or postprocessing of the data are then identified. These will
be investigated in the hope of improving the outcome of the pipeline.
After this initial phase of development of the pipeline, the model evaluation will
begin. Initially the only model used will be random forest with the binary relevance
method to handle the multi-label classification. This will be done for simplicity
and to keep the results easy to compare when making pipeline alterations. As seen
in Figure 3.1, the model is its own block, and is therefore replaceable. Meaning
that different multi-label classifiers can be swapped in it’s place. In this stage no

21


3. Implementation

Figure 3.1: Schematic describing the larger building blocks of the pipeline

alterations will be made to the pipeline.

3.1.1 Mathematical Description of the Problem
To bring clarity and to clearly define the problem, a more mathematical definition
will be presented here.
We define the competitor product as c ∈ C, where C is the set of all competitor
products. We define the target product as t ∈ T , where T is the set of all target
products. Important to note here is that these sets will not be equal in size, some
competitor products do not have a target equivalent and vice versa.
We define P : S ∈ C → T , where S is the subset where a target equivalent product
exists. P is therefore the map between a competitor product and the equivalent
target product, representing the pipeline for finding interchanges.

3.2 Pipeline Implementation
In this section the implementation and development of the pipeline for learning the
function P (predicting interchanges) will be described. The focus will be on the
pipeline itself and not the machine learning model, therefore the model will not be
tweaked. The model that will be used in this development is Random Forest with
Binary Relevance, through initial testing it has been seen that the model is quite
capable and can be used with confidence.

3.2.1 Preprocessing, Feature Extraction, and Encoding
The structure of the given data is simple, as we have decided to handle a single brand
we are left with two columns: the competitor’s designation, and the corresponding
target designation. While this is enough for a user it won’t be possible to train a
machine learning model on these direct relationships. Therefore, feature extraction
needs to be performed as a first step to the process.

22


3. Implementation

Competitor Target
Designation 23030-E1A-XL-K-M-C3 23030 CCK/C3W33

As described in Section 2.1, the product designation contains a lot of detail about the
specification of the product and its features. These features need to be extracted so
that the model can be trained on them. As the competitor and target manufacturer
have different designation systems for the suffixes, the feature extraction will differ
between them and needs to be treated separately
The feature extraction of the competitor designation is easy due to the nature of
their designation system. Each prefix or suffix is split by a delimiter, this used to
be "." but has now changed to "-". Both will be used as delimiters which splits the
designation into parts. The main designation also needs to be extracted, this is
done by extracting terms of more than 4 numbers in a row. With this logic the
main designation can be extracted which leaves a list of suffixes delimited by "." or
"-" which is used to split the suffixes into an array.
The feature extraction of the target designation is more complex. Since this desig-
nation system does not delimit each suffix by a character they come as a compressed
string. From this compressed string each suffix needs to be extracted for maximal
detail. To achieve this an already existing list of suffixes will be utilized, the designa-
tion is matched against this list of phrases. To ensure the right match, the opposite
of a greedy search will be performed, that means we are from left to right searching
for the longest matching phrase. Once this has been found the index is reset and
the search continues along the rest of the string until the end is reached. Provided
that the suffix phrase list is complete, this leaves an array of suffixes contained in
the designation. The reason this approach works is because if it were not true that
you could take the longest matching phrase, the designations could be interpreted in
several ways which would cause issues and confusion. So, because of the underlying
system this behaviour is guaranteed.

Competitor Target
Designation 23030-E1A-XL-K-M-C3 23030 CCK/C3W33
Suffix List [E1A, XL, K, M, C3] [CC, K, C3, W33]
Main Designation 23030 23030

With this feature extraction we can now begin the encoding of data. A big problem
is the permutation of the features, the position of suffixes meaning the same thing
can vary. If we were to split the suffix list into columns, we could get suffixes meaning
the same thing in different columns, and matching this is extremely difficult as it
would require that each column specifies each of the options that it could contain.
This requires a huge amount of effort and is not desired. The main complexity of
this project lies in how to handle the features in a good way which allows a model
to be successfully trained on the data.
The key to solving this is to encode each suffix as its own binary categorical feature.
The net list of suffix features can be extracted as all the suffixes occurring in the
dataset, this can be done for both the competitor and target. Practically this is done
by adding all suffixes to a set from each of the suffix lists, since the set guarantees

23


3. Implementation

uniqueness we are left with a list of all suffixes present. This can be seen as a form
of one-hot encoding.
The main gain of this is making the model permutation invariant, thereby ensuring
that no matter the order and position the model will be able to separate the features.
The drawback to this is that you lose the permutation of the prediction as well. With
making a simplification in one place you almost always get a trade-off in another.
In Section 3.2.4 the process for solving this loss of permutation will be described.

3.2.2 Evaluation

To evaluate the models, the data needs to be split into a training and test set. The
chosen split is 70% training data and 30% validation data, which is a slightly larger
validation set than what is common practice in the machine learning domain (8020).
The training data is used to fit the model, and to reduce overfitting, the validation
data is utilized to determine performance. This way, the model is tested on how well
it generalizes to unseen data. Otherwise, it would be rewarded for overfitting, giving
great accuracy but sacrificing performance on new, unseen data. The validation data
is therefore used for testing the model and evaluating its performance. The model
is first fitted to the training data, and then the validation input is used to get
predictions from the model. These predictions are compared to the true value of the
target.

3.2.2.1 Affix Conversion Accuracy

The machine learning models will output a set of affixes that it predicts are equiva-
lent to the affixes from the competitor. Since this is the only conversion the model
itself does, it can only be evaluated based on the accuracy of the affix predictions.
A prediction is considered correct only if the predicted set and the true set are iden-
tical. If there is an affix missing or an additional affix, it is considered an incorrect
prediction. This gives us a metric of how well the model performs.

3.2.2.2 Designation Conversion Accuracy

The true desired output is the equivalent designation to the competitor product.
The prerequisite for this is a good affix conversion; if the affix conversion is not
correct, you cannot expect a correct designation conversion. The affix accuracy will
therefore place an upper limit on how good the designation accuracy can become.
The designation accuracy metric defines a correct conversion as one that predicts
the exact same designation as listed in the true value of the validation data. Any
deviation from this will be considered a false prediction. There are certainly cases
where the existing reference is not optimal and there are better interchanges that
provide a product that is a closer equivalent or equally as good. However, this is
very hard to measure, so accuracy will be the metric of choice.

24


3. Implementation

3.2.2.3 Incorrect Prediction Rate

When utilizing the model for generating interchanges, you will always need to val-
idate the predictions manually. If you are validating predictions, you would rather
like to see an empty output if the model is not certain of the prediction than having
all lines populated and having to discern the correct from the incorrect. If most of
the predictions are correct, then the validation becomes much easier, and you build
more trust towards the model. Because of this, we want to minimize the rate of
incorrect predictions and rather have more empty outputs.
To measure this, we introduce the incorrect prediction rate, which is the rate of
non-blank predictions that are incorrect. There will of course be a balance between
having high accuracy and a low incorrect prediction rate, since you do not want a
model which never makes a prediction.

3.2.3 Initial Result
Now that the data is extracted into well-defined features, the model can be trained.
Since we, at this stage, are not interested in comparing models, Random Forest using
binary relevance will be implemented. This gives a base measure of performance that
can be used to tweak the pipeline.
A random forest model was set up without any additional tuning or work; the
default parameters were used except for setting a random state to ensure consistency
between runs. Since random forest is robust to changes in its hyperparameters and
configuration, and since the implementation in scikit-learn supports binary relevance
by default, it is very easy to implement. The result from a random forest model
without any tuning on the first try was a affix prediction accuracy of 0.69.

3.2.4 Decoding
The model predicts which affixes should be present in the conversion, but we need
to transform this into an actual product designation. First, the encoded data needs
to be decoded back into a set of affixes, which are then used to identify a product.
This is done by searching a database of all existing products to find the best match.

3.2.4.1 Initial Search and Filtering

Given that the number of target products exceeds 700,000, the search must be con-
ducted efficiently to reduce processing time. This ensures feasibility when processing
and decoding many predictions. The final accuracy is based on identifying the cor-
rect product, not just the suffix list, even though the model itself is evaluated on
suffix list conversion.
To address this computational complexity concern, a search is performed to find
designations containing all the suffixes and the main designation of the competitor
product. This introduces a limitation: only cases where the main designation is
consistent between competitors and the target manufacturer will yield good results.
While this generally holds true, there are exceptions. This trade-off enhances model
consistency and reliability but limits the ability to predict certain product types.

25


3. Implementation

3.2.4.2 Evaluation and Matching Algorithm

After filtering, the resulting list of possible target designations varies in length. For
uncommon types, it is usually fewer than 10, while for common product types and
sizes, it can be hundreds, especially when the affix list is short. These need to be
evaluated to determine the final prediction. An algorithm is used to assess how
much of the designation matches the predicted phrases.
The algorithm finds all matching designations and evaluates each to find a matching
score. This score is calculated by summing the lengths of the affixes and main
designation and dividing by the length of the designation candidates. A score for
how good the match is to the main designation and the predicted affixes, this is
called a match score. This is determined by the proportion of the designation that
the main designation and affix list covers. The designation with the best match
score is chosen. Initially, this process was slow, taking hours to complete, as the
search was performed against the entire database for each prediction.

3.2.4.3 Algorithm Optimization

By investigating cases where the suffixes were correctly identified but the correct
designation was not found, issues with the algorithm were identified. Many super
precision bearings had issues due to having many short suffixes which are sometimes
contained in each other. The problem was that the same suffix could match multiple
times. For example, if the suffixes are P4A and A, A could match twice, theoretically
resulting in a match exceeding 100%. To fix this, affixes are sorted in descending
order of length to avoid matching a substring of an affix. Once a match is identified,
the substring is removed from the designation being searched to prevent multiple
matches. This resolved errors and significantly improved accuracy, approaching the
upper limit set by affix conversion accuracy.

3.2.4.4 Speed Optimization

While effective, the method was slow. Validating the test set took hours, which
was impractical for constant iteration and too slow for users converting long lists of
competitor designations. Initial speed optimizations reduced runtime to 15 minutes,
but this was still too slow.
To resolve this, the list of all designations was pre-processed to extract the main
designation and a list of suffixes using the same preprocessing as the training data.
The affix list was converted to a set, and the search for matching target designations
was ran again. The input is a main designation from the competitor and a set of
predicted affixes. First, the main designation is compared to all target main desig-
nations, retaining only matching lines. If the affix list is empty, only a designation
without suffixes but with a matching main designation is returned, which can be
only one product. If there are affixes, a set comparison checks if the predicted affix
set is a subset of the target set, retaining only those matches.
This results in a list of designations where the main designation matches, and the
predicted suffixes are a subset of the extracted suffixes. Using affix extraction into
a set eliminates problems with multiple matches on the same string part. Since all

26


3. Implementation

remaining designations contain the same affixes, the shortest remaining designation
will always have the best match ratio according to the verified logic. This prepro-
cessing and filtering dramatically reduce processing time and improve the logic’s
explainability and interpretability. This is the final logic used for finding a product
designation from a competitor’s main designation and a list of predicted affixes.

3.2.5 Pipeline Improvements
The initial performance of the random forest model was quite good, even without
tuning and using the default hyperparameters. However, this is not the model’s full
potential, which is why fine-tuning was performed.
Studying the cases where the model failed to make the correct prediction provided
insights into areas for improvement. From this, several fine-tuning attempts were
made.

3.2.5.1 Cleaning of OEM Numbers

Original Equipment Manufacturer (OEM) interchanges are references from a part
number which does not carry any meaning or feature descriptions, it is simply a
random number that specifies a product. Therefore, there is no pattern that can be
used to predict the correct target designation, so the OEM references are impossible
to predict. The hypothesis is that the existence of these in the training data confuses
the model and causes incorrect patterns to be learned. To explore the impact of these
references, they were removed from the training set to prevent negative effects on the
model. The OEM number references were filtered out by removing all references that
do not have the same main designation between the competitor and target product.
This also consequently removed some other references, such as housings, since the
main designation does not always match. This somewhat negatively affects these
types of products, but they are unlikely to match either way since our decoding of
suffixes into a designation relies on the main designation matching as a prerequisite.
The result was surprising. The training and validation sets remained the same,
but the training set was filtered to remove non-matching main designations. This
kept the results as comparable as possible. The accuracy dropped slightly, and false
predictions increased slightly. Therefore, this idea was dropped as it was shown that
OEM numbers were not negatively affecting the model. The impact on validation
remains, as it will never be possible to reach 100% accuracy given that OEM numbers
in the validation set are impossible to predict.

3.2.5.2 Adding Features Extracted from the Designation

The first improvement involved including the first three characters of the competi-
tor designation as a new feature. In many cases, this provides valuable input to
the model, as the first three characters usually specify the product family. This
determines which suffixes to choose from and how they should be interpreted. For
different product families, the same suffix could mean different things. Therefore, if
the model knows the product family, it will have an easier time predicting the correct
suffixes. However, sometimes a product has prefixes, or is of non-standard types.

27


3. Implementation

The first 3 characters will then not indicate the family, however it still provides
additional information. The first three characters strike a good balance between
having many occurrences of each value while still providing enough detail.
In some product families, suffixes change depending on size. For example, in the
Spherical Roller Bearings, a design by the target manufacturer indicated by the suf-
fix CC/W33 was replaced by a new product design labelled E. This means that the
old design is no longer produced or sold and instead the newer design is preferred.
However, this change was only for smaller sizes with an inner diameter under a cer-
tain threshold. In failed cases, several instances were identified where the suggestion
was CC/W33 but the true suffix was E, or vice versa.
The model should not be explicitly told this rule, but it can be given the information
needed to learn it. Therefore, a new feature corresponding to the size of the bearing
was added. As described in Section 2.1, the last two digits of the main designation
describe the inner diameter of the bearing. By extracting these digits and making
them a new feature, the model gains additional useful data.
This splits the main designation into semantically meaningful parts and decodes
the domain semantics into features useful for the model. This is not valid in all
references but provides enough information to be useful.
The outcome was a major leap in performance. Accuracy significantly increased
first from adding the first three characters and then further by adding the size.
Since the increase was significant, these two new features were added to the model.
Features do not need to carry humanly meaningful information or be perfect and
exact; they simply need to carry data that can separate classes. This explains why
added features can still improve performance.

3.2.5.3 Refining the List of Suffixes

It was also observed that some target suffixes were not properly extracted from the
designation. This indicated that either the parser was incorrect, or the proper term
was missing from the list of suffixes. In all investigated cases, the root cause was
the term missing from the list. Resolving this and finding a more complete list of
prefixes and suffixes is likely to improve the model as it provides a more accurate
representation of the features.
However, what resulted in a significant decrease in the model’s affix conversion
accuracy, caused a significant increase in designation conversion accuracy. This is
not entirely unexpected, as the old table of affixes did not contain all affixes, leading
to incomplete extraction and fewer possible target labels. For example, the suffixes
VA904 and VA905 were not in the original list and were extracted as simply V. A
model finds it easier to predict that the affix contains V than to predict the exact
suffix. With several such cases, the suffix conversion accuracy decreases. However,
the prediction of designations improves with a more complete list. If searching for
something containing just V, either VA904 or VA905 could be possible matches,
but only one is correct.
To summarize, with greater detail in affixes, it is easier to find the right designation
given a correct affix conversion, but harder to make the right affix conversion. Pre-
dicting affixes is not enough if they are incomplete, which is why it is important to
have good input data in the form of target affixes into the model.

28


3. Implementation

3.3 Model Exploration and Comparison

Now that the pipeline is in place, the various options for machine learning models
and methods will be explored. As explained in Section 2.3.2 there are a variety of
options. Either adapting the machine learning model to be able to handle multi-
label classification or as we have done up until now by transforming the problem into
a multi-class classification problem. When evaluating models and methods, both of
these will be explored.
The methods and models that will be explored are the following: Problem Trans-
formation

• Random Forest with Binary Relevance (what has been used up until now)
• Random Forest with Label Powerset
• SVM with Binary Relevance
• Linear SVM with RAKELo

• Linear SVM with RAKELd

• Radial Basis Function(RBF) kernel SVM with RAKELo

• Radial Basis Function(RBF) kernel SVM with RAKELd

• kNN with Binary Relevance
Model Adaption

• Multi-Label k-Nearest Neighbors (ML-kNN)
• Neural Network with Sigmoid Activation Function

Each of the different models will need some basic hyperparameter tuning. A lim-
itation in this exploration of methods will be the limited hyperparameter tuning
for each of the models. This might result in all models not getting a fair chance
but performing grid search for each method will be to computationally expensive
without proper compute resources.
The basic hyperparameter tuning will be performed on the affix conversion accuracy
only, but the final models will be compared based on both affix conversion accuracy
and the final designation accuracy. The number of incorrect predictions will also be
presented but might not be so relevant to compare between the models since it is
mostly an effect of the decoding.
The test and validation split will be kept the same between the models to have
as comparable output as possible. For utilization of the models themselves both
sk-learn [4] and sk-multilearn [6] will be used which are python libraries supplying
implementation of a wide variety of machine learning models and methods. While
sk-learn is a very well developed and maintained library used by a lot of people daily,
the same cannot be said for sk-multi-learn. This was developed as a tool to support
research being done in the field, and was then released to the public for people to
contribute and use. However recently it is not maintained, and in my limited usage
I found a lot of bugs and incompatibilities with latest versions of libraries. This
should be noted when judging the outcome of some of the methods, it could be
that the method itself is good but that the implementation of it is poor or contains
errors.

29


3. Implementation

Layer (type) Output Shape Param #
input_1 (InputLayer) (None, 1150) 0
dense (Dense) (None, 1500) 1,726,500
dropout (Dropout) (None, 1500) 0
dense_1 (Dense) (None, 1000) 1,501,000
dropout_1 (Dropout) (None, 1000) 0
dense_2 (Dense) (None, 500) 500,500
dropout_2 (Dropout) (None, 500) 0
dense_3 (Dense) (None, 247) 123,747

Table 3.1: Neural Network Architecture

3.3.1 Model Implementation
The different models will need to be configured and implemented in the pipeline.
Since libraries are being used, the implementation of the models themselves will be
left out, saving a lot of time and removing chances of mistakes. The challenges and
discoveries in the implementation will be described here, but most of the details will
be left out as the majority of it is repetitive work.

3.3.1.1 Neural Network Classifier

The first idea was to use a neural network, partly because of the buzz around this
machine learning method, but also due to its ability to learn complex relation-
ships. However, this approach was tried with some doubt due to the relatively small
dataset, containing only 35,000 data points. The initial solution proposal was based
on a neural network having each competitor suffix as a binary input node and each
target suffix as a binary output node, as seen in Figure 3.2.
The hidden layers should then learn the relationships between these by tuning the
weights and using non-linear functions between the layers. By including only the
affixes present in the dataset for the competitor and corresponding target affixes,
we end up with 1,150 input nodes and 250 output nodes. This shows how varied
the inputs and outputs are.
Many different network configurations were tried. The smaller ones failed to pick up
the underlying pattern, while more success was seen in the larger models. Through
trial and error, the model that performed the best was the following:
The result of this network was still not satisfactory, as it did not generalize well to
unseen data. The training was terminated early due to a lack of improvement in the
validation loss for many iterations. The final accuracy score was 0.47, which will be
compared to the results we can achieve with a random forest model.

3.3.1.2 Random Forest Classifier

A random forest model was set up without any additional tuning or work. The
default parameters were used except for setting a random state to ensure consistency
between runs. Since random forest is robust to changes in its hyperparameters and
configuration, and since the implementation in scikit-learn can handle all kinds of

30


3. Implementation

input and output formats, it is extremely easy to implement.
Some experiments were done with the hyperparameters of the model, but it was
quickly realized that further search for hyperparameters would not lead to improve-
ment. So, the working assumption is that the default parameters are close enough
to being optimal.
The default solution for multi label classification when using random forest and
scikit learn is binary relevance. To me this is a bit surprising, while random forest
is great at binary classification, it is also a method which naturally handles multi
class classification. The computational complexity of the model is sublinear in the
number of classes, meaning that it scales more favourable than linearly to the number
of classes. Rather than training n models where n is the number of output labels, you
could instead use another method such as Label Powerset where you instead have
many classes and leverage the strength of random forest, as described in Section 2.4.
Because of this, Label Powersets will also be explored. Here we instead make classes
out of every combination of labels present in the training data, and train a single
model on this multi-class classification problem. The performance of this will then
be compared to random forest with binary relevance, and the rest of the models.

3.3.1.3 SVM

SVM is a very well-established model which can be used in many applications.
Naturally it handles binary classification, which is why it comes to mind as a good
classifier to try with the Binary Relevance method. It is also interesting to try
out in combination with RAKEL as SVM was used in the paper where the method
was suggested, especially since its described how the computational complexity is
decreased dramatically when using RAKEL and SVM compared to SVM with Label
Powerset. The reason RAKEL is not tried out with Random Forest is that this causes
higher computational complexity compared to just using Random Forest with Label
Powerset since the computational complexity of Random Forest is sublinear in the
number of classes.
A big obstacle in using SVM is the computational performance. When using binary
relevance, it takes an extremely long time to train classifiers for each of the output
labels. The big blocker here is the number of input features, SVM does not scale
well with increasing number of inputs. Initially the generic implementation which
can handle both linear SVM and kernel based SVM was used. The performance
was so bad that the training could not be completed. It was then discovered that
an alternative implementation which is optimized for linear SVM only was available
and that it scales better. This made the computational complexity feasible but still
not good as will be seen in the results later on. Using kernels with the generic
implementation on the other hand did not have the same issues with performance
and managed much better.
However, in all cases where SVM was used the performance of the model itself was
very disappointing.

31


3. Implementation

3.3.1.4 RAKEL

The implementation from sk-multi-learn had several bugs and issues with the latest
version of its dependencies, this caused runtime errors. The issues could be overcome
in the case of RAKELd, but not for RAKELo. Therefore the performance of RAKELo

is not possible to investigate.

3.3.1.5 kNN Variations

The kNN variation implementation also contained issues. These could however be
overcome, and the methods could be applied.

3.3.1.6 Notes on Computational Complexity

The remarks done on the computational complexity is based around the single
threaded performance of the algorithm. Some methods such as Binary Relevance,
does not have interdependencies between the classifiers. This would allow for parallel
computations, which would greatly speed up the performance. Comparing the sin-
gle threaded performance (running on a single computational thread) gives an equal
base for the different classifiers. However, times for the models using multi-threading
(running on multiple computational threads at the same time), and performance in-
creases are made where possible will also be presented. These are not to be taken as
the absolute optimal performance, as it is only an attempt at getting speed gains.

32


3. Implementation

Figure 3.2: Illustrative drawing showing the input and output nodes with hidden
layers in between.

33


3. Implementation

34


4
Results

With the pipeline developed and several machine learning models evaluated, the
results of this will now be presented here. First the results from the first phase
consisting of pipeline development will be presented. Here alterations to the feature
extraction and input data will be evaluated. The version that is determined to be
the best is then what will be used when evaluating different machine learning models
and strategies against each other to determine the best one. Lastly the input data
will be studied to try to get a deeper understanding of why the performance looks
like it does, and to get a sense of what accuracy is reasonable to aim for.

4.1 Pipeline Optimisation
With random forest as the preliminary model, the pipeline was evaluated. Several
different test were ran to determine the best version of the pipeline, the results of
these tests will be presented here.

4.1.1 Improved Affix List
When looking at cases where the affix prediction was correct, but the designation
prediction was incorrect, it was seen that many of these cases were caused by an
incomplete target affix list. For example, W513 was extracted as just W from the
true target designation, and this was successfully predicted by the model as W.
However, this is only part of the affix, which is much easier to predict than the full
affix. The consequence of this is that the designation search will not be performed
based on the actual present affixes, leading to an incorrect prediction.
The hypothesis was that fixing this would decrease the affix conversion accuracy
while increasing the designation conversion accuracy. Since the affixes will be more
specific, it will become harder to predict the right ones. However, with the more
specific affix predictions, it should lead to more accurate designation conversion.
The results of this modification can be seen in Figure 4.1.
The results are somewhat surprising. The affix accuracy drops as expected; however,
the designation accuracy remains almost the same throughout the results. It does
seem like the false suggestions drop slightly with adding prefixes, which makes sense
since with more specific affixes in the affix list, we are less likely to find something
matching all of them. Even though the improvement is questionable, the change will
be implemented as we are more accurately representing the affixes in the training
data as well as the predictions, so it still makes sense. This also leads to the affix

35


4. Results

Figure 4.1: Results of suffix and designation accuracy as well as false suggstion rate
based on changes to the target affix list

conversion accuracy being more representative since it is comparing to the true affix
and not part of it.
Since the affix accuracy can be seen as the upper bound of the performance of the
designation accuracy, it is better to have the affix accuracy be more correct. This
way, it is easier to see how close to the upper bound we reach and how much is still
possible to improve.

4.1.2 Adding Features for Size and Type
If you are familiar with the product ranges, you know that in many cases the suffixes
change at a certain size threshold. In many cases, the larger bearings have another
design, or another seal, or something else. This threshold could be different between
different manufacturers, and some might have the same design throughout the size
range. What this means is that the size of the bearing gives valuable information
about which affixes to predict.
The same thing can be said about the range and type of product. Usually, the
first three numbers tell you the range of bearing. For some ranges, you might have
one design while for other ranges it could be another due to technical reasons. In
other cases where the designation has a prefix, the first three characters tell you
which product type it is or some other information. The conclusion is that the first
three characters of the designation carry a lot of valuable information, regardless of
exactly what it contains. By adding this as a feature (called f3), we get something
that carries a lot of information even though the feature does not represent a single
nicely extracted feature of the bearing.
The hypothesis is that improvement in accuracy will be seen by adding these two
features, which carry a lot of information that is dense and not shown in the affix
list. The results of this can be seen in Figure 4.2.
There is a significant improvement in accuracy when adding the first three characters
and when adding the size independently. You get even further improvement when
adding both features to the model, indicating that they are not especially correlated.

36


4. Results

Figure 4.2: Results of adding features for size and first 3 characters(feature is called
f3)

The outcome here is clear: the features bring great value and greatly improved
accuracy and will therefore be kept in the model.

4.1.3 Training without OEM References
The data contains OEM references that the model, with its limitations, will never
be able to predict. The hypothesis is that these references contribute negatively to
the training of the model and filtering them out would give better accuracy.
The idea is to filter such references out from the training set but keep the validation
set as it is without filtering. With this, we get results that are fully comparable and
that are working on the same data. If you were to filter out the references from the
validation set as well, you would, of course, get higher accuracy by default.
The results of this trial can be seen in Figure 4.3, where the results of training on a
filtered set can be compared to that of a non-filtered training set.
The results show no improvement in the model which was trained on a filtered set
of references; actually, a small decrease in performance can be seen. The differences
are so minor that they are likely not significant.

4.2 Model Evaluation
To determine the most suitable machine learning model, they need to be evaluated
against each other and compared. To compare them we will measure both affix
conversion performance, and runtime. In Figure 4.4 the comparison between affix
accuracy can be seen.
Here each bar represents a different model and strategy combination. You will
see that for example random forest occurs several times, but with different problem
transformation strategies. From the results it is clear that random forest outperforms
the other methods, using either problem transformation strategy.
In Figure 4.5 you can see the resulting runtime of each of the models and strategies.

37


4. Results

Figure 4.3: Results of training without OEM references

What is seen here is very varying results. Most notable is the difference in runtime
for random forest using label powerset compared to binary relevance, especially
when referencing Figure 4.4 where the performance difference is negligible. It is also
notable how the multi-label methods seem to have a lot higher runtime. Likely due
to a poor implementation, rather than the methods themselves, which makes the
comparison difficult.

4.3 Data Limitations
To understand what constitutes a good accuracy level, it is essential to under-
stand the underlying data, its trends, and patterns. If there are consistent patterns
throughout the data, a very high accuracy level can be expected. On the other
hand, if the data is very spread out with little support for many of the classes, great
results cannot be expected. Ideally, the model should learn the well-represented
relationships, but expectations should be low for those without sufficient support.
Several plots have been created to analyse this.
In Figure 4.6,