Evaluating Machine Learning Algorithms
in Design Pattern Recognition

Exploring the Performance of Classification and Clustering
Algorithms in Design Pattern Recognition Utilising Large
Language Models

Master’s thesis in Computer science and engineering

SIMON ANDERSSON
VIKTOR BERGGREN

Department of Computer Science and Engineering
CHALMERS UNIVERSITY OF TECHNOLOGY
UNIVERSITY OF GOTHENBURG
Gothenburg, Sweden 2024


Master’s thesis 2024

Evaluating Machine Learning Algorithms in
Design Pattern Recognition

Exploring the Performance of Classification and Clustering
Algorithms in Design Pattern Recognition Utilising Large

Language Models

SIMON ANDERSSON
VIKTOR BERGGREN

Department of Computer Science and Engineering
Chalmers University of Technology

University of Gothenburg
Gothenburg, Sweden 2024


Evaluating Machine Learning Algorithms in Design Pattern Recognition
Exploring the Performance of Classification and Clustering
Algorithms in Design Pattern Recognition Utilising Large
Language Models
SIMON ANDERSSON
VIKTOR BERGGREN

© SIMON ANDERSSON, VIKTOR BERGGREN, 2024.

Supervisor: Jennifer Horkoff, Computer Science and Engineering
Examiner: Hans-Martin Heyn, Computer Science and Engineering

Master’s Thesis 2024
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg
SE-412 96 Gothenburg
Telephone +46 31 772 1000

Gothenburg, Sweden 2024

iv


Evaluating Machine Learning Algorithms in Design Pattern Recognition
Exploring the Performance of Classification and Clustering
Algorithms in Design Pattern Recognition Utilising Large
Language Models
SIMON ANDERSSON
VIKTOR BERGGREN
Department of Computer Science and Engineering
Chalmers University of Technology and University of Gothenburg

Abstract
Design Pattern Recognition (DPR) is an ongoing research challenge in the field of
software engineering for increasing software maintainability in code. Recent work
has utilised Large Language Models (LLMs) for extracting semantic information
from code. This study follows up on previous research and investigates, explores,
and evaluates the performance of multiple classification and clustering algorithms
when applied to embeddings extracted from LLMs. Performance is explored be-
tween contexts using different LLMs, design patterns, and programming languages.
Data for design pattern implementations was gathered for Java, Python, and C#
via GitHub and the P-MARt repository. Each algorithm was run with tuned hyper-
parameters, and their average performance across multiple runs was compared. The
results indicate variance for the individual performance of the algorithms, but the
overall performance order between the algorithms remains the same. Classification
algorithms outperformed clustering algorithms, and clustering algorithms had low
performance in the measured metrics across all tests. The results also showed a
difference in performance between behavioral, creational, and structural design pat-
terns. This study shows further promise for the use of LLMs for DPR and recognises
the need for larger studies utilising LLMs for DPR.

Keywords: Computer science, design patterns, machine learning, large language
models, software engineering, design pattern recognition, DPR, LLM.

v


Acknowledgements
We want to express our gratitude to our supervisor, Jennifer Horkoff, and our ex-
aminer, Hans-Martin Heyn, for their support and guidance throughout the study.
They provided us with thoughtful feedback and advice that proved helpful with the
study. We also want to thank Sushant Kumar Pandey for his support of the method
and the technical aspects of the study.

Simon Andersson, Viktor Berggren, Gothenburg, June 2024

vii


Contents

List of Figures xiii

List of Tables xv

1 Introduction 1
1.1 Purpose of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background 5
2.1 Design Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 CodeBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.2 Instructor-xl . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.3 GPT-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Classifying Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.3 Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3.5 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 10
2.3.6 Nearest Neighbour . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.7 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3.8 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 11

2.4 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.2 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.3 Mean Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.4 Gaussian Mixture . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.5 BIRCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4.6 Affinity Propagation . . . . . . . . . . . . . . . . . . . . . . . 14

3 Related Work 15
3.1 Existing Approaches for DPR . . . . . . . . . . . . . . . . . . . . . . 15

3.1.1 Metric-based Approaches . . . . . . . . . . . . . . . . . . . . . 16
3.1.2 LLM-based Approaches . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Existing evaluation of classification algorithms . . . . . . . . . . . . . 18

ix


Contents

4 Method 21
4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.1.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 23
4.2 Google Colab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.3 Language Models Used . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.4 Design Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Classifying and Clustering Algorithms . . . . . . . . . . . . . . . . . 25
4.6 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.7 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.7.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.7.2 Extracting Embeddings . . . . . . . . . . . . . . . . . . . . . . 28
4.7.3 Training of Classifiers and Clusterers . . . . . . . . . . . . . . 28
4.7.4 LLM Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5 Results 31
5.1 RQ1: Classifier Performance on Design Pattern Identification . . . . . 31

5.1.1 Baseline Performance . . . . . . . . . . . . . . . . . . . . . . . 31
5.1.2 Performance with Tuned Hyperparameters . . . . . . . . . . . 35

5.2 RQ2: Influence of Context on Classifier Performance . . . . . . . . . 40
5.2.1 RQ2.1: Performance Variance Between Design Patterns . . . . 40

5.2.1.1 Baseline Performance . . . . . . . . . . . . . . . . . . 40
5.2.1.2 Performance with Tuned Hyperparameters . . . . . . 44

5.2.2 RQ2.2: Performance Variance of Language Models . . . . . . . 47
5.2.2.1 GPT-2 . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.2.2 Instructor-xl . . . . . . . . . . . . . . . . . . . . . . 51
5.2.2.3 Comparison Between Language Models . . . . . . . . 55

5.2.3 RQ2.3: Performance Variance of Programming Languages . . 57
5.2.3.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.3.2 C# . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.2.3.3 Comparison Between Programming Languages . . . . 65

6 Discussion 67
6.1 Algorithm Performance on Design Pattern Identification . . . . . . . 67

6.1.1 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
6.1.2 Clusterers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.2 Influence of Context on Algorithm Performance . . . . . . . . . . . . 69
6.2.1 Performance Variance Between Design Patterns . . . . . . . . 69
6.2.2 Performance Variance Between Language Models . . . . . . . 71
6.2.3 Performance Variance Between Programming Languages . . . 72

6.3 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3.1 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.3.2 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.3.3 Construct Validity . . . . . . . . . . . . . . . . . . . . . . . . 74

7 Conclusion 77
7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

x


Contents

Bibliography 79

A Appendix 1 I

B Appendix 2 VII

xi


Contents

xii


List of Figures

2.1 Document converted to embeddings [31]. . . . . . . . . . . . . . . . . 7
2.2 Decision boundaries for a linearly separable problem [49]. . . . . . . . 10
2.3 A simple model of a Neural Network. . . . . . . . . . . . . . . . . . . 12

4.1 Visualisation of process for RQ1. . . . . . . . . . . . . . . . . . . . . 21
4.2 Visualisation of process for RQ2. . . . . . . . . . . . . . . . . . . . . 21
4.3 Visualisation of the pipeline. . . . . . . . . . . . . . . . . . . . . . . . 28

5.1 Box plot showing the F1-scores of the classifiers across 200 runs with
default hyperparameters on CodeBERT embeddings, averaged across
all patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5.2 Confusion matrices showing how the baseline classifiers predicted each
of the design patterns, averaged across 200 runs on CodeBERT em-
beddings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.3 t-SNE visualisation of the CodeBERT embeddings. . . . . . . . . . . 35
5.4 Box plot showing the F1-scores of the classifiers across 200 runs with

tuned hyperparameters on CodeBERT embeddings, averaged across
all patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.5 Confusion matrices showing how the tuned classifiers predicted each
of the design patterns, averaged across 200 runs on CodeBERT em-
beddings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.6 Box plot showing F1-scores of the classifiers across 200 runs with
default hyperparameters on CodeBERT embeddings, averaged across
all patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.7 t-SNE visualisation of the CodeBERT embeddings of the seven design
patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.8 Box plot showing F1-scores of the classifiers across 200 runs with
hyperparameter tuning on CodeBERT embeddings, averaged across
all patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.9 Box plot showing F1-scores of the classifiers across 200 runs with
hyperparameter tuning on GPT-2 embeddings, averaged across all
patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.10 Confusion matrices showing how the tuned classifiers predicted each
of the design patterns, averaged across 200 runs on GPT-2 embeddings. 49

5.11 t-SNE visualisation of the GPT-2 embeddings. . . . . . . . . . . . . . 50

xiii


List of Figures

5.12 Box plot showing the F1-scores of the classifiers across 200 runs with
hyperparameter tuning on Instructor-xl embeddings, averaged across
all patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.13 Confusion matrices showing how the tuned classifiers predicted each
of the design patterns from the Instructor-xl embeddings, results av-
eraged across 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.14 t-SNE visualisation of the Instructor-xl embeddings. . . . . . . . . . . 54
5.15 Box plot the F1-scores of the classifiers across 200 runs with hyper-

parameter tuning of the Python code on CodeBERT embeddings,
averaged across all patterns. . . . . . . . . . . . . . . . . . . . . . . . 57

5.16 Confusion matrices showing how the tuned classifiers predicted each
of the design patterns on the CodeBERT embeddings of the Python
code, results averaged across 200 runs. . . . . . . . . . . . . . . . . . 59

5.17 t-SNE visualisation of the CodeBERT embeddings of the Python data. 60
5.18 Box plot showing the F1-scores of the classifiers across 200 runs with

hyperparameter tuning on CodeBERT embeddings of the C# data,
averaged across all patterns. . . . . . . . . . . . . . . . . . . . . . . . 61

5.19 Confusion matrices showing how the tuned classifiers predicted each
of the design patterns on CodeBERT embeddings of the C# data,
results averaged across 200 runs. . . . . . . . . . . . . . . . . . . . . . 63

5.20 t-SNE visualisation of the CodeBERT embeddings of the C# data. . 64

xiv


List of Tables

4.1 Table showing the amount of files and their source for the Singleton,
Prototype, and Builder design patterns written in Java. . . . . . . . . 23

5.1 Table depicting mean, variance, highest, and lowest F1-scores ob-
tained from running each classifier 200 times with different training
and test sets and with default hyperparameters on CodeBERT em-
beddings, results averaged across all design patterns. . . . . . . . . . 32

5.2 Table depicting mean F1-score, precision, and recall obtained from
running each classifier 200 times with different training and test sets
across the Singleton, Prototype, and Builder design patterns. . . . . . 33

5.3 ARI, Silhouette, F1, precision, and recall scores of the clustering al-
gorithms with default hyperparameters on CodeBERT embeddings,
averaged across all design patterns over 200 runs. . . . . . . . . . . . 34

5.4 ARI, Silhoutte, F1, precision, and recall scores of the clustering al-
gorithms with default hyperparameters on CodeBERT embeddings,
without the Singleton pattern, averaged over 200 runs. . . . . . . . . 35

5.5 Table depicting mean, variance, highest, and lowest F1-scores ob-
tained from running each classifier with tuned hyperparameters on
CodeBERT embeddings, averaged across all design patterns over 200
runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.6 Table depicting F1-score, precision, and recall obtained from running
each tuned classifier 200 times with different training and test sets
across the Singleton, Prototype, and Builder design patterns. . . . . . 37

5.7 ARI, Silhouette, F1, precision, and recall scores of the clustering al-
gorithms with hyperparameter tuning on CodeBERT embeddings,
averaged across all design patterns over 200 runs. . . . . . . . . . . . 39

5.8 Table depicting mean, variance, highest, and lowest F1-scores ob-
tained from running each classifier with default hyperparameters on
CodeBERT embeddings, results averaged across all seven design pat-
terns over 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

5.9 Average F1-score, precision, and recall for all seven design patterns
across 200 runs and nine classifiers, with default hyperparameters on
CodeBERT embeddings. . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.10 ARI and Silhouette score of clustering algorithms with default hy-
perparameters on CodeBERT embeddings, averaged across all seven
design patterns over 200 runs. . . . . . . . . . . . . . . . . . . . . . . 42

xv


List of Tables

5.11 Table depicting mean, variance, highest, and lowest F1-scores ob-
tained from running each classifier 200 times with tuned hyperpa-
rameters on CodeBERT embeddings, results averaged across all de-
sign patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.12 Average F1-score, precision, and recall for all seven design patterns
across 200 runs and nine classifiers, with tuned hyperparameters on
CodeBERT embeddings. . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.13 ARI and Silhouette score of the clustering algorithms with hyperpa-
rameter tuning on CodeBERT embeddings, averaged across all seven
design patterns over 200 runs. . . . . . . . . . . . . . . . . . . . . . . 46

5.14 Table depicting mean, variance, highest, and lowest F1-scores ob-
tained from running each classifier with tuned hyperparameters on
GPT-2 embeddings, results averaged across all design patterns over
200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.15 Table depicting F1-score, precision, and recall obtained from running
each tuned classifier 200 times with different training and test sets
across the Singleton, Prototype, and Builder embeddings from GPT-2. 48

5.16 ARI and Silhouette score of the clustering algorithms with hyper-
parameter tuning on GPT-2 embeddings, averaged across all design
patterns over 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.17 Table depicting mean, variance, highest, and lowest F1-scores ob-
tained from running each classifier with tuned hyperparameters on
Instructor-xl embeddings, results averaged across all design patterns
over 200 runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.18 Table depicting F1-score, precision, and recall obtained from run-
ning each tuned classifier 200 times with different training and test
sets across the Singleton, Prototype, and Builder embeddings from
Instructor-xl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.19 ARI and Silhouette score of the clustering algorithms with hyper-
parameter tuning on Instructor-xl embeddings, averaged across all
design patterns over 200 runs. . . . . . . . . . . . . . . . . . . . . . . 54

5.20 Comparison of mean F1-scores achieved by the classifiers with tuned
hyperparameters between CodeBERT, GPT-2, and Instructor-xl. . . . 55

5.21 Comparison of mean F1-scores for Singleton, Prototype, and Builder. 55
5.22 Comparison of mean ARI and Silhouette scores achieved by the tuned

clusterers between CodeBERT, GPT-2, and Instructor-xl. . . . . . . . 56
5.23 Table depicting mean, variance, highest, and lowest F1-scores ob-

tained from running each classifier with tuned hyperparameters 200
times on the CodeBERT embeddings of the Python code, results av-
eraged across all design patterns. . . . . . . . . . . . . . . . . . . . . 58

5.24 F1-score, precision, and recall obtained from running each tuned clas-
sifier 200 times across the Singleton, Prototype, and Builder embed-
dings from CodeBERT on Python. . . . . . . . . . . . . . . . . . . . 58

5.25 ARI and Silhouette score of the tuned clustering algorithms running
on CodeBERT embeddings of the Python data, averaged over 200 runs. 60

xvi


List of Tables

5.26 Table depicting mean, variance, highest, and lowest F1-scores ob-
tained from running each classifier with tuned hyperparameters 200
times on CodeBERT embeddings of the C# data, results averaged
across all design patterns. . . . . . . . . . . . . . . . . . . . . . . . . 61

5.27 Table depicting F1-score, precision, and recall obtained from running
each tuned classifier 200 times across the Singleton, Prototype, and
Builder embeddings from CodeBERT on C#. . . . . . . . . . . . . . 62

5.28 ARI and Silhouette score of the tuned clustering algorithms running
on CodeBERT embeddings of the C# data, averaged over 200 runs. . 64

5.29 Comparison of mean F1-scores achieved by the classifiers with tuned
hyperparameters between Java, Python, and C#. . . . . . . . . . . . 65

5.30 Comparison of mean F1-scores for Singleton, Prototype, and Builder. 65
5.31 Comparison of mean ARI and Silhouette scores achieved by the tuned

clusterers between CodeBERT, GPT-2, and Instructor-xl. . . . . . . . 66

A.1 Baseline Performance Metrics for Each Design Pattern and Classifier
for RQ2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II

A.2 Baseline Performance Metrics for Each Design Pattern and Classifier
for RQ2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III

A.3 Performance metrics for each design pattern and classifier for RQ2.1,
with tuned hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . IV

A.4 Performance metrics for each design pattern and classifier for RQ2.1,
with tuned hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . V

B.1 The hyperparameters tested for each algorithm and how many com-
binations of each were tested. . . . . . . . . . . . . . . . . . . . . . . VIII

B.2 Hyperparameters for Classifiers in RQ1. . . . . . . . . . . . . . . . . IX
B.3 Hyperparameters for classifiers and clusterers in RQ2.1. . . . . . . . . X
B.4 Hyperparameters for classifiers and clusterers for GPT-2 in RQ2.2. . . XI
B.5 Hyperparameters for classifiers and clusterers for Instructor-xl in RQ2.2.XII
B.6 Hyperparameters for classifiers and clusterers for Python in RQ2.3. . XIII
B.7 Hyperparameters for classifiers and clusterers for C# in RQ2.3. . . . XIV

xvii


List of Tables

xviii


1
Introduction

Software maintainability is an important quality attribute to consider when devel-
oping code, and it accounts for the majority of the cost of software development
[1]. Therefore, keeping the software maintainable reduces the cost of software de-
velopment. Design patterns provide developers with tools for addressing recurring
problems in code [2]. The impact of these design patterns on software maintain-
ability has been demonstrated to be positively correlated [3], [4]. Consequently,
identifying the use of design patterns in large-scale codebases is essential for soft-
ware maintainability.

Manual inspection of code to identify design patterns is challenging and time-
consuming [5]. Design patterns can also have different variations, known as variants
[6], in order to fit the context of the code, which further complicates the problem.
Incorrect identification of a design pattern can lead to incorrect implementations,
which can affect the functionality of the system [7].

There are, however, other methods for automatically identifying design patterns
in code using design pattern recognition (DPR) [8]. There are several different
areas explored in DPR, from metric-based approaches like [9], [10], to machine
learning-based approaches like [7], [11]. These methods have evaluated different
classification and clustering techniques while focusing only on specific programming
languages and design patterns. Research has shown that ML-based approaches show
promising results by being able to understand variants of design patterns better than
traditional metric-based approaches [12]. A consensus on which detection algorithms
to use has yet to be reached [6]. Furthermore, capturing a wider range of design
patterns is difficult because of the different methods for capturing features depending
on the design pattern.

The most recent work on this topic has been using Large Language Models (LLMs)
[7], [11]. The LLMs are used to extract an embedding from the code that is then
used with another machine-learning algorithm to classify if and which design pattern
a code is using. As there are currently no language models designed for the purpose
of DPR, utilising LLMs will also require the use of a separate classifier or clusterer.
The embeddings extracted from LLMs capture the semantic context of the source
code, which can provide further useful information than previously metric-based
approaches have used. This method of design pattern recognition using LLMs is a

1


1. Introduction

very recent topic of research with many unexplored areas.

The previous research that has been done within LLM-based DPR has primarily
focused on testing its feasibility. The most recent studies are exploring different
areas of LLM-based DPR. They are testing different language models, the effects
of pre-training the language models, the performance of different design patterns,
and different combinations of these [11]. There has also been research on evaluating
different supervised learning methods in non-DPR contexts [13]. But there is still
a lack of research evaluating how different classification and clustering algorithms
perform in the context of design pattern recognition when using other large language
models, such as CodeBERT. This thesis will focus on exploring the performance of
classification and clustering algorithms that are trained on embeddings generated by
large language models. It is still unclear how the performance of learning algorithms
varies when using them to classify embeddings generated by the LLMs.

Some possible methods for classification include Decision Trees, Support Vector
Machine (SVM), and K-Nearest Neighbour (KNN) [14]. For clustering, K-Means,
Hierarchical Clustering, and Density-Based Spatial Clustering of Applications with
Noise (DBSCAN) can be utilised [15]. However, it remains unclear which method
provides the best performance [6]. Different programming languages have varying
syntax and coding styles, making it important to evaluate which algorithm should
be used. Furthermore, the choice of language model used for extracting embed-
ded feature vectors will influence how the classification and clustering algorithms
are trained. Therefore, a study is required to explore the feasibility of a range of
classification and clustering algorithms.

1.1 Purpose of the study
The purpose of this study is to advance the field of software development by explor-
ing and evaluating different classification and clustering algorithms in the context
of identifying the use of design patterns in code. Specifically, this research aims
to delve into the efficiency and accuracy of these algorithms when applied to em-
beddings extracted from code via LLMs. The study will also explore how different
design patterns, language models, and programming languages affect the perfor-
mance of the algorithms. The algorithms will be quantitatively analysed to measure
their performance. Thereby determining effective algorithmic approaches for accu-
rately classifying and clustering software design patterns. As a result, enhancing
the ability of developers to maintain, understand, and improve large and complex
codebases.

1.2 Research questions
The research questions are divided into two parts: the performance of the different
algorithms (RQ1), and whether and how the choice of classification algorithm may
change depending on the choice of design patterns, programming language, and

2


1. Introduction

language model that is used (RQ2).

RQ1: What are the performances of different classifying and clustering algorithms
in the context of identifying design patterns in code?

This is performed to find the differences in performances between different
classifiers and clusterers and get a baseline of their performance in DPR.

RQ2: Does the performance of different classifying and clustering algorithms vary
depending on the context in which they are used?

RQ2.1: Does the performance vary between design patterns?

RQ2.2: Does the performance vary between language models?

RQ2.3: Does the performance vary between programming languages?

This is performed to explore how the different parts, such as what language
the code is written in and what type of LLM is being used, of a whole system
affect the performance of the classifier. The goal of RQ2 is to find out if there
are different classifiers with optimal performance depending on the context
in which they are applied. It is done in order to study the effects on the
performance of the classifiers when applied in different contexts.

1.3 Scope and Limitations
The study is limited to the assessment of classification and clustering algorithms in
the context of DPR. The study will not examine all the design patterns identified
by Gamma et al., nor will it examine their industrial adaptations [2]. This is due
to the fact that there is not a lot of data available for this type of research. The
main programming language that will be used in this study and used for RQ1 will
be Java. Other programming languages will only be included for RQ2. This is due
to the publicly available data for Java being much higher than other programming
languages. The findings will, however, still be applicable to the patterns excluded
from the study. The study will not use LLMs that require pre-training, and it will
not look into how pre-training the LLMs affects the performance of the classifiers.
This is due to the size of the models, and pre-training them requires a lot of time and
hardware that is not readily available. The study will not explore the fine-tuning
of the LLMs either, as it is outside the scope of the research questions and not
feasible to be performed within the time schedule. Very large language models, such
as models with multiple billions of parameters, will not be used due to hardware
limitations.

3


1. Introduction

4


2
Background

The focus of this study lies in the evaluation of a range of classification and cluster-
ing algorithms for DPR. The study will use implementations of design patterns as
data, then use LLMs to extract important features, which will then be fed into clas-
sification and clustering algorithms that will predict which design pattern a given
file consists of. This section provides the background and theoretical information
necessary to understand the method and the findings of the study and will bring up
the following topics: design patterns, large language models, and classification and
clustering algorithms.

2.1 Design Patterns

A large part of software development is writing understandable and maintainable
code. When codebases grow and developers add new functionality, the source code’s
complexity grow along with it. New developers learn about the importance of main-
taining code and study various tools for this. One of the tools for keeping a struc-
tured and maintainable code base is design patterns. Design patterns are used in
software development as a guide or description for how to solve commonly occurring
problems [2].

Design patterns can be divided into three different types: creational, structural, and
behavioral patterns [2]. Creational patterns are patterns that encapsulate knowl-
edge of a class that should be instantiated, as well as hide the underlying structure
of the class. Some examples of creational patterns are Singleton [16], Builder [17],
and Prototype [18]. Structural patterns are patterns that focus on the structure of
objects and classes while keeping their structure flexible. Some examples of struc-
tural patterns are Adapter [19], Bridge [20], and Composite [21]. Lastly, behavioral
patterns are responsible for the communication between objects and classes and
characterise complex control flows in programs. Some examples of behavioral pat-
terns are Strategy [22], Observer [23], and State [24].

5


2. Background

2.2 Large Language Models

Language Models (LMs) are models that can understand and generate human lan-
guage [25]. LMs have the ability to predict a sequence of words and can generate
new text based on a given input. The most common type of LM estimates what
words may come next based on the prior words and their context. Large Language
Models (LLMs) are more advanced language models that have a large number of pa-
rameters, often reaching multiple billions of parameters [26]. One key part of LLMs
is their transformer architecture [27] and the underlying self-attention mechanism
[28] it uses to determine the relevance of different parts of the input.

The transformer architecture presented by Vaswani et al. consists of an encoder
and a decoder [28]. The encoder takes an input sequence of symbol representations
and creates a sequence of continuous representations. The encoder creates a set
of vectors for each word that describe the words’ relationship with other words in
the sequence. The decoder uses the output of the encoder to create a contextually
relevant output of the model. Both the encoder and decoder contain self-attention
mechanisms that help with extracting the contextual information of the input. This
is done one-by-one by taking each word of the input and calculating and assigning
weights to all the other words of the input based on their vector similarities that
the model learns during training. Other important parts of LLMs are the ability to
pre-train them on datasets and fine-tune them to perform specific tasks [29], as this
has proven to improve their performance on multiple different tasks [30].

To represent values and inputs, LLMs use something called embeddings [31]. An
embedding is a mathematical representation of objects like images, text, and audio.
They represent these objects in the form of vectors. Figure 2.1 illustrates how em-
beddings are extracted from a document. Embeddings enable language models to
find similarities and relationships between objects. As inputs to language models are
objects like texts and images, the models will convert the inputs into embeddings.
How a model generates the embeddings is different for each model, and the same
input will generate different embeddings for different models. The generated em-
bedding is then compared to the data the model was trained on to find similarities
between the new input and what the model already knows. It will then use these
similarities to perform its task.

This study will utilise existing LLMs to extract embeddings from source code, which
in turn can be used to train classification and clustering algorithms. Since LLMs can
capture the structure and semantic information from a text, classifiers can hopefully
learn which semantic information is part of which design pattern and therefore get
a higher accuracy when predicting.

Section 2.2.1 through 2.2.3 presents the different LLMs used in this study.

6


2. Background

Figure 2.1: Document converted to embeddings [31].

2.2.1 CodeBERT
CodeBERT is a transformer-based bimodal pre-trained model that achieves state-of-
the-art performance in natural language code search and code documentation gen-
eration [32]. Bimodal signifies that it is trained on Natural Language-Programming
Language (NL-PL) pairs, which capture the semantic connection between natural
language and programming language. Results show that it performs better than
previous pre-trained models on NL-PL probing. The BERT in CodeBERT stands
for Bidirectional Encoder Representations from Transformers, which is a framework
that is used for natural language processing (NLP). This model is similar to other
models using this framework, such as RoBERTa [33] and BERT [34]. The idea be-
hind these models is to learn the contextual representations of huge natural language,
which can, for example, be used to predict masked words in a sentence.

The novelty that CodeBERT presents is the fact that it is trained on several pro-
gramming languages and is the first large NL-PL pre-trained model for multiple
programming languages; those languages are Go, Java, JavaScript, PHP, Python,
and Ruby [32]. Moreover, CodeBERT is not only trained on NL-PL pairs, that is,
code paired with documentation, but also on a large amount of unimodal data, which
is code that is not paired with documentation. For pre-training the model, a special
separation token was used to divide the NL-PL pairs, where the natural language
was represented as a sequence of words and the code was represented as a sequence
of tokens. The output is represented as both the context vector representation of
each token and as the aggregated sequence representation [32].

2.2.2 Instructor-xl
INSTRUCTOR is a sentence transformer-based model that is based on the single-
encoder architecture that follows prior work [35]. INSTRUCTOR is designed to be
applicable to many different tasks across different domains, not only within pro-
gramming. The model takes text and task instructions as input. This makes it
so INSTRUCTOR is aware of the domain and the task and generates an embed-
ding that better fits the given task and domain. This was achieved by training
the model on a large collection of datasets across multiple categories of tasks and
domains.

Results show that INSTRUCTOR performs significantly better than other state-of-
the-art models across all different types of tasks [35]. It outperforms the previous
Sent-T5-XXL [36] model with a fraction of the parameters, with INSTRUCTOR
having 335 million parameters compared to 4.8 billion parameters for Sent-T5-XXL

7


2. Background

[35]. The model used in this study is Instructor-xl, which has 1.5 billion parame-
ters.

2.2.3 GPT-2
GPT-2 is also a transformer-based model that follows the structure and details of
the original GPT model [26]. When the GPT model was introduced, it was difficult
for natural language classification models to perform well due to a scarce amount
of task-specific labelled data [37]. The GPT model introduced a way for natural
language models to realise gains on tasks that are unlabeled by performing generative
pre-training on the data, followed by discriminative fine-tuning on the specific task.
The GPT model also made use of task-aware transformations of the input, which
were not used in previous work. The pre-training made the model gain significant
knowledge and the ability to process long-range context and dependencies [37]. The
training data for GPT-2 was obtained by scraping web pages from outbound links on
Reddit [38]. GPT-2 made improvements to the original model by making detailed
changes to the model architecture, increasing the number of parameters, increasing
the size of the vocabulary, and increasing the context and batch size [26].

The GPT model proved to be effective and achieved state-of-the-art performance on
the majority of the tested datasets [37]. The use of generative pre-training followed
by discriminative fine-tuning proved to be a big step in machine learning and is now
a method that is employed by a majority of LLMs [39].

2.3 Classifying Algorithms
Classification algorithms represent a category of machine learning methods where a
model is trained to correctly identify the label of a given input [40]. This approach
falls under supervised learning, necessitating labelled training data for the model to
learn the pattern of the data. Classification problems involve two types of target
variables: categorical and continuous. When the desired output is discrete, the task
is referred to as classification. In contrast, when the output is continuous, the task
is categorised as a type of regression.

There are two primary types of classification tasks: binary classification and multi-
class classification. In binary classification, the target variable is binary, taking on
values such as 0 or 1, true or false, etc. Meanwhile, in multiclass classification, the
target variable can belong to one of multiple classes [41].

Multiple classification algorithms can be combined for a single classification task.
This is known as an ensemble. Ensembles are models that use two or more algo-
rithms to make their classifications. This is done to improve the performance, gen-
eralisability, and robustness of the model. There are three main types of ensembles:
bagging, stacking, and boosting. Bagging fits many classifiers on different samples
of the dataset and then averages all the predictions made. Stacking fits many dif-
ferent classifiers on the same data and then uses a separate classifier to combine the

8


2. Background

predictions. Boosting is when multiple classifiers are fitted in sequence, where each
algorithm fitted aims to minimise the errors made by the prior algorithms. The
output of boosting is a weighted average of all the classifiers [42].

A variety of classification algorithms will be trained and evaluated in this study
to compare their effectiveness when predicting design patterns. This paper will
use multiclass classification as there are many different design patterns that will be
predicted. Multiclass classification is often handled using a technique called One-
Vs-Rest (OVR). The OVR technique means that a classifier will fit one classifier
for each label [43]. Each classifier will then handle it as a binary classification
problem, with its assigned label being the positive examples and everything else
being negative. The classifiers will then produce a confidence score for the sample
based on its assigned label, and the label with the highest confidence score is what
the sample will be predicted to be.

The data used for training classification algorithms is structured as entities in a
list, each with features, where each feature represents a property of the class. It
is common to encounter an imbalanced dataset, which can negatively impact the
performance of many classifiers. The next sections describe a subset of the available
classification algorithms.

2.3.1 Random Forest
Random Forest is an ensemble learning technique where multiple decision trees are
trained with a subset of the training data [44]. The classifier then predicts the data
by aggregating the predictions of all the decision trees. This diversity of decision
trees makes the model less prone to overfitting [45]. Furthermore, Random Forest
is well-suited for handling high-dimensional feature spaces.

2.3.2 Naive Bayes
Naive Bayes is a supervised learning method that uses the principles of probability to
determine the outcome of a class, specifically Bayes’ theorem [46]. This algorithm
assumes that each feature is independent and is often used in test classification
problems. Naive Bayes is seen as a very simple and naive algorithm, as it does
not care if features are correlated. Bayes’ Theorem is represented by the following
formula:

P (A|B) = P (B|A)P (A)
P (B)

Here P (A|B) is the probability of event A occurring given that event B has already
occurred.

2.3.3 Gaussian Process
Gaussian Process (GP) is a supervised learning method that can be used for prob-
abilistic classification problems [47]. This means that the model gives a measure of

9


2. Background

how confident it is in its prediction. In short, GP works by assuming some underly-
ing relationship in the data known as the prior and then using a link function (often
logistic) to get the probabilities that the data is of a certain class. One disadvantage
of the Gaussian Process is that they use the whole dataset when predicting, and
they are less effective in high-dimensional data, i.e., when the feature vector is more
than a few dozen dimensions.

2.3.4 Logistic Regression
Logistic Regression is a probabilistic discriminative model that is used for classifica-
tion problems [48, Chapter 4.3]. The Logistic Regression model is at the core used
for binary classification problems but can be used to classify multiclass problems by
using multinomial logistic regression or by using the One-vs-Rest approach. This
classifier uses a logistic function to describe the outcome of a class. The following
formula describes the Logistic Regression, where P (y = 1|x) stands for probability
that y=1 given the input x.

P (y = 1|x) = 1
1 + e−(β0+β1x1+...+βpxp)

2.3.5 Support Vector Machine
Support Vector Machines (SVMs) are supervised learning methods that can be used
for both regression and classification tasks [48, Chapter 7]. They are highly effective
in high-dimensional spaces. SVMs work by constructing a set of hyperplanes that
maximise the margin between the training data. Figure 2.2 illustrates how an SVM
can separate the training data into two distinct classes. The data points closest to
the separating hyperplane are called support vectors, as they critically define the
margin boundaries.

Figure 2.2: Decision boundaries for a linearly separable problem [49].

10


2. Background

2.3.6 Nearest Neighbour
Nearest Neighbour (NN) is a simple algorithm that stores instances of the training
data [50]. Classification of new data points is done by a majority vote of their nearest
neighbours. There are several variations of the Nearest Neighbour algorithm, but
the most common one is KNN. In KNN, the value of K determines the number of
nearest data points considered when assigning a class to a new instance. Larger
values of K reduce the impact of noise but can also blur boundaries between classes.
KNN algorithms can become less effective on large datasets since every new point
must be compared against all other points to determine its classification.

2.3.7 Gradient Boosting
Gradient Boosting is a powerful classification algorithm and one of the most popular
ensemble methods [51]. This algorithm, a type of boosting, excels at capturing non-
linear patterns in data. The model works by sequentially fitting simpler models,
usually decision trees, that attempt to correct the errors made by previous mod-
els. It uses the negative gradient of the loss function to determine the direction of
improvement for each subsequent model, repeating this process for the number of
estimators used in the model.

2.3.8 Artificial Neural Networks
Artificial Neural Networks, or just Neural Networks, also known as Multi-Layer
Perceptrons (MLPs), are a type of machine-learning algorithm that can be used for
multiple machine-learning tasks, including classification [52]. The building blocks
of a Neural Network are the neurons. A neuron has weighted input signals and an
output signal that transforms the value using an activation function. Neurons are
arranged into networks that consist of an input layer, one or more hidden layers,
and one output layer. The input layer is what receives the input and passes it onto
the first hidden layer. The hidden layers contain the neurons that transform the
value they receive using the weights and activation functions of the neurons. The
final layer is the output layer, which generates an output of the model according
to the task. A simple model of a Neural Network with input, hidden, and output
layers can be seen in Figure 2.3 below.

2.4 Clustering Algorithms
Clustering algorithms are tasks in machine learning that group similar objects to-
gether into clusters. This method is a form of unsupervised learning, which means
it does not require labelled data for training [53]. Clustering methods are com-
monly used when the underlying structure of the data is unknown and to discover
new patterns within the data. Unlike classification algorithms, clustering does not
usually measure performance using accuracy, precision, or recall. If labelled data is
available, it can be used to test the accuracy of the algorithm, but it can be tricky
as the clustering method doesn’t explicitly say which cluster belongs to a specific

11


2. Background

Figure 2.3: A simple model of a Neural Network.

class. Furthermore, clustering algorithms use a graphical aid called the Silhouette
[54] to measure performance. Silhouettes are created by representing each cluster
in a Silhouette diagram based on its tightness and separation. A mean Silhouette
coefficient is then calculated by comparing the sample points to the generated Sil-
houette, providing a value between -1 and 1. Here, -1 indicates that a point has been
incorrectly assigned to a cluster, while 1 represents the optimal assignment [55]. The
following sections describe a subset of the available clustering algorithms.

For this study, clustering algorithms will be tested to see if they can find any clusters
from the extracted embeddings.

2.4.1 K-Means
K-Means is a widely used clustering algorithm that tries to group data into a spec-
ified number of clusters based on their variance [56]. The main objective of the
algorithm is to choose a centroid that minimises the within-cluster sum-of-squares,
also known as inertia. The centroids selected are the means of the clusters. After a
centroid has been chosen, the samples are assigned to their nearest centroid, which
is subsequently updated in that cluster based on the mean of the assigned samples.
This process repeats until the centroid stabilises and does not move significantly
when adding new samples. Since K-means uses inertia to construct the cluster, it
assumes that the clusters are more convex and isotropic-shaped and thus perform
worse on elongated and irregular shapes.

12


2. Background

2.4.2 DBSCAN
The DBSCAN algorithm clusters data by separating high-density areas from low-
density areas [57]. The clusters are generated by finding a set of core samples given
by a distance measure between data points. The neighbouring core samples are then
measured to find all core samples of those; this is recursively done until no more core
samples are found. These core samples are then surrounded by non-core samples,
which are samples that are further away from the core samples given a threshold.
The advantage of DBSCAN compared to K-means is that it can find clusters of any
shape.

2.4.3 Mean Shift
Mean Shift clustering is a centroid-based algorithm that works by updating which
centroid data points are considered the mean for a given region of data points [58].
The different candidates are then filtered to eliminate candidates that generate the
same or very similar clusters. Centroid positions are calculated using hill climbing
in order to find local maxima. The algorithm automatically calculates the number
of clusters, but it can also be set manually. Mean Shift requires multiple nearest
neighbour searches during the execution of the algorithm. This creates a complexity
of O(N2) for the algorithm and also makes it not highly scalable. Application
domains for Mean Shift include computer vision and image processing.

2.4.4 Gaussian Mixture
Gaussian Mixtures are a special type of clustering model called Gaussian Mixture
Models [59]. Their probabilistic model assumes that all the data points are generated
from a finite number of Gaussian distributions. There are different types of Gaussian
mixture models, depending on the chosen estimation strategy. One estimation strat-
egy uses the expectation-maximisation algorithm. The main difficulty of Gaussian
mixture models is dealing with unlabeled data and knowing what data comes from
which cluster. The expectation-maximisation algorithm uses an iterative process to
get around this issue. It starts by assuming that clusters are random by center-
ing them on randomly chosen data points or by using K-Means. The initialization
method can be random or done by K-Means. It then calculates the probability of a
data point being generated by each cluster and tweaks the parameters to maximise
the likelihood of the data points, given their cluster assignments. This process is
repeated and will be guaranteed to converge to a local optimum. Gaussian mixture
models are often found in biometric systems, most notably in speech recognition
[60], [61].

2.4.5 BIRCH
BIRCH is another clustering algorithm that works by building a tree based on
the given data called a Clustering Feature Tree (CFT) [62]. The tree consists of
Clustering Feature Nodes (CF Nodes) and the CF Nodes contain subclusters, and
the subclusters can contain CF Nodes as children. The subclusters are what contain

13


2. Background

the necessary information used to cluster the data. The algorithm works by inserting
new samples into the root of the CFT, which is a CF Node. The new sample is then
merged with the subcluster in the root that has the smallest radius after merging.
If the subcluster it merged with contains a child CF Node, the process is repeated
until it reaches a leaf. Splitting of nodes into branches occurs when the radius of
a subcluster after merging is above a defined threshold value and the number of
subclusters is greater than the branching factor. The number of clusters can be
manually set; if it is not set manually, the clusters will be the subclusters at a
leaf. Applications for BIRCH include image classification and image compression
[63].

2.4.6 Affinity Propagation
Affinity Propagation works by passing messages between pairs of samples until the
algorithm converges [64]. A dataset is described using exemplars, which are a small
number of data points that are supposed to represent other data points. The mes-
sages that data points pass to each other represent how suitable a data point is to
be the exemplar of another data point. This gets updated in response to values from
other pairs. The updating continues iteratively until it converges. At convergence,
the final exemplars are picked and the clusters are created. Like Mean Shift, Affinity
Propagation is not highly scalable due to its complexity of O(N2T ) where N is the
number of data points and T is the number of iterations until convergence.

The next section will present some of the related work done in this field of design
pattern recognition, where concepts and terms from this section will be widely used.
Information from this section will also be used when discussing the results of the
study.

14


3
Related Work

This section presents previous studies that utilised various methods for the purpose
of DPR as well as the evaluation of classification algorithms in non-DPR contexts.
It outlines the main components of these studies, including their evaluation methods
and results.

3.1 Existing Approaches for DPR

Yarahmadi and Hasheminejad conducted an extensive literature review of papers
about design pattern detection that were published between 2008 and 2019 [12].
In their study, they found that the majority of the approaches to DPR worked on
creational, structural, and behavioral patterns. They did, however, note that some
patterns were harder to identify or could not be identified at all due to their similar
structure. Yarahmadi and Hasheminejad also found that a large majority of DPR
approaches use source code as their type of input and found that there were many
different ways of representing the data, including metrics and feature vectors. They
also found that there were multiple different DPR approaches, but the most common
were learning-based approaches that made use of artificial intelligence and machine
learning. Furthermore, they found that the P-MARt repository [65] and the projects
within it were used by a large majority of the papers for training and evaluation of
their methods.

Yarahmadi and Hasheminejad also looked at the results of different DPR approaches
in their study and presented advantages and disadvantages of the approaches [12].
They found that learning-based approaches performed well. It had a high recall and
precision, was evaluated on large test cases, and could detect design patterns such as
Singleton. But it had difficulties detecting patterns with similar structures. Metric-
based approaches performed worse than learning-based approaches but could detect
patterns with similar structure. They found that many of the approaches had small
test cases, which questioned their ability to function in large systems. They also
found that many approaches cannot detect different variants of design patterns, but
that learning-based approaches are capable of this. They also state that many of the
methods have weak performances and evaluations. Yarahmadi and Hasheminejad
believe that many benefits can be gained by combining different methods.

15


3. Related Work

As stated by Yarahmadi and Hasheminejad, there are many approaches that have
been extensively researched when it comes to design pattern recognition [12]. Dif-
ferent approaches have different methods for how they represent their data. Some
of these include metrics extracted from the source code, the Abstract Syntax Tree
(AST), which is a tree that represents the structure of the code, and the Abstract
Semantic Graph (ASG), which captures a higher level of abstraction compared to
the AST. For ML-based approaches, the data representation may instead come from
using language models. These language models are trained on a large set of data
and can extract the semantic meaning of the source code in a more generalisable
way. The next section outlines two different DPR approaches researched.

3.1.1 Metric-based Approaches
Tsantalis et al. explain a method for detecting Java design patterns from similarity
scoring between graph vertices [9]. They evaluated their algorithm on three open
source projects: JHotDraw, JRefactory, and JUnit. Each project was parsed, and
various metrics were extracted to get a hierarchical tree structure of the classes,
methods, and their invocations. These directed graphs were then mapped to matri-
ces for easier manipulation. Known design patterns were also hard-coded as matri-
ces, each with descriptive attributes. The graphs and matrices from the extracted
code could then be used to measure the similarity score to find the best match. The
authors also note that some design patterns have identical structures and cannot
be individually identified without further context; these are Adapter/Command,
and State/Strategy. Their results showed that their method could correctly iden-
tify every design pattern in their scope with only two false negatives and no false
positives.

Dwivedi et al. utilise a metric-based approach where they extracted 67 metrics
using the JBuilder software [10]. Some of these metrics are familiar, such as lines
of code or the number of parameters. But it also includes more advanced metrics
such as the polymorphism factor and coupling between objects. They evaluated
the performance of three different supervised learning methods: Artificial Neural
Networks, Support Vector Machine, and Random Forest. Five design patterns were
used: Abstract Factory, Adapter, Bridge, Composite, and Template Method. For
each pattern, all the related classes were used to extract metrics from the pattern.
The design patterns used were from the P-MARt repository [65]. Validation was
done using cross-validation, and accuracy was used to measure the performance of
the models. Their study showed that ANN performed the poorest, while random
forest was the best, with an F1-measure of 100% on QuickUML and JUnit and
97.6% on JHotDraw, averaged over all design patterns focused on.

Zanoni et al. describe an approach for detecting design patterns in code [6]. Their
approach makes use of graph matching and machine learning techniques. Their
study builds on previous work by implementing enhancements in their analysis
method. They introduce a tool called MAPRLE-DPD. This tool is divided into
three main modules. The first module is the information detector engine. This
module is responsible for building a model of the programme and collecting and

16


3. Related Work

storing microstructures and metrics in the model. The second module is the joiner.
The joiner works on the microstructures that the information detector engine has ex-
tracted and extracts all possible design patterns. The third module is the classifier.
The classifier computes similarities between the design patterns extracted by the
joiner by comparing them to known implementations of the design patterns.

In their study, Zanoni et al. also define two different categories of design patterns
[6]. These categories include single-level patterns and multi-level patterns. They
define patterns as trees of levels where each level contains at least one role. A role
is a class that is needed to implement a design pattern. A single-level pattern is
a design pattern that can be implemented using only one level. Multiple roles are
allowed in single-level patterns as long as they are at the same level. Multi-level
patterns are design patterns that consist of multiple levels. A pattern can consist
of multiple levels if its implementation uses multiple classes and one of the classes
inherits from another class. An example of a multi-level design pattern is Template
Method. Template Method creates an abstract template class that is then inherited
and used by concrete classes.

Zanoni et al. evaluated their model in the form of experiments [6]. The dataset
of design patterns they used consisted of 10 different projects, and the training set
totaled 2794 different pattern instances. One project was gathered from different
design pattern examples found on the web. The nine other projects were anal-
ysed in the P-MARt repository [65]. They used multiple different classification and
clustering algorithms to evaluate the performance of the model, including Decision
Trees, Random Forest, SVMs with different kernel functions, Naive Bayes, Sim-
pleKMeans, and CLOPE [6]. Their model and the algorithms were evaluated on
single-level patterns and multi-level patterns separately, and they only used cluster-
ing algorithms for multi-level patterns. The single-level patterns experimented on
were Singleton and Adapter. The multi-level patterns were Composite, Decorator
and Factory Method. The algorithms were compared on three different metrics: ac-
curacy, F1-measure, and Area Under Curve (AUC). For single-level patterns, Zanoni
et al. concluded that all algorithms have similar performances, with the exception of
Naive Bayes and one variation of SVM that performed significantly worse than the
other algorithms. For multi-level patterns, they found the best performance with an
SVM. It is worth noting, however, that the goal of the study was to evaluate their
approach to DPR, and there was no extensive testing or comparison of different
algorithms.

3.1.2 LLM-based Approaches
Pandey et al. proposed an approach for DPR that utilised language models that have
been trained on programming languages [11]. The study introduced a DPR tech-
nique called TransDPR. The language model used in their study was TransCoder.
TransCoder is a language model developed by Meta [66], and the purpose of the
model is to translate source code from and to Java, C++, and Python. Pandey et
al. use TransCoder to extract the semantic information of a program [11]. This was
done by extracting the embedding vector produced by TransCoder at the encoder

17


3. Related Work

block of the TransCoder model.

Their model was evaluated over two patterns: Singleton and Prototype. The pattern
instances were found in open-source projects on GitHub and were written in C++.
For each instance of a pattern, they also created an artificial counter instance by
multiplying the embedding vector of the instance with -1. The counterexamples were
annotated as non-singleton and non-prototype. This was done to reduce overfitting
and to make sure the model is not biassed towards any design pattern. Their
training set was composed of 52 embedding vectors, of which half were artificial
counterexamples. The model was assessed by splitting data into a training and
test set, with 70% of the data being part of the training set. They only used one
classifier, Logistic Regression, and the model gave 50% accuracy for Singleton and
90% accuracy for Prototype. Pandey et al. also performed an overall test where
Singleton and Prototype were combined into positive examples and non-singleton
and non-prototype were combined into negative examples. The accuracy for the
overall test was 90%.

A study performed by Chand et al. tested the feasibility of using language models
for DPR within industrial contexts [7]. Their approach is similar to the approach
used by Pandey et al. in their study [11]. Chand et al. use two language models,
Word2Vec and Code2Vec [7]. These models are used to generate an embedding of
the source code. Both of the models were pre-trained on two open-source systems
often utilised within the automotive industry: Genivi [67] and Android Auto [68].
The pre-training dataset consisted of 43,397 code examples, which were split up into
test, validation, and training datasets. Both models had similar performance with
the pre-training dataset, with Word2Vec performing slightly better.

Chand et al. evaluated the ability of the models to detect design patterns with
27 code examples across three design patterns: Singleton, Prototype, and Builder
[7]. The algorithm they used was a K-means cluster algorithm. For the purpose of
detecting design patterns, Code2Vec outperformed Word2Vec with a mean F1-score
of 0.781 compared to 0.690 achieved by Word2Vec. The results showed that the
models could store architectural information about the source code.

3.2 Existing evaluation of classification algorithms

There is also work in the field of evaluating classification techniques in a non-DPR
context. The studies are mainly performed using different datasets, and the results
show that classifiers are often task-dependent; in other words, some classifiers are
good on some datasets while others are worse.

Caruana et al. conducted a large-scale study in which they compared ten supervised
learning methods to determine which methods had the best performance across 11
different binary classification problems [13]. Their study included supervised learn-
ing algorithms such as logistic regression, SVMs, naive Bayes, KNN, random forest,
decision trees, bagged trees, boosted trees, boosted stumps, and neural networks.

18


3. Related Work

They evaluated these learning algorithms using eight performance metrics, includ-
ing accuracy, F-score, and Lift Curves. Additionally, calibration methods such as
Platt Scaling and Isotonic Regression were applied to compare the algorithms prob-
abilistically. For datasets, they used 11 binary classification problems, including
ADULT, COV_TYPE, and LETTER. Their results showed that boosted trees per-
formed best when considering the average across all eight metrics, with random
forests coming in second. They also noted that there is no universally best learning
algorithm, as some algorithms that performed poorly on some problems excelled on
others. For example, random forest, neural networks, and logistic regression were
the best models on the MEDIS dataset. Overall, the worst models were naive Bayes
and memory-based learning, such as KNN.

Soofi and Awan investigated several machine learning classification techniques and
the applications for which they were used in [14]. Due to their effectiveness and
simplicity, Decision Trees were the most commonly used algorithm for classification
tasks [69]. One problem that Decision Trees excelled at was dealing with multi-
valued attributes. Bayesian Networks were another classification technique inves-
tigated. This method graphs variables and their conditional dependencies using a
Directed Acyclic Graph (DAG). One disadvantage of Bayesian Networks is that they
require continuous variables to be discretized, which can cause classification issues
[70], [71]. The benefits of Bayesian Networks include their ability to solve both
regression and classification problems, as well as their ability to handle missing val-
ues. K-nearest neighbour (KNN) is another classification technique that is good
at scaling over multimedia datasets and was used to address issues with space and
time constraints. KNN is also effective when dealing with large amounts of train-
ing data and is resistant to noisy training data [72]. Some disadvantages include
computation complexity, memory limitation, and poor run-time performance [73],
[74]. Lastly, Support Vector Machines (SVM) is a technique considered to be one of
the more convenient methods for solving classification problems [75]. SVMs excel at
handling high-dimensional and non-linearly separable problems. However, a notable
disadvantage is that achieving accurate results often requires setting key parameters
correctly. SVM can be applied to issues involving multi-label classification and is
effective at controlling the false-positive rate.

19


3. Related Work

20


4
Method

The main method used in this project was an Mining Software Repository (MSR)
experiment method [76]. The study used publicly available software repositories in
order to understand and improve software maintainability. The study was also a
knowledge-seeking study that aimed to evaluate and validate different methods and
their performance [77]. The independent variables in this study were the classifica-
tion and clustering algorithms. An overview of the process for RQ1 can be seen in
Figure 4.1, and an overview of the process for RQ2 can be seen in Figure 4.2. The
pipeline referenced in those figures is explained in Section 4.7.

Figure 4.1: Visualisation of process for RQ1.

Figure 4.2: Visualisation of process for RQ2.

4.1 Data Collection
Java data for Singleton, Prototype, and Builder were gathered mainly from the
P-MARt repository [65]. The P-MARt repository consists of labelled usage design
patterns that have been extracted from different open source software that is written

21


4. Method

in Java. The data gathered from P-MARt consisted of 32 examples of Prototype,
25 of Singleton, and 9 of Builder. All the examples were written in Java.

Data was also systematically gathered from public GitHub repositories. This is due
to the P-MARt data gathered being imbalanced and having an unequal amount of
data for the different patterns. An imbalanced dataset can lead to the machine-
learning algorithms overfitting on the design pattern(s) that are most common in
the dataset. This happens because the algorithm believes that the most common
design patterns have a higher chance of appearing than the other design patterns.
The P-MARt repository also only have Java data for Singleton, Prototype, and
Builder and data had to be gathered for Adapter, Bridge, Strategy, and Observer as
well as data for Python and C#. Examples of design pattern implementations were
found by searching with specific search terms on GitHub. The search terms used to
find data for all three programming languages were the following:

• "design patterns language:Java"

• "design patterns language:Python"

• "design patterns language:C#"

The results were then sorted by "Most stars". It was sorted by "Most stars", as it
usually indicates a repository being popular and of high quality. The repositories
were then manually inspected one-by-one for design pattern implementations. For
data to be included in a repository, it had to satisfy a list of requirements. The
requirements can be seen below.

• The repository should be in English.

• The repository should have at least one example of the design patterns part
of this study (Singleton, Prototype, Builder, Observer, Strategy, Bridge, and
Adapter).

• The design patterns should be explicitly annotated by the author(s) of the
repository.

The annotation of a design pattern can be: the file(s) are named after the design
pattern, the folder the file(s) are in is named after the design pattern, or the com-
ments in the code explicitly state the design pattern implemented. All the gathered
design pattern examples were then manually reviewed to see if the implementation
of the pattern looked correct when compared to the implementation presented by
Gamma et al. [2]. The number of files and their source for the Singleton, Proto-
type, and Builder patterns written in Java are shown in Table 4.1 below. All other
patterns in Java had the same number of files and were all from GitHub. Python
and C# had the same number of files, all from GitHub.

22


4. Method

Table 4.1: Table showing the amount of files and their source for the Singleton,
Prototype, and Builder design patterns written in Java.

Design Pattern P-MARt GitHub Total
Singleton 25 7 32
Prototype 32 0 32
Builder 9 23 32

4.1.1 Data Preprocessing
The gathered data went through a manual preprocessing stage. All files used for
a single pattern implementation were merged into one file, so each file in the final
data set represented a full implementation and usage of a design pattern. Parts of
the code that were deemed to be irrelevant to the implementation of the pattern
were also removed. Examples of these were: package names, comments explaining
the license the code was part of, the author of the code, and educational comments
explaining how the pattern works. Comments related to the functionality of the code
were not removed as they are relevant for the language model’s understanding of
the code. It was especially relevant for CodeBERT, as that model has been trained
on pairs of natural languages and programming languages. Meaning that comments
in the code can help the model understand the code better.

4.2 Google Colab
Google Colab is a hosted Jupyter Notebook service that allows anybody to write
and execute Python code through the browser [78]. The pipeline used in this study
was written in Google Colab. Colab was opted for as it is free to use and it provides
access to computing resources on which the code is executed. Colab gives access to
execute code on Graphical Processing Units (GPUs), which are required for running
machine learning and AI models. This sped up the process of generating embeddings
and helped us train classifiers and clusterers. Colab also supports sharing and
collaborating on projects.

4.3 Language Models Used
The study utilised language models in order to generate embeddings from the given
code that the classifying and clustering algorithms then used as input to detect
the presence of a design pattern. Each model was also responsible for tokenizing
the input if the model required it to generate an embedding. The model used for
RQ1 was CodeBERT. As it is trained on Java data, is well-known, and outperforms
other well-known models, CodeBERT was chosen as the language model to be used
in RQ1. For RQ2, multiple language models were used to study the effects of the
choice of language model on the performance of the classifiers and clusterers. In
RQ2, CodeBERT, GPT-2, and Instructor-xl were used.

23


4. Method

CodeBERT is a variant of the BERT language model that has been pre-trained
on programming languages. CodeBERT was chosen as it is a variant of a well-
known language model and its purpose is to aid with coding, making it suitable
for this study. As CodeBERT is trained on programming languages, it is known
as a Programming Language Model (PLM). It is trained on Java and Python data,
which are two of the programming languages used in this study, and achieves state-
of-the-art performance. CodeBERT is also the only PLM included in this study,
and PLMs are what previous studies utilising language models for DPR have used
[7], [11], [79]. CodeBERT also outperformed other language models on C# even
though it was not part of its training data [32]. As its performance translates over
to C# as well, it was also chosen to be used for RQ2.

GPT-2 was included as a model to use in RQ2 to investigate how and if the choice
of language model impacts the performance of the machine learning algorithms.
Later versions of the GPT LLM were not used as they were not available to down-
load locally for free. The model is also available for free, well-known, and showed
state-of-the-art performance when it was released [26]. GPT-2 and CodeBERT are
similar in their functionality but different in what they are designed for. CodeBERT
is designed and focused around code and aiding with coding [32]. GPT-2 is a more
general model that aims to aid with many different tasks in many different fields,
such as translation, summarization, and question answering [26]. GPT-2 also gen-
erates an embedding for each token instead of one embedding for all tokens. The
models are similar and both relevant for the study, but still different enough to
discuss results and performance differences.

Instructor-xl was included as a third model in RQ2 due to it being more recent (late
2022), and a lot of advancements have been made in the area of machine learning
and AI between the release of CodeBERT and GPT-2 and the release of Instructor-
xl [35]. GPT-2 and CodeBERT are also not designed for the purpose of finding the
presence of design patterns in code. Instructor-xl is also not designed for the purpose
of DPR, but the model can be given instructions for the task and therefore generate
an embedding that takes the task of DPR into account. It is therefore also of interest
how a more recent model that can generate an embedding more appropriate for the
task changes the performance of the classifiers or clusterers.

4.4 Design Patterns
The design patterns used in RQ1 were Singleton, Prototype, and Builder. These
patterns were selected for their widespread use in object-oriented programming and
their frequent appearance in previous DPR work [12]. Singleton, Prototype, and
Builder are all categorised as creational design patterns. As the focus of RQ1 was
to explore the performance of different classifiers and clusterers, we chose to select
patterns that are of the same type. The difference in performance between the indi-
vidual design patterns can then also be studied without having to make inferences
about design pattern types. Differences between pattern types will be explored in
RQ2.

24


4. Method

For RQ2, we expanded the number of patterns from three to seven, including at least
two from each design pattern category: creational, behavioral, and structural. The
patterns included for RQ2 were Singleton, Prototype, Builder, Observer, Strategy,
Adapter, and Bridge. The patterns were chosen as they were often part of previous
work in DPR and are often used within object-oriented programming [12]. This
also meant that it was not too difficult to find examples of their implementations to
use as training and testing data. At least two patterns for each classification were
chosen in order to study how machine-learning algorithms perform when classifying
or clustering patterns both within and across design pattern classifications and how
their performance varies between them.

4.5 Classifying and Clustering Algorithms
The classification algorithms used in this study were Logistic Regression, Random
Forest, K-Nearest Neighbours, Support Vector Machine, Decision Tree, Gradient
Boosting, Naive Bayes, Gaussian Process, and a Neural Network. The Neural Net-
work will be a Multilayer Perceptron classifier, where the mapping between input
and output is non-linear by using a non-linear activation function. The clustering
algorithms used in the study were K-Means, DBSCAN, Mean Shift, Gaussian Mix-
ture, BIRCH, and Affinity Propagation. K-Means, Gaussian Mixture and BIRCH
allow you to specify the number of classes present, and this will be done for those
algorithms. This is done as this is a classification task and this information is avail-
able. The embeddings are of dimension 768, and reducing this size can lead to a loss
of information. Specifics about training and testing the classifiers and clusterers are
presented in Section 4.7.3.

A wide range of classifiers and clusterers was chosen to explore the performance
of different algorithms and methods of classification. The chosen algorithms were
opted for as they are well-known and have been used in previous work within DPR
[6], [7], [11].

4.6 Metrics
The metrics calculated for each classifier were accuracy, precision, recall, and F1-
score. Accuracy is the simplest metric here, and it shows how many of the predictions
made were correct without considering any other information. It is calculated by
dividing the number of correct predictions by the number of total predictions made.
It is a value between zero and one, and the higher it is, the better the model is at
making correct predictions.

TP: True Positives

TN : True Negatives

FP: False Positives

25


4. Method

FN : False Negatives

accuracy = TP + TN

TP + TN + FP + FN

Precision is a metric that shows the proportion of how many of the predictions
it makes for a certain class are correct. For example, it will show how many of
the programmes it predicted as being Singleton are correct compared to how many
Singleton programmes there are. It is a value between 0 and 1, and a higher value
is better.

precision = TP

TP + FP

Recall shows how good the model is at finding all positive examples. It shows how
many of the data points that should be classified as true were classified as true.
In this case, it shows how many of the programmes that should be predicted as
Singleton were predicted as Singleton. It is a value between 0 and 1, and a higher
value is better.

recall = TP

TP + FN

The F1-score metric shows the accuracy of a model, and it takes precision and
recall into account. Compared to just calculating the normal accuracy, the F1-score
considers class imbalance and better reflects the true performance of a model. It is
a value between 0 and 1, and a higher value means a better-performing model.

F1 = 2 · precision · recall

precision + recall

The data used in the study is labelled. Thus, F1, precision, and recall can be
calculated for the clusterers as well. This does, however, require testing different
mappings between patterns and clusters. Mapping patterns to clusters was only
performed for RQ1 to be able to compare the clusterers and classifiers with the
same metrics. It was not done for the other RQs as their performance in RQ1 did
not show promise of the clusterers being able to compete with or outperform the
classifiers. Mapping the patterns to clusters also takes time and was not feasible at
all for RQ2.1 due the large amount of unique mappings. F1-score, precision, and
recall were calculated by testing all different mappings of pattern to cluster and
taking the results from the mapping that gave the highest F1-score.

The metrics used for RQ2 were the Adjusted Rand Index (ARI) and Silhouette
score. These metrics were included in RQ1 as well for comparison purposes. The

26


4. Method

Silhouette and ARI scores will tell whether the clusters are good in terms of how
similar the points in a cluster are to each other. If a clusterer performs well in terms
of these metrics, it can be used to classify code, but it would require the extra effort
of mapping the clusters to the patterns. The clusterers will also be tuned around
the Adjusted Rand Index.

The Silhouette score for a whole data set is calculated by calculating the Silhouette
score for each individual data point, adding them, and dividing by the total number
of data points. The Silhouette score ranges from -1 to 1. A negative score means
that the data point is likely in the wrong cluster. A positive score indicates that the
data point is assigned to the correct cluster. If clustering performs well, it should
have a Silhouette score near 1.

ai: Average distance of point i to all other points within the same cluster

bi: Average distance of point i to all points in the nearest cluster

Silhouette score(i) = max(ai, bi)
bi − ai

Silhouette score =
∑n

i=1 Sillhouette Score(i)
n

The Rand Index (RI) is a metric that measures the similarity between two different
clusterings of the same data. RI takes into account all different pairs of data and
checks if the pairs have been assigned the same cluster or a different cluster when
comparing to the true labels. It checks whether a pair within the same cluster has
the same true label or whether a pair with the same true label is in different clusters.
The Rand Index is calculated by the following formula:

Rand Index = number of agreeing pairs
number of pairs

The Adjusted Rand Index has the same value, but it adjusts the value according
to the chance of making a correct clustering at random. It calculates an expected
RI value that would have been achieved by random clustering and then adjusts the
RI score according to the expected RI. This adjustment also gives ARI a baseline
where scores near 0.0 have the same performance as random clustering. The ARI
score ranges from -0.5 to 1, where a higher score is better.

4.7 Pipeline
For testing various classifiers and clusterers, a pipeline was created in Google Colab.
An overview of the pipeline can be seen in Figure 4.3. This pipeline takes a directory
of source code files as input.

27


4. Method

Figure 4.3: Visualisation of the pipeline.

4.7.1 Tokenization
The first step in the pipeline was to tokenize the code files. Much of the data used
in this study would generate more tokens than the maximum input sequence length
of the models. This issue was handled by splitting the code files into chunks, where
each chunk was as big as the maximum input sequence length of the model. If a
chunk or a file was smaller than the maximum input sequence length, it was padded
to make the size equal to the maximum input size. This was done to make all inputs
to the model have the same size. The padding does not add any semantic value and
does not affect the embedding generated by the models. The chunks also had an
overlap of 10% where the final 10% of a chunk would be the first 10% of the next
chunk. This was done to help the models retain the context between chunks.

4.7.2 Extracting Embeddings
The tokenized chunks were then individually sent into the models, and their em-
beddings were extracted. All files that had multiple chunks would receive multiple
embeddings. These embeddings were put into one final embedding by averaging all
of their values. After this, all files would have one singular embedding as their rep-
resentation. These embeddings were then saved as a comma-separated file, allowing
them to be used later for training various classifiers and clusterers without the need
to rerun the source code through the models.

The saved embeddings are then split into training and test data, where 80% was
used for training the algorithms and 20% was used for testing. The data is split in
this way in order to have sufficient data for the classifiers to train and learn from, as
well as sufficient data to test and evaluate them. It is also done to reduce overfitting
of the algorithms [80]. All patterns were split separately to ensure that there was an
equal amount of data for each pattern in the training and test sets. It also removes
any bias that can be introduced with imbalanced data sets. The clusterers did not
use the separated data and instead created clusters from the full data set.

4.7.3 Training of Classifiers and Clusterers
The next step in the pipeline was to train and test all the classifiers and clusterers
on the data. All algorithms used the same data, and they were trained and tested
once per model and per programming language. Different hyperparameters were also
tested for each algorithm by performing a grid search with selected hyperparameters.
The grid search was also done per model and per programming language. This was
done to ensure that the best possible result was achieved for all algorithms. What

28


4. Method

hyperparameters were tested and how many combinations of each algorithm were
tested can be seen in Table B.1 in Appendix B. The classifiers were optimised based
on their F1-score, and the clusterers were optimised based on their ARI score. The
values for the hyperparameters used in the different tests can also be found in the
Appendix.

The classifiers and clusterers ran 200 times each with different training and test sets.
This was done to reduce the effect that randomness has on the performance of the
classifiers and clusterers and to see their average performance across multiple runs
and how much it varies between runs [81]. Results before and after hyperparameter
tuning were generated separately to see how much it affected the performance of the
algorithms.

4.7.4 LLM Settings
For RQ1, we utilised CodeBERT [32] to extract the embeddings. RQ1 focused ex-
clusively on the Singleton, Prototype, and Builder design patterns. All the design
patterns analysed were code written in Java. Java was chosen as the language for
RQ1, as the P-MARt data is all written in Java, and P-MARt has also been used
in multiple previous studies within DPR [12]. The data used in RQ1 consisted of
examples found in P-MARt and public GitHub repositories. The P-MARt dataset
and the combination of P-MARt and public GitHub repositories were tested sepa-
rately.

CodeBERT has a maximum input sequence length of 512 and also uses special
classification and end-of-sequence tokens. So, the data for CodeBERT was split into
chunks of size 510, and a classification token and end-of-sequence token were added
as the first and last elements, respectively, making each chunk have a size of 512.
As the maximum length is 512, the overlap for each chunk was 51 tokens.

RQ2 used two more models for the extraction of the embeddings: Instructor-xl
[35], and GPT-2 [26]. The number of patterns was extended from three to seven
and now included multiple patterns from all three types of design patterns. Two
other programming languages were also tested: Python and C#. The pipeline was
modified and extended to include GPT-2 and Instructor-xl. GPT-2 has a maximum
input sequence length of 1024. The chunks created for GPT-2 were of size 1024
with an overlap size of 102. Instructor-xl has a maximum input sequence length of
512 and uses an overlap of 51 tokens. None of them use any special classification
or end-of-sequence tokens. Instructor-xl, however, can be given instructions on the
task, and the instructions given to the model were: "Identify the software design
pattern (Singleton, Prototype, Builder, Observer, Adapter, Bridge, Strategy) in the
following code:".

29


4. Method

30


5
Results

Quantitative data was obtained from the predictions the classifiers made on the
test set and the clusters generated by the clusterers. Results relevant for RQ1 are
presented in Section 5.1. Results relevant for RQ2 are presented in Section 5.2.

5.1 RQ1: Classifier Performance on Design Pat-
tern Identification

This section presents results related to answering RQ1. The results are divided
into baseline performance, which is the performance with default hyperparameters,
presented in Section 5.1.1 and performance with tuned hyperparameters presented
in Section 5.1.2. The design patterns used for RQ1 were Singleton, Prototype, and
Builder. All design patterns were written in Java. All results presented in this
section use embeddings generated by CodeBERT. The default hyperparameters for
the algorithms can be found at scikit-learn [82].

5.1.1 Baseline Performance
The results of training and testing each classifier across 200 different runs with
different training and test splits can be seen in the box plot in Figure 5.1 and in
Table 5.1. Multiple classifiers were able to reach an F1-score of 1 in at least one
run, even without any hyperparameter tuning. The classifiers that did not achieve
a F1-score of 1 in any of the runs were Decision Tree, KNN, Naive Bayes, and
Gaussian Process. The best lowest score was 0.651, which was achieved by Logistic
Regression. Gaussian Process is the standout classifier here, with significantly worse
performance than any other classifier. The mean F1-score it got was 0.219, with
its highest being 0.328. It did have the lowest variance of all classifiers at 0.003,
with the next lowest being 0.005, which was achieved by Logistic Regression. The
second worst performance was attained by Decision Tree with a mean F1-score of
0.664.

Furthermore, Table 5.2 shows the F1-score (F1), Precision (P), and Recall (R) for
each design pattern, Singleton, Prototype, and Builder averaged over 200 runs.
These results are also visualised in confusion matrices, as can be seen in Figure
5.2.

31


5. Results

Figure 5.1: Box plot showing the F1-scores of the classifiers across 200 runs with
default hyperparameters on CodeBERT embeddings, averaged across all patterns.

Table 5.1: Table depicting mean, variance, highest, and lowest F1-scores obtained
from running each classifier 200 times with different training and test sets and with
default hyperparameters on CodeBERT embeddings, results averaged across all de-
sign patterns.

Classifier Mean Variance Highest Lowest
MLPClassifier 0.830 0.005 1.000 0.593
Gradient Boosting 0.826 0.006 1.000 0.613
Random Forest 0.815 0.006 1.000 0.573
Decision Tree 0.664 0.011 0.905 0.356
KNN 0.763 0.008 0.952 0.498
Naive Bayes 0.707 0.005 0.860 0.544
Gaussian Process 0.219 0.003 0.328 0.167
Logistic Regression 0.883 0.005 1.000 0.651
SVM 0.832 0.005 1.000 0.563

The confusion matrices show which patterns were often predicted correctly and
which ones were predicted wrong. The y-axis are the true labels, which indicate
which design pattern the file contains, and the x-axis are the predicted labels, which
indicate what the classifier predicted for that file. The diagonal in the matrices
shows the percentage where the classifiers predicted the class correct, and the other
cells show how the algorithm misclassified which design pattern the file consisted of.
The matrices show that Singleton is the design pattern that gets predicted wrong
most often. Singleton is also the design pattern that Prototype and Builder get
wrongly predicted as being. Gaussian process is the classifier that stands out here,
as it almost always predicts everything as being Singleton.

32


5. Results

Table 5.2: Table depicting mean F1-score, precision, and recall obtained from
running each classifier 200 times with different training and test sets across the
Singleton, Prototype, and Builder design patterns.

Classifier Singleton Prototype Builder
F1, P, R F1, P, R F1, P, R

MLPClassifier 0.74, 0.78, 0.73 0.89, 0.92, 0.89 0.86, 0.85, 0.89
Gradient Boosting 0.75, 0.76, 0.76 0.88, 0.92, 0.85 0.85, 0.85, 0.88
Random Forest 0.76, 0.76, 0.78 0.84, 0.89, 0.82 0.85, 0.86, 0.86
Decision Tree 0.58, 0.62, 0.57 0.69, 0.71, 0.71 0.72, 0.73, 0.73
KNN 0.67, 0.77, 0.62 0.81, 0.78, 0.85 0.81, 0.80, 0.84
Naive Bayes 0.67, 0.71, 0.67 0.67, 0.67, 0.70 0.78, 0.81, 0.77
Gaussian Process 0.51, 0.34, 1.00 0.00, 0.00, 0.00 0.15, 0.51, 0.09
Logistic Regression 0.82, 0.85, 0.83 0.93, 0.97, 0.90 0.90, 0.88, 0.93
SVM 0.77, 0.75, 0.83 0.87, 0.97, 0.81 0.85, 0.86, 0.87

Figure 5.2: Confusion matrices showing how the baseline classifiers predicted each
of the design patterns, averaged across 200 runs on CodeBERT embeddings.

33


5. Results

Table 5.3 shows the baseline performance for different clustering algorithms. These
algorithms were evaluated with 200 iterations each, and the mean was recorded for
ARI, Silhouette score, F1, precision, and recall. ARI has a baseline score of 0 which
is equal to clustering at random. Silhouette score measures the similarity of data
points in the same cluster, and a positive score means that most data points are
likely in the correct cluster. Both ARI and Silhouette score want to achieve a score
close to 1 as 1 is the highest possible score for both metrics. From this baseline
performance, K-Means got the highest ARI score at 0.283, and Mean Shift achieved
the highest Silhouette score at 0.326. Mean Shift did, however, also get an ARI score
of 0.000, meaning it is no better than random clustering. Mean Shift also got the
next lowest F1-score at 0.167. BIRCH achieved the highest F1-score with a score of
0.638. The next highest clusterer, in terms of F1-score, was K-Means with an F1 of
0.335. The lowest mean F1-score was achieved by Affinity Propagation with a score
of 0.069.

Table 5.3: ARI, Silhouette, F1, precision, and recall scores of the clustering algo-
rithms with default hyperparameters on CodeBERT embeddings, averaged across
all design patterns over 200 runs.

Clusterer ARI Silhouette F1 Precision Recall
K-Means 0.283 0.221 0.335 0.349 0.339
Mean Shift 0.000 0.326 0.163 0.109 0.323
Gaussian Mixture 0.290 0.222 0.313 0.331 0.331
BIRCH 0.263 0.167 0.638 0.668 0.644
Affinity Propagation 0.130 0.137 0.069 0.330 0.040
DBSCAN NaN NaN NaN NaN NaN

Figure 5.3 below shows a t-SNE visualisation of the data, where the data has been
standardised and reduced from 768 dimensions to 2. It is important to remember
that, due to the high dimensionality of the original data, the t-SNE visualisation
has a loss of information. The figure does show two clusters in the top right and
bottom left, which are the Builder and Prototype patterns. The Singleton pattern is
spread out in the left and middle and not in one singular cluster. Singleton (green)
is the pattern that is most intertwined between the other patterns, and it is also
the pattern that most often gets predicted wrong by the classifiers. Running the
clusterers only on Builder (blue) and Prototype (orange) does yield better results for
all clusterers in terms of the clustering metrics. All clusterers, except for BIRCH,
also saw an improvement in the classification metrics (F1, precision, and recall)
with BIRCH instead achieving a significantly worse score in those metrics. The full
results of running without Singleton can be seen in Table 5.4 below.

34


5. Results

Figure 5.3: t-SNE visualisation of the CodeBERT embeddings.

Table 5.4: ARI, Silhoutte, F1, precision, and recall scores of the clustering al-
gorithms with default hyperparameters on CodeBERT embeddings, without the
Singleton pattern, averaged over 200 runs.

Clusterer ARI Silhouette F1 Precision Recall
K-Means 0.490 0.278 0.502 0.505 0.505
Mean Shift 0.008 0.164 0.451 0.762 0.547
Gaussian Mixture 0.484 0.278 0.475 0.479 0.479
BIRCH 0.421 0.274 0.167 0.164 0.172
Affinity Propagation 0.199 0.143 0.160 0.499 0.101
DBSCAN NaN NaN NaN NaN NaN

5.1.2 Performance with Tuned Hyperparameters
This section presents the results of classifiers and clusterers after tuning their hy-
perparameters. The hyperparameters used in this section can be found in Appendix
B Table B.2.

The results for 200 runs on tuned hyperparameters for the classifiers can be seen in
Figure 5.4 and Table 5.5. They show that almost all classifiers increase their mean
value slightly. Even with tuned hyperparameters, the performance is very similar
to the baseline result, except for Gaussian Process which increased its mean from

35


5. Results

0.219 to 0.811. It was also able to get a high F1-score of 1. Although it did achieve
a much better mean score, it still had bad performances with many outliers and
a low score of 0.167. This gave it the highest variance of all classifiers at 0.024,
with the next highest being 0.009 from Decision Tree. The reason for the outliers is
due to the algorithm overfitting. This is further explained in Section 6.1.1. Logistic
Regression is still the best-performing classifier and increased its mean from 0.883 to
0.884. Moreover, Decision Tree is now the worst-performing classifier with a mean
of 0.683 due to the high increase in performance of Gaussian Process. SVM and
Naive Bayes had the same performance as with default hyperparameters.

Figure 5.4: Box plot showing the F1-scores of the classifiers across 200 runs with
tuned hyperparameters on CodeBERT embeddings, averaged across all patterns.

Table 5.5: Table depicting mean, variance, highest, and lowest F1-scores obtained
from running each classifier with tuned hyperparameters on CodeBERT embeddings,
averaged across all design patterns over 200 runs.

Classifier Mean Variance Highest Lowest
MLPClassifier 0.835 0.007 1.000 0.595
Gradient Boosting 0.851 0.007 1.000 0.535
Random Forest 0.818 0.006 1.000 0.592
Decision Tree 0.683 0.009 0.904 0.426
KNN 0.795 0.007 1.000 0.500
Naive Bayes 0.707 0.006 0.860 0.544
Gaussian Process 0.811 0.024 1.000 0.167
Logistic Regression 0.884 0.005 1.000 0.651
SVM 0.832 0.005 1.000 0.563

36


5. Results

Table 5.6 and Figure 5.5 show the performance of the tuned classifiers on individual
design patterns. The confusion matrices mostly look the same as baseline except
for Gaussian Process which now correctly can identify Builder and Prototype to a
much higher degree. Singleton is still the pattern that most often gets predicted
wrong, and the other patterns get most often wrongly predicted as being.

Table 5.6: Table depicting F1-score, precision, and recall obtained from running
each tuned classifier 200 times with different training and test sets across the Sin-