Evaluating Machine Learning Algorithms in Design Pattern Recognition Exploring the Performance of Classification and Clustering Algorithms in Design Pattern Recognition Utilising Large Language Models Master’s thesis in Computer science and engineering SIMON ANDERSSON VIKTOR BERGGREN Department of Computer Science and Engineering CHALMERS UNIVERSITY OF TECHNOLOGY UNIVERSITY OF GOTHENBURG Gothenburg, Sweden 2024 Master’s thesis 2024 Evaluating Machine Learning Algorithms in Design Pattern Recognition Exploring the Performance of Classification and Clustering Algorithms in Design Pattern Recognition Utilising Large Language Models SIMON ANDERSSON VIKTOR BERGGREN Department of Computer Science and Engineering Chalmers University of Technology University of Gothenburg Gothenburg, Sweden 2024 Evaluating Machine Learning Algorithms in Design Pattern Recognition Exploring the Performance of Classification and Clustering Algorithms in Design Pattern Recognition Utilising Large Language Models SIMON ANDERSSON VIKTOR BERGGREN © SIMON ANDERSSON, VIKTOR BERGGREN, 2024. Supervisor: Jennifer Horkoff, Computer Science and Engineering Examiner: Hans-Martin Heyn, Computer Science and Engineering Master’s Thesis 2024 Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg SE-412 96 Gothenburg Telephone +46 31 772 1000 Gothenburg, Sweden 2024 iv Evaluating Machine Learning Algorithms in Design Pattern Recognition Exploring the Performance of Classification and Clustering Algorithms in Design Pattern Recognition Utilising Large Language Models SIMON ANDERSSON VIKTOR BERGGREN Department of Computer Science and Engineering Chalmers University of Technology and University of Gothenburg Abstract Design Pattern Recognition (DPR) is an ongoing research challenge in the field of software engineering for increasing software maintainability in code. Recent work has utilised Large Language Models (LLMs) for extracting semantic information from code. This study follows up on previous research and investigates, explores, and evaluates the performance of multiple classification and clustering algorithms when applied to embeddings extracted from LLMs. Performance is explored be- tween contexts using different LLMs, design patterns, and programming languages. Data for design pattern implementations was gathered for Java, Python, and C# via GitHub and the P-MARt repository. Each algorithm was run with tuned hyper- parameters, and their average performance across multiple runs was compared. The results indicate variance for the individual performance of the algorithms, but the overall performance order between the algorithms remains the same. Classification algorithms outperformed clustering algorithms, and clustering algorithms had low performance in the measured metrics across all tests. The results also showed a difference in performance between behavioral, creational, and structural design pat- terns. This study shows further promise for the use of LLMs for DPR and recognises the need for larger studies utilising LLMs for DPR. Keywords: Computer science, design patterns, machine learning, large language models, software engineering, design pattern recognition, DPR, LLM. v Acknowledgements We want to express our gratitude to our supervisor, Jennifer Horkoff, and our ex- aminer, Hans-Martin Heyn, for their support and guidance throughout the study. They provided us with thoughtful feedback and advice that proved helpful with the study. We also want to thank Sushant Kumar Pandey for his support of the method and the technical aspects of the study. Simon Andersson, Viktor Berggren, Gothenburg, June 2024 vii Contents List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Purpose of the study . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Research questions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.3 Scope and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background 5 2.1 Design Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Large Language Models . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.2.1 CodeBERT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.2 Instructor-xl . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2.3 GPT-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 Classifying Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3.1 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.2 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.3 Gaussian Process . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.3.4 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.5 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 10 2.3.6 Nearest Neighbour . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.7 Gradient Boosting . . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.8 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 11 2.4 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 11 2.4.1 K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.4.2 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.3 Mean Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.4 Gaussian Mixture . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.5 BIRCH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4.6 Affinity Propagation . . . . . . . . . . . . . . . . . . . . . . . 14 3 Related Work 15 3.1 Existing Approaches for DPR . . . . . . . . . . . . . . . . . . . . . . 15 3.1.1 Metric-based Approaches . . . . . . . . . . . . . . . . . . . . . 16 3.1.2 LLM-based Approaches . . . . . . . . . . . . . . . . . . . . . . 17 3.2 Existing evaluation of classification algorithms . . . . . . . . . . . . . 18 ix Contents 4 Method 21 4.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 4.1.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . 23 4.2 Google Colab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.3 Language Models Used . . . . . . . . . . . . . . . . . . . . . . . . . . 23 4.4 Design Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 4.5 Classifying and Clustering Algorithms . . . . . . . . . . . . . . . . . 25 4.6 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 4.7 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 4.7.1 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 4.7.2 Extracting Embeddings . . . . . . . . . . . . . . . . . . . . . . 28 4.7.3 Training of Classifiers and Clusterers . . . . . . . . . . . . . . 28 4.7.4 LLM Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5 Results 31 5.1 RQ1: Classifier Performance on Design Pattern Identification . . . . . 31 5.1.1 Baseline Performance . . . . . . . . . . . . . . . . . . . . . . . 31 5.1.2 Performance with Tuned Hyperparameters . . . . . . . . . . . 35 5.2 RQ2: Influence of Context on Classifier Performance . . . . . . . . . 40 5.2.1 RQ2.1: Performance Variance Between Design Patterns . . . . 40 5.2.1.1 Baseline Performance . . . . . . . . . . . . . . . . . . 40 5.2.1.2 Performance with Tuned Hyperparameters . . . . . . 44 5.2.2 RQ2.2: Performance Variance of Language Models . . . . . . . 47 5.2.2.1 GPT-2 . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.2.2.2 Instructor-xl . . . . . . . . . . . . . . . . . . . . . . 51 5.2.2.3 Comparison Between Language Models . . . . . . . . 55 5.2.3 RQ2.3: Performance Variance of Programming Languages . . 57 5.2.3.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.2.3.2 C# . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.2.3.3 Comparison Between Programming Languages . . . . 65 6 Discussion 67 6.1 Algorithm Performance on Design Pattern Identification . . . . . . . 67 6.1.1 Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 6.1.2 Clusterers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 6.2 Influence of Context on Algorithm Performance . . . . . . . . . . . . 69 6.2.1 Performance Variance Between Design Patterns . . . . . . . . 69 6.2.2 Performance Variance Between Language Models . . . . . . . 71 6.2.3 Performance Variance Between Programming Languages . . . 72 6.3 Threats to Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3.1 Internal Validity . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.3.2 External Validity . . . . . . . . . . . . . . . . . . . . . . . . . 74 6.3.3 Construct Validity . . . . . . . . . . . . . . . . . . . . . . . . 74 7 Conclusion 77 7.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 x Contents Bibliography 79 A Appendix 1 I B Appendix 2 VII xi Contents xii List of Figures 2.1 Document converted to embeddings [31]. . . . . . . . . . . . . . . . . 7 2.2 Decision boundaries for a linearly separable problem [49]. . . . . . . . 10 2.3 A simple model of a Neural Network. . . . . . . . . . . . . . . . . . . 12 4.1 Visualisation of process for RQ1. . . . . . . . . . . . . . . . . . . . . 21 4.2 Visualisation of process for RQ2. . . . . . . . . . . . . . . . . . . . . 21 4.3 Visualisation of the pipeline. . . . . . . . . . . . . . . . . . . . . . . . 28 5.1 Box plot showing the F1-scores of the classifiers across 200 runs with default hyperparameters on CodeBERT embeddings, averaged across all patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 5.2 Confusion matrices showing how the baseline classifiers predicted each of the design patterns, averaged across 200 runs on CodeBERT em- beddings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 5.3 t-SNE visualisation of the CodeBERT embeddings. . . . . . . . . . . 35 5.4 Box plot showing the F1-scores of the classifiers across 200 runs with tuned hyperparameters on CodeBERT embeddings, averaged across all patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.5 Confusion matrices showing how the tuned classifiers predicted each of the design patterns, averaged across 200 runs on CodeBERT em- beddings. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 5.6 Box plot showing F1-scores of the classifiers across 200 runs with default hyperparameters on CodeBERT embeddings, averaged across all patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.7 t-SNE visualisation of the CodeBERT embeddings of the seven design patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 5.8 Box plot showing F1-scores of the classifiers across 200 runs with hyperparameter tuning on CodeBERT embeddings, averaged across all patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.9 Box plot showing F1-scores of the classifiers across 200 runs with hyperparameter tuning on GPT-2 embeddings, averaged across all patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 5.10 Confusion matrices showing how the tuned classifiers predicted each of the design patterns, averaged across 200 runs on GPT-2 embeddings. 49 5.11 t-SNE visualisation of the GPT-2 embeddings. . . . . . . . . . . . . . 50 xiii List of Figures 5.12 Box plot showing the F1-scores of the classifiers across 200 runs with hyperparameter tuning on Instructor-xl embeddings, averaged across all patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.13 Confusion matrices showing how the tuned classifiers predicted each of the design patterns from the Instructor-xl embeddings, results av- eraged across 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.14 t-SNE visualisation of the Instructor-xl embeddings. . . . . . . . . . . 54 5.15 Box plot the F1-scores of the classifiers across 200 runs with hyper- parameter tuning of the Python code on CodeBERT embeddings, averaged across all patterns. . . . . . . . . . . . . . . . . . . . . . . . 57 5.16 Confusion matrices showing how the tuned classifiers predicted each of the design patterns on the CodeBERT embeddings of the Python code, results averaged across 200 runs. . . . . . . . . . . . . . . . . . 59 5.17 t-SNE visualisation of the CodeBERT embeddings of the Python data. 60 5.18 Box plot showing the F1-scores of the classifiers across 200 runs with hyperparameter tuning on CodeBERT embeddings of the C# data, averaged across all patterns. . . . . . . . . . . . . . . . . . . . . . . . 61 5.19 Confusion matrices showing how the tuned classifiers predicted each of the design patterns on CodeBERT embeddings of the C# data, results averaged across 200 runs. . . . . . . . . . . . . . . . . . . . . . 63 5.20 t-SNE visualisation of the CodeBERT embeddings of the C# data. . 64 xiv List of Tables 4.1 Table showing the amount of files and their source for the Singleton, Prototype, and Builder design patterns written in Java. . . . . . . . . 23 5.1 Table depicting mean, variance, highest, and lowest F1-scores ob- tained from running each classifier 200 times with different training and test sets and with default hyperparameters on CodeBERT em- beddings, results averaged across all design patterns. . . . . . . . . . 32 5.2 Table depicting mean F1-score, precision, and recall obtained from running each classifier 200 times with different training and test sets across the Singleton, Prototype, and Builder design patterns. . . . . . 33 5.3 ARI, Silhouette, F1, precision, and recall scores of the clustering al- gorithms with default hyperparameters on CodeBERT embeddings, averaged across all design patterns over 200 runs. . . . . . . . . . . . 34 5.4 ARI, Silhoutte, F1, precision, and recall scores of the clustering al- gorithms with default hyperparameters on CodeBERT embeddings, without the Singleton pattern, averaged over 200 runs. . . . . . . . . 35 5.5 Table depicting mean, variance, highest, and lowest F1-scores ob- tained from running each classifier with tuned hyperparameters on CodeBERT embeddings, averaged across all design patterns over 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 5.6 Table depicting F1-score, precision, and recall obtained from running each tuned classifier 200 times with different training and test sets across the Singleton, Prototype, and Builder design patterns. . . . . . 37 5.7 ARI, Silhouette, F1, precision, and recall scores of the clustering al- gorithms with hyperparameter tuning on CodeBERT embeddings, averaged across all design patterns over 200 runs. . . . . . . . . . . . 39 5.8 Table depicting mean, variance, highest, and lowest F1-scores ob- tained from running each classifier with default hyperparameters on CodeBERT embeddings, results averaged across all seven design pat- terns over 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 5.9 Average F1-score, precision, and recall for all seven design patterns across 200 runs and nine classifiers, with default hyperparameters on CodeBERT embeddings. . . . . . . . . . . . . . . . . . . . . . . . . . 42 5.10 ARI and Silhouette score of clustering algorithms with default hy- perparameters on CodeBERT embeddings, averaged across all seven design patterns over 200 runs. . . . . . . . . . . . . . . . . . . . . . . 42 xv List of Tables 5.11 Table depicting mean, variance, highest, and lowest F1-scores ob- tained from running each classifier 200 times with tuned hyperpa- rameters on CodeBERT embeddings, results averaged across all de- sign patterns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.12 Average F1-score, precision, and recall for all seven design patterns across 200 runs and nine classifiers, with tuned hyperparameters on CodeBERT embeddings. . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.13 ARI and Silhouette score of the clustering algorithms with hyperpa- rameter tuning on CodeBERT embeddings, averaged across all seven design patterns over 200 runs. . . . . . . . . . . . . . . . . . . . . . . 46 5.14 Table depicting mean, variance, highest, and lowest F1-scores ob- tained from running each classifier with tuned hyperparameters on GPT-2 embeddings, results averaged across all design patterns over 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 5.15 Table depicting F1-score, precision, and recall obtained from running each tuned classifier 200 times with different training and test sets across the Singleton, Prototype, and Builder embeddings from GPT-2. 48 5.16 ARI and Silhouette score of the clustering algorithms with hyper- parameter tuning on GPT-2 embeddings, averaged across all design patterns over 200 runs. . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.17 Table depicting mean, variance, highest, and lowest F1-scores ob- tained from running each classifier with tuned hyperparameters on Instructor-xl embeddings, results averaged across all design patterns over 200 runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.18 Table depicting F1-score, precision, and recall obtained from run- ning each tuned classifier 200 times with different training and test sets across the Singleton, Prototype, and Builder embeddings from Instructor-xl. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 5.19 ARI and Silhouette score of the clustering algorithms with hyper- parameter tuning on Instructor-xl embeddings, averaged across all design patterns over 200 runs. . . . . . . . . . . . . . . . . . . . . . . 54 5.20 Comparison of mean F1-scores achieved by the classifiers with tuned hyperparameters between CodeBERT, GPT-2, and Instructor-xl. . . . 55 5.21 Comparison of mean F1-scores for Singleton, Prototype, and Builder. 55 5.22 Comparison of mean ARI and Silhouette scores achieved by the tuned clusterers between CodeBERT, GPT-2, and Instructor-xl. . . . . . . . 56 5.23 Table depicting mean, variance, highest, and lowest F1-scores ob- tained from running each classifier with tuned hyperparameters 200 times on the CodeBERT embeddings of the Python code, results av- eraged across all design patterns. . . . . . . . . . . . . . . . . . . . . 58 5.24 F1-score, precision, and recall obtained from running each tuned clas- sifier 200 times across the Singleton, Prototype, and Builder embed- dings from CodeBERT on Python. . . . . . . . . . . . . . . . . . . . 58 5.25 ARI and Silhouette score of the tuned clustering algorithms running on CodeBERT embeddings of the Python data, averaged over 200 runs. 60 xvi List of Tables 5.26 Table depicting mean, variance, highest, and lowest F1-scores ob- tained from running each classifier with tuned hyperparameters 200 times on CodeBERT embeddings of the C# data, results averaged across all design patterns. . . . . . . . . . . . . . . . . . . . . . . . . 61 5.27 Table depicting F1-score, precision, and recall obtained from running each tuned classifier 200 times across the Singleton, Prototype, and Builder embeddings from CodeBERT on C#. . . . . . . . . . . . . . 62 5.28 ARI and Silhouette score of the tuned clustering algorithms running on CodeBERT embeddings of the C# data, averaged over 200 runs. . 64 5.29 Comparison of mean F1-scores achieved by the classifiers with tuned hyperparameters between Java, Python, and C#. . . . . . . . . . . . 65 5.30 Comparison of mean F1-scores for Singleton, Prototype, and Builder. 65 5.31 Comparison of mean ARI and Silhouette scores achieved by the tuned clusterers between CodeBERT, GPT-2, and Instructor-xl. . . . . . . . 66 A.1 Baseline Performance Metrics for Each Design Pattern and Classifier for RQ2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . II A.2 Baseline Performance Metrics for Each Design Pattern and Classifier for RQ2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . III A.3 Performance metrics for each design pattern and classifier for RQ2.1, with tuned hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . IV A.4 Performance metrics for each design pattern and classifier for RQ2.1, with tuned hyperparameters. . . . . . . . . . . . . . . . . . . . . . . . V B.1 The hyperparameters tested for each algorithm and how many com- binations of each were tested. . . . . . . . . . . . . . . . . . . . . . . VIII B.2 Hyperparameters for Classifiers in RQ1. . . . . . . . . . . . . . . . . IX B.3 Hyperparameters for classifiers and clusterers in RQ2.1. . . . . . . . . X B.4 Hyperparameters for classifiers and clusterers for GPT-2 in RQ2.2. . . XI B.5 Hyperparameters for classifiers and clusterers for Instructor-xl in RQ2.2.XII B.6 Hyperparameters for classifiers and clusterers for Python in RQ2.3. . XIII B.7 Hyperparameters for classifiers and clusterers for C# in RQ2.3. . . . XIV xvii List of Tables xviii 1 Introduction Software maintainability is an important quality attribute to consider when devel- oping code, and it accounts for the majority of the cost of software development [1]. Therefore, keeping the software maintainable reduces the cost of software de- velopment. Design patterns provide developers with tools for addressing recurring problems in code [2]. The impact of these design patterns on software maintain- ability has been demonstrated to be positively correlated [3], [4]. Consequently, identifying the use of design patterns in large-scale codebases is essential for soft- ware maintainability. Manual inspection of code to identify design patterns is challenging and time- consuming [5]. Design patterns can also have different variations, known as variants [6], in order to fit the context of the code, which further complicates the problem. Incorrect identification of a design pattern can lead to incorrect implementations, which can affect the functionality of the system [7]. There are, however, other methods for automatically identifying design patterns in code using design pattern recognition (DPR) [8]. There are several different areas explored in DPR, from metric-based approaches like [9], [10], to machine learning-based approaches like [7], [11]. These methods have evaluated different classification and clustering techniques while focusing only on specific programming languages and design patterns. Research has shown that ML-based approaches show promising results by being able to understand variants of design patterns better than traditional metric-based approaches [12]. A consensus on which detection algorithms to use has yet to be reached [6]. Furthermore, capturing a wider range of design patterns is difficult because of the different methods for capturing features depending on the design pattern. The most recent work on this topic has been using Large Language Models (LLMs) [7], [11]. The LLMs are used to extract an embedding from the code that is then used with another machine-learning algorithm to classify if and which design pattern a code is using. As there are currently no language models designed for the purpose of DPR, utilising LLMs will also require the use of a separate classifier or clusterer. The embeddings extracted from LLMs capture the semantic context of the source code, which can provide further useful information than previously metric-based approaches have used. This method of design pattern recognition using LLMs is a 1 1. Introduction very recent topic of research with many unexplored areas. The previous research that has been done within LLM-based DPR has primarily focused on testing its feasibility. The most recent studies are exploring different areas of LLM-based DPR. They are testing different language models, the effects of pre-training the language models, the performance of different design patterns, and different combinations of these [11]. There has also been research on evaluating different supervised learning methods in non-DPR contexts [13]. But there is still a lack of research evaluating how different classification and clustering algorithms perform in the context of design pattern recognition when using other large language models, such as CodeBERT. This thesis will focus on exploring the performance of classification and clustering algorithms that are trained on embeddings generated by large language models. It is still unclear how the performance of learning algorithms varies when using them to classify embeddings generated by the LLMs. Some possible methods for classification include Decision Trees, Support Vector Machine (SVM), and K-Nearest Neighbour (KNN) [14]. For clustering, K-Means, Hierarchical Clustering, and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) can be utilised [15]. However, it remains unclear which method provides the best performance [6]. Different programming languages have varying syntax and coding styles, making it important to evaluate which algorithm should be used. Furthermore, the choice of language model used for extracting embed- ded feature vectors will influence how the classification and clustering algorithms are trained. Therefore, a study is required to explore the feasibility of a range of classification and clustering algorithms. 1.1 Purpose of the study The purpose of this study is to advance the field of software development by explor- ing and evaluating different classification and clustering algorithms in the context of identifying the use of design patterns in code. Specifically, this research aims to delve into the efficiency and accuracy of these algorithms when applied to em- beddings extracted from code via LLMs. The study will also explore how different design patterns, language models, and programming languages affect the perfor- mance of the algorithms. The algorithms will be quantitatively analysed to measure their performance. Thereby determining effective algorithmic approaches for accu- rately classifying and clustering software design patterns. As a result, enhancing the ability of developers to maintain, understand, and improve large and complex codebases. 1.2 Research questions The research questions are divided into two parts: the performance of the different algorithms (RQ1), and whether and how the choice of classification algorithm may change depending on the choice of design patterns, programming language, and 2 1. Introduction language model that is used (RQ2). RQ1: What are the performances of different classifying and clustering algorithms in the context of identifying design patterns in code? This is performed to find the differences in performances between different classifiers and clusterers and get a baseline of their performance in DPR. RQ2: Does the performance of different classifying and clustering algorithms vary depending on the context in which they are used? RQ2.1: Does the performance vary between design patterns? RQ2.2: Does the performance vary between language models? RQ2.3: Does the performance vary between programming languages? This is performed to explore how the different parts, such as what language the code is written in and what type of LLM is being used, of a whole system affect the performance of the classifier. The goal of RQ2 is to find out if there are different classifiers with optimal performance depending on the context in which they are applied. It is done in order to study the effects on the performance of the classifiers when applied in different contexts. 1.3 Scope and Limitations The study is limited to the assessment of classification and clustering algorithms in the context of DPR. The study will not examine all the design patterns identified by Gamma et al., nor will it examine their industrial adaptations [2]. This is due to the fact that there is not a lot of data available for this type of research. The main programming language that will be used in this study and used for RQ1 will be Java. Other programming languages will only be included for RQ2. This is due to the publicly available data for Java being much higher than other programming languages. The findings will, however, still be applicable to the patterns excluded from the study. The study will not use LLMs that require pre-training, and it will not look into how pre-training the LLMs affects the performance of the classifiers. This is due to the size of the models, and pre-training them requires a lot of time and hardware that is not readily available. The study will not explore the fine-tuning of the LLMs either, as it is outside the scope of the research questions and not feasible to be performed within the time schedule. Very large language models, such as models with multiple billions of parameters, will not be used due to hardware limitations. 3 1. Introduction 4 2 Background The focus of this study lies in the evaluation of a range of classification and cluster- ing algorithms for DPR. The study will use implementations of design patterns as data, then use LLMs to extract important features, which will then be fed into clas- sification and clustering algorithms that will predict which design pattern a given file consists of. This section provides the background and theoretical information necessary to understand the method and the findings of the study and will bring up the following topics: design patterns, large language models, and classification and clustering algorithms. 2.1 Design Patterns A large part of software development is writing understandable and maintainable code. When codebases grow and developers add new functionality, the source code’s complexity grow along with it. New developers learn about the importance of main- taining code and study various tools for this. One of the tools for keeping a struc- tured and maintainable code base is design patterns. Design patterns are used in software development as a guide or description for how to solve commonly occurring problems [2]. Design patterns can be divided into three different types: creational, structural, and behavioral patterns [2]. Creational patterns are patterns that encapsulate knowl- edge of a class that should be instantiated, as well as hide the underlying structure of the class. Some examples of creational patterns are Singleton [16], Builder [17], and Prototype [18]. Structural patterns are patterns that focus on the structure of objects and classes while keeping their structure flexible. Some examples of struc- tural patterns are Adapter [19], Bridge [20], and Composite [21]. Lastly, behavioral patterns are responsible for the communication between objects and classes and characterise complex control flows in programs. Some examples of behavioral pat- terns are Strategy [22], Observer [23], and State [24]. 5 2. Background 2.2 Large Language Models Language Models (LMs) are models that can understand and generate human lan- guage [25]. LMs have the ability to predict a sequence of words and can generate new text based on a given input. The most common type of LM estimates what words may come next based on the prior words and their context. Large Language Models (LLMs) are more advanced language models that have a large number of pa- rameters, often reaching multiple billions of parameters [26]. One key part of LLMs is their transformer architecture [27] and the underlying self-attention mechanism [28] it uses to determine the relevance of different parts of the input. The transformer architecture presented by Vaswani et al. consists of an encoder and a decoder [28]. The encoder takes an input sequence of symbol representations and creates a sequence of continuous representations. The encoder creates a set of vectors for each word that describe the words’ relationship with other words in the sequence. The decoder uses the output of the encoder to create a contextually relevant output of the model. Both the encoder and decoder contain self-attention mechanisms that help with extracting the contextual information of the input. This is done one-by-one by taking each word of the input and calculating and assigning weights to all the other words of the input based on their vector similarities that the model learns during training. Other important parts of LLMs are the ability to pre-train them on datasets and fine-tune them to perform specific tasks [29], as this has proven to improve their performance on multiple different tasks [30]. To represent values and inputs, LLMs use something called embeddings [31]. An embedding is a mathematical representation of objects like images, text, and audio. They represent these objects in the form of vectors. Figure 2.1 illustrates how em- beddings are extracted from a document. Embeddings enable language models to find similarities and relationships between objects. As inputs to language models are objects like texts and images, the models will convert the inputs into embeddings. How a model generates the embeddings is different for each model, and the same input will generate different embeddings for different models. The generated em- bedding is then compared to the data the model was trained on to find similarities between the new input and what the model already knows. It will then use these similarities to perform its task. This study will utilise existing LLMs to extract embeddings from source code, which in turn can be used to train classification and clustering algorithms. Since LLMs can capture the structure and semantic information from a text, classifiers can hopefully learn which semantic information is part of which design pattern and therefore get a higher accuracy when predicting. Section 2.2.1 through 2.2.3 presents the different LLMs used in this study. 6 2. Background Figure 2.1: Document converted to embeddings [31]. 2.2.1 CodeBERT CodeBERT is a transformer-based bimodal pre-trained model that achieves state-of- the-art performance in natural language code search and code documentation gen- eration [32]. Bimodal signifies that it is trained on Natural Language-Programming Language (NL-PL) pairs, which capture the semantic connection between natural language and programming language. Results show that it performs better than previous pre-trained models on NL-PL probing. The BERT in CodeBERT stands for Bidirectional Encoder Representations from Transformers, which is a framework that is used for natural language processing (NLP). This model is similar to other models using this framework, such as RoBERTa [33] and BERT [34]. The idea be- hind these models is to learn the contextual representations of huge natural language, which can, for example, be used to predict masked words in a sentence. The novelty that CodeBERT presents is the fact that it is trained on several pro- gramming languages and is the first large NL-PL pre-trained model for multiple programming languages; those languages are Go, Java, JavaScript, PHP, Python, and Ruby [32]. Moreover, CodeBERT is not only trained on NL-PL pairs, that is, code paired with documentation, but also on a large amount of unimodal data, which is code that is not paired with documentation. For pre-training the model, a special separation token was used to divide the NL-PL pairs, where the natural language was represented as a sequence of words and the code was represented as a sequence of tokens. The output is represented as both the context vector representation of each token and as the aggregated sequence representation [32]. 2.2.2 Instructor-xl INSTRUCTOR is a sentence transformer-based model that is based on the single- encoder architecture that follows prior work [35]. INSTRUCTOR is designed to be applicable to many different tasks across different domains, not only within pro- gramming. The model takes text and task instructions as input. This makes it so INSTRUCTOR is aware of the domain and the task and generates an embed- ding that better fits the given task and domain. This was achieved by training the model on a large collection of datasets across multiple categories of tasks and domains. Results show that INSTRUCTOR performs significantly better than other state-of- the-art models across all different types of tasks [35]. It outperforms the previous Sent-T5-XXL [36] model with a fraction of the parameters, with INSTRUCTOR having 335 million parameters compared to 4.8 billion parameters for Sent-T5-XXL 7 2. Background [35]. The model used in this study is Instructor-xl, which has 1.5 billion parame- ters. 2.2.3 GPT-2 GPT-2 is also a transformer-based model that follows the structure and details of the original GPT model [26]. When the GPT model was introduced, it was difficult for natural language classification models to perform well due to a scarce amount of task-specific labelled data [37]. The GPT model introduced a way for natural language models to realise gains on tasks that are unlabeled by performing generative pre-training on the data, followed by discriminative fine-tuning on the specific task. The GPT model also made use of task-aware transformations of the input, which were not used in previous work. The pre-training made the model gain significant knowledge and the ability to process long-range context and dependencies [37]. The training data for GPT-2 was obtained by scraping web pages from outbound links on Reddit [38]. GPT-2 made improvements to the original model by making detailed changes to the model architecture, increasing the number of parameters, increasing the size of the vocabulary, and increasing the context and batch size [26]. The GPT model proved to be effective and achieved state-of-the-art performance on the majority of the tested datasets [37]. The use of generative pre-training followed by discriminative fine-tuning proved to be a big step in machine learning and is now a method that is employed by a majority of LLMs [39]. 2.3 Classifying Algorithms Classification algorithms represent a category of machine learning methods where a model is trained to correctly identify the label of a given input [40]. This approach falls under supervised learning, necessitating labelled training data for the model to learn the pattern of the data. Classification problems involve two types of target variables: categorical and continuous. When the desired output is discrete, the task is referred to as classification. In contrast, when the output is continuous, the task is categorised as a type of regression. There are two primary types of classification tasks: binary classification and multi- class classification. In binary classification, the target variable is binary, taking on values such as 0 or 1, true or false, etc. Meanwhile, in multiclass classification, the target variable can belong to one of multiple classes [41]. Multiple classification algorithms can be combined for a single classification task. This is known as an ensemble. Ensembles are models that use two or more algo- rithms to make their classifications. This is done to improve the performance, gen- eralisability, and robustness of the model. There are three main types of ensembles: bagging, stacking, and boosting. Bagging fits many classifiers on different samples of the dataset and then averages all the predictions made. Stacking fits many dif- ferent classifiers on the same data and then uses a separate classifier to combine the 8 2. Background predictions. Boosting is when multiple classifiers are fitted in sequence, where each algorithm fitted aims to minimise the errors made by the prior algorithms. The output of boosting is a weighted average of all the classifiers [42]. A variety of classification algorithms will be trained and evaluated in this study to compare their effectiveness when predicting design patterns. This paper will use multiclass classification as there are many different design patterns that will be predicted. Multiclass classification is often handled using a technique called One- Vs-Rest (OVR). The OVR technique means that a classifier will fit one classifier for each label [43]. Each classifier will then handle it as a binary classification problem, with its assigned label being the positive examples and everything else being negative. The classifiers will then produce a confidence score for the sample based on its assigned label, and the label with the highest confidence score is what the sample will be predicted to be. The data used for training classification algorithms is structured as entities in a list, each with features, where each feature represents a property of the class. It is common to encounter an imbalanced dataset, which can negatively impact the performance of many classifiers. The next sections describe a subset of the available classification algorithms. 2.3.1 Random Forest Random Forest is an ensemble learning technique where multiple decision trees are trained with a subset of the training data [44]. The classifier then predicts the data by aggregating the predictions of all the decision trees. This diversity of decision trees makes the model less prone to overfitting [45]. Furthermore, Random Forest is well-suited for handling high-dimensional feature spaces. 2.3.2 Naive Bayes Naive Bayes is a supervised learning method that uses the principles of probability to determine the outcome of a class, specifically Bayes’ theorem [46]. This algorithm assumes that each feature is independent and is often used in test classification problems. Naive Bayes is seen as a very simple and naive algorithm, as it does not care if features are correlated. Bayes’ Theorem is represented by the following formula: P (A|B) = P (B|A)P (A) P (B) Here P (A|B) is the probability of event A occurring given that event B has already occurred. 2.3.3 Gaussian Process Gaussian Process (GP) is a supervised learning method that can be used for prob- abilistic classification problems [47]. This means that the model gives a measure of 9 2. Background how confident it is in its prediction. In short, GP works by assuming some underly- ing relationship in the data known as the prior and then using a link function (often logistic) to get the probabilities that the data is of a certain class. One disadvantage of the Gaussian Process is that they use the whole dataset when predicting, and they are less effective in high-dimensional data, i.e., when the feature vector is more than a few dozen dimensions. 2.3.4 Logistic Regression Logistic Regression is a probabilistic discriminative model that is used for classifica- tion problems [48, Chapter 4.3]. The Logistic Regression model is at the core used for binary classification problems but can be used to classify multiclass problems by using multinomial logistic regression or by using the One-vs-Rest approach. This classifier uses a logistic function to describe the outcome of a class. The following formula describes the Logistic Regression, where P (y = 1|x) stands for probability that y=1 given the input x. P (y = 1|x) = 1 1 + e−(β0+β1x1+...+βpxp) 2.3.5 Support Vector Machine Support Vector Machines (SVMs) are supervised learning methods that can be used for both regression and classification tasks [48, Chapter 7]. They are highly effective in high-dimensional spaces. SVMs work by constructing a set of hyperplanes that maximise the margin between the training data. Figure 2.2 illustrates how an SVM can separate the training data into two distinct classes. The data points closest to the separating hyperplane are called support vectors, as they critically define the margin boundaries. Figure 2.2: Decision boundaries for a linearly separable problem [49]. 10 2. Background 2.3.6 Nearest Neighbour Nearest Neighbour (NN) is a simple algorithm that stores instances of the training data [50]. Classification of new data points is done by a majority vote of their nearest neighbours. There are several variations of the Nearest Neighbour algorithm, but the most common one is KNN. In KNN, the value of K determines the number of nearest data points considered when assigning a class to a new instance. Larger values of K reduce the impact of noise but can also blur boundaries between classes. KNN algorithms can become less effective on large datasets since every new point must be compared against all other points to determine its classification. 2.3.7 Gradient Boosting Gradient Boosting is a powerful classification algorithm and one of the most popular ensemble methods [51]. This algorithm, a type of boosting, excels at capturing non- linear patterns in data. The model works by sequentially fitting simpler models, usually decision trees, that attempt to correct the errors made by previous mod- els. It uses the negative gradient of the loss function to determine the direction of improvement for each subsequent model, repeating this process for the number of estimators used in the model. 2.3.8 Artificial Neural Networks Artificial Neural Networks, or just Neural Networks, also known as Multi-Layer Perceptrons (MLPs), are a type of machine-learning algorithm that can be used for multiple machine-learning tasks, including classification [52]. The building blocks of a Neural Network are the neurons. A neuron has weighted input signals and an output signal that transforms the value using an activation function. Neurons are arranged into networks that consist of an input layer, one or more hidden layers, and one output layer. The input layer is what receives the input and passes it onto the first hidden layer. The hidden layers contain the neurons that transform the value they receive using the weights and activation functions of the neurons. The final layer is the output layer, which generates an output of the model according to the task. A simple model of a Neural Network with input, hidden, and output layers can be seen in Figure 2.3 below. 2.4 Clustering Algorithms Clustering algorithms are tasks in machine learning that group similar objects to- gether into clusters. This method is a form of unsupervised learning, which means it does not require labelled data for training [53]. Clustering methods are com- monly used when the underlying structure of the data is unknown and to discover new patterns within the data. Unlike classification algorithms, clustering does not usually measure performance using accuracy, precision, or recall. If labelled data is available, it can be used to test the accuracy of the algorithm, but it can be tricky as the clustering method doesn’t explicitly say which cluster belongs to a specific 11 2. Background Figure 2.3: A simple model of a Neural Network. class. Furthermore, clustering algorithms use a graphical aid called the Silhouette [54] to measure performance. Silhouettes are created by representing each cluster in a Silhouette diagram based on its tightness and separation. A mean Silhouette coefficient is then calculated by comparing the sample points to the generated Sil- houette, providing a value between -1 and 1. Here, -1 indicates that a point has been incorrectly assigned to a cluster, while 1 represents the optimal assignment [55]. The following sections describe a subset of the available clustering algorithms. For this study, clustering algorithms will be tested to see if they can find any clusters from the extracted embeddings. 2.4.1 K-Means K-Means is a widely used clustering algorithm that tries to group data into a spec- ified number of clusters based on their variance [56]. The main objective of the algorithm is to choose a centroid that minimises the within-cluster sum-of-squares, also known as inertia. The centroids selected are the means of the clusters. After a centroid has been chosen, the samples are assigned to their nearest centroid, which is subsequently updated in that cluster based on the mean of the assigned samples. This process repeats until the centroid stabilises and does not move significantly when adding new samples. Since K-means uses inertia to construct the cluster, it assumes that the clusters are more convex and isotropic-shaped and thus perform worse on elongated and irregular shapes. 12 2. Background 2.4.2 DBSCAN The DBSCAN algorithm clusters data by separating high-density areas from low- density areas [57]. The clusters are generated by finding a set of core samples given by a distance measure between data points. The neighbouring core samples are then measured to find all core samples of those; this is recursively done until no more core samples are found. These core samples are then surrounded by non-core samples, which are samples that are further away from the core samples given a threshold. The advantage of DBSCAN compared to K-means is that it can find clusters of any shape. 2.4.3 Mean Shift Mean Shift clustering is a centroid-based algorithm that works by updating which centroid data points are considered the mean for a given region of data points [58]. The different candidates are then filtered to eliminate candidates that generate the same or very similar clusters. Centroid positions are calculated using hill climbing in order to find local maxima. The algorithm automatically calculates the number of clusters, but it can also be set manually. Mean Shift requires multiple nearest neighbour searches during the execution of the algorithm. This creates a complexity of O(N2) for the algorithm and also makes it not highly scalable. Application domains for Mean Shift include computer vision and image processing. 2.4.4 Gaussian Mixture Gaussian Mixtures are a special type of clustering model called Gaussian Mixture Models [59]. Their probabilistic model assumes that all the data points are generated from a finite number of Gaussian distributions. There are different types of Gaussian mixture models, depending on the chosen estimation strategy. One estimation strat- egy uses the expectation-maximisation algorithm. The main difficulty of Gaussian mixture models is dealing with unlabeled data and knowing what data comes from which cluster. The expectation-maximisation algorithm uses an iterative process to get around this issue. It starts by assuming that clusters are random by center- ing them on randomly chosen data points or by using K-Means. The initialization method can be random or done by K-Means. It then calculates the probability of a data point being generated by each cluster and tweaks the parameters to maximise the likelihood of the data points, given their cluster assignments. This process is repeated and will be guaranteed to converge to a local optimum. Gaussian mixture models are often found in biometric systems, most notably in speech recognition [60], [61]. 2.4.5 BIRCH BIRCH is another clustering algorithm that works by building a tree based on the given data called a Clustering Feature Tree (CFT) [62]. The tree consists of Clustering Feature Nodes (CF Nodes) and the CF Nodes contain subclusters, and the subclusters can contain CF Nodes as children. The subclusters are what contain 13 2. Background the necessary information used to cluster the data. The algorithm works by inserting new samples into the root of the CFT, which is a CF Node. The new sample is then merged with the subcluster in the root that has the smallest radius after merging. If the subcluster it merged with contains a child CF Node, the process is repeated until it reaches a leaf. Splitting of nodes into branches occurs when the radius of a subcluster after merging is above a defined threshold value and the number of subclusters is greater than the branching factor. The number of clusters can be manually set; if it is not set manually, the clusters will be the subclusters at a leaf. Applications for BIRCH include image classification and image compression [63]. 2.4.6 Affinity Propagation Affinity Propagation works by passing messages between pairs of samples until the algorithm converges [64]. A dataset is described using exemplars, which are a small number of data points that are supposed to represent other data points. The mes- sages that data points pass to each other represent how suitable a data point is to be the exemplar of another data point. This gets updated in response to values from other pairs. The updating continues iteratively until it converges. At convergence, the final exemplars are picked and the clusters are created. Like Mean Shift, Affinity Propagation is not highly scalable due to its complexity of O(N2T ) where N is the number of data points and T is the number of iterations until convergence. The next section will present some of the related work done in this field of design pattern recognition, where concepts and terms from this section will be widely used. Information from this section will also be used when discussing the results of the study. 14 3 Related Work This section presents previous studies that utilised various methods for the purpose of DPR as well as the evaluation of classification algorithms in non-DPR contexts. It outlines the main components of these studies, including their evaluation methods and results. 3.1 Existing Approaches for DPR Yarahmadi and Hasheminejad conducted an extensive literature review of papers about design pattern detection that were published between 2008 and 2019 [12]. In their study, they found that the majority of the approaches to DPR worked on creational, structural, and behavioral patterns. They did, however, note that some patterns were harder to identify or could not be identified at all due to their similar structure. Yarahmadi and Hasheminejad also found that a large majority of DPR approaches use source code as their type of input and found that there were many different ways of representing the data, including metrics and feature vectors. They also found that there were multiple different DPR approaches, but the most common were learning-based approaches that made use of artificial intelligence and machine learning. Furthermore, they found that the P-MARt repository [65] and the projects within it were used by a large majority of the papers for training and evaluation of their methods. Yarahmadi and Hasheminejad also looked at the results of different DPR approaches in their study and presented advantages and disadvantages of the approaches [12]. They found that learning-based approaches performed well. It had a high recall and precision, was evaluated on large test cases, and could detect design patterns such as Singleton. But it had difficulties detecting patterns with similar structures. Metric- based approaches performed worse than learning-based approaches but could detect patterns with similar structure. They found that many of the approaches had small test cases, which questioned their ability to function in large systems. They also found that many approaches cannot detect different variants of design patterns, but that learning-based approaches are capable of this. They also state that many of the methods have weak performances and evaluations. Yarahmadi and Hasheminejad believe that many benefits can be gained by combining different methods. 15 3. Related Work As stated by Yarahmadi and Hasheminejad, there are many approaches that have been extensively researched when it comes to design pattern recognition [12]. Dif- ferent approaches have different methods for how they represent their data. Some of these include metrics extracted from the source code, the Abstract Syntax Tree (AST), which is a tree that represents the structure of the code, and the Abstract Semantic Graph (ASG), which captures a higher level of abstraction compared to the AST. For ML-based approaches, the data representation may instead come from using language models. These language models are trained on a large set of data and can extract the semantic meaning of the source code in a more generalisable way. The next section outlines two different DPR approaches researched. 3.1.1 Metric-based Approaches Tsantalis et al. explain a method for detecting Java design patterns from similarity scoring between graph vertices [9]. They evaluated their algorithm on three open source projects: JHotDraw, JRefactory, and JUnit. Each project was parsed, and various metrics were extracted to get a hierarchical tree structure of the classes, methods, and their invocations. These directed graphs were then mapped to matri- ces for easier manipulation. Known design patterns were also hard-coded as matri- ces, each with descriptive attributes. The graphs and matrices from the extracted code could then be used to measure the similarity score to find the best match. The authors also note that some design patterns have identical structures and cannot be individually identified without further context; these are Adapter/Command, and State/Strategy. Their results showed that their method could correctly iden- tify every design pattern in their scope with only two false negatives and no false positives. Dwivedi et al. utilise a metric-based approach where they extracted 67 metrics using the JBuilder software [10]. Some of these metrics are familiar, such as lines of code or the number of parameters. But it also includes more advanced metrics such as the polymorphism factor and coupling between objects. They evaluated the performance of three different supervised learning methods: Artificial Neural Networks, Support Vector Machine, and Random Forest. Five design patterns were used: Abstract Factory, Adapter, Bridge, Composite, and Template Method. For each pattern, all the related classes were used to extract metrics from the pattern. The design patterns used were from the P-MARt repository [65]. Validation was done using cross-validation, and accuracy was used to measure the performance of the models. Their study showed that ANN performed the poorest, while random forest was the best, with an F1-measure of 100% on QuickUML and JUnit and 97.6% on JHotDraw, averaged over all design patterns focused on. Zanoni et al. describe an approach for detecting design patterns in code [6]. Their approach makes use of graph matching and machine learning techniques. Their study builds on previous work by implementing enhancements in their analysis method. They introduce a tool called MAPRLE-DPD. This tool is divided into three main modules. The first module is the information detector engine. This module is responsible for building a model of the programme and collecting and 16 3. Related Work storing microstructures and metrics in the model. The second module is the joiner. The joiner works on the microstructures that the information detector engine has ex- tracted and extracts all possible design patterns. The third module is the classifier. The classifier computes similarities between the design patterns extracted by the joiner by comparing them to known implementations of the design patterns. In their study, Zanoni et al. also define two different categories of design patterns [6]. These categories include single-level patterns and multi-level patterns. They define patterns as trees of levels where each level contains at least one role. A role is a class that is needed to implement a design pattern. A single-level pattern is a design pattern that can be implemented using only one level. Multiple roles are allowed in single-level patterns as long as they are at the same level. Multi-level patterns are design patterns that consist of multiple levels. A pattern can consist of multiple levels if its implementation uses multiple classes and one of the classes inherits from another class. An example of a multi-level design pattern is Template Method. Template Method creates an abstract template class that is then inherited and used by concrete classes. Zanoni et al. evaluated their model in the form of experiments [6]. The dataset of design patterns they used consisted of 10 different projects, and the training set totaled 2794 different pattern instances. One project was gathered from different design pattern examples found on the web. The nine other projects were anal- ysed in the P-MARt repository [65]. They used multiple different classification and clustering algorithms to evaluate the performance of the model, including Decision Trees, Random Forest, SVMs with different kernel functions, Naive Bayes, Sim- pleKMeans, and CLOPE [6]. Their model and the algorithms were evaluated on single-level patterns and multi-level patterns separately, and they only used cluster- ing algorithms for multi-level patterns. The single-level patterns experimented on were Singleton and Adapter. The multi-level patterns were Composite, Decorator and Factory Method. The algorithms were compared on three different metrics: ac- curacy, F1-measure, and Area Under Curve (AUC). For single-level patterns, Zanoni et al. concluded that all algorithms have similar performances, with the exception of Naive Bayes and one variation of SVM that performed significantly worse than the other algorithms. For multi-level patterns, they found the best performance with an SVM. It is worth noting, however, that the goal of the study was to evaluate their approach to DPR, and there was no extensive testing or comparison of different algorithms. 3.1.2 LLM-based Approaches Pandey et al. proposed an approach for DPR that utilised language models that have been trained on programming languages [11]. The study introduced a DPR tech- nique called TransDPR. The language model used in their study was TransCoder. TransCoder is a language model developed by Meta [66], and the purpose of the model is to translate source code from and to Java, C++, and Python. Pandey et al. use TransCoder to extract the semantic information of a program [11]. This was done by extracting the embedding vector produced by TransCoder at the encoder 17 3. Related Work block of the TransCoder model. Their model was evaluated over two patterns: Singleton and Prototype. The pattern instances were found in open-source projects on GitHub and were written in C++. For each instance of a pattern, they also created an artificial counter instance by multiplying the embedding vector of the instance with -1. The counterexamples were annotated as non-singleton and non-prototype. This was done to reduce overfitting and to make sure the model is not biassed towards any design pattern. Their training set was composed of 52 embedding vectors, of which half were artificial counterexamples. The model was assessed by splitting data into a training and test set, with 70% of the data being part of the training set. They only used one classifier, Logistic Regression, and the model gave 50% accuracy for Singleton and 90% accuracy for Prototype. Pandey et al. also performed an overall test where Singleton and Prototype were combined into positive examples and non-singleton and non-prototype were combined into negative examples. The accuracy for the overall test was 90%. A study performed by Chand et al. tested the feasibility of using language models for DPR within industrial contexts [7]. Their approach is similar to the approach used by Pandey et al. in their study [11]. Chand et al. use two language models, Word2Vec and Code2Vec [7]. These models are used to generate an embedding of the source code. Both of the models were pre-trained on two open-source systems often utilised within the automotive industry: Genivi [67] and Android Auto [68]. The pre-training dataset consisted of 43,397 code examples, which were split up into test, validation, and training datasets. Both models had similar performance with the pre-training dataset, with Word2Vec performing slightly better. Chand et al. evaluated the ability of the models to detect design patterns with 27 code examples across three design patterns: Singleton, Prototype, and Builder [7]. The algorithm they used was a K-means cluster algorithm. For the purpose of detecting design patterns, Code2Vec outperformed Word2Vec with a mean F1-score of 0.781 compared to 0.690 achieved by Word2Vec. The results showed that the models could store architectural information about the source code. 3.2 Existing evaluation of classification algorithms There is also work in the field of evaluating classification techniques in a non-DPR context. The studies are mainly performed using different datasets, and the results show that classifiers are often task-dependent; in other words, some classifiers are good on some datasets while others are worse. Caruana et al. conducted a large-scale study in which they compared ten supervised learning methods to determine which methods had the best performance across 11 different binary classification problems [13]. Their study included supervised learn- ing algorithms such as logistic regression, SVMs, naive Bayes, KNN, random forest, decision trees, bagged trees, boosted trees, boosted stumps, and neural networks. 18 3. Related Work They evaluated these learning algorithms using eight performance metrics, includ- ing accuracy, F-score, and Lift Curves. Additionally, calibration methods such as Platt Scaling and Isotonic Regression were applied to compare the algorithms prob- abilistically. For datasets, they used 11 binary classification problems, including ADULT, COV_TYPE, and LETTER. Their results showed that boosted trees per- formed best when considering the average across all eight metrics, with random forests coming in second. They also noted that there is no universally best learning algorithm, as some algorithms that performed poorly on some problems excelled on others. For example, random forest, neural networks, and logistic regression were the best models on the MEDIS dataset. Overall, the worst models were naive Bayes and memory-based learning, such as KNN. Soofi and Awan investigated several machine learning classification techniques and the applications for which they were used in [14]. Due to their effectiveness and simplicity, Decision Trees were the most commonly used algorithm for classification tasks [69]. One problem that Decision Trees excelled at was dealing with multi- valued attributes. Bayesian Networks were another classification technique inves- tigated. This method graphs variables and their conditional dependencies using a Directed Acyclic Graph (DAG). One disadvantage of Bayesian Networks is that they require continuous variables to be discretized, which can cause classification issues [70], [71]. The benefits of Bayesian Networks include their ability to solve both regression and classification problems, as well as their ability to handle missing val- ues. K-nearest neighbour (KNN) is another classification technique that is good at scaling over multimedia datasets and was used to address issues with space and time constraints. KNN is also effective when dealing with large amounts of train- ing data and is resistant to noisy training data [72]. Some disadvantages include computation complexity, memory limitation, and poor run-time performance [73], [74]. Lastly, Support Vector Machines (SVM) is a technique considered to be one of the more convenient methods for solving classification problems [75]. SVMs excel at handling high-dimensional and non-linearly separable problems. However, a notable disadvantage is that achieving accurate results often requires setting key parameters correctly. SVM can be applied to issues involving multi-label classification and is effective at controlling the false-positive rate. 19 3. Related Work 20 4 Method The main method used in this project was an Mining Software Repository (MSR) experiment method [76]. The study used publicly available software repositories in order to understand and improve software maintainability. The study was also a knowledge-seeking study that aimed to evaluate and validate different methods and their performance [77]. The independent variables in this study were the classifica- tion and clustering algorithms. An overview of the process for RQ1 can be seen in Figure 4.1, and an overview of the process for RQ2 can be seen in Figure 4.2. The pipeline referenced in those figures is explained in Section 4.7. Figure 4.1: Visualisation of process for RQ1. Figure 4.2: Visualisation of process for RQ2. 4.1 Data Collection Java data for Singleton, Prototype, and Builder were gathered mainly from the P-MARt repository [65]. The P-MARt repository consists of labelled usage design patterns that have been extracted from different open source software that is written 21 4. Method in Java. The data gathered from P-MARt consisted of 32 examples of Prototype, 25 of Singleton, and 9 of Builder. All the examples were written in Java. Data was also systematically gathered from public GitHub repositories. This is due to the P-MARt data gathered being imbalanced and having an unequal amount of data for the different patterns. An imbalanced dataset can lead to the machine- learning algorithms overfitting on the design pattern(s) that are most common in the dataset. This happens because the algorithm believes that the most common design patterns have a higher chance of appearing than the other design patterns. The P-MARt repository also only have Java data for Singleton, Prototype, and Builder and data had to be gathered for Adapter, Bridge, Strategy, and Observer as well as data for Python and C#. Examples of design pattern implementations were found by searching with specific search terms on GitHub. The search terms used to find data for all three programming languages were the following: • "design patterns language:Java" • "design patterns language:Python" • "design patterns language:C#" The results were then sorted by "Most stars". It was sorted by "Most stars", as it usually indicates a repository being popular and of high quality. The repositories were then manually inspected one-by-one for design pattern implementations. For data to be included in a repository, it had to satisfy a list of requirements. The requirements can be seen below. • The repository should be in English. • The repository should have at least one example of the design patterns part of this study (Singleton, Prototype, Builder, Observer, Strategy, Bridge, and Adapter). • The design patterns should be explicitly annotated by the author(s) of the repository. The annotation of a design pattern can be: the file(s) are named after the design pattern, the folder the file(s) are in is named after the design pattern, or the com- ments in the code explicitly state the design pattern implemented. All the gathered design pattern examples were then manually reviewed to see if the implementation of the pattern looked correct when compared to the implementation presented by Gamma et al. [2]. The number of files and their source for the Singleton, Proto- type, and Builder patterns written in Java are shown in Table 4.1 below. All other patterns in Java had the same number of files and were all from GitHub. Python and C# had the same number of files, all from GitHub. 22 4. Method Table 4.1: Table showing the amount of files and their source for the Singleton, Prototype, and Builder design patterns written in Java. Design Pattern P-MARt GitHub Total Singleton 25 7 32 Prototype 32 0 32 Builder 9 23 32 4.1.1 Data Preprocessing The gathered data went through a manual preprocessing stage. All files used for a single pattern implementation were merged into one file, so each file in the final data set represented a full implementation and usage of a design pattern. Parts of the code that were deemed to be irrelevant to the implementation of the pattern were also removed. Examples of these were: package names, comments explaining the license the code was part of, the author of the code, and educational comments explaining how the pattern works. Comments related to the functionality of the code were not removed as they are relevant for the language model’s understanding of the code. It was especially relevant for CodeBERT, as that model has been trained on pairs of natural languages and programming languages. Meaning that comments in the code can help the model understand the code better. 4.2 Google Colab Google Colab is a hosted Jupyter Notebook service that allows anybody to write and execute Python code through the browser [78]. The pipeline used in this study was written in Google Colab. Colab was opted for as it is free to use and it provides access to computing resources on which the code is executed. Colab gives access to execute code on Graphical Processing Units (GPUs), which are required for running machine learning and AI models. This sped up the process of generating embeddings and helped us train classifiers and clusterers. Colab also supports sharing and collaborating on projects. 4.3 Language Models Used The study utilised language models in order to generate embeddings from the given code that the classifying and clustering algorithms then used as input to detect the presence of a design pattern. Each model was also responsible for tokenizing the input if the model required it to generate an embedding. The model used for RQ1 was CodeBERT. As it is trained on Java data, is well-known, and outperforms other well-known models, CodeBERT was chosen as the language model to be used in RQ1. For RQ2, multiple language models were used to study the effects of the choice of language model on the performance of the classifiers and clusterers. In RQ2, CodeBERT, GPT-2, and Instructor-xl were used. 23 4. Method CodeBERT is a variant of the BERT language model that has been pre-trained on programming languages. CodeBERT was chosen as it is a variant of a well- known language model and its purpose is to aid with coding, making it suitable for this study. As CodeBERT is trained on programming languages, it is known as a Programming Language Model (PLM). It is trained on Java and Python data, which are two of the programming languages used in this study, and achieves state- of-the-art performance. CodeBERT is also the only PLM included in this study, and PLMs are what previous studies utilising language models for DPR have used [7], [11], [79]. CodeBERT also outperformed other language models on C# even though it was not part of its training data [32]. As its performance translates over to C# as well, it was also chosen to be used for RQ2. GPT-2 was included as a model to use in RQ2 to investigate how and if the choice of language model impacts the performance of the machine learning algorithms. Later versions of the GPT LLM were not used as they were not available to down- load locally for free. The model is also available for free, well-known, and showed state-of-the-art performance when it was released [26]. GPT-2 and CodeBERT are similar in their functionality but different in what they are designed for. CodeBERT is designed and focused around code and aiding with coding [32]. GPT-2 is a more general model that aims to aid with many different tasks in many different fields, such as translation, summarization, and question answering [26]. GPT-2 also gen- erates an embedding for each token instead of one embedding for all tokens. The models are similar and both relevant for the study, but still different enough to discuss results and performance differences. Instructor-xl was included as a third model in RQ2 due to it being more recent (late 2022), and a lot of advancements have been made in the area of machine learning and AI between the release of CodeBERT and GPT-2 and the release of Instructor- xl [35]. GPT-2 and CodeBERT are also not designed for the purpose of finding the presence of design patterns in code. Instructor-xl is also not designed for the purpose of DPR, but the model can be given instructions for the task and therefore generate an embedding that takes the task of DPR into account. It is therefore also of interest how a more recent model that can generate an embedding more appropriate for the task changes the performance of the classifiers or clusterers. 4.4 Design Patterns The design patterns used in RQ1 were Singleton, Prototype, and Builder. These patterns were selected for their widespread use in object-oriented programming and their frequent appearance in previous DPR work [12]. Singleton, Prototype, and Builder are all categorised as creational design patterns. As the focus of RQ1 was to explore the performance of different classifiers and clusterers, we chose to select patterns that are of the same type. The difference in performance between the indi- vidual design patterns can then also be studied without having to make inferences about design pattern types. Differences between pattern types will be explored in RQ2. 24 4. Method For RQ2, we expanded the number of patterns from three to seven, including at least two from each design pattern category: creational, behavioral, and structural. The patterns included for RQ2 were Singleton, Prototype, Builder, Observer, Strategy, Adapter, and Bridge. The patterns were chosen as they were often part of previous work in DPR and are often used within object-oriented programming [12]. This also meant that it was not too difficult to find examples of their implementations to use as training and testing data. At least two patterns for each classification were chosen in order to study how machine-learning algorithms perform when classifying or clustering patterns both within and across design pattern classifications and how their performance varies between them. 4.5 Classifying and Clustering Algorithms The classification algorithms used in this study were Logistic Regression, Random Forest, K-Nearest Neighbours, Support Vector Machine, Decision Tree, Gradient Boosting, Naive Bayes, Gaussian Process, and a Neural Network. The Neural Net- work will be a Multilayer Perceptron classifier, where the mapping between input and output is non-linear by using a non-linear activation function. The clustering algorithms used in the study were K-Means, DBSCAN, Mean Shift, Gaussian Mix- ture, BIRCH, and Affinity Propagation. K-Means, Gaussian Mixture and BIRCH allow you to specify the number of classes present, and this will be done for those algorithms. This is done as this is a classification task and this information is avail- able. The embeddings are of dimension 768, and reducing this size can lead to a loss of information. Specifics about training and testing the classifiers and clusterers are presented in Section 4.7.3. A wide range of classifiers and clusterers was chosen to explore the performance of different algorithms and methods of classification. The chosen algorithms were opted for as they are well-known and have been used in previous work within DPR [6], [7], [11]. 4.6 Metrics The metrics calculated for each classifier were accuracy, precision, recall, and F1- score. Accuracy is the simplest metric here, and it shows how many of the predictions made were correct without considering any other information. It is calculated by dividing the number of correct predictions by the number of total predictions made. It is a value between zero and one, and the higher it is, the better the model is at making correct predictions. TP: True Positives TN : True Negatives FP: False Positives 25 4. Method FN : False Negatives accuracy = TP + TN TP + TN + FP + FN Precision is a metric that shows the proportion of how many of the predictions it makes for a certain class are correct. For example, it will show how many of the programmes it predicted as being Singleton are correct compared to how many Singleton programmes there are. It is a value between 0 and 1, and a higher value is better. precision = TP TP + FP Recall shows how good the model is at finding all positive examples. It shows how many of the data points that should be classified as true were classified as true. In this case, it shows how many of the programmes that should be predicted as Singleton were predicted as Singleton. It is a value between 0 and 1, and a higher value is better. recall = TP TP + FN The F1-score metric shows the accuracy of a model, and it takes precision and recall into account. Compared to just calculating the normal accuracy, the F1-score considers class imbalance and better reflects the true performance of a model. It is a value between 0 and 1, and a higher value means a better-performing model. F1 = 2 · precision · recall precision + recall The data used in the study is labelled. Thus, F1, precision, and recall can be calculated for the clusterers as well. This does, however, require testing different mappings between patterns and clusters. Mapping patterns to clusters was only performed for RQ1 to be able to compare the clusterers and classifiers with the same metrics. It was not done for the other RQs as their performance in RQ1 did not show promise of the clusterers being able to compete with or outperform the classifiers. Mapping the patterns to clusters also takes time and was not feasible at all for RQ2.1 due the large amount of unique mappings. F1-score, precision, and recall were calculated by testing all different mappings of pattern to cluster and taking the results from the mapping that gave the highest F1-score. The metrics used for RQ2 were the Adjusted Rand Index (ARI) and Silhouette score. These metrics were included in RQ1 as well for comparison purposes. The 26 4. Method Silhouette and ARI scores will tell whether the clusters are good in terms of how similar the points in a cluster are to each other. If a clusterer performs well in terms of these metrics, it can be used to classify code, but it would require the extra effort of mapping the clusters to the patterns. The clusterers will also be tuned around the Adjusted Rand Index. The Silhouette score for a whole data set is calculated by calculating the Silhouette score for each individual data point, adding them, and dividing by the total number of data points. The Silhouette score ranges from -1 to 1. A negative score means that the data point is likely in the wrong cluster. A positive score indicates that the data point is assigned to the correct cluster. If clustering performs well, it should have a Silhouette score near 1. ai: Average distance of point i to all other points within the same cluster bi: Average distance of point i to all points in the nearest cluster Silhouette score(i) = max(ai, bi) bi − ai Silhouette score = ∑n i=1 Sillhouette Score(i) n The Rand Index (RI) is a metric that measures the similarity between two different clusterings of the same data. RI takes into account all different pairs of data and checks if the pairs have been assigned the same cluster or a different cluster when comparing to the true labels. It checks whether a pair within the same cluster has the same true label or whether a pair with the same true label is in different clusters. The Rand Index is calculated by the following formula: Rand Index = number of agreeing pairs number of pairs The Adjusted Rand Index has the same value, but it adjusts the value according to the chance of making a correct clustering at random. It calculates an expected RI value that would have been achieved by random clustering and then adjusts the RI score according to the expected RI. This adjustment also gives ARI a baseline where scores near 0.0 have the same performance as random clustering. The ARI score ranges from -0.5 to 1, where a higher score is better. 4.7 Pipeline For testing various classifiers and clusterers, a pipeline was created in Google Colab. An overview of the pipeline can be seen in Figure 4.3. This pipeline takes a directory of source code files as input. 27 4. Method Figure 4.3: Visualisation of the pipeline. 4.7.1 Tokenization The first step in the pipeline was to tokenize the code files. Much of the data used in this study would generate more tokens than the maximum input sequence length of the models. This issue was handled by splitting the code files into chunks, where each chunk was as big as the maximum input sequence length of the model. If a chunk or a file was smaller than the maximum input sequence length, it was padded to make the size equal to the maximum input size. This was done to make all inputs to the model have the same size. The padding does not add any semantic value and does not affect the embedding generated by the models. The chunks also had an overlap of 10% where the final 10% of a chunk would be the first 10% of the next chunk. This was done to help the models retain the context between chunks. 4.7.2 Extracting Embeddings The tokenized chunks were then individually sent into the models, and their em- beddings were extracted. All files that had multiple chunks would receive multiple embeddings. These embeddings were put into one final embedding by averaging all of their values. After this, all files would have one singular embedding as their rep- resentation. These embeddings were then saved as a comma-separated file, allowing them to be used later for training various classifiers and clusterers without the need to rerun the source code through the models. The saved embeddings are then split into training and test data, where 80% was used for training the algorithms and 20% was used for testing. The data is split in this way in order to have sufficient data for the classifiers to train and learn from, as well as sufficient data to test and evaluate them. It is also done to reduce overfitting of the algorithms [80]. All patterns were split separately to ensure that there was an equal amount of data for each pattern in the training and test sets. It also removes any bias that can be introduced with imbalanced data sets. The clusterers did not use the separated data and instead created clusters from the full data set. 4.7.3 Training of Classifiers and Clusterers The next step in the pipeline was to train and test all the classifiers and clusterers on the data. All algorithms used the same data, and they were trained and tested once per model and per programming language. Different hyperparameters were also tested for each algorithm by performing a grid search with selected hyperparameters. The grid search was also done per model and per programming language. This was done to ensure that the best possible result was achieved for all algorithms. What 28 4. Method hyperparameters were tested and how many combinations of each algorithm were tested can be seen in Table B.1 in Appendix B. The classifiers were optimised based on their F1-score, and the clusterers were optimised based on their ARI score. The values for the hyperparameters used in the different tests can also be found in the Appendix. The classifiers and clusterers ran 200 times each with different training and test sets. This was done to reduce the effect that randomness has on the performance of the classifiers and clusterers and to see their average performance across multiple runs and how much it varies between runs [81]. Results before and after hyperparameter tuning were generated separately to see how much it affected the performance of the algorithms. 4.7.4 LLM Settings For RQ1, we utilised CodeBERT [32] to extract the embeddings. RQ1 focused ex- clusively on the Singleton, Prototype, and Builder design patterns. All the design patterns analysed were code written in Java. Java was chosen as the language for RQ1, as the P-MARt data is all written in Java, and P-MARt has also been used in multiple previous studies within DPR [12]. The data used in RQ1 consisted of examples found in P-MARt and public GitHub repositories. The P-MARt dataset and the combination of P-MARt and public GitHub repositories were tested sepa- rately. CodeBERT has a maximum input sequence length of 512 and also uses special classification and end-of-sequence tokens. So, the data for CodeBERT was split into chunks of size 510, and a classification token and end-of-sequence token were added as the first and last elements, respectively, making each chunk have a size of 512. As the maximum length is 512, the overlap for each chunk was 51 tokens. RQ2 used two more models for the extraction of the embeddings: Instructor-xl [35], and GPT-2 [26]. The number of patterns was extended from three to seven and now included multiple patterns from all three types of design patterns. Two other programming languages were also tested: Python and C#. The pipeline was modified and extended to include GPT-2 and Instructor-xl. GPT-2 has a maximum input sequence length of 1024. The chunks created for GPT-2 were of size 1024 with an overlap size of 102. Instructor-xl has a maximum input sequence length of 512 and uses an overlap of 51 tokens. None of them use any special classification or end-of-sequence tokens. Instructor-xl, however, can be given instructions on the task, and the instructions given to the model were: "Identify the software design pattern (Singleton, Prototype, Builder, Observer, Adapter, Bridge, Strategy) in the following code:". 29 4. Method 30 5 Results Quantitative data was obtained from the predictions the classifiers made on the test set and the clusters generated by the clusterers. Results relevant for RQ1 are presented in Section 5.1. Results relevant for RQ2 are presented in Section 5.2. 5.1 RQ1: Classifier Performance on Design Pat- tern Identification This section presents results related to answering RQ1. The results are divided into baseline performance, which is the performance with default hyperparameters, presented in Section 5.1.1 and performance with tuned hyperparameters presented in Section 5.1.2. The design patterns used for RQ1 were Singleton, Prototype, and Builder. All design patterns were written in Java. All results presented in this section use embeddings generated by CodeBERT. The default hyperparameters for the algorithms can be found at scikit-learn [82]. 5.1.1 Baseline Performance The results of training and testing each classifier across 200 different runs with different training and test splits can be seen in the box plot in Figure 5.1 and in Table 5.1. Multiple classifiers were able to reach an F1-score of 1 in at least one run, even without any hyperparameter tuning. The classifiers that did not achieve a F1-score of 1 in any of the runs were Decision Tree, KNN, Naive Bayes, and Gaussian Process. The best lowest score was 0.651, which was achieved by Logistic Regression. Gaussian Process is the standout classifier here, with significantly worse performance than any other classifier. The mean F1-score it got was 0.219, with its highest being 0.328. It did have the lowest variance of all classifiers at 0.003, with the next lowest being 0.005, which was achieved by Logistic Regression. The second worst performance was attained by Decision Tree with a mean F1-score of 0.664. Furthermore, Table 5.2 shows the F1-score (F1), Precision (P), and Recall (R) for each design pattern, Singleton, Prototype, and Builder averaged over 200 runs. These results are also visualised in confusion matrices, as can be seen in Figure 5.2. 31 5. Results Figure 5.1: Box plot showing the F1-scores of the classifiers across 200 runs with default hyperparameters on CodeBERT embeddings, averaged across all patterns. Table 5.1: Table depicting mean, variance, highest, and lowest F1-scores obtained from running each classifier 200 times with different training and test sets and with default hyperparameters on CodeBERT embeddings, results averaged across all de- sign patterns. Classifier Mean Variance Highest Lowest MLPClassifier 0.830 0.005 1.000 0.593 Gradient Boosting 0.826 0.006 1.000 0.613 Random Forest 0.815 0.006 1.000 0.573 Decision Tree 0.664 0.011 0.905 0.356 KNN 0.763 0.008 0.952 0.498 Naive Bayes 0.707 0.005 0.860 0.544 Gaussian Process 0.219 0.003 0.328 0.167 Logistic Regression 0.883 0.005 1.000 0.651 SVM 0.832 0.005 1.000 0.563 The confusion matrices show which patterns were often predicted correctly and which ones were predicted wrong. The y-axis are the true labels, which indicate which design pattern the file contains, and the x-axis are the predicted labels, which indicate what the classifier predicted for that file. The diagonal in the matrices shows the percentage where the classifiers predicted the class correct, and the other cells show how the algorithm misclassified which design pattern the file consisted of. The matrices show that Singleton is the design pattern that gets predicted wrong most often. Singleton is also the design pattern that Prototype and Builder get wrongly predicted as being. Gaussian process is the classifier that stands out here, as it almost always predicts everything as being Singleton. 32 5. Results Table 5.2: Table depicting mean F1-score, precision, and recall obtained from running each classifier 200 times with different training and test sets across the Singleton, Prototype, and Builder design patterns. Classifier Singleton Prototype Builder F1, P, R F1, P, R F1, P, R MLPClassifier 0.74, 0.78, 0.73 0.89, 0.92, 0.89 0.86, 0.85, 0.89 Gradient Boosting 0.75, 0.76, 0.76 0.88, 0.92, 0.85 0.85, 0.85, 0.88 Random Forest 0.76, 0.76, 0.78 0.84, 0.89, 0.82 0.85, 0.86, 0.86 Decision Tree 0.58, 0.62, 0.57 0.69, 0.71, 0.71 0.72, 0.73, 0.73 KNN 0.67, 0.77, 0.62 0.81, 0.78, 0.85 0.81, 0.80, 0.84 Naive Bayes 0.67, 0.71, 0.67 0.67, 0.67, 0.70 0.78, 0.81, 0.77 Gaussian Process 0.51, 0.34, 1.00 0.00, 0.00, 0.00 0.15, 0.51, 0.09 Logistic Regression 0.82, 0.85, 0.83 0.93, 0.97, 0.90 0.90, 0.88, 0.93 SVM 0.77, 0.75, 0.83 0.87, 0.97, 0.81 0.85, 0.86, 0.87 Figure 5.2: Confusion matrices showing how the baseline classifiers predicted each of the design patterns, averaged across 200 runs on CodeBERT embeddings. 33 5. Results Table 5.3 shows the baseline performance for different clustering algorithms. These algorithms were evaluated with 200 iterations each, and the mean was recorded for ARI, Silhouette score, F1, precision, and recall. ARI has a baseline score of 0 which is equal to clustering at random. Silhouette score measures the similarity of data points in the same cluster, and a positive score means that most data points are likely in the correct cluster. Both ARI and Silhouette score want to achieve a score close to 1 as 1 is the highest possible score for both metrics. From this baseline performance, K-Means got the highest ARI score at 0.283, and Mean Shift achieved the highest Silhouette score at 0.326. Mean Shift did, however, also get an ARI score of 0.000, meaning it is no better than random clustering. Mean Shift also got the next lowest F1-score at 0.167. BIRCH achieved the highest F1-score with a score of 0.638. The next highest clusterer, in terms of F1-score, was K-Means with an F1 of 0.335. The lowest mean F1-score was achieved by Affinity Propagation with a score of 0.069. Table 5.3: ARI, Silhouette, F1, precision, and recall scores of the clustering algo- rithms with default hyperparameters on CodeBERT embeddings, averaged across all design patterns over 200 runs. Clusterer ARI Silhouette F1 Precision Recall K-Means 0.283 0.221 0.335 0.349 0.339 Mean Shift 0.000 0.326 0.163 0.109 0.323 Gaussian Mixture 0.290 0.222 0.313 0.331 0.331 BIRCH 0.263 0.167 0.638 0.668 0.644 Affinity Propagation 0.130 0.137 0.069 0.330 0.040 DBSCAN NaN NaN NaN NaN NaN Figure 5.3 below shows a t-SNE visualisation of the data, where the data has been standardised and reduced from 768 dimensions to 2. It is important to remember that, due to the high dimensionality of the original data, the t-SNE visualisation has a loss of information. The figure does show two clusters in the top right and bottom left, which are the Builder and Prototype patterns. The Singleton pattern is spread out in the left and middle and not in one singular cluster. Singleton (green) is the pattern that is most intertwined between the other patterns, and it is also the pattern that most often gets predicted wrong by the classifiers. Running the clusterers only on Builder (blue) and Prototype (orange) does yield better results for all clusterers in terms of the clustering metrics. All clusterers, except for BIRCH, also saw an improvement in the classification metrics (F1, precision, and recall) with BIRCH instead achieving a significantly worse score in those metrics. The full results of running without Singleton can be seen in Table 5.4 below. 34 5. Results Figure 5.3: t-SNE visualisation of the CodeBERT embeddings. Table 5.4: ARI, Silhoutte, F1, precision, and recall scores of the clustering al- gorithms with default hyperparameters on CodeBERT embeddings, without the Singleton pattern, averaged over 200 runs. Clusterer ARI Silhouette F1 Precision Recall K-Means 0.490 0.278 0.502 0.505 0.505 Mean Shift 0.008 0.164 0.451 0.762 0.547 Gaussian Mixture 0.484 0.278 0.475 0.479 0.479 BIRCH 0.421 0.274 0.167 0.164 0.172 Affinity Propagation 0.199 0.143 0.160 0.499 0.101 DBSCAN NaN NaN NaN NaN NaN 5.1.2 Performance with Tuned Hyperparameters This section presents the results of classifiers and clusterers after tuning their hy- perparameters. The hyperparameters used in this section can be found in Appendix B Table B.2. The results for 200 runs on tuned hyperparameters for the classifiers can be seen in Figure 5.4 and Table 5.5. They show that almost all classifiers increase their mean value slightly. Even with tuned hyperparameters, the performance is very similar to the baseline result, except for Gaussian Process which increased its mean from 35 5. Results 0.219 to 0.811. It was also able to get a high F1-score of 1. Although it did achieve a much better mean score, it still had bad performances with many outliers and a low score of 0.167. This gave it the highest variance of all classifiers at 0.024, with the next highest being 0.009 from Decision Tree. The reason for the outliers is due to the algorithm overfitting. This is further explained in Section 6.1.1. Logistic Regression is still the best-performing classifier and increased its mean from 0.883 to 0.884. Moreover, Decision Tree is now the worst-performing classifier with a mean of 0.683 due to the high increase in performance of Gaussian Process. SVM and Naive Bayes had the same performance as with default hyperparameters. Figure 5.4: Box plot showing the F1-scores of the classifiers across 200 runs with tuned hyperparameters on CodeBERT embeddings, averaged across all patterns. Table 5.5: Table depicting mean, variance, highest, and lowest F1-scores obtained from running each classifier with tuned hyperparameters on CodeBERT embeddings, averaged across all design patterns over 200 runs. Classifier Mean Variance Highest Lowest MLPClassifier 0.835 0.007 1.000 0.595 Gradient Boosting 0.851 0.007 1.000 0.535 Random Forest 0.818 0.006 1.000 0.592 Decision Tree 0.683 0.009 0.904 0.426 KNN 0.795 0.007 1.000 0.500 Naive Bayes 0.707 0.006 0.860 0.544 Gaussian Process 0.811 0.024 1.000 0.167 Logistic Regression 0.884 0.005 1.000 0.651 SVM 0.832 0.005 1.000 0.563 36 5. Results Table 5.6 and Figure 5.5 show the performance of the tuned classifiers on individual design patterns. The confusion matrices mostly look the same as baseline except for Gaussian Process which now correctly can identify Builder and Prototype to a much higher degree. Singleton is still the pattern that most often gets predicted wrong, and the other patterns get most often wrongly predicted as being. Table 5.6: Table depicting F1-score, precision, and recall obtained from running each tuned classifier 200 times with different training and test sets across the Sin-