Gone Phishin’ An Investigation of Node Classification in Graphical Models for Domain Abuse Detection Master’s thesis in Complex Adaptive Systems Joel Rosko William Truvé DEPARTMENT OF PHYSICS CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2023 www.chalmers.se www.chalmers.se Master’s thesis 2023 Gone Phishin’ An Investigation of Node Classification in Graphical Models for Domain Abuse Detection Joel Rosko William Truvé Department of Physics Chalmers University of Technology Gothenburg, Sweden 2023 Gone Phishin’ An Investigation of Node Classification in Graphical Models for Domain Abuse Detection Joel Rosko William Truvé © Joel Rosko & William Truvé 2023. Supervisor: Anders Hansson, Recorded Future Examiner: Mats Granath, Department of Physics Master’s Thesis 2023 Department of Physics Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2023 iv Abstract In today’s digital era, cyber attacks pose a constant threat as attackers attempt to access proprietary data and disrupt operations on a daily basis. Phishing remains their number one attack method where users are tricked into entering sensitive in- formation which attackers later will use or sell. The use of domain abuse detection algorithms restricts the range of attack possibilities. Furthermore, since an attack may begin as soon as a domain goes live, finding and evaluating domains quickly is of paramount importance when countering cyber threat actors. As of now, several feature based classifiers exist and are showing good results in detecting domain abuse. However, the results are dependent on a large set of fea- tures, complicated to interpret, and struggles to generalize as attack patterns change. In this thesis we compare feature based classifiers with our implementation of belief propagation to evaluate if the use of structural information and less domain specific features can create a more interpretable and general solution. By constructing a bidirectional graph connecting autonomous system numbers, classless inter-domain routing blocks, IP addresses, domains, and tokens extracted from the URL string, a high connectivity between nodes to propagate inference is achieved. We experiment with various techniques when initiating the graph to find an appropriate setup for belief propagation. Our implementation of belief propagation achieves an accuracy of 91% on the en- tire dataset which is worse than random forest having an accuracy of 94%, however with a smaller sample of false positives. With an AUC of 0.95 the classes are well distinguishable and when optimizing thresholds and allowing nodes to be classified as “unkown”, the accuracy increases to 96%. Overall, our findings demonstrate the potential to use belief propagation for ac- curately identifying suspicious domains at scale, providing a valuable tool in the fight against cyber threats. Keywords: phishing, random forest, belief propagation, loopy belief propagation v Acknowledgements We would like to thank everyone at Recorded Future who has contributed to our master thesis by engaging in discussions regarding our work. As a special thanks, we would like to express our heartfelt gratitude to our supervisor Anders Hansson who has been a great mentor during every step of this project. William Truvé, Gothenburg, June 2023 Joel Rosko, Gothenburg, June 2023 vii List of Acronyms Below is the list of acronyms that have been used throughout this thesis listed in alphabetical order: AUC Area under the ROC Curve ASN Autonomous system number BPE Byte-pair encoding BP Belief propagation BI-LSTM Bi-directional Long short-term memory CBOW Continous bag-of-words CIDR Classless Inter-Domain Routing CSS Cascading Style Sheets DBSCAN density-based spatial clustering of applications with noise DGA Domain Generation Algorithm DNS Domain name system DOM tree Document object model tree GDPR general data protection regulation GNN Graph neural network HTML HyperText Markup Language ML Machine learning RFC Random forest classifier ROC receiver operating characteristic RNN Recurrent neural network SSL Secure Sockets Layer URL Uniform Resource Locator ix Contents List of Acronyms ix List of Figures xiii List of Tables xv 1 Introduction 1 1.1 Purpose & Aim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Problem statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.3 Disposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.4 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 2 Background 5 2.1 Cyber security . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Recorded Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 Data origin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.4 Domain Features: DNS and Content . . . . . . . . . . . . . . . . . . 7 3 Related Work 9 3.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Models that act on string-based features . . . . . . . . . . . . . . . . 9 3.3 Models that act on infrastructure-based features . . . . . . . . . . . . 10 3.4 Models that act on content-based features . . . . . . . . . . . . . . . 10 3.5 Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.6 Graph Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 Theory 13 4.1 Random Forest Classifier . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Loopy Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . 14 4.3 Levenshtein distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 4.4 Tokenization, byte-pair encoding . . . . . . . . . . . . . . . . . . . . . 16 4.5 String embeddings and Word2Vec . . . . . . . . . . . . . . . . . . . . 17 4.6 Graph embeddings and Node2Vec . . . . . . . . . . . . . . . . . . . . 17 4.7 Long short-term memory (LSTM) and bi-directional LSTM . . . . . . 18 5 Methods 21 5.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 xi Contents 5.1.1 Label encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.2 Feature based classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 22 5.2.1 Random forest classifier . . . . . . . . . . . . . . . . . . . . . 23 5.3 Belief Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.3.1 Graph construction . . . . . . . . . . . . . . . . . . . . . . . . 24 5.3.2 Levenshtein neighbours . . . . . . . . . . . . . . . . . . . . . . 25 5.3.3 BPE tokenization . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.3.4 Edge potentials . . . . . . . . . . . . . . . . . . . . . . . . . . 28 5.3.5 Node potentials . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.3.6 Handling underflow for high degree nodes . . . . . . . . . . . . 31 5.3.7 Algorithm implementation . . . . . . . . . . . . . . . . . . . . 31 6 Results 33 6.1 RFC Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 6.2 BP Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 6.2.1 LSTM model . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 6.3 Final model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 7 Discussion 41 7.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 7.2 RFC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 7.3 Belief propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 7.3.1 Final model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 7.4 Comparison of belief propagation and RFC . . . . . . . . . . . . . . . 43 7.5 Ethical . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 8 Conclusion 45 9 Future Work 47 Bibliography 49 xii List of Figures 2.1 Illustration of the Cyber Kill Chain and how the different steps create a framework to carry out a large-scale attack. By hindering phishing attacks, the delivery step is obstructed. . . . . . . . . . . . . . . . . . 6 4.1 Illustration of random forest classification. The algorithm consists of multiple decision trees, where the leaf nodes correspond to that tree’s prediction. The majority prediction of all trees in the forest becomes the final prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 4.2 Illustration of message passing algorithm from node i to node j. . . . 15 4.3 Illustration of CBOW where four words are used as context, two before and two after. The word vectors are projected into the same position, and thus their vectors are averaged. The projection layer then outputs the most probable middle word. . . . . . . . . . . . . . 17 4.4 LSTM cell displaying the three important gates used to process in- formation as well as the memory cell which stores information from the sequence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 5.1 Flowchart of the steps needed to arrive at end goal. The first and second step is highly related and revolved around gathering and an- alyzing the data. The third step is divided into two areas, selecting and developing a feature-based classifier, and implementing the BP algorithm. The fourth and final step is then to analyze the various models’ pros and cons. . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.2 Description of the main steps when creating the initial RFC model, and the methods used to improve its performance. . . . . . . . . . . . 23 5.3 Illustration of the constructed graph with all possible node and edge types. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 5.4 Histogram of domains registered the specific days with at least one neighboring domain and at least one domain are a phishing domain. The red bar represents domains found to be phishing domains while the blue bars are domains not said to be phishing domains. . . . . . . 26 5.5 Stabilizing point, orange circle, found to use a cutoff point for stop- word removal. The point is located at the largest perpendicular dis- tance between token frequencies and a straight line connecting the highest frequency with the lowest. Y-axis in log-scale. . . . . . . . . . 27 xiii List of Figures 5.6 Bar plot displaying the token frequencies from the dataset for some randomly sampled tokens. The green bars represent the number of times that token was used in a benign URL while the red represents when the token was found in malicious URLs. . . . . . . . . . . . . . 28 5.7 Sequential bidirectional LSTM model displayed with input sizes for each layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 6.1 Bar plot showing the importance of the used features, with a mean decrease in impurity on the x-axis. Decreasing the impurity is a result when the feature well splits the class labels. . . . . . . . . . . . . . . 33 6.2 Simulation of random forest classifiers with varying tree depth and number of trees. For each step, five models were evaluated with newly generated data splits and the results shown are the average across those five simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . 34 6.3 ROC curve for the final model with a true positive rate on the y-axis and false positive rate on the x-axis. The AUC score is 0.98 which indicates well-separated classes. . . . . . . . . . . . . . . . . . . . . . 35 6.4 Results from training the LSTM model for ten epochs, where the blue lines represent the results from the training set, while the orange represents the validation set. The left figure displays the accuracy and the right figure displays the loss. . . . . . . . . . . . . . . . . . . 38 6.5 Illustration of the selected final model. The data is used to con- struct the G6 graph, the BP algorithm then takes advantage of the Node2Vec embeddings and URL model predictions to predict the un- labeled domains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 6.6 ROC curve for the final model with true positive rate on the y-axis and false positive rate on the x-axis. The AUC score is 0.96 which indicates well separated classes. . . . . . . . . . . . . . . . . . . . . . 40 6.7 Predictions for domains predicted benign, left, and domains predicted malicious, right. The x-axis corresponds to prediction belief and the y-axis density of samples. The red graphs are wrongly predicted samples while the blue graphs represent accurate predictions. . . . . . 40 xiv List of Tables 5.1 Results for some of the evaluated classifiers to select the optimal one for the current dataset. As can be seen, the majority performs fairly similarly, however random forest has a slight edge in most evaluation metrics and will therefore be used moving forward. . . . . . . . . . . 23 5.2 Different node types and possible connections for each node type . . . 25 5.3 Compatibility matrix when using epsilon edge initialization. When passing messages between nodes for the same class label a higher potential is set than when two different class labels interfere. . . . . . 28 5.4 Compatibility matrix used for similarities between node embeddings from Word2Vec or Node2Vec. The potential between two equal class labels is the minimum value of the scaled cosine similarity, SC,n, and the maximum threshold, t+. For opposite class labels, the minimum threshold, t− is used. . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6.1 Result from the final RFC model, achieving an accuracy of 95%. The other metrics are shown for both class labels, where support represents the number of samples for each class in the test set. . . . . 35 6.2 Evaluation Results for Different Graphs using the Epsilon Model . . . 37 6.3 Evaluation Results for G5 using three different Models . . . . . . . . 37 6.4 Evaluation Results for G6 using three different Models . . . . . . . . 37 6.5 Evaluation Results for G5 using three different Models assisted by an LSTM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 6.6 Evaluation Results for G6 using three different Models assisted by an LSTM model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 xv List of Tables xvi 1 Introduction Domain abuse is a highly relevant topic in today’s society. This thesis investigates and evaluates a few different approaches on how to detect new phishing domains to prevent potential attacks. 1.1 Purpose & Aim A recent study performed over a 6-month span in 2022 found more than 255 million phishing attempts during that period of time [1], this number corresponds to a 61% increase from the previous year. These attacks are not only increasing in number but also in sophistication, making it difficult for the victim to tell whether it’s a scam or not. Machine learning (ML) models have become a common approach to solve the problem of identifying and warning a user about malicious domains by using domain name system (DNS) data as features when classifying suspicious domains. Existing feature-based classifiers have reached promising results around 98% ac- curacy towards their specific datasets, however, when exposed to new data their performance will likely decrease, due to constant changes in attack patterns [2]. Aiming to establish a more robust and performant classifier, a graph schema be- tween domains and IPs is created combined with passive data. The graph is used to propagate reputation between neighboring nodes as a way of utilizing the important structural information [3]. 1.2 Problem statement In light of the presented purpose, we aim to: • Develop a graph-based model to classify phishing domains. • Compare feature-only models and graph-based models. • Minimize falsely predicted malicious domains to avoid alert overflow. 1.3 Disposition A description of what is included in each chapter. 1 1. Introduction Introduction Introduces the ongoing problems with phishing attacks and the weaknesses of many current solutions. We briefly describe our approach as an alternative solution and the problem statements we wish to answer to answer to evaluate the benefits of a graph-based solution. Background The background chapter gives a short introduction to the field of cybersecurity, mainly focusing on phishing attacks. Recorded Future as an organization is intro- duced and how this study could assist in their work. Lastly, the data sources and their features are described. Related work A lot of research has been done in the field of detecting domain abuse. In related work, multiple previous projects used as inspiration are mentioned. The mentioned projects include simple URL-string models, feature classifiers, belief propagation (BP) approaches, that highly inspired our project, and more complex neural network solutions for possible future work. Method Divided into two parts, the method chapter describes the selection and implemen- tation of a feature-based classifier and the development of our BP algorithm. The chapter provides an in-depth explanation of algorithm modifications and various experimental graph constructions. Results The result chapter includes all results obtained when creating the random forest and BP model. The BP results contain several experiments when iterating through al- ternative solutions. The last sub-chapter then highlights the most optimal solution, the final model, and thoroughly presents its performance. Discussion The results and problems introduced previously in the report are combined to discuss the initial problem statements. The chapter also includes additional thoughts and problems encountered during the project. Conclusion The last chapter is a short and concise explanation of the main insights found throughout the project. 2 1. Introduction 1.4 Contribution The thesis contains three main contributions, compared to previously developed BP solutions. The first contribution is a method to handle underflow when propagat- ing belief through a network with high-degree nodes: by averaging the messages received, numerical issues are avoided. The second contribution is to add a new kind of connection between domains: the end graph is equipped with edges between domains if the Levenshtein distance between those URLs is short. The third con- tribution is to extract the most common URL sub-strings using byte-pair encoding tokenization to increase graph connectivity and discover new connections. 3 1. Introduction 4 2 Background In this chapter, we motivate why this work is highly relevant. Furthermore, we present and explain this works relevance for Recorded Future. 2.1 Cyber security Cybersecurity is the practice of protecting computer systems, networks, and data from unauthorized access. As most companies strive for increased digitalization and automation, cybercriminals are able to pose a great threat to all kinds of infrastruc- ture. Moreover, cyberattackers are becoming more sophisticated in their techniques and tools, which means that businesses must stay updated on the latest potential threats and best practices to protect themselves. With the rise of large language models, cybercriminals now have the potential to train AI chatbots to constantly send out phishing emails and launch malicious websites in order to infiltrate net- works [4, 5]. A model that is frequently used within the cybersecurity industry is the “kill chain” model. The kill chain is a framework that describes the different stages of how a cyber attack is prepared and launched. Understanding the process of an attack can help organizations respond to possible threats in an effective manner. Following is a description of the different steps, as well as an illustration of the kill chain in figure (2.1). Reconnaissance: The attacker gathers information about the target. Weaponization: The attacker creates a weapon, such as malware, to exploit a vulnerability. Delivery: The attacker delivers the weapon to the target, typically through email or a compromised website. Exploitation: The weapon is activated, and the attacker gains access to the target’s system. Installation: The attacker installs a backdoor or other malware on the target’s system to maintain access. Command and Control: The attacker establishes a command and control chan- nel to communicate with the malware on the target’s system. Actions on Objective: The attacker achieves their objective, which could be stealing data or disrupting operations. 5 2. Background Figure 2.1: Illustration of the Cyber Kill Chain and how the different steps create a framework to carry out a large-scale attack. By hindering phishing attacks, the delivery step is obstructed. Early in the chain, attackers will attempt to infect their targets with malware. By finding and blocking phishing websites, organizations can prevent users from falling victim to some of these attacks. This can be done through various methods, such as using email filters to block phishing emails or using web filters to block access to known phishing websites. In conclusion, cyber security is a critical aspect of our daily lives, and understand- ing the kill chain and how to prevent phishing attacks can help organizations and individuals to better protect themselves from cyber-attacks. 2.2 Recorded Future Recorded Future, founded in 2009, is the world’s largest intelligence company with customers in the private sector as well as governments.1 By continuously collect- ing and analyzing data published on the open web, dark web, and other technical sources, Recorded Future is able to detect existing threats and assist its customers by providing them with information relevant to their company or organization. More- over, Recorded Future collects all newly registered domains, which presents an op- portunity for early phishing detection. With threat actors constantly working to bypass existing security measures, in- formation about what is occurring outside the firewalls is of vital importance to adapt to current threats. The threats are detected by specific machine learning models created by Recorded Future. However, in order to extend their insight into the threat landscape they must continuously develop additional methods and mod- els. Having the capability to quickly and with high accuracy evaluate domains being abused for phishing is one such field. Customers could then be alerted when they are being targeted, providing them with crucial information in order to avoid security breaches. 1https://www.recordedfuture.com/company 6 2. Background 2.3 Data origin The data that has been used in this project is gathered from three different sources, Tranco, Phishtank, and Recorded Future, creating a dataset containing both mali- cious and benign domains. • Tranco: Tranco2 is an organization that rates websites based on their traffic. A website with more traffic receives a higher ranking and is assumed to be benign. • PhishTank: PhishTank3 is a website that allows users to submit suspicious phishing domains, other users are later able to vote and validate whether or not submitted domains are used for malicious activity. • Recorded Future: As described in the previous section, Recorded Future collects information on various events. Collected domains from Recorded Fu- ture are domains that were recently flagged as phishing domains. From Tranco approximately the 100 000 top-ranked websites were collected, which serve as benign domains. The domains fetched from PhishTank and Recorded Future are merged into domains labeled as malicious. These sources yielded approximately 40 000 domains from Phishtank and 100 000 domains from Recorded Future. The combined dataset of benign and malicious domains consists of approximately 240 000 domains. Each domain is later enriched with multiple features which are utilized by the mod- els. The two services used to collect features for the domains are the Recorded Future service SecurityTrails4 and Urlscan.5 SecurityTrails provides the service to fetch domain name system and IP-related data while urlscan collects data related to the DOM tree of the website corresponding to the domain. 2.4 Domain Features: DNS and Content Domain name system features, or DNS features for short hold information describ- ing how a website was created, and some of its connections, such as IP adress(es). All live websites have to be registered, have an active IP address, and usually, it contains some sort of content. The list of features is extensive and can tell quite a lot about a website. Below is a brief description of the most important features. The URL string, for example “google.com”, is an important part of a domain as it allows users to connect to a website easily. Furthermore, the URL can be analyzed to find malicious domains. One case where the URL is important is when a phishing website is created as a typosquat of a benign website, following the “google.com” example, a typosquat domain could be “goggle.com”. Machine learning models have had success in finding phishing websites, using only the URL string [6][7][8]. 2https://tranco-list.eu/ 3https://phishtank.org/ 4https://securitytrails.com/ 5https://urlscan.io/ 7 2. Background A common term for another specific subset of features is a domain’s WHOIS in- formation. For example, every domain has to be registered by an Internet domain registrar such as ”GoDaddy Inc“, this feature can also give information regarding the credibility of a domain since some registrars are cheaper than others and may also provide additional anonymity making it difficult to track the owner of a domain. Other features such as who registered the domain and technical contact information can also be found sometimes, however, since the General Data Protection Regulation (GDPR) was enforced in Europe this information is easier to hide for new domains [9]. Another set of interesting features is based on the content of a website, later also re- ferred to as the document object model tree, or DOM tree for short. These content- based features can tell whether or not the website has a login form, number of javascript files, and more. Many domains today also use encryption when transferring data between the domain and the users. One way to encrypt data is via transport layer security, abbreviated TLS which can be identified via a padlock next to the URL in your browser. This fea- ture establishes a secure connection between your browser and the domain’s server, preventing hackers from reading or modifying transferred data [10]. However, these certificates are often found on both benign and malicious websites. This is likely due to the fact that some companies offer free certificates which allow malicious domains to disguise themselves as safe without any extra cost. Every domain must also have at least one active IP address. This feature may be interesting to analyze since when a phishing campaign is launched a lot of phishes may resolve to the same IP address, allowing us to propagate risk in that network, however, most IP addresses usually host both malicious and benign websites which must be taken into consideration. All IPs are also linked to an autonomous system number (ASN) and are maintained by an organization owning that ASN. A detailed view showing which features this project makes use of is shown in the method section. 8 3 Related Work Trying to detect malicious websites has been a hot topic for many years now, and just as researchers try to find the best detection method, hackers are similarly using their best evasion techniques. This chapter presents some of the key findings from previous research on phishing detection. A common approach is to extract features from the URL string, and from there train a machine-learning model to classify whether or not the website is malicious [11, 12]. Even though many previous works have reported results with high accu- racy, over 90%, it has been shown that many of these models do not generalize very well. Studies have found results may drop more than 30% in performance when tested on previously unseen data [13, 14]. 3.1 Data Related research articles often make use of blacklists and whitelists to construct their dataset. When extracting legitimate domains, a common approach is to use websites that have a high ranking according to some list, however, these lists may be infiltrated, which would damage the quality of the dataset [15]. A dataset that has been made in an attempt to withstand such attacks is Tranco, which is an open- source list of benign domains [16]. Moreover, in the process of building a dataset of malicious domains, a common source to use is Phishtank.com, which is a crowdsourced repository that allows anyone to vote whether or not a website is used for malicious activity, this data may also have some flaws since the voting can be manipulated by hackers [17]. In our dataset, we have a mix of confirmed phishes from PhishTank as well as phishes from Recorded Future’s database. 3.2 Models that act on string-based features Predicting a domain solely based on its URL string may seem impossible but it has proven to be quite successful. This can be done in a few different ways, either extracting a couple of features from the string, for example, the number of dots and dashes, algorithms that detect if it’s a randomly generated string, and the presence of special symbols [18, 19]. These models have shown a high accuracy, and even though they make it more difficult for hackers to create websites, these 9 3. Related Work detection methods can easily be avoided by handcrafted domain names. Another way of detecting malicious websites based on their string is to train a language model with transformers [8]. This too has proven quite successful but may suffer the same consequences of generalizing poorly due to poor diversity in the dataset. In our work, we recognize that the URL is obviously an important feature and can tell a lot about a domain, however, it has been shown that these models often show bias towards the dataset being used for training [20]. 3.3 Models that act on infrastructure-based fea- tures Many articles have extracted a number of features based on the domain URL string, DNS data, certificates, IP addresses, and WHOIS data, in an attempt to classify domains [21, 22, 23, 24]. This allows for a rather extensive dataset with a lot more information than just the URL string. Moreover, many phishing websites prefer registrars that are cheap and may provide additional anonymity, which is a moti- vation to why these features may improve a model’s accuracy. However, hackers are not the only ones who prefer cheap and anonymous registration options, and even though these models can easily distinguish domains that are from Tranco ver- sus Phishtank, it becomes more difficult to separate arbitrary benign domains, such as blogs or family websites, from malicious phishing sites. Shirazi [20] highlights several shortcomings, such as the presence of bias towards the dataset, exhibited by these models. Instead, in that work, they designed features that can visualize relationships between a domain name and some critical aspects of a phishing website. Another shortcoming of these models is the fact that GDPR and privacy services allow perpetrators to stay anonymous, which causes a great limitation on WHOIS information of a domain [25]. 3.4 Models that act on content-based features Similarly to infrastructure-based features, one can extract features based on the content of a website, such as screenshots and the DOM tree. This may include features such as the presence of login forms and brand logos. There are also more advanced features that may be extracted from a website’s source code by analyzing the underlying code structure such as HTML tags, CSS styles, and JavaScript func- tions [7, 26, 27]. Another feature that can often be important to analyze amongst phishing domains is the favicon [28]. Using these features may provide information regarding what type of website it is. Phishing sites tend to have as little text as possible and instead make use of images in order to avoid text-based techniques [29]. Furthermore, since the most common objective of a phishing website is to steal a person’s credentials there are usually several input fields. 10 3. Related Work 3.5 Belief Propagation BP is an algorithm that has been widely used for probabilistic inference in graphical models [30, 31, 32]. One of the reasons it has grown popular is that it scales really well, even with graphs that have billions of edges [33]. A paper that we have taken a lot of inspiration from is Kim et al., [17], that showed promising results using BP. This research emphasizes how different edge potentials can increase the accuracy of BP. By embedding the nodes they calculate an edge weight based on the two nodes’ cosine similarity, thereby decreasing propagated risk if the nodes are not similar despite the fact that they are connected. Other researchers have for example con- structed a detailed graph with many different node types such as ASN, IP address, domains, Organization, and certificates [34]. This more detailed graph also showed great results. For smaller graphs with a low node degree, BP works fine, even though it doesn’t always converge if the graph contains loops. However, graphs that have nodes with high degrees will run into numerical issues, and therefore one must either alter the graph or the algorithm. Moreover, BP can be seen as a white-box model for smaller graphs, but when the graph grows large and complex, the results are not always intuitive to the human eye [35]. Finally, many researchers who base their model of BP usually modify it to the extent that they call it something else, [36] call their model NetConf, [34] named their model MalRank. This shows that BP is not the optimal solution for these problems, but rather a good foundation for how one can tackle the problem. 3.6 Graph Neural Networks Graph neural networks have become a popular method for node classifications [37, 38]. Similar to BP, GNN can also be seen as message passing in a graph, however, GNNs perform their message passing by training a model, whereas BP follows a determined graph structure. Both BP and GNN have difficulty with how to best structure the graph in order to get reliable results. For GNNs, it is also easier to implement domain features in the model, as it can train edge potentials and node potentials to improve the results of the GNN [39]. GNNs have also been applied to domain classification [40]. In conclusion, GNNs can be seen as a sort of hybrid model between a feature-based classifier and BP, as it can make use of features in order to determine the required graph parameters for risk propagation and node classification. 11 3. Related Work 12 4 Theory Our work is based on a number of theoretical concepts which are introduced in this chapter 4.1 Random Forest Classifier Random forest is a widely used algorithm applied in various fields of classification. The algorithm is based upon multiple decision trees to generate a prediction based on the input features of the provided data samples. A decision tree is a sequential model using a sequence of logical rules to propagate through the tree until the leaf node is reached, i.e., the last node on that branch. The leaf node then assigns the new prediction based on which class most frequently reaches that node. The number of logical operations performed before arriving at the prediction is based on the tree depth, which is the maximum number of logical operations for all branches. For each decision node, the data is partitioned into two or more subsets of data with the objective to separate the data classes as well as possible. The more the classes get separated, the more information is gained from that split, and the subsets are seen as more pure. In the left part of Figure 4.1 a simple decision tree can be seen where the decision nodes represent the logical rules to partition the data samples [41]. Figure 4.1: Illustration of random forest classification. The algorithm consists of multiple decision trees, where the leaf nodes correspond to that tree’s prediction. The majority prediction of all trees in the forest becomes the final prediction. As stated earlier the random forest algorithm initiates multiple independent decision trees. With each decision tree producing its own prediction, the classifier collects all 13 4. Theory predictions and the majority class is then selected. An illustration of the random forest classifier with k decision trees can be seen in Figure 4.1 [42]. 4.2 Loopy Belief Propagation This section provides a theoretical explanation of belief propagation, using examples from our domain to give an intuitive understanding of the algorithm. BP is a message-passing algorithm that can be applied to graph models and is used to solve inference problems. In this study, the graph contains loops which means the resulting inference will be an approximative value. Moreover, the output of the model is a belief or probability for belonging to each possible class, for each node in the graph. At the initial phase of the algorithm, each node will have a prior belief, corresponding to the initial belief of it belonging to each class prior to running the algorithm. This prior belief is often called node potential. The message-passing part is performed such that each node sends messages to all its neighbors. One message will be sent for each class. In a case where there are three possible classes, for example criminal, police, or civilian, each node will send out three different messages to all neighbors, posing as a member of each and every class. When sending a message, the sender will also take other neighbors’ messages, from the previous generation of messages, into consideration. Moreover, each node on the receiving end of the message will also act as if it belongs to every class, criminal, police, or civilian. This leads to a total of 9 messages sent, however, the receiver will only accept 3 messages, one for each label. In our implementation, the message with the minimum value will be accepted. The weight of each message is also influenced by how important the connection is between classes, policemen sending messages to criminals may for example have a weaker weight than criminals sending messages to other criminals. This weight is called edge potential. The algorithm has three sets of variables that must be initialized, node potentials ϕi, edge potentials ψij, and messages mn→i(xi). In our study, each node will have two node potentials, one for each class. The edge potentials will have four different values, given ψij(xi, xj) from equation (4.1) and that xi and xj can be either mali- cious or benign. In general, when X is the set of classes each node may be classified as, |X|2 gives the number of combinations for ψij(xi, xj). There are a few common variations of belief propagation when calculating the mes- sages, one can use either logarithmic values and use either min-sum or max-sum. Another variation is to use probabilistic values and use min-product or max-product [43, 44, 17]. The algorithm showcased here is with logarithmic values and min-sum. The algorithm repeats equation (4.1) until the messages converge. If the algorithm doesn’t converge, due to it alternating between two different values, it runs for a 14 4. Theory chosen number of iterations. mi→j(xj) = min xi∈X ϕi(xi) + ψij(xi, xj) + ∑ n∈N(i)\j mn→i(xi)  (4.1) The expression above represents the calculation of mi→j(xj), where the minimum value is obtained by considering all possible values of xi from the set X. The calcu- lation includes the contribution of various terms, such as the node potential ϕi(xi), the edge potential ψij(xi, xj), and the messages mn→i(xi) from neighboring nodes n (excluding j). By selecting the value of xi that minimizes the overall expression, mi→j(xj) is assigned the smallest possible value based on xi. Figure 4.2: Illustration of message passing algorithm from node i to node j. After computing mi→j(xj) for all xj ∈ X, the messages are normalized. When the message passing algorithm is complete, a final belief b is set as bi(xi) = ϕi(xi) + ∑ j∈N(i) mj→i(xi). (4.2) Similar to the normalization of messages, the final beliefs are also normalized. More- over, the beliefs can be seen as probabilistic values which represent the probability of each node belonging to each class. In the simplest case, the end result will be binary, setting the classification of each node based on the most probable value. However, we will later show how thresholds may be added as a way to tune the end results. 4.3 Levenshtein distance Levenshtein distance will later be used to compare string similarity between URL strings, connecting domains directly to other domains in the graph. What follows is a theoretical background of the Levenshtein distance Algorithm. Levenshtein distance, also known as edit distance, is a measure of the difference 15 4. Theory between two strings. Levenshtein distance is applied in a wide range of applications within the field of computational linguistics. It is defined as the minimum number of character edits needed to transform one string, string s, into the other, string t. For each edit, a cost is added to the distance, where the possible edit options are: • Insertion: Cost of δins • Deletions: Cost of δdel • Substitution: if [si ̸= tj], cost of δsub The dynamic programming solution can be seen in equation (4.3) where the min- imum cost, of transforming character i in string s into character j in string t, is determined. To evaluate the entire strings, the formula starts at D(m,n) where m = len(s) and n = len(t). Next, it recursively iterates through the table of possible operations until the entire string t has been transformed. Depending on the penalties defined for the different operations the algorithm could favour some operations more than others [45]. D(i, j) =  max(i, j) if min(i, j) = 0 min  D(i− 1, j) + δdel D(i, j − 1) + δins D(i− 1, j − 1) + [si ̸= tj] otherwise (4.3) 4.4 Tokenization, byte-pair encoding When constructing the graph, the URL string will be tokenized using this particular method. What follows is a theoretical background to Byte-pair encoding. Byte-pair encoding(BPE) is a technique used for data compression and can be ap- plied as a sub-word tokenizer. BPE generates a vocabulary of a specified size for a given text. In other words, based on the desired vocabulary size, the algorithm creates a set of tokens (which could be letters, sub-words, or whole words) that are later mapped to the text. To create the tokenization vocabulary, the vocabulary is initialized with all individ- ual characters as tokens. Then, through an iterative merging of the most frequently occurring tokens, new tokens are created and added to the vocabulary. This process is repeated until the desired size of the vocabulary is accomplished [46]. BPE ensures that the most common words are represented in the vocab- ulary as a single token while the rare words are broken down into two or more subword tokens and this is in agreement with what a subword-based tokenization algorithm does.1 1https://towardsdatascience.com/byte-pair-encoding-subword-based-tokenization-algorithm- 77828a70bee0 16 4. Theory 4.5 String embeddings and Word2Vec A Word2Vec model is part of our BP, what follows here is a theoretical background. The Word2Vec algorithm is an unsupervised learning model published by Google in 2013 [47]. Word2Vec has demonstrated several use cases for natural language processing, with applications linked to information retrieval and recommendation systems. The algorithm provides a tool to generate vector embeddings of words while still capturing semantic and syntactic relationships between words. Word2Vec can be applied with two different modeling architectures, skip-gram, and continuous bag-of-words (CBOW). As of now, the most common approach is CBOW. With a neural network, CBOW uses the context, as well as the surrounding words, to predict the word in the middle. The context words serve as input, their current vector representations are then averaged in a projection layer. The order of the in- put words is not taken into consideration. The projection layer relies on a log-linear classifier to predict the most probable middle word. By specifying a window size, the number of words to include when predicting the middle word is adjusted. An example of CBOW with four words as context input is shown in Figure 4.3. Figure 4.3: Illustration of CBOW where four words are used as context, two before and two after. The word vectors are projected into the same position, and thus their vectors are averaged. The projection layer then outputs the most probable middle word. Once the model is trained on a text corpus, each string occurring in the corpus receives an embedding. The similarity or relationship between words could then be evaluated based on the multidimensional vector representation, by calculating cosine similarity for example. The model has proven to define clear similarities between ”big“ and ”bigger“, and more impressively relationships between ”France“ and ”Paris“ [48]. 4.6 Graph embeddings and Node2Vec Node2Vec is an algorithm that utilizes graph structure in order to create vector representations of each node. The generated node embeddings aim to be of a lower 17 4. Theory dimension than the initial graph. The Node2Vec embeddings offer users the ability to use network data as input to traditional classifiers or evaluate similarities between nodes. How Node2Vec manages to keep network structure of neighbouring nodes is by running several biased random walks from every node of a specified length l. Each step in the walk is determined by the probabilistic outcome, P , of equation (4.4). Here, c denotes the nodes traversed in the walk, v is the previous node, x is a possi- ble next node, πv,x is the unnormalized transition probability between v and x, and Z is a normalizing constant. If no edge between the two nodes exists, P = 0. P (ci = x|ci−1 = v) =  πv,x Z if (x, v) ∈ E 0 otherwise (4.4) A search bias α is introduced to allow the random walks to explore the absolute close proximity of the starting node, as well as traverse further away. The transi- tion probability πv,x is therefore updated using equation (4.5). Here, t is the node traversed prior to node v, and w is the edge weight. πv,x = αp,q(t, x) · wv,x (4.5) The bias is defined by equation (4.6) where p and q control how quickly the walk explores the graph. While dt,x denotes the shortest path between node t and x. As dt,x represents the minimum number of edges needed to traverse from node t and ending at node x, dt,x could only receive a value within {0,1,2}. This is due to if t = x, the number edges needed are zero, if node v has to be included in the path the number of edges traversed are two, and if there exist a directly connected edges, dt,x = 1. αp,q(t, x) =  1 p if dt,x = 0 1 if dt,x = 1 1 q if dt,x = 2 (4.6) With the generated walks, the embeddings are created using Word2Vec, which is explained in section 4.5, however this time by using the skip-gram architecture. The walk is then seen as a sentence where each node represents a word [49]. 4.7 Long short-term memory (LSTM) and bi-directional LSTM A bi-directional LSTM model will be used to assist in the classification of nodes in our graph, here, a theoretical background is presented. Long short-term memory (LSTM) networks are a type of recurrent neural network (RNN) architecture. RNN is a powerful tool when working with sequential data, due to its use of feedback loops which pass information from the hidden states, previous 18 4. Theory data in the sequence, to add information as the sequence proceeds. The classical RNN, however, experiences problems with vanishing gradients as the sequences get longer. LSTM network provides a solution to avoid vanishing gradients and there- fore offer a better long-term memory [50]. Figure 4.4: LSTM cell displaying the three important gates used to process information as well as the memory cell which stores information from the se- quence. The way LSTM models retain long se- quences of information is with the help of a memory cell, shown in Figure 4.4 as cell states. What information is stored or discarded is handled by the three gates in an LSTM cell. The in- put gate ingests new information from the sequence and previous hidden states, using a sigmoid activation function to store valuable information in the cell state. The forget gate determines what information should be removed from the cell state similar to the input gate, by passing the previous hidden state into a sigmoid function. While the last gate, the output gate, outputs the new hid- den state of the cell. The output from the input gate is passed through a sigmoid function and then multiplied with the tanh value for the cell state. The output could either be used as a prediction or serve as hidden state input to a new cell [51]. In contrast to a regular LSTM model which iterates through the data from start to end, a bidirectional LSTM model processes the data from two directions. By incorporating two separate layers, one which processes the sequence in ordinary or- der while the other layer process the data from end to beginning. This allows the model to benefit from both prior and future information which has shown two be an improvement in some classification tasks. The bidirectional output is then merged to be used as input to the following cells or serve as prediction [52]. 19 4. Theory 20 5 Methods In this chapter, the methods used to create and fine-tune our models are described in detail. Since multiple separate components belong to the two different models, the implementations will each have their own sub-section. The flowchart in Figure 5.1 shows on a large scale the main steps throughout the project to reach the final results. The two first points, data, and analysis are closely related, while the model developments and evaluation are more independent. Figure 5.1: Flowchart of the steps needed to arrive at end goal. The first and second step is highly related and revolved around gathering and analyzing the data. The third step is divided into two areas, selecting and developing a feature-based classifier, and implementing the BP algorithm. The fourth and final step is then to analyze the various models’ pros and cons. 5.1 Data Preparation The data that was briefly described earlier in section 2.3 includes multiple features, each linked to one domain. As said earlier approximately 240 000 domains have been fetched. To only evaluate the more relevant domains, expired domains, i.e. domains no longer resolving to any IP address, are removed. The final dataset therefore con- tains around 150 000 domains, where 90 000 are labeled benign while the remaining 60 000 are labeled malicious. A list with the enriched features and a short description can be seen below from which the feature-based classifier uses the majority of features while the BP solution includes a far smaller sample of features. 1. Domain: Domain URL string. 21 5. Methods 2. IP: List of IP or IPs which the domain resolves to. 3. TLD: The domains top-level domain. 4. WHOIS registrar: Where the domain is hosted. 5. SSL self-signed: Whether or not the SSL certificate is self-signed. 6. SSL issuer organization: Name of organization which issued SSL cer- tificate. 7. Number of IPs: Number of IPs domain resolves to. 8. DGA probability: Probability that a domain’s URL string has been gen- erated by an algorithm. 9. Number of dashes: Number of dashes in URL. 10. Number of dots: Number of dots in URL. 11. URL length: Number of characters in URL. 12. URL entropy Entropy measure for randomness in URL. 13. ASN: AS number of IP. 14. ASN owner: Organization owning AS number. 15. IP degree: Number of domains resolving to the same IP. 16. Number of CSS files: Number of CSS files included in the website. 17. Number of JavaScript files: Number of JavaScript files included in the website. 18. Number of images: Number of image files included in the website. 19. Has login form: If there exists a login form on the website. 20. Number of input forms: Number of input forms found on the website. 21. Label: Class domain belongs to. 5.1.1 Label encoding Based on the list above, several of the features are of categorical type while the majority of classifiers require numerical input. The features that had to be pre- processed, due to having too many different options, were number 3, 4, 6, and 14. These features were encoded by using a basic label encoder.1 Separate label encoders were built for each of these features, where each possible feature label is assigned an integer representation. 5.2 Feature based classifier There are numerous traditional classifiers available off the shelf. Therefore a sim- ulation evaluating several of those was carried out to pick the optimal alternative, to use as a comparison model. In Table 5.1 the top five classifiers are shown, al- though the results are similar, the random forest classifier is highlighted due to best performance and will be used moving forward. 1https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html 22 5. Methods hej hejhejhejhejhej hejhejhejhejhej hejhejhejhejhej hejhejhejhejhej hejhejhejhejhej Model Accuracy AUC Recall Precision F1 Light gradient boosting machine 0.93 0.97 0.86 0.97 0.91 K nearest neighbours 0.92 0.95 0.85 0.93 0.89 Random forest 0.94 0.97 0.87 0.97 0.92 SVM 0.91 0.90 0.82 0.95 0.82 Logistic regression 0.91 0.95 0.83 0.93 0.88 Table 5.1: Results for some of the evaluated classifiers to select the optimal one for the current dataset. As can be seen, the majority performs fairly similarly, however random forest has a slight edge in most evaluation metrics and will therefore be used moving forward. 5.2.1 Random forest classifier When developing the RFC model, a Python module called scikit-learn which con- tains an efficient and customizable random forest package2 was used. The model was then trained using the features mentioned in section 5.1 at the beginning. Since the classifier only accepts numerical input, all categorical features were embedded as explained in section 5.1.1. For all samples with missing values, “NaN”, the values were set to −1. Without any fine-tuning, the current setup serves as our initial RFC model. Figure 5.2 briefly explains the creation of the initial model and then the steps when we try to improve its performance. Figure 5.2: Description of the main steps when creating the initial RFC model, and the methods used to improve its performance. With an initial model trained using all features, a feature importance simulation was carried out to find the most dominant features. Reducing the number of features simplifies the decision path and also reduces misleading information. A recursive feature elimination algorithm was applied. The algorithm removes one feature each run, the currently least important, until no features are left. Once completed only the subset of features which maximizes performance are kept [53]. The last step when fine-tuning the RFC model was to decide the optimal model parameter, number of trees, and maximum depth of trees, which is shown in Figure 5.2. Four different tree depths were used as can be seen in the list below, for each of those we changed the number of trees used, ranging from 10 to 300 trees. At each step, five runs with new data splits were executed to calculate the average. 2https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html 23 5. Methods • Tree depths: 6, 12, 24, 48 • Number of trees: 10–300 with steps of five The fine-tuned RFC model will be selected based on the most performant parameter setup. Once the accuracy of the several models converges an optimal RFC model is found. 5.3 Belief Propagation This section describes the different steps taken to run our modified BP. Moreover, a brief part of our data analysis is presented here as a motivation for the use of Levenshtein distance as well as our chosen method of tokenization. 5.3.1 Graph construction The first step before we can apply BP is to construct a graph. We experimented with a few different node types and connections between the nodes and ended up with the following node types: Domains, Tokens, IPs, CIDR-Blocks, and ASNs. The results of a few different graph compositions is presented later in the results. Possible connections can be seen in Table 5.2. Domains will be linked to other domains if their normalized Levenshtein distance is less than 0.1. If the URL string of a domain contains a token created by the BPE-tokenizer, described in the theory section, an edge will connect those two nodes. Furthermore, each domain is connected to one or more IP addresses, and IPs resolve to a CIDR block which in turn connects to an ASN. To clarify further, every node is unique, meaning that if two domains have the same IP address or token, that token or IP will be connected to both domains and allow risk to spread via itself. An illustration of an example graph is shown in Figure 5.3. The motivation for including IP addresses as well as CIDR-Blocks and ASNs is to increase the connectivity of the graph. It has also been used by previous works on the topic [17]. The motivation for the use of Levenshtein distance and tokens is described in the following section. 24 5. Methods Figure 5.3: Illustration of the constructed graph with all possible node and edge types. Table 5.2: Different node types and possible connections for each node type Node type Neighbours Domain Token, IP, Domain Token Domain IP Domain, CIDR-Block CIDR-Block ASN ASN CIDR-Block 5.3.2 Levenshtein neighbours After performing extensive data analysis, it was found that many phishing domains have very similar names, often only differing by one or two characters. Attackers usually register multiple domains with similar names to start their attack, then if one domain is detected as malicious and shut down there are other domains still active. Due to this, the Levenshtein distance between all domains was calculated and then normalized using equation (5.1). Here, s1 represents one URL and s2 the other, and l is the Levenshtein distance between the two strings. The normalized distance ranges between zero and one, where a lower value indicates similar strings. 1 − |s1| + |s2| − l |s1| + |s2| (5.1) The normalized distances are used to create clusters of domains where each cluster corresponds to a set of domains referred to as Levenshtein neighbors. For two do- mains to be seen as neighbors their normalized Levenshtein distance must be less or 25 5. Methods equal to 0.1. Depending on the different lengths of the URLs, this corresponds to a difference of 10% and is often equivalent to one or two string edits. For domains to be in the same cluster, however, a domain needs to be neighboring to at least one other domain in the cluster. Our clustering approach is inspired by the DBSCAN algorithm which identifies clusters based on their density [54]. Why this information could be beneficial when creating the graph and why it cur- rently is taken advantage of can be seen in Figure 5.4. Here, all domains registered during a random sample of days were evaluated using the Levenshtein neighbor- ing method. The domains shown are the ones corresponding to a cluster where at least one of the domains was classified by Recorded Future as a phishing web- site after being active for almost one month. The red bars are phishing domains while the blue bars represent clustered domains not yet convicted of phishing. For more clarification, one of the red domains is bankofamerica-secure02[.]com while the suspicious-looking URL bankofamerica-secure01[.]com is one of the blue domains. The classification of these websites does not correspond to any ground truth, it does however give an indication that Levenshtein distance may be used to find new phishing domains. Figure 5.4: Histogram of domains registered the specific days with at least one neighboring domain and at least one domain are a phishing domain. The red bar represents domains found to be phishing domains while the blue bars are domains not said to be phishing domains. When constructing the graph, Levenshtein edges are only added between two neigh- boring domains. The reason for not drawing edges between each and every domain in the cluster is that the risk will propagate throughout the chain of domains when running BP. The addition of these edges allows risk to propagate directly between two different domains, eliminating the need for intermediate nodes. 26 5. Methods 5.3.3 BPE tokenization The BPE tokenizer was trained on the entire dataset of domains and a vocabulary size of 8000. By allowing for a rather large vocabulary, rare and specific words or sub-words are found as tokens in the URLs. However, the limit still manages to increase the connectivity of the graph since not an infinite amount of tokens are allowed. For each URL, the tokenizer is used to extract the found tokens. Before the to- kens are added to the graph, some modifications are done to the token vocabulary. We start by removing all tokens with less than four characters, |token| < 4. Next, we use the elbow method, the largest perpendicular distance, to find the stabilizing point which is used as a cutoff point, shown in Figure 5.5. By adding a straight line from the most frequent token to the least frequent token and calculating the per- pendicular distance to the token frequency line, the largest perpendicular distance is found [17]. The words prior to the stabilizing point are deemed as stop-words and are removed, while the remaining words are added to the final vocabulary. Figure 5.5: Stabilizing point, orange circle, found to use a cutoff point for stop- word removal. The point is located at the largest perpendicular distance between token frequencies and a straight line connecting the highest frequency with the lowest. Y-axis in log-scale. All of the tokens in the final vocabulary are then added as nodes when constructing the graph. The domains in the graph which have any of those tokens in their URL receives an edge to the node associated with the token. Due to the removal of multi- ple tokens, some domains receive no token connection, and the final vocabulary size is approximately 4600. By adding tokens, increased connectivity between phishing 27 5. Methods domains was achieved since many strings are reused by attackers to appear legit. This is exemplified in Figure 5.6. Figure 5.6: Bar plot displaying the token frequencies from the dataset for some randomly sampled tokens. The green bars represent the number of times that token was used in a benign URL while the red represents when the token was found in malicious URLs. 5.3.4 Edge potentials To initialize the edge potentials three different approaches were used as experiments. As a first approach, we applied the more classical approach of using a coupling value, epsilon. The edge potential when passing messages between nodes for the same class label receives a higher value than when the class labels differ. We used an epsilon value of 0.3, the values can be seen in Table 5.3. Benign Malicious Benign 0.5 + ϵ 0.5 − ϵ Malicious 0.5 − ϵ 0.5 + ϵ Table 5.3: Compatibility matrix when using epsilon edge initializa- tion. When passing messages be- tween nodes for the same class la- bel a higher potential is set than when two different class labels in- terfere. Benign Malicious Benign min(t+, SC,n) max(t−, SC,n) Malicious max(t−, SC,n) min(t+, SC,n) Table 5.4: Compatibility matrix used for similarities between node embeddings from Word2Vec or Node2Vec. The potential be- tween two equal class labels is the minimum value of the scaled cosine similarity, SC,n, and the maximum threshold, t+. For opposite class labels, the minimum threshold, t− is used. When setting the edge potentials for the second experiment, using Word2Vec embed- dings, multiple steps were carried out to reach the final potentials. The nodes repre- sented by strings, domains, and tokens, were embedded using a trained Word2Vec3 3https://radimrehurek.com/gensim/models/word2vec.html 28 5. Methods model to receive a vector representation of length 100. However, with the remaining possible nodes, IP, CIDR-block, and ASN, the embeddings were calculated as the average of neighboring nodes’ embeddings. Equation (5.2) explains the averaged vec- tor representation of the ith node, where the sum of neighboring node embeddings is divided by the number of neighbors. |N(i)| denotes the number of neighbours node i has, whereas N(i)emb is the vector embeddings of the neighbours of node i. n(i)emb = 1 |N(i)| ∑ nemb∈N(i)emb nemb (5.2) The last approach used as edge potential initialization was to use Node2Vec. A Node2Vec model4 was trained on the entire graph, generating vector embeddings for all types of nodes in one step. With all nodes embedded using either Word2Vec or Node2Vec, the cosine simi- larity between nodes, or the node embeddings, are calculated using equation (5.3). The similarities are scaled to range between zero and one by equation (5.4). The potentials are then as shown in Table 5.4, with some modifications compared to the epsilon potentials. SC(n(i), n(j)) = cos(θ) = n(i)emb · n(j)emb ||n(i)emb|| ||n(j)emb|| (5.3) SC,n = SC + 1 2 (5.4) The compatibility matrix introduces two thresholds, t+ and t−, which control the minimum or maximum edge potential. When messages are passed between to equal class labels, the minimum value of t+ and SC,n are used as potential. When two opposite class labels interfere the maximum value of t− and SC,n are used. 5.3.5 Node potentials Similar to edge potentials, two different node potential initializations were tried as experiments to improve the final model performance. The following only applies to domain nodes since all other types of nodes are initialized as unknown or unlabeled for both experiments. The first approach was to set the potentials as seen in the list below: • Benign: Benign = 0.99, Malicious = 0.01 • Unknown/Unlabeled: Benign = 0.5, Malicious = 0.5 • Malicious: Benign = 0.01, Malicious = 0.99 Each node has two potentials, one for benign belief and one for malicious. For domains labeled as benign, the initial potential is clearly set as benign. The oppo- site applies to malicious domains. For domains without any label, the ones who are to be predicted, the potential is set to 0.5 to not favor any of the possible predictions. 4https://github.com/eliorc/node2vec 29 5. Methods Another less general approach was also used when initializing the unlabeled do- main nodes. Based on the predictions of the LSTM model (see below), the initial node potential was shifted according to the prediction of the LSTM model. The shift was performed using equation (5.5). The scaling reduces the value range to be between 0.25 and 0.75 to avoid the potential of favouring any of the class labels too much. ps(i) = 0.5 + p(i) − 0.5 2 (5.5) Figure 5.7: Sequential bidirectional LSTM model displayed with input sizes for each layer. An illustration of the LSTM model used to generate the predictions can be seen in Figure 5.7. The model takes a URL as an input and out- puts a probability value due to the final layer being a fully connected layer with sigmoid as the activa- tion function. Before passing URLs into the model, they were padded to have equal lengths. At the sec- ond layer, the strings were embedded into vector representations. Once em- bedded the input reaches the three connected bidirectional LSTM lay- ers which is the main core of the model. The last LSTM layer re- turns a one-dimensional vector which the output layer uses to form a pre- diction. The architecture of the network is heavily influenced by a project posted by ”jackaduma“ on GitHub5, where it reached an ac- curacy of 99% predicting phishing URLs. Using the model to retrieve the scaled prediction value, ps(i), is used in equa- tion 5.6 to set both the benign and ma- licious potential. The larger ps(i) is, the larger the predicted maliciousness of the domain. ϕi = Benign = 1 − ps(i) Malicious = ps(i) (5.6) 5https://github.com/jackaduma/NLP4CyberSecurity 30 5. Methods 5.3.6 Handling underflow for high degree nodes The standard equation for BP can easily cause numerical issues for nodes with a high degree. Many researchers often handle this by removing nodes that have too large of a degree, for example by excluding tokens that are too common. We went with a slightly different approach, we still removed the most common tokens in order to avoid cluttering the graph, however, we still have nodes with high-degree, common tokens as well as some IP addresses. This caused numerical issues despite the fact that we are using logarithmic values, simply because the messages grew too negative for standard message passing BP. We, therefore, decided to average the incoming messages when performing the message passing, which is described by equation (5.7). When inferring the final belief, however, we used the standard formula, equation (5.8). This worked well numerically since the final belief was only inferred on domains, which typically do not have a very high degree. mi→j(xj) = min xi∈X [ϕi(xi) + ψij(xi, xj) + 1 |N(i) \ j| ∑ n∈N(i)\j mn→i(xi)] (5.7) bi(xi) = ϕi(xi) + ∑ j∈N(i) mj→i(xi) (5.8) 5.3.7 Algorithm implementation When implementing the algorithm, we made a few more design choices. At first, the data is split, where one set of the domains kept their label of benign/malicious. The remaining set of domains had their labels dropped with the goal of predicting the label. The initial potential of the unlabeled domains was set with help from the BI-LSTM model whereas the labeled domains had a fixed starting potential. Once the graph and all the node and edge potentials had been set, the messages were initialized to the value 0 since logarithmic values were used (0 = log(1)). Moreover, messages are only being passed to nodes that are unlabeled, meaning that domains that already have a label do not have the ability to change their belief. Secondly, all messages were normalized for each iteration and label. At this stage, the message passing part of the algorithm was repeated until convergency conditions had been met. Finally, a belief was inferred to the previously unlabeled domains, predicting their probability of being either malicious or benign. 31 5. Methods 32 6 Results The result chapter is divided into two main sections, one presenting the results from developing the RFC model and the second presenting the results from several BP models. 6.1 RFC Results As explained in section 5.2.1, when developing and fine-tuning the RFC model, two main steps were conducted, feature elimination and parameter selection. Figure 6.1 displays the feature importance metrics for the initial model. The importance is on the x-axis and measured as a mean decrease in impurity, the better a feature manages to separate the class labels, the purer data splits. As can be seen the number of dots in the URL clearly is the most important feature, while most other features are fairly similar in importance. The IP risk score feature consists of only missing values and therefore has no importance. Figure 6.1: Bar plot showing the importance of the used features, with a mean decrease in impurity on the x-axis. Decreasing the impurity is a result when the feature well splits the class labels. By running the recursive feature elimination algorithm, three features were removed, IP risk score, number of countries, and language. By studying Figure 6.1, it shows that these were the features with the least importance. Worth noting is that mean 33 6. Results decrease in impurity isn’t an optimal measurement when exposed to high cardinal- ity features, features containing multiple unique values. The number three ranked feature, who-is registrar, contains multiple unique values, however still ranks well compared to the other features. The next step was to select the optimal parameter configuration as discussed in the method chapter, section 5.2.1. The result of the simulation can be seen in Fig- ure 6.2, where the lines represent different tree depths and how the accuracy changes as more trees are added to the models. The figure shows that using either 24 or 48 as tree depth has no real impact on the accuracy of the models. The performance for those two models also seems to converge at around 75 to 100 trees, even though the accuracy overall has very small changes. To continue we selected the parameters displayed in the list below. • Tree depth: 24 • Number of trees: 100 Using 24 as tree depth instead of 48 helps reduce model complexity, without impact- ing the model performance. The selection of the number of trees had less impact on final performance, however moving forward, 100 trees are used. Figure 6.2: Simulation of random forest classifiers with varying tree depth and number of trees. For each step, five models were evaluated with newly generated data splits and the results shown are the average across those five simulations. The performance of the final model is slightly improved from the simulation and reaches an accuracy of 95%, as can be seen in Table 6.1. The support column presents the number of samples from each class label in the test set, due to the dataset containing more benign domains than malicious, the test set is skewed a 34 6. Results bit towards benign domains. The model’s precision for finding malicious domains is high, at 98%, which tells that it rarely predicts a benign domain as malicious. Precision Recall F1-Score Support Benign 0.93 0.99 0.96 18332 Malicious 0.98 0.88 0.92 11438 Table 6.1: Result from the final RFC model, achieving an accuracy of 95%. The other metrics are shown for both class labels, where support represents the number of samples for each class in the test set. The model performance can also be seen in Figure 6.3, the figure shows the receiver operating characteristic (ROC) curve and the area underneath (AUC). The ROC curve, dark blue line, demonstrates the trade-off between the accurately predicted malicious domains, versus the falsely predicted malicious domains. By studying the ROC curve, how well the model manages to discriminate between class labels when exposed to a large range of classification thresholds. The AUC score quantifies the overall performance of the model, the score represents how well the model manages to distinguish between class labels, without taking any specific classification threshold into account. The AUC score of 0.98 suggests that the classifier has great discriminative capabilities, with a strong ability to differen- tiate between benign and malicious instances. Figure 6.3: ROC curve for the final model with a true positive rate on the y-axis and false positive rate on the x-axis. The AUC score is 0.98 which indicates well- separated classes. As a final experiment, some features were removed in order to make the features 35 6. Results of this model similar to the features available for our BP algorithm. The following features were included in the test: 1. URL: Included in BP 2. IP: Included in BP 3. Apex domain: Resembles tokens used in BP 4. Top level domain: Resembles tokens used in BP 5. AS number: Included in BP By only using the features listed above and the same parameter settings as described previously, the performance was slightly affected. Leading to an accuracy of 93%, with the features used being more similar to the ones later used by the BP solution. 6.2 BP Results As a starting point, our graph consisted of three different node types: domains, to- kens, and IPs. The following results show how the accuracy of our model improves as more nodes and edges are added. The first results shown make use of the Epsilon model that was described in the method section. The graphs with the best accuracy and percentage of predictions will be subject to the other models, Word2Vec and Node2Vec. 1. Graph 1 (G1): Contains three node types: domains, tokens, and IP ad- dresses. This graph functions as our most basic approach. 2. Graph 2 (G2): G2 is built upon G1 with the addition of CIDR blocks for IP addresses, connecting each IP to its corresponding CIDR block. This allows risk to propagate further and increases the connectivity of the graph. 3. Graph 3 (G3): G3 is constructed similarly to G2, but instead of using CIDR blocks to connect IP addresses this graph uses the IP’s ASN. 4. Graph 4 (G4): G4 is built upon G1, the addition to this graph is the inclusion of Levenshtein neighbors. This addition connects domains to other domains with similar URL strings. 5. Graph 5 (G5): G5 is a combination of G2 and G3, the difference here is that the CIDR blocks will resolve to an ASN in the order IP-CIDR-ASN. 6. Graph 6 (G6): G6 is a combination of G5 and G4, meaning it uses the relation IP-CIDR-ASN, as well as the Levenshtein edge domain-domain. In Table 6.2, we present some of our key results. The highest accuracy, as well as the highest percentage of predictions, are in bold. What is important to consider here is that ”Acc1“, is the accuracy of the model when including domains that did not receive a prediction as an incorrect classification. Moreover, ”Acc2“ is the accuracy when excluding the domains that did not receive a prediction. A domain doesn’t always receive a prediction since it may not be connected to a component that is capable of spreading the risk. Furthermore, a domain does not receive a prediction if its final belief is 0.5 malicious and 0.5 benign. Given these results, we went on to further evaluate G5 and G6 and replace the Epsilon model with Word2Vec and Node2Vec. The results for G5 is presented in 36 6. Results Table 6.2: Evaluation Results for Different Graphs using the Epsilon Model Graph False Malicious False Benign Predicted (%) Acc1 Acc2 G1 403 2201 86.46 0.78 0.90 G2 363 2546 91.01 0.81 0.89 G3 233 3130 98.53 0.87 0.89 G4 1880 5955 88.64 0.62 0.70 G5 277 2997 98.60 0.88 0.89 G6 1328 6672 98.87 0.72 0.73 Table 6.3 and the results for G6 is shown in Table 6.4 Table 6.3: Evaluation Results for G5 using three different Models Model False Malicious False Benign Predicted (%) Acc1 Acc2 Epsilon 277 2997 98.60 0.88 0.89 Word2Vec 306 3207 97.65 0.86 0.88 Node2Vec 296 3039 98.62 0.87 0.89 Table 6.4: Evaluation Results for G6 using three different Models Model False Malicious False Benign Predicted (%) Acc1 Acc2 Epsilon 1328 6672 98.87 0.72 0.73 Word2Vec 312 3023 98.08 0.87 0.89 Node2Vec 280 2922 98.89 0.88 0.89 6.2.1 LSTM model To be able to consistently predict all domains, we incorporated an LSTM model to push the initial potentials of the unlabeled domains, as was described in the method section. The model was trained with a sub-sample of domains, 20 000 benign domains and 20 000 malicious domains. Both the accuracy and loss when training the LSTM model are shown in Figure 6.4, where the blue lines represent the training data, while the orange lines represent the validation data. The training was carried out for ten epochs and the loss remained quite high. The validation accuracy starts and ends at about 90%, however, when evaluated against the entire set of URLs the accuracy equals approximately 92%. The figure also shows that the validation accuracy is consistently slightly higher than the training accuracy. This is most likely caused by random variability and the fact that the domains are quite similar 37 6. Results to one another. However, for future work, this would require further evaluation as there could be other issues causing this, such as overfitting or data leakage. Figure 6.4: Results from training the LSTM model for ten epochs, where the blue lines represent the results from the training set, while the orange represents the validation set. The left figure displays the accuracy and the right figure displays the loss. In the following results, the prediction rate is 100% and will therefore be excluded from the tables. Furthermore, since “Acc1” and “Acc2” are now equal, only “Acc1” is presented. The results for graph G5 are presented in Table 6.5 and the results for G6 can be seen in Table 6.6. Table 6.5: Evaluation Results for G5 using three different Models assisted by an LSTM model Model False Malicious False Benign Accuracy Epsilon 18 3033 0.90 Word2Vec 14 2882 0.90 Node2Vec 16 2921 0.90 Table 6.6: Evaluation Results for G6 using three different Models assisted by an LSTM model Model False Malicious False Benign Accuracy Epsilon 576 6099 0.78 Word2Vec 14 2834 0.90 Node2Vec 13 2853 0.90 38 6. Results 6.3 Final model Studying the results shown in Table 6.5 and Table 6.6 the performance is very similar between graph constructions. The added URL model also had a small impact on the overall accuracy, when comparing the results with Table 6.3 and Table 6.4. The final model was selected based on accuracy, prediction rate, and false positive rate. The selected model, therefore, ended up as the Node2Vec solution with the LSTM model, using the G6 graph. An illustration of the final model can be seen in Figure 6.5. Where “Build graph” corresponds to constructing the G6 graph, and the BP algorithm uses Node2Vec and the URL model as it is initialized. Figure 6.5: Illustration of the selected final model. The data is used to construct the G6 graph, the BP algorithm then takes advantage of the Node2Vec embeddings and URL model predictions to predict the unlabeled domains. The model demonstrates good abilities to differentiate between benign and malicious data samples as can be seen in Figure 6.6. The figure shows the ROC curve and the AUC. The AUC score obtained by the model is 0.96. 39 6. Results Figure 6.6: ROC curve for the final model with true positive rate on the y-axis and false positive rate on the x-axis. The AUC score is 0.96 which indicates well separated classes. The distribution of predictions can be seen in Figure 6.7, where the first figure shows the domains predicted benign and the second figure the domains predicted to be malicious. The red graphs display the domains wrongly predicted while the blue graphs represent the domains accurately predicted. The figures display how the model manages to differentiate between the falsely predicted domains, especially towards malicious predictions where the overlap is very limited. Figure 6.7: Predictions for domains predicted benign, left, and domains predicted malicious, right. The x-axis corresponds to prediction belief and the y-axis density of samples. The red graphs are wrongly predicted samples while the blue graphs represent accurate predictions. 40 7 Discussion 7.1 Dataset The dataset that was used for this project may cause misleading results for our RFC model. As stated in the related work section, these models tend to generalize poorly. This is most likely a consequence of the fact that we are using domains that are really benign as well as domains that are really malicious, and nothing in between. This may cause issues for real-time classification since websites such as blogs or restaurant web pages may seem malicious to the RFC model. The reason for this is that small businesses may go for cheaper registration options while still wanting their website to look legit, similar to the behavior of hackers. The dataset naturally affects the BP as well, perhaps more so due to its size. Our method of BP relies on the fact that every domain’s neighborhood is included in the graph, which is usually not the case. Secondly, the dataset consists of more benign domains than malicious, and as can be seen in the BP results, there are a lot more domains that are labeled false benign than false malicious. We believe that in order for BP to reach its full potential, a larger graph is required. It is also important to balance between benign and malicious domains, otherwise one label will most likely have stronger propagation and diminish the other. 7.2 RFC Based on our results, we believe that RFC models can definitely be a useful tool in detecting phishing domains, especially since they are easy to implement and directly use to classify new domains. The biggest flaw of this model is that its result heavily depends on its dataset. We believe that a good approach for this model is to spend time on the dataset that the model will be trained on. Using domains that are from Tranco and Phishtank gives decent results, but in order to generalize better we be- lieve that it requires a more diverse dataset, including some low-traffic domains such as blogs, restaurant websites, family websites, etc. Having a more diverse dataset would help reduce bias, and make the model more robust for real-world applications. The promising results indicate that the RFC model manages to find good feature splits and separate benign from malicious samples. Even when only using the fea- tures similar to the ones used by BP the accuracy remains at 93%, where the most 41 7. Discussion important features from previous models were removed. 7.3 Belief propagation The results of our BP definitely shows potential in classifying domains. The dif- ficulty in this model lies in how to properly structure the graph and propagating risk via IP and tokens is not always correct. Many IP addresses can for example host hundreds of thousands of domains and contain just a few phishing domains, whereas the majority of the domains are unknown. In situations like this, the IP doesn’t necessarily tell us that much and propagating risk with a fixed edge poten- tial may cause issues. By extending the standard BP with models like Word2Vec and Node2vec, this type of false propagation can be mitigated. As of now, these models were only used to weaken the connection between two nodes, however, one could also argue that if the link is too weak it should be removed altogether. Another difficulty lies in how to use models like this in order to classify new do- mains. One could either build an entire graph based on that one domain, and then classify it, but this doesn’t guarantee a prediction. Otherwise, one could construct a large graph, with hundreds of millions of domains, and each time a new domain needs to be classified it can simply be added to the graph. Moreover, this model is not tested in the same way as an RFC model. The graph and all its parameters were tuned in order to give good results, but to really see if the results are general, one should probably divide the dataset into a couple of smaller datasets, to see that the results hold for all graphs and not just a fine-tuned one. Finally, to guarantee some sort of prediction of all domains, we believe this model performs best when combined with a second model, such as an LSTM model. 7.3.1 Final model In the selection phase of the final model, the optimal solution for our objectives was selected. Several approaches have very similar performances, Node2Vec versus Word2Vec, and with or without the URL model. The end selection as described in Section 6.3 showed the highest accuracy while also minimizing the false malicious predictions. Minimizing the falsely predicted malicious domains is of high importance to re- duce an overflow of possibly malicious domains. Mainly to avoid raising suspicions towards benign domains with honest objectives, but also to minimize investigative work when looking into malicious predictions. The predictions shown in Figure 6.7 show that the prediction has a small overlap for malicious predictions, which indi- cates these number could be further improved by demanding a higher certainty for a domain to be labeled malicious. Overall the false predictions are more centered around 0.5 and an improvement to avoid uncertain prediction could be to label these as unknown until a higher degree of certainty is reached. One possible problem when adding the URL model to initialize the node potentials 42 7. Discussion could be a problem of consequential errors due to relying upon an additional model. To evaluate this new data would be required to determine the effects, however, as of now the results indicates a positive impact when adding the LSTM model. 7.4 Comparison of belief propagation and RFC Based on the results as well as our methods, we believe that each model has its own place in the world of detecting malicious domains. BP will require more work than an RFC model, as well as a solid framework for tests and experiments to find some- thing that works in the wild, in regards to both computational requirements and results. An RFC model, however, is easy to implement and use, making it a more suitable model for smaller projects. When using RFC with only features similar to the ones you can find in our BP implementation, you can see that the results are very similar, both with an accuracy of around 90%. Moreover, once a model has been trained for RFC, it requires very little compu- tational power to use, whereas the BP may need to continuously host and update a large graph. In terms of scalability, the drawback of the RFC model we used it that is uses many features that are expensive to retrieve, limiting the amount of classifications that can be done for one day. The data used in our BP is easier to get a hold of, but discussing its scalability is difficult as there are several options on how to use the model as a product. The BP solutions however offer more interpretability into why domains received the prediction they got, due to the prediction only being based upon the neighbour- ing nodes. As the size of the dataset increases the graph will be harder to interpret, even still studying smaller sub-graphs allows us to understand limited areas. The BP solution is also expected to handle concept drift or new attack patterns in the data as a far smaller sample of features is used. Both models, as many others, heavily depend on the quality of the dataset, we therefore believe that the top priority for similar projects should be to create a high-quality diverse dataset. 7.5 Ethical In this area, there are a few ethical aspects to consider. First of all, it may be harmful to a business if its website is incorrectly classified as malicious. Therefore, all public classifications should always maintain a high quality as well as some explanation behind them. Secondly, there are integrity issues that may arise with the data that is collected. Therefore, it is important to respect the laws that regulate public information, such as GDPR. Thankfully, Recorded Future does its very best to ensure that no laws are broken when collecting public data. 43 7. Discussion 44 8 Conclusion First and foremost, it is of paramount importance to find or construct high quality data when building a model to detect phishing domains. Secondly, many models have proven successful in this area. With that in mind, we recommend using models with some form of explainability as it can assist developers that are searching for flaws and potential improvements. Thirdly, it is important to fully test the models ability to generalize before building a product out of it. This can for example be done by testing the model on some low traffic domains and analysing the results. Finally, it is important to recognize the flaws of a chosen model, as well as constantly updating it as hackers develop new evasion techniques. 45 8. Conclusion 46 9 Future Work For future experiments, we would create one more model using the GNN/GCN framework as it creates a solid framework for embedding more features into a graph based model. Furthermore, we would spend more time to create datasets and testing the model graph model over a few different datasets to make sure it is not overfitted. We would also like to evaluate the Levenshtein connection further by testing different thresholds. Furthermore, there is a modified Levenshtein algorithm that can include an operation that swaps the position of two adjacent characters in a string which would also be interesting to implement. Finally, we would also like to have an experiment that tested our models’ prediction compared to those that are currently in place at Recorded Future to see how models like these may improve existing classifications. 47 9. Future Work 48 Bibliography [1] Violino. (2023) Phishing attacks are increasing and getting more sophisticated. here’s how to avoid them. [Online]. Available: https://www.cnbc.com/2023/ 01/07/phishing-attacks-are-increasing-and-getting-more-sophisticated.html [2] Q. Wang, L. Li, B. Jiang, Z. Lu, J. Liu, and S. Jian, “Malicious domain de- tection based on k-means and smote,” in Computational Science – ICCS 2020, V. V. Krzhizhanovskaya, G. Závodszky, M. H. Lees, J. J. Dongarra, P. M. A. Sloot, S. Brissos, and J. Teixeira, Eds. Cham: Springer International Publish- ing, 2020, pp. 468–481. [3] Najafi et al., “Malrank: A measure of maliciousness in siem-based knowledge graphs,” in Proceedings of the 35th Annual Computer Security Applications Conference, ser. ACSAC ’19. New York, NY, USA: Association for Computing Machinery, 2019, p. 417–429. [Online]. Available: https: //doi.org/10.1145/3359789.3359791 [4] Wired. (2023) Brace yourself for a tidal wave of chat- gpt email scams. [Online]. Available: https://www.wired.com/story/ large-language-model-phishing-scams [5] EC-Council. (2022) The cyber kill chain: The seven steps of a cyberat- tack. [Online]. Available: https://www.eccouncil.org/cybersecurity-exchange/ threat-intelligence/cyber-kill-chain-seven-steps-cyberattack [6] Bilge et al., “Exposure: A passive dns analysis service to detect and report malicious domains,” ACM Trans. Inf. Syst. Secur., vol. 16, no. 4, apr 2014. [Online]. Available: https://doi.org/10.1145/2584679 [7] A. Aljofey, Q. Jiang, A. Rasool, H. Chen, W. Liu, Q. Qu, and Y. Wang, “An effective detection approach for phishing websites using url and html features,” Scientific Reports, vol. 12, no. 1, p. 8842, May 2022. [Online]. Available: https://doi.org/10.1038/s41598-022-10841-5 [8] P. Maneriker, J. W. Stokes, E. G. Lazo, D. Carutasu, F. Tajaddodianfar, and A. Gururajan, “Urltran: Improving phishing url detection using transformers,” in MILCOM 2021 - 2021 IEEE Military Communications Conference (MILCOM), ser. MILCOM 2021 - 2021 IEEE Military Communications Conference (MILCOM), 2021, pp. 197–204. [Online]. Available: https://doi.org/10.1109/MILCOM52596.2021.9653028 [9] A. J. Ferrante, “The impact of gdpr on whois: Implications for businesses facing cybercrime,” 2018. [10] Kaspersky. What is an ssl certificate – definition and explanation. [Online]. Available: https://www.kaspersky.com/resource-center/definitions/ what-is-a-ssl-certificate 49 https://www.cnbc.com/2023/01/07/phishing-attacks-are-increasing-and-getting-more- sophisticated.html https://www.cnbc.com/2023/01/07/phishing-attacks-are-increasing-and-getting-more- sophisticated.html https://doi.org/10.1145/3359789.3359791 https://doi.org/10.1145/3359789.3359791 https://www.wired.com/story/large-language-model-phishing-scams https://www.wired.com/story/large-language-model-phishing-scams https://www.eccouncil.org/cybersecurity-exchange/threat-intelligence/cyber-kill-chain-seven-steps-cyberattack https://www.eccouncil.org/cybersecurity-exchange/threat-intelligence/cyber-kill-chain-seven-steps-cyberattack https://doi.org/10.1145/2584679 https://doi.org/10.1038/s41598-022-10841-5 https://doi.org/10.1109/MILCOM52596.2021.9653028 https://www.kaspersky.com/resource-center/definitions/what-is-a-ssl-certificate https://www.kaspersky.com/resource-center/definitions/what-is-a-ssl-certificate Bibliography [11] L. Tang and Q. Mahmoud, “A survey of machine learning-based solutions for phishing website detection,” Machine Learning and Knowledge Extraction, vol. 3, pp. 672–694, 8 2021. [Online]. Available: 10.3390/make3030034 [12] S. A. Khan, W. Khan, and A. Hussain, “Phishing attacks and websites classification using machine learning and multiple datasets (a comparative analysis),” in Intelligent Computing Methodologies. Cham, Switzerland: Springer International Publishing, 2020, pp. 301–313. [Online]. Available: https://arxiv.org/pdf/2101.02552.pdf [13] Y. Tsai, C. Liow, Y. S. Siang, and S.-D. Lin, “Toward more generalized malicious url detection models,” 2022. [Online]. Available: https://arxiv.org/ pdf/2202.10027.pdf [14] M. Sánchez-Paniagua, E. FIDALGO, E. Alegre, A.-N. M. Wesam, and V. González-Castro, “Phishing url detection: A real-case scenario through login urls,” IEEE Access, vol. 10, pp. 42 949–42 960, 2022. [Online]. Available: https://ieeexplore.ieee.org/ielx7/6287639/9668973/09759382.pdf [15] W. Rweyemamu, T. Lauinger, C. Wilson, W. Robertson, and E. Kirda, “Getting under alexa’s umbrella: Infiltration attacks against internet top domain lists,” in Information Security: 22nd International Conference, sep 2019, p. 255–276. [Online]. Available: https://wkr.io/publication/ isc-2019-domains.pdf [16] V. L. Pochat, T. van Goethem, S. Tajalizadehkhoob, M. Korczyński, and W. Joosen, “TRANCO: A research-oriented top sites ranking hardened against manipulation,” in Proceedings 2019 Network and Distributed System Security Symposium, feb 2019. [Online]. Available: https://tranco-list.eu/ assets/tranco-ndss19.pdf [17] T. Kim, N. Park, J. Hong, and S.-W. Kim, “Phishing url detection: A network- based approach robust to evasion,” 2022. [18] A. Joshi, L. Lloyd, P. Westin, and S. Seethapathy, “Using lexical features for malicious URL detection - A machine learning approach,” CoRR, vol. abs/1910.06277, 2019. [Online]. Available: https://arxiv.org/pdf/1910.06277. pdf [19] C. Johnson, B. Khadka, R. B. Basnet, and T. Doleck, “Towards detecting and classifying malicious urls using deep learning,” J. Wirel. Mob. Networks Ubiquitous Comput. Dependable Appl., vol. 11, no. 4, pp. 31–48, 2020. [Online]. Available: https://isyou.info/jowua/papers/jowua-v11n4-3.pdf [20] H. Shirazi, B. Bezawada, and I. Ray, “"kn0w thy doma1n name": Unbiased phishing detection using domain name based features,” in Proceedings of the 23nd ACM on Symposium on Access Control Models and Technologies, ser. SACMAT ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 69–75. [Online]. Available: https://shorturl.at/dGJ28 [21] I. Torroledo, L. D. Camacho, and A. C. Bahnsen, “Hunting malicious tls cer- tificates with deep neural networks,” in Proceedings of the 11th ACM Workshop on Artificial Intelligence and Security. New York, NY, USA: Association for Computing Machinery, 2018, p. 64–73. [22] L. Bilge, S. Sen, D. Balzarotti, E. Kirda, and C. Kruegel, “Exposure: A passive DNS analysis service to detect and report malicious domains,” ACM 50 10.3390/make3030034 https://arxiv.org/pdf/2101.02552.pdf https://arxiv.org/pdf/2202.10027.pdf https://arxiv.org/pdf/2202.10027.pdf https://ieeexplore.ieee.org/ielx7/6287639/9668973/09759382.pdf https://wkr.io/publication/isc-2019-domains.pdf https://wkr.io/publication/isc-2019-domains.pdf https://tranco-