Anti-Money Laundering with Unreliable Labels Master’s thesis in Electrical engineering Jesper Bergquist DEPARTMENT OF ELECTRICAL ENGINEERING CHALMERS UNIVERSITY OF TECHNOLOGY Gothenburg, Sweden 2024 www.chalmers.se www.chalmers.se Master’s thesis 2024 Anti-Money Laundering with Unreliable Labels Jesper Bergquist Department of Electrical Engineering Chalmers University of Technology Gothenburg, Sweden 2024 Anti-Money Laundering with Unreliable Labels Jesper Bergquist © Jesper Bergquist, 2024. Supervisors: Johan Östman, AI Sweden & Edvin Callisen, AI Sweden Examiner: Alexandre Graell I Amat, Department of Electrical Engineering Master’s Thesis 2024 Department of Electrical Engineering Chalmers University of Technology SE-412 96 Gothenburg Telephone +46 31 772 1000 Cover: Illustration of a transaction network as a graph. Typeset in LATEX, template by Kyriaki Antoniadou-Plytaria Printed by Chalmers Reproservice Gothenburg, Sweden 2024 iv Anti-Money Laundering with Unreliable Labels Jesper Bergquist Department Electrical engineering Chalmers University of Technology Abstract This report examines the effectiveness of Graph Neural Networks (GNNs) in de- tecting money laundering activities using transaction data with unreliable labels. It explores how weakly supervised learning, specifically with GNNs, manages the chal- lenges posed by missing and inaccurate labels in anti-money laundering (AML) sys- tems. The study utilizes simulated transaction datasets to compare the performance of GNNs against traditional statistical models. Findings indicate that GNNs, due to their ability to process relational data structures, demonstrate superior adaptability and accuracy in scenarios with label deficiencies. This research provides effective strategies for enhancing anti-money laundering systems by employing GNNs to more effectively manage data challenges. Keywords: GNN, AML, money laundering, machine learning, deep learning, graph neural networks. v Acknowledgements I would like to extend omy heartfelt gratitude to my supervisors, Johan Östman and Edvin Callisen from AI Sweden, and Anton Chen from Handelsbanken, for their invaluable feedback, guidance, and support throughout the course of our the- sis. Additionally, I would like to thank all the team members at AI Sweden and Handelsbanken who contributed their time, insights, and assistance, making this work possible. Your collective efforts and encouragement have been instrumental in the successful completion of our project. Jesper Bergquist, Gothenburg, June 2024 vi viii Contents 1 Introduction 1 2 Background 5 2.1 Money laundering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Current AML practices . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.3 AMLSim . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.4 Statistical machine learning models . . . . . . . . . . . . . . . . . . . 8 2.5 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.6 Graph neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.6.1 Graph Convolutional Networks . . . . . . . . . . . . . . . . . 11 2.6.2 GraphSAGE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6.3 Graph attention network . . . . . . . . . . . . . . . . . . . . . 13 2.7 Weakly supervised learning . . . . . . . . . . . . . . . . . . . . . . . . 14 2.8 Weakly supervised learning in AML . . . . . . . . . . . . . . . . . . . 14 2.9 Different perspectives in AML . . . . . . . . . . . . . . . . . . . . . . 16 3 Methods 17 3.0.1 Data generation and preparation . . . . . . . . . . . . . . . . 17 3.0.2 Missing labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.0.3 Inaccurate labeling . . . . . . . . . . . . . . . . . . . . . . . . 23 3.0.3.1 Class . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.0.3.2 Topology . . . . . . . . . . . . . . . . . . . . . . . . 24 3.0.3.3 Neighbour . . . . . . . . . . . . . . . . . . . . . . . . 24 3.0.4 Statistical machine learning models . . . . . . . . . . . . . . . 24 ix Contents 3.0.4.1 Model implementation . . . . . . . . . . . . . . . . . 25 3.0.4.2 Data loading . . . . . . . . . . . . . . . . . . . . . . 25 3.0.4.3 Optimization . . . . . . . . . . . . . . . . . . . . . . 25 3.0.4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . 25 3.0.4.5 Handling incomplete labels . . . . . . . . . . . . . . 26 3.0.5 GNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.0.5.1 Model implementation . . . . . . . . . . . . . . . . . 27 3.0.5.2 Data loading . . . . . . . . . . . . . . . . . . . . . . 27 3.0.5.3 Optimization . . . . . . . . . . . . . . . . . . . . . . 27 3.0.5.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . 28 3.0.5.5 Handling incomplete labels . . . . . . . . . . . . . . 28 4 Results 33 4.1 The finished datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4.2 Comparison of known and missing labels . . . . . . . . . . . . . . . . 37 4.3 Model performance for incorrect labels . . . . . . . . . . . . . . . . . 39 4.3.1 EASY with 10% flipped labels . . . . . . . . . . . . . . . . . . 41 4.3.2 EASY with 25% flipped labels . . . . . . . . . . . . . . . . . . 43 4.3.3 Summary for EASY . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.4 MID with 10% flipped labels . . . . . . . . . . . . . . . . . . . 44 4.3.5 MID with 25% flipped labels . . . . . . . . . . . . . . . . . . . 46 4.3.6 Summary for MID . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3.7 HARD with 10% flipped labels . . . . . . . . . . . . . . . . . 48 4.3.8 HARD with 25% flipped labels . . . . . . . . . . . . . . . . . 50 4.3.9 Summary for HARD . . . . . . . . . . . . . . . . . . . . . . . 50 5 Discussion 53 5.0.1 Implications of findings . . . . . . . . . . . . . . . . . . . . . . 53 5.0.2 Train and test split . . . . . . . . . . . . . . . . . . . . . . . . 55 5.0.3 Node classification vs edge classification . . . . . . . . . . . . 55 5.0.4 Ethical considerations . . . . . . . . . . . . . . . . . . . . . . 56 5.0.5 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 x Contents 6 Conclusion 59 7 Contributions 61 Bibliography 61 xi Contents xii 1 Introduction Money laundering is the process of disguising funds acquired from illicit activities to make them appear as though they have been obtained through legitimate means. This practice is a fundamental operation for organized crime groups, involving funds from illegal activities such as corruption, gambling, embezzlement, and drug traf- ficking. According to estimates from the United Nations in 2009, as much as 1.6$ trillion dollars are laundered annually, which equates to about 2 to 5 % of total GDP [1]–[3]. Despite various anti-money laundering (AML) measures by banks and financial authorities, the scale of the problem is still large. Close to 40% of all crim- inal networks in the EU are involved in the trade of illegal drugs, and more than 80% of the criminal networks use legal business structures with two thirds partaking in corruption on a regular basis [4]–[6]. Current AML systems have demonstrated limited effectiveness in detecting illegal activities [7]–[9]. Using a data-driven approach in AML is a strategy that improves efficiency and scales effectively to handle the large volumes of transactions, but this presents multiple challenges. The intricacy of laundering schemes, and their attempt to be stealthy by behaving normally often makes them difficult to identify [10]. Ad- ditionally, banks generally only have access to data concerning their own customers’ accounts and transactions, which provides a restricted perspective. The data avail- able to banks may also be incomplete or contain errors, with potential inaccuracies in labeling, due to the lengthy investigations [11]. Achieving reliable labels for AML purposes is a difficult task, that is deeming if an account really is laundering money or not is hard, because banks are reliant on the financial authorities for feedback if a suspicious case was correct or not. These factors contribute to the obscurity 1 1. Introduction surrounding money laundering, complicating detection efforts. Since transaction networks can be represented as graphs, AML can either be viewed as a node or a link classification problem. If it is viewed as a link classification problem, the approach is to find specific illicit transactions in the network, and if it is viewed as node classification problem, the approach is to find specific accounts partaking in illegal activities. The best approach depends on the situation, but node classification can be preferred because it is less computationally intensive, as there are fewer accounts than transactions in a financial institution. In this thesis, the focus will be on classifying accounts, hence, it will be viewed as a node classification problem, this choice is further motivated in section 5.0.3. One common approach is using statistical machine learning models to try and cap- ture the behavior of normal versus suspicious accounts [12]. As previously men- tioned, a significant challenge in this domain is the limited number of labels, which can be treated as a weakly supervised problem. In weakly supervised settings sta- tistical models often perform poorly because with few labeled data-points, they are not able to capture the nature of the data [13]. Graph neural networks (GNNs) are particularly promising in the context of finan- cial anomaly detection. Recent research highlights the potential of GNNs to en- hance anomaly detection capabilities in financial networks [14]–[16]. GNNs excel at processing graph-based structures, enabling them to uncover complex relation- ships within data. They facilitate the understanding of interactions within financial networks. This approach could significantly improve the detection of sophisticated money laundering schemes. Deploying GNNs on sparsely labeled and inaccurate data would present a weakly supervised scenario, where the models must learn from imprecise and incomplete information [17]. The main goal of this project is to evaluate how well different GNN models detect money laundering using simulated transaction data. We will compare these results with those from statistical models to understand the strengths and weaknesses of GNNs. A key part of this research is to study how data quality affects GNN performance. We will create several datasets with specific errors, e.g., missing and incorrect labels, 2 1. Introduction to see how these issues impact both GNNs and statistical models. This will provide insights into how different data problems influence the accuracy and reliability of these models. By achieving these objectives, the research aims to improve our understanding of the practical challenges and limitations of current AML technologies. To address this challenge, the study focuses on the following two main research questions: 1. How does the presence of incomplete labels affect the performance of statistical models and GNN models in detecting money laundering activities? This question aims to explore how each model type compensates for missing labels and the resulting effects. 2. What is the impact of inaccurate labels on the performance of GNN and statistical models? Specifically, how do incorrect labels in the dataset influence the performance and reliability of the models classifying money laundering accounts? To the best of our knowledge, the specific impact of different types of transaction data noise in a weakly supervised setting, on GNNs and statistical models, has not been thoroughly investigated. The insights gained could help tailor GNNs to be more reliable and effective for practical use in detecting financial crimes. The findings from this research could provide a foundation for future studies and help improve regulatory practices and financial security measures, leading to stronger defenses against financial crimes. Specifically, our work demonstrates the usability of GNNs in a weakly supervised scenario, showing their potential effectiveness in detecting complex patterns indicative of money laundering activities. This could significantly influence the adoption of GNNs for AML purposes within the financial sector. By proving GNNs capability to operate efficiently with limited labeled data, our research encourages further development and integration of these models into real- world AML systems. As financial institutions and regulatory bodies begin to im- plement GNN-based approaches, we anticipate a substantial improvement in the accuracy and reliability of money laundering detection. This enhancement could 3 1. Introduction lead to more sophisticated monitoring systems that can uncover previously unde- tected laundering schemes, thereby increasing the amount of illicit activity that is identified and prevented. 4 2 Background 2.1 Money laundering Money laundering is an illegal activity that impacts the global financial system by disguising illegally obtained funds as legitimate income. This process enables criminals to use their illegal profits in the legal system without being prosecuted and compromises the integrity of financial institutions, leading to social and economic consequences [2], [3]. The process of money laundering typically consists of three stages [18]. The first stage is placement, where illicit funds are introduced into the legitimate financial system. This may involve depositing large sums of cash into bank accounts or purchasing high-value assets such as real estate or luxury goods. The second stage is layering, where the launderer executes a series of complex transactions to obscure the origins of the illicit funds. This often includes transferring money between various accounts. The final stage is integration, wherein the laundered money is reintroduced into the economy, now appearing as legitimate. Globally, money laundering poses a significant threat by facilitating other serious crimes, such as drug trafficking and terrorism. The United Nations Office on Drugs and Crime has reported estimates that up to 5 percent of global GDP is laundered each year, highlighting the vast scale and impact of this activity [3]. Effective mea- sures against money laundering are vital for maintaining financial market integrity and ensuring national and global security. These measures include strict regula- tory frameworks, thorough monitoring of financial transactions, and international cooperation [19]. 5 2. Background 2.2 Current AML practices Traditional AML approaches have relied on rule-based systems [20]. These systems use predefined rules and thresholds to identify potentially suspicious activity. While somewhat effective, these methods can be rigid and may miss complex laundering schemes. These systems result in a false positive rate (FPR) of about 95-98% [21]. To improve detection, research has shifted towards learning-based methods, em- ploying classical machine learning techniques such as logistic regression and support vector machines. These techniques analyze patterns and assess the likelihood of transactions being associated with money laundering [12]. However, they struggle with the complexity of money laundering tactics. In response to these limitations, there has been a shift towards employing advanced deep learning models [22]. With the vast amounts of transactional data available, these models can learn from a broader set of behavioral patterns where accounts are seen as dependent on their neighbours, providing a more effective tool in detecting laundering activities. 2.3 AMLSim Machine learning models depend on the quality of the data used to train them. Therefore, it is crucial that the training data is of high quality and represents a broad spectrum of the target domain. This includes ensuring the data is accurate, complete, consistent, and diverse to avoid biases and ensure the model’s robustness and generalizability across different scenarios. There is a shortage of such publicly available datasets in the field of AML [23]. One commonly used dataset to develop money-laundering detection methods is the Elliptic Data Set [24]. This dataset contains 203 769 nodes and 234 355 edges over a span of 98 weeks. The nodes represent a bitcoin transaction and edges represent a flow of bitcoins between one transaction and the other, with 2% of the nodes labelled as money-laundering, 21% are labelled as normal [25]. This dataset has several shortcomings. Firstly, the limited usage of only bitcoin transactions may not fully capture the complexity of real-world money laundering activities. Consecutively, the dataset has a large 6 2. Background imbalance of the classes, with only a small portion of the labelled nodes being illicit. This could cause biases whilst training machine learning models. Also, since the dataset only contains transactions, there is limited capabilities of capturing the behavior of specific accounts. Instead, using a simulation tool gives full control of the properties of the data, in this thesis simulated data is generated by deploying a version of AMLSim, a simulation platform developed by the MIT-IBM Watson AI Lab to aid in the fight against money laundering [26]. The simulation platform is designed to generate synthetic datasets of financial transactions for research and development in AML technologies. AMLSim has been further enhanced by a team at AI Sweden as part of the project named Federated Learning in Banking [27]. This project is a collaboration between AI Sweden, Handelsbanken, and Swedbank. In AMLSim, each node within the simulation represents a bank account, and edges between these nodes symbolize financial transactions. This structure allows AML- Sim to create complex graphs that mimic the intricate networks through which money might be laundered in the real world. AMLSim works by first defining the spatial part of the transaction graph, forming predefined agents in graph-topologies. These agents are then allowed to perform transactions over a discrete-time simulation, achieving the temporal aspect of the graph. The platform operates by employing a multi-agent approach, where each agent (or node) behaves as an individual bank account. These agents interact by transferring funds among themselves, creating a dynamic graph of transactions. Importantly, AMLSim includes the capability to simulate illicit behaviors by programming a subset of these agents to engage in activities typical of money laundering (see Fig- ure 3.4), based on observed patterns from real cases [28]. AMLSim provides a controlled environment for generating transaction data, making it a valuable resource for researchers. It facilitates a way to generate data to enable the testing of new AML methods, particularly those incorporating deep learning and graph analytics, under various scenarios and conditions. The data produced by AMLSim can be utilized to train algorithms for improved identification and 7 2. Background prediction of suspicious activities within extensive, complex financial networks. 2.4 Statistical machine learning models Detecting suspicious activity in transaction graphs can be done with many different methods. One way of classifying money laundering is with the usage of statistical machine learning models. Several popular and widely used models include support vector machines (SVM), k-nearest neighbours (KNN), logistic regression, random forest, and XGBoost. SVMs work by finding the optimal hyperplane that separates the two classes in the feature space. It is particularly useful with high-dimensional data [29]. KNN is an algorithm that classifies data based on the majority vote of the nearest neighbours in the feature space. KNNs are particularly useful with non-linear data since it does not make an assumption of the underlying distributions [30]. Logistic regression is a popular method used for binary classification. It models the probability of an outcome based on the input variables, making it suitable for money laundering detection where the goal is estimate a probability for an account being suspicious. It is a simple and interpretable model, making it easy to understand the decisions made [31]. Random forest is an ensemble learning method that trains several decision trees in parallel and merges their outputs to improve its performance. This is an effective model for money laundering detection due to its capability to handle large amounts of data and ability to find complex patterns in the data [32]. XGBoost is another ensemble method that builds a model by sequentially optimizing decision trees with respect to the loss function. It excels in handling unbalanced datasets, and is widely used in several fields [33]–[36]. It is known from previous work that XGBoost is one of the best methods for detecting money laundering in graph transaction data [15], [37]. This is something that will be challenged by graph neural networks in this thesis. 8 2. Background 2.5 Graphs A graph is composed of nodes (also called vertices) and edges (also called links) that connect pairs of nodes [38]. An undirected graph is defined by a set of nodes V = {v1, . . . , vN} and a set of edges E = {e1, . . . , eM}, where an edge ei = {v, u} connects nodes v and u in V . This implies that if there is an edge between node v and node u, traversal is possible in both directions. Formally, an undirected graph G is represented as G = (V , E), where V is the set of nodes and E is the set of edges, with each edge being an unordered pair of nodes {u, v}. In contrast, a directed graph (or digraph) includes edges with specific directions, where an edge from node v to node u is represented as an ordered pair (v, u), indicating the direction from v to u [39]. In the context of this thesis, the nodes represent bank accounts, and an edge between two nodes represents a transaction between these entities. Furthermore, each node is associated with a node feature vector xi ∈ Rn and a label, which contains features, the features and labels for this project is further explained in table 3.2. The graph structure refers to the sets V and E , while the term graph includes both the graph structure and the node feature vectors X.The neighbors of a node v are defined as the set N (v) = {u | {v, u} ∈ E}, which includes all nodes u connected to v by an edge. An example of an undirected graph with nodes A, B, C and D with feature vectors can be sen in Figure 2.1. 2.6 Graph neural networks Current AML systems are usually rule based and have had limited success in detect- ing illegal activities. Recent research suggests that GNNs could enhance AML [14]– [16]. GNNs can process graph structures and reveal underlying complex relation- ships, and transaction networks can be represented as graphs, and thus GNN shows the promise to not only consider transactions in isolation, but to identify patterns and anomalies by considering the broader context of interactions in financial net- works. This makes GNN a promising candidate to uncover complex money launder- 9 2. Background A B C D Figure 2.1: An undirected graph with nodes A, B, C, and D. Node A is connected to all other nodes, while nodes B, C, and D are only connected to node A. The squares represent feature vectors associated with each node. ing schemes to further reduce the illegal practice. GNNs are sophisticated models that harness the inherent structure of graph data to perform various tasks such as classifying nodes, edges, or entire networks. These models are particularly effective in domains where data relationships can be nat- urally represented as graphs, including social networks, molecular structures, and transaction networks [40]. A GNN iteratively updates node embeddings by integrating both their own features and the features of their neighboring nodes. This process is structured through layers, where each layer refines the node’s representation by employing two key functions: 1. Aggregation Function: This function collects and combines the feature vectors of a node’s neighbors into a single vector that summarizes the local neighborhood information. 2. Updating Function: This function then takes the aggregated information and the node’s current features to produce a new, updated embedding. Mathematically, the operation of a GNN across its layers can be encapsulated as follows: 10 2. Background hk v = Updating(k−1)(h(k−1) v , m(k−1) N (v) ) (2.1) mk N(v) = Aggregation(k−1)({h(k−1) u : ∀u ∈ N (v)}) (2.2) Here, hk v represents the embedding of node v at the k-th layer, while m(k) N (v) denotes the aggregated message derived from the embeddings of v’s neighbors at the previous layer. These functions are designed to be differentiable, enabling the use of gradient- based learning techniques to optimize the network parameters [41]. As illustrated in Figure 2.2, the node A is updated by aggregating information from its neighboring nodes and information from itself. This visualization shows how the features from nodes B, C, and D are combined with information from the node A and used to refine the embedding of node A. This flexible framework allows GNNs to learn to encode both node and topological features into the embeddings, making them powerful tools for predictive and analytic tasks on graph-structured data. Different Graph Neural Networks employ various aggregation and updating func- tions. These functions are tailored to the specific architecture and objectives of the GNN model being used. The three different GNN models utilized in this project are the following: 2.6.1 Graph Convolutional Networks Graph convolutional networks (GCNs) apply convolutional principles to graph- structured data, allowing each node in the graph to be represented by aggregating features from its immediate neighbors [42]. Developed by Thomas N. Kipf and Max Welling, GCNs use a layer-wise propagation rule based on the eigen-decomposition of graph Laplacians, simplifying the convolution operation in the spectral domain. By doing so, GCNs efficiently blend local node features with their neighborhood topology, resulting in a powerful representation that captures both individual and collective properties. This approach is particularly effective for datasets where the structural connections between entities are pivotal to their overall data representa- 11 2. Background B C D Aggregation function Updating function A A Figure 2.2: A visualization of how the neighborhood information is aggregated and used to update the embedding of node A. In this diagram, circles represent nodes and squares represent their feature vectors. Node A (initially green) aggregates information from its neighboring nodes B (blue), C (orange), and D (pink), as well as from itself, to produce a new node representation for A (shown as a combination of the colors of its neighbors). tion. GCNs have become a promising model for various applications, including those in social network analysis, recommendation systems, and bioinformatics, where the intrinsic link between data points critically informs their analysis [43]–[45]. In the GCN used in this thesis, the node representations are updated using the ReLU activation function. The combined aggregation and update rule is given by: h(l+1) v = ReLU  ∑ u∈N (v)∪{v} 1 cvu h(l) u W(l)  , (2.3) where ReLU is the activation function, N (v) represents the set of neighbors of node v, cvu is a normalization constant (typically based on the degrees of nodes v and u), h(l) u denotes the feature vector of node u at layer l, and W(l) is the weight matrix for layer l. This equation captures both the aggregation of features from the neighboring nodes and the node itself, as well as the subsequent update of the node’s feature representation. 12 2. Background 2.6.2 GraphSAGE GraphSAGE is a framework designed for learning on large graphs, developed by William L. Hamilton, Rex Ying, and Jure Leskovec [46]. GraphSAGE employs a unique approach involving sampling and aggregation. It samples a fixed-size neigh- borhood and aggregates their features to update a node’s representation. This allows GraphSAGE to efficiently handle dynamically growing graphs by incorporating new nodes and continuing to generate high-quality embeddings without needing access to the full graph. The GraphSAGE model used in this thesis employs specific aggregation and updat- ing functions. The mathematical representation of these functions is as follows: h(k) v = ReLU ( W(k) · MEAN ( {h(k−1) v } ∪ {h(k−1) u , ∀u ∈ N (v)} )) (2.4) where the MEAN() function calculates the feature-wise mean of the node embedding vectors. ReLU is used as the activation function, and W(k) represents a weight matrix specific to each layer k. 2.6.3 Graph attention network The graph attention network (GAT) incorporates attention mechanisms into the graph neural network framework, allowing for more nuanced aggregation of neighbor features [47]. GAT computes the hidden representations of each node by attending over its neighbors, thus assigning different weights to different nodes in a neigh- borhood without any predefined pooling strategy. This attention-based approach enables the model to focus on more relevant information and diminish the less use- ful data dynamically, enhancing the model’s adaptability to complex and noisy data environments. The GAT model utilized in this thesis is represented as follows h(k) v = ReLU  1 K K∑ k=1 ∑ u∈N (v) αk vuWkhu . (2.5) The rectified linear unit (ReLU) activation function is employed in this thesis, K 13 2. Background represents the number of attention heads, and αk vu are the normalized attention coefficients that indicate the significance of the information from node u to node v. 2.7 Weakly supervised learning Weakly supervised learning is a machine learning paradigm in which the model is trained with imprecise, noisy, or incomplete labels [17]. Unlike fully supervised learning, where each training instance is associated with a precise output label, weakly supervised learning tackles scenarios where acquiring high-quality labels is either expensive or impractical. Let’s denote the input space by X and the output space by Y . A typical fully supervised learning model is trained using a dataset D = {(xi, yi)}n i=1, where xi ∈ X and yi ∈ Y . This thesis investigates incomplete supervision on its own, as well as the combination of incomplete and inaccurate supervision, which one expects to find in real-world data. Incomplete supervision occurs when only a portion of the training data is labeled. In this scenario, the dataset can be represented as D = {(xi, yi)}n i=1 ∪ {xi}n+m i=n+1. (2.6) Here, only the first n instances have labels, and the remaining m instances are unlabeled. Each xi ∈ X represents an input instance, and yi ∈ Y represents the corresponding label. Inaccurate supervision, on the other hand, involves training data that contains in- correct labels. This can be represented as D = {(xi, ỹi)}n i=1 (2.7) where ỹi may be an incorrect label. Here, xi ∈ X represents an input instance, and ỹi ∈ Y represents the potentially incorrect label. 2.8 Weakly supervised learning in AML In real-world bank transaction data, obtaining reliable labels is extremely challeng- ing. Research conducted by Copenhagen Business School [21] indicates that only 14 2. Background about 5% of all money-laundering attempts are detected and intercepted. Because of this statistic, it is reasonable to assume that, when curating a graph dataset for anti- money laundering detection, a significant portion of node labels will be unknown. To the best of our knowledge, no research has been conducted on this assumption in the context of anti-money laundering. Labels reported back from the financial police are subject to proof burden and human errors and may, hence, contain errors. Therefore, labeling only nodes with an absolute certainty could ensure that new money-laundering strategies do not fly under the radar. Compared to statistical machine learning models (see Section 2.4), GNNs have dif- ferent capabilities of handling incomplete and inaccurate labels in graph data, where GNNs have an advantage. Firstly, GNNs are able to capture the relational dependencies between nodes in a graph [48]. In the case of AML, this means understanding the intricate transaction topologies for suspicious activities. Statistical models on the other hand, treat nodes as independent and identically distributed, which limits their ability to exploit the graph structured data. Secondly, GNNs can incorporate both node and edge features [48]. This enables GNNs to consider not only the properties of the accounts, but also the nature of their interaction with other accounts. Statistical models are only able to extract node features from the edge information, and because of this, fails to capture the relation between accounts. Moreover, GNN’s neighbour aggregation mechanism allows nodes with missing la- bels to still contribute with valuable information [48]. By aggregating the features and labels of neighboring nodes, GNNs can propagate information across the net- work to fill in the information gaps left by accounts with an unknown status to money laundering. Statistical models have no way of mimicking this aggregating mechanism, putting them at a disadvantage in the field of AML where missing labels are an intrinsic property. 15 2. Background 2.9 Different perspectives in AML In AML, regulatory bodies and financial institutions approach the problem from different angles, each with unique priorities and measures of success. The Financial Action Task Force (FATF), along with other regulatory bodies, focus on maximizing the detection of money laundering activities with high accuracy. Their success is measured by the thoroughness and effectiveness of their enforcement actions, emphasizing comprehensive suspicious activity reporting and rigorous customer due diligence [49]. Financial institutions prioritize regulatory compliance to avoid penalties and ensure smooth operations. Their objective is to detect money laundering efficiently while managing costs and minimizing disruptions. Success for banks is often measured by their ability to adhere to regulatory standards and maintain operational integrity, balancing accurate detection with the demands of daily operations [50]. This difference in focus could lead to varied approaches in AML practices. Regu- latory bodies may push for more exhaustive investigative techniques to uncover as many illicit activities as possible, even if it imposes a higher compliance burden on financial institutions. Banks, on the other hand, might streamline their processes to ensure compliance without overwhelming their resources. 16 3 Methods 3.0.1 Data generation and preparation Within this thesis, three different datasets were generated: an easy dataset with approximately 25% money laundering transactions, a medium dataset with approx- imately 5% money laundering transactions, and a difficult dataset with approxi- mately 1% money laundering transactions. Old reports have estimated the rate of money laundering transactions to be very low, around 0.05% to 0.1% [51]. Since exact numbers are hard to determine, we used a range of values to simulate dif- ferent scenarios. We chose these numbers to ensure that the class imbalance does not obscure our results too much, providing a clearer evaluation of our methods. These datasets were created using a simulation tool based on AMLSim developed by IBM [26], and further enhanced by AI Sweden (see Section 2.3), to mimic the Swedish Swish transaction network [52]. This tool allows for the configuration of various spatial and temporal parameters for both the normal and money laundering accounts. As there is no public information available on how to choose the parameters for the synthetic data, one must qualitatively guess reasonable numbers. To this end, a reverse engineering pipeline was devised, visually described in figure 3.1, with the objective to find parameters that yield a pre-determined false positive rate when the surveillance model is based on a decision tree. The data-generation procedure is outlined next. Firstly, the spatial properties of the graph needed to be chosen. These parameters dictate how many accounts are present in the transaction network, how many ac- 17 3. Methods Figure 3.1: Visualisation of the data generation pipeline counts that are part of normal patterns, and how many accounts that are partaking in money laundering topologies, se Figure 3.4. Also, there is a degree distribution where counts of in- and out-degrees for the accounts are chosen. With these spatial parameters we are able to choose the class balance of the datasets. Increasing the number of money-laundering topologies will increase the number of money-laundering accounts, and therefore change the class imbalance. For all three of the datasets, a total number of 100 000 accounts was chosen. This number of accounts was deemed enough to be relevant while still being manageable. The size was manageable to handle whilst being sufficient to demonstrate a broad and complex transaction network. For the medium dataset a class imbalance of 5:95 was chosen. The graph was built as a scale-free network, following the power-law degree distribution, with a gamma of 2.0 [53]. The reasoning for modeling this as a scale- free network lies in its ability to more accurately represent real-world transaction networks, where a small number of nodes (accounts) tend to have a large number of connections (transactions), while the majority have relatively few. Next, the temporal parameters for each of the classes in AMLSim need to be chosen. These parameters control the distributions of transaction properties from accounts to accounts, income to accounts, and outcome from accounts. These temporal pa- rameters are found in table 3.1. Internal transactions in AMLSim, are mimicing Swish transactions from an account in the network to another account in the net- work. Income in AMLSim, is treated as transactions from a source node to an account in the transaction network. This allows for money to flow into the network. Accounts have a probability to receive income from the source node each simulated time step. Outcome in AMLSim, is treated as accounts sending a transaction to the sink node on the transaction network. The sink node serves the purpose of money leaving the network, such as accounts spending money at businesses outside of the 18 3. Methods Figure 3.2: Visualisation of the two classes overlap in feature space network. The probability of an outcome transaction to the sink node is controlled by the spending behavior. Spending behavior refers to the pattern where accounts are more inclined to make transactions to the sink node as their available balance increases. This tendency can be mathematically represented by a sigmoid function. As aforementioned, real-life behavioral parameters are highly confidential and only known by banks and financial authorities. Because of this, these parameters are not readily available. To circumvent this issue, a reverse engineered iterative process was devised to find temporal parameters that reflects the performance of practical AML systems. The paper by Copenhagen Business School [21] states that current if-based systems exhibit a false positive rate (FPR) of about 95%-98% on real data. Inspired by this, the idea for the temporal parameters was to be tuned towards a model that performs at a 95% FPR, demonstrated in the figure 3.1. Tuning the temporal parameters for the classes to be more similar results in data with a larger overlap in feature space, thus making them less distinguishable. Con- versely, tuning the temporal parameters to be less similar creates data where the classes are easier to distinguish, as conceptually illustrated in Figure 3.2. With the spatial parameters set, and temporal parameters tuned, the transaction log (TxLog) can be generated. The TxLog is generated from a discrete-time multi- agent simulation where, in each time step, each agent (account) gets the opportunity to perform transactions with other accounts or with the source/sink. All of the transactions are written into the log file including initial balances, transaction size, account ids, etc. Hence, the TxLog will describe a spatio-temporal directed graph 19 3. Methods Label Parameter Description (A) min_amount Global minimum transaction amount (shared) (B) max_amount Global maximum transaction amount (shared) Class-specific Parameters A mean_amount Mean transaction amount B std_amount Standard deviation of transaction amount C prob_income Probability of income transactions D mean_income Mean income amount E std_income Standard deviation of income amount F mean_outcome Mean outcome amount G std_outcome Standard deviation of outcome amount H mean_phone_change_freq Mean phone number change frequency I std_phone_change_freq Standard deviation of phone number change frequency Table 3.1: List of the temporal parameters used in AMLSim with their descriptions. Parameters (A) and (B) are shared between both classes. The remaining parameters are defined separately for each class (Normal and SAR). with nodes being accounts and edges being transactions. The time-step interval is one day, and the total time for the simulation is set to one year. From the TxLog, we create the training and test sets through a preprocessing step. This preprocessing involves splitting the data into training and test sets with respect to time, and then by engineering features for each of the splits. For both the training and test sets, a node file and an edge file are created, structuring the data for further analysis and model training. To create the training and test splits, we divide the TxLog data based a 50/50 split in time. This means, that all transaction activities occurring in the first half of the simulated year are included in the training set, while the activity from the second half of the year is included in the test set. No overlap is used between the training and test sets to minimize data leakage. This approach is further discussed in Section 5.0.2. 20 3. Methods After splitting the data based on time to create the training and test sets, feature engineering is performed separately on each split. The label indicating whether an account is laundering money or not is determined based on whether the account is part of a money laundering topology during the specific time split, ensuring the label accurately reflects the account’s status within each period. For both the training and test sets, a node file and an edge file are created. The node files contain the created features for each account, such as the total amount spent by an account, the number of incoming transactions, and other account-specific features. All the node features are shown and further explained in Table 3.2. The edge file represents transactions between accounts, including the source. The edges are undirectional, meaning the direction of the transaction is not considered in the graph representation. From the training and test datasets we are now able to train a model. In this thesis we are treating the classification on a node level, i.e., we are trying distinguish between accounts partaking in money-laundering activities from normal accounts. Since the rule-based system in the paper by Copenhagen Business School [21] that performs at a 95-98% FPR is unknown, we opted to use a tree-based classifier as a replacement. This tree-based classifier offers a more sophisticated method for classifying money laundering compared to the rule-based system. Consequently, this approach results in a dataset that is likely more challenging than real-data. The tree-based classifier is only able to use tabular data which, in this case, corresponds to the node features previously been described. With the training and test data in place, a classifier is trained and its FPR is compared to the target. If the target is not achieved, the iterative process restarts, as illustrated in Figure 3.1. In particular, if the model achieves lower than 95% FPR, the temporal parameters are updated to attempt to get closer to the target in the next iteration. If the tree-based classifier achieves 95% FPR or higher, we can assure that the the two classes in the data are extremely difficult to distinguish between, based on the node features, and the benchmarking dataset is now completed. Ultimately, this procedure results in a dataset that mimics that of real-data in a more systematic 21 3. Methods way than guessing the parameters. From this point, the temporal parameters, see in Table 3.3, are locked. This bench- marking dataset will be called the medium level dataset (MID). From the MID dataset, two other datasets are generated: an easy (EASY) and one more diffi- cult (HARD), distinguished by different class imbalances. The datasets were gen- erated by locking the temporal parameters, and tuning the spatial parameters to increase the number of money-laundering accounts for the EASY dataset to ap- proximately 25%, and decreasing the number of money-laundering accounts for the HARD dataset to approximately 1%. These transaction graph networks, EASY and HARD, with their respective spatial parameters go through the pipeline of gener- ating the transaction log, splitting and preprocessing the data. Thus, finalizing the three datasets, EASY, MID and HARD, with respective node and edge files for both the training a test sets. 3.0.2 Missing labels In this thesis, we are assuming that a substantial portion of the data labels are unknown. This assumption is fair since there is limited feedback on the cases that are sent to the financial authorities, and not a certain ground truth for the non- suspicious accounts. To strenghten this argument, we also note that only about 5% of all financial crimes are found by today’s systems [21]. Therefore it is fair to assume, that if all accounts in a transaction network were labeled, a large amount of these could be mislabeled. From these facts and estimations, we opted for a heavily semisupervised setting where 90% of the accounts do not have a label in the training data. To realize this, in each of the three datasets (EASY, MID, and HARD), 90% of all normal nodes, and 90% of all money-laundering nodes have their labels removed. Nodes whose labels are removed are selected at random. This procedure results in an additional three datasets, EASY with missing label, MID with missing labels, and HARD with missing labels. 22 3. Methods Figure 3.3: Example of the differences between known labels, incomplete labels and incomplete and inaccurate labels. Blue nodes with a "?" are incompletely labeled, and nodes with a "!" are inaccurately labeled. 3.0.3 Inaccurate labeling To investigate the importance of label accuracy, i.e., the correctness of the labels available, we introduce label noise to the nodes that have a label attached, demon- strated in Figure 3.3. To address realistic scenarios, we identified three different categories of methods to select labels: Class, topology, and neighbour. We consider two different ratios, 10% and 25%, which translates into the fraction of the selected labeled nodes, that have their labels flipped. The selection categories contain different noises that will be applied to the train-set of the three datasets, EASY, MID, and HARD, at both of the ratios of 10%, and 25%, which implies, that for every noise, six more datasets will be generated. 3.0.3.1 Class In real data, inaccuracies can occur due to accounts being falsely flagged by a de- tection system or by illicit accounts that have not been found. The class selection category introduces label noise by separately handling normal accounts and money- laundering accounts. The False positive noise selects all the normal accounts, and flips 10% and 25% of their labels, and the False negative noise selects all the money-laundering accounts and flips 10% and 25% of their labels. From this type of noise, one may study the importance of labeling and whether one class is more important than the other. 23 3. Methods Figure 3.4: The different money laundering topologies 3.0.3.2 Topology Money launderers funnel illicit funds in many different ways, so called topologies, as can be seen in Figure 3.4. The topology selection category introduces noise in seven different ways, all following the same selection principle. In particular, we select all the nodes of every money-laundering topology at a given time and flip 10% and 25% of their labels, thus creating noises called Fan-out, Fan-in, Scatter-Gather, Gather-Scatter, Cycle, Bipartite, and Stack. This is done to analyze if the models are more sensitive to inaccurately labeled nodes in certain money-laundering topologies. 3.0.3.3 Neighbour In a real world scenario, accounts that have engaged in transactions with money laundering accounts may have a higher possibility to also be flagged as a money launderer. We denote this noise as Neighbour. It is created by selecting all normal neighbouring nodes to a money laundering typology whereafter 10% and 25% of their labels are flipped. This is done to analyze if inaccurately labeling neighbouring normal accounts as money-launderers has a significant effect on the performance of the models or not. 3.0.4 Statistical machine learning models In this thesis, five widely used statically machine learning models were considered, i.e., SVM, KNN, LOG, Random Forest, and XGBoost, as outlined in section 2.4. Notably, these models are unable to utilize the graph structure of the data. Instead, these models are only able to use the node features as independent data-points. 24 3. Methods Therefore, it is highly interesting to compare the results of the statistical machine learning models with the GNNs that are able to use the graph structure to try and understand how the incomplete and inaccurate datasets are effecting the different models. 3.0.4.1 Model implementation All five statistical models were implemented using the Scikit-learn library in python [54]. This is a widely used library used for classification, regression, clustering, etc. It enables easy implementations of various statically machine learning models in an easy to use environment. 3.0.4.2 Data loading Each of the models were trained on the node files for every dataset. Prior to training, all features were normalized using the standard scaler, which removes the mean and scales to unit variance, to ensure proper function of the models. 3.0.4.3 Optimization All five statistical models were optimized with the grid search technique. Grid search works by defining a search space as a grid of hyper-parameter values. It evaluates every combination of the hyper-parameters in the grid to find the best combination [55]. The performance metric optimized was Average precision (AP), this will be explained in the next section. The hyper-parameters were tuned for the EASY, MID and HARD datasets. 3.0.4.4 Evaluation In evaluating the performance of the selected statistical machine learning models, the thesis focused on two key metrics: AP for the positive label and the Area Under the Curve (AUC) score. These metrics were chosen to provide a comprehensive view of model effectiveness. AP offers a focused measure of the models’ ability to identify accounts that are laundering money, which are critical to the project’s objectives, while the AUC score provides a broader assessment of overall model performance 25 3. Methods across all classes. To calculate the AUC score, we first determine the True Positive Rate TPR = True positives True positives + False negatives, (3.1) and False Positive Rate FPR = False positives False positives + True negatives, (3.2) at various thresholds. These values are used to plot the Receiver Operating Characteristic (ROC) curve. The area under this curve quantifies the model’s AUC score fawcett2006ROCAUC. Since there is no randomness in fitting the statistical machine learning models to the training data, no cross-validation was performed. Since the data is split in time there is no intuitive way of generating cross-validation datasets. Therefore, in the results for the statistical machine learning models, there are only scores without standard deviation. 3.0.4.5 Handling incomplete labels Since all nodes in the datasets are interpreted as independent by the statistical machine learning models, the only intuitive way of handling the incomplete labeled nodes is to not use them. Not being able to use the node features of nodes with incomplete labels is a significant disadvantage compared to the GNNs. 3.0.5 GNNs For this thesis, three well-established GNN architectures were selected: GCN, Graph- SAGE, and GAT. These models were chosen for their demonstrated success and versatility in handling graph based learning tasks across various domains [42], [46], [47]. Each model offers a distinct approach to processing graph-structured data, which is crucial for addressing the challenges presented by datasets with missing labels and label noise. The selection of GCN, GraphSAGE, and GAT allows for a detailed comparison of their performance in managing these specific data irreg- ularities, providing valuable insights into their effectiveness and robustness under varying conditions. 26 3. Methods 3.0.5.1 Model implementation All three GNN models were implemented using the PyTorch Geometric library [56], [57]. This library is a specialized extension of PyTorch, tailored specifically for the creation and operation of graph neural networks. PyTorch Geometric provides efficient data structures and APIs for managing complex graph data, facilitating the development and application of advanced GNN architectures. 3.0.5.2 Data loading In the thesis, data preparation entails a series of steps to ready the graph data for analysis using the selected GNNs. A specialized dataset handling module is used to manage both node and edge data, ensuring that the models receive structured and relevant information. Data is loaded from CSV files containing node and edge information. This includes incorporating node features when specified and applying labels where available, whether to nodes or edges. It should be noted that in the current setup, the GNN models do not utilize the edge features nor operate on directed edges. This approach simplifies the model’s architecture and focuses the learning on node features and their relationships, which are represented as undirected edges. The data for nodes includes features and, possibly, labels, loaded and normalized to improve model training efficiency. Normalization involves subtracting the mean and then dividing by the standard deviation of the training data’s features, a common practice to scale the input features to a similar range, reducing the possibility of instability during training due to vastly different feature scales. 3.0.5.3 Optimization To enhance the performance of the GNN models for detecting positive labels in the data, the hyperparameters were optimized using the Tree-structured Parzen Estima- tor (TPE) method [58]. This method predicts which hyperparameter configurations are likely to yield better results by analyzing outcomes from past trials. As for the statistical machine learning models, AP was used as the objective func- 27 3. Methods tion for the positive label across different datasets. TPE was deployed to optimize each model individually for each dataset, adjusting key parameters such as learning rate, dropout rates, and the number of attention heads for attention-based models. For each dataset, 30 trials were conducted, with each trial configuration undergo- ing a training process spanning 1000 epochs to thoroughly evaluate the model’s performance. 3.0.5.4 Evaluation The evaluation of the selected GNN models will follow the same principle as for the statistical models, described in section 3.0.4.4, to be able to compare the different methods. To ensure the robustness and reliability of the evaluation results, each GNN model underwent multiple runs under different conditions. Specifically, each model was trained and tested five times with random initializations. This approach mitigates any potential biases or anomalies that could arise from a single training session and allows for a more generalized assessment of the models’ performance. For each of the five runs, both the AP for the positive label and the overall AUC score were calculated. The results from these runs were then aggregated to compute an average AP and average AUC, along with the standard deviation (std) for these metrics. This evaluation approach not only demonstrates each model’s effectiveness in identifying money laundering accounts but also reliably measures their overall performance. 3.0.5.5 Handling incomplete labels The GNN models can effectively handle partially labeled datasets. Specifically, due to the neighbor aggregation mechanism illustrated in Figure 2.2, these models can leverage the features of nodes with missing labels, making them advantageous for datasets with incomplete labels. A possible way of incorporating the nodes without a label in the loss function is by means of label propagation, i.e., creating pseudo labels via majority voting between neighbours [59]. However, experiments yielded poor results since most of the neigh- 28 3. Methods bouring nodes are normal account, hence, voting also money laundering accounts to be labeled as normal. Therefore, nodes without a label was not used in the loss function. 29 3. Methods Feature Description sum_spending Total amount spent by an account. mean_spending Average amount spent by an account. median_spending Median amount spent by an account. std_spending Standard deviation of amounts spent by an account. max_spending Maximum amount spent in a single transaction. min_spending Minimum amount spent in a single transaction. count_spending Total number of outgoing transactions. sum Total transaction amount for an account. mean Average transaction amount for an account. median Median transaction amount for an account. std Standard deviation of transaction amounts. max Maximum transaction amount for an account. min Minimum transaction amount for an account. count_in Number of incoming transactions. count_out Number of outgoing transactions. count_unique_in Number of unique counterparties sending transactions. count_unique_out Number of unique counterparties receiving transactions. count_days_in_bank Number of days an account has been active. count_phone_changes Number of phone number changes for an account. Label Description money_launderer Indicator if an account is laundering money. Table 3.2: List of node features and label their descriptions. 30 3. Methods A B C D E F G H I Normal 637 500 0.1 400 250 200 500 1460 365 SAR 850 700 0.1 500 400 300 650 1000 200 Table 3.3: Parameter values for normal and money-laundering classes. A = mean_amount, B = std_amount, C = prob_income, D = mean_income, E = std_income, F = mean_outcome, G = std_outcome, H = mean_phone_change_freq, I = std_phone_change_freq. 31 3. Methods 32 4 Results 4.1 The finished datasets Tables 4.1,4.2, and 4.3 display all key properties of all datasets. Class imbalance is shown as percentage of normal : percentage of money launderers. 33 4. Results Table 4.1: Data analysis from the EASY data Dataset Total Labeled Normal ML Class Inaccurate Accounts Accounts Accounts Accounts Imbalance Labels Ratio: No noise Known labels 99 938 99 938 85 120 14 818 82.6:17.4 0 Missing labels 99 938 9 994 8 512 1 482 82.6:17.4 0 Ratio: 10% Neighbour 99 938 9 994 8 197 1 797 78.1:21.9 315 Cycle 99 938 9 994 8 540 1 454 83.0:17.0 28 Fan-out 99 938 9 994 8 537 1 457 82.9:17.1 25 Fan-in 99 938 9 994 8 538 1 456 82.9:17.1 26 FP 99 938 9 994 7 661 2 333 69.6:30.4 851 Bipartite 99 938 9 994 8 533 1 461 82.9:17.1 21 Gather-scatter 99 938 9 994 8 524 1 470 82.8:17.2 12 Scatter-gather 99 938 9 994 8 528 1 466 82.8:17.2 16 Stack 99 938 9 994 8 532 1 462 82.9:17.1 20 FN 99 938 9 994 8 660 1 334 84.6:15.4 148 Ratio: 25% Neighbour 99 938 9 994 7 724 2 270 70.6:29.3 788 Cycle 99 938 9 994 8 581 1 413 83.5:16.5 69 Fan-out 99 938 9 994 8 575 1 419 83.5:16.5 63 Fan-in 99 938 9 994 8 577 1 417 83.5:16.5 65 FP 99 938 9 994 6 384 3 610 43.5:56.5 2 128 Bipartite 99 938 9 994 8 564 1 430 83.3:16.7 52 Gather-scatter 99 938 9 994 8 542 1 452 83.0:17.0 30 Scatter-gather 99 938 9 994 8 552 1 442 83.1:16.9 40 Stack 99 938 9 994 8 562 1 432 83.3:16.7 50 FN 99 938 9 994 8 882 1 112 87.5:12.5 370 34 4. Results Table 4.2: Data analysis from the MID data Dataset Total Labeled Normal ML Class Inaccurate Accounts Accounts Accounts Accounts Imbalance Labels Ratio: No noise Known labels 99 932 99 932 97 032 2 900 97.0:3.0 0 Missing labels 99 932 9 993 9 703 290 97.0:3.0 0 Ratio: 10% Neighbour 99 932 9 993 9 625 368 96.2:3.8 78 Cycle 99 932 9 993 9 709 284 97.1:2.9 6 Fan-out 99 932 9 993 9 708 285 97.1:2.9 5 Fan-in 99 932 9 993 9 708 285 97.1:2.9 5 FP 99 932 9 993 8 733 1 260 85.6:14.4 970 Bipartite 99 932 9 993 9 708 285 97.1:2.9 5 Gather-scatter 99 932 9 993 9 705 288 97.0:3.0 2 Scatter-gather 99 932 9 993 9 706 287 97.0:3.0 3 Stack 99 932 9 993 9 707 286 97.1:2.9 4 FN 99 932 9 993 9 732 261 97.3:2.7 29 Ratio: 25% Neighbour 99 932 9 993 9 508 485 94.9:5.1 195 Cycle 99 932 9 993 9 719 274 97.2:2.8 16 Fan-out 99 932 9 993 9 715 278 97.1:2.9 12 Fan-in 99 932 9 993 9 715 278 97.1:2.9 12 FP 99 932 9 993 7 277 2 716 62.7:37.3 2 426 Bipartite 99 932 9 993 9 715 278 97.1:2.9 12 Gather-scatter 99 932 9 993 9 708 285 97.1:2.9 5 Scatter-gather 99 932 9 993 9 710 283 97.1:2.9 7 Stack 99 932 9 993 9 712 281 97.1:2.9 9 FN 99 932 9 993 9 775 218 97.8:2.2 72 35 4. Results Table 4.3: Data analysis from the HARD data Dataset Total Labeled Normal ML Class Inaccurate Accounts Accounts Accounts Accounts Imbalance Labels Ratio: No noise Known labels 99 923 99 923 99 344 579 99.4:0.6 0 Missing labels 99 923 9 992 9 934 58 99.4:0.6 0 Ratio: 10% Neighbour 99 923 9 992 9 918 74 99.3:0.7 16 Cycle 99 923 9 992 9 935 57 99.4:0.6 1 Fan-out 99 923 9 992 9 935 57 99.4:0.6 1 Fan-in 99 923 9 992 9 935 57 99.4:0.6 1 FP 99 923 9 992 8 941 1 051 88.2:11.8 993 Bipartite 99 923 9 992 9 935 57 99.4:0.6 1 Gather-scatter 99 923 9 992 9 934 58 99.4:0.6 0 Scatter-gather 99 923 9 992 9 934 58 99.4:0.6 0 Stack 99 923 9 992 9 935 57 99.4:0.6 1 FN 99 923 9 992 9 940 52 99.5:0.5 6 Ratio: 25% Neighbour 99 923 9 992 9 894 98 99.0:1.0 40 Cycle 99 923 9 992 9 936 56 99.4:0.6 2 Fan-out 99 923 9 992 9 936 56 99.4:0.6 2 Fan-in 99 923 9 992 9 937 55 99.4:0.6 3 FP 99 923 9 992 7 450 2 542 65.9:34.1 2 484 Bipartite 99 923 9 992 9 936 56 99.4:0.6 2 Gather-scatter 99 923 9 992 9 935 57 99.4:0.6 1 Scatter-gather 99 923 9 992 9 935 57 99.4:0.6 1 Stack 99 923 9 992 9 937 55 99.4:0.6 3 FN 99 923 9 992 9 948 44 99.6:0.4 14 36 4. Results 4.2 Comparison of known and missing labels Table 4.4: AP and AUC Scores for Different Datasets and Label Conditions AP Scores Model EASY MID HARD Known labels Missing labels Known labels Missing labels Known labels Missing labels KNN 42.16 37.11 13.03 8.98 1.77 0.84 LOG 37.32 34.52 10.60 10.80 1.93 1.35 RF 52.69 48.82 23.17 18.75 8.30 3.19 SVM 34.91 27.84 3.09 5.69 0.73 0.94 XGB 53.53 51.00 24.61 18.06 7.83 2.57 GSAGE 55.16 ± 0.42 40.84 ± 2.01 25.56 ± 1.25 5.57 ± 0.95 6.54 ± 1.40 1.84 ± 0.52 GCN 40.57 ± 3.69 44.98 ± 0.86 17.06 ± 0.74 16.13 ± 0.56 5.31 ± 0.62 4.26 ± 0.17 GAT 55.30 ± 0.34 54.31 ± 0.56 28.08 ± 1.52 23.40 ± 2.09 8.86 ± 0.73 7.90 ± 0.73 AUC Scores Known labels Missing labels Known labels Missing labels Known labels Missing labels KNN 79.93 76.06 74.53 70.08 62.96 56.17 LOG 78.23 75.82 74.75 73.24 73.63 68.39 RF 86.12 85.14 86.36 83.36 85.27 77.18 SVM 74.78 66.22 47.54 64.30 49.77 62.14 XGB 86.32 85.48 86.51 84.53 85.70 78.41 GSAGE 86.27 ± 0.15 77.09 ± 1.22 85.86 ± 0.31 62.04 ± 5.17 78.88 ± 2.46 60.67 ± 2.72 GCN 82.97 ± 0.47 83.31 ± 0.17 84.86 ± 0.06 82.78 ± 0.51 80.61 ± 0.45 79.58 ± 0.30 GAT 86.25 ± 0.05 85.57 ± 0.16 87.10 ± 0.13 86.09 ± 0.44 85.27 ± 0.29 83.89 ± 1.32 In table 4.4 the results for all models regarding both known labels and missing labels on the EASY, MID, and HARD datasets are shown. For the EASY dataset, GAT generally shows superior performance compared to other models, achieving the highest AP scores for both known labels (55.30) and missing labels (54.31). GraphSAGE shows competitive performance with AP scores of 55.16 for known labels and 40.84 for missing labels, indicating a significant drop but still maintaining a robust performance relative to other models. GCN also shows strong results with an AP score of 40.57 for known labels and 44.98 for missing labels, showcasing better robustness to missing labels compared to GraphSAGE. XBG follows closely with an AP score of 53.53 for known labels and 51.00 for missing labels, showing slightly 37 4. Results lower robustness compared to GAT. RF performs well among statistical models with AP scores of 52.69 for known labels and 48.82 for missing labels. SVM and LOG show relatively lower performance, particularly SVM, with a notable drop from 34.91 (known labels) to 27.84 (missing labels). The overall performance decreases slightly in the presence of missing labels across all models, but GAT maintains the highest robustness. In terms of AUC scores, XBG shows the highest AUC for known labels (86.32), but GAT performs almost equally well (86.25). For missing labels, GAT (85.57) and XBG (85.48) remain very close, with GAT showing slightly better robustness. For the MID dataset, GAT again leads with an AP score of 28.08 for known labels and 23.40 for missing labels. GraphSAGE shows competitive performance with AP scores of 25.56 for known labels and 5.57 for missing labels, indicating a significant drop. GCN also shows strong results with an AP score of 17.06 for known labels and 16.13 for missing labels, showcasing better robustness to missing labels compared to GraphSAGE. XBG scores 24.61 for known labels and 18.06 for missing labels, demonstrating a significant drop when labels are missing. Random Forest shows close results to XBG with an AP score of 23.17 for known labels and 18.75 for missing labels. SVM has a lower result with known labels than for missing labels with an AP performance from 3.09 for known labels to 5.69 for missing labels. LOG scores 10.60 for known labels and 10.80 for missing labels, indicating a slight improvement with missing labels. In terms of AUC scores, GAT performs best with AUC scores of 87.10 for known labels and 86.09 for missing labels. XBG also performs closely in terms of AUC score with 86.51 for known labels and 84.53 for missing labels. For the HARD dataset, GAT continues to outperform other models with AP scores of 8.86 for known labels and 7.90 for missing labels. GraphSAGE shows competitive performance with AP scores of 6.54 for known labels and 1.84 for missing labels, indicating a significant drop. GCN also shows strong results with an AP score of 5.31 for known labels and 4.26 for missing labels, showcasing better robustness to missing labels compared to GraphSAGE. Random Forest shows a significant drop in AP performance from 8.30 for known labels to 3.19 for missing labels. XBG scores 7.83 for known labels and 2.57 for missing labels, demonstrating a notable drop. 38 4. Results LOG and KNN perform the worst in terms of AP scores, with KNN scoring 1.77 for known labels and 0.84 for missing labels. SVM performs poorly, but still better than KNN in some cases, with scores of 0.73 for known labels and 0.94 for missing labels. In terms of AUC scores, GAT maintains the highest AUC for missing labels with 85.27 for known labels and 83.89 for missing labels, showing strong robustness. XBG also shows the highest performance for known labels with 85.70 but drops to 78.41 for missing labels. Overall, we can see that the AP scores for all models are low regarding the HARD dataset. There is a clear decline in both AP and AUC scores as the dataset becomes more challenging (from EASY to HARD), with all models showing reduced performance with missing labels. GAT consistently shows the best performance across all dif- ficulty levels and conditions, particularly excelling in robustness to missing labels. Among statistical models, Random Forest and XBG generally perform best, but they still lag behind GAT in robustness and overall performance. SVM shows the lowest performance for the MID and HARD datasets, while KNN performs the worst in the HARD missing labels scenario. In summary, GAT demonstrates the highest performance and robustness across all scenarios, particularly excelling with incomplete labels. Among statistical models, Random Forest and XBG show the highest performance but still fall behind GAT in robustness. Overall, the presence of missing labels affects the performance of all models, with a notable decline in both AP and AUC scores as dataset difficulty increases. 4.3 Model performance for incorrect labels 39 4. Results Table 4.5: Results of EASY datasets with 10% of selected nodes label flipped AP Scores Model Missing Label FN FP Neighbour Bipartite Cycle KNN 37.11 35.54 32.81 34.75 36.91 37.50 LOG 34.52 33.98 35.76 34.01 34.31 34.74 RF 48.82 48.41 48.12 48.95 48.36 49.03 SVM 27.84 20.32 21.23 25.71 26.08 26.56 XGB 51.00 49.86 45.23 50.24 50.65 51.54 GSAGE 40.84 ± 2.01 43.15 ± 4.07 43.10 ± 3.31 46.25 ± 1.21 39.40 ± 3.82 36.46 ± 2.32 GCN 44.98 ± 0.86 42.55 ± 1.26 43.87 ± 3.15 46.44 ± 0.33 41.03 ± 3.88 43.57 ± 2.44 GAT 54.31 ± 0.56 53.93 ± 0.21 53.24 ± 0.29 54.37 ± 0.15 51.35 ± 4.75 54.09 ± 0.52 Model Missing Label Fan In Fan Out Ga-Sc Sc-Ga Stack KNN 37.11 37.17 37.10 36.92 37.13 37.02 LOG 34.52 34.47 34.08 34.66 34.56 34.28 RF 48.82 48.92 48.85 48.95 48.95 48.44 SVM 27.84 28.68 29.24 27.31 27.22 28.24 XGB 51.00 51.20 50.53 50.74 50.72 50.46 GSAGE 40.84 ± 2.01 38.22 ± 4.15 38.90 ± 2.89 36.96 ± 3.19 40.60 ± 5.95 39.66 ± 5.47 GCN 44.98 ± 0.86 43.35 ± 2.53 44.49 ± 0.56 43.10 ± 2.95 40.97 ± 3.01 44.01 ± 1.79 GAT 54.31 ± 0.56 51.49 ± 4.37 53.61 ± 0.23 54.06 ± 0.44 53.43 ± 0.60 53.66 ± 0.31 ROC_AUC Scores Model Missing Label FN FP Neighbour Bipartite Cycle KNN 76.06 75.05 70.70 74.51 75.90 76.23 LOG 75.82 75.45 78.84 76.30 75.64 75.79 RF 85.14 84.91 84.26 84.94 84.95 85.15 SVM 66.22 59.52 58.05 65.98 64.62 65.85 XGB 85.48 85.20 83.98 85.04 85.45 85.55 GSAGE 77.09 ± 1.22 79.32 ± 2.79 77.83 ± 2.95 81.21 ± 1.23 76.18 ± 2.70 74.00 ± 1.89 GCN 83.31 ± 0.17 82.80 ± 0.22 83.03 ± 0.28 83.13 ± 0.09 82.64 ± 0.64 83.06 ± 0.29 GAT 85.57 ± 0.16 85.56 ± 0.17 84.43 ± 0.30 85.66 ± 0.17 85.30 ± 0.72 85.72 ± 0.18 Model Missing Label Fan In Fan Out Ga-Sc Sc-Ga Stack KNN 76.06 76.04 76.06 75.96 75.92 76.00 LOG 75.82 75.71 75.41 76.06 75.84 75.64 RF 85.14 85.07 85.07 85.08 85.12 84.99 SVM 66.22 66.57 67.35 65.63 65.84 66.60 XGB 85.48 85.50 85.35 85.45 85.46 85.32 GSAGE 77.09 ± 1.22 75.73 ± 2.82 75.60 ± 2.51 74.37 ± 2.87 76.54 ± 4.44 75.90 ± 4.34 GCN 83.31 ± 0.17 83.11 ± 0.27 83.16 ± 0.14 82.99 ± 0.35 82.73 ± 0.42 83.13 ± 0.23 GAT 85.57 ± 0.16 85.39 ± 0.51 85.57 ± 0.07 85.79 ± 0.25 85.68 ± 0.25 85.80 ± 0.12 40 4. Results 4.3.1 EASY with 10% flipped labels For the EASY dataset with 10% of selected nodes labels flipped shown in table 4.5, GAT demonstrates the highest performance across most scenarios. GAT consistently achieves the highest AP scores for both missing labels and various types of inaccurate labels. It also maintains robust AUC scores, indicating its strong overall performance and robustness to label inaccuracies. XGB follows closely with high AP and AUC scores but shows less robustness to label flipping compared to GAT. While it performs well, its sensitivity to inaccurate labels results in a more noticeable performance decline. RF also performs well among the statistical models. It achieves competitive AP and AUC scores, although it too is affected by label inaccuracies, showing some performance decline with increased label flipping. GraphSAGE shows competitive performance with moderate drops in AP and AUC scores when labels are flipped. It remains robust, but the impact of label inaccuracies is more pronounced compared to GAT. GCN also shows strong results with respectable AP and AUC scores. Like Graph- SAGE, GCN is affected by label inaccuracies but maintains competitive performance levels. SVM and LOG show relatively lower performance. Both models experience signifi- cant declines in AP and AUC scores as label inaccuracies increase, with SVM being particularly sensitive. KNN exhibits moderate performance, showing noticeable declines in both AP and AUC scores with label inaccuracies. Its performance is less robust compared to the other models evaluated. 41 4. Results Table 4.6: Results of EASY datasets with 25% of selected nodes label flipped AP Scores Model Missing Label FN FP Neighbour Bipartite Cycle KNN 37.11 33.58 26.45 32.25 36.50 36.80 LOG 34.52 32.77 32.93 32.74 33.97 34.60 RF 48.82 48.80 45.59 47.80 48.07 49.37 SVM 27.84 21.63 16.99 25.45 25.79 22.74 XGB 51.00 48.94 44.33 50.40 50.26 50.86 GSAGE 40.84 ± 2.01 46.91 ± 4.78 38.66 ± 2.70 40.22 ± 1.01 33.86 ± 3.64 37.34 ± 3.33 GCN 44.98 ± 0.86 43.62 ± 0.75 46.69 ± 0.56 41.58 ± 4.11 42.98 ± 1.83 41.65 ± 3.50 GAT 54.31 ± 0.56 52.15 ± 0.85 51.96 ± 0.63 54.20 ± 0.69 53.70 ± 0.40 52.35 ± 3.08 Model Missing Label Fan In Fan Out Ga-Sc Sc-Ga Stack KNN 37.11 36.99 37.02 37.36 37.36 36.85 LOG 34.52 34.05 34.04 34.42 34.42 34.62 RF 48.82 49.18 48.95 48.40 48.97 48.00 SVM 27.84 24.26 26.61 26.59 26.59 22.27 XGB 51.00 51.20 50.72 50.22 50.95 49.31 GSAGE 40.84 ± 2.01 40.51 ± 5.96 38.59 ± 4.76 40.91 ± 3.19 40.73 ± 5.42 35.65 ± 5.11 GCN 44.98 ± 0.86 41.82 ± 2.67 41.92 ± 4.31 41.03 ± 2.60 42.45 ± 1.88 43.46 ± 2.22 GAT 54.31 ± 0.56 53.95 ± 0.27 54.02 ± 0.46 53.97 ± 0.44 54.16 ± 0.25 52.49 ± 1.43 ROC_AUC Scores Model Missing Label FN FP Neighbour Bipartite Cycle KNN 76.06 73.58 66.02 73.08 75.62 75.78 LOG 75.82 75.16 77.06 75.89 75.25 75.78 RF 85.14 84.82 83.49 84.39 84.91 85.22 SVM 66.22 62.41 54.26 66.13 64.25 62.62 XGB 85.48 84.87 82.97 84.43 85.29 85.44 GSAGE 77.09 ± 1.22 81.84 ± 3.15 75.79 ± 2.05 76.11 ± 1.69 71.93 ± 3.99 74.64 ± 2.60 GCN 83.31 ± 0.17 83.11 ± 0.15 83.39 ± 0.18 82.59 ± 0.57 82.87 ± 0.29 82.81 ± 0.49 GAT 85.57 ± 0.16 85.43 ± 0.13 83.88 ± 0.64 85.59 ± 0.25 85.54 ± 0.08 85.48 ± 0.26 Model Missing Label Fan In Fan Out Ga-Sc Sc-Ga Stack KNN 76.06 75.96 75.93 76.07 75.96 75.83 LOG 75.82 75.54 75.48 75.81 75.81 76.04 RF 85.14 85.08 85.12 85.07 85.07 84.95 SVM 66.22 63.17 66.57 65.20 65.20 63.13 XGB 85.48 85.53 85.53 85.32 85.43 85.16 GSAGE 77.09 ± 1.22 76.54 ± 4.54 75.72 ± 4.14 77.37 ± 2.38 77.15 ± 4.61 73.41 ± 4.72 GCN 83.31 ± 0.17 82.89 ± 0.37 82.80 ± 0.61 82.73 ± 0.33 82.99 ± 0.30 83.05 ± 0.16 GAT 85.57 ± 0.16 85.76 ± 0.20 85.62 ± 0.25 85.79 ± 0.25 85.66 ± 0.13 85.70 ± 0.21 42 4. Results 4.3.2 EASY with 25% flipped labels The results for 25% flipped labels are shown in table 4.6. For this dataset GAT still achieves the highest performance across both missing and inaccurate labels scenarios. Despite the increased label flipping, GAT maintains high AP and AUC scores, showcasing its robustness. XGB continues to perform well but shows greater sensitivity to the increase in label inaccuracies. Its AP and AUC scores decline more noticeably compared to the 10% flip scenario, indicating its lower robustness relative to GAT. RF also shows a notable decline in performance with increased label flipping. While it still performs well among the statistical models, the impact of inaccurate labels is more evident. GraphSAGE and GCN continue to show competitive performance, but with more significant drops compared to the 10% flip scenario. Both models experience larger declines in AP and AUC scores, highlighting their sensitivity to higher levels of label inaccuracies. SVM and LOG continue to show lower performance. Both models exhibit significant performance drops in AP and AUC scores with increased label flipping, reinforcing their sensitivity to label inaccuracies. KNN shows a notable decline in performance with increased label flipping. Its AP and AUC scores decrease significantly, indicating its lower robustness compared to other models. 4.3.3 Summary for EASY Across both the 10% and 25% label flip scenarios, GAT consistently achieves the highest performance, maintaining robust AP and AUC scores even as label inaccu- racies increase. XGB and RF also perform well, but they show a more significant decline in performance with increasing label flips compared to GAT. GraphSAGE and GCN maintain competitive performance but are more affected by label inaccu- racies, particularly in the 25% flip scenario. SVM and LOG continue to show lower performance across all scenarios, with significant drops in both AP and AUC scores 43 4. Results as label inaccuracies increase. KNN exhibits a moderate performance that declines notably with increased label flips. 4.3.4 MID with 10% flipped labels The results for 10% flipped labels are shown in table 4.7. For this dataset GAT is the best performing model overall. With a 5 times lower rate of money laundering compared to the EASY dataset, GAT drops about half of it’s previous performance with respect to the AP metric. XGB and RF are the best performing statistical models, with respect to both per- formance metrics. From the greater class imbalance SVM, KNN, and LOG show a significant drop in performance on the AP metric. GraphSAGE also shows a large decrease in performance with respect to both per- formance metrics. Showing a big effect of the increased class imbalance. GCN shows a drop in performance just below of the performance of XGB and RF. The noise that has the greatest impact on the models performances is the FP noise. Comparing with the other noises that all seem to have a similar impact on the model performance. GCN is the model that handles the FP noise the best showing just a slight drop in performance for the AP metric and a slight increase in AUC score. 44 4. Results Table 4.7: Results of MID datasets with 10% of selected nodes label flipped AP Scores Model Missing Label FN FP Neighbour Bipartite Cycle KNN 8.98 8.29 4.72 8.65 8.86 9.17 LOG 10.80 10.53 6.89 9.08 10.87 12.14 RF 18.75 17.87 13.20 18.19 18.72 19.49 SVM 5.69 5.42 3.65 4.35 5.00 5.61 XGB 18.06 17.89 11.19 17.50 17.54 17.86 GSAGE 5.57 ± 0.95 6.37 ± 1.42 3.46 ± 0.48 4.11 ± 0.44 5.39 ± 0.74 4.85 ± 0.78 GCN 16.13 ± 0.56 16.71 ± 1.05 15.70 ± 0.94 17.56 ± 1.31 17.29 ± 1.19 16.79 ± 1.12 GAT 23.40 ± 2.09 22.26 ± 1.22 17.44 ± 2.07 22.50 ± 1.28 23.46 ± 1.59 23.35 ± 1.93 Model Missing Label Fan In Fan Out Ga-Sc Sc-Ga Stack KNN 8.98 8.97 8.89 8.96 8.97 8.80 LOG 10.80 10.83 10.62 10.69 10.64 10.74 RF 18.75 18.43 18.95 19.46 19.34 18.73 SVM 5.69 5.40 6.24 5.65 5.58 5.35 XGB 18.06 17.90 17.94 18.09 17.20 17.74 GSAGE 5.57 ± 0.95 4.93 ± 0.55 5.52 ± 0.47 6.05 ± 0.92 5.31 ± 0.76 4.41 ± 0.19 GCN 16.13 ± 0.56 16.04 ± 0.31 17.06 ± 1.14 17.40 ± 1.20 16.20 ± 0.33 16.72 ± 1.33 GAT 23.40 ± 2.09 22.45 ± 1.45 23.89 ± 1.82 23.84 ± 1.35 24.26 ± 1.72 22.96 ± 2.32 ROC_AUC Scores Model Missing Label FN FP Neighbour Bipartite Cycle KNN 70.08 67.74 61.94 70.14 69.75 70.10 LOG 73.24 73.12 67.23 71.48 73.29 73.74 RF 83.36 83.03 75.70 82.50 83.34 83.58 SVM 64.30 63.07 58.56 57.83 62.04 63.87 XGB 84.53 84.43 71.15 82.76 84.55 84.57 GSAGE 62.04 ± 5.17 66.32 ± 6.29 55.61 ± 4.70 56.68 ± 0.84 61.82 ± 4.06 58.67 ± 3.91 GCN 82.78 ± 0.51 83.01 ± 0.19 83.38 ± 0.16 83.60 ± 0.10 82.94 ± 0.17 83.23 ± 0.29 GAT 86.09 ± 0.44 85.94 ± 0.25 77.57 ± 4.04 85.18 ± 0.55 85.90 ± 0.22 85.85 ± 0.22 Model Missing Label Fan In Fan Out Ga-Sc Sc-Ga Stack KNN 70.08 70.40 69.37 70.07 70.08 69.63 LOG 73.24 73.29 73.22 73.13 73.17 73.22 RF 83.36 83.54 83.66 83.85 83.78 83.30 SVM 64.30 63.87 65.69 64.57 64.02 62.17 XGB 84.53 84.70 84.46 84.53 84.43 84.47 GSAGE 62.04 ± 5.17 60.25 ± 3.70 64.13 ± 1.37 64.77 ± 3.03 61.76 ± 4.58 56.35 ± 1.62 GCN 82.78 ± 0.51 82.12 ± 0.34 83.09 ± 0.43 82.65 ± 0.45 82.69 ± 0.31 82.50 ± 0.43 GAT 86.09 ± 0.44 86.07 ± 0.16 85.79 ± 0.24 86.02 ± 0.39 86.15 ± 0.20 86.09 ± 0.33 45 4. Results 4.3.5 MID with 25% flipped labels The results for 25% flipped labels are shown in table 4.8. Once again the strongest performance is from GAT with respect to both performance metrics. Comparing with the previous 10% flipped labels the general performance of all models decrease in performance across all different noises. The statistical models SVM, KNN, LOG, and GraphSAGE have the largest perfor- mance drops on both metrics. With GraphSAGE showing the greatest decrease for the neighbour noise in particular. XGB and RF still perform at high levels especially with respect to the AUC score, showing a high overall classification performance for taking both classes in consid- eration. Comparing the different noise effects on the models, neighbour is a noise that with 25% flipped labels have a large impact on all models. FP is still the noise with the largest impact on the performance for all models. 4.3.6 Summary for MID GAT consistently performs well across both the 10% and 25% flipped labels of various noises. XGB and RF also show little to no performance drop over the various noises with respect to both performance metrics for the two different ratios. SVM, KNN, and LOG are all consistently showing poor performance with accurate labeling, showing that the increases gap in class imbalance has a great negative effect, and also showing a slight decrease in performance with added inaccurate labeling. The FP and neighbour noises are the most detrimental to the performance of all models, with the largest effects on 25%. 46 4. Results Table 4.8: Results of MID datasets with 25% of selected nodes label flipped AP Scores Model Missing Label FN FP Neighbour Bipartite Cycle KNN 8.98 8.38 3.99 8.21 8.84 8.97 LOG 10.80 10.75 5.01 7.91 10.36 10.52 RF 18.75 17.90 6.10 14.46 18.79 18.90 SVM 5.69 3.76 3.01 3.92 5.41 5.44 XGB 18.06 17.92 7.46 16.66 16.67 18.17 GSAGE 5.57 ± 0.95 4.61 ± 0.92 5.71 ± 1.54 3.37 ± 0.29 6.64 ± 1.00 5.72 ± 0.22 GCN 16.13 ± 0.56 16.00 ± 0.82 14.74 ± 0.71 16.86 ± 1.04 16.62 ± 0.26 16.86 ± 1.17 GAT 23.40 ± 2.09 21.94 ± 0.87 15.67 ± 4.10 18.26 ± 3.10 22.43 ± 1.97 24.22 ± 2.26 Model Missing Label Fan In Fan Out Ga-Sc Sc-Ga Stack KNN 8.98 8.49 8.95 8.91 8.61 8.78 LOG 10.80 11.67 10.32 10.77 10.75 10.49 RF 18.75 18.63 19.12 19.10 18.20 18.78 SVM 5.69 4.62 5.36 5.37 4.43 5.98 XGB 18.06 17.38 18.42 17.67 17.83 16.85 GSAGE 5.57 ± 0.95 5.09 ± 0.45 4.73 ± 0.51 5.48 ± 0.93 4.96 ± 0.45 5.54 ± 1.11 GCN 16.13 ± 0.56 16.61 ± 0.99 16.27 ± 0.46 16.58 ± 1.18 15.83 ± 0.17 15.78 ± 0.27 GAT 23.40 ± 2.09 21.80 ± 0.68 24.10 ± 1.70 24.85 ± 1.45 21.86 ± 1.23 23.89 ± 1.37 ROC_AUC Scores Model Missing Label FN FP Neighbour Bipartite Cycle KNN 70.08 67.88 58.50 70.76 69.56 69.93 LOG 73.24 72.97 61.83 70.04 73.02 73.23 RF 83.36 82.23 67.24 80.98 83.72 83.00 SVM 64.30 57.23 51.90 56.96 65.22 63.97 XGB 84.53 83.69 68.13 81.56 84.38 84.64 GSAGE 62.04 ± 5.17 61.40 ± 5.43 65.66 ± 5.63 54.25 ± 1.37 66.32 ± 3.73 63.90 ± 1.51 GCN 82.78 ± 0.51 81.98 ± 0.30 83.51 ± 0.10 83.11 ± 0.16 82.95 ± 0.23 82.68 ± 0.24 GAT 86.09 ± 0.44 85.43 ± 0.20 77.72 ± 5.12 82.30 ± 3.21 86.17 ± 0.24 85.41 ± 0.84 Model Missing Label Fan In Fan Out Ga-Sc Sc-Ga Stack KNN 70.08 69.07 70.39 69.98 69.56 69.89 LOG 73.24 73.56 73.16 73.22 73.33 73.08 RF 83.36 83.31 83.54 83.51 83.52 83.30 SVM 64.30 60.16 63.43 63.49 59.00 66.95 XGB 84.53 84.56 84.68 84.46 84.59 84.14 GSAGE 62.04 ± 5.17 60.70 ± 3.34 59.49 ± 2.58 63.14 ± 4.16 60.67 ± 3.99 62.22 ± 5.92 GCN 82.78 ± 0.51 82.55 ± 0.34 82.67 ± 0.36 82.67 ± 0.52 82.58 ± 0.19 81.97 ± 0.07 GAT 86.09 ± 0.44 85.95 ± 0.26 86.11 ± 0.23 85.82 ± 0.41 86.09 ± 0.28 86.29 ± 0.36 47 4. Results 4.3.7 HARD with 10% flipped labels The results for 10% flipped labels are shown in table 4.9. Here the extreme class imbalance is showing its effect across all models on the AP metric. GAT is still the best performing model showing some capabilities of handeling an extreme class imbalance. SVM, KNN, LOG, and GraphSAGE all perform at low scores for both AP and AUC, showing extremely poor overall performance. XGB, RF, and GCN do not show as big of a performance drop on the AUC score, showing that these models are still able to find patterns mainly for the normal class. The FP noise once again shows to have a large effect across all models. Consequently, the neighbour noise has the second to largest effect. These noises both introduce false positive to the dataset, thus overwhelming the positive class with inaccuracies. 48 4. Results Table 4.9: Results of HARD datasets with 10% of selected nodes label flipped AP Scores Model Missing Label FN FP Neighbour Bipartite Cycle KNN 0.84 0.84 0.58 0.78 0.84 0.84 LOG 1.35 1.34 0.70 1.34 1.28 1.29 RF 3.19 2.79 0.95 2.67 3.30 2.99 SVM 0.94 0.99 0.55 0.68 1.11 0.96 XGB 2.57 2.06 0.59 2.07 2.26 2.83 GSAGE 1.84 ± 0.52 1.85 ± 0.44 0.57 ± 0.03 1.36 ± 0.17 1.64 ± 0.20 2.01 ± 0.61 GCN 4.26 ± 0.17 3.11 ± 0.13 1.05 ± 0.15 3.43 ± 0.48 3.28 ± 0.51 2.67 ± 0.63 GAT 7.90 ± 0.73 7.94 ± 1.02 1.99 ± 0.30 7.65 ± 0.77 7.39 ± 1.10 6.88 ± 0.85 Model Missing Label Fan In Fan Out Ga-Sc Sc-Ga Stack KNN 0.84 0.84 0.84 0.84 0.84 0.84 LOG 1.35 1.32 1.32 1.35 1.41 1.43 RF 3.19 3.12 2.98 2.76 2.67 3.52 SVM 0.94 0.97 0.93 0.94 0.98 0.92 XGB 2.57 2.46 2.80 2.70 2.87 2.86 GSAGE 1.84 ± 0.52 1.73 ± 0.32 2.05 ± 0.43 1.72 ± 0.35 1.67 ± 0.39 1.51 ± 0.46 GCN 4.26 ± 0.17 3.24 ± 0.67 4.31 ± 0.51 3.59 ± 0.68 4.06 ± 0.52 3.51 ± 0.72 GAT 7.90 ± 0.73 6.81 ± 1.13 7.16 ± 1.66 8.01 ± 0.36 8.19 ± 0.66 7.73 ± 1.20 ROC_AUC Scores Model Missing Label FN FP Neighbour Bipartite Cycle KNN 56.17 56.01 49.64 56.07 56.17 56.17 LOG 68.39 68.25 57.89 69.02 67.46 67.74 RF 77.18 76.25 64.41 74.40 77.05 76.77 SVM 62.14 63.85 49.86 57.51 65.08 62.25 XGB 78.41 75.73 51.59 76.68 78.17 79.06 GSAGE 60.67 ± 2.72 54.89 ± 0.95 52.43 ± 1.30 57.13 ± 0.70 58.82 ± 1.24 60.79 ± 2.84 GCN 79.58 ± 0.30 80.23 ± 0.13 59.73 ± 2.35 79.34 ± 0.56 79.11 ± 0.64 76.35 ± 4.50 GAT 83.89 ± 1.32 82.89 ± 1.37 73.11 ± 4.36 82.51 ± 1.65 83.52 ± 0.98 83.19 ± 1.25 Model Missing Label Fan In Fan Out Ga-Sc Sc-Ga Stack KNN 56.17 56.17 56.17 56.17 56.17 56.17 LOG 68.39 68.02 68.30 68.39 69.07 69.37 RF 77.18 76.70 76.02 76.79 76.30 77.77 SVM 62.14 62.71 61.71 62.14 62.95 62.09 XGB 78.41 76.07 80.51 79.58 78.78 79.32 GSAGE 60.67 ± 2.72 60.26 ± 2.77 59.23 ± 0.72 58.77 ± 1.08 59.11 ± 1.99 58.62 ± 1.86 GCN 79.58 ± 0.30 78.45 ± 1.68 79.49 ± 0.59 78.77 ± 1.25 79.53 ± 0.41 78.98 ± 1.51 GAT 83.89 ± 1.32 83.96 ± 0.77 82.17 ± 3.37 83.90 ± 0.48 84.14 ± 0.67 83.32 ± 0.70 49 4. Results 4.3.8 HARD with 25% flipped labels The results for 25% flipped labels are shown in table 4.10. Here once again the extreme class imbalance is showing it’s negative effects across all models. SVM, KNN, LOG, and GraphSAGE all continue showing extremely poor perfor- mance on the AP metric, showing no capabilities in finding the money laundering class. XGB, RF and GCN are performing at similar levels on both metrics, with GCN having an edge on XGB and RF. GAT once again is the model showing the best performance across all results. Comparing the noise effects with for the previous 10%, FN and neighbour have the largest negative impact. With FP having detrimental effects on all models pushing the performance down to the lowest possible levels, even for GAT. All noise from the topology category seem to have little to no effect on the perfor- mance of all models. 4.3.9 Summary for HARD Overall for all HARD datasets the extreme class imbalance shows across all models for the AP metric. All models except for GAT are completely incapable of detecting the money laundering class, with GAT showing slight capabilities. RF, XBG and GCN show some performance on the AUC score. GAT is once again the best performing model across both metric scores. The most prevalent noises are FP, and neighbour, both are adding false positives as inaccuracies. Analysing the noises in the topology category, there are no differences in performance. 50 4. Results Table 4.10: Results of HARD datasets with 25% of selected nodes label flipped AP Scores Model Missing Label FN FP Neighbour Bipartite Cycle KNN 0.84 0.80 0.57 0.71 0.73 0.84 LOG 1.35 0.88 0.67 1.43 1.29 0.98 RF 2.43 1.70 0.66 2.16 2.11 2.47 SVM 0.94 0.80 0.53 0.78 0.80 0.92 XGB 2.57 1.51 0.63 3.12 2.07 2.30 GSAGE 1.84 ± 0.52 1.86 ± 0.49 0.59 ± 0.05 1.38 ± 0.10 1.54 ± 0.17 1.98 ± 0.40 GCN 4.26 ± 0.17 2.77 ± 0.38 1.08 ± 0.09 3.47 ± 0.56 3.77 ± 0.53 4.00 ± 0.20 GAT 7.90 ± 0.73 4.78 ± 0.64 0.61 ± 0.15 4.21 ± 1.16 6.37 ± 1.56 6.33 ± 1.49 Model Missing Label Fan In Fan Out Ga-Sc Sc-Ga Stack KNN 0.84 0.84 0.84 0.84 0.83 0.78 LOG 1.35 1.43 1.15 1.35 1.33 1.43 RF 2.43 3.06 3.26 2.86 3.19 2.78 SVM 0.94 1.00 0.92 0.94 1.21 0.96 XGB 2.57 2.39 2.35 2.63 2.38 2.61 GSAGE 1.84 ± 0.52 1.96 ± 0.22 1.46 ± 0.32 1.81 ± 0.28 1.75 ± 0.14 1.59 ± 0.26 GCN 4.26 ± 0.17 3.42 ± 0.69 3.47 ± 0.51 3.57 ± 0.47 3.47 ± 0.59 4.34 ± 1.03 GAT 7.90 ± 0.73 8.11 ± 1.19 8.26 ± 0.28 7.31 ± 0.80 7.58 ± 1.39 7.49 ± 0.53 ROC_AUC Scores Model Missing Label FN FP Neighbour Bipartite Cycle KNN 56.17 55.22 48.73 54.86 54.18 56.17 LOG 68.39 62.83 56.78 69.82 67.57 63.76 RF 74.26 72.17 54.24 71.47 76.39 77.38 SVM 62.14 60.13 48.42 59.97 59.44 62.90 XGB 78.41 72.08 52.18 74.38 77.37 75.69 GSAGE 60.67 ± 2.72 54.56 ± 0.81 50.62 ± 1.66 58.42 ± 1.56 57.57 ± 1.42 59.61 ± 2.03 GCN 79.58 ± 0.30 78.38 ± 1.69 60.60 ± 4.26 71.82 ± 1.44 79.31 ± 0.87 79.71 ± 0.15 GAT 83.89 ± 1.32 80.35 ± 2.23 51.51 ± 5.79 76.38 ± 6.35 82.52 ± 1.75 82.82 ± 1.81 Model Missing Label Fan In Fan Out Ga-Sc Sc-Ga Stack KNN 56.17 56.17 56.17 56.17 55.91 55.48 LOG 68.39 69.42 66.07 68.39 68.25 69.31 RF 74.26 76.76 76.82 76.90 75.19 74.89 SVM 62.14 62.83 62.56 62.14 65.01 62.57 XGB 78.41 77.38 77.70 78.06 78.58 77.69 GSAGE 60.67 ± 2.72 63.86 ± 1.57 59.14 ± 1.79 60.62 ± 1.70 59.80 ± 2.19 57.33 ± 1.07 GCN 79.58 ± 0.30 79.05 ± 0.81 79.79 ± 0.15 79.09 ± 0.41 78.74 ± 1.56 77.86 ± 1.45 GAT 83.89 ± 1.32 84.37 ± 0.57 84.18 ± 0.27 83.49 ± 1.08 83.37 ± 1.42 83.75 ± 0.55 51 4. Results 52 5 Discussion 5.0.1 Implications of findings In scenarios where a significant portion of the labels is missing, our results in table 4.4 have demonstrated that GNNs that are capable of incorporating and utilizing the graph structure to outperform other types of models such as XGBoost. Our comparative analysis revealed that GNNs not only leverage node features but also the underlying graph structure. This is particularly beneficial in weakly super- vised settings typical in AML applications, where obtaining complete and accurate labels can be complicated. The superior performance of GNNs in these conditions underscores their robustness and flexibility, making them a preferable choice for AML tasks with incomplete data. It should be noted that while XGBoost has higher AUC scores in some instances in table 4.4, this does not present the full picture. When looking at AP, it becomes clear that XGBoost does not perform as well as GAT in identifying the positive labels. Moreover, in scenarios where labels are both incorrect and missing in tables 4.5 to 4.10, GNNs, particularly the GAT model, demonstrate an even greater advan- tage. This is a scenario that financial institutions are likely to encounter. GAT’s ability to handle both types of label imperfections more effectively than other mod- els further supports its use in real-world AML efforts. The attention mechanisms within GAT allow it to prioritize relevant parts of the graph, enhancing its capability to make accurate predictions despite the presence of noisy and incomplete labels. This suggests that implementing a model like GAT in a practical AML context could 53 5. Discussion significantly improve the detection of suspicious activities, offering a more reliable and effective solution compared to traditional models. Comparing the results of inaccurate labels between specific money-laundering topolo- gies, shows that there are minimal to no difference. Having inaccurate labels in specific topologies, does not differ between the models to a large extent. The largest difference in performance is found in table 4.8 where the fan-in and scatter-gather topologies for the GAT model received an average AP score of 21.80 and 21.86, where the other topologies received an average AP score of around 23-24. Showing that in a real world scenario, correctly labeling these topologies could deem more important than for the other topologies. Our results have shown that in the scenario where significant imbalance is in the dataset it is a crucial factor that must be addressed. All models struggled to handle the severe imbalance present for the HARD dataset in tables 4.9 and 4.10 resulting in very low AP and AUC scores. To tackle this problem, one potential approach is to use graph-based sampling methods like GraphSMOTE [60]. These methods generate synthetic samples within the graph structure, improving class balance and enhancing the representation of minority classes. For financial institutions working with AML, this finding has important implications. High class imbalance could be present in real-world AML datasets, where the number of illicit transactions is typically much smaller than the number of legitimate ones. This imbalance can severely affect the performance of AML models, leading to a high number of false negatives. The low evaluation scores observed in our models under these conditions highlight the need for strategies to mitigate the effects of class imbalance. Financial institu- tions may need to explore various methods to balance their datasets and enhance model performance in these scenarios. Addressing dataset imbalance is critical to improving the effectiveness of AML systems, ensuring that they can reliably detect money laundering activity. This, in turn, can lead to more efficient and accurate AML efforts. 54 5. Discussion 5.0.2 Train and test split In this thesis, we chose to use a time-based split for the training and test sets, with a 50/50 split to prevent data leakage and ensure accurate model evaluation. This approach helps avoid skewed results and provides a clear assessment of model performance. However, in real-world applications, allowing some overlap between the training and test sets might yield better results. Having some overlap can be beneficial to ensure sufficient graph structure is available to identify patterns. This is particularly relevant for financial institutions. If the classification were to be performed on a smaller time window than the half-year period used in this thesis, it might be advantageous to include some overlap to maintain enough graph structure and accurately detect money laundering patterns. This approach is suggested in the paper "Realistic Synthetic Financial Transactions for Anti-Money Laundering Models" by Altman et al [23]. The authors propose a data split where the validation graph includes both training and validation trans- actions, but only the validation indices are used for evaluation. The test graph encompasses all transactions, but only the test indices are utilized for evaluation. The train graph contains only the training transactions and the corresponding nodes. However, with this approach, data leakage from surrounding nodes outside of the chosen indices could still occur. To determine the best strategy for a real-world scenario, one must investigate different methods and decide what works best for their specific data and goals. To determine the best data-splitting approach for a given situation, one must examine various strategies, considering the specific data and goals. 5.0.3 Node classification vs edge classification In this thesis, we have framed the task as node classification, meaning we aim to classify accounts rather than transactions as involved in money laundering. Alter- natively, one could approach it as an edge classification task, focusing on classifying transactions as part of money laundering activities. Opting for node classification could offer several advantages in real-life applications. 55 5. Discussion Firstly, it could be more efficient in terms of time complexity. Real datasets typically contain significantly more transactions than accounts. Even if each transaction between two accounts is aggregated into a single edge, edge classification could still face greater scalability challenges compared to node classification. Node classification leverages local neighborhoods. Aggregating features from neigh- boring nodes is straightforward, usually requiring fewer layers of computation and reducing the risk of over-smoothing [61]. However, this approach also has some limitations. It may focus primarily on the attributes of individual nodes and their immediate neighborhoods, potentially overlooking more complex relationships across the entire graph. 5.0.4 Ethical considerations This study demonstrates the strength of GNNs in node classification within a weakly supervised setting and highlights the shortcomings of statistical models in the same context. It is important to note that no specific information about the systems used by particular banks or the exact data they work with has been disclosed, as this research was conducted using synthetic data with estimated features. The intent of this work is to emphasize the need for more robust models to improve the classification capabilities of AML systems and to show the potential of GNNs for AML purposes when a portion of the data is inaccurate or missing. 5.0.5 Future work Future research could focus on novel approaches to enhance the classification capa- bilities of GNNs, especially when dealing with missing or inaccurate labels. This could involve building out their architecture or developing complementary methods to improve their performance in AML applications. A promising candidate for fur- ther exploration is GAT, as this study has shown the model outperforms all other evaluated models in settings with inaccurate and incomplete labels. Beyond addressing missing and inaccurate l