Anomaly Detection in Credit Card Transactions using Multivariate Generalized Pareto Distribution

Examensarbete för masterexamen
Engineering mathematics and computational science (MPENM), MSc
Muameleci, Kubilay
There are billions of dollars that are lost to fraudulent credit card transactions every year. Many of these transactions are never noticed which causes a tremendous pressure on the economical system for the financial and credit institutions of interest. In addition to this, the usage of credit cards and thus e-business are in its arise, which together causes a threat in parallel with new developed data infringement methods. The research and progress within Machine Learning (ML) algorithms has been seen as an useful tool for the fraud investigators. However, there are still lacking robust frameworks which provides accurate and reliable methods within the field of ML:s. This thesis examines how the Multivariate Generalized Pareto distribution (MGPD) performs with regards to anomaly detection within a pre-processed data set consisting of credit card transactions in Europe for a month, compared to the supervised ML algorithm Feedforward Fully Connected Neural Network (FFCNN) and the two unsupervised ML algorithms Isolation Forest (IF) and Support Vector Machine (SVM), respectively. The pre-processing of the data set has been done a priori by means of Principal Components Analysis (PCA). The MGPD is fitted and simulated such that it has generators with independent Gumbel generators, whereas it is constructed in 3 dimensions consisting of standard exponentially transformed anomaly threshold excesses from the IF algorithm, L2 and L-Supremum metrics. The comparison is mainly done by means of Precision-Recall (PR) curves and Receiver Operating Characteristic (ROC), Area under ROC (AUROC) and Area under PR curves (AUPRC), whereby most emphasis in the comparison has been put on the AUPRC value, due to the nature of the highly imbalanced data set. It is found that the MGPD outperforms both of the unsupervised algorithms; IF and SVM under the assumption of 0.2% anomalies in the training set. Moreover, it is slightly under performing the IF when assuming 1% anomalies in the training set. The supervised FFCNN performs best within all of the models, due to its supervised nature. Nevertheless, trained and tested with respect to the same data set, the MGPD significantly outperforms both of the unsupervised algorithms. The results from this thesis provides promising future research with respect to the MGPD within unsupervised anomaly detection.
Multivariate Generalized Pareto, Support Vector Machine, Artificial Neural Network, Isolation Forest, Unsupervised, Supervised, Anomaly, Credit Card, Fraud, Machine Learning
Arkitekt (konstruktör)
Geografisk plats
Byggnad (typ)
Teknik / material