Anomaly Detection in Credit Card Transactions using Multivariate Generalized Pareto Distribution
Publicerad
Författare
Typ
Examensarbete för masterexamen
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
There are billions of dollars that are lost to fraudulent credit card transactions every
year. Many of these transactions are never noticed which causes a tremendous pressure
on the economical system for the financial and credit institutions of interest. In
addition to this, the usage of credit cards and thus e-business are in its arise, which
together causes a threat in parallel with new developed data infringement methods.
The research and progress within Machine Learning (ML) algorithms has been
seen as an useful tool for the fraud investigators. However, there are still lacking
robust frameworks which provides accurate and reliable methods within the field of
ML:s. This thesis examines how the Multivariate Generalized Pareto distribution
(MGPD) performs with regards to anomaly detection within a pre-processed data
set consisting of credit card transactions in Europe for a month, compared to the
supervised ML algorithm Feedforward Fully Connected Neural Network (FFCNN)
and the two unsupervised ML algorithms Isolation Forest (IF) and Support Vector
Machine (SVM), respectively. The pre-processing of the data set has been done a
priori by means of Principal Components Analysis (PCA). The MGPD is fitted and
simulated such that it has generators with independent Gumbel generators, whereas
it is constructed in 3 dimensions consisting of standard exponentially transformed
anomaly threshold excesses from the IF algorithm, L2 and L-Supremum metrics.
The comparison is mainly done by means of Precision-Recall (PR) curves and Receiver
Operating Characteristic (ROC), Area under ROC (AUROC) and Area under
PR curves (AUPRC), whereby most emphasis in the comparison has been put on
the AUPRC value, due to the nature of the highly imbalanced data set. It is found
that the MGPD outperforms both of the unsupervised algorithms; IF and SVM
under the assumption of 0.2% anomalies in the training set. Moreover, it is slightly
under performing the IF when assuming 1% anomalies in the training set. The
supervised FFCNN performs best within all of the models, due to its supervised
nature. Nevertheless, trained and tested with respect to the same data set, the
MGPD significantly outperforms both of the unsupervised algorithms. The results
from this thesis provides promising future research with respect to the MGPD within
unsupervised anomaly detection.
Beskrivning
Ämne/nyckelord
Multivariate Generalized Pareto, Support Vector Machine, Artificial Neural Network, Isolation Forest, Unsupervised, Supervised, Anomaly, Credit Card, Fraud, Machine Learning