Fraud is a problem that is simply not going away. In this post, we apply outlier detection to finding fraudulent credit card transactions, using a well known dataset from Kaggle.
We called this post “fraud detection without tears”, after a classic book. Why do we say without tears? Because you too can detect fraud without custom models, data scientists or consultants. In other words, if you are just an ordinary person with some Excel skills, you can run your data through our service and detect fraud. Let’s get on with the demonstration.
The dataset contains transactions from a European credit card over two days in September 2013. There are 284,807 transactions, of which 492 are fraudulent. So, fraud is rare, at 0.17% of all transactions or about 1 in 579 transactions.
There are 31 columns:
- Time – number of seconds elapsed between this transaction and the first transaction in the dataset
- V1 – V28 – a set of numeric columns. The columns have been deliberately disguised to maintain confidentiality. (In practice, this dataset is one of very few fraud detection datasets in the public domain).
- Amount – the value (probably in euros) of the transaction
- Class – a flag, 0 if not fraudulent, 1 if fraudulent
Before uploading the file to Penny Analytics, we made three changes to the dataset:
- We transformed the time variable to an actual unambiguous datetime using midnight on 1 Sept 2013 as the baseline. This is to take advantage of the datetime processing within Penny Analytics
- Since this is a large dataset that will have to be processed under the high volume/high security option, we added an integer rownumber as the first column of the dataset
- We removed the class variable. Because we are using outlier detection to find fraud, knowing the frauds would be cheating!
We then uploaded the file to Penny Analytics and got our free data profiling report. It should be noted that the data profiling report looks at a maximum of 50,000 records. Even so, this report gives us a good sense of the data and enables us to spot anything strange before going ahead with the outlier detection. Here is the data profiling report:
(To enable all features of the data profiling report including toggle details, you will need to download it and open it from there).
In the outlier results, the outliers are ranked from 1 to 8545, with 1 being the biggest outlier. Some ranks are ties. In fact, 12842 or 4.5% of records have been given an outlier score and reason code. The first thing we did with this results file was to reattach the class column that tells us whether the transaction is fraudulent or not. We then sorted the results by the outlier ranking.
Let’s eyeball the top ranked outliers. You can see the outlier ranking as well as the reason codes provided by Penny Analytics. You can also see that these top outliers are all fraudulent. In fact, we have to get to record 27 before we find our first non-fraudulent record. That is encouraging!
Next, we took the top ranked outliers and put them into groups of 100. In the first 100 records we found 80 frauds i.e. 16% of all the 492 frauds. The following table shows the results for the next 100 outliers, and so on:
As can be expected, the results diminish as we go deeper into the outliers. There appears to be a natural break between outlier 1301 and 1400. After this interval, the fraud rate goes below 10% and stays there. In the interval 1301 – 1400, the fraud rate is 5%. This means that, of the transactions in this interval, 5% are actual frauds, while 95% are false alarms (also called false positives). In practice, we do not want to have too many false alarms. Each suspicious case needs to be worked and there is a cost to this – the fraud analyst’s time and also potential inconvenience to the customer. Imagine if you were a fraud analyst and you were given a “hotlist” but in fact only a very small fraction of that list was real fraud.
So, a natural cutoff would be to work the first 1300 outliers only. This would give us a marginal fraud rate of 13%, an overall fraud rate of 23% and would capture 298 frauds, 61% of the total. Furthermore, the fraud analysts would have access to the outlier ranks and reason codes when working their lists.
Is this the best possible solution? Well, if you are a large financial services company, you will have invested in data science teams and you should be able to beat this solution. You can do this by building several model types, including outlier detection, but also including predictive models that use the information contained within the actual fraud cases. It takes a considerable amount of expertise and effort to model for fraud and similar problems like anti-money laundering, but this is worth it if your organization is large enough.
Why, then, are we suggesting this solution? An outlier detection approach to fraud detection overcomes many issues:
- Since actual fraud is rare, you may not have enough actual fraud cases to build a predictive model. Remember that not all fraud is found quickly, if ever. An example is this episode involving group medical benefits. The employer was being defrauded by its employees, and the insurance company was the go between.
- Fraud schemes may change rapidly over time, too quickly to be captured by predictive models
- The data available about transactions may be changing over time, too quickly to be captured by predictive models
- You simply may not be in a position to invest in custom models for your data, or the data scientists to build them
- Finally, you might need to find your fraud issues today, not after a long machine learning project is done!
Do you suspect your data contains fraud, but do not know where to start? At Penny Analytics, you can learn how our outlier detection service works by taking advantage of our free trial.
Fraud is a sensitive issue for many organizations. At Penny Analytics, we keep your data securely in the cloud and follow strict data retention policies. If you upload a file to our website, you can choose to delete it immediately, or wait until the file expires in one month’s time. If you go ahead with outlier detection, then you can choose to get your data deleted after a day, or wait until the file expires in one month’s time. Our data retention processes are automated and have been run hundreds of times. You will find this is much more transparent than handing your data over to a typical consultant. Learn more about how to get started.
The Penny Analytics credit card outlier file is here: