In this post, we apply outlier detection to checking expenses claims. Expenses claims in companies often need a second look. According to the Economist, employees’ expense claims are becoming more unusual.
There are also different personality types when it comes to submitting expenses. At Penny Analytics, we know people who would rather incur the expense themselves than go through the paperwork. On the opposite end, there are employees who start going overboard as soon as they start a business trip. And the advice “spend the money as if it was your own” does not in fact always work, as some people simply like to live it up all the time.
Checking expenses is clearly one way that a company can control costs. Companies also need to verify that the expenses are allowed for tax purposes. At Penny Analytics, we have worked places where our expenses got passed onto clients, so it was important that those expenses were uncontroversial. By checking expenses, companies may find simple human errors, wasteful spending or, at worst, fraud. This sort of checking is a continuous process but unfortunately it is often manual and tedious. Ensuring that claims adhere to policy rules does help, but it does not go far enough.
Checking expenses is also a typical function of the audit, as part of the larger task in which auditors have to check transactions on the general ledger. A common approach is to randomly sample the transactions and go from there. Machine learning and AI can change this by looking at 100% of the transactions, and then prioritizing the riskiest transactions. This is the exercise we are going to go through on this post.
Our dataset is from the government of Canada: “The rules and principles governing travel are outlined in the Policies for Ministers’ Offices and in Treasury Board’s Travel Directive and Special Travel Authorities. This dataset consolidates all the Travel Expense reports submitted by federal institutions.” These expense claims have been approved, paid and published i.e. we are not expecting to find any bombshells.
The file we worked with is the “Proactive Disclosure – Travel Expenses” in csv format and it includes expense claims from the very end of 2013 to 2019.
We made a few initial changes to the file:
• In Excel we removed columns that were just French translations of English columns
• We calculated a new column called num_days, the difference between the end date and the start date, using the Excel formula IF(ISNUMBER(G2-F2),IF(G2-F2>0,G2-F2,0),0). If the end date is missing, the num_days defaults to zero.
The resulting file has over 62,000 records and 19 columns.
We then uploaded the file to the Penny Analytics website. The upload feedback report highlighted two issues and led us to make the following changes:
• We made the date formats unambiguous using the Excel custom format “DD-MMM-YYYY” in Excel
• We saved the file from Excel, reopened it in Notepad++, converted it to UTF-8 and resaved
Next, we looked at the free data profiling report. The issues raised there were as follows:
Issue 1 – one really high airfare
Issue 2 – four totals that are too high and follow a pattern
Issue 3 – Three end dates that were not dates
Issue 4 – one really high lodging amount
Issue 5 – one really high meals amount
Here is a list of the bad records:
T-2017-Q4-0045,MPSES,Press Secretary,Alexander Wellstead,Staffing MVA for meeting & events with stakeholders/ Media,07-Jan-2018,10-Jan-2018,Ottawa-Calgary-Victoria-Vancouver-Toronto-Ottawa,1.83317E+18,48.52,664.12,237.15,31.08,3354.85,,,vac-acc,Veterans Affairs Canada | Anciens Combattants Canada,4 T-2016-Q2-00010,,"Commissioner (from September 12th to September 13th, 2016) ","Tedesco, Lucie",Meeting with Financial Institutions,12-Sep-2016,13-Sep-2016,"Toronto, Ontario, Canada",475.14,16.05,416.56,85.4,42,309309,,,fcac-acfc,Financial Consumer Agency of Canada | Agence de la consommation en matiŠre financiŠre du Canada,2 T-2016 Q2-00014,,"Deputy Commissioner (November 1, 2016)","Goulard, Brigitte",Meeting with Canadian Bankers Association ,01-Nov-2016,01-Nov-2016,"Toronto, Ontario, Canada",508.76,24.42,0,34.25,21,313313,,,fcac-acfc,Financial Consumer Agency of Canada | Agence de la consommation en matiŠre financiŠre du Canada,1 T-2016 Q2-00015,,"Deputy Commissioner (from September 30 to October 6, 2016)","Goulard, Brigitte",Attended OECD conference,30-Sep-2016,06-Oct-2016,"Paris, France",2051.99,317.5,501.28,644.43,0,314314,,,fcac-acfc,Financial Consumer Agency of Canada | Agence de la consommation en matiŠre financiŠre du Canada,7 T-2016 Q2-00016,,"Deputy Commissioner (From October 20 to October 21, 2016)","Goulard, Brigitte",Meeting with Financial Institution and attended the FCAC focus group,20-Oct-2016,21-Oct-2016,"Montr‚al, Qu‚bec, Canada",0,238.98,224.91,114.4,21,315315,,,fcac-acfc,Financial Consumer Agency of Canada | Agence de la consommation en matiŠre financiŠre du Canada,2 T-2017-Q4-0047,MPSES,Press Secretary,Alexander Wellstead,Staffing MVA for meetings & events with stakeholders/ Media,19-Feb-2018,21-Feb-2018,Ottawa-Moncton-Fredericton-Ottawa,774.56,0,3228626.16,199.15,,1270.41,,,vac-acc,Veterans Affairs Canada | Anciens Combattants Canada,2 T-2018-Q1-00001,,Vice-President - Operations Sector,Heather Smith,Site visit to Site C Dam (Fort St. John B.C.),23-Apr-2017,26-Apr-2017,"Vancouver, British Columbia",2264,82.65,622.57,865324.6,0,3380.32,,,iaac-aeic,Impact Assessment Agency of Canada | Agence d'????valuation d'impact du Canada,3 TR-2014-Q3-00018,, Financial Literacy Leader ,"Rooney, Jane",Guest speaker at Credit Counselling Canada's Stakeholders Conference,14-Oct-2014,0001-01-01,Toronto,454.77,0,0,15.75,194.6,665.12,,,fcac-acfc,Financial Consumer Agency of Canada | Agence de la consommation en matiô€ùre financiô€ùre du Canada,0 T-2015-Q4-5746,,President,Mr. John McDougall,Parking ,04-Nov-2015,1156-12-09,"Ottawa, ON",0,65,0,0,0,65,,,nrc-cnrc,National Research Council Canada | Conseil national de recherches Canada,0 T-2016-Q4-6015,,President,Mr. Iain Stewart,Parking and Mileage,30-Nov-2016,1116-12-08,"Ottawa, ON",,79.28,,,,79.28,,,nrc-cnrc,National Research Council Canada | Conseil national de recherches Canada,0
We decided to drop the bad records. Particularly in the case of the airfare of $1,833,171,084,810,840,000, this has real potential to throw off the outlier detection!!
It should be noted that the data profiling report looks at a maximum of 50,000 records. So, if your dataset is larger than this, it may not catch all such issues and you may need to do some checks of your own before proceeding with outlier detection. But the aim at this stage is to catch the obvious issues. We can leave the less obvious ones to the outlier detection.
Here is the final data profiling report after we had made all the changes to our file:
We then submitted our file for outlier detection. Let’s look at the top 10 ranked outliers.
Five of the top 10 are due to unexpected year, and refer to the handful of claims with a start date beginning in 2013. We have greyed these out. Although they are outliers, they are not of interest.
Rank 1 outlier – This is a 12 day Canadian trade delegation to Lebanon flagged for being expensive. The submitter annotated the claim in the purpose column.
Rank 6 outlier – This is a five day Arctic training program flagged for being expensive and having high “other expenses”. The submitter annotated the claim by saying that meals and accommodation are included in the other expenses.
Rank 7 outlier – This is a professional development course that the submitter misdated. The course should have been three days long but the calculated number of days is three years plus three days.
Rank 9 outlier – This is a claim for an annual parking pass at Ottawa airport. It has been flagged because the number of days (365) is long and the end date is unusual. The end date is next year (2020).
Rank 10 outlier – This is a 15 day trip to Geneva that has been flagged for having expensive lodging and meals.
As expected, no bombshells here. The next 10 outliers look a lot like rank 7, in which the dates appear wrong. This is often the case, that outlier detection reveals as many data issues as anything else.
The full results from the outlier detection are here:travelq_datesfixed_utf8_badrecordremoved_penny_outliers_1466_186
Do you work with transactional data and need to find unusual records? Or are you responsible for data quality? At Penny Analytics, we don’t want to make any “outrageous claims”. But we do think that our outlier detection service is one of the fastest ways that you can get the benefits of AI and machine learning when you audit transactional data. If you haven’t done so already, register and get started with our free trial datasets.