Have you ever had a school detention? At Penny Analytics, we have to admit that we have some experience in that department. If you were lucky, they just let you do your homework (which needed to be done anyway). If you were unlucky, then they gave you a tedious task like writing lines. So, you might be asked to write “I will not interrupt my math teacher” one hundred times. This was indeed a tedious task, but at least it was not difficult.

The human eye has evolved to be a powerful organ. But to the human eye, outlier detection is both tedious and difficult. This is why radiology is a medical specialty. In today’s post, we look at a dataset full of images and run it through our outlier detection system.

The dataset is called “digits” and can be found on data.world:

Handwritten digits dataset

This dataset is typically used to train handwriting recognition models and consists of 64 columns that contain ink densities within an 8×8 grid of pixels. So, these 64 columns describe the image itself. Then, there is one last column which is the label between 0 and 9.

Here is what the first record in the dataset represents:

Handwritten digit zero

And here is the data profiling report for this dataset:

optdigits_test_profiling_report.html

(To enable all features of the data profiling report including toggle details, you will need to download it and open it from there.)

Unlike many business datasets, it is hard for us to interpret these columns directly, since they each represent a pixel. Instead, we are going to look at the images themselves.

Let’s start by looking at all the images in this dataset, but without the labels. This quite an eyeful, and if your detention teacher asked you to find outliers in this set of digits you would be right to be horrified, snowflake or not.




download digits_onepage.pdf

We put this data (without labels) through our outlier detection system and here are the results. We are highlighting the top 2% of outliers in blue.




download digits_global_outliers1_onepage.pdf

The first thing to notice is that none of the outliers are 0, 1, 5 or 8 but there are a number of outliers that are 4 or 7. So, fours and sevens may have quite different shapes than most of the other digits. But there is also evidence of clustering of outliers. It turns out only 13 individuals provided handwriting samples for this dataset, so that means a person would have written each digit several times. We can definitely also see some distinct handwriting styles being highlighted.

To make the problem easier, let’s now put the dataset including the labels through outlier detection. This is what the problem looks like now:




download digits_ordered1_onepage.pdf

If your detention teacher asked you to find outliers in this set of digits, you might think this was doable at first. But then, after working the problem for a bit, your doubts would creep in and you might worry that you will never finish the detention. We put this data (with labels) through our outlier detection system and here are the results. We are highlighting the top 2% of outliers in blue.




download digits_ordered_outliers1_onepage.pdf

Including the labels gives the outlier detection system more context to work with, but as there are 64 image columns and one label column, perhaps we shouldn’t be surprised that the system is picking up many of the same cases. In fact, the overlap is 78%.

This dataset contained less than 1800 rows and we were able to represent it on one page, but even then it was hard for the human eye to take in. Your business data probably contains many more rows than this. At Penny Analytics, we can handle approximately 100 million cells of data at a time. Would you like to know what’s hiding in your data? You can get started by downloading the free trial datasets on our website.


optdigits_test_nolabel_penny_outliers_1251_153

optdigits_test_penny_outliers_1251_152

Categories: Blogposts

Copyright © 2020 Penny Analytics Limited All rights reserved.