In this post, we visualize the outlier detection results from Penny Analytics using two multivariate datasets. (If instead you are interested in visualizing time series outlier results, please see our earlier blogpost.)
Outliers in the Fisher Iris dataset
Let’s start with one of our free trial datasets, Fisher Iris data. You can download this dataset from our free trial page.
The dataset is quite simple, containing four numeric columns and one categorical column.
Numeric columns – sepal_length, sepal_width, petal_length, petal_width
Categorical column – species (one of three)
There are 150 records. The data was collected by hand. Two of the three species were collected in the Gaspé Peninsula in Canada “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”. If you go through our free trial, you can get both the data profiling report and the outlier results for this dataset.
In the outlier results, the outliers are ranked from 1 to 5, with 1 being the biggest outlier. The first and simplest thing to do with the results is to sort by the ranking and take a quick look. We go through sorting and conditionally formatting the Fisher Iris outlier results in our worked examples page. But in this blogpost we have gone further and visualized the outliers using Tableau:
You can play with the visualization yourself by going to our pages in Tableau Public. This will enable you to see the reason codes.
Outliers in the zoo dataset
The zoo is another small dataset with 101 records in it and can be downloaded from the University of California at Irvine:
The dataset contains the name of an animal, 15 boolean columns (hair, feathers, eggs, etc.), one numeric column (number of legs), and a column containing the classification of the animal.
The Penny Analytics data profiling report and outlier results can be downloaded from here:zoo_penny_outliers_1395_172.csv
We have imported the outlier results into Tableau and here is our visualization:
The ranked outliers are highlighted in dark blue. The highest ranked outlier is the honeybee, with reason code “venomous unexpected”.
You can play with the visualization yourself by going to our pages in Tableau Public.
Seeing is believing and we hope that this demonstration made the outliers found by Penny Analytics more tangible. The demonstration was really made possible by the simplicity of our datasets. Both datasets had a small number of records and relatively few columns. Also, almost all the columns were numeric.
Bear in mind that our outlier detection service can handle 200 columns, including date and categorical types. Experts acknowledge that visualizing high dimensional data is notoriously difficult. But in those high dimensional datasets, we use the same methods to find outliers. So, even if you cannot visualize them, those outliers are likely useful, whether in practice you are looking for data quality errors, suspicious transactions or faulty processes.
If you haven’t done so already, please run the Fisher Iris dataset through our free trial.