Here at Penny Analytics, we can’t tell a fender from a bumper, but we do know our way around a dataset. We can at least help you find an unusual used car. Today’s dataset is used car listings from Craigslist.
The original dataset is a csv file called “CraigslistVehiclesFull” (1.7m records, 26 columns) and can be found on Kaggle:
At 550 MB, it is just over our file size limit (500 MB), so we narrowed it to listings from the state of New York. This gave us about 71,000 records and 26 columns.
Our data profiling report has flagged a number of potential issues. This is not surprising, since it is quite flexible what you can input into a Craigslist ad. There are many missing values in fields such as the condition, odometer, size, paint colour and cylinders. We can sympathise, we don’t know how many cylinders we have either.
The data profiling report also gives us details for each column in the dataset. For example, the most popular paint colour (apart from missing values), is black. We can’t remember which movie it was, but we knew already that New Yorkers were fond of black, as there was a scene where they came to a wedding wearing black, with black sunglasses. Anyway, here is the data profiling report:
(To enable all features of the data profiling report including toggle details, you will need to download it and open it from there.)
Moving onto the outlier results, this is a standard dataset rather than a time series, so the first thing we usually do is sort the results by the outlier rank and look at the biggest outliers. The #1 outlier is a Volvo truck painted orange with 10 million miles on the odometer. The reasons for it being flagged include the number of cylinders (10), the odometer reading and the paint colour. This record highlights what outlier detection is often about. Perhaps this record is entirely bogus? Or, certainly, there are some data quality issues. Well, funnily enough, we can find the truck’s VIN (4v4m19gf35n378129) on the internet and it is indeed an orange tractor, but it only has six cylinders and the odometer reading is not given, but is probably not 10 million. So, this truck is the real orange. But is this orange in fact a lemon?
Other outliers include some enterprising Canadians trying to sell their cars over the border. Or, if you are a James Bond type there is an Aston Martin with more cylinders than the Volvo, in excellent condition and in black, obviously.
Here is the results file (outliers only):
We have also featured this used cars dataset in our video about outlier detection in Excel: