In this post we use a public dataset called FiSC (Fiscally Standardized Cities). This dataset is available from the Lincoln Institute of Land Policy here:
According to their website: “The FiSC database makes it possible to compare local government finances for 150 of the largest U.S. cities across more than 120 categories of revenues, expenditures, debt, and assets. The FiSC estimates are critical for making meaningful fiscal comparisons at the city level, because the delivery of public services is organized in very different ways in different cities. While some city governments provide a full array of public services for their residents and businesses, others share the responsibility with a variety of overlying independent governments. Fiscal comparisons across central city municipal governments alone can thus be highly misleading.”
In other words, the Lincoln Institute has put together a dataset that makes the comparisons between city finances more fair. Another way it does this is to present the dollar amounts as per capita. So, for example, the average fiscally standardized city (FiSC) raises about $6000 per person in revenue, and spends a similar amount.
Preparing the data for outlier detection
We have made two changes to this dataset before running it through our outlier detection service. First, the complete dataset has over 600 columns but this is because it contains the finances for each government level: the city, the county, the school districts, special districts and finally the FiSC, which is the rollup of all those levels. We removed the sublevels and just kept the FiSC data which “provides a full picture of revenues raised from city residents and businesses and spending on their behalf, whether done by the city government or a separate overlying government.”
The other change we made was to limit the data to the most recent year. The FiSC data has been collected from 1977 to 2016, but we limited our records to 2016. This meant we had one row per city. So, our final dataset has 152 rows and 142 columns. When we uploaded the file, we got the following feedback report:
File upload feedback from Penny Analytics
Note: We have not detected any dates in your file. If you know you have dates, please review your formats and use a more conventional one such as 2-FEB-2008 01:00:00 AM. See FAQs about dates. Also, check the data profiling report produced after file upload.
Note: We have recognized your data as a standard dataset (not a time series).
Note: This file is not eligible for the high volume/high security option, since you have not provided a first column containing unique integer rownumbers.
Warning: Your file has a high number of columns relative to rows. This is not ideal for outlier detection. If possible, reduce the number of columns and/or increase the number of rows and submit a new file.
SUCCESS: File validated
We are adding your file to inventory, so you will be ready to shop.
Also, we are generating a free profiling report of your data for you to keep (see below). This usually takes less than a minute. You can use this report to check your data before you shop.
Most of this feedback is fairly standard, but the warning about columns relative to rows is worth a mention. The outlier detection service works best when there are many rows to establish a “normal” baseline from. In fact, the system is built to handle approximately 100 million cells of data. A larger number of records is also ideal because it is well beyond what a human can process or visualize. For now, we are going to go ahead with our dataset and take the view that the proof of the pudding is in the eating.
If we didn’t go ahead, it’s worth considering the alternative. With only 152 records in this case, you might be tempted to just try conditional formatting on each column in Excel to spot the outliers. The problem is, this would only get you so far. You would spot the univariate outliers, but not the multivariate ones e.g. a city could be spending a typical amount on parks and recreation versus all other cities, but this amount could in fact be considered low if a peer group of similar cities are spending much more.
Data profiling results
Here is the data profiling report produced after the file upload:
(To enable all features of the data profiling report including toggle details, you will need to download it and open it from there.)
If you are looking at this report, you should note that columns starting with chg_ are actually revenue items. For example, the average chg_parks_rec is $41 (raised from various fees) whereas the spend is in the column parks_recreation and averages $172.
The other thing to note is that the report highlights 36 columns which are redundant. This means they are highly correlated with other columns and add little new information. So, the effective number of columns in the dataset is more like 106, which is still a lot relative to the number of rows but is better than we originally thought.
Outlier results for city finances – Washington, San Francisco, New York, Oakland, and Tacoma top the list
Here are the five outliers (in rank order) found along with the reasoning produced by our service.
|DC: Washington|||rev_total high|rev_general high|intergovt_rev high|igr_federal high|own_source_rev high|taxes high|tax_property high|tax_sales_grossrec high|tax_sales_general high|tax_sales_tobacco high|tax_income_indiv high|tax_income_corp high|tax_other high|misc_general_rev high|misc_fines_fees high|misc_other high|social_services high|public_welfare high|welfare_vendors high|welfare_other high|envir_housing high|sewerage high|administration high|admin_financial high|public_buildings high|interest_general high|capital_sewers high||
|CA: San Francisco|||tax_sales_other high|tax_other high|chg_air_transit high|chg_parking high|chg_hous_commdev high|rev_utility_transit high|rev_empl_retire_tr high|health high|transportation high|parking high|utility_water high|utility_transit high|spend_capital high|capital_utility high|cash_holdings high|cash_empl_retire high|cash_other_other high||
|NY: New York|||city_population high|taxes high|tax_income_corp high|chg_highways high|rev_utility_transit high|rev_empl_retire_tr high|utility_transit high|spend_intergovt high|cash_empl_retire high||
|CA: Oakland|||chg_seaport high|misc_property_sale high|envir_housing high|housing_commdevt high|admin_other high|capital_other_gen high||
|WA: Tacoma|||commercial_activ high|capital_other_dir high||
What you will notice is that the largest outliers have more reasons behind them and this is typical. Since almost all the columns in the FiSC data are numeric we expect the reasoning to be directional (high or low). But the FiSC data is financial and, like personal incomes, it tends to be positively skewed. So, the outlier detection service is much more likely to pick out high values and that’s what we are seeing here. We are not deliberately targeting the big spenders! In fact, it would have been nice to see a counter example in this list.
If you live in these cities or study these cities, hopefully some of these findings ring true.fisc_totalsonly_2016only_penny_outliers_1358_163
Did you know that Penny Analytics offers a permanent discount for public datasets? Public datasets can be processed under our file donation data retention option.
The Penny Analytics blog is looking for more public datasets to showcase its online outlier detection service. If you think your dataset is suitable, register online and upload your dataset. After your free data profiling report is produced, do not shop. Instead, email us at email@example.com with the name of your uploaded file. If your dataset is blogworthy, we will send you a coupon code so that your file donation order will be free of charge. Please allow a few days for us to respond.