Higher education is expensive and outcomes (graduation, employment) are not always certain. This has become increasingly true in Canada and the UK, and perhaps was always true in the US. At Penny Analytics, we have worked at the universities of Oxford, St. Andrews and Waterloo and have always had an interest in the sector. And now that we are parents, we want our kids to make good decisions about their studies, and be able to launch themselves as independent adults!

In this post, we apply outlier detection to the US college scorecard data. The college scorecard was introduced by the Obama government and, according to Wikipedia, it “is an online tool for consumers to compare the cost and value of higher education institutions in the United States. It displays data in five areas: cost, graduation rate, employment rate, average amount borrowed, and loan default rate”.

Our original dataset is called “Most-Recent-Cohorts-Scorecard-Elements.csv” and is available from the US government college scorecard website.

Preparing the US college scorecard data for outlier detection

We have made two main changes to the dataset. First, we have removed the short (and cryptic) column names and replaced them with the “developer friendly name” from the data dictionary. This makes the dataset easier to follow. We also replaced any values of “PrivacySuppressed” with “NULL”. Before we did this, when we ran our data profiling report, there were several variables that were being treated as categorical when rightly they should have been numeric.

Here is the data profiling report produced after the final file upload:


(To enable all features of the data profiling report including toggle details, you will need to download it and open it from there.)

The dataset has 7115 rows (colleges) and 190 columns. This is a large number of columns and the data profiling report is very busy! But the report reveals that many columns are redundant. For example, average SAT scores in math are correlated with average SAT scores in writing. This means that the actual information contained in the dataset is somewhat less than 190 columns.

Penny Analytics outlier detection results

Looking at the outlier results, there are, as always, several ways to be an outlier and these are spelled out in the reason codes. In the top 20 or so, we find:

  • a number of institutions with very low completion rates
  • specialist institutions (L3 Commercial Training Solutions Airline Academy, Los Angeles Film School)
  • elite institutions (Caltech, Harvard, Harvey Mudd)
US College scorecard top 25 outliers by Penny Analytics

It looks like Obama had a point and we should beware of career colleges that will not get you a qualification, let alone a career.

Let’s look at the reason codes for Harvard, which is ranked 21 out of 7115:

|act_scores.25th_percentile.writing unexpected|program_percentage.mathematics high|program_percentage.social_science high|program_percentage.history high|

Fans of the Ivy League will be reassured that all eight institutions have been identified as outliers, usually for similar reasons to Harvard’s – high admissions standards and an unusual mix of programs. In fact, 19 out of the Forbes top 20 list are outliers. Here they are in order of their outlier rank:

Institution nameReasoning
California Institute of Technology|act_scores.25th_percentile.writing unexpected|program_percentage.computer high|program_percentage.engineering high|program_percentage.mathematics high|program_percentage.physical_science high|demographics.race_ethnicity.asian high|10_yrs_after_entry.median high|
Harvard University|act_scores.25th_percentile.writing unexpected|program_percentage.mathematics high|program_percentage.social_science high|program_percentage.history high|
University of Chicago|program_percentage.mathematics high|program_percentage.social_science high|
Massachusetts Institute of Technology|act_scores.25th_percentile.writing unexpected|program_percentage.computer high|program_percentage.engineering high|program_percentage.mathematics high|
Columbia University in the City of New York|act_scores.25th_percentile.writing unexpected|program_percentage.mathematics high|program_percentage.social_science high|
Princeton University|act_scores.25th_percentile.writing unexpected|program_percentage.language high|program_percentage.history high|
Yale University|act_scores.25th_percentile.writing unexpected|program_percentage.language high|program_percentage.history high|
Williams College|program_percentage.ethnic_cultural_gender high|program_percentage.mathematics high|program_percentage.physical_science high|program_percentage.history high|
University of Pennsylvania|act_scores.25th_percentile.writing unexpected|
Duke University|act_scores.25th_percentile.writing unexpected|program_percentage.public_administration_social_service high|
Stanford University
|program_percentage.multidiscipline high|
Brown University|program_percentage.mathematics high|
Dartmouth College|program_percentage.social_science high|program_percentage.history high|
University of Michigan-Ann Arbor|act_scores.25th_percentile.writing unexpected|size high|
Pomona College|program_percentage.mathematics high|
University of California-Berkeley|size high|
Northwestern University|act_scores.25th_percentile.writing unexpected|
Georgetown University|program_percentage.social_science high|
Cornell University|program_percentage.agriculture high|
University of Notre Dame(not an outlier)

What’s interesting is that these great places do not for the most part have extraordinary outcomes. Only Caltech has higher median earnings and otherwise the reason codes do not mention things like completion rates.

In Excel, we can filter the reason codes on text “price”. Surprisingly, very few institutions were called out for high prices, only these five:

  • L3 Commercial Training Solutions Airline Academy
  • Jewish Theological Seminary of America
  • Elmira Business Institute
  • The International Culinary Center
  • Southern California Institute of Architecture

And only two institutions were called out for low prices:

  • Franklin W Olin College of Engineering
  • Washington and Lee University

So, where does this outlier analysis leave us? There are few outliers on price, and few outliers on good outcomes. This analysis mostly helps us find institutions with a more unusual mix of programs, which speaks to their unique character, and those with more selective admissions. It also alerts us to some “bad apples” in the US higher education sector. You can download the Penny Analytics outlier results below:


Did you know that Penny Analytics offers a permanent discount for public datasets? Public datasets can be processed under our file donation data retention option.

The Penny Analytics blog is looking for more public datasets to showcase its online outlier detection service. If you think your dataset is suitable, register online and upload your dataset. After your free data profiling report is produced, do not shop. Instead, email us at blog@pennyanalytics.com with the name of your uploaded file. If your dataset is blogworthy, we will send you a coupon code so that your file donation order will be free of charge. Please allow a few days for us to respond.

Categories: Blogposts

Copyright © 2020 Penny Analytics Limited All rights reserved.