Fun with data


FunWithDataFor those of you who have some time to waste: here’s a site that you have to check out: Be careful though, easy to spend an hour or more having fun.

Tyler Vigen is a student at Harvard Law School and has written an algorithm that lets him find correlations in independent data sets. So, we now know that there is a 99% (almost perfect!) correlation between US Spending on Science and Technology and US Suicides by Hanging, Strangulation, and Suffocation and a 95% correlation between Per Capita Consumption of Cheese in the US and the Number of People in the US who died by becoming tangled in their bedsheets. Plus, did you know, that  Margarine consumption in the US and the Number of people who starve to death in the US almost perfectly correlate? Perhaps more surprising is that data sets where you would expect a very strong correlation such as between the total crude oil imports in the US and US crude oil imports from Venezuela (88%), are easily trumped by stronger correlations between US crude oil imports and Number of Lawyers in Louisiana (95%) or Cost of Red Delicious Apples (90%).

Of course we already knew that correlation doesn’t imply causation, but we’re seldom so crudely reminded.  Nevertheless, there is potential for such an algorithm to be “let loose” on corporate data sets. Who knows what might be uncovered. Could make for some interesting presentations! And, with the right amount of interpretive analysis applied, perhaps new actual causal relationships with predictive value might be uncovered.

Happy hunting? Perhaps there is also reason to be cautious. There is no shortage of datasets to analyze given the enormous amount of data people are now already willing to share with corporations in return for small discounts or targeted advertising. What if your insurance company discovers a correlation between the average number of right turns per trip (you signed up for one of these devices in your car in return for a premium discount) and early death (from their actuarial files and CDC)? Might they increase your life insurance premium or perhaps deny coverage? In the United States, insurance companies have already gotten into some trouble associating credit scores with automobile premiums. Although they found an almost perfect correlation with predictive value, many legislators said that, since this correlation could not be properly explained, it could not be used to determine driving risk and premiums. Get ready for a direct correlation between the amount of data shared with corporations and number of lawsuits and lawyers needed to arbitrate all this.

Kasper de Boer


Kasper de Boer is a Vice President in Sogeti US, where he is currently responsible for the Infrastructure Practice. Kasper has 25 years experience in IT Consulting and is particularly interested in IT organizations and how to make these more efficient and effective. He advises clients on how to reduce the on-going maintenance burden of IT systems and technology, achieve better alignment between IT capabilities and business needs, increase speed-to-market of IT solutions, and increase the overall responsiveness of IT.

More on Kasper de Boer.

Related Posts

Your email address will not be published.

  1. Craig Jahnke May 28, 2014 Reply

    I have been hearing a lot of discussions around big data and personal privacy, so I think you are right that the lawyers will be playing a big part in the decision going forward.