Fun with data
May 28, 2014
For those of you who have some time to waste: here’s a site that you have to check out: http://www.tylervigen.com. Be careful though, easy to spend an hour or more having fun.
Tyler Vigen is a student at Harvard Law School and has written an algorithm that lets him find correlations in independent data sets. So, we now know that there is a 99% (almost perfect!) correlation between US Spending on Science and Technology and US Suicides by Hanging, Strangulation, and Suffocation and a 95% correlation between Per Capita Consumption of Cheese in the US and the Number of People in the US who died by becoming tangled in their bedsheets. Plus, did you know, that Margarine consumption in the US and the Number of people who starve to death in the US almost perfectly correlate? Perhaps more surprising is that data sets where you would expect a very strong correlation such as between the total crude oil imports in the US and US crude oil imports from Venezuela (88%), are easily trumped by stronger correlations between US crude oil imports and Number of Lawyers in Louisiana (95%) or Cost of Red Delicious Apples (90%).
Of course we already knew that correlation doesn’t imply causation, but we’re seldom so crudely reminded. Nevertheless, there is potential for such an algorithm to be “let loose” on corporate data sets. Who knows what might be uncovered. Could make for some interesting presentations! And, with the right amount of interpretive analysis applied, perhaps new actual causal relationships with predictive value might be uncovered.
Happy hunting? Perhaps there is also reason to be cautious. There is no shortage of datasets to analyze given the enormous amount of data people are now already willing to share with corporations in return for small discounts or targeted advertising. What if your insurance company discovers a correlation between the average number of right turns per trip (you signed up for one of these devices in your car in return for a premium discount) and early death (from their actuarial files and CDC)? Might they increase your life insurance premium or perhaps deny coverage? In the United States, insurance companies have already gotten into some trouble associating credit scores with automobile premiums. Although they found an almost perfect correlation with predictive value, many legislators said that, since this correlation could not be properly explained, it could not be used to determine driving risk and premiums. Get ready for a direct correlation between the amount of data shared with corporations and number of lawsuits and lawyers needed to arbitrate all this.