October 23, 2014

Big data analysis requires some back to basic statistics principles

BY :     October 23, 2014

The recent blogpost “this piece of data is lying!” (part 1 and 2) showed that the devil hides in the details and what we observe from data (i.e. correlation) could be false. In this new series of articles, I would like to elaborate on that and give some more examples of statistical bias we must be all aware of in order to not be fooled with big data statistical results. The content of this series is inspired by a data science course of Washington University in the USA (Bill Howe, Data science, autumn 2012).

What about big data and statistics

Let us begin with a Bradley Efron statement: “Classical statistics was fashioned for small problems, a few hundred data points at most, a few parameters.” In addition, “The bottom line is that we have entered an era of massive scientific data collection, with a demand for answers to large-scale inference problems that lie beyond the scope of classical statistics.”

What goes wrong here? The fact is that you can find lots of positive correlation between data, for example:

• Loss of Internet Explorer and murder rate in the US between 2006 and 2011 (Cited by Bill Howe)
• Number of police officers and number of crimes (Glass & Hopkins, 1996)
• Amount of ice cream sold and deaths by drownings (Moore, 1993)
• Stork sightings and population increase (Box, Hunter, Hunter, 1978)

These examples shows that “when you search for patterns in very, very large data sets with billions or trillions of data points and thousands of metrics, you are bound to identify coincidences that have no predictive power” (Vincent Granville, the curse of big data). So the problem is the introduction of statistical bias that we should discover and remove before begin analyzing a data set. This series will give you some insight about some common bias coming with the traditional statistical method for analyzing (big) data.

The publication bias

Publication bias was reported long time ago, but there is evidence suggesting that this bias is increasing.

What is one example of publication bias? In the field of biomedical research, autism spectrum disorder publications suggest that in some areas negative results are completely absent. What does that really mean? That means that you are only publishing papers that show significant positive gains. In fact, when we try several treatments and none of them work except for one, we generally publish one article, not 20!

In the case of studies that cover more and more population (i.e. population size of the experiment increase) over time, when you repeat the experience, normally, you will have more accurate results (you gain more statistical power due to the size of the population; it is statistic rule). But usually, you will notice that the actual effect reported by meta-analysis is regressing to zero, like in the next figure (Bill Howe).

But if you take all the experiments that could have been reported through publication, you will have another pattern.

In fact, we see that this mysterious decline effect does not exist and is here just because of publication bias: only part of the results (in our example, positive results) are reported and published.

In the next article, we will look at the mysterious effect size.