In the first blog of the series, we gave an introduction to why R is an ideal framework for working with stats.
Big Data analytics is everywhere, any vertical, any market, any country. One of the biggest modern challenges is to reveal all the richness and relevance of data, whilst its volume and complexity increase exponentially.
From the raw data, it is not always possible to extract pertinent or statistically accurate information. Therefore, just like Michel Angelo’s Pietà, statisticians and scientists also need to transform an unhewn block of marble into a piece of art. For example, they need to remove abnormal information, add missing data, correct the data, etc.
As we can easily understand, with such amounts of data being constantly updated, the task is gargantuan. Doing this manually is like asking a 3-year-old child to eat soup with a fork. But with a little knowledge of R, the task can be simplified. In the light of the Pietà analogy, R allows one to draw a map and build a mould. When that is done, the artist “just” needs to fill the mould with material to reproduce the piece of art, and the statistician will “just” have to fill the mould with data to obtain results.
In a more practical view, R allows scripts to be written: these are sequences of pertinent rules and computations aimed to transform raw data. For example, it is possible to create automatized reports where the raw data will be directly taken from the database, then transformed and cleaned to be fit for statistic processing and finally presented as graphics and statistic results, even if the data is constantly being updated.
Nevertheless, the scripts need to be conceived with care, rigor and statistical knowledge, all of which is a time- and mind-demanding task. But once this is done, raw data treatment decreases from 6 months to several days of work.
Not only does R drastically reduce the time to prepare data from 6 months to 1 week, but the open source framework also makes it possible to easily inject images, videos, sounds or other types of data. This will be the topic of our next post.
This blog has been co-authored by Kamel Abid and Paul Majerus.
Paul Majerus is a Data analyst – Statistician at Sogeti Luxembourg.