We are living in the Information era, surrounded by smartphones, social networks, platforms, IoTs, and technologies of all kinds, with increasing volumes of data, shared with greater frequency and richness. This is a boom for scientists who see the mass of the raw material expanding, but also a challenge for statisticians. Indeed, the processing of the data requires more and more adaptation regarding software, algorithms and statistical methods. Working on the same case, scientists and statisticians could sometimes feel far from each other. But before we delve deeper into how the R statistical programming language could help these two to get closer, let’s first shape the cornerstones of my story.
Present in 20 countries, we, Sogeti, are an IT company, where Digital, Cyber, Cloud, and Testing are the main practices. And like the small “Gallic village”, we have here in Luxembourg a special population: statisticians.
The latter is more concerned with collecting, cleaning, arranging, analyzing and disseminating data than integrating cloud solutions or developing apps. There are 70 statisticians in Sogeti Luxembourg, of which are two great colleagues: Paul Majerus and Alexandre Poncin. Their main duties include selecting and preparing the best ingredients, using and sometimes creating the most useful utensils so it is possible to reveal all the richness and relevance of data.
In this task they are helped by an essential tool: R. Before adopting R, our colleagues used to spend months iterating the same tasks year after year: cleaning, sorting, arranging, and presenting. Meanwhile, scientists were trying to analyse and understand the pattern of the world, some of them neglecting involuntarily available statistical practices.
Some user-friendly statistics software were part of the problem by allowing poor statistic-based data analyses summarizing statistical reflection to a mindlessly button-click action, therefore actively participating in the reproducibility crisis that the scientific world has been facing for more than a decade.
So why is R an ideal framework for working with stats? As a simple but robust programming language, it facilitates the automation of procedures, supported by an active community. It is constantly enriched with new features and applications, heavily documented, and all of this in open source.
“Thanks to R, we are saving precious time by automating our procedures. This time is reinvested in deepening our analyses, proposing new dissemination tools – such as web applications built in R” says Paul about the possibilities of R.
“On the other hand, the scientists could find in R a tool quite simple to use but requiring the comprehension of all parts of the statistical process. By methodically constructing each stage of the statistical analysis, scientists are assured of a better understanding of their construction, their limitations, possible errors and they maintain complete control and transparency over the calculations and methods applied. Therefore, R is less error-prone and supports scientists in good statistical practices” explains Alexandre.
For the next 4 blog posts, Paul, Alexandre and I will describe and explain how R changes the game by
– Reducing data preparation from 6 months to one week
– Making it possible to inject easily images, videos, sounds or other kinds of data
– Supporting the Bayesian revolution, and
– Bringing an inalterable source of innovation
We’ll be right back. Stay tuned!
This blog has been co-authored by Paul Majerus and Kamel Abid.
Paul Majerus is a Data analyst – Statistician at Sogeti Luxembourg.