This piece of data is lying! (1/2)

Sep 25, 2014
Capgemini

lying-21Data analysis is fascinating. With some good data and some appropriate tools, both becoming more and more accessible these days, you can see more clearly, explain what is happening and even predict the future.

As always in automated processes (and math), all tools have to be used carefully and becoming more accessible also means removing some barriers, becoming sometimes “too simple to use”.

In this first part, we will walk through a pair of cases where data appears to be lying.

Fist case: The mall and the subway

In a mall we see people coming and going all day long, but couldn’t we predict the crowd a little bit more to adjust our sales animation?

Let’s measure people entering the mall every 7 minutes:

Fig1-measures
Fig 1: number of people entering the mall during a 6 hour extract, one blue point every 7 minutes

Based on this data (in fact based on one month of such data), and the use of the “Power Spectral Density Estimator” tool in the new version of our data analysis system, we were able to identify the frequencies at which larger groups of people come into the mall!

We have two main frequencies: 45 minutes and nearly 17 minutes that, used as a simulation, correlate quite well with the measurement.

Fig2-estimation
Fig 2: same as figure one with the red curve figuring the estimation

Well, knowing that the main train frequency during the day is 45 minutes, this is quite logical. But I can’t figure out why 17 minutes as none of the main subway schedules indicate this kind of timing. Can you?

Second case: The test

It is time to improve our software performance. A complete “live” measurement campaign has been conducted on our services layer, establishing the most comprehensive test to date, one thousand services response time in real conditions.

Fig3-1st campaign
Fig 3: services performances, first measurement campaign

The network team thinks they could improve the results by prioritizing the best performing ones (which are the most often used ones, the code of these services is already quite optimized) using QoS on the network.

In my team, we believe code reviews are the way to go. We think we can improve the results by reviewing the services and giving some advice to the development team. We take the 100 “worst” performing services on the list and begin our work.

A month later, a new campaign is performed; we observe the same kind of measurement (globally), with a comparable mean and standard deviation.

And the results are…

Fig4-bottom 100 comparison
Fig 4: impact of code reviews on 100 services

Very good indeed! As you can see, the improvement (data moving “left”) is 10-50% in each rank and some have been improved by at least 40%.

Well, not so good results for some friends of ours…

Fig5-top 100 comparison
Fig 5: impact of QoS on 100 top services

 

The “best performing” services are now even worse (data moving right) with some nearly doubling their result, most ranks are worse (the first two are a little bit better but it’s not worth it).

Based on those first data, we can conclude that the network was already very well setup and should be set back to the previous settings and we can schedule code reviews for the next 100 “worst performing” services to evaluate more closely the ROI before generalizing this approach for every service used in critical applications.

But talking with our colleagues, we realized two very strange things:

  • the network team finally didn’t set up priorities as the network monitoring tools showed a very fluid network
  • the development team was swamped with a new mobile app to build and integrate and couldn’t integrate the result of our recommendations yet.

So no one did anything, but nevertheless the result changed dramatically. I can’t figure out why these results were so positive.Chance perhaps? Can you figure it out?

Interlude

Those two stories were a simplified illustration of things that can go wrong (or too good) when using data. Of course these data weren’t lying at all and were even showing us some interesting trends that will be useful not only in future analysis.

(Disclaimer: no real live data were used or hurt in this experiment)

In the next part, we will look closer at the tools we used and how it can be explained.

About the author

Information Systems Architect | France
Claude Bamberger has been an information systems architect since his first job in 1994, realizing in nearly 20 years that it’s a role that one becomes more than one is, mainly by enlarging the technology scope known, the skills mastered, the contexts experienced.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

Slide to submit