This piece of data is lying! (1/2)

lying-21Data analysis is fascinating. With some good data and some appropriate tools, both becoming more and more accessible these days, you can see more clearly, explain what is happening and even predict the future.

As always in automated processes (and math), all tools have to be used carefully and becoming more accessible also means removing some barriers, becoming sometimes “too simple to use”.

In this first part, we will walk through a pair of cases where data appears to be lying.

Fist case: The mall and the subway

In a mall we see people coming and going all day long, but couldn’t we predict the crowd a little bit more to adjust our sales animation?

Let’s measure people entering the mall every 7 minutes:

Fig1-measures Fig 1: number of people entering the mall during a 6 hour extract, one blue point every 7 minutes

Based on this data (in fact based on one month of such data), and the use of the “Power Spectral Density Estimator” tool in the new version of our data analysis system, we were able to identify the frequencies at which larger groups of people come into the mall!

We have two main frequencies: 45 minutes and nearly 17 minutes that, used as a simulation, correlate quite well with the measurement.

Fig2-estimation Fig 2: same as figure one with the red curve figuring the estimation

Well, knowing that the main train frequency during the day is 45 minutes, this is quite logical. But I can’t figure out why 17 minutes as none of the main subway schedules indicate this kind of timing. Can you?

Second case: The test

It is time to improve our software performance. A complete “live” measurement campaign has been conducted on our services layer, establishing the most comprehensive test to date, one thousand services response time in real conditions.

Fig3-1st campaign Fig 3: services performances, first measurement campaign

The network team thinks they could improve the results by prioritizing the best performing ones (which are the most often used ones, the code of these services is already quite optimized) using QoS on the network.

In my team, we believe code reviews are the way to go. We think we can improve the results by reviewing the services and giving some advice to the development team. We take the 100 “worst” performing services on the list and begin our work.

A month later, a new campaign is performed; we observe the same kind of measurement (globally), with a comparable mean and standard deviation.

And the results are…

Fig4-bottom 100 comparison Fig 4: impact of code reviews on 100 services

Very good indeed! As you can see, the improvement (data moving “left”) is 10-50% in each rank and some have been improved by at least 40%.

Well, not so good results for some friends of ours…

Fig5-top 100 comparison Fig 5: impact of QoS on 100 top services

 

The “best performing” services are now even worse (data moving right) with some nearly doubling their result, most ranks are worse (the first two are a little bit better but it’s not worth it).

Based on those first data, we can conclude that the network was already very well setup and should be set back to the previous settings and we can schedule code reviews for the next 100 “worst performing” services to evaluate more closely the ROI before generalizing this approach for every service used in critical applications.

But talking with our colleagues, we realized two very strange things:

  • the network team finally didn’t set up priorities as the network monitoring tools showed a very fluid network
  • the development team was swamped with a new mobile app to build and integrate and couldn’t integrate the result of our recommendations yet.

So no one did anything, but nevertheless the result changed dramatically. I can’t figure out why these results were so positive.Chance perhaps? Can you figure it out?

Interlude

Those two stories were a simplified illustration of things that can go wrong (or too good) when using data. Of course these data weren’t lying at all and were even showing us some interesting trends that will be useful not only in future analysis.

(Disclaimer: no real live data were used or hurt in this experiment)

In the next part, we will look closer at the tools we used and how it can be explained.

Claude Bamberger

About

Claude Bamberger has been an information systems architect since his first job in 1994, realizing in nearly 20 years that it’s a role that one becomes more than one is, mainly by enlarging the technology scope known, the skills mastered, the contexts experienced. Particularly interested in technologies and what they can mean in improving business results, Claude went from consulting in the early days of object-oriented development and distributed computing to projects, team, and I.T. department management during half a decade to come back to consulting in 2008 in Sogeti after an innovative start-up co-founding in the Talent Management field.

More on Claude Bamberger.

Related Posts

Your email address will not be published. Required fields are marked *

1 + 2 =