According to Kapow Software, a comprehensive list of typical Big Data sources comprises these nine categories: archives, docs, media, business apps, social media, public Web, data storage, machine log data, and sensor data. Red means external data (public Web), yellow is internal (archives and data storage), and orange designates both (docs, media, business apps, social media, sensor data).
Doug Laney, who coined the term Big Data around 1999, agrees with this view of Big Data Intelligence by Variety: “the variety of data is by 2:1 both the greatest challenge and the greatest opportunity for businesses. Volume and Velocity can be accomodated by scaling and swapping infrastructure components. Not so for Variety.”
Besides these Big Data domain issues there is the people side of things. I wondered how this new breed of so-called “data scientists” actually view their work. Therefore, I turned to Claudia Perlich of m6d, Drew Conway of IA Ventures, and Pete Skomoroch of LinkedIn to capture their insights in plain verbs that they feel explain the burden of their daily task best: fiddling, educating, judging, understanding, converging, and hopping. Of course, these descriptions should be understood at the appropriate level.
Claudia Perlich, Chief Scientist m6d: “Models estimate how good a candidate someone is for running shoes, say. But then the question is also, what is he doing right now. Is he reading his email on Yahoo, is he sitting on Facebook, is he reading a blog about the New York Marathon? So we have different layers of models that then kick in to say, should we pay the usual price or should we pay more? I’m constantly fiddling so I can see if we can add additional features or information to those models.
My challenge is still trying to communicate to the powers in charge what can and cannot be done with data. I think there’s a breed of people who understand data and know what to do and then there’s this huge expectation that people have for what data should be able to do. Some people have too low expectations for data; they just don’t get certain aspects of it. And others have too high expectations, believing that just because you have data you can answer any question. I spend time working with people, trying to understand what they expect to see, and helping them understand what they can realistically hope for or how long it takes to get there.
Evaluating a data scientist is really hard. Quality control around data science is incredibly difficult. If I build a model that predicts something, I have a hunch, but I don’t know how good it is. And then you ask me to evaluate somebody else’s work, where I only get exposed to about 5% of what the person really did. It’s impossible for me to judge how good a job the other person did. And that makes it extremely hard to evaluate candidates as well.”
Drew Conway, Scientist-in-Residence IA Ventures: “The thing that data scientists are not so good at, is that the story is not being told in a way that describes the challenging sociological questions that we really need to be focusing our attention on and spending much more time thinking about. [ . . . ] Data science is really about understanding human behavior and trying to find interesting patterns about that so that we can form lots of problem areas that we haven’t been able to address yet. Things like social policy, health care and medicine, local, national, and international policies about national security and war and peace and things like that, we haven’t really addressed those before.”
Pete Skomoroch, Principal Data Scientist LinkedIn: “If you have to know different things to be a data scientist, you have to have some programming and stats and a lot of other things, it can be hard to keep sharp on all of those skills. Oftentimes when I list the things that a data scientist should know, it seems overwhelming. Often, data scientists seem to be in the 80th percentile of a large number of areas, in terms of skills and maybe they’re a rock star in one particular area like machine learning or data visualization.
Working in different domains is good for people who are intellectually curious and just like solving problems in general. It can be a challenge, but it keeps life interesting. You see commonalities. If you take a really good data scientist and they’ve been working in bioinformatics, and then you drop them into a consumer internet company, they can often ramp up fairly quickly, pick up some domain knowledge and then start solving problems.”
Unlocking your Big Data Potential
Transformative Big Data initiatives begin with “magic moments”: by choosing a domain in which your organization wishes to excel, while taking into account the risks and side effects. Performance Big Data initiatives are directed to existing projects with the aim of improving the performance. With No More Secrets with Big Data Analytics VINT aims to create clarity by putting experience and vision in perspective: independent and supported by examples. You can download our book here.