Google DataFlow: Get rid of your embarrassingly parallel problems
Mar 9, 2015
“My hypothesis is that we can solve [the software crisis in parallel computing], but only if we work from the algorithm down to the hardware — not the traditional hardware first mentality.” – Tim Mattson
I wanted to start with this eloquent quote from Tim Mattson to highlight today’s most important need generated by Big Data and IoT: an efficient large data processing model. Parallel computing technologies are providing tangible answers to such problems, but are still quite complex and difficult to use.
On June 25, 2014, Google announced Cloud Dataflow, a large-scale data processing Cloud solution. The technology is based on a highly efficient and popular model, used internally at Google, which evolved from MapReduce and successor technologies such as Flume and MillWheel. You can use Dataflow to solve “Embarrassingly Parallel” data processing problems – the ones, which can be easily decomposed into “bundles” for processing simultaneously.
Fortunately, I had the chance to be whitelisted for the Dataflow private alpha release and was, therefore, able to put my hands on this very new Google Cloud Service. I will not keep the suspense up any longer: I have been really impressed by the solution and am pleased to share five reasons why every business, concerned with large data processing, should at least try out the Google Cloud Dataflow solution once:
1 – Simple: Dataflow has a unified programming model, relying on three simple components:
A- Pipelines – Independent entity reading, transforming and producing data
B – Pipelines data – Input date that will be read by the pipeline
C – Pipelines transformations – Operations to perform on any given data
Building a powerful pipeline, processing 100 millions of records, can be done within 6 lines of code.
2 – Efficient: I am used to performing computing on large data sets. The program was generating statistics pertaining to more than 56 millions records and took 7 to 8 hours on a 12 cores machine. Whereas, it took about 30 minutes with DataFlow. It is also very important to notice that Dataflow provides a real-time streaming feature, allowing developers to process data on the fly.
3 – Open Source: Dataflow consists of two major components: an SDK and a fully managed Cloud platform, managing resources and jobs for you. The Dataflow SDK is an open source & is available on GitHub; this means, a community can quickly grow around it. It also means that you can easily extend the solution to meet your requirements.
4 – Integrated: Dataflow can natively take inputs and provide outputs from/to different locations: Cloud Storage, BigQuery & Pub/Sub. This integration can really speed up and facilitate adoption if you are already relying on these technologies.
5 – Documented: The Dataflow documentation website is well furnished with theories and examples. If you are familiar with the Pipelines concept, you should then be able to run your customized pipeline in less than an hour. Many valuable examples are also provided on the SDK GitHub page.
Dataflow is in alpha, but is already promising to be a strong player in the world of Cloud parallel computing.
Stay tuned for more posts on this topic.