Peter O'Kelly's Reality Check: Google Cloud Platform Blog: Dataflow and open source

Thursday, January 21, 2016

Google Cloud Platform Blog: Dataflow and open source - proposal to join the Apache Incubator

Google asserts Dataflow will address the problems described below; also see the Apache Dataflow proposal, which notes "Dataflow started as a set of Google projects focused on making data processing easier, faster, and less costly. The Dataflow model is a successor to MapReduce, FlumeJava, and Millwheel inside Google and is focused on providing a unified solution for batch and stream processing."

"It wasn't long ago that Apache Hadoop MapReduce was the obvious engine for all things big data, then Apache Spark came along, and more recently Apache Flink, a streaming-native engine. Unlike upgrading hardware, adopting these more modern engines has generally required rewriting pipelines to adopt engine-specific APIs, often with different implementations for streaming and batch scenarios. This can mean throwing away user code that had just been weathered enough to be considered (mostly) bug-free, and replacing it with immature new code. All of this just because the data pipelines needed to scale better, or have lower latency, or run more cheaply, or complete faster.

Adjusting such aspects should not require throwing away well-tested business logic. You should be able to move your application or data pipeline to the appropriate engine, or to the appropriate environment (e.g., from on-prem to cloud) while keeping the business logic intact. But, to do this, two conditions need to be met. First, you need a portable SDK, which can produce programs that can execute on one of many pluggable execution environments. Second, that SDK has to expose a programming model whose semantics are focused on your workload and not on the capabilities of the underlying engine. For example, MapReduce as a programming model doesn’t meet the bill (even though MapReduce as an execution method might be appropriate in some cases) because it cannot productively express low-latency computations."

Google Cloud Platform Blog: Dataflow and open source - proposal to join the Apache Incubator

Peter O'Kelly's Reality Check

Thursday, January 21, 2016

Google Cloud Platform Blog: Dataflow and open source - proposal to join the Apache Incubator

No comments:

Blog Archive