Apache Spark 2.0 Highlights

By | December 25, 2016
Reference: https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html

  1. Structured Streaming supports interactive queries on streaming data through Spark SQL, joins against static data, and many libraries that already use DataFrames, letting developers build complete applications instead of just streaming pipelines. (94)
  2. To run a streaming computation, developers simply write a batch computation against the DataFrame / Dataset API, and Spark automatically incrementalizes the computation to run it in a streaming fashion (i.e. update the result as data comes in). (88)
  3. Spark Streaming has long led the big data space as one of the first systems unifying batch and streaming computation. (82)
  4. When you look into a modern data engine (e.g. Spark or other MPP databases), majority of the CPU cycles are spent in useless work, such as making virtual function calls or reading/writing intermediate data to CPU cache or memory. (81)
  5. Structured Streaming handles fault tolerance and consistency holistically across the engine and storage systems, making it easy to write applications that update a live database used for serving, join in static data, or move data reliably between storage systems. (74)
  6. When its streaming API, called DStreams, was introduced in Spark 0.7, it offered developers with several powerful properties: exactly-once semantics, fault-tolerance at scale, strong consistency guarantees and high throughput. (73)
  7. However, after working with hundreds of real-world deployments of Spark Streaming, we found that applications that need to make decisions in real-time often require more than just a streaming engine. (72)
  8. Spark 2.0 ships with an initial, alpha version of Structured Streaming, as a (surprisingly small!) extension to the DataFrame/Dataset API. (66)
Best words:
  1. spark (34)
  2. streaming (17)
  3. data (12)
  4. apache (7)
  5. databricks (6)
  6. batch (6)
  7. systems (5)
  8. generation (5)
  9. structured (5)
  10. performance (5)