Free Book Mastering Apache Spark 2.0 by Jacek Laskowski

Free Book Mastering Apache Spark 2.0 by Jacek Laskowski

Summary for https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-overview.html:
  1. Apache Spark is an open-source distributed general-purpose cluster computing framework with in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL. (307)
  2. Regardless of the Spark tools you use – the Spark API for the many programming languages supported – Scala, Java, Python, R, or the Spark shell, or the many Spark Application Frameworks leveraging the concept of RDD, i.e. Spark SQL, Spark Streaming, Spark MLlib and Spark GraphX, you still use the same development and deployment environment to for large data sets to yield a result, be it a prediction (Spark MLlib), a structured data queries (Spark SQL) or just a large distributed batch (Spark Core) or streaming (Spark Streaming) computation. (278)
  3. In contrast to Hadoop two-stage disk-based MapReduce processing engine, Sparks multi-stage in-memory computing engine allows for running most computations in memory, and hence very often provides better performance (there are reports about being up to 100 times faster – read Spark officially sets a new record in large-scale sorting!) for certain applications, e.g. iterative algorithms or interactive data mining. (261)
  4. When you hear "Apache Spark" it can be two things;the Spark engine aka Spark Core or the Apache Spark open source project which is an "umbrella" term for Spark Core and the accompanying Spark Application Frameworks, i.e. Spark SQL, Spark Streaming, Spark MLlib and Spark GraphX that sit on top of Spark Core and the main data abstraction in Spark called RDD – Resilient Distributed Dataset. (234)
  5. You could also describe Spark as a distributed, data processing engine for batch and streaming modes featuring SQL queries, graph processing, and Machine Learning. (233)
  6. One of the Spark project goals was to deliver a platform that supports a very wide array of diverse workflows – not only MapReduce batch jobs (there were available in Hadoop already at that time), but also iterative computations like graph algorithms or Machine Learning. (215)
Best words:
  1. spark (94)
  2. data (36)
  3. processing (15)
  4. machine (13)
  5. learning (12)
  6. engine (12)
  7. distributed (12)
  8. graph (12)
  9. using (12)
  10. mapreduce (11)