Category Archives: Big Data

Free Book Mastering Apache Spark 2.0 by Jacek Laskowski

Summary for https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-overview.html: Apache Spark is an open-source distributed general-purpose cluster computing framework with in-memory data processing engine that can do ETL, analytics, machine learning and graph processing on large volumes of data at rest (batch processing) or in motion (streaming processing) with rich concise high-level APIs for the programming languages: Scala, Python, Java, R, and SQL. (307)… Read More »

Apache Spark 2.0 Memory Management

Summary for https://0x0fff.com/spark-memory-management/: Initial Storage Memory region size, as you might remember, is calculated as “Spark Memory” * spark.memory.storageFraction = (“Java Heap”  “Reserved Memory”) * spark.memory.fraction * spark.memory.storageFraction. With default values, this is equal to (“Java Heap” – 300MB) * 0.75 * 0.5 = (“Java Heap” – 300MB) * 0.375. For 4GB heap this would result in 1423.5MB… Read More »

Apache Spark 2.0 Highlights

Reference: https://databricks.com/blog/2016/07/26/introducing-apache-spark-2-0.html Structured Streaming supports interactive queries on streaming data through Spark SQL, joins against static data, and many libraries that already use DataFrames, letting developers build complete applications instead of just streaming pipelines. (94) To run a streaming computation, developers simply write a batch computation against the DataFrame / Dataset API, and Spark automatically incrementalizes the computation… Read More »