Apache Spark

from Wikipedia, the free encyclopedia
Apache Spark

Spark logo
Basic data

developer Apache Software Foundation
Publishing year May 30, 2014, March 1, 2014
Current  version 3.0.0
( 6 June 2020 )
operating system Windows , OS X , Linux
programming language Scala , Java , Python
category Development framework, big data analysis
License Apache license , version 2.0

Apache Spark is a framework for cluster computing that was created as part of a research project at the AMPLab of the University of California at Berkeley and has been publicly available under an open source license since 2010 . The project has been continued by the Apache Software Foundation since 2013 and has been classified as a Top Level Project there since 2014 .


Spark consists of several, partially interdependent components:

Spark Core

The Spark Core forms the basis of the entire Spark system. It provides basic infrastructure functionalities (task distribution, scheduling, I / O etc.). The basic data structure for all operations carried out in Spark is referred to as Resilient Distributed Dataset ( RDD , in German roughly "robust distributed dataset") - this is a (partial) set of data formed according to logical criteria, which is distributed over several computers can be distributed. RDDs can be generated from external sources (e.g. SQL, file, ...) or as a result of the application of various transformation functions (map, reduce, filter, join, group, ...).

Spark SQL

Spark SQL offers the option of converting RDDs into a so-called data frame on which SQL queries can be carried out. For this purpose, data frames are registered as temporary tables with a user-defined table name, which can be used in the FROM clause of SQL queries. This makes it easy to carry out selections, projections, joins, groupings and more.

Spark streaming

Spark Streaming enables the processing of data streams by dividing them into individual packets on which transformations can then be carried out.

MLlib / SparkML Machine Learning Library

MLlib and its successor SparkML are function libraries that make typical machine learning algorithms available for distributed Spark systems.


GraphX ​​is a Spark-based, distributed framework for calculations on graphs .

Web links

Individual evidence

  1. projects.apache.org . (accessed on April 8, 2020).
  2. Release 3.0.0 . June 6, 2020 (accessed June 16, 2020).
  3. ^ History. Apache Software Foundation, accessed June 14, 2015 .
  4. ^ The Apache Software Foundation Announces Apache ™ Spark ™ as a Top-Level Project. Apache Software Foundation, accessed June 14, 2015 .
  5. Machine learning on HDInsight. In: Microsoft Azure. January 19, 2018, accessed on November 15, 2018 (English): "SparkML is a newer package that provides a higher-level API built on top of DataFrames for constructing ML pipelines. SparkML does not yet support all of the features of MLlib, but is replacing MLlib as Spark's standard machine learning library. "