Apache Spark
Apache Spark
|
|
---|---|
Basic data
|
|
developer | Apache Software Foundation |
Publishing year | May 30, 2014, March 1, 2014 |
Current version |
3.0.0 ( 6 June 2020 ) |
operating system | Windows , OS X , Linux |
programming language | Scala , Java , Python |
category | Development framework, big data analysis |
License | Apache license , version 2.0 |
spark.apache.org |
Apache Spark is a framework for cluster computing that was created as part of a research project at the AMPLab of the University of California at Berkeley and has been publicly available under an open source license since 2010 . The project has been continued by the Apache Software Foundation since 2013 and has been classified as a Top Level Project there since 2014 .
architecture
Spark consists of several, partially interdependent components:
Spark Core
The Spark Core forms the basis of the entire Spark system. It provides basic infrastructure functionalities (task distribution, scheduling, I / O etc.). The basic data structure for all operations carried out in Spark is referred to as Resilient Distributed Dataset ( RDD , in German roughly "robust distributed dataset") - this is a (partial) set of data formed according to logical criteria, which is distributed over several computers can be distributed. RDDs can be generated from external sources (e.g. SQL, file, ...) or as a result of the application of various transformation functions (map, reduce, filter, join, group, ...).
Spark SQL
Spark SQL offers the option of converting RDDs into a so-called data frame on which SQL queries can be carried out. For this purpose, data frames are registered as temporary tables with a user-defined table name, which can be used in the FROM clause of SQL queries. This makes it easy to carry out selections, projections, joins, groupings and more.
Spark streaming
Spark Streaming enables the processing of data streams by dividing them into individual packets on which transformations can then be carried out.
MLlib / SparkML Machine Learning Library
MLlib and its successor SparkML are function libraries that make typical machine learning algorithms available for distributed Spark systems.
GraphX
GraphX is a Spark-based, distributed framework for calculations on graphs .
Web links
- Apache Spark website (English)
- Apache Spark Tutorial (German)
- Apache Spark: Introduction (German)
Individual evidence
- ↑ projects.apache.org . (accessed on April 8, 2020).
- ↑ Release 3.0.0 . June 6, 2020 (accessed June 16, 2020).
- ^ History. Apache Software Foundation, accessed June 14, 2015 .
- ^ The Apache Software Foundation Announces Apache ™ Spark ™ as a Top-Level Project. Apache Software Foundation, accessed June 14, 2015 .
- ↑ Machine learning on HDInsight. In: Microsoft Azure. January 19, 2018, accessed on November 15, 2018 (English): "SparkML is a newer package that provides a higher-level API built on top of DataFrames for constructing ML pipelines. SparkML does not yet support all of the features of MLlib, but is replacing MLlib as Spark's standard machine learning library. "