Apache Hadoop

from Wikipedia, the free encyclopedia
Apache Hadoop

Apache Hadoop logo
Basic data

developer Apache Software Foundation
Publishing year April 1, 2006
Current  version 3.2.0
( January 16, 2019 )
operating system platform independent
programming language Java
category Distributed file system
License Apache
German speaking No
hadoop.apache.org

Apache Hadoop is a free framework written in Java for scalable, distributed software. It is based on the MapReduce algorithm from Google Inc. and on proposals from the Google file system and enables intensive computing processes to be carried out with large amounts of data ( big data , petabytes ) on computer clusters . Hadoop was initiated by Lucene inventor Doug Cutting and first published in 2006. On January 23, 2008, it became the Apache Software Foundation's top-level project . Users include Facebook , a9.com , AOL , Baidu , IBM , ImageShack and Yahoo .

Components

Hadoop Distributed File System (HDFS)

HDFS is a highly available file system for storing very large amounts of data on the file systems of several computers (nodes). Files are broken down into data blocks of fixed length and distributed redundantly to the participating nodes. There are master and slave nodes. A master node , the so-called NameNode, processes incoming data requests, organizes the filing of files in the slave nodes and saves any metadata . HDFS supports file systems with several hundred million files. Both the file block length and the degree of redundancy can be configured.

HDFS is suitable for large files. Many smaller files do not make sense and should be grouped in an application-transparent manner using Hadoop Archives (HAR). In future releases, even small files will be transparently supported by the Hadoop Distributed Data Store (HDDS).

HDFS can be replaced by other distributed file systems such as CassandraFS , MapRFS , GPFS , S3, and Azure Blockstorage . FTP servers are also supported as file systems with restrictions. Hadoop ecosystem applications that are to use third-party file systems must support the corresponding data locality for optimal performance, which should be ensured through tests.

Yet Another Resource Negotiator (YARN)

YARN makes it possible to dynamically manage the resources of a cluster for different jobs. YARN enables queues to determine the allocation of the cluster's capacities to individual jobs. In addition to CPU and memory, as of version 3.1.0, the management of GPU and FPGA resources is supported, which are primarily relevant for machine learning . This can be configured for applications and users.

MapReduce

Hadoop implements the MapReduce algorithm with configurable classes for Map, Reduce and Combination phases. MapReduce is increasingly considered obsolete within the Hadoop ecosystem and is increasingly being replaced by execution methods based on a Directed Acyclic Graph (DAG) (Directed Acyclic Graph).

Directed Acyclic Graph (DAG)

Execution methods based on a directed acyclic graph are provided for the Hadoop ecosystem by, for example, Apache TEZ, Apache Flink or Apache Spark . They enable the fast execution of complex distributed algorithms. Due to the modular architecture of Hadoop, these processes can easily run side by side.

Transparent compression

Hadoop supports the transparent compression of files for optimal storage and resource support. A wide variety of formats are supported, including Snappy for fast compression, zlib for high compression rates, and Bzip2 for highest compression. Any other format can be made transparently available to Hadoop applications. Compression can improve performance, as it significantly reduces the necessary IO operations. However, not all types of compression are "splitable"; H. can be decompressed in parallel. Modern file formats such as ORC or Parquet circumvent this by internally dividing the files to be compressed into blocks. This means that every compression format is suitable for processing the files in parallel.

Transparent file format support

Hadoop transparently supports the use of different file formats depending on the application. Both unstructured and structured formats are supported, including simple text formats such as CSV , JSON but also highly optimized schema-based files ( Apache Avro ) and highly optimized tabular formats such as ORC and Parquet. In addition, other file formats can easily be developed. Other plugins support the analysis of CryptoLedgers such as B. the Bitcoin blockchain.

XML is considered obsolete in the Hadoop ecosystem because it is not suitable for high-performance big data applications. Instead, it is recommended to use Apache Avro as the exchange format and ORC or Parquet as the query format for highly structured data.

Extensions

HBase

HBase is a scalable, simple database for managing very large amounts of data within a Hadoop cluster. The HBase database is based on a free implementation of Bigtable . This data structure is suitable for data that is seldom changed, but which is added very frequently. With HBase, billions of rows can be distributed and efficiently managed. It is suitable for processing small amounts of data from large amounts of data or for writing frequently changed data or individual data quickly. The Apache Phoenix project offers a SQL99 interface for HBase.

Hive

Hive extends Hadoop with data warehouse functionalities, namely the query language HiveQL and indices. HiveQL is a query language based on SQL and thus enables the developer to use a SQL99-like syntax. Since Hive 2.0, Hybrid Procedural SQL On Hadoop (HPL / SQL) is supported, which supports the execution of PL / SQL and many other SQL dialects. In addition, by using the ORC table format, LLAP and many other optimizations, complex interactive queries are increasingly supported in addition to batch applications. These optimizations come from the Stinger initiative, which also provides support for SQL: 2011 Analytics. Extensions such as HiveMall offer in-database analytics for complex machine learning applications. Transactionality is also supported by the ORC table format. It is possible to define traditional indexes such as the B-tree index and the bitmap index. For data warehouse scenarios, however, we recommend that you do not use this, but rather the ORC format with support for compression, bloom filters and storage indexes. This enables much more efficient queries, provided the data is sorted. Modern database appliances such as Oracle Exadata support these optimization options and also recommend avoiding traditional indexes for performance reasons.

Hive supports the execution of query languages ​​through so-called "engines". MapReduce (MR) is considered obsolete and should no longer be used (marked as "deprecated" since 2.0). Instead, TEZ is recommended. Alternatively, Spark is offered as an engine. Both are based on optimization methods using directed acyclic graphs.

LLAP offers a transparent in-memory cache that is geared towards interactive big data warehouse applications.

In the summer of 2008, Facebook , the original developer of Hive, made the project available to the open source community. The Hadoop database used by Facebook is one of the largest in the world at just over 100 petabytes (as of August 2012). This warehouse grew to 300 PBytes by 2014.

Pig

Pig can be used to create MapReduce programs in the high-level language Pig Latin for Hadoop. Pig is characterized by the following properties:

  • Simplicity. The parallel execution of complex analyzes is easy to understand and implement.
  • Optimization. Pig independently optimizes the execution of complex operations using the Carsten method.
  • Expandability. Pig can be expanded with its own functionalities and thus adapted to individual areas of application.

Chukwa

Chukwa enables real-time monitoring of very large distributed systems.

ZooKeeper

ZooKeeper is used for the (distributed) configuration of distributed systems.

Spark

Spark is an in-memory batch processing engine, which was developed primarily for machine learning applications. Graph applications, streaming applications and file-based batch jobs are supported. A machine learning application as well as an in-memory batch processing SQL engine, which Hive supports, are available.

Nimble

Flink is an in-memory stream processing engine and basically offers similar functions to Spark, with a stronger focus on machine learning and complex event processing. It is based on the European research project Stratosphere. Flink was published after Spark, but included efficient memory management of large amounts of data much earlier that was not based on the slow serialization methods of Java.

Ignite

Ignite is a distributed big data cache for interactive queries to speed up queries on frequently used data. It supports HDFS and Spark. Thanks to HDFS support, selected tables / partitions can be kept in-memory in Hive.

JanusGraph

JanusGraph is a graph database suitable for complex graph applications with quadrillions of nodes and edges. Various storage engines and index engines are supported, including Apache Hbase as the storage engine and Apache Solr as the index engine. JanusGraph is based on the standard stack for graph applications: TinkerPop. This supports the interactive graph query language Gremlin . JanusGraph is the open source successor to TitanDB.

architecture

Hadoop should be understood as an ecosystem in which Hadoop interacts with many other extensions. Therefore a suitable architecture has to be chosen.

Lambda architecture

A popular architecture here is the lambda architecture. A distinction is made between the following levels:

  • Batch layer: This layer processes data as part of long-lasting batch processes. This is often covered by Hadoop MapReduce, Spark or Hive in combination with the HDFS file system.
  • Speed ​​Layer: This layer processes data streams (streaming) from "live" events. These are large data streams of often several terabytes / hour from devices from the Internet of Things / Industry 4.0 or social networks such as B. Twitter, Facebook, etc. Online machine learning algorithms are often used here because they can adapt the model to the latest events. Kafka, to bundle the data streams, and Spark Streaming, Flink Streaming or Storm are often used here.
  • Serving layer: This layer makes the results from the batch layer and speed layer available to users as quickly as possible for interactive analyzes. This area is often covered by traditional databases, but more and more often also by NoSQL databases, as these offer more suitable data structures, such as For example, document databases (e.g. MongoDB), graph databases (e.g. TitanDB), column-oriented databases (e.g. HBase) or key-value stores (e.g. Redis).

Kappa architecture

With the Kappa architecture, the batch layer is completely dispensed with. Only "live" events are considered and processed in order to make them available to users in the serving layer. This poses special challenges in terms of availability, reliability and once-and-only-once delivery.

Awards

A cluster system based on Apache Hadoop won the Terabyte Sort Benchmark award in 2008 and 2009 . Among the systems tested in the IT benchmark , it was able to sort large amounts of data (one hundred terabytes of integers in 2009 ) the fastest - but with a significantly larger number of nodes than its competitors, as this is not regulated in the benchmark statutes. It was the first Java and also the first open source program to win this benchmark.

The Guardian gave Apache Hadoop in March 2011 at the Media Guardian Innovation Awards the award Innovator of the Year . The project relegated innovations like WikiLeaks and iPad to their places. It was highlighted that Hadoop enables so versatile and far-reaching applications that it could prove to be the beginning of a new data revolution.

Commercial Support and Commercial Forks

Since the use of Hadoop is particularly interesting for companies, there are a number of companies that offer commercial support or forks from Hadoop:

  • With CDH, Cloudera provides an "enterprise ready" open source distribution for Hadoop (current version: CDH 6.0.0). At the beginning of 2019, the other large big data distribution provider Hortonworks was integrated. Hortonworks originally comes from a spin-off from Yahoo and Benchmark Capital.
  • Teradata has partnered with Hortonworks to provide an expanded distribution. Teradata Open Distribution for Hadoop (TDH) 2.1 thus links Hadoop with Teradata products. Teradata is the global leader in data warehousing.
  • Microsoft is currently integrating Hadoop into Windows Azure and SQL Server . The integration will be part of SQL Server 2019.
  • The Google App Engine MapReduce supports Hadoop programs.
  • The IBM product InfoSphere BigInsights is based on Hadoop.
  • With Greenplum HD, EMC² offers Hadoop as part of a product package.
  • With SAP HANA Vora, SAP SE offers a connection from Hadoop to SAP HANA .
  • SAS enables SAS scripts to be executed in a distributed manner on a Hadoop cluster.
  • Matlab from Mathworks supports the distributed execution of Matlab scripts on a Hadoop cluster.

There are also other providers.

literature

  • Ramon Wartala: Hadoop. Reliable, distributed and scalable big data applications. Open Source Press, Munich 2012. ISBN 978-3-941841-61-1

Web links

Individual evidence

  1. archive.apache.org .
  2. hadoop.apache.org .
  3. http://archive.apache.org/dist/hadoop/core/
  4. https://cwiki.apache.org/confluence/display/HADOOP2/PoweredBy
  5. HDFS Users Guide. Apache Software Foundation, archived from the original on May 21, 2012 ; accessed on March 26, 2017 (English).
  6. https://hadoop.apache.org/docs/current/hadoop-archives/HadoopArchives.html
  7. Archived copy ( Memento of the original from May 27, 2018 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / cwiki.apache.org
  8. https://wiki.apache.org/hadoop/HCFS
  9. http://hadoop.apache.org/docs/r2.7.1/api/org/apache/hadoop/fs/ftp/FTPFileSystem.html
  10. https://github.com/ZuInnoTe/hadoopcryptoledger
  11. https://hbase.apache.org/
  12. https://phoenix.apache.org/
  13. https://cwiki.apache.org/confluence/display/Hive/Home
  14. https://de.hortonworks.com/solutions/
  15. https://github.com/myui/hivemall
  16. https://snippetessay.wordpress.com/2015/07/25/hive-optimizations-with-indexes-bloom-filters-and-statistics/
  17. https://cwiki.apache.org/confluence/display/Hive/LLAP
  18. http://www.dbms2.com/2009/05/11/facebook-hadoop-and-hive/
  19. Archived copy ( memento of the original from March 26, 2017 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / de.scribd.com
  20. https://code.facebook.com/posts/229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/
  21. https://pig.apache.org/
  22. https://spark.apache.org/
  23. https://flink.apache.org/
  24. http://janusgraph.org/
  25. Alex Woodie: JanusGraph Picks Up Where TitanDB Left Off. In: datanami. January 13, 2017, accessed April 14, 2018 .
  26. https://jornfranke.wordpress.com/2014/07/20/the-lambda-architecture-for-big-data-in-your-enterprise/
  27. Chris Nyberg and Mehul Shah: Sort Benchmark Home Page. Retrieved November 30, 2010 .
  28. Diverging views on Big Data density, and some gimmes
  29. Grzegorz Czajkowski: Sorting 1PB with MapReduce . google. November 21, 2008. Retrieved March 26, 2017.
  30. Owen O'Malley - Yahoo! Grid Computing Team: Apache Hadoop Wins Terabyte Sort Benchmark . July 2008. Archived from the original on October 15, 2009. Retrieved October 14, 2009: “ This is the first time that either a Java or an open source program has won. "(Offline)
  31. guardian.co.uk: Megas 2011: Winners . March 25, 2011. Retrieved on March 25, 2011: “ Applications of the system are diverse and far reaching, and as data manipulation and management play an increasingly large part in all of our lives Hadoop may come to be seen as the beginning of a new data revolution. "
  32. https://www.cloudera.com/downloads/cdh/6-0-0.html
  33. ^ Cloudera: Cloudera and Hortonworks Complete Planned Merger . January 3, 2019. Retrieved September 22, 2019: " Cloudera, Inc. (NYSE: CLDR), the enterprise data cloud company, today announced completion of its merger with Hortonworks, Inc. Cloudera will deliver the first enterprise data cloud - unlocking the power of any data, running in any cloud from the Edge to AI, on a 100% open-source data platform "
  34. PresseBox: Teradata supports Hadoop 2 with an updated Hadoop portfolio . June 12, 2014. Accessed on March 26, 2017: “ The updated software Teradata Open Distribution for Hadoop (TDH) 2.1 is an advanced software platform based on the Hortonworks Data Platform 2.1. "
  35. FSeiwerth: Microsoft, Big Data and Hadoop - what's behind them? . October 31, 2011. Accessed on April 3, 2012: “ In cooperation with the partner Hortonworks, it is planned to completely“ port ”Hadoop to the Windows server. [...] It is also planned to offer Hadoop as a service in Windows Azure. "
  36. SQL Server 2019 preview combines SQL Server and Apache Spark to create a unified data platform . ( microsoft.com [accessed September 25, 2018]).
  37. https://wiki.apache.org/hadoop/Distributions%20and%20Commercial%20Support