Amazon Dynamo

Amazon logo

Amazon Dynamo is a distributed hash table that is used internally at Amazon.com . Like the Google File System , Dynamo is optimized for a specific application that is tailored to the requirements of some Amazon Web Services that require a high level of reliability.

conditions

Amazon applications expect a storage system to be highly available and extremely fail-safe. In particular, it must be possible to save in every situation.

[...] even if disks are failing, network routes are flapping, or data centers are being destroyed by tornados.

"[...] even if hard drives fail, network connections go crazy or data centers are destroyed by tornadoes."

- Werner Vogels , amazon.com

The system must be incrementally scalable at any time in order to be able to cover peak loads, for example during the Christmas business. Complicated database access is avoided, access is made directly via a key . Furthermore, there is no need to pay attention to security at this point, as the system is located in a "friendly" environment within the Amazon services that is isolated from the outside.

construction

Dynamo is based on a network of fully equal computers, i. H. there is no central control or administration; each node can perform any task. This architecture was chosen to ensure the scalability of the system.

Services such as the Shopping Cart Service (the service that manages the shopping cart ) expect that the storage system can always be accessed by writing, that the system is highly available and has low latency times. Since the properties of consistency and high availability defined in the ACID criteria are contradictory, - in contrast to traditional database systems - the consistency has been softened to an eventual consistency 'eventually consistent' . Another characteristic was that mostly small files (less than 1 MB in size) should be saved in the form of key-value pairs. Complicated database queries do not have to be supported.

In order to achieve the desired properties, a number of processes that are already known in other contexts were used:

Consistent hashing
Sloppy quorum and hinted handoff
Vector clocks
Anti-entropy through Merkle Trees
Gossip-based protocol

Consistent hashing

All computers are arranged as a ring (at least logically, the network is physically different). A hash value is calculated from each key using MD5 . Each node is now assigned a specific part of the value range of the result of the hash function , for which the respective node saves the associated file. A certain number of the following nodes in the ring also store a copy, whereby the total number of storing nodes can be configured. In order to maximize failure safety, nodes are not only distributed to different computers and racks, but also to different data centers.

Consistent hashing on Dynamo with six nodes and triple redundancy l

An example for the case of a total of six nodes with redundant storage on three nodes each (N = 3) can be found in the figure opposite.

Since it is a heterogeneous system landscape in which the storage capacity of the node computers used can be different (and in addition, some files are requested more frequently than others), Dynamo uses so-called virtual nodes. The concept of virtual nodes enables several virtual nodes on the same ring to be located on a physical node. This enables better utilization with different storage capacities of the physical nodes, since the equal distribution of the hash function means that the storage utilization is the same for all nodes and multiple virtual nodes can be assigned to a physical node with higher storage capacity.

Sloppy quorum and hinted handoff

To ensure that the system is fail-safe, the parameters R ( read , read ) and W ( write , write ), which can also be configured , have been introduced in addition to the N parameter (the number of nodes on which replication takes place) . These parameters are similarly known from quorum systems . However, they were far as modified here that one of sloppy , sloppy ' can speak. These determine how many of the nodes must report a read or write operation as successful for the action to be considered successful. In the standard configuration, the tuple (N, R, W) is assigned the values (3, 2, 2). This means that

each file is saved three times,
a read access is considered successful as soon as at least two nodes return the file and
a write access is considered successful as soon as at least two nodes report the access as successful.

The parameters also allow an application to tailor the system exactly to its needs. For example, a configuration of (3, 1, 3) would ensure that a kind of read buffer has been implemented (only one node must respond for read access, all copies must always be written successfully, since N = W), whereas a system with W = 1 is optimized for the fastest possible write access. The configuration (1, 1, 1) in turn simply implements a completely normal (although not highly available) file system (corresponding with replication also as (2,2,2), (3,3,3) etc.).

If the coordinator node (the node in whose range the hash value actually falls) is not available, the so-called hinted handoff takes effect : If hash value 3 and node A were not available in the example in the above figure, the copy would be passed on to node D instead ( Handoff ) with the note ( Hinted ) that this file actually belongs to node A. That is why D saves these copies in a separate local database and asks A from time to time whether the node is available again. Once this is the case, all hinted copies are transferred to A. After a successful transfer, D can delete the hinted object.

Vector clocks

The sloppy quorum configuration of (3, 2, 2) may lead to different versions. Since updates can also be passed on in the background (e.g. to the third node), it is possible that after a successful write access (which has only reached two nodes) a read access follows, which may now return two different versions. To resolve this conflict, there is the so-called vector clocks , also Vector Clocks mentioned, which are just plain version counter in principle. Each file contains a vector of tuples of the form (node ID, version number), with each node always increasing its version number contained in the file by one during an update. In the problem case described, the coordinator would now get back version 14 and version 15 for the same node, for example, and can use these version numbers to identify which version is the latest. Accordingly, the requesting client would only receive the latest version with version number 15.

It only becomes problematic if the actual coordinator has failed for any reason and parallel access occurs at the same time. For example, the following sequence could result:

Node A coordinates a Write ⇒ ([A, 1]).
Node A coordinates a Write ⇒ ([A, 2]).
Node A fails.
Node B coordinates a Write ⇒ ([A, 2], [B, 1]). At the same time, node C coordinates a Write ⇒ ([A, 2], [C, 1]).
Node A is available again.
Node A coordinates a read and gets the version ([A, 2], [B, 1]) and the version ([A, 2], [C, 1]) returned.

In this case it cannot be decided whether the version of B or C is the newer one, and the resolution is shifted to the application level, the client receives both versions. In the example of the shopping cart service, for example, both versions would be combined and the new version ([A, 3], [B, 1], [C, 1]) would be written by node A. However, this depends on the respective application. If an application prefers not to worry about error resolution, there are also simple last-write-wins strategies pre-implemented.

Anti-entropy through Merkle Trees

The hinted handoff can cause further problems. For example, the following sequence is possible:

Node A fails, Nodes B, C and D need to save new replicas.
A write is coordinated by B, D marks the file as a hinted handoff .
Node D fails.
Node A is available again, but does not get the copy back because D is offline.

Problem: A doesn't even notice that there is an old version and that there are only N − 1 copies at the time. To avoid this problem, A compares its copies with those of B and C when it restarts. In order to keep the traffic and the computing load as low as possible, so-called Merkle trees are used. Merkle trees are trees that have hash values of the files in their leaves, a hash of all hashes in the root and corresponding hashes for the subtree in the nodes in between. As a result, A and B only have to exchange the root hash and can then determine whether their copies are all identical or not. If not, the tree is traversed until the guilty leaf is found. You can then check the vector clocks to see which version is newer and copy them accordingly.

In the event that (analogous to example) the network connection to A breaks off and A does not notice this directly, either A will determine with the help of the vector clocks during the next read that an old version is available, or as part of the regular gossip messages there the hashes of the Merkle trees are also transmitted.

Gossip-based protocol

Hinted handoffs are used so that the entire circle structure does not have to be rebuilt every time a node fails temporarily. However, it must also be possible to permanently add or remove nodes from the network. In order to make this possible, an entry is made in a corresponding configuration file by an administrator after logging in on any node using a command line tool or browser. This change is then communicated to all other nodes in the ring via a gossip-based protocol. This protocol is used to keep the distribution of the virtual nodes on the computers and a list of computers constantly up to date.

A simple example of explicitly adding node X to network ABCD would then be as follows:

step	action	Table from A	Table from B	Table by C	Table by D	Table of X
1	Initial state	ABCD	ABCD	ABCD	ABCD	X
2	X is registered with A.	ABCDX	ABCD	ABCD	ABCD	ABCDX
3	A communicates with B	ABCDX	ABCDX	ABCD	ABCD	ABCDX
4th	C communicates with D	ABCDX	ABCDX	ABCD	ABCD	ABCDX
5	B communicates with D	ABCDX	ABCDX	ABCD	ABCDX	ABCDX
6th	A communicates with C	ABCDX	ABCDX	ABCDX	ABCDX	ABCDX
7th	Final state reached	ABCDX	ABCDX	ABCDX	ABCDX	ABCDX

The order of communication (who exchanges with whom) is random and there does not have to be a change with every communication (in the example: step 4).

If a node is removed or added, the distribution of the virtual nodes to the physical computers must also change, for which there are several methods. The simplest variant of this is - in the case of a non-heterogeneous system landscape - that there should be the same number of virtual nodes of the same size on each physical computer. When a node is removed, the virtual nodes in question are thus copied to randomly selected physical nodes that have fewer virtual nodes than the rest of the ring. Conversely, a newly added node takes over virtual nodes from fully loaded nodes - also randomly selected.

DynamoDB

Since 2012 Dynamo has been offered by Amazon Web Services as a storage service under the name DynamoDB. However, the IaaS service differs in a few ways from the original Dynamo implementation. For example, DynamoDB offers a bigtable- like interface in which multi-dimensional keys map a value. This allows a table structure to be displayed similar to that of a relational database.

swell

Paper on the topic of Amazon Dynamo

Individual evidence

↑ ^a ^b Werner Vogels: Amazon's Dynamo. In: allthingsdistributed.com. October 2, 2007, accessed March 21, 2017 .

[DynamoPaper-1] Werner Vogels: Amazon's Dynamo. In: allthingsdistributed.com. October 2, 2007, accessed March 21, 2017 .