Hash table

In computer science , a special is called index structure as a hash table ( English hash table or hash map ) or hash table . It is used to search for or find data elements in a large amount of data ( hash or hash memory ).

Compared to alternative index data structures such as tree structures (e.g. a B + tree ) or skip lists , hash tables are usually characterized by a constant expenditure of time for insert and remove operations.

Hash method

The hash method is an algorithm for searching for data objects in large amounts of data. It is based on the idea that a mathematical function , the hash function , calculates the position of an object in a table. This eliminates the need to search through many data objects until the target object is found.

The algorithm

With the hash process, the target data is stored in a hash table. The key that uniquely identifies the data object is not used as the index , but the hash value that is calculated from the key by a hash function. The location defined by the hash value of a data object in the table is also known as bucket designated ( English container ).

Ideally, each object has its own bucket (i.e. no collisions, see below).

In practice the table is implemented as an array .

Collisions

Since hash functions are generally not unique ( injective ), two different keys can lead to the same hash value, i.e. to the same field in the table. This event is known as a collision . In this case, the hash table must contain multiple values in the same place / in the same bucket.

During the search, a collision requires special handling by the method: First, a hash value is calculated from a search key, which determines the bucket of the data object sought; then the desired destination must be determined by directly comparing the search key with the objects in the bucket.

In order to handle collisions, collided data is saved in alternative fields or in a list according to an alternative strategy. In the worst case, collisions can lead to a degeneration of the hash table if a few hash values are assigned to a large number of objects while other hash values remain unused.

Collision resolution strategies

There are various collision resolution strategies to handle the collision problem.

Closed hashing with open addressing

One possibility is called closed hashing with open addressing . If an entry is to be stored in a position in the table that is already occupied, another free position is used instead. A distinction is often made between three variants (see #Algorithms ):

linear probing: a free position is searched for, shifted by a constant interval. Most of the time, the interval size is set to 1.
quadratic probing: After each unsuccessful search step, the interval is squared.
double hashing: another hash function supplies the interval.

Open hashing with closed addressing

Another option is open hashing with closed addressing . Instead of the requested data, the hash table contains here container ( English buckets take up), all data with the same hash value. When searching, the correct target container is calculated first. This considerably limits the number of possible targets.

Nevertheless, the remaining elements in the container must finally be searched. In the worst case, it can happen that all elements have the same hash values and are therefore stored in the same bucket. In practice, however, this can be avoided by choosing a suitable size for the hash table and a suitable hash function.

The linkage is often implemented using a linear list per container.

advantages

Depending on the application, a hash table has advantages both in terms of access time (compared to other tree index structures) and in the storage space required (compared to conventional arrays).

Ideally, the hash function for the data to be stored should be chosen so that the number of collisions is minimized and remains below a constant; Depending on the hash function, the hash table must contain unused fields (in practice usually 20 to 30 percent). If this is the case, then a hash table with stored elements via access ( English look-up ) to a hash table entry only requires a constant expenditure of time on average ( O (1) ). ${\ displaystyle n}$

In comparison, the access to an element in a B-tree is in the order of magnitude of , where n corresponds to the number of entries stored in the hash table. The complete tree data structure requires memory on the order of . ${\ displaystyle O (\ log n)}$ ${\ displaystyle O (n)}$

disadvantage

Because of the possible collisions, a hash table has very poor access time behavior in the worst case . This is also estimated; all entries in the table are searched. ${\ displaystyle O (n)}$

The fill level is the number of stored elements divided by the number of all buckets:

Filling level = saved elements / buckets.

As the degree of filling increases, the probability of a collision increases and the degeneration increases. Then only an enlargement of the table with subsequent restructuring can lead to acceptable runtime behavior again.

In addition, a neighborhood search that follows the search operation is impossible, since every relationship that could exist between the keys is discarded (see also section #Use in databases ).

variants

There are several variants of the hashing process that are better suited to specific data. An important factor here is the dynamism with which the number of elements changes. Open hashing solves this problem, but accepts losses in access times. Closed hashing, on the other hand, depends on explicit collision handling strategies.
Caution: The terms open or closed hashing are used in exactly the opposite way.

Hashing with concatenation

Hashing with concatenation ( English separate chaining ), the hash table structured such that each container can hold a dynamic data structure - for example, a list or a tree . Each key is then entered or searched for in this data structure. It is therefore possible to store several keys in one container without any problems, but this leads to more or less extended access times. The efficiency of the access is determined by how quickly data records can be inserted and found in the selected data structure. Hashing with chaining is a very common indexing variant for databases, with very large amounts of data being indexed using hash tables. The size of the buckets in database systems is a multiple of the sector size of the storage medium. The reason for this is that the amount of data can no longer be kept in main memory. When a search query is made, the database system must read the buckets in sectors.

Hashing with open addressing

This method is also called open hashing or closed hashing for short . The name open hashing refers to open addressing, while the name closed hashing refers to the limited number of possible keys in the container.

With hashing with open addressing, each container can only be assigned a fixed number of keys. Often one simply chooses a single possible key per container. In the event of a collision, an alternative container must then be found. The procedure here is to define a whole sequence of m hash functions for m containers . If the application of the first hash function, let's call it h ₁ , leads to a collision, then h _{2 is used} . If this also leads to a collision (i.e. the corresponding container is already occupied), h _{3 is used} , and so on, until h _m or until an empty container is found. The term "open addressing" results from the property that the same key can be assigned different addresses due to collisions.

Cuckoo hashing

Cuckoo hashing is another way of avoiding collisions in a table. The name is derived from the behavior of the cuckoo to remove eggs from a nest in order to put an egg of its own.

The principle is to use two hash functions. This results in two possible storage locations in a hash table, which always guarantees a constant access time. A new key is stored in one of two possible locations. If the first target position is occupied, the existing key is moved to its alternative position and the new key is saved in its place. If the alternative position is occupied, the key in this position is again transferred to its alternative position, and so on. If this procedure leads to an infinite loop (usually it is broken off after steps), the hash table is rebuilt with two new hash functions. The probability of such rehashing is on the order of each insert. ${\ displaystyle \ log n}$ ${\ displaystyle O (1 / n)}$

Algorithms

Linear probing

The simplest way to define such a sequence is to check the next container until you find a free container. The definition of the sequence of hash functions looks like this:

${\ displaystyle h_ {i} (x) = (h (x) + i) \; {\ bmod {~}} m}$

The application of the modulo has to do with the limited number of containers: If the last container was checked, you start again with the first container. The problem with this method is that chains or clusters are formed so quickly and the access times in the area of such chains increase rapidly. Linear probing is therefore inefficient. Its advantage, however, is that - in contrast to other probing methods - all containers in the table are used.

Square probing

As with linear probing, a search is made for a new free memory, but not sequentially, but with a distance from the original position that increases steadily and squarely and in both directions. If a collision causes , attempts are made one after the other and so on. Expressed in formulas: ${\ displaystyle h (k)}$ ${\ displaystyle h (k) + 1, h (k) -1, h (k) + 4, h (k) -4, h (k) +9}$ ${\ displaystyle h_ {i} (x) = \ left (h (x) + (- 1) ^ {i + 1} \ cdot \ left \ lceil {\ frac {i} {2}} \ right \ rceil ^ {2} \ right) {\ bmod {~}} m}$

The constant change of sign in this collision strategy is also called "alternating quadratic probing" or "quadratic probing with refinement". If the number of containers is chosen cleverly (namely , is a prime number ), then each probing sequence up to generates a permutation of the numbers 0 to ; this ensures that every container is hit. ${\ displaystyle m = 4 \ cdot j + 3}$ ${\ displaystyle m}$ ${\ displaystyle h_ {0} (x)}$ ${\ displaystyle h_ {m-1} (x)}$ ${\ displaystyle m-1}$

Square probing does not improve the likelihood of probing ( ), but it can decrease the likelihood of collisions during the probing ( ); H. Cluster formation is avoided. ${\ displaystyle h_ {0} (x) = h_ {0} (y)}$ ${\ displaystyle h_ {0} (x) = h_ {k} (y)}$

Double hashing

When double hashing are two independent hash functions and applied. These are called independent if the probability of a so-called double collision, i.e. H. is the same and therefore minimal. The sequence of hash functions that are now formed using and looks like this: ${\ displaystyle h}$ ${\ displaystyle h '}$ ${\ displaystyle h (x) = h (y) \ land h '(x) = h' (y)}$ ${\ displaystyle 1 / m ^ {2}}$ ${\ displaystyle h}$ ${\ displaystyle h '}$

${\ displaystyle h_ {i} (x) = (h (x) + h '(x) \ cdot i) ~ {\ bmod {~}} m}$

The cost of this method is close to the cost of ideal hashing.

Brent hashing

When Brent hashing is checked whether the place where the item is to be inserted, is free. If this is the case, the element is inserted there. If the space is occupied, however, a new space is calculated in the table for the element to be inserted and for the element that is already there, based on the space just calculated. If the two newly calculated spaces are also occupied, the procedure is repeated for the newly calculated occupied space of the element to be inserted. However, if a free space is calculated for the element to be inserted, the element is inserted there. However, if the space is occupied and the calculated space is free for the element that occupied the space for the element to be inserted in the previous run, then the two spaces of the elements are swapped and the element to be inserted can thus be accommodated in the table.

Dynamic hashing

The likelihood of collisions increases significantly as the table becomes more full. At the latest when the number of indexed data records is greater than the capacity of the table, collisions become inevitable. This means that the method has to spend increasing effort to resolve the collision. In order to avoid this, the hash table is enlarged if necessary with dynamic hashing. However, this inevitably affects the value range of the hash function, which must now also be expanded. A change in the hash function, however, has the disadvantageous effect that the hash values for data that have already been stored also change. A class of hash functions was specially developed for dynamic hashing, the value range of which can be increased without changing the hash values that have already been saved.

advantages

There is no upper limit for the data volume
Entries can be deleted without any problems
Address collisions do not lead to clustering .

disadvantage

If an order-preserving hash function was not used:

no efficient traversal of the entries according to an order
no efficient search for the entry with the smallest or largest key

application

Hash tables are used in virtually every modern application used, such as for implementation of sets (sets) or caches .

Associative arrays

A typical application is also associative arrays (also known as map , lookup table , dictionary or dictionary ); looking up the data associated with a key can be implemented quickly and elegantly using a hash table.

Symbol tables

Symbol tables such as those used by compilers or interpreters are usually also implemented as hash tables.

Databases

Hash tables are also important for databases for indexing tables. A hash index can lead to ideal access times under favorable conditions .

Hash tables enable a very fast search in large amounts of data, since the calculation of the hash value in a single step limits the number of possible target objects. This makes hash tables one of the most efficient index structures.

A major disadvantage, however, is the risk of degeneracy due to collisions , which is inevitable with a steady increase in the amount of data (if the table is not enlarged and every element it contains is hashed again). Therefore, because of unfavorable IO access patterns when the hash table is stored on a data carrier and the lack of the possibility of efficiently iterating intervals according to an order relation , the use of database systems must be compared to alternative index data structures, such as. B. B + trees , are weighed.

Most hash functions do not allow movement to the next or previous data set according to an order relation, as they specifically “mix” the data in order to distribute it evenly in the value space. Only special “order preserving” hash functions allow such an iteration according to their order relation and thus the query with inequality links (“greater than”, “less than”) or the sorted access to all values. In order to be able to use such methods efficiently, a prior analysis of the data distribution is usually necessary. They are therefore mostly only used in database systems that also carry out such an analysis, for example for query optimization.

literature

Thomas H. Cormen , Charles E. Leiserson , Ronald L. Rivest : Introduction to Algorithms . 1184 pages; The MIT Press 2001, ISBN 0-262-53196-8 .
Alfons Kemper, André Eickler: Database Systems - An Introduction. 7th edition, Oldenbourg Verlag 2009, ISBN 3-486-57690-9 . Page 220f
Christian Ullenboom: Java is also an island , 1444 pages; Galileo Press 2007, ISBN 3-89842-838-9 ; available online as an open book .

Web links

Extensive hash library with extensive analysis functionality
Cuckoo Hashing (English, PDF, 169 KiB)

Individual evidence

↑ Davi de Castro Reis, Djamel Belazzougui, Fabiano Cupertino Botelho: Minimal Perfect Hash Functions - Introduction ( English ) SourceForge . Retrieved November 8, 2010.

[1] Davi de Castro Reis, Djamel Belazzougui, Fabiano Cupertino Botelho: Minimal Perfect Hash Functions - Introduction ( English ) SourceForge . Retrieved November 8, 2010.

Hash table

contents

Hash method

The algorithm

Collisions

Collision resolution strategies

Closed hashing with open addressing

Open hashing with closed addressing

advantages

disadvantage

variants

Hashing with concatenation

Hashing with open addressing

Cuckoo hashing

Algorithms

Linear probing

Square probing

Double hashing

Brent hashing

Dynamic hashing

advantages

disadvantage

application

Associative arrays

Symbol tables

Databases

See also

literature

Web links

Individual evidence