# Hash function

A hash function that maps names to integers . There is a collision for the names "John Smith" and "Sandra Dee".

A hash function (also known as scatter function ) is a mapping that maps a large input set (the key ) to a smaller target set (the hash values). A hash function is therefore generally not injective . The input set can contain elements of different lengths, whereas the elements of the target set usually have a fixed length.

The name "hash function" comes from the English verb to hash , which can be translated as "hack". The German name is Streuwertfunktion. Both names indicate that these functions are normally designed to "scatter" and "chop up" the data (see also chopper in radio technology). Especially in the computer science is also used the term hash algorithm ( English hash algorithm ), because of a hash functions are often in the form of algorithm to be specified, which describes the calculation of the mathematical function.

The hash or scatter values ​​are mostly scalar values ​​from a limited subset of the natural numbers . A “good” hash function supplies values ​​for the (expected) input data, so that two different inputs also lead to different output values. A hash value is therefore also referred to as an English fingerprint , as it represents an almost unambiguous identification of a larger amount of data, just as a fingerprint almost uniquely identifies a person.

A collision occurs when the same hash value is assigned to different input data. Since the amount of possible hash values ​​is usually smaller than that of the possible inputs, such collisions are then in principle unavoidable, which is why there must be methods for collision detection. A good hash function is characterized by the fact that it generates as few collisions as possible for the inputs for which it was designed.

A hash value can be used in data storage to calculate the storage location of the requested data (e.g. hash table ). In checksums to hash values used to detect transmission errors. In cryptology , special cryptological hash functions are used which also require that it is practically impossible to find collisions on purpose.

For known and limited input sets, perfect (in the sense of collision-free) hash functions can also be found.

## Explanation

Hash functions are typically used to:

• to assign a storage location to a complex object
Example: "John Doe" is  stored in the "m" folder - the first letter of the surname.
• calculate a (short) checksum for the object
Example: checksum of an ISBN , cyclic redundancy check to detect file transfer errors
• Identify content almost unambiguously (but still “briefly”) without revealing anything about the content.
Example: Applications in cryptography .

Depending on the application, there are different functional requirements. For example, if you group an address file according to the first letter of the surname, you obviously save a lot of time in the search - you only need to search through one of 26 parts. This “hash function” is very convenient for humans (because it is very easy to “calculate”), but a computer program would use other methods to organize an address book. It is important for the program to have as few collisions as possible; but there are apparently many names that begin with the same initial letter, and they appear unevenly. For example, if you store personnel files according to this principle, you often have many files in the folder with the letter “S”, while the “Q” folder remains empty.

The single digit checksum is a simple hash function. It assigns a single-digit number to any number, for example is shown on . This checksum is not well suited as a checksum, however, since the interchanging of digits - a typical case when typing long numbers - is not recognized. So the number has the same cross sum . Checksums such as the ISBN of a book or the CRC-32 checksum of a file (e.g. when checking a file downloaded from the Internet for transmission errors) are better suited to detect such errors. ${\ displaystyle 25}$${\ displaystyle 2 + 5 = 7}$${\ displaystyle 52}$${\ displaystyle 5 + 2 = 7}$

When identifying content with cryptological hash functions, it is not only important that the hash value changes apparently randomly (i.e. not with just a few bits ) for every small modification (a normal checksum would be sufficient for this) and that it is not easy to change a to generate second content with the same hash value (to avoid a complete exchange of the content). It is just as important that the content cannot be reconstructed from the hash value . If you have exchanged two documents and would like to check the successful transfer on the phone, for example, it is sufficient to check the correctness of the hash value on the phone. If the conversation is tapped , nothing is revealed about the content of the message, even if parts of the content are already known.

## Definition of a hash function

A mapping is called a hash function if applies. In particular, a hash table provides the size . The set represents the data to be hashed and is also called the set of keys; the amount is the amount of possible hash values. Typically, the amount of the hash values is selected as the initial segment of the natural numbers: . This set is then also called the address space . ${\ displaystyle h \ colon K \ rightarrow S}$${\ displaystyle | K | \ geq \ | S |}$${\ displaystyle h}$${\ displaystyle | S |}$${\ displaystyle K}$${\ displaystyle S}$${\ displaystyle S \ subseteq \ left \ {0, \ dotsc, m-1 \ right \}}$

Typically, in practice, only a small subset of the key is always with mapped. The amount is then the hash values ​​actually used. ${\ displaystyle K '{} \ subseteq {} K}$${\ displaystyle h}$${\ displaystyle S ': = \ {h (k) | k \ in K' \}}$

The ratio provides the occupancy factor. ${\ displaystyle \ beta = {\ frac {\ left | S '\ right |} {\ left | S \ right |}}}$

The fall is known as a collision. An injective hash function is called perfect, u. a. because it doesn't create collisions. ${\ displaystyle k \ not = k '\ land h (k) = h (k')}$

## Criteria for a good hash function

• Low probability of collisions of the hash values ​​for the input value range, i.e. an even distribution of the hash values ​​on the expected input values if possible .
• Surjectivity - no result value (hash value) should be impossible, every result (every hash value in the defined value range) should actually be able to occur.
• Efficiency - The function must be calculable quickly, without large memory consumption (the memory requirement of the hash value should be significantly smaller than that of the key / input value) and should only have to read the source data (input values) once if possible.

The following criteria play a different role depending on the application:

• if the hash function a stocked to allow access in the hash table of a database: Trim preserving
• for cryptological hash functions: chaos or avalanche effect - the hash function should have good diffusion ; Similar source elements (input values) should lead to completely different hash values. Ideally, flipping a bit in the input changes an average of half of all bits in the resulting hash value.
• With cryptological hash functions: Confusion - No conclusions can be drawn from the hash value about the input value.
• at cryptologic hash functions: Un irreversibility - It should be possible no practical method that determines a hash value the input value.

## Applications

### Databases

Database management systems use hash functions to search for data in large databases using hash tables . A database index is implemented over this.

The fragmentation of data records can also be implemented using hash functions . The hash function is applied to the primary key of the relevant object. The result then references its storage location.

Hash functions are also used for comparatively small amounts of data, for example in compression algorithms such as  LZW .

### Checksums

Checksums are a simple means of increasing the plausibility for recognizing changes to transmitted data. Only the subset of the data variants that generate the same result as the original data when calculating the checksum can still remain undetected as a falsification. The probability of a collision can be greatly reduced with several different, suitably generated checksums.

An error can always be detected if the calculated checksum of the received data differs from the transmitted checksum, i.e. that of the original data. If an error is found, the corruption can also be contained exclusively in the checksum. The suitability of various hash functions for calculating checksums depends on their probability of collision.

If the checksum is to protect against targeted manipulation of the data, a cryptological hash function is used, since a collision can only be found here with a very high computing effort.

#### Examples

Hash values ​​play an important role in P2P applications for various reasons. The hash values ​​are used here both for searching and identifying files and for recognizing and checking transferred file fragments. In this way, large files can be exchanged reliably in small segments.

Stepped hash functions are primarily used in P2P networks, in which the hash value is calculated for smaller parts of a file and then a total value is calculated from these values. For the Gnutella G2, Shareaza and Direct Connect networks, these are, for example, Tiger Tree hash functions.

Finding files using the hash value of their content is protected as a software patent, at least in the USA. The owner pursues programs and companies that enable the search for files on the basis of this system, including companies that want to identify providers of unlicensed content on behalf of RIAA or MPA .

When programming Internet applications, hash functions are used to generate session IDs by calculating a hash value using changing status values ​​(such as time, IP address).

### Cryptology

Cryptological hash functions have special properties, in practice they are collision - resistant one-way functions . They are used to sign messages and to ensure the integrity of data. Special hash functions (e.g. from the class of "secure hash algorithms" ) are used to hash passwords with the aim of securely storing them or obtaining keys from them . Ideally, these are particularly complex to calculate in order to make brute force attacks more difficult. Furthermore, they must in particular satisfy the properties of confusion and irreversibility, so that the plain text password or a set of candidates cannot be easily generated from the key value.

## Hash algorithms

### Non-cryptological hash functions

Hash function speed developer year
xxHash 5.4 0GB / s Yann Collet 2012
MurmurHash 3a 2.7 0GB / s Austin Appleby 2008
SBox 1.4 0GB / s Bret Mulvey 2007?
Lookup3 1.2 0GB / s Bob Jenkins 2006
CityHash64 1.05 GB / s Geoff Pike & Jyrki Alakuijala 2011
FNV 0.55 GB / s Fowler, Noll, Vo 1991
SipHash / HighwayHash Jan Wassenberg & Jyrki Alakuijala 2016/2012