Message-Digest Algorithm 5

Message-Digest Algorithm 5 ( MD5 ) is a widely used cryptographic hash function that generates a 128-bit hash value from any message. This allows, for example, an easy check of a download for correctness. It is one of a series of cryptographic hash functions developed by Ronald L. Rivest at the Massachusetts Institute of Technology in 1991 , when analysis showed that its predecessor, MD4, is likely to be insecure. In the meantime, MD5 is also no longer considered secure, as it is possible to generate different messages that have the same MD5 hash value with reasonable effort.

MD5 hashes

The 128-bit long MD5 hashes (also known as "message digests") are usually noted as 32-digit hexadecimal numbers. The following example shows a 59 byte long ASCII input and the associated MD5 hash:

md5("Franz jagt im komplett verwahrlosten Taxi quer durch Bayern") =
a3cca2b2aa1e3b5b3b5aad99a8529074

A small change in the text creates a completely different hash. For example, with Frank instead of Franz (only one letter changed):

md5("Frank jagt im komplett verwahrlosten Taxi quer durch Bayern") =
7e716d0e702df0505fc72e2b89467910

The hash of a zero-length string is:

md5("") = d41d8cd98f00b204e9800998ecf8427e

Use and availability

Most Linux distributions install the md5sum program as part of the coreutils by default.

On BSD- derived operating systems such as macOS there is the md5 command.

On many other Unix derivatives you can make do with the mostly installed program OpenSSL .

Microsoft Windows operating systems from versions Windows 8.1 or Windows Server 2012 R2 have the PowerShell cmdlet Get-Filehash by default.

Checking the MD5 hash value

After a file or a folder with files has been downloaded successfully, the associated MD5 hash value is often made available in another file. A test program can then be used to calculate the hash value from the downloaded file, which is then compared with the hash value made available. If both hash values are identical, the integrity of the downloaded file is confirmed. No errors occurred when downloading the file. This offers no security with regard to targeted data manipulation by an attacker ( man-in-the-middle attack ), since the attacker can also manipulate the transmission of the MD5 hash value.

algorithm

An MD5 operation. MD5 consists of 64 operations of this type, grouped in 4 runs with 16 operations each. F is a non-linear function that is used in the respective run. M _i denotes a 32-bit block of the input stream and K _i a 32-bit constant different for each operation; _s denotes the bit-by-bit left rotation by s places, where s varies for each operation. denotes the addition modulo 2 ³² .

MD5 is based on the Merkle-Damgård construction to generate an output of fixed length (128 bit) from a message of variable length. First, a one is appended to the output message. The output message is then padded with zeros so that its length is 64 bits away from being divisible by 512. A 64-bit number encoding the length of the output message is now appended. The message length is now divisible by 512.

The main algorithm of MD5 uses a 128-bit buffer in four 32-bit words A , B , C and D is divided. These are initialized with certain constants. The compression function is now called on this buffer with the first 512-bit block as the key parameter. A message block is handled in four similar stages, called "rounds" by cryptographers. Each round consists of 16 operations based on a non-linear function "F", modular addition and left rotation. There are four possible "F" functions, a different one is used in each round:

{\ displaystyle F (X, Y, Z) = (X \ wedge {Y}) \ vee (\ neg {X} \ wedge {Z})}

{\ displaystyle G (X, Y, Z) = (X \ wedge {Z}) \ vee (Y \ wedge \ neg {Z})}

{\ displaystyle H (X, Y, Z) = X \ oplus Y \ oplus Z}

{\ displaystyle I (X, Y, Z) = Y \ oplus (X \ vee \ neg {Z})}

${\ displaystyle \ oplus, \ wedge, \ vee, \ neg}$ each stand for XOR, AND, OR and NOT operations.

The same function is called on the result with the second message block as a parameter, and so on, up to the last 512-bit block. The result is again a 128-bit value - the MD5 sum.

Reference implementation

RFC1321 also contains an implementation of the algorithm in C under the title "Appendix A Reference Implementation". This implementation from 1992 by RSA Data Security, Inc. runs incorrectly on many 64-bit systems and calculates incorrect hash values. This is because in the global.h file the lines

/* UINT4 defines a four byte word */
typedef unsigned long int UINT4;

are not necessarily given. The error can be fixed by reading these lines through

#include <inttypes.h>
...
/* UINT4 defines a four byte word */
typedef uint32_t UINT4;

replaced. Another executable implementation by L Peter Deutsch can be found on Sourceforge.net. This implementation is derived from the specification of RFC1321 and not from the aforementioned reference implementation in RFC1321. Therefore, no references to RSA Data Security, Inc. are necessary when using this implementation .

Pseudocode

The pseudocode for the MD5 algorithm follows .

// Beachte: Alle Variablen sind vorzeichenlose (unsigned) 32-Bit-Werte und
// verhalten sich bei Berechnungen kongruent (≡) modulo 2^32
    
 // Definition der linksrotation Funktion, c ist der übergebene Wert von s[i] - siehe Hauptschleife
linksrotation(x, c)
    return (x << c)binär or (x >> (32-c));
    
// s definiert die Anzahl der Bits, die pro Runde rotiert werden:
var uint[64] s, K
s[ 0..15] := { 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22, 7, 12, 17, 22}
s[16..31] := { 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20, 5, 9, 14, 20}
s[32..47] := { 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23, 4, 11, 16, 23}
s[48..63] := { 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21, 6, 10, 15, 21}
    
// Verwende den binären Vorkommateil vom 2^32-fachen Betrag des Sinus
// von Integerwerten als Konstanten:
für alle i von 0 bis 63
(
    K[i] := floor(abs(sin(i + 1)) × 2^32)
)

// Alternativ kann man auch folgende Tabelle nutzen:
K[ 0.. 3] := { 0xd76aa478, 0xe8c7b756, 0x242070db, 0xc1bdceee }
K[ 4.. 7] := { 0xf57c0faf, 0x4787c62a, 0xa8304613, 0xfd469501 }
K[ 8..11] := { 0x698098d8, 0x8b44f7af, 0xffff5bb1, 0x895cd7be }
K[12..15] := { 0x6b901122, 0xfd987193, 0xa679438e, 0x49b40821 }
K[16..19] := { 0xf61e2562, 0xc040b340, 0x265e5a51, 0xe9b6c7aa }
K[20..23] := { 0xd62f105d, 0x02441453, 0xd8a1e681, 0xe7d3fbc8 }
K[24..27] := { 0x21e1cde6, 0xc33707d6, 0xf4d50d87, 0x455a14ed }
K[28..31] := { 0xa9e3e905, 0xfcefa3f8, 0x676f02d9, 0x8d2a4c8a }
K[32..35] := { 0xfffa3942, 0x8771f681, 0x6d9d6122, 0xfde5380c }
K[36..39] := { 0xa4beea44, 0x4bdecfa9, 0xf6bb4b60, 0xbebfbc70 }
K[40..43] := { 0x289b7ec6, 0xeaa127fa, 0xd4ef3085, 0x04881d05 }
K[44..47] := { 0xd9d4d039, 0xe6db99e5, 0x1fa27cf8, 0xc4ac5665 }
K[48..51] := { 0xf4292244, 0x432aff97, 0xab9423a7, 0xfc93a039 }
K[52..55] := { 0x655b59c3, 0x8f0ccc92, 0xffeff47d, 0x85845dd1 }
K[56..59] := { 0x6fa87e4f, 0xfe2ce6e0, 0xa3014314, 0x4e0811a1 }
K[60..63] := { 0xf7537e82, 0xbd3af235, 0x2ad7d2bb, 0xeb86d391 }
    
// Initialisiere die Variablen: (lt. RFC 1321)
var uint a0 := 0x67452301
var uint b0 := 0xEFCDAB89
var uint c0 := 0x98BADCFE
var uint d0 := 0x10325476
    
// Vorbereitung der Nachricht 'message':
var uint message_laenge := bit_length(message)
erweitere message um bit "1"
erweitere message um bits "0" bis Länge von message in bits ≡ 448 (mod 512)
erweitere message um message_laenge als 64-Bit little-endian Integer
    
// Verarbeite die Nachricht in aufeinander folgenden 512-Bit-Blöcken:
für alle 512-Bit Block von message
(
    unterteile Block in 16 32-bit little-endian Worte M[i], 0 ≤ i ≤ 15
    
    // Initialisiere den Hash-Wert für diesen Block:
    var uint A := a0
    var uint B := b0
    var uint C := c0
    var uint D := d0
    
    // Hauptschleife:
    // not Operator entspricht dem Einerkomplement
    für alle i von 0 bis 63
    (
        wenn 0 ≤ i ≤ 15 dann
            F := (B and C) or ((not B) and D)
            g := i
        sonst wenn 16 ≤ i ≤ 31 dann
            F := (B and D) or (C and (not D))
            g := (5×i + 1) mod 16
        sonst wenn 32 ≤ i ≤ 47 dann
            F := B xor C xor D
            g := (3×i + 5) mod 16
        sonst wenn 48 ≤ i ≤ 63 dann
            F := C xor (B or (not D))
            g := (7×i) mod 16
        wenn_ende
        
        temp := D
        D := C
        C := B
        B := B + linksrotation((A + F + K[i] + M[g]), s[i])
        A := temp
    )
    
    // Addiere den Hash-Wert des Blocks zur Summe der vorherigen Hashes:
    a0 := a0 + A
    b0 := b0 + B
    c0 := c0 + C
    d0 := d0 + D
)
    
var uint digest := a0 anfügen b0 anfügen c0 anfügen d0 // Darstellung als little-endian

Instead of the original formulation from RFC 1321 , the following can be used to increase efficiency:

( 0 ≤ i ≤ 15): F := D xor (B and (C xor D))
(16 ≤ i ≤ 31): F := C xor (D and (B xor C))

Attacks

As early as 1993, Bert de Boer and Antoon Bosselaers published an algorithm for generating pseudocollisions on the compression function of MD5: two different initialization constants result in the same hash value for the same message.

In 1996 Hans Dobbertin found a collision for two different messages. This is a real collision, i.e. two specially prepared messages that differ, but still result in the same hash value. However, Dobbertin used a modified MD5 variant in which other initialization constants (for A, B, C, D) are used. It was also not possible to specify the content of the colliding messages. Practical attacks on MD5 were therefore not possible, but the weaknesses of MD5 became clear, so that cryptologists advised switching to other hash functions.

In 2004, a Chinese research group led by Xiaoyun Wang succeeded in systematically generating collisions if the beginning of the message can be chosen at will, but both messages are identical (common-prefix collision) . At this beginning of the message, two different continuations of the message can be calculated with reasonable effort, which lead to the same hash value. This collision is retained even if the same suffix is appended to both messages (each consisting of the same beginning and one or the other continuation). This attack was improved by Wang and other research groups, so that today a PC can calculate an MD5 collision within seconds.

The effort to find a collision is greater if the beginning of the two messages is different (chosen-prefix collision) . In 2008, a team led by Marc Stevens and Alexander Sotirov managed to carry out such a collision attack in order to create a forged CA certificate that was recognized as trustworthy. With this, they were in principle able to forge an SSL certificate for any URL and thus bypass the security mechanisms of HTTPS on the web. The work was first presented in December 2008 at the 25th Chaos Communication Congress and published a few months later in a scientific article. To calculate the collision, they used a cluster of 200 Sony PlayStation 3s .

The Windows malware Flame , discovered in 2012, uses a forged code signing certificate based on a new and previously unknown variant of a chosen prefix collision for MD5.

Even with the methods mentioned, pre-image attacks cannot yet be carried out in a reasonable time. As a result, it is still impossible to subsequently create a forged document that matches a specific certificate generated with MD5 . In many cases, however, collision attacks make it possible to create two documents that result in the same MD5 hash value, then have the first, legitimate document signed, and then exchange this with the second, forged document. Against this background, it is not advisable to continue using MD5.

safety

MD5 is widely used and was originally thought to be cryptographically secure. As early as 1994, Bert den Boer and Antoon Bosselaers discovered pseudocollisions in MD5. Fundamental work to find real collisions was also done by Hans Dobbertin (then at the BSI ), who had already developed the successful attack on MD4 and transferred the techniques used to MD5.

Collision resistance

In August 2004, a Chinese team of scientists found the first collision in the full MD5 function. On an IBM P690 cluster, their first attack took an hour, and based on this, further collisions could be found within a maximum of five minutes. Shortly after the publication of the Chinese work, MD5CRK was discontinued and tried to find collisions using brute force methods .

Brief description: An input block (512 bits) is modified, whereby an attempt is made to produce a certain difference to the original in the output. A complex analysis of the algorithm can reduce the number of unknown bits to such an extent that this is mathematically successful. In the next 512-bit block, the same methods are used to attempt to cancel out the difference. The forgery therefore requires a coherent block of data of 1024 bits = 128 bytes, which greatly restricts its use.

In the meantime, the collision attacks have progressed so far that further use of MD5, especially in those scenarios in which the user does not fully control the files to be signed, must be rejected. A test carried out by the computer magazine c't in 2009 using GPGPU enables a high-end game PC, about a year old, with two Nvidia GeForce 9800 GX2 (four graphics processors in total ) to find a collision in just under 35 minutes.

One way feature

Rainbow tables are another attack method. These tables contain strings with the associated MD5 hash values. The attacker searches these tables for the specified hash value and can then read out suitable strings. This attack can mainly be used to discover passwords that are stored as MD5 hashes. The rainbow tables required for this, however, are very large and require a lot of computational effort to create them. Therefore, this attack is generally only possible with short passwords. For this case, there are precalculated rainbow tables in which at least the computational effort to create the list is omitted. However, the use of a salt , i.e. a random, unpredictable value that is added to the plain text, destroys the effectiveness of pre-calculated rainbow tables.

Summary

Finding collisions means finding two different texts Mand M'with hash(M) = hash(M'). In a pre-image attack, one searches for a given Mor hash(M)one M'so that hash(M) = hash(M'). Since you only M'control a pre-image attack , not also M, it is much more difficult.

Currently, MD5 is only broken regarding the collision attacks.

For the secure storage of passwords, however, other algorithms should be considered that have been specially developed for this purpose, e.g. B. bcrypt or PBKDF2 .

Since no first pre-image attack is known, MD5 hashes signed in the past are still safe (2013).

literature

Hans Dobbertin: Cryptanalysis of MD5 compress . Announcement on Internet, May 1996 (English, online )
Hans Dobbertin: The Status of MD5 After a Recent Attack . In: CryptoBytes 2 (2), 1996 (English, online )
Philip Hawkes, Michael Paddon, Gregory G. Rose: Musings on the Wang et al. MD5 collision . Detailed analysis of the differential attack on the MD5
Vlastimil Klima: Finding MD5 Collisions on a Notebook PC using multi-message modifications . Again improved attack technique (English)

Web links

RFC 1321 - The MD5 Message-Digest Algorithm; R. Rivest, April 1992 (English)
MD5 Collision Demo Two different programs with the same MD5 hash and a library for generating other such programs (English)
papers and demos to MD5-collisions (English)
Jürgen Schmidt: Hash me, the second
Online generator for generating MD5 hashes
Sourceforge.net : RFC1321-based (RSA-free) MD5 library

Individual evidence

↑ Description of the Powershell cmdlet Get-Filehash
^ Bert de Boer, Antoon Bosselaers: Collisions for the compression function of MD5 . In: Proceedings of EUROCRYPT '93 Workshop on the theory and application of cryptographic techniques on Advances in cryptology . Springer-Verlag, New York 1994, ISBN 3-540-57600-2
↑ Alexander Sotirov, Marc Stevens, Jacob Appelbaum, Arjen Lenstra, David Molnar, Dag Arne Osvik, Benne de Weger: MD5 considered harmful today. December 30, 2008, accessed December 30, 2008 .
↑ Marc Stevens: TECHNICAL BACKGROUND OF THE FLAME COLLISION ATTACK. CWI, June 7, 2012, accessed June 8, 2012 (English): "the results have shown that not our published chosen-prefix collision attack was used, but an entirely new and unknown variant."
↑ Collision analysis (PDF; 57 kB)
↑ Explanation of the collision problem when manipulating md5 hash values
^ Stefan Arbeiter, Matthias Deeg: Bunte Rechenknechte - graphics cards accelerate password crackers . In: c't, edition 06/2009, p. 204. ( downloadable fee )

[1] Description of the Powershell cmdlet Get-Filehash

[2] Bert de Boer, Antoon Bosselaers: Collisions for the compression function of MD5 . In: Proceedings of EUROCRYPT '93 Workshop on the theory and application of cryptographic techniques on Advances in cryptology . Springer-Verlag, New York 1994, ISBN 3-540-57600-2

[sslHarmful-3] Alexander Sotirov, Marc Stevens, Jacob Appelbaum, Arjen Lenstra, David Molnar, Dag Arne Osvik, Benne de Weger: MD5 considered harmful today. December 30, 2008, accessed December 30, 2008 .

[4] Marc Stevens: TECHNICAL BACKGROUND OF THE FLAME COLLISION ATTACK. CWI, June 7, 2012, accessed June 8, 2012 (English): "the results have shown that not our published chosen-prefix collision attack was used, but an entirely new and unknown variant."

[5] Collision analysis (PDF; 57 kB)

[6] Explanation of the collision problem when manipulating md5 hash values

[7] Stefan Arbeiter, Matthias Deeg: Bunte Rechenknechte - graphics cards accelerate password crackers . In: c't, edition 06/2009, p. 204. ( downloadable fee )