Byte order

from Wikipedia, the free encyclopedia

The byte order ( English byte order or endianness ) describes the memory organization for simple numerical values ​​in computer technology , primarily the storage of whole numbers ( integers ) in the main memory .

The first computer architectures adopted the representation of multi-digit numbers from everyday life in accordance with the convention of the place value system, first for decimal, then also for numbers represented in binary. In this convention, the notation of a number begins with the digit in the most significant place. However, addition, subtraction and multiplication begin with the least significant digit, the ones place .

As long as you stayed within similar computer architectures, you didn't have to worry about the endianness, it corresponded to the usual one. However, since the three basic mathematical functions mentioned can start a machine cycle earlier if the bit order is reversed, some manufacturers have subsequently created a corresponding architecture principle. That means: the ones place is placed at the starting address , and the 3 algorithms mentioned advance to the right into the higher places and addresses. This deviation from the usual made the concept of endianness necessary:

  • In the big-endian (literally: " big- end", see also section Etymology ) format, the most significant byte is stored first; H. at the smallest memory address. In general, the term means that the most significant (highest ranking) component is mentioned first in the case of composite data, such as the German notation of the time: hour: minute: second.
  • In the little-endian (literally: "small-end") format, on the other hand, the least significant byte is stored at the start address or the least significant component is named first, as in the conventional German date notation: day.month.year.

The terms big -endian and little -endian designate the end of the display that is noted first or stored at the lowest address. Since the latter address usually addresses the entire (multi-digit) field, the terms “Big-Startian” and “Little-Startian” would be even more appropriate because the field does not end at the point under consideration , but starts .

In parlance, the two variants in computer technology are often named after the manufacturers of microprocessors who use or have used the respective variant in several processor families : " Motorola format" stands for big-endian, " Intel format" for little- endian.

If data is transmitted serially bit by bit , the bit sequence must also be specified. Seems logical

  • the big endian byte order if the most significant bit of a byte is transmitted first ( e.g. I²C ),
  • the little-endian byte order when the least significant bit of a byte is transmitted first (e.g. RS-232 ).

Occasionally, however, you can also see reverse assignments, for example in the case of refresh memories .

agreement

The following statements, about which there is a high degree of consensus in the literature , should be made explicit as a starting point for the discussion and definition of the facts:

  • The main memory has a smallest addressable unit , also called a "memory location". In this article it is the byte as an example . It consists of 8  bits and its content is in this article with predominantly two hexadecimal specified numerals, each numeral corresponding to a nibble of 4 bits ( nibble is). The smallest addressable unit could also consist of a different number of bits, or, if the machine calculates in the decimal system , it could accommodate a decimal digit.
  • The (byte) addresses of the main memory are non-negative integers.
  • A (simple) data field is stored in the main memory in a contiguous memory area (a seamless sequence of addresses) that has a starting (byte) address and a (byte) length .
  • The machine commands address a data field via its start address. Analogously, in assembly languages and in high-level programming languages, the start address plays the role of a pointer to the data field.
  • The byte at the start address is often referred to as the left byte , the one at the end address as the right byte (see bit order ). Horizontal graphical representations of data fields adhere to this orientation very often, but not always.
  • A byte within a simple data field has a (non-negative) distance to its start address, which is referred to as an offset .
  • Numerical data are considered which are represented in the memory according to a system of values. In such a system, in addition to its value as an individual character, a digit also has a value (depending on its position within the whole number), also known as “value” or “significance”.
  • The hexadecimal notations or appearing in manuals and below denote a numerical value, namely the number 978.017.389, and not its representation in memory, unless one means storage . If you want to specify a different way of storing the same number , you have to use other notations, such as or .3A4B5C6Dh0x3A4B5C6D3Ah4Bh5Ch6Dh0x3A4B5C6D6Dh5Ch4Bh3Ah0x6D,0x5C,0x4B,0x3A
  • In general, a left shift multiplies a binary number by a power of two, thus shifting the bits towards the “big end” (= direction of the most significant bit ), and a right shift divided by such a shift shifts the bits towards the “little end” “(= Direction of least significant bit ). The shift operations induce an unambiguous, consistent "addressing" from the bytes to the bits (see the section on addressing bits ). (The left-right orientation for shift instructions is something different and completely independent of that for addressing with left = low and right = high address.)

If the significance of a digit increases with the increasing address of a number stored in the memory, then it is displayed in little-endian format.

If the value of a digit in a number stored in the memory decreases with the increasing address, then it is displayed in big-endian format.

If one of these two formats is maintained for the storage of numerical fields in a computer system , the first is called the little-endian system and the second is called the big-endian system .

Example: Storage of a 32-bit integer in 4 bytes

address Big
Endian
Mixed
endian
Little
endian
10,000 01 02 04
10001 02 01 03
10002 03 04 02
10003 04 03 01

In the example, the integer 16,909,060 is saved as a 32-bit integer value (hexadecimal:) . The storage takes place in 4 bytes from an assumed memory address of : 01020304h10000

  • Big-endian stores the sample number in order .01 02 03 04 = 01h02h03h04h
  • Little-endian stores the sample number in reverse order .04 03 02 01 = 04h03h02h01h

Some older systems (e.g. PDP-11 ) also store the bytes in order (but not as ). This is known as mixed-endian or middle-endian . 02 01 04 03 = 02h01h04h03h03 04 01 02 = 03h04h01h02h

Some systems store both big-endian and little-endian, which is known as bi-endian .

Order of digits within numbers in the language

The usual representation of (decimal) numbers is - in the sense of the reading direction of most European languages ​​from left to right - big-endian . This is due to the fact that the order of digits of the Indo-Arabic numbers was retained in the scripts of Central Europe. In Arabic, which reads from right to left, the numbers are written the same, i.e. H. for numbers below 100 they are read as “little-endian” (for numbers above 100 they are read as “big-endian”). In German, too, the numbers from 13 to 99 are pronounced little-endian: “one-and-twenty”; the one as the less significant digit is spoken first (this order also exists in other languages ).

An example for decimal numbers: In the most common representation (big-endian) the decimal number one thousand-two hundred-thirty is represented as "1230", where "1" has the value 1000, the "2" the value 100 and the "3" the valency 10. In the “little-endian” representation it is the other way around, so that the representation of the number would be “0321” (pronounced perhaps “thirty-two hundred-one thousand”).

Contexts of the byte order problem

The problem of the byte order affects data types that are composed of several bytes and are directly supported by the respective processor , i.e. mainly integer and floating point types , as well as data types that are effectively treated by the processor as such internal data types, e.g. B. UTF-16 .

A byte order mark  (BOM) is often used to get around this problem with Unicode characters . In a hex editor a text looks like this:

00 44 00 69 00 65 | D i e| BOM 00 44 = FE FF am Dateianfang  UTF-16 Big Endian / UCS-2BE
44 00 69 00 65 00 |D i e | BOM 44 00 = FF FE am Dateianfang  UTF-16 Little Endian / UCS-2LE

Cross-platform presentation of numbers

In order to enable error-free data exchange between computers on different platforms, the byte order is always fixed in network protocols . This is known as the "Network Byte Order". In contrast, the natural byte order of the system is referred to as the “host byte order”. If the system does not work with this byte order, it must be converted accordingly in the network driver or in part in the application program.

In the case of the Internet protocol set that is predominantly widespread today, the network byte order corresponds to the big-endian format. However, there are still protocols that use a different byte order. In addition, there are data types that are not or not only characterized by endianness, such as B. Floating point numbers, the conversion of which can also lead to a loss of accuracy.

In the BSD-IP-Socket-API offered on most operating systems there are six functions for converting the byte order:

Data type (word length) conversion
Host-to-network Network-to-host
double (64 bit) htond() ntohd()
long (32 bit) htonl() ntohl()
short (16 bit) htons() ntohs()

Correct conversion is guaranteed for unsigned integers. Negative integers are converted correctly if they are represented in two's complement and the bit width matches.

These functions are trivial on big-endian machines, since the host and network byte orders are identical.

The use of these functions is recommended for programmers of network applications, since the source code can also be transferred to other systems.

The selection of matching the current hardware implementation is usually done implicitly by the operating system - in an emergency by the user at download .

The endianness type of a machine can be determined using the program as follows:

 union {
   uint16_t sixteenBits;
   uint8_t twoBytes[2];
 } test_endianness;

 test_endianness.sixteenBits = 1 << 15; // 0x8000, 32768
 if (test_endianness.twoBytes[0] != 0) {
    // Das Programm läuft auf einer Big-Endian-Maschine.
 }
 else {
    // Das Programm läuft auf einer Little-Endian-Maschine.
 }

In the BitConverterclass of .NET Frameworkthere is the field IsLittleEndianthat allows the endianness type (of the running hardware) to be queried.

Byte order problems can also occur when exchanging files and sometimes when exchanging data carriers between different platforms. This must be remedied either through a clear definition of the corresponding file format or file system or through a compatibility mode that recognizes and converts the files during loading.

The problem of displaying data on different systems and exchanging them is generally addressed by the display layer of the OSI model.

"Nuxi"

The problem of different endianness of different architectures is often jokingly referred to as the NUXI problem: If the word UNIX is stored in two two-byte words (two 16-bit registers for “UN” and “IX”), it is in one Big-Endian system as "UNIX" in the memory, in a little-endian system, however, because of the swapping of the bytes in each word as "NUXI" (on 32-bit systems, however, "XINU" would be in a single 32-bit Register).

Important properties

With the first microprocessors this was only 4 bits (later 8 bits for a long time). The address bus is much wider with these CPUs. This gave rise to the need to load or save data with one instruction that were distributed over at least two coupled registers. In order to reduce the complexity of the CPU (each individual transistor function was still expensive) it was easier to automatically load the low-order "data snippet" for each operation. During this memory operation, the instruction could then be further decoded and, if necessary, the further data processed in the next cycle . This problem was less of a problem with mainframes, since they were already working with data bus widths of 16 to 48 bits, meaning that they could load them in a single memory cycle and the (byte) order was therefore irrelevant.

Big Endian Format

  1. Since the machine instructions address the operands at their lowest address, operations whose algorithm starts at the least significant place must be positioned on them by increasing the start address by the operand length – 1. Adding , subtracting and multiplying is therefore slightly more complex.
  2. Dividing and comparing , on the other hand, start with the most significant byte and are therefore marginally easier.
  3. The same comparison operations can be used to compare ( unsigned ) big endian numbers as well as short texts (2, 4 or 8 bytes long), since both are sorted lexicographically.
  4. For comparing strings , there are at mainframe computer system IBM / 370 the machine instruction CLCL( C ompare L ogical C haracter L (different and arbitrary) ong) with two long memory operand, the lexicographic ordering implemented.
  5. In the Big Endian format, hex dumps of numbers are easier to read because the order of the digits is the same as in the usual notation of the place value system .

Little endian format

  1. Since the machine instructions address the operands at their lowest address , the initial incrementing with the operand length is not required for operations whose algorithm starts at the least significant place, such as for addition , subtraction and multiplication . These operations are therefore slightly easier to implement in hardware.
  2. To convert a two-byte number into a four-byte number on a little-endian machine, all you have to do is add two zero-filled bytes to the end without changing the memory address. On a big-endian machine, the value must first be shifted two bytes in memory. The reverse conversion is also easier. On a little-endian machine, the more significant bytes are simply discarded without the memory address changing.
  3. In contrast, the implementation of operations such as division , whose algorithm starts at the most significant position, is marginally more complex.
  4. Machine commands for the lexicographical comparison of long texts are missing on some machines and have to be replaced by subroutines such as memcmp().

Example of a decimal little endian addition:

   717
 + 0452
   ----
   7523
   ====

(Sample: 717+2540=3257)

Example: Interpretation of a hex dump

The purpose of a dump is to clearly display the memory content, e.g. for error analysis. For machines whose memory location (byte) consists of 8 bits, the representation in the hexadecimal system is selected, in which the 2 8 = 256 = 16 2 different contents of a byte are expressed in 2 hexadecimal digits. This coding , which directly covers both binary values ​​and machine instructions as well as decimal values ​​in BCD code , is usually accompanied by a column that represents each individual byte as an alphabetic character where possible, so that any texts in the memory can be recognized and read more easily.

The following example shows how two consecutive bytes (4 half bytes) a732 are to be interpreted in a hex dump with the readable hexadecimal content .

Hexdump 2 unsigned 8-bit binary numbers 1 unsigned 16-bit binary number
Bytes text Byte0: bits hex dec Byte1: bits hex dec Bits hex dec
Offset 0 1 0123 4567 0123 4567 0123 4567 89ab cdef
readable a7 32 §2
big-endian
internal bit sequence 1010 0111 0011 0010 1010 0111 0011 0010
interpretation 1010 01112 a7h 16710 0011 00102 32h 5010 1010 0111 0011 00102 a732h 4280210
little-endian
internal bit sequence 1110 0101 0100 1100 1110 0101 0100 1100
interpretation 1010 01112 a7h 16710 0011 00102 32h 5010 0011 0010 1010 01112 32a7h 1296710

If the field only consists of a single byte (8-bit binary number with or without sign ) or a collection of it (e.g. text in ISO 8859 code ) - in the table the columns "2 unsigned 8-bit binary numbers" - then differentiate the interpretation of the two formats big or little endian does not differ.The internal sequence of the bits per byte is mirrored between the two formats in exactly the same way as that of the bytes per integer (see bit value # addressing of bits ). However, due to the requirements of the hexadecimal representation, the hex dump is completely fixed byte for byte, so that there is no difference between big and little endian.

If the field consists of more than one byte, the so-called "Intel convention" comes into play with little-endian. This means that - unlike big-endian - the low-order byte is stored in the lower memory address and the high-order bytes in the subsequent memory addresses. As a result, for example, in the case of integer fields with a length of 16, 32 or 64 bits, the two hex dump representations are byte-wise mirroring of one another. For clarification, the content of the first of the 2 bytes in the column "1 unsigned 16-bit binary number" is provided with an overline.

Usage and hardware examples

Big-endian

The big-endian format was e.g. B. used in the Motorola-6800 - as well as the Motorola-68000 - or - Coldfire family, the processors of the System-z - and Sun-SPARC -CPUs and the Power (up to Power7) and PowerPC .

Big-endian is used by mainframe systems (eg. IBM and -Mainframe) MIPS - SPARC - Power -, PowerPC - Motorola 6800 / 68k - Atmel AVR32 - and TMS9900 - processors . Alpha processors can also be operated in this mode, but this is unusual. With the IBM POWER8, the power architecture ( PAPR ) was converted to little-endian, but the POWER8 can also still be operated in big-endian mode.

PowerPC can also be switched to little-endian on some models , and POWER8 can be switched from little to big-endian mode - however, IBM has been pushing little-endian mode since POWER8.

Little-endian

The little-endian format was originally the processor 6502 , the NEC -V800 series PICmicro or x86 Intel - processors used.

Even today's PC systems (x86-compatible) use little-endian . Others are Alpha , Altera Nios , Atmel AVR , IBM Power from POWER8 , some SH3 / SH4 systems or VAX . These are true little endian systems.

In contrast to this are architectures, such as some PowerPC variants (including 603, 740, 750), which can only be configured as little-endian systems (see below bi-endian) and then use little-endian from the perspective of the running program, However, store values ​​in memory in big-endian format. The representation is converted implicitly for load and store operations. This may have to be taken into account when creating software for these systems, e. B. in driver programming.

Mixed variants (bi-endian)

Some processors, e.g. B. certain MIPS variants and POWER / PowerPC ( PAPR ) as well as all alpha processors can be switched between little-endian and big-endian .

Also ARM processors (incl. The Intel XScales ) can be used in case of data accesses little - and in big-endian are operated; with ARM processors, however, code is always accessed in little-endian format.

The Itanium architecture  "IA-64" developed jointly by Hewlett-Packard and Intel also masters both byte orders, which was supposed to facilitate the porting of operating systems (especially between  HP-UX (big-endian) and Windows (little- endian) ).

File formats

The typical use of a byte order in a processor architecture for storing values ​​in the main memory influences the byte order of values ​​in secondary storage (often hard drives). When new file formats were created , the byte order of the numerical values ​​was set so that they can do without conversion when saving and reloading from the secondary memory. Using memory virtualization , data on the secondary memory can even be addressed directly by the program.

This is important for container formats with a general structure definition. The Interchange File Format  (IFF) was designed for Amiga programs, and according to this Motorola 68000 processor, the four-byte chunk lengths were stored in the Motorola big-endian format. On the Macintosh computer, which also works with Motorola processors , this was u. a. adopted for the AIFF audio format  .

When it was adopted on the Windows platform with Intel processors, the chunk lengths were redefined to the four-byte Intel little-endian format and the new general container format called Resource Interchange File Format  (RIFF). This RIFF file format is the basis of common file formats such as  RIFF WAVE (* .wav files) for audio and audio video interleave (* .avi files) for video.

In the case of file formats, too, it is possible to develop a definition that allows both byte orders of the processor architectures. So stands z. E.g. for TIFF files ( Tagged Image File Format ) in the first two bytes of the file:

  • II for Intel format (little-endian) or
  • MM for Motorola format (big-endian).

The following length and offset values ​​in the file are then coded accordingly.

etymology

The names go back to the satirical novel Gulliver's Travels by Jonathan Swift , in which the inhabitants of the country of Lilliput live in two warring groups: Some whip their breakfast eggs at the "big" end and become therefore called Big Ender , while Little Enders open the eggs at the pointed, "small", English "little" end. Swift alluded to the split of the English Church (Spitz-Ender) from the Catholic Church (Dick-Ender) - in connection with the byte order this was first mentioned in 1980 by Danny Cohen in the April Fool's joke - Paper On Holy Wars and a Plea brought for peace .

Web links

References and comments

  1. This applies above all to machines in which the length of the operands is coded in the machine command.
    Machines like the IBM 1401 do not fit into this scheme , in which so-called “word marks” in the memory determine the extent of the memory operands. Such machines can - depending on the machine instruction - address a field (consisting of several contiguous memory locations) at its low or high address and process it to the other end; for example, the add command of the IBM 1401 addresses the operands in the ones place (on this machine at the high address) and works its way up to the word mark. With the divide command, the dividend is addressed in the most significant place (the lower address) and the divisor in the ones place. IBM 1410 Principles of Operation . P. 9. Retrieved November 3, 2014.
  2. This connection between byte and bit order was already established by Cohen (p. 3).
  3. The comparison of character strings by machine instructions or the C functions memcmp()and strcmp()starts on every machine at the beginning of the character string and thus evaluates the digits of the lower address as being of higher priority, i.e. it works in Big Endian style. This transfer of the order from the individual bytes to multi-digit fields is called the lexicographical order , if it starts with the first digit . ( See also strncmp . Retrieved March 26, 2015.)
    Mention of character strings ("strings") in the literature, e.g. B. Big and Little Endian C-style strings and byte ordering , in the context of endianness it is often limited to the mode of transmission.
  4. https://msdn.microsoft.com/de-de/library/system.bitconverter.islittleendian(v=vs.110).aspx?cs-save-lang=1&cs-lang=cpp#code-snippet-1
  5. Although, for example, the C or C ++ compiler “has to know” for which endianness type it is compiling, there is no standardized #defineconstant that allows the endianness type to be queried. S. ISO / IEC 14882: 2014, also known as C ++ 14 (accessed on 21 May 2016).
  6. So that the instruction, which potentially includes a large number of machine cycles, does not monopolize the main processor , it is designed to be interruptible and, after a hardware interrupt, can be continued at the point at which it was interrupted. (See ESA / 390 principles of operation, chapter 7-44 General Instructions SA22-7201-08 ESA / 390 Principles of Operation . Accessed June 25, 2014.)
  7. memcmp . en.cppreference.com. Retrieved March 6, 2014.
  8. Gerd Küveler, Dietrich Schwoch: Computer science for engineers and scientists: PC and microcomputer technology, computer networks ( German ), 5th edition, volume 2, Vieweg, reprint: Springer-Verlag, October 4, 2007, ISBN 3834891916 , 9783834891914 (accessed on August 5, 2015).
  9. AVR32 Architecture Document . (PDF; 5.1 MB) Atmel, November 2007
  10. ^ Jeff Scheel: Little endian and Linux on IBM Power Systems. Answers to your frequently asked questions . In: IBM Developer. June 16, 2016, accessed July 14, 2019 .
  11. ^ Danny Cohen : On Holy Wars and a Plea for Peace