Machine language

from Wikipedia, the free encyclopedia

A machine language , as used in machine code or native code , is a programming language in which the instructions to be executed by the processor are defined as formal language elements. Due to its proximity to hardware , it is also generally referred to as the "programming language of a computer". The scope and syntax of the machine commands are defined in the command set and depend on the processor type. Machine language is mostly represented as binary code or simplified using hexadecimal numbers .

A machine command is an instruction to the processor to carry out an operation, for example an addition or a value comparison. Every functional achievement of a processor is therefore the result of the execution of machine code, a program in machine language.

Programs in machine language are usually not generated directly by the programmer , but rather using a high-level programming language or an assembly language , with machine code only being created with the aid of a compiler or assembler . When “programming in machine language” is spoken of, it is sometimes incorrectly referred to as programming in assembly language. When executed by the interpreter , however, the machine commands are generated when the program is started or during runtime.

Sometimes expressions such as “machine code, machine language, binary code, native code, program code” are used interchangeably. However, they can have two different meanings:

  • For the typifying designation of the code used as a syntax definition. Example: "Source code (for the XYZ programming language)"
  • For the program code of a specific program. Example "binary code (for program ABC)"

Machine program

'Machine program', contexts of terms and synonyms used in language

Machine programs are used in all devices with a processor , from mainframe computers to personal computers and smartphones to embedded systems in modern washing machines, radios or controls in motor vehicles for ABS or airbags . On PCs, they are usually contained in executable files .

In Windows, executable files can be found in files with the filename extension “.exe” . Under many other operating systems, executable files are also managed without a file extension and in other formats. They are sometimes called differently, e.g. B. under z / OS as a load module . In many embedded systems or microcontrollers , certain machine programs are permanently in the ROM, e.g. B. a boot loader

Machine programs can be viewed by humans with the help of a hex editor , theoretically also created and modified. In practice, however, a machine program is created with the help of an assembler or compiler using the source text of the respective programming language . Machine code can be translated back into assembler format by a disassembler , but the conversion into a higher programming language by a decompiler is subject to severe restrictions.

Differences to the assembly language

The program in the machine code consists of a sequence of bytes that represent both commands and data. Since this code is difficult to read for humans, the commands are represented in assembly language using more easily understandable abbreviations, so-called mnemonics . The operation code, source and target fields as well as other information in the commands can be noted with symbolic identifiers (such as MOVE, ZIP, LENGTH), possibly supplemented by numerical values, e.g. B. for an individual length specification, register numbers etc.

File format
As is usual with source texts , an assembler program is usually available as a text file , while the machine program is usually saved as a binary file.
instructions
Programming in text format with subsequent translation into machine code by an assembler allows the programmer to create programs much faster and easier than coding in machine code. As a rule, one assembler command corresponds to exactly one command in the machine code, except for macro assemblers , which can generate several machine commands from one instruction.
Character formats
Common assemblers allow the programmer to code characters and numbers in different code formats (text, decimal, hexadecimal, octal, binary) and set them in the machine instruction in a format that corresponds to the instruction. Example: The source text specifications' A 'or' X'C1 '' or 'B'11000001' '(in EBCDIC code ) mean the same thing and become X'C1' in the machine code - which for commands for dual operations has the value +193, corresponds to the character 'A' for character operations.
Data declaration
An assembler offers the programmer the possibility to mark and name data fields as such, to declare them in different formats and to give them symbolic names. In the generated machine code, memory space is reserved according to this information and (with constants ) pre-assigned with content. In the generated machine commands, the symbolic address is replaced by the numeric address and the length of the defined fields is used.
addressing
An assembler enables the storage locations for data and commands to be named symbolically so that the programmer does not need to know their numeric address. Memory addresses are specified directly in machine language. Even with a small change in the program, the addresses of all subsequent program parts would shift, which (when programming in machine language) would make it necessary to adapt all these addresses. The symbolic addressing means that subroutines can also be called in the assembly language , the actual address of which is only used in the machine code by the assembler or a linker .
Program scope
An assembler program normally relates to one (1) defined task and is independent of other programs at assembly time. Using techniques such as 'linking' , depending on the development platform, the results of several assemblies (e.g. called object modules ) can be 'combined', which as a whole result in the machine program.
documentation
An assembler makes it possible to add comments and further documentation to a program. As a rule, these source code parts are not transferred to the machine program.

Most of the aforementioned aspects of the assembly language also apply in a similar way to higher-level programming languages - although these differ from the assembly language in further (performance) features.

Program creation

Internally, each machine language command is coded using one or more numerical values. These numerical values ​​consist of the opcode , which defines the type of command, possibly followed by one or more bytes of data for this command. A program therefore forms a meaningful sequence of such numerical codes stored in the main memory or as a file. There are different ways of creating such programs:

  • Direct entry of the binary code (extremely cumbersome and highly error-prone, uncommon since the 1950s)
  • To write the numeric code in opcodes using a hex editor . (prone to errors)
  • With an assembler : assembly languages formulate the processor commands of the machine code as mnemonics in a simple syntax. This source text is then converted into machine code by the assembler.
  • A program is written in a high-level language, then translated (compiled) into machine code by a compiler . In an intermediate step, object code is often generated first .
  • Alternatively, programs in a high-level language can also be processed by an interpreter - either after being compiled into an intermediate code or directly . An example of this is the Java programming language , whose intermediate code (also called bytecode ) is executed by an interpreter. This is done transparently for the user, for example when an applet is executed in the web browser . In addition to Java, all .NET languages, such as C # , are translated into an intermediate code , which is then translated into the corresponding machine language by a JIT compiler at runtime within the CLR .
  • In the installation of software , including the operating system, this often is already available in machine code for each platform. This saves the user having to compile the program.

example

Programming language C

We consider the following code in the C programming language . The sum of the numbers 2 and 3 is calculated and the result is returned

int main() {
    int a = 2;
    int b = 3;
    int c = a + b;
    return c;
}

Compiling this program can result in the following machine code:

Machine code
( hexadecimal )
associated assembler code associated C code Explanation
55
48 89 E5
push rbp

mov rbp, rsp

int main() { Safe register RBP on the stack and set RBP to the value of register RSP, the stack pointer (not part of the actual calculation). This preparation is necessary in order to be able to save the values ​​of the variables a , b and c on the stack.
C7 45 FC 02 mov DWORD PTR [rbp-4], 2 int a = 2; Set variable a , which is addressed by register RBP, to the value 2.
C7 45 F8 03 mov DWORD PTR [rbp-8], 3 int b = 3; Set variable b , which is addressed by register RBP, to the value 3.
8B 45 F8
8B 55 FC
01 D0
89 45 F4
mov eax, DWORD PTR [rbp-8]

mov edx, DWORD PTR [rbp-4]
add eax, edx
mov DWORD PTR [rbp-12], eax

int c = a + b; Set register EAX to the value of variable b .

Set register EDX to the value of variable a .
Add the value of EDX to the value of EAX.
Set variable c addressed by RBP to the value of EAX.

8B 45 F4 mov eax, DWORD PTR [rbp-12] return c; Set register EAX to the value of variable c . Because register EAX already contains this value, this instruction could be omitted in an optimized program.
5D
C3
pop rbp

ret

} Set the RBP back to its original value.

Jump back to where main was called . Register EAX contains the return value.

A compiler could generate an executable file from this together with other necessary information . For execution, the machine code is loaded into the main memory by the loader of the operating system. The runtime environment then calls the main () function and the CPU begins processing the machine commands.

Machine code on IBM computers using the example of OS / 390

The machine code is created when assembling or compiling the source code files and is made available by the "Linkage Editor" , possibly with the addition of additional modules , as an executable program in a program library . This program is loaded into main memory for execution. The machine code of these programs contains commands and data mixed - as is possible with computers of the Von Neumann architecture (in contrast, for example, to the Harvard architecture ).

The data is created according to the specified storage format. The value "12" can e.g. B. have the following appearance (representation in hexadecimal, in minimum length):

F1F2 Text or unpacked number
012C packed positive, storage of one nibble per number, at the end a sign nibble.
012D packed negative (dto)
0C binary positive, corresponds to B'00001100 '

In the case of longer data fields, there may be leading zeros or trailing blanks in text. For each data field provided, an 'address' is specified where it begins and where it is stored according to its length and format.

The commands consist of the command code and - depending on the command - parameters of different structures. The following examples are shown in hexadecimal . Command examples:

C5.1C.92A4.8C2B (dividing points only inserted for better readability):

C5 = Command code for CLC = Compare logical character; Comparison of characters
1C = length minus 1 of the fields to be compared (with 00, 1 byte is compared, etc., here 29 bytes)
92A4 = address of first operand: 9 = base register, 2A4 = distance to register
8C2B = address of the second operand: 8 = base register, C2B = distance to the register

47.80.B654:

47 = Command code for BC = Branch on Condition: Jump command if the condition (from the previous command) is fulfilled
8 = condition; here: if 'equal', mnemonic assembler code BE (branch on equal)
0 = optional register, the content of which is added to the jump address; not at '0'
B = destination address (base register)
654 = destination address (distance); with the content of B = 6C4410, the system branches to address 6C4A64.

<etc>

In assembler code this coding could e.g. B. look like this:

CLC FELDA (29), FELDB
BE XXX

In contrast, the source code generated by a high-level language could be:

IF Field_A = Field_B then GOTO XXX.

If the condition is met, a branch is made to XXX (= real address 6C4A64), otherwise the machine code <usw>continues with . High-level languages ​​often generate additional commands, e.g. B. to equalize field lengths or data formats, load registers or calculate addresses in arrays .

You can see that the commands have different lengths . The control unit of the computer recognizes the length from the first two bits of the command code and switches the command counter register accordingly. The program is continued at precisely this point - if no jump command is to be executed.

Memory addresses are always represented in the machine code by one (or two) register information, and optionally by a “distance” specified in the command. When the program is started, the operating system loads a certain register with the address to which the program was loaded into memory. Starting from this value, the base registers are loaded in the program code (programmed with ASS, generated with high-level languages), so that the commands provided with relative addresses address the actual memory locations.

To execute system functions (such as input / output commands , querying the date / time, keyboard input, loading subroutines, etc.), all that is required is a system call in the machine program with the command 'SVC' (Supervisor Call). The function to be carried out is specified in the second byte (see directory); Further parameters for the function are transferred via a data interface that is defined in its structure and whose address is indicated by an implicitly agreed register (not specified in the command). Example: X'05 08 '= LOAD, parameter = Pgm-Name etc. The commands executing the called functions are machine code of the operating system. They are executed there and then lead back to the command following the SVC.

Overview of the typical functionality of a machine language

Instruction set

The following mnemonics (command abbreviations) were chosen as an example and depend on the assembly language.

Addressing and display of results: Almost all commands address the memory positions concerned (often source / target, to be compared / comparison value, etc.) via defined registers . The processor also returns its results and relevant additional information via defined registers and / or via flags in the status register . This makes it possible to evaluate this information in the further course of the program and to react to it. The length of the instructions and the size of the source and destination operands can vary depending on the architecture.

Example: An addition command such as ADC (add with carry) signals to the further program sequence that the valid range of values ​​has been exceeded beyond the setting of the carry and overflow flags.

Differences: The instruction set of individual processors is different. Not all commands are available on every processor type and in every processor generation.

Example: A simple basic command such as SHL / SHR , which shifts a register value by a certain number of places to the left or right, is already available in the 8086. The more powerful variant SHLD / SHRD , which also fills the resulting spaces from another integer value, is only implemented from the 80386.

Power: The instruction set of a processor provides commands with differently powerful functionality. In addition to simple, single-stage basic operations, commands are also available that combine several operations in one command.

Examples: The CMP (compare) command enables two values ​​to be compared for <,>, =. The XCHG (exchange) command swaps the positions of two operands. The CMPXCHG (compare and exchange) command combines these two commands and enables conditional data exchange in one command. While the command BT (bit test) only checks the status of a single bit in an integer value, the commands BTC, BTR and BTS also enable the bit tested to be set (BTS) , to be deleted (BTR ) , or to invert (BTC) .

A general distinction is made between CPUs with RISC - ( Reduced instruction set computer ) or CISC - ( Complex instruction set computer ) instruction set. The former have a significantly less powerful instruction set, but can typically process each individual instruction in one clock cycle. Modern CPUs with a CISC instruction set (today this includes almost exclusively x86- compatible CPUs) decode the complex CISC instructions for execution internally in a RISC-like microcontroller language for faster processing.

Performance: Each command is processed in a number of clock cycles of the processor specified in data sheets . Knowing this allows the programmer (in extremely time-critical applications), for example, to replace commands with many clock cycles with several, but overall more efficient commands.

Categorization of commands

Basic machine commands can be divided into the following categories:

  • Arithmetic operations: perform calculations (ADD, ADC, SUB, SBB, DIV, MUL, INC, DEC)
  • Logical operations: link bit fields logically with each other ( AND , OR , XOR , NOT )
  • Bit-oriented operations: They can be used to precisely address, read out (BSF, BSR) , shift (SHL, SHR, RCL, RCR, ROL, ROR) or manipulate (BT, BTC, BTR) individual bits in a bit field
  • Memory operations: Transferring data between processor registers (MOV, MOVSX, MOVZX, XCHG) , within a register (BSWAP) , as well as registers and memories
  • Comparison operations: Comparison of values ​​using <,>, and = (CMP, TEST)
  • Combined commands from comparison operations, arithmetic operations and data exchange (XADD, CMPXCHG)
  • Control operations: branches that influence the flow of the program
  • Data conversion: These commands convert values ​​from one representation to another, e.g. U. also at a loss. For example: a byte in a word (CBW) , a long integer in a byte ( CVTLB ) or a double exact floating point number in an integer ( CVTSD2SI ).

In many modern processors, the machine language commands, at least the more complex ones, are implemented internally by microprograms . This is particularly the case with the CISC architecture.

literature

  • Assembler - machine-level programming from the start . rororo paperbacks No. 61224 (2003), ISBN 3-499-61224-0 .

Web links

Wiktionary: machine language  - explanations of meanings, word origins, synonyms, translations

Individual evidence

  1. Duden Computer Science . ISBN 3-411-05232-5 .
  2. machine code . In: Gabler Wirtschaftslexikon
  3. Table of SVC codes for IBM's MVS & OS / 390 & z / OS
  4. ^ Supervisor Call instruction in the English language Wikipedia