Self-modifying code

By the term self-modifying code (engl .: Self Modifying Code ) is a portion of a computer program referred to, that of the own program code during execution changed to solve the task program parts targeted. Konrad Zuse had already included self-modifying code as a possibility in the Plankalkül programming language he had designed under the name “free calculation plan” .

The program must be able to replace certain commands in the machine code with other meaningful machine commands. In the case of higher programming languages (e.g. APL ), the program usually manipulates the source code as a character string ( text string ).

Self-modifying code can be used, among other things, where it is possible to combine several program parts that differ only in a few places into a single one.

The self-modifying code of a program usually has nothing to do with learning or improving a program. Self-modifying programs that modify the high-level language of the program may be helpful in the future to increase machine intelligence.

motivation

The method of allowing code to modify itself mainly comes from a time when resources (CPU time, memory) were still very scarce - so it was often the aim to optimize runtime behavior or memory consumption. So-called runtime packers decompress the actual program by means of an auxiliary routine before they start it. Both the runtime optimization by means of self-modification and the reduction in memory requirements are now only required extremely rarely (e.g. in "retro computing", when programming is therefore carried out on very old systems). Another reason to self-modify was to protect intellectual property in order to hide the actual algorithms. In view of the historical motivations for writing self-modifying code, the presence of such code should not only be assessed according to modern standards for measuring code quality, but also the (historical and / or technical) circumstances should always be taken into account.

Architecture and language dependency

The in-memory change of a machine language program code is easily possible in a Von Neumann architecture , since program parts can sometimes be viewed as data, and later as program parts again; With the Von Neumann architecture, the program and data have the same address space. In processors with Harvard architecture , the modification of machine code during runtime is not provided, program and data have separate address spaces. Possibly. special commands are available for transferring information between the address spaces, or detours outside of the main memory have to be taken.

Higher programming languages can embed a compiler in the "normal program" , in which case the modification may not have to be carried out directly in machine language . It is helpful if the language has homoiconicity (self-mapping: the property of a programming language that programs are / can be data structures of the same language at the same time; in such languages it is easy to write programs that write programs). The porting of self-modifying machine code on any processor is almost not possible. In the meantime, many processor architectures that are actually constructed in the manner of Von Neumann have methods to prevent writing in (machine) code areas and the execution of data areas (e.g. NX bit ) as a protective measure against buffer overflow attacks. In higher-level programming languages, self-modifying code usually requires interpreting (i.e. not compiling) systems.

advantages

A very compact program can be constructed for certain tasks.
The program solution found can appear elegant .

The program can be better protected against reverse engineering .

disadvantage

Compilers do not support the creation of self-modifying code.
The program code is difficult or not portable at all.
The machine code is difficult to understand.
The CPU design becomes significantly more complicated; sometimes errors occur with other CPU versions.

Examples

Video game

In a video tennis game, in the part of the program that controls the ball, an increment command can be replaced by a decrement command when it hits the wall, thereby reversing the direction of movement.

The bytes containing the coordinates of the ball can be stored in the memory in such a way that they can be interpreted as direct parameters of a command. For example, imagine a command that causes the ball to be displayed in a specific location. Instead of addressing the two arguments "X-Position" and "Y-Position" indirectly as variables, they can be stored directly in the memory in such a way that they are part of the command "Show ball".

Combination of the two examples as a pseudo-program:

if the ball has hit a vertical wall and the program code says “increment x-coordinate”, then write the command for “decrement x-coordinate” in the appropriate memory location and skip the next command
If the ball has hit a vertical wall and the program code says "decrement x-coordinate", then write the command for "increment x-coordinate" in the appropriate memory location
If the ball has hit a horizontal wall and the program code says “increment y-coordinate”, then write the command for “decrement y-coordinate” in the appropriate memory location and skip the next command
if the ball has hit a horizontal wall and the program code says "decrement y-coordinate", then write the command for "increment y-coordinate" in the appropriate memory location
increment the x-coordinate of the ball display command
increment y-coordinate of the ball display command
Place the ball in position 1, 1 and start over

Both the two commands for incrementing and the coordinates “1,1” in this example only represent initial values that are modified by the program itself.

Math program

In Microsoft BASIC on Commodore computers (e.g. PET, VC 20, C64 ) it was effectively possible, by briefly stopping a program, to execute a user function (e.g. "SIN (X)") queried via the INPUT command in the program to the program editor, which changed a line in the BASIC program accordingly, whereupon the program was continued without losing the variable information (using the GOTO command ) and could use the new line for calculations. This was done by printing out the desired new program line in the top line of the screen (using the Microsoft BASIC expression " DEF FN ") and issuing the command "GOTO xxx" to jump back into the program in the second line of the screen. Filling the keyboard buffer with the characters HOME and several control characters for carriage return ensured that after the STOP command the system's own program editor processed the previously output program line and executed the BASIC program again when the GOTO command was reached (triggered by the carriage return characters) .

Copy routines (6502-CPU)

Such a subroutine was given the start address, target address and size in bytes or memory pages (256 bytes each). The normal way to copy was to store the addresses in two pointers within the zeropage, and then use indirect-zeropage addressable load and store instructions with index access. However, these need two clock cycles more on the 6502 CPU than the absolutely addressable ones. The trick to increasing the speed is to use absolutely addressable commands. With this type of self-modifying code, the index register and the pointer addresses are not incremented, but the addresses in the program code after the opcode of the absolutely addressable load and store commands. This enables copying routines to be accelerated significantly.

Remarks

↑ Self-modifying code was e.g. B. used to distinguish the Intel 8088 from the Intel 8086, because one had a longer instruction pipeline: The processor with the short pipeline followed the change, but the processor with the longer pipeline continued to execute the "old" instruction because it was already running was stored in the pipeline.

[1] Self-modifying code was e.g. B. used to distinguish the Intel 8088 from the Intel 8086, because one had a longer instruction pipeline: The processor with the short pipeline followed the change, but the processor with the longer pipeline continued to execute the "old" instruction because it was already running was stored in the pipeline.