Bytecode

In computer science, the bytecode is a collection of commands for a virtual machine . When compiling a source code of some programming languages or environments - such as Java - machine code is not created directly , but an intermediate code, the bytecode. This code is usually independent of real hardware . It arises as the result of a semantic analysis of the source text and, compared to this, is often relatively compact and can be interpreted much more efficiently than the original source text.

Technical details

The virtual machine - in the case of Java the Java Virtual Machine (JVM) - then executes this intermediate result by either translating the bytecode into machine language for the respective processor at runtime ( just-in-time compilation ) or corresponding machine code routines executes ( interpreter ). The virtual machine must be available for each computer platform on which the compilation is to be carried out.

Java is just one of the more prominent examples of a bytecode-based programming language. Other languages that use bytecodes are P-Code , Lua , the .NET languages C # , F # and Visual Basic , Python , Ruby , Perl , PHP , Prolog , Limbo , Gambas and Tcl . While in Java, Python, and .NET the bytecode is saved as a "compilation" and then executed independently of the source code, in the script languages Perl (up to version 5) and Tcl the source code is translated into bytecode when the program is started and only stored in the main memory.

The use of bytecode makes it possible to use the same virtual machine for multiple languages such as the JVM for Java, Scala , Groovy and many others. The effort to develop only one bytecode compiler for a higher programming language is then significantly reduced. In contrast, the effort would be much higher if a compiler were to provide a direct translation into machine code for several operating systems and architectures. A bytecode can also have been developed independently of a special language for a specific purpose, for example WebAssembly .

The execution of bytecode by the program that represents the virtual machine is at the expense of its start time, which is generally only measurably rather than perceptibly impaired. Special just-in-time compilers (JIT compilers) translate bytecode pieces once during program execution into corresponding machine code pieces and then execute them. As a result, the execution times, but not the start times, can often be reduced to the range of pre-translated machine code.

backgrounds

Many interpreted languages also use bytecode internally. This means that the bytecode itself is kept invisible to the programmer and end user and is automatically generated as an intermediate step in the interpretation of the program. Examples of current languages that use this technique are: Perl, PHP, Prolog, Tcl and Python. In Python, the bytecode is stored in .pyc files (which contain the bytecode) after the first parsing; the procedure is basically similar to that of Java. However, this step is optional.

This procedure is also very old: Bytecompiling was already used for Lisp in the 1960s: The 256 atomic functions were coded in one byte; what was the reason for the naming. Early BASIC versions of the 1970s and 1980s used certain byte values, so-called tokens , instead of keywords in order to speed up the execution of their programs and to save the program text in a more compact form. However, the rest of the text - such as variables, math expressions, and strings - was saved unchanged. When issued by the LIST command, the tokens were converted back into readable keywords.

A well-known early home computer that used bytecode is the TI 99 / 4a from Texas Instruments .

Recovery of source code from compilations

For programmers who work on programs whose source code should not be disclosed, there is one important aspect to consider: With programming languages such as C , which are compiled directly into machine code, it is usually not possible to extract the original from the machine code Recover source code. When compiling to bytecode, however, this is often not a major problem. Although the exact source code cannot be reconstructed here, it is often possible to recover at least equivalent code in the source language, sometimes with astonishing similarities. With Java and .NET, for example, this is very possible in most cases; A recovery is always possible in Prolog with WAM byte code;

With the help of a so-called obfuscator , the usability of the source text obtained by decompiling can be severely restricted; Sometimes it is no longer possible to decompile into the source language.

The .NET Reflector can be used for own .NET projects to restore the source code. This supports the three .NET languages C #, Visual Basic .NET and Delphi.NET . Alternatively, the open source dnSpy can be used.

Web links

Java bytecode decryptedTemplate: dead link /! ... nourl ( page no longer available ) (English)
via archive.org

Individual evidence

↑ ^a ^b IT knowledge: Bytecode. Retrieved October 19, 2018 .
↑ TechTerms: Bytecode. Retrieved October 19, 2018 .
↑ Blog entry by Carles Mateo ' Performance of several languages '
↑ What Is: bytecode. Retrieved October 19, 2018 .
↑ CS1Bh Lecture Note 7 Compilation I: Java Byte Code. (PDF) Retrieved October 19, 2018 .

[itwissen-1] IT knowledge: Bytecode. Retrieved October 19, 2018 .

[2] TechTerms: Bytecode. Retrieved October 19, 2018 .

[3] Blog entry by Carles Mateo ' Performance of several languages '

[4] What Is: bytecode. Retrieved October 19, 2018 .

[5] CS1Bh Lecture Note 7 Compilation I: Java Byte Code. (PDF) Retrieved October 19, 2018 .