Explicitly Parallel Instruction Computing

The Explicitly Parallel Instruction Computing ( EPIC ) describes a programming paradigm of an instruction set architecture ( English Instruction Set Architecture , short ISA ) and the processing structure of a family of microprocessors , e.g. B. Itanium . When programming EPIC CPUs, the commands of an instruction stream are explicitly parallelized. The ISA has properties that support explicit parallelization , while a conventional ISA assumes that the commands are processed sequentially. A program that is in a non-EPIC machine language can also be parallelized, but complex logic is necessary during execution to identify instructions that can be executed in parallel, since the instruction format does not make any statements about instructions that can be parallelized. An EPIC CPU works on the principle of in-order execution , in contrast to the out-of-order execution of the superscalar CPUs.

The motivation for developing an EPIC processor is to reduce the number of logic gates in the processor. The space that has now become free can be used to integrate further functional units (e.g. arithmetic units) into the CPU in order to

increase the number of operations that can be carried out in parallel,
integrate larger caches into the processor,
to reduce the influence of the main memory bottleneck or
to reduce power consumption, power loss and thus heat dissipation.

The out-of-order execution is partly a result of the need to be backwards compatible with older processors. Since the command format of an older processor still had to be supported, improvements to parallel execution could only be made under the hood. In principle, however, it is possible to entrust the compiler with this task, and in most cases a compiler is better suited for this task, since it can spend more time on optimization and has access to more information about the program flow.

features

The most important features of this instruction set architecture are:

Static instruction grouping: The compiler determines which instructions can be processed in parallel. This makes the processor much simpler (Pentium 4 with 42 million transistor functions, Itanium with 25 million transistor functions).
VLIW architecture: the processor receives very long instruction words , which contain several instructions and the statement on which unit of the processor the instruction is to be executed. With the IA-64, three commands are packed into a VLIW.
Predication : Predication (statement, assertion) means the conditional execution of commands without the use of jump commands.
speculation : So that there is no need to wait for data in the command sequence, data can be speculatively loaded and processed at an early point in time.
Load / Store architecture: Memory access only occurs with load and store commands (and of course with the fetch cycle ).
Large register sets: The load / store architecture requires many registers in order to keep the number of memory accesses as small as possible.
register stack and register engine : The registers are arranged in such a way that the static registers and the registers used by the procedure are visible within a procedure. The registers are dynamically renamed when the procedure is called. The register engine saves the currently invisible registers in the memory if necessary, so that the number of registers is unlimited for a user program.
High-performance instruction set: The instruction set contains a large number of high-performance instructions suitable for parallel processing.
little-endian and big-endian : In a control register of the processor you can define how the processor should store the data in the memory.

realization

With EPIC, the processor is signaled during programming which instructions can be executed in parallel. Such parallelizable instructions are summarized in groups ( instruction groups ) . The instructions of a group can then in principle be executed in any order and with any degree of parallelism.

To separate the dependent instructions from each other, have stops in the instruction stream to be installed. Stops mark the end of an instruction group and the beginning of a new one. The actually explicit parallelization information is the stops , because they are used to identify parallelizable instructions without having to be analyzed.

The optimization goal of a given EPIC program is to minimize the number of instruction groups required, i.e. to increase the average number of instructions per instruction group .

There are exceptions to this (in the example IA-64). For example, exceptions that are triggered by an early instruction of a group must always be executed as if the later instructions of a group had not been executed at all. However, any instructions may have already been speculatively executed, the result of which is discarded when an exception from an earlier instruction occurs. The processor must therefore give the impression that the instructions in a group were executed in sequence. Other exceptions concern special instructions that by definition must appear at the beginning or end of an instruction group .

properties

The EPIC saves resources which , in the case of non-EPIC processors, have to be used to distribute the instructions to the parallel working units of the processor during execution. These savings are the motivation behind the invention of EPIC. This can lower the cost of the chip and reduce power consumption. The calculations that are necessary for parallelization of the instruction stream are carried out once during compilation with EPIC, with non-EPIC processors each time when the code is executed and with the help of a large number of logic gates.

Modern compilers undertake optimizations of the instruction stream even for non-EPIC ISAs (e.g. move independent instructions within the stream) in order to support the processor in parallelization. With EPIC-ISA, this processor support is mandatory, which means that a dependency error can be generated even if the instructions are in the correct sequential order.

EPIC is related to VLIW because VLIW is also used to group instructions. It must be noted that the VLIW grouping in bundles and the EPIC grouping in instruction groups in the IA-64 are independent of each other, i.e. an instruction group can include any number of bundles and a stop can also be between instructions of an individual Bundles are brought in.

ISAs that have EPIC as an architectural feature are relatively difficult to program in assembly language, and compilers are considerably more complex because the parallelization is no longer performed by the implementation of the ISA, i.e. the processor itself, but has to be done explicitly. In EPIC programming, for example, there are issues that cannot be expressed using non-EPIC machine language because the model there is a strictly sequential execution.

Since the calculations that are necessary for the parallelization are carried out independently of the execution, more computing time can be used on this very task.

Static command grouping

Static instruction grouping means that the compiler is responsible for grouping instructions that run in parallel. It is important that the compiler must be optimized for the architecture of the processor in order to take advantage of the properties of the processor. The compiler groups the commands so that as many as possible can be processed in parallel. In addition, it defines what unit in the processor is required to process the command and marks the commands accordingly. The commands are transferred to the processor in groups and, based on the assignment determined by the compiler, are distributed and processed among processor units.

Predication

Predication is a method of executing commands depending on a condition without using jump commands. The procedure is shown in simplified form below: The execution of an instruction can depend on the content of a Predicate register. In the following example, the MOV instruction is only executed if the Predicate register is p1true; otherwise it acts like a NOP.

p1  mov gr8 = gr5  ; lade gr8 mit dem Wert von gr5 falls p1 = true
                   ; falls p1 = false, wirkt der Befehl wie ein
                   ; NOP

The predicate registers can be set with compare commands. The following Compare command tests the registers gr10and gr11for equality. The predicate register p1is p2loaded with the result, the register with its negation.

    cmp.eq p1,p2 = gr10,gr11  ; teste gr10 mit gr11 auf equal
                              ; falls equal: p1 true, p2 false
                              ; falls not equal: p1 false, p2 true

Speculation

The faster the processors get, the greater the loss when data has to be loaded from memory and the processor has to wait for that data. Therefore, the goal is to execute commands earlier in the program sequence so that the required data is available when it is needed.

In the first example you have to wait after the load command until the data in gr4resp. are gr5loaded in. In the second example, the order of commands was reversed and the distance between dependent commands increased.

ld  gr4, x
add gr4 = gr4,gr8
st  y, gr4
ld  gr5, a
add gr5 = gr5,gr9
st  b, gr5

ld  gr4, x
ld  gr5, a
add gr4 = gr4,gr8
add gr5 = gr5,gr9
st  y, gr4
st  b, gr5

In many cases, however, it is not sufficient to bring an instruction forward by one or two instructions, since the difference between the dynamics of the processor and the memory is too great. The processing of the data has to wait until the data has been loaded. The loading of the data should therefore be brought forward so that no waiting is necessary.

If a load command is brought forward via a branch, as in the example below, one speaks of a control speculation . If an error occurs during loading, this should not be dealt with, since you do not yet know whether you need the data at all. The loading is speculative. Before the data can be processed, however, it must be checked whether an error occurred during loading and must be corrected. The second type of speculation is data speculation . The large number of working registers allows many data elements to be held in registers. In order to avoid waiting for data when loading, the required data is loaded into registers early. The following command sequence shows an example of a normal load. The command add gr5=gr2,gr7can only be carried out after the data has been gr2loaded from the memory into .

add gr3 = 4,gr0
st  [gr32] = gr3
ld  gr2 = [gr33]
add gr5 = gr2,gr7

The processor therefore notes all prematurely loaded addresses in the Advanced Load Address Table (ALAT). In the following example the load command contains a control speculation because there is a branch between loading and processing and a data speculation because the store command with the pointer [gr32] could affect the loaded value. It is a Speculative Advanced Load . If ld.saan error occurs during , no ALAT entry is created. If the loading process runs without errors, an entry is made in the ALAT. If the value at the address is [gr33]changed, the ALAT entry is deleted.

             ld.sa gr2 = [gr33]      ; speculative advanced load
             ...
             add gr5 = gr2,gr7       ; use data
             ...
             ...
             add gr3 = 4,gr0
             st [gr32] = gr3
             ...
             cmp.eq p3,p4 = gr7,gr8
         p3  chk.a gr2,recover       ; prüft ALAT
back:    p3  add gr9 = 1, gr6
             ...
             ...
recover:     ld gr2 = [gr33]
             add gr5 = gr2,gr7
             br back

The value is not only read ahead of time ld.a gr2 = [gr33], but also processed add gr5 = gr2,gr7. If the data element used is changed by an operation (e.g. by st [gr32] = gr3), this is determined by the check command ( chk.a gr2,recover), since the entry in the ALAT is missing.


	according to word length	1-bit architecture • Bit-slice architecture • 4-bit architecture • 8-bit architecture • 16-bit architecture • 32-bit architecture • 64-bit architecture
	according to instruction set structure	CISC • EPIC • NISC • RISC • VLIW • Microarchitecture
	with optimization for purpose	(Main) processor • Graphics processor • GPGPU • Stream processor • Sound processor • Floating point unit • Network processor • Physics accelerator • Vector processor • TensorFlow Processing Unit