Vector processor

Processor board of a CRAY-YMP vector computer

Vector processors (also called vector computers or array processors ) perform a calculation on a large number of data (in a vector or array ) at the same time . If a lot of the same type of data is to be processed in the same way (for example in matrix operations ), vector processors are far superior to pure general-purpose processors (e.g. x86 ), which process all data one after the other. This is at least the case when the vector computer also has parallel access to the main memory .

Functionality and fields of application

Vector processors are mainly used in high-performance computing (HPC). The Cray supercomputers used vector processors. Vendors of vector computers were NEC and the Convex Computer Corporation , for example with the C38xx series, which used gallium arsenide technology, or Fujitsu Siemens Computers with their VPP series.

Significant performance gains can also result from a "vectorized" algorithm formulation: Compare, for example, the performance gain in vectorized numerical operations versus sequential ("for-loop" formulation) numerics in Matlab . Similar to (also) with processors, the increase in performance is achieved through reduced overhead (loop and call delays), and parallel processing is also enabled through non-explicit sequential formulation.

In HPC applications in particular, there is often a lot of similar data that should be processed in a similar way, for example in simulations in meteorology and geology, where vector computers are often used.

In recent years, vector computers have faced great competition from computing clusters built in parallel and made up of many thousands of standard processors. By resorting to standard components that are widespread beyond the HPC sector, costs can be saved, especially since such standard processors have become very powerful due to intensive technological development. Distributed computing is even cheaper .

Because of the advantages that result from the simultaneous execution of an arithmetic operation on several data (single instruction, multiple data, SIMD ), standard processors have also been expanded to the respective architecture since the 1990s in order to accelerate this type of calculation. See Architecture of the x86 Processor or AltiVec for PowerPC Processors .

In addition to the above-mentioned applications for vector processors, graphic simulation is also one of the main applications. Complex 3D games in particular require an enormous amount of calculations (matrix operations on 3D coordinates, antialiasing of the screen output) on large amounts of data, which is why today's graphics processors are very similar to pure vector processors.

Vector processor at work

MIPS architecture example

A simple example is used to show the difference between a scalar and a vector processor.

${\ displaystyle Y = a \ cdot X + Y}$

X and Y are two vectors of equal length and a is a scalar quantity. This problem is solved on scalar processors by a loop. The same loop is also used in the LINPACK benchmark to determine the performance of the tested computers. In C syntax it looks like this:

 for (i = 0; i < 64; i++)
     Y[i] = a * X[i] + Y[i];

It is assumed here that the vectors consist of 64 elements.

In MIPS code, this program fragment looks like this:

        L.D     F0, a          ; Skalar a laden
        DADDIU  R4, Rx, #512   ; letzte Adresse 512/8 = 64
 Loop:  L.D     F2, 0(Rx)      ; X(i) laden
        MUL.D   F2, F2, F0     ; a * X(i)
        L.D     F4, 0(Ry)      ; Y(i) laden
        ADD.D   F4, F4, F2     ; a * X(i) + Y(i)
        S.D     0(Ry), F4      ; Y(i) speichern
        DADDIU  Rx, Rx, #8     ; Index (i) von X inkrementieren
        DADDIU  Ry, Ry, #8     ; Index (i) von Y inkrementieren
        DSUBU   R20, R4, Rx    ; Rand berechnen
        BNEZ    R20, Loop      ; wenn 0, dann fertig

In VMIPS code, however, it looks like this:

 L.D      F0, a       ; Skalar a laden
 LV       V1, Rx      ; Vektor X laden
 MULVS.D  V2, V1, F0  ; Vektor-Skalar-Multiplikation
 LV       V3, Ry      ; Vektor Y laden
 ADDV.D   V4, V2, V3  ; Vektor-Addition
 SV       Ry, V4      ; Resultat speichern

This example shows how efficiently the vector processor solves the task. With VMIPS, six commands are sufficient, while with MIPS 64 * 9 + 2 = 578 commands are executed. Mainly there is no loop. With VMIPS, only a fraction of the commands have to be fetched from memory and decoded.

In the MIPS architecture, multiplications and additions are carried out alternately, i.e. the addition must always wait for the slower multiplication. In the vector calculator, on the other hand, all independent multiplications are carried out first and then all dependent additions. This is another significant difference.

Example of an x86 architecture, embedded in high-level language

A current example of x86 processor architecture using the SSE instruction extension. The example shows the vectorized multiplying floating point - Arrays single precision. The source code shown is written in the high-level language "C" , with essential inline assembler parts (Intel syntax), which can be compiled directly with the GCC .

//SSE-Funktion zum vektorisierten Multiplizieren von 2 Arrays mit Single-precision-Gleitkommazahlen
//Erster Parameter Zeiger auf Ziel/Quellarray, zweiter Parameter 2. Quellarray, dritter Parameter Anzahl der Gleitkommazahlen in jedem Array
//32-Bit-Version
void mul_asm(float* out, float* in, unsigned int leng)
{
     unsigned int count, rest;

     rest  = (leng*4)%16;
     count = (leng*4)-rest;

     if (count>0){
     // vectorized part; 4 floats per loop iteration
     __asm __volatile__  (".intel_syntax noprefix\n\t"
     "loop:                 \n\t"
     "movups xmm0,[ebx+ecx] ;loads 4 floats in first register (xmm0)\n\t"
     "movups xmm1,[eax+ecx] ;loads 4 floats in second register (xmm1)\n\t"
     "mulps xmm0,xmm1       ;multiplies both vector registers\n\t"
     "movups [eax+ecx],xmm0 ;write back the result to memory\n\t"
     "sub ecx,16            ;increase address pointer by 4 floats\n\t"
     "jnz loop              \n\t"
     ".att_syntax prefix    \n\t"
       : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1");
     }

     // scalar part; 1 float per loop iteration
     if (rest!=0)
     {
      __asm __volatile__  (".intel_syntax noprefix\n\t"
     "add eax,ecx           \n\t"
     "add ebx,ecx           \n\t"

     "rest:                 \n\t"
     "movss xmm0,[ebx+edx]  ;load 1 float in first register (xmm0)\n\t"
     "movss xmm1,[eax+edx]  ;load 1 float in second register (xmm1)\n\t"
     "mulss xmm0,xmm1       ;multiplies both scalar registers\n\t"
     "movss [eax+edx],xmm0  ;write back the result\n\t"
     "sub edx,4             \n\t"
     "jnz rest              \n\t"
     ".att_syntax prefix    \n\t"
       : : "a" (out), "b" (in), "c"(count), "d"(rest): "xmm0","xmm1");
     }

     return;
}

Programming of vector processors with high level programming languages

The above example is coded directly in machine language, which is no longer common these days, but is definitely possible (SIMD Intrinsics or inline assembler code components). Architectures with special machine instructions for vectors require either support from

parallelizing compilers (i.e. those that can convert a whole loop in the source code into a SIMD calculation instruction)
a language extension for generating the array functions
or at least through special library functions

At least in the last two cases, the software developer must definitely know the architecture and then use the special functions in order to use the vector processing.

Web links

Instructions for vector programming ( Memento from November 6, 2013 in the Internet Archive ) (PDF; 1.03 MB) Cray Research (English)
HiCoVec ( Memento from July 10, 2010 in the Internet Archive ) Open source, configurable vector processor of the HS Augsburg


	according to word length	1-bit architecture • Bit-slice architecture • 4-bit architecture • 8-bit architecture • 16-bit architecture • 32-bit architecture • 64-bit architecture
	according to instruction set structure	CISC • EPIC • NISC • RISC • VLIW • Microarchitecture
	with optimization for purpose	(Main) processor • Graphics processor • GPGPU • Stream processor • Sound processor • Floating point unit • Network processor • Physics accelerator • Vector processor • TensorFlow Processing Unit