Cell (processor)

from Wikipedia, the free encyclopedia
The cell processor

Cell (or Cell Broadband Engine ) is a processor series that was developed by IBM together with Sony and Toshiba . The processors are characterized by the use of a 64-bit - PowerPC -Kernes, a pipeline architecture, support for simultaneous multi-threading and the use of a heterogeneous multi-core architecture , whereby they are ideally suited for parallel computation.

construction

Cell processor diagram

The basic concept of the cell processors provides eight Synergistic Processing Elements (SPE) and one PowerPC Processing Element (PPE). The individual processor cores are coupled via an Element Interconnect Bus (EIB), which can transfer data with up to 96 bytes per CPU cycle. Both the PPE and the SPEs can access the EIB with 8 bytes per CPU cycle. The EIB is implemented as a ring bus (4 × 128 bit) and is clocked with half the CPU cycle. The main memory is accessed via a memory interface controller (MIC).

Synergistic Processing Element (SPE)

Each SPE consists of a computing unit ( ALU ) with a quadruple SIMD , referred to as a Synergistic Processing Unit (SPU or SPX). This has 128  registers , each 128  bits in size. The SPE also includes a Memory Flow Controller (MFC) that controls DMA transfers to the main memory or to other SPEs, as well as its own local memory of 256  kB .

Local storage and storage management

The local memory (also called Load Store Unit , LS for short) consists of four separate 64 KB memory blocks with six clock cycles of latency. An SPU can only communicate directly with local storage. A Memory Flow Controller (MFC), which acts completely independently, is responsible for access to or communication with the main memory, the PPE or other SPUs . This means that the memory of the individual SPEs can theoretically be freely divided or protected with specific access rights. A total of 16 memory actions are possible simultaneously with the MFC.
With the complete abandonment of cache in favor of a directly addressable and SRAM- based local memory, memory latencies can be controlled and kept correspondingly low compared to a cache-supported in-order architecture. Thanks to this method, program sequences can be controlled to a high degree both by compilers and by direct programming, so that out-of-order execution or sophisticated jump predictions , which would have unnecessarily increased the complexity of the processor, are superfluous for high performance.

Scheme of the SPE

Synergistic Processing Unit (SPU)

An SPU works with two pipelines ( even and odd ) that are 23 steps long. The even pipeline houses the floating point and fixed point units, while all other functional units are located on the odd pipeline. An SPU can execute two instructions per cycle ( dual issue ), one of them per pipeline. This corresponds to a maximum of eight floating point operations per cycle with single precision. At 3.2 GHz clock rate, this results in a theoretical performance of 25.6 GFLOPS per SPU.
Only a static branch prediction is implemented. The performance of the compiler is therefore important in this regard, since pipeline hazards result in a waiting time of 18 cycles. The large number of register sets also helps to bridge latencies by resolving loops ( loop unrolling ) or by executing algorithms several times in parallel.
The SPUs are not coprocessors . They can work independently of one another and are also compatible with PPE program code, provided that this has been recompiled and expanded to include DMA calls. Although SPUs are designed for more specific applications, they are processors with a General Purpose Instruction Set .

Scheme of the PPE

Power Processor Element (PPE)

The control processor (PPE) is based on the 64-bit PowerPC architecture from IBM , but its pipeline works in order , i.e. it works one after the other, in comparison to conventional PowerPC processors . However, the PPE has delayed execution pipelines which allow out-of-order execution at least for load instructions. Since it can process two threads at the same time, the usual in-order disadvantages caused by blocked pipelines arise to a lesser extent with appropriately set up programs. 512 KB L2 cache is available to the PPE . The CPU has a total of 2.5 MB of internal memory.

history

The Cell processor is a joint development by Sony, Toshiba and IBM. Development began in March 2001 in a development center in Austin with the participation of engineers from all three companies. A total of more than 400 specialists, spread over ten locations worldwide, were involved in the development of the Cell. The synergistic processing units were largely designed at the IBM location in Boeblingen , Swabia .

Overall, the development estimated over 400 million US dollars, billions more were invested in the construction of foundries , including at the IBM production site in East Fishkill, New York.

The first cell processor was 90 nm in feature size in the SOI technique fabricated thereby reached the an area of about 235 mm². Reports prior to April 2005 relating to an earlier prototype ( DD1 ) of the processor speak of a slightly smaller die area of ​​221 mm². The final version ( DD2 ) has an improved PPE with higher SIMD performance, which takes up more space. From March 2007, IBM manufactured the processor in a 65 nm process, which led to a smaller die area and thus lower production costs. With the introduction of the PlayStation 3 Slim in August 2009, another shrink to 45 nm followed with an area of ​​just 115 mm².

In 2007, an improved variant of the Cell processor was launched, the PowerXCell 8i . This was manufactured in 65 nm right from the start and, compared to its predecessor, supports calculations with floating point numbers with double precision natively, i.e. without auxiliary functions and therefore much faster.

commitment

The Cell processor was developed with special attention to broadband computing applications, especially graphics computing and video coding / decoding. The design found its first commercial use in September 2006 in IBM blade servers with eight SPEs. The processor was best known for its use in Sony's PlayStation 3 game console , where it runs at 3.2 GHz, but only with seven SPEs. This means that cell chips with only seven functioning SPEs can still be used, which can reduce costs. Even with just seven SPEs, the processor achieves a theoretical peak performance of over 200 GFlops with single-precision floating-point numbers, which is superior to the processors of the seventh-generation competing consoles ( Xbox 360 and Wii ).

The processor is also used in televisions with extended video functions; cell derivatives with only four SPEs and additional hardware for video coding and decoding are also used in special notebooks from Toshiba and in expansion cards for PCs. The successor processor PowerXCell 8i has been used in servers since March 2007.

Further information

Peter Hofstee, one of the leading architects of the processor

In the LINPACK performance comparison with other processors, the Cell BE performs as follows:

LINPACK (DP) Clock
frequency
theoretical
performance
average
performance
Efficiency matrix
Cell BE 1) 3.2 GHz 100.00 GFlops 2) 2) 4k × 4k
SPU 3) 3.2 GHz 1.83 GFlops 1.45 GFlops 79.23% 1k × 1k
8 SPUs 3) 3.2 GHz 14.63 GFlops 9.46 GFlops 64.66% 1k × 1k
Pentium 4 3.2 GHz 6.40 GFlops 3.10 GFlops 48.44% 1k × 1k
Pentium 4 + SSE3 3.6 GHz 14.40 GFlops 7.20 GFlops 50.00% 1k × 1k
Itanium 1.6 GHz 6.40 GFlops 5.95 GFlops 92.97% 1k × 1k
1)Implementation under Jack Dongarra
2) unknown
3)Implementation under IBM

The values ​​relate to double-precision floating point numbers (64 bits), for which the SPUs of the cell processor are not designed. With the help of the VMX unit in the PPE, which is optimized for double precision , the Cell processor manages up to 21.03 GFlops under the implementation of IBM. A working group led by Jack Dongarra optimized the code by using an iterative process . With LINPACK, a performance corresponding to 100 GFlops on a 4K × 4K matrix can be achieved with double precision. The PPE does not contribute to the actual calculation either, but serves as the control unit for the SPUs.

LINPACK calculations with single-precision floating point numbers (32 bits) achieve over 73 GFlops on a cell processor with eight SPUs. As the matrix size increases, the computing efficiency increases, so that 8 SPUs on a 4K × 4K matrix under LINPACK achieve around 156 GFlops.

It is also interesting to compare the Cell processor with other multiprocessors:

Multi-array processors
Manufacturer processor Cores SIMD
units
Clock rate
(GHz)
FMUL + FADD
(GFLOPS)
Peak performance
(GFLOPS)
BLAS / SGEMM
(GFLOPS)
Power loss
( watt )
execution
IBM Cell BE 1) 8th 4th 3.200 2 204.8 201 80 processor
Nvidia 8800Ultra (G80) 128 1 1.512 2 387.1 2) > 170 map
Nvidia 8800GTX (G80) 128 1 1,350 2 345.6 105 3) 120-170 map
Nvidia GT200b 240 1 1.476 n / A 1062.7 2) 180-240 map
ATI HD2900 XT (R600) 320 5 0.742 2 474.9 2) 150-200 map
ATI 1900XTX (R580) 48 4th 0.650 2 249.6 120 130–170 4) map
ATI RV770 800 5 0.750 n / A 1200 2) 80-160 map
ClearSpeed CSX700 192 1 0.250 2 96 80 10 processor
ClearSpeed e710 192 1 0.250 2 96 80 25th map
1) without considering the PPE
2) unknown
3)under DirectX 9
4) ctm

See also

Web links

Commons : Cell Processor  - collection of images, videos and audio files

Individual evidence

  1. D. Pham, S. Asano, M. Bolliger, M. Day, H. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi, M. Riley, D. Shippy, D Stasiak, M. Suzuoki, M. Wang, J. Warnock, S. Weitzel, D. Wendel, T. Yamazaki, and K. Yazawa: The design and implementation of a first-generation CELL processor . International Solid-State Circuits Conference, February 2005, pp. 184-185
  2. ^ ISSCC 2005: The CELL Microprocessor , article on Realworldtech from February 10, 2005
  3. a b c Cell culture - inner workings and programming of the cell process . In: c't , p. 28 ff., Edition c't special 01/07 - Playstation 3
  4. ^ Cell's Approach - In Order with no Cache . Retrieved January 28, 2011
  5. a b c Cell Architecture Explained . Retrieved January 20, 2013
  6. ^ Practical SPU Programming in God of War III . (PDF; 4.7 MB) accessed on January 28, 2011
  7. The PlayStation3's SPUs in the Real World (PDF; 62.4 MB) accessed on January 24, 2013
  8. a b IBM: Cell Broadband Engine Architecture and its first implementation - A performance view
  9. Holy Chip! , January 30, 2006 (English); accessed January 13, 2013.
  10. CELL Microprocessor III . Realworldtech, July 24, 2005
  11. IBM Cell Processor Produces Using New Fabrication Technology . ( Memento of the original from March 15, 2007 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. X-bit Labs, March 12, 2007 @1@ 2Template: Webachiv / IABot / www.xbitlabs.com
  12. Sony answers our questions about the new PlayStation 3 . Artechnica, August 18, 2009
  13. Console The Sizes . Beyond3D, November 21, 2012
  14. IBM introduces blade server module with cell processors . Heise.de, September 13, 2006
  15. Toshiba Demonstrates Cell Microprocessor Simultaneously Decoding 48 MPEG-2 Streams . Tech-On April 25, 2005
  16. Toshiba Qosmio G55 - first notebook with SpursEngine . Golem.de, June 18, 2008
  17. Toshiba Qosmio G55-Q802 Laptop Computers Specs & Customer Reviews . Product specification for the notebook with SpursEngine processor, a cell derivative
  18. The WinFast PxVC1100 Video Transcoding Card: Worth The Price? Test report on tomshardware.com, January 28, 2010
  19. IBM announces PowerXCell 8i, QS22 blade server . ( Memento of the original from June 16, 2008 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. Beyond3D, May 13, 2008 @1@ 2Template: Webachiv / IABot / www.beyond3d.com
  20. IBM BladeCenter QS22 , product specification for the IBM blade server with PowerXCell 8i
  21. Exploiting the Performance of 32 bit Floating Point Arithmetic in Obtaining 64 bit Accuracy . (PDF; 227 kB), October 31, 2006 (English); Retrieved January 5, 2011
  22. Cellular structures . In: c't , 12/2007, p. 196 ff.
  23. Clearspeed CSX700 ( Memento of the original from May 18, 2009 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. , Product specification for the CSX700 processor @1@ 2Template: Webachiv / IABot / www.clearspeed.com