Nvidia Tesla

from Wikipedia, the free encyclopedia
Nvidia Tesla 2075

Tesla is a processor with a strongly parallelized design, also called a stream processor , from Nvidia . The processor based on GPU technology can be addressed using the in-house CUDA API and OpenCL . The product was in direct competition with FireStream or FirePro from competitor AMD .

After the first cards based on the G80 GPU were presented in mid-2007, Tesla cards followed a year later with the GT200 graphics chip, which is also used for desktop graphics cards of the Geforce 200 series .

Under the code name “Fermi” , Nvidia presented the next generation graphics processor on September 30, 2009 at the in-house “GPU Technology Conference”, which is also available in products such as Tesla, Quadro cards and in a modified form (e.g. trimmed with Double Precision) is used in the Geforce 400 series . Nvidia announced Tesla cards based on the Fermi graphics processor for the second and third quarter of 2010 at the Supercomputing exhibition 09.

With more recent drivers, the possibilities in OpenGL, CUDA and OpenCL have in some cases been considerably expanded.

technology

Tesla

G80

The G80 graphics processor was the first Nvidia processor to be based on the newly developed unified shader architecture . After the G80 had been installed on the Geforce 8800 GTX and GTS graphics cards since the end of 2006, Nvidia presented the first Tesla models in mid-2007. The G80 is primarily used in A3 stepping, as it was installed on the Geforce 8800 Ultra.

GT200

The GT200 processor was the second chip that Nvidia installed on the Teslaser series. In contrast to the G80, Nvidia planned to use the Tesla models from the start (hence the T in the identifier) ​​and implemented the double-precision capabilities over 30 additional MADD units according to the IEEE-754R specification , which is what the Geforce- Graphics cards would not have been necessary.

Fermi

The Fermikern is manufactured using the 40 nm manufacturing process and has around three billion transistors . In contrast to its predecessor, the GT200 , it is largely a new development based on the unified shader architecture of the G80 graphics processor . Fermi is divided into 16 shader clusters, with each cluster having 32 stream processors . This means that there are a total of 512 stream processors. The Fermi-Chip has 16 “load / store” units, as well as four separate “special function units” for calculating sine and cosine . There are also six 64-bit memory controllers for GDDR5 memory on the Fermikern , resulting in a 384-bit memory interface. This allows the memory to be expanded to 1.5 GB, 3 GB and 6 GB. The memory controller can now also handle ECC memory, which has its own error correction.

Nvidia is now attaching increasing importance to GPU computing, which is why many architectural changes have been made to the Fermikern to improve performance in this area. Fermi is the first graphics processor to have full support for C ++ and is fully compatible with the IEEE-754-2008 standard (previously IEEE-754-1985). The latter became necessary in order to be able to use FMA (Fused Multiply-Add), which is more precise than MAD , to improve double-precision capabilities (calculation with double precision ). This allows each Fermikern shader cluster to perform 16 double precision operations per clock cycle. This means that Fermi can perform a total of 256 calculations with double precision per cycle, whereas on the GT200 only 30 were possible. In addition to shared memory , the Fermi graphics processor also has an L1 and L2 cache to improve GPU computing capabilities .

Kepler

GK104

On March 22nd, 2012 Nvidia presented the Geforce GTX 680, the first graphics card of the Geforce 600 series, with which the new Kepler architecture was introduced. The Geforce GTX 680 is based on the GK104 graphics processor, which consists of 3.54 billion transistors, as well as 1536 stream processors and 128 texture units, which are organized in eight shader clusters. The GK104-GPU is manufactured in the 28 nm manufacturing process at TSMC and has a die area of ​​294 mm². The GK104 was originally planned as a graphics chip for the performance sector. a. can be recognized by the reduced “double precision” performance. After Nvidia dropped the GK100 graphics processor in favor of the GK110, the GK104 also had to be used for the high-end sector, as the GK110 was only to be available for the Kepler refresh generation.

GK110

With 7.1 billion transistors on around 561 mm² (533 mm² in production-optimized B1 stepping), the GK110-GPU is the largest and most complex graphics processor of the Kepler generation. It consists of 2880 shader and 240 texture units, which are distributed over 15 SMX blocks (shader clusters). These, in turn, are distributed over five graphics processing clusters, which means that the GK110 has a ratio of 3: 1 (in contrast to the rest of the Kepler generation of graphics processors, which use a ratio of 2: 1). Another special feature of the GK110 is the additional 64 separate ALUs per SMX block, which are not responsible for the single-precision (FP32) but for the double-precision operation (FP64) . The features "Dynamic Parallelism", "Hyper-Q" and "GPUDirect" are also intended for the professional sector and are only available on the GK110-GPU.

GK210

Because of the limitations of the Maxwell architecture, an improved version of the GK110 was designed for the Tesla series, the GK210 graphics processor.

Maxwell

GM200

The GM200 graphics processor also functions as a high-end chip of the Geforce 900 series and in this function replaced the GK110 GPU of the Geforce 700 series . The GM200 has 8 billion transistors on a chip area of ​​601 mm², making it the largest and most complex graphics processor on the market until then. From a technical point of view, the GM200 with 96 raster, 3072 shader and 192 texture units is a 50% larger variant of the GM204. This also differs significantly from its predecessors: the GF100, GF110 or GK110 GPUs still had it has advanced double-precision capabilities (FP64) and was also used on the Quadro and Tesla professional series. Therefor were on the GK110 z. For example, 64 separate ALUs were installed in each SMX block , which resulted in a DP rate of 1/3. Since these separate ALUs are missing on the GM200 (they have probably been deleted for reasons of space, since the production of graphics processors with a size of over 600 mm² is hardly possible for technical and economic reasons) it only has a DP rate of 1/32 . Since double-precision operations are not required for 3D applications, this aspect did not play a role in the gaming sector, but made the GM200 unsuitable for the Quadro and Tesla professional series.

Therefore, Nvidia turned away from its previous strategy of developing a high-end / enthusiast chip for all three series and only used the GM200 for the Tesla M40. Instead, an improved version of the Kepler GK110, the GK210 graphics processor, was designed for the Tesla K80.

GM204

The GM204 graphics processor was the first GPU of the Geforce 900 series and uses the "second generation Maxwell architecture". As with the first Kepler generation, the Geforce 600 series , Nvidia is sending the performance chip (GM204) onto the market before the high-end chip (GM200). After Nvidia, just like AMD, did without 20 nm production at TSMC , the GM204 will continue to be produced in 28 nm production, contrary to original plans. It has 5.2 billion transistors on a chip area of ​​398 mm². The basic structure is identical to that of the GM107 GPU of the first Maxwell generation: The shader clusters (SMM) still contain 128 shader and 8 texture units, but the level 1 cache and the textures have been changed from 64 kByte to 96 kByte -Cache increased from 24 kByte to 48 kByte per cluster. The GM204 consists of a total of 16 shader clusters, with four clusters each hanging on a raster engine, which means that the GM204 has 2048 stream processors, 128 texture units, 64 ROPs and a 2 MB level 2 cache. In order to compensate for the small memory interface of 256 bits compared to other GPUs of this class, Nvidia introduced the “Third Generation Delta Color Compression” feature, which is a bandwidth saver that is supposed to reduce the memory load by around 25%.

Because of the limitations in the Double Precision performance to 1/32 of Single Precision, the Tesla K cards with Kepler architecture will continue to be offered with their higher performance.

Pascal

GP100

Pascal chips are called "GP100" and, thanks to their high computing power and efficiency, are particularly suitable for high-performance computing and deep learning. With the Tesla P100, Nvidia presented the first computing accelerator with a GP100 chip in the spring of 2016 at the GPC 2016. Pascal is to replace Kepler and Maxwell graphics chips in the professional sector in the medium term. GPU100 consists of 15 billion transistors and contains up to 3840 shader cores. Nvidia manufactures the GP100 GPU at TSMC using the 16 nm FinFET process, which is significantly more energy efficient than the previous 28 nm technology. When it comes to memory, Nvidia uses HBM 2 (High Bandwidth Memory 2) - at least for the Tesla P100. Compared to HBM 1, which currently only AMD uses for graphics cards with Fiji GPUs, HBM 2 enables higher transfer rates and more memory per GPU.

Like AMD's Fiji counterpart, the GP100 sits on an interposer (or "carrier") and is connected to the 16 GByte ECC-protected HBM-2 memory via a total of 4096 data lines. The four memory stacks are located very close to the GPU in order to reduce signal paths and consequently to maximize the transfer rate. With the Tesla P100, it is 720 GB per second.

Volta

GV100

Volta is primarily tailored to calculations in the field of artificial intelligence or deep learning. The GPU, called "GV100", consists of 21.1 billion transistors and contains 5376 shader processing cores on a chip area of ​​815 mm². Nvidia produces the GV100-GPU at the Taiwanese contract manufacturer TSMC in the 12-nanometer FFN process.

In the Tesla V100, however, Nvidia only activates 80 of the 84 shader clusters in order to increase the chip yield. This means that 5120 shader cores are available for FP32. The GV100-GPU allows single-precision calculations to be carried out with up to 15 TFlops (30 TFlops for FP16), the 2560 double-precision units theoretically manage 7.5 FP64-TFlops. With the PCIe card, Volta achieves a slightly lower theoretical computing power of 14 or 7 TFlops compared to the SXM2 variant (due to the slightly lower clock frequency - 1370 instead of 1455 MHz). In addition, Volta contains 640 deep learning special units. Of these so-called tensor cores, eight tensor cores are contained in each streaming multiprocessor. You can achieve a computing power of up to 120 TFlops both during training and during inferencing of neural networks. However, they can only be programmed to a limited extent.

HBM-2 memory (High Bandwidth Memory 2) is used for the memory, which reaches 900 GByte / s on the Tesla V100. As with the predecessor Tesla P100, the memory size remains at 16 GB. Theoretically, a memory expansion to 32 GB is possible. Compared to the Pascal chip in the Tesla P100, Volta's L1 cache has a latency that is 4 times lower and achieves a throughput of around 14 terabytes / s.

Turing

TU104

The new Turing card T4 of the Tesla series is in the PCIe 3.0 power limit of 75 watts and therefore ideal for servers.

Processors

Since most cards lack output ports due to the focus on calculations with GPU, the compute interfaces OpenCL and CUDA are most important here. OpenCL 2.0 Evaluation support is available with driver version 378.66 for Kepler, Maxwell and Pascal. OpenGL 4.6 is possible from Fermi with the latest drivers from 381 for Linux and 387 for Windows.

chip production units interface
Process
in nm
Transis-
interfere

in millions
The -
area
in mm²
ROP
particle
functions
ROPs Unified shaders Shader
model
Direct
X
Open
GL
Open
CL
Cuda
CAPA
bility
Cuda
SDK
(max.)
hardware
Stream
processors
Shader -
cluster
units
Tesla G80 90 681 484 6th 24 128 8th 4.0 10.0 3.3 1.1 1.0 6.5 PCIe
Tesla GT200 / b 65/55 1400 576/470 8th 32 240 10 4.0 10.1 3.3 1.1 1.3 6.5 PCIe 2.0
Fermi GF100 40 3000 526 6th 48 512 16 5.0 11.0 4.6 1.1 2.1 8.0 PCIe 2.0
Fermi GF110 40 3000 526 6th 48 512 16 5.0 11.0 4.6 1.1 2.1 8.0 PCIe 2.0
Kepler GK104 28 3540 294 4th 32 1536 8th 5.0 11.0 4.6 1.2 (2.0) 3.0 10.0 PCIe 3.0
Kepler GK110 28 7100 561 6th 48 2880 15th 5.0 11.0 4.6 1.2 (2.0) 3.0 10.0 PCIe 3.0
Kepler GK210 28 approx. 7100 approx. 561 6th 48 2880 15th 5.0 11.0 4.6 1.2 (2.0) 3.5 10.0 PCIe 3.0
Maxwell GM200 28 8,000 601 6th 96 3072 24 5.0 12.0 4.6 1.2 (2.0) 5.2 10.0 PCIe 3.0
Maxwell GM204 28 5,200 398 4th 64 2048 16 5.0 12.1 4.6 1.2 (2.0) 5.2 10.0 PCIe 3.0
Maxwell GM206 28 2,940 227 2 32 1024 8th 5.0 12.1 4.6 1.2 (2.0) 5.2 10.0 PCIe 3.0
Pascal GP100 16 15,300 610 10 96 3840 60 5.0 12.1+ 4.6 1.2 (2.0) 6.0 10.0 PCIe 3.0 , NVLink
Volta GV100 12 21,100 815 128 5376 84 5.0 12.1+ 4.6 1.2 (2.0) 7.0 10.0 PCIe 3.0 , NVLink
Turing TU104 12 13,600 545 64 2560 40 6.3 12.1+ 4.6 1.2 (2.0) 7.5 10.0 PCIe 3.0 , NVLink

Model data

Model name processor Storage
Type Stream
processor-
sors
Clock chip
Clock shader
Computing power in GFLOPS Size
in MB
Tact
Type Storage
interface
Storage
throughput
rate
Half
Prec.
(FP16)
Single
Prec.
(MAD + MUL)
Single
Prec.
(MAD or FMA)
Double
Prec.
(FMA)
units MHz MHz MB MHz GB / s
Tesla C870 G80 128 600 1350 No? 519 No 1536 800 GDDR3 384 bits 77
Tesla D870 2 × G80 256 600 1350 No? 1037 No 3072 800 GDDR3 2 × 384 bits 2 × 77
Tesla S870 4 × G80 512 600 1350 No? 2074 No 6144 800 GDDR3 4 × 384 bits 4 × 77
Tesla C1060 GT200 240 602 1296 ? 933 622 78 4096 800 GDDR3 512 bits 102
Tesla S1070 4 × GT200 960 602 1296 ? 3732 2488 311 16,384 800 GDDR3 4 × 512 bits 4 × 102
4 × GT200b 1440 ? 4147 2765 345
Tesla C2050 Fermi GF100 448 575 1150 ? No 1030 515 3072 1500 GDDR5 384 bits 144
Tesla M2050 GF100 448 575 1150 ? No 1030 515 3072 1550 GDDR5 384 bits 148
Tesla C2070 GF100 448 575 1150 ? No 1030 515 6144 1500 GDDR5 384 bits 144
Tesla M2070 GF100 448 575 1150 ? No 1030 515 6144 1550 GDDR5 384 bits 150
Tesla S2050 4 × GF100 1792 575 1150 ? No 4120 2060 12,288 1500 GDDR5 4 × 384 bits 4 × 144
Tesla S2070 4 × GF100 1792 575 1150 ? No 4122 2061 24,576 1500 GDDR5 4 × 384 bits 4 × 144
Tesla M2090 Fermi GF110 512 650 1300 ? No 1331 666 6144 1850 GDDR5 384 bits 177
Tesla K10 2 × GK104 3072 745 ? No 4580 191 8192 2500 GDDR5 2 × 256 bits 2 × 160
Tesla K20 GK110 2496 705 ? No 3524 1175 5120 2600 GDDR5 320 bits 208
Tesla K20X GK110 2688 735 ? No 3935 1312 6144 2600 GDDR5 384 bits 250
Tesla K40 GK110B 2880 745 (Boost: 810/875) ? No 4290 1430 12,288 3004 GDDR5 384 bits 288
Tesla K80 2 × GK210 5760 590 ? No 5591-8736 1864-2912 24,576 3004 GDDR5 2 × 384 bits 2 × 288
Tesla M4 GM206 1024 872 (Boost: 1072) ? No 1786-2195 56-69 4096 2750 GDDR5 128 bit 88
Tesla M40 GM200 3072 948 (Boost: 1114) ? No 5825-6844 182-214 12288 3000 GDDR5 384 bits 288
Tesla M6 GM204 1536 930 (Boost 1180) ? No 2857 2857/32 8192 2750 GDDR5 256 bit 160
Tesla M60 2x GM204 4096 900 (Boost: 1180) ? x SP No 7373-9667 230-302 2 × 8192 2500 GDDR5 2 × 256 bits 2 × 160
Tesla P4 GP104 2560 810 (Boost: 1063) 2x SP No 5500 1/32 SP 8000 1500 (eff. 6000) GDDR5 256 bit 192
Tesla P40 GP102 3840 1303 (Boost: 1531) 2x SP No 12000 1/32 SP 24000 1251 (eff. 10008) GDDR5 X 384 bits 346
Tesla P100 "PCIe 12GB 250W" GP100 3584 1175 (Boost: 1300)? 2x SP No 8000-9300 4700 12288 700 HBM2 3072 bits 540
Tesla P100 "PCIe 16GB 250W" GP100 3584 1175 (Boost: 1300)? 2x SP No 8000-9300 4700 16384 700 HBM2 4096 bits 720
Tesla P100 "NVLink 300W" GP100 3584 1328 (Boost: 1480) 2x SP No 9519-10609 5300 16384 700 HBM2 4096 bits 720
Tesla V100 PCle 250 W GV100 5120 (Boost: 1370) 8x SP Tensor Mode No 14000 7000 16384 876 (eff. 1752) HBM2 4096 bits 900
Tesla V100 SXM2 NVLink 300 W GV100 5120 (Boost: 1455) 8x SP Tensor Mode No 15000 7500 16384 876 (eff. 1752) HBM2 4096 bits 900
Tesla T4 PCIe 70 W TU104 2560 1005 (Boost: 1515) 8x SP Tensor Mode No 8100 1/32 SP 16384 1250 (eff.10,000) GDDR6 256 bit 320

Web links

Commons : Nvidia Tesla series  - collection of pictures, videos and audio files

Individual evidence

  1. ATI Stream Technology - Commercial ( Memento from February 19, 2010 in the Internet Archive )
  2. Tom hardware: DP speed of the GTX 480 reduced ( memento of the original from July 2, 2010 in the Internet Archive ) Info: The archive link was inserted automatically and not yet checked. Please check the original and archive link according to the instructions and then remove this notice. , Message dated April 6, 2010 @1@ 2Template: Webachiv / IABot / www.tomshardware.de
  3. golem: Nvidia names first performance values for Fermi , message from November 16, 2009
  4. de.download.nvidia.com (PDF)
  5. de.download.nvidia.com (PDF)
  6. de.download.nvidia.com (PDF)
  7. Tesla K80 - dual Kepler with up to 8.7 TFLOPS for supercomputers. ComputerBase, November 17, 2014, accessed August 6, 2015 .
  8. a b Launch analysis: nVidia GeForce GTX Titan X. 3DCenter.org, March 18, 2015, accessed on June 10, 2015 .
  9. Launch analysis: nVidia GeForce GTX 970 & 980.3DCenter, September 19, 2014, accessed on February 3, 2015 .
  10. heise.de
  11. images.nvidia.com (PDF)
  12. heise online: Tesla V100: Nvidia hands over the first Volta calculation cards to deep learning researchers. heise online, accessed on September 12, 2017 .
  13. heise.de
  14. streamcomputing.eu
  15. developer.nvidia.com
  16. Whitepaper Kepler GK110 (PDF; 1.7 MB)
  17. Archived copy ( memento of the original from June 21, 2016 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. (PDF; 2 MB) @1@ 2Template: Webachiv / IABot / international.download.nvidia.com
  18. images.nvidia.com (PDF)
  19. techpowerup.com
  20. anandtech.com
  21. techpowerup.com
  22. techpowerup.com
  23. Computationally determined, theoretical maximum computing power that can never be achieved in real terms with meaningful program code.
  24. TESLA M2090 DUAL-SLOT COMPUTING PROCESSOR MODULE Board Specification (PDF; 348 kB)
  25. Product overview M2090 (PDF; 423 kB)
  26. TESLA K10 GPU ACCELERATOR Board Specification (PDF; 650 kB)
  27. NVIDIA® Tesla® Kepler GPU Computing Accelerators (PDF; 296 kB)
  28. TESLA K10 K20 K20X GPU ACCELERATOR Board Specification (PDF; 193 kB)
  29. Hassan Mujtaba: NVIDIA Tesla K80 “GK210-DUO” Graphics Card Heading Out To The Professional Market - Features Over 2 TFlops of Double Precision Compute. WCCFtech.com, accessed November 29, 2014 .
  30. techpowerup.com
  31. techpowerup.com
  32. techpowerup.com
  33. heise.de
  34. techpowerup.com
  35. heise.de
  36. Archived copy ( memento of the original from October 18, 2016 in the Internet Archive ) Info: The archive link was inserted automatically and has not yet been checked. Please check the original and archive link according to the instructions and then remove this notice. @1@ 2Template: Webachiv / IABot / www.techpowerup.com
  37. a b NVIDIA Tesla V100 | NVIDIA. NVIDIA Corporation, accessed September 12, 2017 (American English).
  38. a b Datasheet NVIDIA V100 GPU ACCELERATOR. (PDF) NVIDIA Corporation, July 2017, accessed September 12, 2017 .
  39. xcelerit.com
  40. nvidia.com