Nvidia Tesla

Nvidia Tesla 2075

Tesla is a processor with a strongly parallelized design, also called a stream processor , from Nvidia . The processor based on GPU technology can be addressed using the in-house CUDA API and OpenCL . The product was in direct competition with FireStream or FirePro from competitor AMD .

After the first cards based on the G80 GPU were presented in mid-2007, Tesla cards followed a year later with the GT200 graphics chip, which is also used for desktop graphics cards of the Geforce 200 series .

Under the code name “Fermi” , Nvidia presented the next generation graphics processor on September 30, 2009 at the in-house “GPU Technology Conference”, which is also available in products such as Tesla, Quadro cards and in a modified form (e.g. trimmed with Double Precision) is used in the Geforce 400 series . Nvidia announced Tesla cards based on the Fermi graphics processor for the second and third quarter of 2010 at the Supercomputing exhibition 09.

With more recent drivers, the possibilities in OpenGL, CUDA and OpenCL have in some cases been considerably expanded.

technology

Tesla

G80

The G80 graphics processor was the first Nvidia processor to be based on the newly developed unified shader architecture . After the G80 had been installed on the Geforce 8800 GTX and GTS graphics cards since the end of 2006, Nvidia presented the first Tesla models in mid-2007. The G80 is primarily used in A3 stepping, as it was installed on the Geforce 8800 Ultra.

GT200

The GT200 processor was the second chip that Nvidia installed on the Teslaser series. In contrast to the G80, Nvidia planned to use the Tesla models from the start (hence the T in the identifier) and implemented the double-precision capabilities over 30 additional MADD units according to the IEEE-754R specification , which is what the Geforce- Graphics cards would not have been necessary.

Fermi

The Fermikern is manufactured using the 40 nm manufacturing process and has around three billion transistors . In contrast to its predecessor, the GT200 , it is largely a new development based on the unified shader architecture of the G80 graphics processor . Fermi is divided into 16 shader clusters, with each cluster having 32 stream processors . This means that there are a total of 512 stream processors. The Fermi-Chip has 16 “load / store” units, as well as four separate “special function units” for calculating sine and cosine . There are also six 64-bit memory controllers for GDDR5 memory on the Fermikern , resulting in a 384-bit memory interface. This allows the memory to be expanded to 1.5 GB, 3 GB and 6 GB. The memory controller can now also handle ECC memory, which has its own error correction.

Nvidia is now attaching increasing importance to GPU computing, which is why many architectural changes have been made to the Fermikern to improve performance in this area. Fermi is the first graphics processor to have full support for C ++ and is fully compatible with the IEEE-754-2008 standard (previously IEEE-754-1985). The latter became necessary in order to be able to use FMA (Fused Multiply-Add), which is more precise than MAD , to improve double-precision capabilities (calculation with double precision ). This allows each Fermikern shader cluster to perform 16 double precision operations per clock cycle. This means that Fermi can perform a total of 256 calculations with double precision per cycle, whereas on the GT200 only 30 were possible. In addition to shared memory , the Fermi graphics processor also has an L1 and L2 cache to improve GPU computing capabilities .

Kepler

GK104

On March 22nd, 2012 Nvidia presented the Geforce GTX 680, the first graphics card of the Geforce 600 series, with which the new Kepler architecture was introduced. The Geforce GTX 680 is based on the GK104 graphics processor, which consists of 3.54 billion transistors, as well as 1536 stream processors and 128 texture units, which are organized in eight shader clusters. The GK104-GPU is manufactured in the 28 nm manufacturing process at TSMC and has a die area of 294 mm². The GK104 was originally planned as a graphics chip for the performance sector. a. can be recognized by the reduced “double precision” performance. After Nvidia dropped the GK100 graphics processor in favor of the GK110, the GK104 also had to be used for the high-end sector, as the GK110 was only to be available for the Kepler refresh generation.

GK110

With 7.1 billion transistors on around 561 mm² (533 mm² in production-optimized B1 stepping), the GK110-GPU is the largest and most complex graphics processor of the Kepler generation. It consists of 2880 shader and 240 texture units, which are distributed over 15 SMX blocks (shader clusters). These, in turn, are distributed over five graphics processing clusters, which means that the GK110 has a ratio of 3: 1 (in contrast to the rest of the Kepler generation of graphics processors, which use a ratio of 2: 1). Another special feature of the GK110 is the additional 64 separate ALUs per SMX block, which are not responsible for the single-precision (FP32) but for the double-precision operation (FP64) . The features "Dynamic Parallelism", "Hyper-Q" and "GPUDirect" are also intended for the professional sector and are only available on the GK110-GPU.

GK210

Because of the limitations of the Maxwell architecture, an improved version of the GK110 was designed for the Tesla series, the GK210 graphics processor.

Maxwell

GM200

The GM200 graphics processor also functions as a high-end chip of the Geforce 900 series and in this function replaced the GK110 GPU of the Geforce 700 series . The GM200 has 8 billion transistors on a chip area of 601 mm², making it the largest and most complex graphics processor on the market until then. From a technical point of view, the GM200 with 96 raster, 3072 shader and 192 texture units is a 50% larger variant of the GM204. This also differs significantly from its predecessors: the GF100, GF110 or GK110 GPUs still had it has advanced double-precision capabilities (FP64) and was also used on the Quadro and Tesla professional series. Therefor were on the GK110 z. For example, 64 separate ALUs were installed in each SMX block , which resulted in a DP rate of 1/3. Since these separate ALUs are missing on the GM200 (they have probably been deleted for reasons of space, since the production of graphics processors with a size of over 600 mm² is hardly possible for technical and economic reasons) it only has a DP rate of 1/32 . Since double-precision operations are not required for 3D applications, this aspect did not play a role in the gaming sector, but made the GM200 unsuitable for the Quadro and Tesla professional series.

Therefore, Nvidia turned away from its previous strategy of developing a high-end / enthusiast chip for all three series and only used the GM200 for the Tesla M40. Instead, an improved version of the Kepler GK110, the GK210 graphics processor, was designed for the Tesla K80.

GM204

The GM204 graphics processor was the first GPU of the Geforce 900 series and uses the "second generation Maxwell architecture". As with the first Kepler generation, the Geforce 600 series , Nvidia is sending the performance chip (GM204) onto the market before the high-end chip (GM200). After Nvidia, just like AMD, did without 20 nm production at TSMC , the GM204 will continue to be produced in 28 nm production, contrary to original plans. It has 5.2 billion transistors on a chip area of 398 mm². The basic structure is identical to that of the GM107 GPU of the first Maxwell generation: The shader clusters (SMM) still contain 128 shader and 8 texture units, but the level 1 cache and the textures have been changed from 64 kByte to 96 kByte -Cache increased from 24 kByte to 48 kByte per cluster. The GM204 consists of a total of 16 shader clusters, with four clusters each hanging on a raster engine, which means that the GM204 has 2048 stream processors, 128 texture units, 64 ROPs and a 2 MB level 2 cache. In order to compensate for the small memory interface of 256 bits compared to other GPUs of this class, Nvidia introduced the “Third Generation Delta Color Compression” feature, which is a bandwidth saver that is supposed to reduce the memory load by around 25%.

Because of the limitations in the Double Precision performance to 1/32 of Single Precision, the Tesla K cards with Kepler architecture will continue to be offered with their higher performance.

Pascal

GP100

Pascal chips are called "GP100" and, thanks to their high computing power and efficiency, are particularly suitable for high-performance computing and deep learning. With the Tesla P100, Nvidia presented the first computing accelerator with a GP100 chip in the spring of 2016 at the GPC 2016. Pascal is to replace Kepler and Maxwell graphics chips in the professional sector in the medium term. GPU100 consists of 15 billion transistors and contains up to 3840 shader cores. Nvidia manufactures the GP100 GPU at TSMC using the 16 nm FinFET process, which is significantly more energy efficient than the previous 28 nm technology. When it comes to memory, Nvidia uses HBM 2 (High Bandwidth Memory 2) - at least for the Tesla P100. Compared to HBM 1, which currently only AMD uses for graphics cards with Fiji GPUs, HBM 2 enables higher transfer rates and more memory per GPU.

Like AMD's Fiji counterpart, the GP100 sits on an interposer (or "carrier") and is connected to the 16 GByte ECC-protected HBM-2 memory via a total of 4096 data lines. The four memory stacks are located very close to the GPU in order to reduce signal paths and consequently to maximize the transfer rate. With the Tesla P100, it is 720 GB per second.

Volta

GV100

Volta is primarily tailored to calculations in the field of artificial intelligence or deep learning. The GPU, called "GV100", consists of 21.1 billion transistors and contains 5376 shader processing cores on a chip area of 815 mm². Nvidia produces the GV100-GPU at the Taiwanese contract manufacturer TSMC in the 12-nanometer FFN process.

In the Tesla V100, however, Nvidia only activates 80 of the 84 shader clusters in order to increase the chip yield. This means that 5120 shader cores are available for FP32. The GV100-GPU allows single-precision calculations to be carried out with up to 15 TFlops (30 TFlops for FP16), the 2560 double-precision units theoretically manage 7.5 FP64-TFlops. With the PCIe card, Volta achieves a slightly lower theoretical computing power of 14 or 7 TFlops compared to the SXM2 variant (due to the slightly lower clock frequency - 1370 instead of 1455 MHz). In addition, Volta contains 640 deep learning special units. Of these so-called tensor cores, eight tensor cores are contained in each streaming multiprocessor. You can achieve a computing power of up to 120 TFlops both during training and during inferencing of neural networks. However, they can only be programmed to a limited extent.

HBM-2 memory (High Bandwidth Memory 2) is used for the memory, which reaches 900 GByte / s on the Tesla V100. As with the predecessor Tesla P100, the memory size remains at 16 GB. Theoretically, a memory expansion to 32 GB is possible. Compared to the Pascal chip in the Tesla P100, Volta's L1 cache has a latency that is 4 times lower and achieves a throughput of around 14 terabytes / s.

Turing

TU104

The new Turing card T4 of the Tesla series is in the PCIe 3.0 power limit of 75 watts and therefore ideal for servers.

Processors

Since most cards lack output ports due to the focus on calculations with GPU, the compute interfaces OpenCL and CUDA are most important here. OpenCL 2.0 Evaluation support is available with driver version 378.66 for Kepler, Maxwell and Pascal. OpenGL 4.6 is possible from Fermi with the latest drivers from 381 for Linux and 387 for Windows.

chip	production			units				interface
	Process in nm	Transis- interfere in millions	The - area in mm²	ROP particle functions	ROPs	Unified shaders		Shader model	Direct X	Open GL	Open CL	Cuda CAPA bility	Cuda SDK (max.)	hardware
	Process in nm	Transis- interfere in millions	The - area in mm²	ROP particle functions	ROPs	Stream processors	Shader - cluster	Shader model	Direct X	Open GL	Open CL	Cuda CAPA bility	Cuda SDK (max.)	hardware
units
Tesla G80	90	681	484	6th	24	128	8th	4.0	10.0	3.3	1.1	1.0	6.5	PCIe
Tesla GT200 / b	65/55	1400	576/470	8th	32	240	10	4.0	10.1	3.3	1.1	1.3	6.5	PCIe 2.0
Fermi GF100	40	3000	526	6th	48	512	16	5.0	11.0	4.6	1.1	2.1	8.0	PCIe 2.0
Fermi GF110	40	3000	526	6th	48	512	16	5.0	11.0	4.6	1.1	2.1	8.0	PCIe 2.0
Kepler GK104	28	3540	294	4th	32	1536	8th	5.0	11.0	4.6	1.2 (2.0)	3.0	10.0	PCIe 3.0
Kepler GK110	28	7100	561	6th	48	2880	15th	5.0	11.0	4.6	1.2 (2.0)	3.0	10.0	PCIe 3.0
Kepler GK210	28	approx. 7100	approx. 561	6th	48	2880	15th	5.0	11.0	4.6	1.2 (2.0)	3.5	10.0	PCIe 3.0
Maxwell GM200	28	8,000	601	6th	96	3072	24	5.0	12.0	4.6	1.2 (2.0)	5.2	10.0	PCIe 3.0
Maxwell GM204	28	5,200	398	4th	64	2048	16	5.0	12.1	4.6	1.2 (2.0)	5.2	10.0	PCIe 3.0
Maxwell GM206	28	2,940	227	2	32	1024	8th	5.0	12.1	4.6	1.2 (2.0)	5.2	10.0	PCIe 3.0
Pascal GP100	16	15,300	610	10	96	3840	60	5.0	12.1+	4.6	1.2 (2.0)	6.0	10.0	PCIe 3.0 , NVLink
Volta GV100	12	21,100	815		128	5376	84	5.0	12.1+	4.6	1.2 (2.0)	7.0	10.0	PCIe 3.0 , NVLink
Turing TU104	12	13,600	545		64	2560	40	6.3	12.1+	4.6	1.2 (2.0)	7.5	10.0	PCIe 3.0 , NVLink

Model data

Model name	processor								Storage
	Type	Stream processor- sors	Clock chip	Clock shader	Computing power in GFLOPS				Size in MB	Tact	Type	Storage interface	Storage throughput rate
	Type	Stream processor- sors	Clock chip	Clock shader	Half Prec. (FP16)	Single Prec. (MAD + MUL)	Single Prec. (MAD or FMA)	Double Prec. (FMA)	Size in MB	Tact	Type	Storage interface	Storage throughput rate
units			MHz	MHz					MB	MHz			GB / s
Tesla C870	G80	128	600	1350	No?	519		No	1536	800	GDDR3	384 bits	77
Tesla D870	2 × G80	256	600	1350	No?	1037		No	3072	800	GDDR3	2 × 384 bits	2 × 77
Tesla S870	4 × G80	512	600	1350	No?	2074		No	6144	800	GDDR3	4 × 384 bits	4 × 77
Tesla C1060	GT200	240	602	1296	?	933	622	78	4096	800	GDDR3	512 bits	102
Tesla S1070	4 × GT200	960	602	1296	?	3732	2488	311	16,384	800	GDDR3	4 × 512 bits	4 × 102
Tesla S1070	4 × GT200b	960	602	1440	?	4147	2765	345	16,384	800	GDDR3	4 × 512 bits	4 × 102
Tesla C2050	Fermi GF100	448	575	1150	?	No	1030	515	3072	1500	GDDR5	384 bits	144
Tesla M2050	GF100	448	575	1150	?	No	1030	515	3072	1550	GDDR5	384 bits	148
Tesla C2070	GF100	448	575	1150	?	No	1030	515	6144	1500	GDDR5	384 bits	144
Tesla M2070	GF100	448	575	1150	?	No	1030	515	6144	1550	GDDR5	384 bits	150
Tesla S2050	4 × GF100	1792	575	1150	?	No	4120	2060	12,288	1500	GDDR5	4 × 384 bits	4 × 144
Tesla S2070	4 × GF100	1792	575	1150	?	No	4122	2061	24,576	1500	GDDR5	4 × 384 bits	4 × 144
Tesla M2090	Fermi GF110	512	650	1300	?	No	1331	666	6144	1850	GDDR5	384 bits	177
Tesla K10	2 × GK104	3072	745		?	No	4580	191	8192	2500	GDDR5	2 × 256 bits	2 × 160
Tesla K20	GK110	2496	705		?	No	3524	1175	5120	2600	GDDR5	320 bits	208
Tesla K20X	GK110	2688	735		?	No	3935	1312	6144	2600	GDDR5	384 bits	250
Tesla K40	GK110B	2880	745 (Boost: 810/875)		?	No	4290	1430	12,288	3004	GDDR5	384 bits	288
Tesla K80	2 × GK210	5760	590		?	No	5591-8736	1864-2912	24,576	3004	GDDR5	2 × 384 bits	2 × 288
Tesla M4	GM206	1024	872 (Boost: 1072)		?	No	1786-2195	56-69	4096	2750	GDDR5	128 bit	88
Tesla M40	GM200	3072	948 (Boost: 1114)		?	No	5825-6844	182-214	12288	3000	GDDR5	384 bits	288
Tesla M6	GM204	1536	930 (Boost 1180)		?	No	2857	2857/32	8192	2750	GDDR5	256 bit	160
Tesla M60	2x GM204	4096	900 (Boost: 1180)		? x SP	No	7373-9667	230-302	2 × 8192	2500	GDDR5	2 × 256 bits	2 × 160
Tesla P4	GP104	2560	810 (Boost: 1063)		2x SP	No	5500	1/32 SP	8000	1500 (eff. 6000)	GDDR5	256 bit	192
Tesla P40	GP102	3840	1303 (Boost: 1531)		2x SP	No	12000	1/32 SP	24000	1251 (eff. 10008)	GDDR5 X	384 bits	346
Tesla P100 "PCIe 12GB 250W"	GP100	3584	1175 (Boost: 1300)?		2x SP	No	8000-9300	4700	12288	700	HBM2	3072 bits	540
Tesla P100 "PCIe 16GB 250W"	GP100	3584	1175 (Boost: 1300)?		2x SP	No	8000-9300	4700	16384	700	HBM2	4096 bits	720
Tesla P100 "NVLink 300W"	GP100	3584	1328 (Boost: 1480)		2x SP	No	9519-10609	5300	16384	700	HBM2	4096 bits	720
Tesla V100 PCle 250 W	GV100	5120	(Boost: 1370)		8x SP Tensor Mode	No	14000	7000	16384	876 (eff. 1752)	HBM2	4096 bits	900
Tesla V100 SXM2 NVLink 300 W	GV100	5120	(Boost: 1455)		8x SP Tensor Mode	No	15000	7500	16384	876 (eff. 1752)	HBM2	4096 bits	900
Tesla T4 PCIe 70 W	TU104	2560	1005 (Boost: 1515)		8x SP Tensor Mode	No	8100	1/32 SP	16384	1250 (eff.10,000)	GDDR6	256 bit	320