Tensor Processing Unit

Tensor Processing Units ( TPUs ), also known as tensor processors , are application-specific chips to accelerate applications in the context of machine learning . TPUs are mainly used to transfer data in artificial neural networks , cf. Deep learning to process.

The TPUs developed by Google were specially designed for the TensorFlow software collection . TPUs are the basis for all Google services that use machine learning and were also used in the AlphaGo machine-versus-human competitions with one of the world's best Go players, Lee Sedol .

First generation

The first generation of Google's TPU was presented at Google I / O 2016 and was specially designed to support or accelerate the use of an already trained artificial neural network . This was u. a. achieved by a lower precision compared to normal CPUs or GPUs and a specialization in matrix operations.

The TPU consists of a systolic array with a 256 × 256 8-bit matrix multiplication unit (MMU), which is controlled by a microprocessor with a CISC instruction set. The chip was manufactured in a 28 nm process and clocks at 700 MHz with a TDP of 28 to 40 W. The TPU has 28 MiB of RAM on the chip. In addition, 4-MiB 32-bit accumulators are installed, which take over the results of the matrix multiplication unit. The TPU can perform matrix multiplications , convolution and activation functions , as well as data transfer to the host system via PCIe 3.0 or to the DDR3 DRAM, which is located on the board.

Second generation

The second generation of Google's TPU ( TPUv2 ) was presented at Google I / O 2017 . This should not only accelerate the use of neural networks ( inference ), but also the training of these networks. These TPUs have two "Matrizenausführungseinheiten" ( Matrix Execution Unit ; MXU ) with 8 GiB of RAM. Each MXU has a computing power of 22.5 TFLOPS , although the bfloat16 data type is used, which does not comply with IEEE 754 . A TPU board with 4 TPUs thus comes to 180 TFLOPS.

The TPUs are interconnected to form a “pod” with 11.5 PFLOPS , a computer network (cluster system architecture ) of 256 TPUs and 128 server CPUs. The TPUs are interconnected in a spherical (2D torus) network topology of 8 × 8 TPUs each. PCI-Express 3.0 with 32 lanes (8 lanes per TPU) is used to connect the CPUs to the TPUs .

The second generation TPUs can be used in the form of the Google Compute Engine , a cloud offering from Google.

HBM memory is used to increase the memory bandwidth of the architecture .

The chip area of the second generation should be larger than that of the first generation due to the more complex memory interface and the 2 cores per chip.

Third generation

TPUv3 card

The third generation of Google's TPU ( TPU 3.0 ) was presented at Google I / O 2018 . The TPUs have 4 MXUs with 8 GiB working memory each (32 GiB per TPU). The network topology of the TPUs is now designed in the form of a 3D torus . The racks also have water cooling , with which the TPUs are cooled. TPU 3.0 pods consist of 8 racks with a total of 1024 TPUs and 256 server CPUs. The computing power of a pod is just over 100 PFLOPS.

literature

Norman P. Jouppi, Cliff Young, Nishant Patil, David Patterson: A domain-specific architecture for deep neural networks. In: Communications of the ACM. 61, 2018, p. 50, doi : 10.1145 / 3154484 .

Web links

Cloud TPUs (TensorFlow @ O'Reilly AI Conference, San Francisco '18) on YouTube , October 25, 2018
A simple classification model using Keras with Cloud TPUs. In: colab.research.google.com. Retrieved November 10, 2018 .
Edge TPU devices. In: aiyprojects.withgoogle.com. Retrieved March 22, 2019 .
Sebastian Grüner: Tensor Processing Unit: Google builds its own chips for machine learning. In: golem.de. May 19, 2016. Retrieved November 23, 2016 .
Harald Bögeholz: Artificial Intelligence: Architecture and Performance of Google's KI Chip TPU - heise online. In: heise.de. April 6, 2017. Retrieved April 7, 2017 .
David Patterson, Google TPU Team: In-Data Center Performance Analysis of a Tensor Processing Unit. (PDF) April 2, 2017, accessed on May 23, 2017 (English).

Patents

Patent US20160342889 : Vector Computation Unit in Neural Network Processor. Registered on September 3, 2015 , published on November 24, 2016 , applicant: Google Inc. , inventor: Gregory Michael Thorson, Christopher Aaron Clark, Dan Luu. ‌
Patent WO2016186823 : Batch Processing in a Neural Network Processor. Registered on March 3, 2016 , published on November 24, 2016 , applicant: Google Inc. , inventor: Reginald Clifford Young. ‌
Patent WO2016186801 : Neural Network Processor. Registered on April 26, 2016 , published on November 24, 2016 , applicant: Google Inc. , inventor: Jonathan Ross, Norman Paul Jouppi, Andrew Everett Phelps, Reginald Clifford Young, Thomas Norrie, Gregory Michael Thorson, Dan Luu. ‌
Patent WO2014105865 : System and method for parallelizing convolutional neural networks. Filed December 23, 2013 , published July 3, 2014 , Applicant: Google Inc. , Inventors: Alexander Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton. ‌

Individual evidence

^ Jeff Dean, Rajat Monga: TensorFlow - Google's latest machine learning system, open sourced for everyone. In: Google Research Blog. Google, November 9, 2015, accessed June 29, 2016 .
↑ Christof Windeck: Google I / O 2016: "Tensor processors" helped win the Go - heise online. In: heise.de. May 19, 2016. Retrieved November 23, 2016 .
↑ Norm Jouppi: Google supercharges machine learning tasks with TPU custom chip. In: Google Cloud Platform Blog. May 18, 2016. Retrieved June 29, 2016 (American English).
↑ ^a ^b ^c ^d ^e ^f Timothy Prickett Morgan: Tearing apart Google's TPU 3.0 AI Coprocessor. In: The Next Platform. May 10, 2018, accessed May 24, 2018 .
↑ System architecture | Cloud TPU. Retrieved January 12, 2020 .

[tf-1] Jeff Dean, Rajat Monga: TensorFlow - Google's latest machine learning system, open sourced for everyone. In: Google Research Blog. Google, November 9, 2015, accessed June 29, 2016 .

[2] Christof Windeck: Google I / O 2016: "Tensor processors" helped win the Go - heise online. In: heise.de. May 19, 2016. Retrieved November 23, 2016 .

[supercharge-3] Norm Jouppi: Google supercharges machine learning tasks with TPU custom chip. In: Google Cloud Platform Blog. May 18, 2016. Retrieved June 29, 2016 (American English).

[tnp_TPU3-4] ↑ ^a ^b ^c ^d ^e ^f Timothy Prickett Morgan: Tearing apart Google's TPU 3.0 AI Coprocessor. In: The Next Platform. May 10, 2018, accessed May 24, 2018 .

[5] System architecture | Cloud TPU. Retrieved January 12, 2020 .


	according to word length	1-bit architecture • Bit-slice architecture • 4-bit architecture • 8-bit architecture • 16-bit architecture • 32-bit architecture • 64-bit architecture
	according to instruction set structure	CISC • EPIC • NISC • RISC • VLIW • Microarchitecture
	with optimization for purpose	(Main) processor • Graphics processor • GPGPU • Stream processor • Sound processor • Floating point unit • Network processor • Physics accelerator • Vector processor • TensorFlow Processing Unit