The Google I/O Developers Conference was cancelled last year due to the epidemic, and this year it has adopted an online format to return strongly.
World’s first supercomputer, The landmark Google TPU v4
In the Google campus where no developers were present, Google CEO Sundar Pichai announced a number of new technologies. In addition to the holographic video chat technology Project Starling that can help users achieve “spatial teleportation”, it is refreshing. There is also the latest generation of AI chip TPU v4.
“This is the fastest system we have deployed on Google, and it is a historic milestone for us.” Pichai said.
The strongest TPU, speed increased by 2 times, performance increased by 10 times
According to Google’s official introduction,
under the same 64 chip scale, regardless of the improvement brought by the software, TPU v4 has an average performance improvement of 2.7 times compared with the previous generation TPU v3.
In practical applications, TPU v4 is mainly connected to Pod to play a role. Each TPU v4 Pod has 4096 TPU v4 single chips. Thanks to its unique interconnection technology, hundreds of independent processors can be transformed into one. System, interconnect bandwidth is 10 times that of any other network technology in scale.
Each TPU v4 Pod can reach 1 exaFlOP-level computing power, and achieve 10-18 floating point operations per second. This is even twice the performance of the world’s fastest supercomputer “Fuyue”.
“If there are 10 million people while using a laptop, all the computing power of these computers accumulate, just be able to count reached 1 exaFLOP force, and to reach 1 exaFLOP before, may need to tailor a super computer.” Pichardo Iraq Said so.
The results of this year’s MLPerf show that the strength of Google TPU v4 cannot be underestimated. In the image classification training test (with an accuracy of at least 75.90%) using the ImageNet dataset , 256 TPU v4 completed this task in 1.82 minutes, which is almost the same as 768 The combination of two NVIDIA Nvidia A100 graphics cards, 192 AMD EPYC 7742 cores (1.06 minutes), 512 Huawei AI-optimized Ascend 910 chips, and 128 Intel Xeon Platinum 8168 cores (1.56 minutes) is as fast.
When responsible for training the Transform-based reading comprehension BERT model on a large Wikipedia corpus, TPU v4 also scored very high. Training with 256 TPU v4 takes 1.82 minutes, which is more than 1 minute slower than the 0.39 minutes required for training with 4096 TPU v3.
At the same time, if you want to use NVIDIA’s hardware to achieve 0.81 minutes of training time, you need 2048 A100 cards and 512 AMD EPYC 7742 CPU cores.
Google also showed specific AI examples that can use TPU v4 at the I/O conference, including the MUM model (Multitask Unified Model) that can process multiple data such as webpages and images at the same time, and the multitask unified model designed for dialogue. LaMDA is a scene model that can use TPU v4. The former is 1000 times stronger than the reading comprehension model BERT, which is suitable for empowering search engines to help users get the information they want more efficiently, while the latter can have uninterrupted conversations with humans. communicate with.
This TPU, which is not sold externally, will soon be deployed in Google’s data center, and about 90% of TPU v4 Pods will use green energy.
In addition, Google also said that it will be open to Google Cloud customers later this year.
Google self-developed TPU, updated four generations in five years
Google first announced the first internally customized AI chip in 2016, which is different from the most common combination architecture for training and deploying AI models, that is, the combination of CPU and GPU. The first-generation TPU helped AlphaGo in the world-famous man-machine Go battle. Defeating Li Shishi’s “famous in World War I” declared that it is not only GPUs that can do training and reasoning.
Google’s first-generation TPU uses a 28nm process and consumes about 40W of power. It is only suitable for deep learning inference. In addition to AlphaGo, it is also used in machine learning models such as Google search and translation.
In May 2017,
Google released TPU v2 that can realize machine learning model training and inference, reaching 180TFLOPs floating-point computing power, and memory bandwidth has also been improved, which is 30 times higher than the CPU AI workload launched in the same period, and works better than GPU AI. The load increased by 15 times, and the world Go champion Ke Jie, who was defeated by AlphaGo based on 4 TPU v2, felt all this most intuitively.
In May 2018,
Google released the third-generation TPU v3, which has twice the performance of the previous-generation TPU, implements 420TFLOPs floating-point operations, and 128GB of high-bandwidth memory.
According to the rhythm of iterative updates once a year, Google should launch the fourth-generation TPU in 2019. However, at the I/O conference this year, Google launched the second- and third-generation TPU Pods, which can be configured with more than 1,000 units. TPU greatly shortens the time required for complex model training.
In the history of AI chip development, Google TPU is a rare technological innovation, both in terms of on-chip memory and programmability, breaking the “monopoly” of GPU and opening up a new competitive landscape for cloud AI chips.
Google TPU, which has been in development for five years, still maintains a strong competitiveness today. What will the world look like in the future? Google TPU has already told us a small part of the answer.