After helping to define the modern internet era with Search and Android, Google is already at the forefront of the next wave in computing research and development: AI. Many consider artificial intelligence and neural network computers to be the next step in computing, enabling new use cases and faster computation to solve currently unsolvable problems. The search giant, which now calls itself an “AI first” company, has been leading the adoption of these new technologies in a number of ways.
Neural networking algorithms and machine learning are already at the heart of many of Google’s services. They filter out spam in Gmail, optimize targeted advertising, and analyze your voice when you talk to Google Assistant or your Home speaker. Inside smartphones, ideas like Google Lens and Samsung’s Bixby are showing the power of “AI” vision processing. Even companies like Spotify and Netflix are using Google’s Cloud servers to tailor content to their users.
Google’s Cloud Platform is at the centre of its efforts (and those of third parties) to utilize this increasingly popular area of computing. However, this new field requires new kinds of hardware to run efficiently, and Google has invested heavily in its own processing hardware, which it calls a cloud tensor processing unit (Cloud TPU). This custom hardware is packed into Google’s servers and already powers the current and expanding AI ecosystem. But how does it work?
TPUs vs CPUs – searching for better efficiency
Google unveiled its second-generation TPU at Google I/O earlier this year, offering increased performance and better scaling for larger clusters. The TPU is an application specific integrated circuit. It’s custom silicon designed very specifically for a particular use case, rather than a general processing unit like a CPU. The unit is designed to handle common machine learning and neural networking calculations for training and inference; specifically, matrix multiply, dot product, and quantization transforms, which are usually just 8 bits inaccuracy.
While these kinds of calculations can be done on a CPU and sometimes even more efficiently on a GPU, these architectures are limited in terms of performance and energy efficiency when scaling across operation types. For example, IEEE 754 8-bit integer multiplication optimized designs can be up to 5.5X more energy and 6X more area efficient than 16-bit floating-point optimized designs. They’re also 18.5X more efficient in terms of energy and 27X smaller in terms of area than 32-bit FP multiply. IEEE 754 is the technical standard for floating point computations used in all modern CPUs.