Your computer can do anything, and therefore is optimized for nothing. What’s the benefit of a chip that would focus on just one task?
The central processing unit (CPU) of your computer is a silicon microchip that works exceptionally well because data can be transferred quickly over small distances. In the fabrication process of chips, called lithography, error scales roughly with the size of the chip. The larger the chip’s surface, the more likely it is faulty and can be thrown away. It’s, therefore, crucial to put as many transistors as possible on the smallest surface. The resulting miniaturization of transistors has been the industry trend for decades. Every year or two, the transistors halved in size so that the chip surfaces could double the transistors on board. The exponential growth of transistors, called Moore’s law, is the reason for the incredibly powerful general-purpose computers that we have today. Doubling the number of transistors every two years, for many decades on row results in ~ ten billion transistors on a chip today compared to just a thousand in 1970. Unfortunately, Moore’s law is ending. You can’t continue miniaturizing transistors. At some point, there’s just no more space. In this length scale, we are leaving the normal dimensions, and we are entering into the spooky quantum world.
Simultaneously, chip manufacturers designed chips to fit the growing demand for the software that we use. Each chip contains several low-level instructions. The instructions are a basis set, like looking up some memory, adding two numbers, of functions that ultimately run all the software that we use. In 1978 Intel’s chipset included chipset 91 instructions. Over time, additional instructions were added, so that now Intel’s processor has 1500 different operations. All these instructions are carefully designed to perform some tasks as efficiently as possible. However, all these instructions take up scarce space on the chip.
It begs the question if we need all this complexity. After all, the surface of a chip is extremely valuable. If for a particular task, we need only a small number of instructions, we could turn the chip in a specialized powerhouse by removing most other instructions. It’s like a factory. Instead of running around and doing a thousand things, Ford invented the assembly line where each component was designed for one particular task. The assembly line then changed the game in manufacturing. The same is happening in the chip industry. If we can’t innovate by making transistors ever smaller, we can optimize chip design by changing the silicon real estate. Welcome to the world of accelerators.
Let’s start with the most obvious one. Graphical processing units (GPUs) have changed the game for deep learning. GPUs have been developed for the gaming industry to do one thing: vector arithmetic. When showing graphics, the same operation is performed repeatedly, determining each pixel what to show. In the past decade, another application for GPUs showed up. The computational workload of training a deep learning model is, for the most prominent part, a single algorithm, backpropagation. Because backpropagation is a form of vector arithmetic, deep learning models can run on GPUs.
With the rise of deep learning models, the need for compute power arose. Computational resources to produce top-notch AI models are doubling every 3.4 months on average. Between 2020 and 2018, computational resources have increased 300.000 times. Large neural networks, like those used in natural language processing, have billions of parameters. State of the art models, like BERT, for example, has 366 million parameters. Besides the hardware costs, all this computation is responsible for significant greenhouse gas emissions; a single AI model can emit as much as five cars.
Although GPUs have made the deep learning revolution possible, further optimization is possible. After all, GPUs have been designed for graphics, and not for backpropagation. In recent years, other accelerators have entered the market. One of those is Google tensor processing unit (TPU), which has been compared to GPUs, designed for high volumes of low precision at the minimum energy consumption. Other GPUs functionality, such as silicon for rasterization, has been left out so that TPUs are optimized for machine learning only. In the future, we will see a variety of AI accelerators.
Given that specialized hardware can provide an order of magnitude speedups and that we use billions of computing power for repetitive tasks, it would make sense to scan applications for other compute-intensive tasks. That is exactly what Google did. In a paper published in 2015, Google reported the profiling of computational tasks at their datacentres. Although the workload seems extremely diverse at first, hosting thousands of different applications, patterns were emerging. After a cross-application, microarchitecture analysis, it turned out that common building blocks were making up a large part of all the cycles. Of the thousands of applications that run on Google’s datacentres, 30 percent of the computation time was done by a handful of operations. Optimizing these operations (like memory allocation, hashing, compression, and data movement) can lead to significant performance gains. Two interesting conclusions are drawn in the paper. First, optimizing common building blocks across a variety of applications may show more speedup than trying to find optimizations in single applications. And second, such common building blocks are excellent candidates for hardware accelerators.
Finally, a third example where hardware accelerators may provide significant speedup is knowledge graphs. Knowledge graphs are data structures that emphasize the relation between concepts. Unlike relational databases, where data is stored as rows in tables, graph databases store data in edges and vertices that make up the graph. Knowledge graphs are extremely promising because they provide a natural way to perform algorithms. The reason for this is twofold. First, graphs are intuitive. As humans, we tend to think in concepts and relations between concepts. Because graphs are built up in a similar fashion, algorithms that run on graph data are easier to explain. Second, graphs allow for fast lookup between nodes. For algorithms that need to visit a large part of the graph, this becomes crucial. For example, recommendation models find a natural implementation of knowledge graphs. Central for finding recommendations is finding similarities between nodes, which requires the algorithm the visit many nodes of the graph.
The performance of graph algorithms boils down to how many traversals (hops between nodes) can be made per second (TPS). In native graph databases, memory locations are stored, which enable the graph algorithm to jump from one memory address to another rapidly. This ‘pointer hopping’ is fast on conventional CPUs, quickly allowing for millions of TPS. Still, pointer hopping requires only a limited number of chipset operations. It’s worth looking into specialized hardware that is designed for pointer hopping and pointer hopping only. Especially because each pointer-hop will result in several new hops, this is a problem that can be processed massively in parallel. GraphCore is an exciting start-up that leverages this idea. Instead of wasting all chip surface on unused silicon, it packs six thousand cores with 115K threads to analyze graph data concurrently. This way, the hardware knowledge graph provides means to run algorithms on billions of nodes, in near real-time. Now that’s parallelism.
These are just some examples of the full range of accelerators that we will see. We will have to get used to a wide variety of hardware that accelerates particular tasks. The time of one solution fits all is over. Instead, parallel computation using graphic processing units (GPU), tensor processing units (TPU), or intelligent processing units (IPU) will benefit machine learning and other applications. Furthermore, quantum processing unit (QPU), photonic processing units (PPU), or neuromorphic processing units (NPU) will provide an order of magnitude performance gains or break open new fields of applications. Some accelerators are still in full development and will take significant time before matured. Other techniques, however, are available today and already provide substantial speedups. In public clouds, like AWS and Azure, accelerators, and required toolkits, are becoming standard. For these systems, business cases can easily be made. In anyway, preparing for a heterogenic computation landscape now pays off.
References:
This blog has been authored by Julian van Velzen
I am an enthusiastic big data engineer with a strong background in computational physics. As the leading consultant for the quantum exploration center, I take clients on a journey into the exciting era of quantum computing.