Breakthrough tech can no longer rely on CPUs

In 1993, Intel released the first Pentium processor. By today’s standards, it was primitive. At the time, it represented a major—and exciting—leap forward, even if it didn’t quite match the Macintosh’s Motorola 601 chip in sheer number-crunching performance.

The 1990s and 2000s were, generally speaking, a great time for the CPU industry. Every other year, there was something new. Intel’s MMX promised better performance in media and gaming applications. In 2003, AMD and IBM unleashed 64-bit computing to the masses with the Athlon 64 and PowerPC G5 respectively. Multicore processors followed a few years later.

It would be impossible to list each breakthrough (or near-breakthrough, like Intel’s promising-though-doomed Xeon Phi coprocessor) in this introduction. The point is, this was a time when CPUs got measurably better with each passing year. Each hardware upgrade cycle brought new possibilities for consumers and businesses alike.

And then, it stopped. Or, if we’re being charitable, the pace of advancement slowed to a crawl. For developers, this presents a serious dilemma. We can no longer rely on CPUs to deliver the performance we need in the long term. To build the next generation of high-performance applications, we need to start from a point of hardware acceleration from the outset.

The Root of the Problem

For nearly 15 years, we’ve heard warnings—or complaints—about the slumping pace of CPU development. In 2010, the MIT Technology Review published a piece headlined “Why CPUs Aren’t Getting Any Faster,” which blamed the milieu on a number of factors: insufficient memory bandwidth, thermal and power consumption constraints, and a prioritization of integrated graphics over sheer CPU power.

Similar articles have cropped up over the years. Arguably the most significant alarm bell came in 2016, when Intel formally declared it had abandoned the “Tick-Tock model” that had defined its product road map for almost a decade.

In the Tick-tock model, each iteration of a product category would focus on either a new manufacturing process (a tick) or the underlying microarchitecture (the tock). One followed the other, and usually at 12-18 month intervals. This model created a steady cadence of improvements, with “ticks” delivering improved energy efficiency and thermal performance, and “tocks” providing new capabilities.

Although Intel claims it deprecated this model in favor of a three-stage approach that adds a dedicated optimization phase into the product development cycle, it’s also true that the company couldn’t continue at this velocity. As semiconductor manufacturing processes shrink, it becomes harder and costlier to improve further.

For context, the first Pentium 4 processor, released in November 2000, used a 180nm manufacturing process. The final Pentium 4, released in early 2006, used a 65nm process. This represents a nearly two-thirds reduction. By contrast, it took TSMC—the world’s largest semiconductor fabricator—nearly three years to go from 5nm to 3nm.

Again, this is just one factor in why CPU development has slowed. There are other elements that are equally significant. For example, CPU manufacturers still need to consider power consumption and energy draw—although this can be mitigated to a degree through chip design—which also imposes hard limitations on what performance you can squeeze from a CPU.

And there are other, smaller factors that have dented CPU performance, too. I won’t dwell on these too much, but from time to time, we encounter CPU-level issues that introduce security bugs. Meltdown—the glitch that primarily affected Intel CPUs, as well as some Power and ARM chips—is a good example of this.

Although the underlying factors that allowed for Meltdown were resolved in future iterations, fixing existing chips resulted in a five to 30% performance hit in certain scenarios.

Ultimately, the cause isn’t nearly as interesting as the solution—shifting our workloads away from CPUs to hardware accelerators, like GPUs.

Computing Beyond Moore’s Law

This isn’t inherently new. GPU acceleration for general-purpose computing tasks has existed for over two decades at this point, with the first significant research taking place in 2003.

What’s changed—and what makes this moment particularly exciting—is the increasing ease of development for GPU-accelerated software, the growing capabilities of GPUs, and the increasing prevalence of tasks that can benefit from the unique characteristics of GPUs.

Initially, writing general-purpose code for GPUs required a low-level understanding of how the GPU worked, and a level of perseverance. The GPU manufacturers of the early 2000s—namely ATI and Nvidia, but also Matrox and Silicon Integrated Systems—didn’t have general-purpose computing in mind. They were making cards for gamers.

In essence, you had to circumvent the original intent of these companies. They didn’t stop you, but they didn’t help you either. That changed in 2007 when Nvidia released the CUDA platform, which provided a (relatively) simple avenue for developers to write general-purpose code for certain graphics cards. Over the years, CUDA has grown in capability, and it competes against similar products from Intel and AMD (namely oneAPI and ROCm).

The barriers to entry that existed in 2003 are now, for the most part, gone. While I wouldn’t say that writing GPU-accelerated code is easy, it’s more accessible than ever.

Meanwhile, the cards themselves are growing more capable at a seemingly exponential rate. This is true for both consumer-centric cards, as well as those intended for server/enterprise environments. The Nvidia A100, for example, has as much as 80GB of high-bandwidth memory, as well as over 2Tb/s of memory bandwidth. It has over 4,000 general-purpose compute cores. That’s many multiples more than even the most capable server/workstation processors, which typically top out at 128 cores.

These attributes lend themselves well to tasks that benefit from parallelization. Essentially, if a task can be broken down into small pieces that can be executed simultaneously, they’ll typically complete faster on a GPU than a CPU. And there are many, many examples that we’re already seeing today, from AI training and inference, to data processing, to analytics and scientific computing.

Life Beyond the CPU

Beyond GPUs, there’s a growing range of specialist hardware products that can perform tasks faster than a conventional CPU, much like AI tasks can benefit from using dedicated AI accelerator hardware. These are found on your phones and laptops, as well as in discrete cards in servers, and they’re specifically designed to speed up machine learning and AI workloads.

We’re already seeing a flurry of activity around purpose-specific semiconductors from the cloud hyperscalers building AI chips (Google’s TPU, Microsoft’s Maia) to startups implementing the open-source RISC-V architecture.

When trying to build these purpose-specific semiconductors the options have always been limited, either requiring an incredibly costly process of building an application specific integrated circuit (ASIC), or tweaking paying development and royalty fees to ARM.

In an Ideal state, we would build an ASIC for every algorithm or workload. Unfortunately, this is cost-prohibitive. Such a state requires an army of engineers to build software to leverage the proprietary ASIC, keep up to date with the latest silicon fabrication technologies, semiconductor design, and economies of scale to reach reasonable price points.

ARM mitigates most of the ASIC risks, but comes with hefty development and royalty fees.

The open-source instruction set architecture (ISA) of RISC-V processors provides the best of both worlds, mitigating the risks of ASICs with an open-source and free architecture. This has enabled countless companies to build new state-of-the-art processors, like SiFive, who is building out RISC-V components for Google’s latest generation of TPUs.

As with GPU-accelerated software development, the barriers to using RISC-V have dropped over the past decade, making them an increasingly viable option for companies that want to accelerate their workloads. And, they can deliver performance that is many magnitudes beyond what a CPU can deliver.

I’d add that their appeal isn’t just that they’re fast now. It’s that they’re going to continue getting better as time goes on. Hardware accelerators are not subject to the same constraints as a general-purpose piece of hardware, like a CPU.

There’s nothing preventing Nvidia from adding more cores to its next generation of GPUs, for example, or adding exponentially more memory, besides what its existing customers are willing to pay for. These devices are—ultimately—owned top-to-bottom by their manufacturers, who can add and improve as they see fit.

And this makes them a compelling, long-term solution.

The question then becomes: Are developers ready to take advantage of the opportunity? It’s clear that CPU development has, if not stalled, then slowed to the point of stalling. CPUs will never achieve the exponential performance increases we’ve seen from GPUs, RISC-V, and other dedicated accelerator hardware.

The onus is now on us—as engineers—to change course. The future of software is accelerated.

No comments

Read more