GPU Life Concerns: Reality And Implications

Oct 26, 2024

∙ Paid

Note: This article uses GPUs, TPUs, and accelerators interchangeably as they are substantially similar in the context of this article.

A tweet about recent comments by a Google (GOOG) (GOOGL) GenAI Architect in an expert network interview has raised investor interest about the lifespan of GPUs. The predominant sentiment seems to be that short GPU lifespan is a positive for Nvidia (NVDA) although some question the “lifespan will be three years at the most” aspect of Google Architect’s assessment. Before we discuss the validity and implications of Google Architect’s claim, see the publicly available part of the quote below:

Firstly, note that this is one snippet of a larger interview, and the full interview was not made available. That aside, the first thing to consider is if the claims have any merit.

It is well known that silicon and electronic components (the various parts that combine to make server circuit boards and systems) can deteriorate rapidly in highly stressed environments. Stress in this context is mostly related to operating chips under high current, high voltage, high temperature, and utilization factor. Given that this is precisely what we do with high performance chips, deterioration and failure rates themselves are not exactly a surprise.

What has been changing over time is that leading edge chips, as they pack more transistors and push higher performance, are increasingly pushing the boundaries of power consumption and temperatures. The image below shows historic trends of various CPU attributes over time.

Evolution of electronic CMOS characteristics over time [41]. Transistor counts (orange triangles) are growing exponentially following Moore's law while performance growth is limited by power consumption. Single thread performance (blue circles) had been increasing by 60% per year until 2005 and slowed down to +20% per year after 2005. The operation frequency (green squares) is also limited due to power restrictions (after 2005).Typical power consumption (red triangles) and number of cores (black rhombuses) are also presented.

About a decade back, leading edge compute devices were consuming about 100W. During the crypto wave a few years back, the industry witnessed high failure rates with mining GPUs which were run around the clock at peak performance to generate cryptocurrency tokens. These chips used to consume between 200W to 500W. While no known records exist on failure rates, there is no shortage of anecdotal evidence about GPUs failing in as little as 2 to 3 years. GPUs and AI accelerators are being pushed to limit more than mining GPUs or Intel (INTC) Xeons or Advanced Micro Devices (AMD) EPYCs or any other electronic devices in the history of high-performance computing. The Blackwell generation’s leading-edge accelerators are expected to consume 1000W to 1500W.

Keep reading with a 7-day free trial

Subscribe to Beyond The Hype - Looking Past Management & Wall Street Hype to keep reading this post and get 7 days of free access to the full post archives.