AI Silicon Opportunity Is Changing – Part 2

Dec 23, 2024

∙ Paid

The article “AI Model Scaling Slowdown: Hardware Implications” discussed the impact of AI scaling slowdown on the AI hardware market whereas the article “AI Accelerator Hardware Update” discussed how the various AI hardware development impact key players. As the facts on the ground increasingly support the scaling hypothesis previously discussed at Beyond The Hype, these topics take on even more importance. This article takes a step beyond previous discussions and looks at other emerging AI hardware trends and how they are likely to impact the key players.

The AI hardware space is exhibiting hypergrowth and within a few short years is already bigger than any other semiconductor vertical. Given the TAM is already over $100B and growing at an estimated 60% to 70% CAGR, the competition is, and will be, intense. As such, Beyond The Hype is forecasting that the industry will see the most cut throat competition ever seen by the semiconductor industry. Given the expected intensity of competition, investors should not trust any CEO with big claims on customer lock or long-term revenues. No one’s market share is safe, and no one’s design wins are safe.

Even the company with the biggest market share and software moat, Nvidia (NVDA), is not safe from direct or indirect attacks. To its credit, Nvidia management has understood the size of the opportunity and the risks that come with it before anyone else and CEO Jensen Huang has been pushing the envelope on multiple fronts to stay ahead. Nvidia has:

- Pushed the performance envelope with advanced designs on advanced processes

- Accelerated the product roadmap to a one-year cadence instead of a two-year cadence (the strategy is inherently risky and the risks are playing out with the delays in Blackwell rollout).

- Gone vertical on the technology stack to create a higher barrier to success.

- Used various contractual tools and bundling techniques to lock in revenues as far out as possible.

The strategy has been working because Nvidia was ahead and the performance increases from one generation to the next generation are so large that the market is motivated to move to the latest generation of Nvidia hardware rapidly. The dynamic is especially pronounced when it comes to training. Frontier model training, which creates demand for hundreds of thousands of the highest performance GPUs, runs its course in a few quarters at most. Once the training is done, the once leading edge hardware gets used mostly for inference. The next frontier model, due to scaling requirements, must use the next generation hardware.

Another dynamic that has been helping Nvidia is that it is easy to run inference on Nvidia chips if the model was trained on Nvidia chips. Using any other chips to run inference means that the inference software will have to be ported and optimized for the new hardware. This is not an expense that will be incurred by customers who do not have high volume inference needs.

Given that Advanced Micro Devices (AMD) hardware is not suited for frontier model training, all inference, except for at Google (GOOG) (GOOGL), is optimized at training time for Nvidia. This creates a situation where Nvidia’s training hardware moves down to take care of inference needs or customers mostly buy Nvidia hardware for incremental inference needs.

By being an unquestioned training leader, Nvidia was capturing most of the inference demand by default. Note that Nvidia’s latest GPU production ramps hard and the demand for its previous generation GPUs fades rapidly. For example, the lead time for Blackwell is about a year whereas one can procure previous generation Hoppers with practically zero lead time.

This mode of operation works extremely well in Nvidia’s favor if training demand is higher than inference demand. In such a scenario, there is little scope for other players to operate unless they can compete with high performance training solutions. The above dynamics mean that the incremental hardware opportunity for other silicon suppliers has been small. Nvidia’s structural strength gave Nvidia the opportunity to command usurious margins on its products.

But this MO is starting to break for several reasons:

- Scaling slowdown reduces the need for exponentially larger training clusters although training investments will continue. Readers should not take scaling slowdown to mean that training cluster sizes will stall at the current ~100K levels. The relevant question is how far CSPs are willing to grow the cluster sizes without scaling needs driving the size. Even if scaling stalls, it is always advantageous to be able to compress the time spent on training runs from the current several month time span to something much shorter. We can be certain that cluster sizes will grow to several thousand GPUs. Will companies continue to chase million GPU clusters? This is a more difficult question to answer.

- Training itself will get more competition from AMD, Broadcom (AVGO) and Marvell (MRVL). Increasingly, more training will occur away from Nvidia silicon.

- Inference, for many reasons, including test time compute, is growing much more rapidly than training and will create an opportunity for players who do not participate in training. Microsoft (MSFT) is the first example of that. Microsoft today deploys far more compute for inference than training. Meta (META) is in a similar situation. There is a reason why Microsoft and Meta are the biggest customers for AMD’s MI300. The odds are high that in as little as 3 to 5 years, inference hardware business will be 10 times bigger than training.

- Last but not the least, the Nvidia dominance has caused an industry level response on the way forward. The largest consumers of AI hardware, the hyperscalers, have been working on alternatives. Google was the first in the group to go to its own custom chip level and system level solution, with Amazon (AMZN), Meta, Microsoft, among others, following.

This article will focus on this very last aspect and the implications to the AI hardware business.

Keep reading with a 7-day free trial

Subscribe to Beyond The Hype - Looking Past Management & Wall Street Hype to keep reading this post and get 7 days of free access to the full post archives.