Friday after close, The Information reported that Nvidia’s (NVDA) Blackwell platform is delayed by at least a quarter. The “at least” part is a key subtlety that seems to be lost many on social networks discussing the subject.
“Nvidia’s upcoming artificial intelligence chips will be delayed by three months or more due to design flaws, a snafu that could affect customers such as Meta Platforms, Google and Microsoft that have collectively ordered tens of billions of dollars worth of the chips, according to two people who help produce the chip and server hardware for it.”
Nvidia investor relations issued a non-denial denial statement to Reuters and several other news outlets in response to The Information article saying:
“Hopper demand is very strong, broad Blackwell sampling has started, and production is on track to ramp in the second half”
But not only is this addressing The Information story but also seems to contrast with what CEO Jensen Huang said during the Q1 earnings call in May.
“While supply for H100 grew, we are still constrained on H200. At the same time, Blackwell is in full production”
Additional information/leaks have come out since The Information article which indicate that Nvidia may be having two distinct problems (although it is unclear if they are related):
1. There is a chip level problem that affects Blackwell B100 and B200 dies which makes 2-die solution problematic (note that unlike Hopper, Blackwell uses two back-to-back reticle sized dies). While The Information suggests a 3 month plus delay, as semiconductor design goes, this is more likely a 6 month delay. While it is not Nvidia’s current official plan, it is also entirely possible that the delay causes Nvidia to skip Blackwell altogether and move directly to Blackwell Ultra.
2. There appears to be a problem with the ramp of CoWoS-L, a new packaging technology, at TSMC (TSM). This will limit the amount of Blackwell chips that can be manufactured. If this problem is related to the Blackwell re-spin, it is reasonable to assume that they both will be solved at the same time. If not, then generally speaking, packaging problems tend to be easier to solve and could be rectified before the chip level problem is solved. In either scenario, by the time Nvidia has a revised version of Blackwell dies, TSMC will likely address the capacity problems.
Nvidia Product Competitiveness Implications
Before we get deep into the weeds, it is useful to discuss why Nvidia finds itself in this current situation. Most investors and analysts are likely to see this as an execution problem and a one-off. But such analysis would be incorrect. Instead, what we are seeing is a direct result of a super aggressive product push. When the market potential is tens of billion dollars per quarter, companies will take big risks to chase it and sometimes the risks will bite. With tens of billions of dollars per quarter at stake, delays can have multi (tens of) billion dollar consequences. Note that Advanced Micro Devices (AMD) faced a similar delay with MI300 which was launched later than the anticipated schedule by about 6 months.
The key problem here is Nvidia’s much ballyhooed 1-year cadence. It is very tough to get to a one-year product cadence in semiconductors – especially with highly complex products like data center GPUs. For good reasons, historically, semiconductor companies have settled on an 18 month or 24 month cadence. With one year cadence, there is no slack in the schedule. If anything goes wrong, it will become immediately visible to customers.
It does not help that most chip level problems take three to six months to fix (2 minor metal layer fixes or a single all-layer change). Given this type of product cycle, there is also a high chance of a product generation becoming obsolete with just one or two iterations of a chip. But Nvidia took this high risk path because it was looking to shake off AMD and other competitors off its tail. AMD, in particular, was a big threat with its MI300 family being superior to Hopper in the inference space. As AMD roadmap gained, Nvidia’s need to accelerate its roadmap became stronger.
Nvidia now finds itself in a position where the risks of one year cadence are realized. While it is theoretically possible that it is one quarter slip that the Company is facing, more realistically, it is likely to be a two-quarter slip (that is typically how long it takes to implement chip level changes and validate the changes). Nvidia is not only doing metal layer changes for the Blackwell chip but will also have to adjust its board designs and rack designs to the new reality. Such changes would push the Blackwell generation to be uncomfortably close to the Blackwell Ultra generation.
Given this likelihood, Nvidia seems to have scrapped its Blackwell B100/B200 generation. Instead of that, Nvidia seems to be opting for a single die Blackwell as a stop gap (instead of the previously planned 2-die Blackwell). This chip, called B200A, is a cut down version of the original Blackwell. In addition to cutting the logic capabilities by almost half, the single die solutions also reduce HBM memory by half. This makes Blackwell not much better than Hopper generation for inference.
As covered in recent Beyond The Hype articles, Nvidia Hopper is already disadvantaged compared to AMD’s MI300. With Meta Llama 3.1 heavily favoring AMD MI300 due to the large memory, and with MI325 on horizon, Nvidia is at a massive disadvantage when it comes to inference. H200 may be tentatively the best inference chip that Nvidia can offer. Even Nvidia drops the price of H200 and the new Blackwell chips, it may not help much as Nvidia is now fundamentally disadvantaged compared to AMD. The situation will only get worse when Meta launches Llama 4.
This is a very large setback for Nvidia. It is fair to say that Nvidia’s competitive fate now depends on AMD’s execution. AMD does not need to deliver anything beyond MI300 to stay competitive through mid-2024. If AMD can execute on its roadmap without slips, and deliver MI325 and MI350 on time, it will become an extremely strong player relative to Nvidia.
So, what does this news mean for Nvidia?
Keep reading with a 7-day free trial
Subscribe to Beyond The Hype - Looking Past Management & Wall Street Hype to keep reading this post and get 7 days of free access to the full post archives.