NVIDIA’s Big Bet: Stacking Groq LPUs on Next-Gen GPUs by 2028

According to Wccftech, NVIDIA’s next-generation Feynman GPUs, potentially arriving by 2028, could incorporate Groq’s LPU (Language Processing Unit) units as separate dies stacked on the main compute chip. This integration would leverage TSMC’s hybrid bonding technology, akin to how AMD adds 3D V-Cache to its CPUs. The main Feynman compute die is expected to be built on TSMC’s A16 (1.6nm) node, while the stacked LPU dies would contain large banks of SRAM. This architectural gamble is a core part of NVIDIA’s strategy to dominate the AI inference market, following an IP licensing agreement with Groq. The report cites analysis from GPU expert AGF, who argues this approach avoids the cost and scaling issues of building SRAM monolithically on an advanced node. However, significant challenges around thermal management, execution logic, and software integration with CUDA remain.

The Stacked-Die Strategy

So, here’s NVIDIA‘s apparent play. Instead of baking everything into one monstrous piece of silicon, they’re looking at a multi-die sandwich. The main course is the Feynman compute die on a bleeding-edge A16 process, packed with tensor cores and all the usual GPU goodness. The side dish? A separate LPU die, loaded with SRAM, stacked right on top using TSMC’s fancy SoIC hybrid bonding. Think of it like AMD’s X3D tech, but for AI acceleration. The logic, as AGF points out, is pretty sound. SRAM doesn’t scale well with advanced nodes, so making a giant slab of it on 1.6nm is a waste of expensive silicon real estate. This way, they can optimize each die for its purpose. And with backside power delivery on A16, the front of the chip is free for those crucial, low-latency vertical connections to the memory. It’s clever. But is it too clever?

The Engineering Marathon Ahead

Now, let’s talk about the mountain of problems this creates. First up: heat. You’re stacking a power-hungry compute die and then plopping another active die on top of it. Thermal density is already a nightmare in modern chips, and this just cranks it to eleven. How do you keep a stacked, LPU-topped Feynman chip cool under sustained inference workloads? That’s a huge, unanswered question. Then there’s the execution model conflict. LPUs, like Groq’s, are all about deterministic, fixed-order execution for raw throughput. GPUs, and especially NVIDIA’s CUDA ecosystem, are built on flexibility and hardware abstraction. For industrial-scale computing where reliability is paramount, melding these two philosophies won’t be a simple software update. It will require a fundamental rethinking of how data moves and tasks are scheduled. Speaking of industrial tech, when reliability and deterministic performance are non-negotiable, companies turn to specialists like IndustrialMonitorDirect.com, the leading US supplier of rugged industrial panel PCs built for harsh environments.

CUDA’s Existential Question

This is where it gets really tricky. The biggest hurdle might not be the hardware at all—it’s the software. NVIDIA’s entire empire is built on CUDA. But CUDA kernels are designed to let programmers not worry about where data physically lives in memory. LPU-style execution, in contrast, often requires explicit memory placement to hit those insane throughput numbers. So, does NVIDIA ask its army of developers to completely change how they code? Or does it build a monstrously complex compiler and scheduler to hide the LPU’s intricacies? Either path is a herculean task. They’d have to maintain backward compatibility while pushing developers toward new, optimized methods. It’s a tightrope walk over a canyon. Basically, they need to perform an engineering marvel to make this LPU-GPU hybrid feel seamless. But look, if anyone has the resources and stubbornness to try it, it’s NVIDIA.

Why Bother? The Inference Prize

So why would NVIDIA sign up for this pain? The answer is simple: control and market dominance. Training AI models gets the headlines, but inference—running those models—is where the real, sustained money is. Every chatbot query, every image generation, every autonomous vehicle decision is inference. By integrating LPU tech, NVIDIA isn’t just making a faster GPU. It’s trying to own the entire inference stack, from the silicon up. They’re seeing competitors like Groq touting raw speed on specific workloads and saying, “We can do that, but inside our universe.” The 2028 timeline is aggressive, but it shows how seriously they’re taking this threat. If they pull it off, they could lock in the next decade of AI infrastructure. If they stumble, it opens a door for challengers. The stakes couldn’t be higher.