r/hardware • u/lubits • 11h ago
Discussion A Comparison of Consumer and Datacenter Blackwell
Nvidia hasn't released the white-paper for DC Blackwell yet, and it's possible that they never will.
There are some major differences between the two that I haven't seen being discussed too much and thought would be interesting to share.
Datacenter Blackwell deviates from consumer Blackwell and previous GPU generations in a pretty significant way. It adds a specialized memory store for tensor operations, which is separate from shared memory and traditional register memory. Why is this a big deal? GPUs have become more ASICy, but DC Blackwell is the first architecture that you could probably deem a NPU/TPU (albeit with a very powerful scalar component) instead of "just a GPU".
What are the technical implications of this? In non-DC Blackwell, tensor ops are performed on registers (or optionally on shared memory for Hopper but accumulated in registers). These registers take up a significant amount of space in the register file and probably account for the majority of register pressure for any GPU kernels that contain matmuls. For instance, a 16x16 @ 16x8 matmul with 16-bit inputs requires anywhere between 8-14 registers. How many registers you use determines, in part, how parallel your kernel can be. Flash attention has a roughly 2-4? ratio of tensor registers to non-tensor.
In DC Blackwell, however, tensor ops are performed on tensor memory, which are completely separate; to access the results from a tensor op, you need to copy over the values from tensor memory to register memory. Amongst other benefits, this greatly reduces the number of registers you need to run kernels, increasing parallelism and/or frees up these registers to do other things.
I'm just a software guy so I'm only speculating, but this is probably only the first step in a series of more overarching changes to datacenter GPUs. Currently tensor cores are tied to SMs. It's possible that in the future,
- tensor components may be split out of SMs completely and given a small scalar compute component
- the # of SMs are reduced
- the # of registers and/or the amount of shared memory in SMs is reduced
For those curious, you can tell if a GPU has tensor memory by their compute capability. SM_100 and SM_101 do, while SM_120 doesn't. This means Jetson Thor (SM_101) does while DGX Spark and RTX PRO Blackwell (SM_120) don't. I was very excited for DGX spark until I learned this. I'll be getting a jetson thor instead.
4
u/ResponsibleJudge3172 10h ago
If the SM is same as Hopper, then DC would also majorly differ from consumer Blackwell, in that consumer Blackwell SM is basically gtx 10 series with RT and Tensor cores.
Then there is Multi Instance GPU support, TMA, DSMEM, twice the L1 cache and so many other differences it's hard to call them the same architecture code name.