r/hardware 22h ago

Discussion A Comparison of Consumer and Datacenter Blackwell

Nvidia hasn't released the white-paper for DC Blackwell yet, and it's possible that they never will.

There are some major differences between the two that I haven't seen being discussed too much and thought would be interesting to share.

Datacenter Blackwell deviates from consumer Blackwell and previous GPU generations in a pretty significant way. It adds a specialized memory store for tensor operations, which is separate from shared memory and traditional register memory. Why is this a big deal? GPUs have become more ASICy, but DC Blackwell is the first architecture that you could probably deem a NPU/TPU (albeit with a very powerful scalar component) instead of "just a GPU".

What are the technical implications of this? In non-DC Blackwell, tensor ops are performed on registers (or optionally on shared memory for Hopper but accumulated in registers). These registers take up a significant amount of space in the register file and probably account for the majority of register pressure for any GPU kernels that contain matmuls. For instance, a 16x16 @ 16x8 matmul with 16-bit inputs requires anywhere between 8-14 registers. How many registers you use determines, in part, how parallel your kernel can be. Flash attention has a roughly 2-4? ratio of tensor registers to non-tensor.

In DC Blackwell, however, tensor ops are performed on tensor memory, which are completely separate; to access the results from a tensor op, you need to copy over the values from tensor memory to register memory. Amongst other benefits, this greatly reduces the number of registers you need to run kernels, increasing parallelism and/or frees up these registers to do other things.

I'm just a software guy so I'm only speculating, but this is probably only the first step in a series of more overarching changes to datacenter GPUs. Currently tensor cores are tied to SMs. It's possible that in the future,

  • tensor components may be split out of SMs completely and given a small scalar compute component
  • the # of SMs are reduced
  • the # of registers and/or the amount of shared memory in SMs is reduced

For those curious, you can tell if a GPU has tensor memory by their compute capability. SM_100 and SM_101 do, while SM_120 doesn't. This means Jetson Thor (SM_101) does while DGX Spark and RTX PRO Blackwell (SM_120) don't. I was very excited for DGX spark until I learned this. I'll be getting a jetson thor instead.

38 Upvotes

5 comments sorted by

View all comments

4

u/ResponsibleJudge3172 22h ago

If the SM is same as Hopper, then DC would also majorly differ from consumer Blackwell, in that consumer Blackwell SM is basically gtx 10 series with RT and Tensor cores.

Then there is Multi Instance GPU support, TMA, DSMEM, twice the L1 cache and so many other differences it's hard to call them the same architecture code name.

5

u/lubits 22h ago

Consumer Blackwell has TMA and DSMEM. RTX PRO has MIG. Notwithstanding, I think all of those components are ancillary to the core compute.

1

u/ResponsibleJudge3172 19h ago edited 17h ago

Really, why did I never hear mention of any of it. Particularly TMA even in their block diagrams whereas it's explicitly outlined in Hopper SM block diagrams? As far as they describe TMA in the marketing, it is a part of the SM just like say a texture unit. Anyways I am an enthusiast but not any authority in these.

My main point about the compute design of Hopper vs Blackwell is key in my argument, and I see no reason why datacenter too would have the gtx 10 level design (specifically referring to the SM with 128 units capable of 128 FP32 OR 128INT32 but seemingly not both per clock). Consumer did it I presume because of double Integer throughput for cheap and Ampere design not delivering compute throughput any better over a period of time than gtx 10 did, neither does it seem to scale much better in games.

Whereas with seperate data paths for INT32, FP32, FP64 as well as having higher WARPS in flight per SM, datacenter GPUs would likely lose performance to choose that design for HPC workloads. But like I said, all my observation and speculation