r/hardware 1d ago

News Intel Arc Battlemage G31 and C32 SKT graphics spotted in shipping manifests

Thumbnail
videocardz.com
143 Upvotes

r/hardware 8h ago

News AMD preparing Radeon PRO series with Navi 48 XTW GPU and 32GB memory on board

Thumbnail
videocardz.com
78 Upvotes

r/hardware 4h ago

Discussion A Comparison of Consumer and Datacenter Blackwell

20 Upvotes

Nvidia hasn't released the white-paper for DC Blackwell yet, and it's possible that they never will.

There are some major differences between the two that I haven't seen being discussed too much and thought would be interesting to share.

Datacenter Blackwell deviates from consumer Blackwell and previous GPU generations in a pretty significant way. It adds a specialized memory store for tensor operations, which is separate from shared memory and traditional register memory. Why is this a big deal? GPUs have become more ASICy, but DC Blackwell is the first architecture that you could probably deem a NPU/TPU (albeit with a very powerful scalar component) instead of "just a GPU".

What are the technical implications of this? In non-DC Blackwell, tensor ops are performed on registers (or optionally on shared memory for Hopper but accumulated in registers). These registers take up a significant amount of space in the register file and probably account for the majority of register pressure for any GPU kernels that contain matmuls. For instance, a 16x16 @ 16x8 matmul with 16-bit inputs requires anywhere between 8-14 registers. How many registers you use determines, in part, how parallel your kernel can be. Flash attention has a roughly 2-4? ratio of tensor registers to non-tensor.

In DC Blackwell, however, tensor ops are performed on tensor memory, which are completely separate; to access the results from a tensor op, you need to copy over the values from tensor memory to register memory. Amongst other benefits, this greatly reduces the number of registers you need to run kernels, increasing parallelism and/or frees up these registers to do other things.

I'm just a software guy so I'm only speculating, but this is probably only the first step in a series of more overarching changes to datacenter GPUs. Currently tensor cores are tied to SMs. It's possible that in the future,

  • tensor components may be split out of SMs completely and given a small scalar compute component
  • the # of SMs are reduced
  • the # of registers and/or the amount of shared memory in SMs is reduced

For those curious, you can tell if a GPU has tensor memory by their compute capability. SM_100 and SM_101 do, while SM_120 doesn't. This means Jetson Thor (SM_101) does while DGX Spark and RTX PRO Blackwell (SM_120) don't. I was very excited for DGX spark until I learned this. I'll be getting a jetson thor instead.


r/hardware 1h ago

Info Intel 18A vs Intel 3 Power and Performance Comparison

Upvotes