NVIDIA’s flagship datacenter GPU, the Hopper H100, is featured in all its glory. (image credit: CNET)
At GTC 2022, NVIDIA unveiled its Hopper H100 GPU, a compute powerhouse designed for next-generation data centers. It’s been a while since we talked about this powerful chip, but it looks like NVIDIA has given select media a close-up of its flagship chip.
NVIDIA Hopper H100 GPU: First delivers high-resolution images with 4nm and HBM3 technology
CNET managed to capture not only the graphics board to which the H100 GPU is attached, but also the H100 chip. The H100 GPU is a monster chip loaded with the latest 4nm technology and includes 80 billion transistors with bleeding-edge HBM3 memory technology. According to Tech Outlet, the H100 is built on a PG520 PCB board with over 30 power VRMs and a massive integral interposer that uses TSMC’s CoWoS technology to combine the H100 GPU with a 6-stack HBM3 design.
NVIDIA Hopper H100 GPU images (Image credit: CNET):
Two of the six piles are kept to ensure yield integrity. But the new HBM3 standard allows up to 80Gb of capacity at 3Tb/s speeds which are insane. For comparison, the current fastest gaming graphics card, the RTX 3090 Ti, only offers 1 TB/s bandwidth and 24 GB of VRAM capacity. In addition, the H100 Hopper GPU also packs in the latest FP8 data format, and through its new SXM connection, it helps accommodate the 700W power design the chip is designed around.
NVIDIA Hopper H100 GPU Specifications at a Glance
So as per the specifications, the NVIDIA Hopper GH100 GPU is made up of a massive 144 SM (Streaming Multiprocessor) chip layout that is clocked in a total of 8 GPC. These GPCs are a total of 9 TPCs which is further made up of 2 SM units. This gives us 18 SMs per GPC and 8 on the whole 144 GPC configuration. Each SM is made up of 128 FP32 units which should give us a total of 18,432 CUDA cores. The following are some of the configurations you can expect from the H100 chip:
The full implementation of the GH100 GPU consists of the following units:
- 8 GPC, 72 TPC (9 TPC/GPC), 2 SM/TPC, 144 SM per full GPU
- 128 FP32 CUDA cores per SM, 18432 FP32 CUDA cores per full GPU
- 4 4th generation Tensor cores per SM, 576 per full GPU
- 6 HBM3 or HBM2e stacks, 12 512-bit memory controllers
- 60 MB L2 Cache
- Fourth Generation NVLink and PCIe Gen 5
The NVIDIA H100 GPU with SXM5 board form-factor consists of the following units:
- 8 GPC, 66 TPC, 2 SM/TPC, 132 SM per GPU
- 128 FP32 CUDA Cores per SM, 16896 FP32 CUDA Cores per GPU
- 4 4th generation Tensor cores per SM, 528 per GPU
- 80 GB HBM3, 5 HBM3 stacks, 10 512-bit memory controllers
- 50 MB L2 Cache
- Fourth Generation NVLink and PCIe Gen 5
This is an increase of 2.25x over the full GA100 GPU configuration. NVIDIA is also benefiting from more FP64, FP16 and Tensor cores within its Hopper GPUs which will greatly increase performance. And it’s going to be a necessity to rival Intel’s Ponte Vecchio which is also expected to feature 1:1 FP64.
Cache is another place where NVIDIA has paid a lot of attention, bumping it up to 48MB in the Hopper GH100 GPU. That’s 50MB of cache featured on the Ampere GA100 GPU and 3 times the size of AMD’s flagship Aldebaran MCM GPU, the Mi250X.
Completing the performance figures, NVIDIA’s GH100 Hopper GPU will offer 4000 TFLOP of FP8, 2000 TFLOP of FP16, 1000 TFLOP of TF32 and 60 TFLOP of FP64 compute performance. These record-breaking figures supersede all other HPC accelerators that have come before it. For comparison, it’s 3.3x faster than NVIDIA’s own A100 GPU and 28% faster than AMD’s Instinct MI250X in FP64 counts. In FP16 compute, the H100 GPU is 3x faster than the A100 and 5.2x faster than the MI250X, which is really bonkers.
The PCIe variant which is a cut-down model was recently listed in Japan for over US$30,000, so one can imagine that the SXM variant with the beefier configuration would cost around $50 grand.
NVIDIA Ampere GA100 GPU Based Tesla A100 Specs:
| NVIDIA Tesla Graphics Card | Nvidia H100 (SMX5) | Nvidia H100 (PCIE) | Nvidia A100 (SXM4) | Nvidia A100 (PCIE4) | Tesla V100S (PCIE) | Tesla V100 (SXM2) | Tesla P100 (SXM2) | Tesla P100 (PCI-Express) |
Tesla M40 (PCI-Express) |
Tesla K40 (PCI-Express) |
|---|---|---|---|---|---|---|---|---|---|---|
| GPU | GH100 (Hopper) | GH100 (Hopper) | GA100 (amperes) | GA100 (amperes) | GV100 (Volta) | GV100 (Volta) | GP100 (Pascal) | GP100 (Pascal) | GM200 (Maxwell) | GK110 (Kepler) |
| process node | 4nm | 4nm | 7nm | 7nm | 12nm | 12nm | 16nm | 16nm | 28nm | 28nm |
| Transistor | 80 billion | 80 billion | 54.2 billion | 54.2 billion | 21.1 billion | 21.1 billion | 15.3 billion | 15.3 billion | 8 billion | 7.1 billion |
| GPU Die Size | 814mm2 | 814mm2 | 826mm2 | 826mm2 | 815mm2 | 815mm2 | 610 mm2 | 610 mm2 | 601 mm2 | 551 mm2 |
| SMS | 132 | 114 | 108 | 108 | 80 | 80 | 56 | 56 | 24 | 15 |
| TPC | 66 | 57 | 54 | 54 | 40 | 40 | 28 | 28 | 24 | 15 |
| FP32 CUDA Core Per SM | 128 | 128 | 64 | 64 | 64 | 64 | 64 | 64 | 128 | 192 |
| FP64 CUDA Core / SM | 128 | 128 | 32 | 32 | 32 | 32 | 32 | 32 | 4 | 64 |
| FP32 CUDA Core | 16896 | 14592 | 6912 | 6912 | 5120 | 5120 | 3584 | 3584 | 3072 | 2880 |
| FP64 CUDA Core | 16896 | 14592 | 3456 | 3456 | 2560 | 2560 | 1792 | 1792 | 96 | 960 |
| tensor core | 528 | 456 | 432 | 432 | 640 | 640 | n/a | n/a | n/a | n/a |
| texture units | 528 | 456 | 432 | 432 | 320 | 320 | 224 | 224 | 192 | 240 |
| boost clock | TBD | TBD | 1410 MHz | 1410 MHz | 1601 MHz | 1530 MHz | 1480 MHz | 1329 MHz | 1114 MHz | 875 MHz |
| TOP (DNN/AI) | 2000 Tops 4000 top |
1600 top 3200 top |
1248 Tops 2496 tops with sparsity |
1248 Tops 2496 tops with sparsity |
130 top | 125 top | n/a | n/a | n/a | n/a |
| FP16 Count | 2000 TFLOPs | 1600 TFLOP | 312 TFLOPs 624 TFLOPs with seldom |
312 TFLOPs 624 TFLOPs with seldom |
32.8 TFLOP | 30.4 TFLOP | 21.2 TFLOP | 18.7 TFLOP | n/a | n/a |
| FP32 Count | 1000 TFLOPs | 800 TFLOPs | 156 TFLOPs (19.5 TFLOP STANDARD) |
156 TFLOPs (19.5 TFLOP STANDARD) |
16.4 TFLOP | 15.7 TFLOP | 10.6 TFLOP | 10.0 TFLOPs | 6.8 TFLOP | 5.04 TFLOP |
| FP64 Count | 60 TFLOPs | 48 TFLOP | 19.5 TFLOP (9.7 TFLOP STANDARD) |
19.5 TFLOP (9.7 TFLOP STANDARD) |
8.2 TFLOPs | 7.80 TFLOP | 5.30 TFLOP | 4.7 TFLOP | 0.2 TFLOPs | 1.68 TFLOPs |
| memory interface | 5120-bit HBM3 | 5120-bit HBM2e | 6144-bit HBM2e | 6144-bit HBM2e | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM2 | 4096-bit HBM2 | 384-bit GDDR5 | 384-bit GDDR5 |
| memory size | HBM3 @ 3.0Gbps up to 80Gb | HBM2e @ 2.0 Gbps up to 80 GB | HBM2 up to 40 GB @ 1.6 TB/sec HBM2 up to 80 GB @ 1.6 TB/sec |
HBM2 up to 40 GB @ 1.6 TB/sec HBM2 @ 2.0 TB/s up to 80 GB |
16 Gb HBM2 @ 1134 Gb/s | 16 Gb HBM2 @ 900 Gb/s | 16 Gb HBM2 @ 732 Gb/s | 16 Gb HBM2 @ 732 Gb/s 12 Gb HBM2 @ 549 Gb/s |
24 Gb GDDR5 @ 288 Gb/s | 12 Gb GDDR5 @ 288 Gb/s |
| L2 cache size | 51200 KB | 51200 KB | 40960 KB | 40960 KB | 6144 KB | 6144 KB | 4096 KB | 4096 KB | 3072 KB | 1536 KB |
| TDP | 700W | 350W | 400W | 250W | 250W | 300W | 300W | 250W | 250W | 235W |