The world’s first with the world’s fastest 4nm GPU and HBM3 memory

NVIDIA’s flagship datacenter GPU, the Hopper H100, is featured in all its glory. (image credit: CNET)

At GTC 2022, NVIDIA unveiled its Hopper H100 GPU, a compute powerhouse designed for next-generation data centers. It’s been a while since we talked about this powerful chip, but it looks like NVIDIA has given select media a close-up of its flagship chip.

NVIDIA Hopper H100 GPU: First delivers high-resolution images with 4nm and HBM3 technology

CNET managed to capture not only the graphics board to which the H100 GPU is attached, but also the H100 chip. The H100 GPU is a monster chip loaded with the latest 4nm technology and includes 80 billion transistors with bleeding-edge HBM3 memory technology. According to Tech Outlet, the H100 is built on a PG520 PCB board with over 30 power VRMs and a massive integral interposer that uses TSMC’s CoWoS technology to combine the H100 GPU with a 6-stack HBM3 design.

Next-gen NVIDIA GeForce RTX 4090 with top AD102 GPU could be the first gaming graphics card to break past 100 TFLOP

NVIDIA Hopper H100 GPU images (Image credit: CNET):

Also read: The world’s first with the world’s fastest 4nm GPU and HBM3 memory

Two of the six piles are kept to ensure yield integrity. But the new HBM3 standard allows up to 80Gb of capacity at 3Tb/s speeds which are insane. For comparison, the current fastest gaming graphics card, the RTX 3090 Ti, only offers 1 TB/s bandwidth and 24 GB of VRAM capacity. In addition, the H100 Hopper GPU also packs in the latest FP8 data format, and through its new SXM connection, it helps accommodate the 700W power design the chip is designed around.

NVIDIA Hopper H100 GPU Specifications at a Glance

So as per the specifications, the NVIDIA Hopper GH100 GPU is made up of a massive 144 SM (Streaming Multiprocessor) chip layout that is clocked in a total of 8 GPC. These GPCs are a total of 9 TPCs which is further made up of 2 SM units. This gives us 18 SMs per GPC and 8 on the whole 144 GPC configuration. Each SM is made up of 128 FP32 units which should give us a total of 18,432 CUDA cores. The following are some of the configurations you can expect from the H100 chip:

The full implementation of the GH100 GPU consists of the following units:

Intel CEO Pat Gelsinger expects end of chip shortage by 2024

8 GPC, 72 TPC (9 TPC/GPC), 2 SM/TPC, 144 SM per full GPU
128 FP32 CUDA cores per SM, 18432 FP32 CUDA cores per full GPU
4 4th generation Tensor cores per SM, 576 per full GPU
6 HBM3 or HBM2e stacks, 12 512-bit memory controllers
60 MB L2 Cache
Fourth Generation NVLink and PCIe Gen 5

The NVIDIA H100 GPU with SXM5 board form-factor consists of the following units:

8 GPC, 66 TPC, 2 SM/TPC, 132 SM per GPU
128 FP32 CUDA Cores per SM, 16896 FP32 CUDA Cores per GPU
4 4th generation Tensor cores per SM, 528 per GPU
80 GB HBM3, 5 HBM3 stacks, 10 512-bit memory controllers
50 MB L2 Cache
Fourth Generation NVLink and PCIe Gen 5

This is an increase of 2.25x over the full GA100 GPU configuration. NVIDIA is also benefiting from more FP64, FP16 and Tensor cores within its Hopper GPUs which will greatly increase performance. And it’s going to be a necessity to rival Intel’s Ponte Vecchio which is also expected to feature 1:1 FP64.

Cache is another place where NVIDIA has paid a lot of attention, bumping it up to 48MB in the Hopper GH100 GPU. That’s 50MB of cache featured on the Ampere GA100 GPU and 3 times the size of AMD’s flagship Aldebaran MCM GPU, the Mi250X.

Completing the performance figures, NVIDIA’s GH100 Hopper GPU will offer 4000 TFLOP of FP8, 2000 TFLOP of FP16, 1000 TFLOP of TF32 and 60 TFLOP of FP64 compute performance. These record-breaking figures supersede all other HPC accelerators that have come before it. For comparison, it’s 3.3x faster than NVIDIA’s own A100 GPU and 28% faster than AMD’s Instinct MI250X in FP64 counts. In FP16 compute, the H100 GPU is 3x faster than the A100 and 5.2x faster than the MI250X, which is really bonkers.

The PCIe variant which is a cut-down model was recently listed in Japan for over US$30,000, so one can imagine that the SXM variant with the beefier configuration would cost around $50 grand.

NVIDIA Ampere GA100 GPU Based Tesla A100 Specs:

NVIDIA Tesla Graphics Card	Nvidia H100 (SMX5)	Nvidia H100 (PCIE)	Nvidia A100 (SXM4)	Nvidia A100 (PCIE4)	Tesla V100S (PCIE)	Tesla V100 (SXM2)	Tesla P100 (SXM2)	Tesla P100 (PCI-Express)	Tesla M40 (PCI-Express)	Tesla K40 (PCI-Express)
GPU	GH100 (Hopper)	GH100 (Hopper)	GA100 (amperes)	GA100 (amperes)	GV100 (Volta)	GV100 (Volta)	GP100 (Pascal)	GP100 (Pascal)	GM200 (Maxwell)	GK110 (Kepler)
process node	4nm	4nm	7nm	7nm	12nm	12nm	16nm	16nm	28nm	28nm
Transistor	80 billion	80 billion	54.2 billion	54.2 billion	21.1 billion	21.1 billion	15.3 billion	15.3 billion	8 billion	7.1 billion
GPU Die Size	814mm2	814mm2	826mm2	826mm2	815mm2	815mm2	610 mm2	610 mm2	601 mm2	551 mm2
SMS	132	114	108	108	80	80	56	56	24	15
TPC	66	57	54	54	40	40	28	28	24	15
FP32 CUDA Core Per SM	128	128	64	64	64	64	64	64	128	192
FP64 CUDA Core / SM	128	128	32	32	32	32	32	32	4	64
FP32 CUDA Core	16896	14592	6912	6912	5120	5120	3584	3584	3072	2880
FP64 CUDA Core	16896	14592	3456	3456	2560	2560	1792	1792	96	960
tensor core	528	456	432	432	640	640	n/a	n/a	n/a	n/a
texture units	528	456	432	432	320	320	224	224	192	240
boost clock	TBD	TBD	1410 MHz	1410 MHz	1601 MHz	1530 MHz	1480 MHz	1329 MHz	1114 MHz	875 MHz
TOP (DNN/AI)	2000 Tops 4000 top	1600 top 3200 top	1248 Tops 2496 tops with sparsity	1248 Tops 2496 tops with sparsity	130 top	125 top	n/a	n/a	n/a	n/a
FP16 Count	2000 TFLOPs	1600 TFLOP	312 TFLOPs 624 TFLOPs with seldom	312 TFLOPs 624 TFLOPs with seldom	32.8 TFLOP	30.4 TFLOP	21.2 TFLOP	18.7 TFLOP	n/a	n/a
FP32 Count	1000 TFLOPs	800 TFLOPs	156 TFLOPs (19.5 TFLOP STANDARD)	156 TFLOPs (19.5 TFLOP STANDARD)	16.4 TFLOP	15.7 TFLOP	10.6 TFLOP	10.0 TFLOPs	6.8 TFLOP	5.04 TFLOP
FP64 Count	60 TFLOPs	48 TFLOP	19.5 TFLOP (9.7 TFLOP STANDARD)	19.5 TFLOP (9.7 TFLOP STANDARD)	8.2 TFLOPs	7.80 TFLOP	5.30 TFLOP	4.7 TFLOP	0.2 TFLOPs	1.68 TFLOPs
memory interface	5120-bit HBM3	5120-bit HBM2e	6144-bit HBM2e	6144-bit HBM2e	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	4096-bit HBM2	384-bit GDDR5	384-bit GDDR5
memory size	HBM3 @ 3.0Gbps up to 80Gb	HBM2e @ 2.0 Gbps up to 80 GB	HBM2 up to 40 GB @ 1.6 TB/sec HBM2 up to 80 GB @ 1.6 TB/sec	HBM2 up to 40 GB @ 1.6 TB/sec HBM2 @ 2.0 TB/s up to 80 GB	16 Gb HBM2 @ 1134 Gb/s	16 Gb HBM2 @ 900 Gb/s	16 Gb HBM2 @ 732 Gb/s	16 Gb HBM2 @ 732 Gb/s 12 Gb HBM2 @ 549 Gb/s	24 Gb GDDR5 @ 288 Gb/s	12 Gb GDDR5 @ 288 Gb/s
L2 cache size	51200 KB	51200 KB	40960 KB	40960 KB	6144 KB	6144 KB	4096 KB	4096 KB	3072 KB	1536 KB
TDP	700W	350W	400W	250W	250W	300W	300W	250W	250W	235W

Source

En Microsoft AI lär sig spela Minecraft bara genom att titta på YouTube-videor

Hur man skapar ett Twitter-konto utan ett telefonnummer

Stallone berättar om Hollywood i trailern för Sly, hans dokumentär för Netflix

hur man tar bort ditt Twitter-konto på Android

Hur man gör en ficklampa i Minecraft: vi berättar alla möjliga alternativ

Po:s AI chatbot-app låter dig nu skapa dina egna bots med hjälp av gester