After announcing its Tesla A100 GPU and early details on Ampere, Nvidia made a more comprehensive post on what's new in its new architecture and the specs of the new GA100 GPU that Tesla A100 is based on.
The most interesting information revealed by this blog is that the new Tesla A100 does not use the full core, but uses 7/8 of it. The complete core has the following specifications:
GA100 Specifications
- 8 GPCs, 8 TPCs / GPC, 2SMs / TPC, 16SMs / GPC, 128SMs per full GPU
- 64 CUDA Cores FP32 / SM, 8192 CUDA Cores FP32 per full GPU
- 4 Third Generation Tensor Cores / SM, 512 Third Generation Tensor Cores per Full GPU
- 6 HBM2 memory stacks, 12 512-bit memory controllers
In this way we can have a 6144-bit bus and up to 48GB of HBM2 memories, with a bandwidth of up to 1,866TB / s if the same 1215MHz HBM2 memories of the Tesla V100 are used.
RT Cores, raster units, video outputs, and NVENC encoders are not included as it fully targets AI.
GA100 SMs Architecture
- Third Generation Tensor Cores
- Acceleration for all types of data, including FP16, BF16, TF32, FP64, INT8, INT4 and Binary
- Tensor Cores' TF32 operations provide an easy way to accelerate FP32 input / output data in Deep Learning and High Performance Computing frameworks, running up to 10x faster than the Tesla V100 in FP32 FMA operations, or up to 20x faster. in sparse matrices.
- The FP16 / FP32 Mixed Precision Tensor Cores provide unprecedented processing power for Deep Learning, running up to 2.5x faster than Volta Tensor Cores, and up to 5x faster on sparse matrices.
- FP64 operations on Tensor Cors run up to 2.5x faster than Tesla V100's DFMA FP64 operations.
- INT8 operations with sparse matrices offer unprecedented processing power in Deep Learning interference, running up to 20x faster than INT8 operations in Tesla V100.
- 192KB of combined memory and L1 cache, 1.5x larger than a Tesla V100 SM
- New asynchronous copy statement for direct data load from global memory to shared memory, optionally skipping the L1 cache and eliminating the need for an intermediate log file.
- New shared memory barrier unit (asynchronous barrier) for use in conjunction with the new asynchronous copy instruction.
- New instructions for L2 cache management and residency controls.
- New programming improvements to reduce software complexity.
Undoubtedly Ampere brings great improvements, and that we have not yet seen the complete architecture, but only a part. Nvidia is also expected to introduce RT Cores 2.0 and a new version of NVENC, so stay tuned for the announcement of their version for GeForce and Quadro in the second half of the year.