Chapter 4. Supported AI accelerator model quantization formats
Different AI accelerator architectures support different types of model quantization, depending on the compute capabilities of the hardware. The following tables list the AI accelerators that support INT8, INT4, FP8, and NVFP4 quantization formats.
- INT8 (W8A8) quantization reduces model weights and activations to 8-bit integers, providing significant memory savings while maintaining acceptable accuracy for many use cases.
- INT8 (W4A8) quantization reduces model weights to 4-bit integers, keeping activations at 8-bit precision. INT8 (W4A8) improves memory efficiency compared to W8A8 while preserving higher activation fidelity for inference.
- INT4 (W4A16) quantization reduces model weights to 4-bit integers while maintaining 16-bit activations, enabling larger models to fit in GPU memory with minimal accuracy loss.
- FP8 (W8A8) quantization uses 8-bit floating point representation for weights and activations, offering a balance between memory efficiency and numerical precision for training and inference workloads.
- NVFP4 quantization uses NVIDIA’s 4-bit floating point format with two-level scaling (FP8 fine-grained scales and FP32 tensor-level scale), providing maximum memory efficiency for inference on NVIDIA Blackwell hardware.
| Architecture | Supported AI accelerators | Minimum compute capability |
|---|---|---|
| Turing | Tesla T4 | 7.5 |
| Ampere | A10, A30, A40, A100 | 8.0 |
| Ada Lovelace | L4, L40, L40S | 8.9 |
| Hopper | H100, H200, GH200 | 9.0 |
NVIDIA Blackwell architecture (B200, B300, GB200, GB300) does not support INT8 quantization in vLLM due to kernel limitations. Use FP8 or NVFP4 quantization instead.
| Architecture | Supported AI accelerators |
|---|---|
| CDNA 2 | MI210 |
| CDNA 3 | MI300X, MI325X |
| Architecture | Supported AI accelerators | Minimum compute capability |
|---|---|---|
| Ampere | A10, A30, A40, A100 | 8.0 |
| Ada Lovelace | L4, L40, L40S | 8.9 |
| Hopper | H100, H200, GH200 | 9.0 |
| Blackwell | B200, B300, GB200, GB300 | 10.0 |
NVIDIA Turing architecture (Tesla T4) does not have optimized vLLM kernel support for INT4 quantization. Use Ampere or newer architectures for INT4 inference.
| Architecture | Supported AI accelerators |
|---|---|
| CDNA 3 | MI300X, MI325X |
AMD CDNA 2 architecture (MI210) does not have optimized vLLM kernel support for INT4 quantization.
| Architecture | Supported AI accelerators | Minimum compute capability |
|---|---|---|
| Ada Lovelace | L4, L40, L40S | 8.9 |
| Hopper | H100, H200, GH200 | 9.0 |
| Blackwell | B200, B300, GB200, GB300 | 10.0 |
NVIDIA Turing architecture (Tesla T4) and Ampere architecture (A10, A30, A40, A100) AI accelerators do not support FP8 W8A8 quantization due to hardware limitations. However, FP8 weight-only (W8A16) quantization is available on these architectures by using Marlin kernels.
| Architecture | Supported AI accelerators |
|---|---|
| CDNA 3 | MI300X, MI325X |
AMD CDNA 2 architecture AI accelerators (MI210) do not support FP8 quantization due to hardware limitations.
| Architecture | Supported AI accelerators | Minimum compute capability |
|---|---|---|
| Blackwell | B200, B300, GB200, GB300 | 10.0 |
NVFP4 quantization is only available on NVIDIA Blackwell architecture AI accelerators. AMD AI accelerators do not support NVFP4 quantization.