Chapter 4. Supported AI accelerator model quantization formats


Different AI accelerator architectures support different types of model quantization, depending on the compute capabilities of the hardware. The following tables list the AI accelerators that support INT8, INT4, FP8, and NVFP4 quantization formats.

  • INT8 (W8A8) quantization reduces model weights and activations to 8-bit integers, providing significant memory savings while maintaining acceptable accuracy for many use cases.
  • INT8 (W4A8) quantization reduces model weights to 4-bit integers, keeping activations at 8-bit precision. INT8 (W4A8) improves memory efficiency compared to W8A8 while preserving higher activation fidelity for inference.
  • INT4 (W4A16) quantization reduces model weights to 4-bit integers while maintaining 16-bit activations, enabling larger models to fit in GPU memory with minimal accuracy loss.
  • FP8 (W8A8) quantization uses 8-bit floating point representation for weights and activations, offering a balance between memory efficiency and numerical precision for training and inference workloads.
  • NVFP4 quantization uses NVIDIA’s 4-bit floating point format with two-level scaling (FP8 fine-grained scales and FP32 tensor-level scale), providing maximum memory efficiency for inference on NVIDIA Blackwell hardware.
Expand
Table 4.1. Supported NVIDIA AI accelerators for INT8 (W8A8) quantization
ArchitectureSupported AI acceleratorsMinimum compute capability

Turing

Tesla T4

7.5

Ampere

A10, A30, A40, A100

8.0

Ada Lovelace

L4, L40, L40S

8.9

Hopper

H100, H200, GH200

9.0

Note

NVIDIA Blackwell architecture (B200, B300, GB200, GB300) does not support INT8 quantization in vLLM due to kernel limitations. Use FP8 or NVFP4 quantization instead.

Expand
Table 4.2. Supported AMD AI accelerators for INT8 (W8A8) quantization
ArchitectureSupported AI accelerators

CDNA 2

MI210

CDNA 3

MI300X, MI325X

Expand
Table 4.3. Supported NVIDIA AI accelerators for INT4 (W4A16) quantization
ArchitectureSupported AI acceleratorsMinimum compute capability

Ampere

A10, A30, A40, A100

8.0

Ada Lovelace

L4, L40, L40S

8.9

Hopper

H100, H200, GH200

9.0

Blackwell

B200, B300, GB200, GB300

10.0

Note

NVIDIA Turing architecture (Tesla T4) does not have optimized vLLM kernel support for INT4 quantization. Use Ampere or newer architectures for INT4 inference.

Expand
Table 4.4. Supported AMD AI accelerators for INT4 (W4A16) quantization
ArchitectureSupported AI accelerators

CDNA 3

MI300X, MI325X

Note

AMD CDNA 2 architecture (MI210) does not have optimized vLLM kernel support for INT4 quantization.

Expand
Table 4.5. Supported NVIDIA AI accelerators for FP8 (W8A8) quantization
ArchitectureSupported AI acceleratorsMinimum compute capability

Ada Lovelace

L4, L40, L40S

8.9

Hopper

H100, H200, GH200

9.0

Blackwell

B200, B300, GB200, GB300

10.0

Note

NVIDIA Turing architecture (Tesla T4) and Ampere architecture (A10, A30, A40, A100) AI accelerators do not support FP8 W8A8 quantization due to hardware limitations. However, FP8 weight-only (W8A16) quantization is available on these architectures by using Marlin kernels.

Expand
Table 4.6. Supported AMD AI accelerators for FP8 (W8A8) quantization
ArchitectureSupported AI accelerators

CDNA 3

MI300X, MI325X

Note

AMD CDNA 2 architecture AI accelerators (MI210) do not support FP8 quantization due to hardware limitations.

Expand
Table 4.7. Supported NVIDIA AI accelerators for NVFP4 quantization
ArchitectureSupported AI acceleratorsMinimum compute capability

Blackwell

B200, B300, GB200, GB300

10.0

Note

NVFP4 quantization is only available on NVIDIA Blackwell architecture AI accelerators. AMD AI accelerators do not support NVFP4 quantization.

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2026 Red Hat
Back to top