Supported product and hardware configurations
Supported hardware and software configurations for deploying Red Hat AI software
Abstract
Chapter 1. About Red Hat AI Inference Server supported hardware and software Copy linkLink copied to clipboard!
Red Hat AI software runs on a variety of supported hardware, software, and delivery platforms in production environments.
Technology Preview and Developer Preview features are provided for early access to potential new features.
Technology Preview or Developer Preview features are not supported or recommended for production workloads.
Supported configurations span multiple AI accelerator types including NVIDIA GPUs, AMD GPUs, Google TPUs, and IBM Spyre accelerators. Red Hat AI Inference Server can be deployed in OpenShift Container Platform clusters, on standalone Red Hat Enterprise Linux (RHEL) hosts with Podman, or integrated with Red Hat OpenShift AI for managed AI/ML workflows.
Chapter 2. Product and version compatibility Copy linkLink copied to clipboard!
The following table lists the supported product versions for Red Hat AI Inference Server, Red Hat Enterprise Linux AI, and Red Hat OpenShift AI.
| Product version | vLLM core version | LLM Compressor version |
|---|---|---|
| 3.4.0-ea.1 | v0.14.1 | v0.9.0.2 |
| 3.3 | v0.13.0 | v0.9.0.1 |
| 3.2.5 | v0.11.2 | v0.8.1 |
| 3.2.4 | v0.11.0 | v0.8.1 |
| 3.2.3 | v0.11.0 | v0.8.1 |
| 3.2.2 | v0.10.1.1 | v0.7.1 |
| 3.2.1 | v0.10.0 | Not included in this release |
| 3.2.0 | v0.9.2 | Not included in this release |
| Product version | vLLM core version | LLM Compressor version |
|---|---|---|
| 3.3 | v0.13.0 | v0.9.0.1 |
| 3.2 | v0.11.2 | v0.8.1 |
| 3.0 | v0.11.0 | v0.8.1 |
| Product version | vLLM core version | LLM Compressor version |
|---|---|---|
| 3.3 | v0.13.0 | v0.9.0.1 |
| 3.2 | v0.11.2 | v0.8.1 |
| 3.0 | v0.11.0 | v0.8.1 |
Chapter 3. Supported AI accelerators for Red Hat AI Inference Server Copy linkLink copied to clipboard!
The following tables list the supported AI data center grade accelerators for Red Hat AI Inference Server 3.4.
Red Hat AI Inference Server supports data center grade AI accelerators only.
| vLLM release | AI accelerators | Requirements | vLLM architecture support | LLM Compressor support |
|---|---|---|---|---|
| vLLM v0.14.1 | NVIDIA data center GPUs:
|
|
Supported, now packaged separately in the |
Red Hat AI Inference Server 3.4.0-ea.1 is built with CUDA 13.0. The container images are backward compatible with CUDA 12.9 drivers.
If your host driver version is older than the CUDA toolkit version shipped in the AI Inference Server container, you can use NVIDIA Forward Compatibility to avoid driver upgrades.
NVIDIA T4 and A100 accelerators do not support FP8 (W8A8) quantization.
| vLLM release | AI accelerators | Requirements | vLLM architecture support | LLM Compressor support |
|---|---|---|---|---|
| vLLM v0.14.1 |
| x86 | Not supported |
AMD GPUs support FP8 (W8A8) and GGUF quantization schemes only.
| vLLM release | AI accelerators | Requirements | vLLM architecture support | LLM Compressor support |
|---|---|---|---|---|
| vLLM v0.14.1 | Google v4, v5e, v5p, v6e (Trillium) | x86 Technology Preview | Not supported |
| vLLM release | AI accelerators | Requirements | vLLM architecture support | LLM Compressor support |
|---|---|---|---|---|
| vLLM v0.14.1 | IBM Spyre for Power (ppc64le) |
| IBM Power (ppc64le) | Not supported |
| vLLM v0.14.1 | IBM Spyre for Z (s390x) |
| IBM Z (s390x) | Not supported |
| vLLM v0.14.1 | IBM AIU (x86) |
| x86 Technology Preview | Not supported |
IBM AIU support for x86 is available as a Technology Preview feature only. IBM AIU for x86 is not a Generally Available (GA) feature. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
| vLLM release | AI accelerators | Requirements | vLLM architecture support | LLM Compressor support |
|---|---|---|---|---|
| vLLM v0.14.1 | AWS Inferentia2 (Inf2), AWS Trainium (Trn1, Trn1n, Trn2) | x86 Dev Preview | Not supported |
AWS Trainium and Inferentia support is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
Chapter 4. Supported AI accelerator model quantization formats Copy linkLink copied to clipboard!
Different AI accelerator architectures support different types of model quantization, depending on the compute capabilities of the hardware. The following tables list the AI accelerators that support INT8, INT4, FP8, and NVFP4 quantization formats.
- INT8 (W8A8) quantization reduces model weights and activations to 8-bit integers, providing significant memory savings while maintaining acceptable accuracy for many use cases.
- INT8 (W4A8) quantization reduces model weights to 4-bit integers, keeping activations at 8-bit precision. INT8 (W4A8) improves memory efficiency compared to W8A8 while preserving higher activation fidelity for inference.
- INT4 (W4A16) quantization reduces model weights to 4-bit integers while maintaining 16-bit activations, enabling larger models to fit in GPU memory with minimal accuracy loss.
- FP8 (W8A8) quantization uses 8-bit floating point representation for weights and activations, offering a balance between memory efficiency and numerical precision for training and inference workloads.
- NVFP4 quantization uses NVIDIA’s 4-bit floating point format with two-level scaling (FP8 fine-grained scales and FP32 tensor-level scale), providing maximum memory efficiency for inference on NVIDIA Blackwell hardware.
| Architecture | Supported AI accelerators | Minimum compute capability |
|---|---|---|
| Turing | Tesla T4 | 7.5 |
| Ampere | A10, A30, A40, A100 | 8.0 |
| Ada Lovelace | L4, L40, L40S | 8.9 |
| Hopper | H100, H200, GH200 | 9.0 |
NVIDIA Blackwell architecture (B200, B300, GB200, GB300) does not support INT8 quantization in vLLM due to kernel limitations. Use FP8 or NVFP4 quantization instead.
| Architecture | Supported AI accelerators |
|---|---|
| CDNA 2 | MI210 |
| CDNA 3 | MI300X, MI325X |
| Architecture | Supported AI accelerators | Minimum compute capability |
|---|---|---|
| Ampere | A10, A30, A40, A100 | 8.0 |
| Ada Lovelace | L4, L40, L40S | 8.9 |
| Hopper | H100, H200, GH200 | 9.0 |
| Blackwell | B200, B300, GB200, GB300 | 10.0 |
NVIDIA Turing architecture (Tesla T4) does not have optimized vLLM kernel support for INT4 quantization. Use Ampere or newer architectures for INT4 inference.
| Architecture | Supported AI accelerators |
|---|---|
| CDNA 3 | MI300X, MI325X |
AMD CDNA 2 architecture (MI210) does not have optimized vLLM kernel support for INT4 quantization.
| Architecture | Supported AI accelerators | Minimum compute capability |
|---|---|---|
| Ada Lovelace | L4, L40, L40S | 8.9 |
| Hopper | H100, H200, GH200 | 9.0 |
| Blackwell | B200, B300, GB200, GB300 | 10.0 |
NVIDIA Turing architecture (Tesla T4) and Ampere architecture (A10, A30, A40, A100) AI accelerators do not support FP8 W8A8 quantization due to hardware limitations. However, FP8 weight-only (W8A16) quantization is available on these architectures by using Marlin kernels.
| Architecture | Supported AI accelerators |
|---|---|
| CDNA 3 | MI300X, MI325X |
AMD CDNA 2 architecture AI accelerators (MI210) do not support FP8 quantization due to hardware limitations.
| Architecture | Supported AI accelerators | Minimum compute capability |
|---|---|---|
| Blackwell | B200, B300, GB200, GB300 | 10.0 |
NVFP4 quantization is only available on NVIDIA Blackwell architecture AI accelerators. AMD AI accelerators do not support NVFP4 quantization.
Chapter 5. Supported AI accelerators for RHEL AI Copy linkLink copied to clipboard!
The following AI accelerators are supported for inference serving with Red Hat AI Inference Server on RHEL AI.
Bare metal deployments of RHEL AI are supported for all NVIDIA CUDA and AMD ROCm AI accelerators listed in Supported AI accelerators for Red Hat AI Inference Server.
Actual requirements vary based on the specific models you deploy, quantization methods, context lengths, and concurrent request loads. Aggregate GPU memory refers to the total GPU memory available across all GPUs in the system that can be used for tensor parallelism or pipeline parallelism.
For more information about inference serving on bare metal or Cloud platforms, see Red Hat Enterprise Linux AI.
The recommended minimum additional disk storage for all platforms is 1 TB.
| NVIDIA AI accelerator | Aggregate GPU memory | AWS instance family |
|---|---|---|
| GB200 | 384 GB | P6e series |
| B200 | 192 GB | P6 series |
| RTX PRO 6000 Blackwell Server Edition | 96 GB | G7e series |
| H100 | 80 GB | P5 series |
| L40S | 48 GB | G6e series |
| A100 | 40 GB | P4d series |
| L4 | 24 GB | G6 series |
| NVIDIA AI accelerator | Aggregate GPU memory | IBM Cloud instance family |
|---|---|---|
| H200 | 141 GB | gx3 series |
| H100 | 80 GB | gx3 series |
| A100 | 80 GB | gx3 series |
| L40S | 48 GB | gx3 series |
| L4 | 24 GB | gx3 series |
| AI accelerator | Aggregate GPU memory | Azure instance family |
|---|---|---|
| NVIDIA GB200 | 384 GB | ND series |
| AMD Instinct MI300X | 192 GB | ND series |
| NVIDIA H100 | 80 GB | ND series |
| NVIDIA A100 | 80 GB | ND series |
| AMD Instinct MI210 | 64 GB | ND series |
| NVIDIA AI accelerator | Aggregate GPU memory | Google Cloud instance family |
|---|---|---|
| GB200 | 384 GB | A4X series |
| B200 | 192 GB | A4 series |
| 4xL4 | 96 GB | G2 series |
| H100 | 80 GB | A3 series |
| A100 | 40 GB | A2 series |
Chapter 6. Supported AI accelerators for Red Hat OpenShift AI Copy linkLink copied to clipboard!
You must install the AI accelerator Operator that is relevant to the AI accelerator that you want to use with Red Hat OpenShift AI.
OpenShift AI provides Operators that support integration with AI accelerators. OpenShift AI also provides images that include libraries that work with NVIDIA, AMD, and Intel Gaudi data center grade AI accelerators.
Chapter 7. Supported deployment environments Copy linkLink copied to clipboard!
The following deployment environments for Red Hat AI Inference Server are supported.
Red Hat AI Inference Server is available only as a container image. The host operating system and kernel must support the required accelerator drivers. For more information, see Supported AI accelerators.
| Environment | Supported versions | Deployment notes |
|---|---|---|
| OpenShift Container Platform (self‑managed) | 4.14+ | Deploy on bare‑metal hosts or virtual machines. |
| Red Hat OpenShift Service on AWS (ROSA) | 4.14+ | Requires a ROSA cluster with STS and GPU‑enabled P5 or G5 node types. See Prepare your environment for more information. |
| Red Hat Enterprise Linux AI | 3.0+ | Deploy on bare‑metal hosts or virtual machines. |
| Red Hat Enterprise Linux (RHEL) | 9.2+ | Deploy on bare‑metal hosts or virtual machines. |
| Linux (not RHEL) | - | Supported under third‑party policy deployed on bare‑metal hosts or virtual machines. OpenShift Container Platform Operators are not required. |
| Kubernetes (not OpenShift Container Platform) | - | Supported under third‑party policy deployed on bare‑metal hosts or virtual machines. |
- Single-host deployments for IBM Spyre AI accelerators on IBM Z and IBM Power are supported for RHEL AI 9.6+.
- Cluster deployments for IBM Spyre AI accelerators on IBM Z are supported as part of Red Hat OpenShift AI version 3.0+ only.
Chapter 8. OpenShift Container Platform software prerequisites for GPU deployments Copy linkLink copied to clipboard!
The following table lists the minimum OpenShift Container Platform software prerequisites for AI accelerator workloads.
| Component | Minimum version | Operator |
|---|---|---|
| NVIDIA GPU Operator | 24.3 | |
| AMD GPU Operator | 6.2 | |
| IBM Spyre Operator | 1.0 | |
| Node Feature Discovery [1] | 4.14 |
[1] Included by default with OpenShift Container Platform. Node Feature Discovery is required for scheduling NUMA-aware workloads.
Chapter 9. Lifecycle and update policy Copy linkLink copied to clipboard!
Security and critical bug fixes are delivered as container images available from the registry.access.redhat.com/rhaiis container registry and are announced through RHSA advisories. See RHAII container images on catalog.redhat.com for more details.