Chapter 3. Version 3.2.1 release notes

Red Hat AI Inference Server 3.2.1 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, and Google TPU AI accelerators. The container images are available from registry.redhat.io:

registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.1
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.1
registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.1

Red Hat AI Inference Server 3.2.1 packages the upstream vLLM v0.10.0 release.

You can review the complete list of updates in the upstream vLLM v0.11.0 release notes.

Note

The Red Hat AI Inference Server 3.2.1 release does not package LLM Compressor. Pull the earlier 3.2.0 container image to use LLM Compressor with AI Inference Server.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

3.1. New models enabled
Copy link

Red Hat AI Inference Server 3.2.1 expands capabilities by enabling the following newly validated models for use with Red Hat AI Inference Server 3.2.1 in vLLM v0.10.0:

Llama 4 with EAGLE support
EXAONE 4.0
Microsoft Phi‑4‑mini‑flash‑reasoning
Hunyuan V1 Dense + A13B, including reasoning and tool-parsing abilities
Ling mixture-of-experts (MoE) models
JinaVL Reranker
Nemotron‑Nano‑VL‑8B‑V1
Arcee
Voxtral

3.2. New developer features
Copy link

Inference engine updates

V0 engine cleanup - removed legacy CPU/XPU/TPU V0 backends.
Experimental asynchronous scheduling can be enabled by using the --async-scheduling flag to overlap engine core scheduling with the GPU runner for improved inference throughput.
Reduced startup time for CUDA graphs by calling gc.freeze before capture.

Performance improvements

48% request duration reduction by using micro-batch tokenization for concurrent requests
Added fused MLA QKV and strided layernorm.
Added Triton causal-conv1d for Mamba models.

New quantization options

MXFP4 quantization for Mixture of Experts models.
BNB (Bits and Bytes) support for Mixtral models.
Hardware-specific quantization improvements.

Expanded model support

Llama 4 with EAGLE speculative decoding support.
EXAONE 4.0 and Microsoft Phi-4-mini model families.
Hunyuan V1 Dense and Ling MoE architectures.

OpenAI compatibility

Added new OpenAI Responses API implementation.
Added tool calling with required choice and $defs.

Dependency updates

Red Hat AI Inference Server Google TPU container image uses PyTorch 2.9.0 nightly.
NVIDIA CUDA uses PyTorch 2.7.1.
AMD ROCm remains on PyTorch 2.7.0.
FlashInfer library is updated to v0.2.8rc1.

3.3. Known issues
Copy link

In Red Hat AI Inference Server model deployments in OpenShift Container Platform 4.19 with CoreOS 9.6, ROCm driver 6.4.2, and multiple ROCm AI accelerators, model deployment fails. This issue does not occur with CoreOS 9.4 paired with the matching ROCm driver 6.4.2 version.

To workaround this ROCm driver issue, ensure that you deploy compatible OpenShift Container Platform and ROCm driver versions:

Expand

Table 3.1. Supported OpenShift Container Platform and ROCm driver versions
OpenShift Container Platform version	ROCm driver version
4.17	6.4.2
4.17	6.3.4

Chapter 3. Version 3.2.1 release notes

3.1. New models enabled
Copy link

3.2. New developer features
Copy link

3.3. Known issues
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 3. Version 3.2.1 release notes

3.1. New models enabledCopy linkLink copied to clipboard!

3.2. New developer featuresCopy linkLink copied to clipboard!

3.3. Known issuesCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

3.1. New models enabled
Copy link

3.2. New developer features
Copy link

3.3. Known issues
Copy link