Chapter 1. Version 3.2.1 release notes


Red Hat AI Inference Server 3.2.1 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, and Google TPU AI accelerators. The container images are available from registry.redhat.io:

  • registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.1
  • registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.1
  • registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.1

Red Hat AI Inference Server 3.2.1 packages the upstream vLLM v0.10.0 release.

You can review the complete list of updates in the upstream vLLM v0.10.0 release notes.

Note

The Red Hat AI Inference Server 3.2.1 release does not package LLM Compressor. Pull the earlier 3.2.0 container image to use LLM Compressor with AI Inference Server.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

1.1. New models enabled

Red Hat AI Inference Server 3.2.1 expands capabilities by enabling the following newly validated models for use with Red Hat AI Inference Server 3.2.1 in vLLM v0.10.0:

  • Llama 4 with EAGLE support
  • EXAONE 4.0
  • Microsoft Phi‑4‑mini‑flash‑reasoning
  • Hunyuan V1 Dense + A13B, including reasoning and tool-parsing abilities
  • Ling mixture-of-experts (MoE) models
  • JinaVL Reranker
  • Nemotron‑Nano‑VL‑8B‑V1
  • Arcee
  • Voxtral

1.2. New developer features

Inference engine updates
  • V0 engine cleanup - removed legacy CPU/XPU/TPU V0 backends.
  • Experimental asynchronous scheduling can be enabled by using the --async-scheduling flag to overlap engine core scheduling with the GPU runner for improved inference throughput.
  • Reduced startup time for CUDA graphs by calling gc.freeze before capture.
Performance improvements
  • 48% request duration reduction by using micro-batch tokenization for concurrent requests
  • Added fused MLA QKV and strided layernorm.
  • Added Triton causal-conv1d for Mamba models.
New quantization options
  • MXFP4 quantization for Mixture of Experts models.
  • BNB (Bits and Bytes) support for Mixtral models.
  • Hardware-specific quantization improvements.
Expanded model support
  • Llama 4 with EAGLE speculative decoding support.
  • EXAONE 4.0 and Microsoft Phi-4-mini model families.
  • Hunyuan V1 Dense and Ling MoE architectures.
OpenAI compatibility
  • Added new OpenAI Responses API implementation.
  • Added tool calling with required choice and $defs.
Dependency updates
  • Red Hat AI Inference Server Google TPU container image uses PyTorch 2.9.0 nightly.
  • NVIDIA CUDA uses PyTorch 2.7.1.
  • AMD ROCm remains on PyTorch 2.7.0.
  • FlashInfer library is updated to v0.2.8rc1.

1.3. Known issues

  • In Red Hat AI Inference Server model deployments in OpenShift Container Platform 4.19 with CoreOS 9.6, ROCm driver 6.4.2, and multiple ROCm AI accelerators, model deployment fails. This issue does not occur with CoreOS 9.4 paired with the matching ROCm driver 6.4.2 version.

    To workaround this ROCm driver issue, ensure that you deploy compatible OpenShift Container Platform and ROCm driver versions:

    Expand
    Table 1.1. Supported OpenShift Container Platform and ROCm driver versions
    OpenShift Container Platform versionROCm driver version

    4.17

    6.4.2

    4.17

    6.3.4

Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat