Dieser Inhalt ist in der von Ihnen ausgewählten Sprache nicht verfügbar.

Chapter 2. Version 3.2.1 release notes


Red Hat AI Inference Server 3.2.1 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, and Google TPU AI accelerators. The container images are available from registry.redhat.io:

  • registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.1
  • registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.1
  • registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.1

Red Hat AI Inference Server 3.2.1 packages the upstream vLLM v0.10.0 release.

You can review the complete list of updates in the upstream vLLM v0.10.1.1 release notes.

Note

The Red Hat AI Inference Server 3.2.1 release does not package LLM Compressor. Pull the earlier 3.2.0 container image to use LLM Compressor with AI Inference Server.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

2.1. New models enabled

Red Hat AI Inference Server 3.2.1 expands capabilities by enabling the following newly validated models for use with Red Hat AI Inference Server 3.2.1 in vLLM v0.10.0:

  • Llama 4 with EAGLE support
  • EXAONE 4.0
  • Microsoft Phi‑4‑mini‑flash‑reasoning
  • Hunyuan V1 Dense + A13B, including reasoning and tool-parsing abilities
  • Ling mixture-of-experts (MoE) models
  • JinaVL Reranker
  • Nemotron‑Nano‑VL‑8B‑V1
  • Arcee
  • Voxtral

2.2. New developer features

Inference engine updates
  • V0 engine cleanup - removed legacy CPU/XPU/TPU V0 backends.
  • Experimental asynchronous scheduling can be enabled by using the --async-scheduling flag to overlap engine core scheduling with the GPU runner for improved inference throughput.
  • Reduced startup time for CUDA graphs by calling gc.freeze before capture.
Performance improvements
  • 48% request duration reduction by using micro-batch tokenization for concurrent requests
  • Added fused MLA QKV and strided layernorm.
  • Added Triton causal-conv1d for Mamba models.
New quantization options
  • MXFP4 quantization for Mixture of Experts models.
  • BNB (Bits and Bytes) support for Mixtral models.
  • Hardware-specific quantization improvements.
Expanded model support
  • Llama 4 with EAGLE speculative decoding support.
  • EXAONE 4.0 and Microsoft Phi-4-mini model families.
  • Hunyuan V1 Dense and Ling MoE architectures.
OpenAI compatibility
  • Added new OpenAI Responses API implementation.
  • Added tool calling with required choice and $defs.
Dependency updates
  • Red Hat AI Inference Server Google TPU container image uses PyTorch 2.9.0 nightly.
  • NVIDIA CUDA uses PyTorch 2.7.1.
  • AMD ROCm remains on PyTorch 2.7.0.
  • FlashInfer library is updated to v0.2.8rc1.

2.3. Known issues

  • In Red Hat AI Inference Server model deployments in OpenShift Container Platform 4.19 with CoreOS 9.6, ROCm driver 6.4.2, and multiple ROCm AI accelerators, model deployment fails. This issue does not occur with CoreOS 9.4 paired with the matching ROCm driver 6.4.2 version.

    To workaround this ROCm driver issue, ensure that you deploy compatible OpenShift Container Platform and ROCm driver versions:

    Expand
    Table 2.1. Supported OpenShift Container Platform and ROCm driver versions
    OpenShift Container Platform versionROCm driver version

    4.17

    6.4.2

    4.17

    6.3.4

Nach oben
Red Hat logoGithubredditYoutubeTwitter

Lernen

Testen, kaufen und verkaufen

Communitys

Über Red Hat Dokumentation

Wir helfen Red Hat Benutzern, mit unseren Produkten und Diensten innovativ zu sein und ihre Ziele zu erreichen – mit Inhalten, denen sie vertrauen können. Entdecken Sie unsere neuesten Updates.

Mehr Inklusion in Open Source

Red Hat hat sich verpflichtet, problematische Sprache in unserem Code, unserer Dokumentation und unseren Web-Eigenschaften zu ersetzen. Weitere Einzelheiten finden Sie in Red Hat Blog.

Über Red Hat

Wir liefern gehärtete Lösungen, die es Unternehmen leichter machen, plattform- und umgebungsübergreifend zu arbeiten, vom zentralen Rechenzentrum bis zum Netzwerkrand.

Theme

© 2025 Red Hat