이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 3. Version 3.2.1 release notes


Red Hat AI Inference Server 3.2.1 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, and Google TPU AI accelerators. The container images are available from registry.redhat.io:

  • registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.1
  • registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.1
  • registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.1

Red Hat AI Inference Server 3.2.1 packages the upstream vLLM v0.10.0 release.

You can review the complete list of updates in the upstream vLLM v0.11.0 release notes.

Note

The Red Hat AI Inference Server 3.2.1 release does not package LLM Compressor. Pull the earlier 3.2.0 container image to use LLM Compressor with AI Inference Server.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

3.1. New models enabled

Red Hat AI Inference Server 3.2.1 expands capabilities by enabling the following newly validated models for use with Red Hat AI Inference Server 3.2.1 in vLLM v0.10.0:

  • Llama 4 with EAGLE support
  • EXAONE 4.0
  • Microsoft Phi‑4‑mini‑flash‑reasoning
  • Hunyuan V1 Dense + A13B, including reasoning and tool-parsing abilities
  • Ling mixture-of-experts (MoE) models
  • JinaVL Reranker
  • Nemotron‑Nano‑VL‑8B‑V1
  • Arcee
  • Voxtral

3.2. New developer features

Inference engine updates
  • V0 engine cleanup - removed legacy CPU/XPU/TPU V0 backends.
  • Experimental asynchronous scheduling can be enabled by using the --async-scheduling flag to overlap engine core scheduling with the GPU runner for improved inference throughput.
  • Reduced startup time for CUDA graphs by calling gc.freeze before capture.
Performance improvements
  • 48% request duration reduction by using micro-batch tokenization for concurrent requests
  • Added fused MLA QKV and strided layernorm.
  • Added Triton causal-conv1d for Mamba models.
New quantization options
  • MXFP4 quantization for Mixture of Experts models.
  • BNB (Bits and Bytes) support for Mixtral models.
  • Hardware-specific quantization improvements.
Expanded model support
  • Llama 4 with EAGLE speculative decoding support.
  • EXAONE 4.0 and Microsoft Phi-4-mini model families.
  • Hunyuan V1 Dense and Ling MoE architectures.
OpenAI compatibility
  • Added new OpenAI Responses API implementation.
  • Added tool calling with required choice and $defs.
Dependency updates
  • Red Hat AI Inference Server Google TPU container image uses PyTorch 2.9.0 nightly.
  • NVIDIA CUDA uses PyTorch 2.7.1.
  • AMD ROCm remains on PyTorch 2.7.0.
  • FlashInfer library is updated to v0.2.8rc1.

3.3. Known issues

  • In Red Hat AI Inference Server model deployments in OpenShift Container Platform 4.19 with CoreOS 9.6, ROCm driver 6.4.2, and multiple ROCm AI accelerators, model deployment fails. This issue does not occur with CoreOS 9.4 paired with the matching ROCm driver 6.4.2 version.

    To workaround this ROCm driver issue, ensure that you deploy compatible OpenShift Container Platform and ROCm driver versions:

    Expand
    Table 3.1. Supported OpenShift Container Platform and ROCm driver versions
    OpenShift Container Platform versionROCm driver version

    4.17

    6.4.2

    4.17

    6.3.4

맨 위로 이동
Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2025 Red Hat