이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 4. Version 3.2.1 release notes

Red Hat AI Inference Server 3.2.1 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, and Google TPU AI accelerators. The container images are available from registry.redhat.io:

registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.1
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.1
registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.1

Red Hat AI Inference Server 3.2.1 packages the upstream vLLM v0.10.0 release.

You can review the complete list of updates in the upstream vLLM v0.11.0 release notes.

Note

The Red Hat AI Inference Server 3.2.1 release does not package LLM Compressor. Pull the earlier 3.2.0 container image to use LLM Compressor with AI Inference Server.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

4.1. New models enabled
링크 복사

Red Hat AI Inference Server 3.2.1 expands capabilities by enabling the following newly validated models for use with Red Hat AI Inference Server 3.2.1 in vLLM v0.10.0:

Llama 4 with EAGLE support
EXAONE 4.0
Microsoft Phi‑4‑mini‑flash‑reasoning
Hunyuan V1 Dense + A13B, including reasoning and tool-parsing abilities
Ling mixture-of-experts (MoE) models
JinaVL Reranker
Nemotron‑Nano‑VL‑8B‑V1
Arcee
Voxtral

4.2. New developer features
링크 복사

Inference engine updates

V0 engine cleanup - removed legacy CPU/XPU/TPU V0 backends.
Experimental asynchronous scheduling can be enabled by using the --async-scheduling flag to overlap engine core scheduling with the GPU runner for improved inference throughput.
Reduced startup time for CUDA graphs by calling gc.freeze before capture.

Performance improvements

48% request duration reduction by using micro-batch tokenization for concurrent requests
Added fused MLA QKV and strided layernorm.
Added Triton causal-conv1d for Mamba models.

New quantization options

MXFP4 quantization for Mixture of Experts models.
BNB (Bits and Bytes) support for Mixtral models.
Hardware-specific quantization improvements.

Expanded model support

Llama 4 with EAGLE speculative decoding support.
EXAONE 4.0 and Microsoft Phi-4-mini model families.
Hunyuan V1 Dense and Ling MoE architectures.

OpenAI compatibility

Added new OpenAI Responses API implementation.
Added tool calling with required choice and $defs.

Dependency updates

Red Hat AI Inference Server Google TPU container image uses PyTorch 2.9.0 nightly.
NVIDIA CUDA uses PyTorch 2.7.1.
AMD ROCm remains on PyTorch 2.7.0.
FlashInfer library is updated to v0.2.8rc1.

4.3. Known issues
링크 복사

In Red Hat AI Inference Server model deployments in OpenShift Container Platform 4.19 with CoreOS 9.6, ROCm driver 6.4.2, and multiple ROCm AI accelerators, model deployment fails. This issue does not occur with CoreOS 9.4 paired with the matching ROCm driver 6.4.2 version.

To workaround this ROCm driver issue, ensure that you deploy compatible OpenShift Container Platform and ROCm driver versions:

Expand

Table 4.1. Supported OpenShift Container Platform and ROCm driver versions
OpenShift Container Platform version	ROCm driver version
4.17	6.4.2
4.17	6.3.4

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 4. Version 3.2.1 release notes

4.1. New models enabled
링크 복사

4.2. New developer features
링크 복사

4.3. Known issues
링크 복사

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 소개

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 4. Version 3.2.1 release notes

4.1. New models enabled링크 복사링크가 클립보드에 복사되었습니다!

4.2. New developer features링크 복사링크가 클립보드에 복사되었습니다!

4.3. Known issues링크 복사링크가 클립보드에 복사되었습니다!

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 소개

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

4.1. New models enabled
링크 복사

4.2. New developer features
링크 복사

4.3. Known issues
링크 복사