이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 1. Version 3.2.4 release notes


Red Hat AI Inference Server 3.2.4 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators. The following container images are Generally Available (GA) from registry.redhat.io:

  • registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.4
  • registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.4
  • registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.4
  • registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.4

The following container image is a Technology Preview feature:

  • registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.4

    Important

    The rhaiis/vllm-tpu-rhel9:3.2.4 container is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

    For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

Note

To facilitate customer testing of new models, early access fast release Red Hat AI Inference Server images are available in near-upstream preview builds. Fast release container images are not functionally complete or production-ready, have minimal productization, and are not supported by Red Hat in any way.

You can find available fast release images in the Red Hat ecosystem catalog.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

1.1. New Red Hat AI Inference Server developer features

Red Hat AI Inference Server 3.2.4 packages the upstream vLLM v0.11.0 release. This is unchanged from the Red Hat AI Inference Server 3.2.3 release. See the Version 3.2.3 release notes for more information.

1.2. New Red Hat AI Model Optimization Toolkit developer features

Red Hat AI Model Optimization Toolkit 3.2.4 packages the upstream LLM Compressor v0.8.1 release. This is unchanged from the Red Hat AI Inference Server 3.2.3 release. See the Version 3.2.3 release notes for more information.

1.3. Known issues

  • The FlashInfer kernel sampler was disabled by default in Red Hat AI Inference Server 3.2.3 to address non-deterministic behavior and correctness errors in model output.

    This change affects sampling behavior when using Flashinfer top-p and top-k sampling methods. If required, you can enable FlashInfer by setting the VLLM_USE_FLASHINFER_SAMPLER environment variable at runtime:

    VLLM_USE_FLASHINFER_SAMPLER=1
    Copy to Clipboard Toggle word wrap
  • AMD ROCm AI accelerators do not support inference serving encoder-decoder models when using the vLLM v1 inference engine.

    Encoder-decoder model architectures cause NotImplementedError failures with AMD ROCm accelerators. ROCm attention backends support decoder-only attention only.

    Affected models include, but are not limited to, the following:

    • Speech-to-text Whisper models, for example openai/whisper-large-v3-turbo and mistralai/Voxtral-Mini-3B-2507
    • Vision-language models, for example microsoft/Phi-3.5-vision-instruct
    • Translation models, for example T5, BART, MarianMT
    • Any models using cross-attention or an encoder-decoder architecture
맨 위로 이동
Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2025 Red Hat