Este contenido no está disponible en el idioma seleccionado.

Chapter 1. Version 3.2.5 release notes


Red Hat AI Inference Server 3.2.5 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators with multi-architecture support for s390x (IBM Z) and ppc64le (IBM Power).

The following container images are Generally Available (GA) from registry.redhat.io:

  • registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.5
  • registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.5
  • registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.5 (s390x and ppc64le)
  • registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.5

The following container images are Technology Preview features:

  • registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.5
  • registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.5 (x86)
Important

The registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.5 and registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.5 (x86) containers are Technology Preview features only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

1.1. Early access AI Inference Server images

To facilitate customer testing of new models, early access fast release Red Hat AI Inference Server images are available in near-upstream preview builds. Fast release container images are not functionally complete or production-ready, have minimal productization, and are not supported by Red Hat in any way.

You can find available fast release images in the Red Hat ecosystem catalog.

1.2. New Red Hat AI Inference Server developer features

Red Hat AI Inference Server 3.2.5 packages the upstream vLLM v0.11.2 release. You can review the complete list of updates in the upstream vLLM v0.11.2 release notes.

PyTorch 2.9.0, CUDA 12.9.1 updates
NVIDIA CUDA has been updated with PyTorch 2.9.0, enabling Inductor partitioning and enabling multiple fixes in graph-partition rules and compile-cache integration.
Batch-invariant torch.compile
Generalized batch-invariant support across attention and MoE model backends, with explicit support for DeepGEMM and FlashInfer on NVIDIA Hopper and Blackwell AI accelerators.
Robust async scheduling
Fixed several correctness and stability issues in async scheduling, especially when combined with chunked prefill, structured outputs, priority scheduling, MTP, DeepEP or Dynamic Compressing Prompts (DCP) processing. The --async-scheduling option will be enabled by default in a future release.
Stronger scheduler + KV ecosystem
The scheduler is now more robust with KV connectors, prefix caching, and multi-node deployments.
Anthropic API support
Added support for the /v1/messages API endpoint, you can now use vllm serve with Anthropic-compatible clients.
AI accelerator hardware updates

IBM Spyre support for IBM Power and IBM Z is now Generally Available.

Note
  • Single-host deployments for IBM Spyre AI accelerators on IBM Z and IBM Power are supported for RHEL AI 9.6 only.
  • Cluster deployments for IBM Spyre AI accelerators on IBM Z are supported as part of Red Hat OpenShift AI version 3.0+ only.

1.3. New Red Hat AI Model Optimization Toolkit developer features

Red Hat AI Model Optimization Toolkit 3.2.5 packages the upstream LLM Compressor v0.8.1 release. This is unchanged from the Red Hat AI Inference Server 3.2.3 and 3.2.4 releases. See the Version 3.2.3 release notes for more information.

1.4. Known issues

  • The FlashInfer kernel sampler was disabled by default in Red Hat AI Inference Server 3.2.3 to address non-deterministic behavior and correctness errors in model output.

    This change affects sampling behavior when using Flashinfer top-p and top-k sampling methods. If required, you can enable FlashInfer by setting the VLLM_USE_FLASHINFER_SAMPLER environment variable at runtime:

    VLLM_USE_FLASHINFER_SAMPLER=1
    Copy to Clipboard Toggle word wrap
  • AMD ROCm AI accelerators do not support inference serving encoder-decoder models when using the vLLM v1 inference engine.

    Encoder-decoder model architectures cause NotImplementedError failures with AMD ROCm accelerators. ROCm attention backends support decoder-only attention only.

    Affected models include, but are not limited to, the following:

    • Speech-to-text Whisper models, for example openai/whisper-large-v3-turbo and mistralai/Voxtral-Mini-3B-2507
    • Vision-language models, for example microsoft/Phi-3.5-vision-instruct
    • Translation models, for example T5, BART, MarianMT
    • Any models using cross-attention or an encoder-decoder architecture
  • Inference fails for MP3 and M4A file formats. When querying audio models with these file formats, the system returns a "format not recognized" error.

    {"error":{"message":"Error opening <_io.BytesIO object at 0x7fc052c821b0>: Format not recognised.","type":"Internal Server Error","param":null,"code":500}}
    Copy to Clipboard Toggle word wrap

    This issue affects audio transcription models such as openai/whisper-large-v3 and mistralai/Voxtral-Small-24B-2507. To workaround this issue, convert audio files to WAV format before processing.

  • Jemalloc consumes more memory than glibc when deploying models on IBM Spyre AI accelerators.

    When deploying models with jemalloc as the memory allocator, overall memory usage is significantly higher than when using glibc. In testing, jemalloc increased memory consumption by more than 50% compared to glibc. To workaround this issue, disable jemalloc by unsetting the LD_PRELOAD environment variable so the system uses glibc as the memory allocator instead.

  • On IBM Z systems with FIPS mode enabled, Red Hat AI Inference Server fails to start when the IBM Spyre platform plugin is in use. A _hashlib.UnsupportedDigestmodError error is shown in the model startup logs. This issue occurs in Red Hat AI Inference Server 3.2.5 with the IBM Spyre plugin on IBM Z, which uses vLLM v0.11.0. The issue is fixed in vLLM v0.11.1, and will be included in a future version of Red Hat AI Inference Server.
Volver arriba
Red Hat logoGithubredditYoutubeTwitter

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Ayudamos a los usuarios de Red Hat a innovar y alcanzar sus objetivos con nuestros productos y servicios con contenido en el que pueden confiar. Explore nuestras recientes actualizaciones.

Hacer que el código abierto sea más inclusivo

Red Hat se compromete a reemplazar el lenguaje problemático en nuestro código, documentación y propiedades web. Para más detalles, consulte el Blog de Red Hat.

Acerca de Red Hat

Ofrecemos soluciones reforzadas que facilitan a las empresas trabajar en plataformas y entornos, desde el centro de datos central hasta el perímetro de la red.

Theme

© 2025 Red Hat