Este conteúdo não está disponível no idioma selecionado.

Chapter 2. Version 3.4.0-ea.2 release notes


Red Hat AI Inference Server 3.4.0-ea.2 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, Intel Gaudi, and IBM Spyre AI accelerators with multi-architecture support for s390x (IBM Z) and ppc64le (IBM Power).

Important

Red Hat AI Inference Server 3.4.0-ea.2 is an Early Access release. Early Access releases are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Early Access releases for production or business-critical workloads. Use Early Access releases to test upcoming product features in advance of their possible inclusion in a Red Hat product offering, and to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Early Access features without an associated SLA.

The following container images are available as early access releases from registry.redhat.io:

  • registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.2
  • registry.redhat.io/rhaii-early-access/vllm-rocm-rhel9:3.4.0-ea.2
  • registry.redhat.io/rhaii-early-access/vllm-spyre-rhel9:3.4.0-ea.2 (s390x, ppc64le, x86_64)
  • registry.redhat.io/rhaii-early-access/model-opt-cuda-rhel9:3.4.0-ea.2

The following container images are Technology Preview features:

  • registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.2
  • registry.redhat.io/rhaii-early-access/vllm-neuron-rhel9:3.4.0-ea.2
  • registry.redhat.io/rhaii-early-access/vllm-cpu-rhel9:3.4.0-ea.2
  • registry.redhat.io/rhaii-early-access/vllm-gaudi-rhel9:3.4.0-ea.2
Important

Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

Note

The following Technology Preview container images bundle different upstream vLLM versions:

  • vllm-tpu-rhel9:3.4.0-ea.2 bundles vLLM v0.13.0.
  • vllm-neuron-rhel9:3.4.0-ea.2 bundles vLLM v0.13.0.

2.1. New Red Hat AI Inference Server developer features

Red Hat AI Inference Server 3.4.0-ea.2 packages the upstream vLLM v0.16.0 release. You can review the complete list of updates in the upstream vLLM v0.16.0 release notes. vLLM v0.16.0 was branch cut on February 8. Features added to vLLM after that date are not included.

Introducing Speculators library (Technology Preview)
Speculators 3.4.0-ea.2 packages the upstream Speculators 0.4.0a1 release. Speculators is an end-to-end training framework for creating EAGLE3 draft models that accelerate inference through speculative decoding, reducing model latency by 1.5-3x while maintaining output quality.
AI accelerator and platform hardware updates
  • Intel Gaudi 3: Added Intel Gaudi 3 AI accelerator support as a Technology Preview feature through the vllm-gaudi hardware plugin (vllm-gaudi 0.16.0)

2.2. New Red Hat AI Model Optimization Toolkit developer features

Red Hat AI Model Optimization Toolkit 3.4.0-ea.2 packages the upstream LLM Compressor v0.10.0.1 release. You can review the complete list of updates in the upstream LLM Compressor v0.10.0.1 release notes.

LLM Compressor v0.10.0.1 brings significant performance improvements, updated quantization capabilities, and enhanced model offload support.

Distributed GPTQ with major performance improvements
GPTQ quantization supports fully distributed functionality, which results in significant speedups. It takes advantage of the underlying Distributed Data Parallel (DDP) improvements for calibration and adds weight parallel compression.
Enhanced compressed-tensors offloading (disk and distributed)
Compressed-tensors support loading transformers models that are offloaded to disk, across distributed process ranks, or both. With disk offloading, you can load and compress very large models which normally would not fit in CPU memory. When loading offloaded models across distributed process ranks, the offloaded model memory is shared between ranks.
Migration from accelerate to compressed-tensors offloading
LLM Compressor v0.10 no longer uses offloading logic provided by Hugging Face’s accelerate library, instead opting to integrate with model offloading provided by compressed-tensors.
GPTQ support for FP4 microscale schemes (NVFP4, MXFP4)
GPTQ supports FP4 microscale schemes including NVFP4 and MXFP4. Applying GPTQ to these schemes can result in improved recovery and overall quantization accuracy.
MXFP4 accuracy improvements
MXFP4 support has been updated with accuracy improvements for its weight scale generation. MXFP4 with activation quantization is not yet enabled in vLLM for compressed-tensors models.

2.3. Known issues

  • The model-opt-cuda-rhel9 image fails to deploy on clusters with CUDA versions earlier than 13.0.

    Deployment fails with a timeout error because the image requires CUDA 13.0 or later. To work around this issue, add the following environment variable to the deployment YAML file for the model-opt image:

    env:
    - name: NVIDIA_DISABLE_REQUIRE
      value: "1"
  • Neuron variant returns plain text instead of JSON for structured output requests.

    When sending a chat completion request with the response_format: { type: "json_schema" } to the Neuron variant, the server accepts the request but returns unstructured plain text instead of JSON. Basic /v1/completions and /v1/chat/completions requests without the response_format work correctly. No error or warning is returned to the client. To work around this issue, use the CPU variant (vllm-cpu-rhel9) instead of the Neuron variant.

  • Unable to query video models deployed using Red Hat AI Inference Server 3.4.0-ea.2.

    While the model loads correctly, it fails to respond when queried. This occurs because base images do not ship the Cisco OpenH264 codec for legal and compliance reasons. The ffmpeg-free-rhai package only enables H.264 support when a compatible libopenh264.so.7 is provided at runtime. To work around this issue, provide libopenh264.so.7 at runtime. For OpenShift or Kubernetes deployments, use volume mounts to make the library available to the container.

Red Hat logoGithubredditYoutubeTwitter

Aprender

Experimente, compre e venda

Comunidades

Sobre a documentação da Red Hat

Ajudamos os usuários da Red Hat a inovar e atingir seus objetivos com nossos produtos e serviços com conteúdo em que podem confiar. Explore nossas atualizações recentes.

Tornando o open source mais inclusivo

A Red Hat está comprometida em substituir a linguagem problemática em nosso código, documentação e propriedades da web. Para mais detalhes veja o Blog da Red Hat.

Sobre a Red Hat

Fornecemos soluções robustas que facilitam o trabalho das empresas em plataformas e ambientes, desde o data center principal até a borda da rede.

Theme

© 2026 Red Hat
Voltar ao topo