Este conteúdo não está disponível no idioma selecionado.

Chapter 2. Version 3.4.0 release notes


Red Hat AI Inference 3.4.0 introduces distributed inference capabilities with Distributed Inference with llm-d and expands CPU inference support. This release provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Intel Xeon and AMD EPYC CPUs, Google TPU, Intel Gaudi, and IBM Spyre AI accelerators with multi-architecture support for s390x (IBM Z) and ppc64le (IBM Power).

Red Hat AI Inference Server is now Red Hat AI Inference

Red Hat AI Inference Server (RHAIIS) is evolving in capabilities, and is now called Red Hat AI Inference. This reflects an expanded scope that includes distributed inference capabilities alongside optimized standalone inference deployments:

  • Distributed Inference with llm-d (Technology Preview): Deploy large language models across Kubernetes clusters with intelligent scheduling that reduces latency and increases throughput.
  • Enterprise-grade model optimization: Access Red Hat’s curated repository of validated and optimized models on Hugging Face, tuned for performance across multiple hardware accelerators.
  • Extensive AI accelerator hardware support: Deploy on NVIDIA, AMD, Google TPU, AWS Inferentia, and IBM AI accelerators. Standalone inference supports any Kubernetes or Linux environment; distributed inference is available for managed Kubernetes services including Azure AKS, CoreWeave CKS, and OpenShift Container Platform.
  • Observability: Monitor cluster performance for Distributed Inference with llm-d with inference-aware metrics.
Important

The standalone inference experience from the previous Red Hat AI Inference Server version is preserved and continues as a supported deployment option.

Distributed Inference with llm-d (Technology Preview)

Distributed Inference with llm-d is a Kubernetes-native framework for serving large language models at scale on OpenShift Container Platform. Created and led by Red Hat with contributions from Google, NVIDIA, AMD, and Hugging Face, Distributed Inference with llm-d provides enterprise-grade inference serving for production AI workloads.

Distributed Inference with llm-d delivers the following core capabilities:

  • Intelligent inference scheduling: A prefix cache aware scheduler routes requests to the replica most likely to have relevant KV cache entries already populated. The scheduler evaluates GPU utilization, queue depth, cache residency, and load distribution to optimize throughput and time-to-first-token latency.
  • KV cache management: Efficient management of key-value cache across distributed inference servers. The scheduler routes requests to replicas with warm KV cache entries to avoid redundant prompt processing.
  • Prefill-decode disaggregation (Developer Preview): Separates the compute-intensive prefill phase from the latency-sensitive decode phase, allowing each phase to scale independently on optimized hardware. This increases GPU utilization, reduces tail latency, and lowers cost per token.
  • Wide expert parallelism (Developer Preview): Supports distributed inference of mixture-of-experts (MoE) models across many GPU nodes for cost-effective scaling of large models.

Distributed Inference with llm-d separates the model serving control plane from the inference data plane, enabling platform teams to swap runtimes or schedulers independently. You can deploy Distributed Inference with llm-d on OpenShift Container Platform 4.19 or later, or on any CNCF-certified managed Kubernetes 1.33 or later, including Azure Kubernetes Service (AKS) and CoreWeave Kubernetes Service.

2.1. Container images

The following container images are Generally Available (GA) from registry.redhat.io:

  • registry.redhat.io/rhaii/vllm-cuda-rhel9:3.4.0
  • registry.redhat.io/rhaii/vllm-rocm-rhel9:3.4.0
  • registry.redhat.io/rhaii/vllm-spyre-rhel9:3.4.0 (s390x, ppc64le, x86_64)
  • registry.redhat.io/rhaii/model-opt-cuda-rhel9:3.4.0
  • registry.redhat.io/rhaii/vllm-cpu-rhel9:3.4.0

The following container images are Technology Preview features:

  • registry.redhat.io/rhaii/vllm-tpu-rhel9:3.4.0
  • registry.redhat.io/rhaii/vllm-neuron-rhel9:3.4.0
  • registry.redhat.io/rhaii/vllm-gaudi-rhel9:3.4.0
Important

Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

The Red Hat AI Inference supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

Note

The following Technology Preview container images bundle different upstream vLLM versions:

  • vllm-tpu-rhel9:3.4.0 bundles vLLM v0.13.0.
  • vllm-neuron-rhel9:3.4.0 bundles vLLM v0.13.0.
  • vllm-gaudi-rhel9:3.4.0 bundles vLLM v0.17.1.

2.2. New Red Hat AI Inference developer features

Red Hat AI Inference 3.4.0 packages the upstream vLLM v0.18.0 release. You can review the complete list of updates in the upstream vLLM v0.18.0 release notes.

AMD Instinct MI355X and MI350X accelerator support
Added support for AMD Instinct MI355X (OAM) and MI350X accelerators for high-performance AI inference workloads. MI355X and MI350X accelerators provide 288 GB HBM3e memory per GPU. This feature requires ROCm 7.1 and PyTorch 2.10.
CPU inference support

CPU inference is now generally available (GA) in Red Hat AI Inference by using the vllm-cpu-rhel9 container. With this capability, enterprises can serve models cost-effectively on existing CPU fleets, making it well suited for use cases such as AI virtual agents, Retrieval-Augmented Generation (RAG), guardrail LLMs, Edge AI, embedding models, batch inference, and observability. For optimal performance, set LD_PRELOAD=/usr/lib64/libomp.so when starting the container. Because some enterprise deployments use Small Language Models with fewer than 20B parameters and moderate concurrency requirements, vLLM CPU can deliver competitive performance with strong TCO advantages.

  • Intel Xeon CPU inference support: This integration is optimized to use AVX2, AVX-512, and Advanced Matrix Extensions (AMX) instruction sets, automatically detecting and applying the best available instruction set for your processor. AMX is dedicated to on-die matrix-multiplication engines accelerating BF16, FP16, INT8, and INT4/W4A16.
  • AMD EPYC CPU inference support: Red Hat AI Inference introduces CPU inference for AMD EPYC processors, powered by the vLLM-CPU engine and the ZenDNN backend. ZenDNN delivers tuned kernels, optimized primitives, and graph enhancements (through ZenTorch) to run frameworks like PyTorch efficiently. This feature supports Zen 5 (Turin) and Zen 4 (Genoa) architectures with BF16 inference and INT8/INT4 quantization for highly efficient model execution.
IBM Spyre prefix caching and chunked prefill support

Prefix caching and chunked prefill are now supported for IBM Spyre accelerator deployments. This feature improves inference performance by enabling reuse of previously computed prefix states during model execution.

When using pre-compiled model caches, the supported chunk lengths are:

  • 1024 for x86 and Power architectures
  • 512 for IBM Z (s390x) architecture

For more detailed implementation and configuration of IBM Power architecture, see the IBM Spyre Accelerator for Power.

2.3. New Red Hat AI Model Optimization Toolkit developer features

Red Hat AI Model Optimization Toolkit 3.4.0 packages the upstream LLM Compressor v0.10.0.1 release. You can review the complete list of updates in the upstream LLM Compressor v0.10.0.1 release notes.

2.4. Known issues

  • Neuron variant does not support structured output with on-device sampling enabled.

    When sending a chat completion request with response_format: { type: "json_schema" } to the Neuron variant, the server accepts the request but returns unstructured plain text instead of JSON. Structured output is silently ignored when on-device sampling is enabled.

    To work around this issue, disable on-device sampling by passing the following configuration:

    --additional-config '{"override_neuron_config": {"on_device_sampling_config": null}}'

    Or in Python:

    llm = LLM(
        model="my-model",
        additional_config={
            "override_neuron_config": {
                "on_device_sampling_config": None,
            }
        },
    )
    Note
    • async_mode: true is incompatible with CPU sampling. You must disable async_mode when disabling on-device sampling.
    • Performance is reduced when on-device sampling is disabled.

    For more information about on-device sampling, see AWS Neuron on-device sampling documentation.

  • FP8-quantized models with Multi-head Latent Attention (MLA) crash on Ampere GPUs.

    When serving FP8-quantized models that use MLA on GPUs with compute capability less than 8.9, the vLLM API server crashes during inference. This issue affects NVIDIA A100, A6000, and other Ampere architecture GPUs.

    Affected models include RedHatAI/sarvam-105b-FP8-Dynamic and other FP8-quantized DeepSeek V2 or MLA-based models.

    GPUs with compute capability 8.9 or higher, such as H100 and L40S, are not affected because they use native FP8 compute paths.

    To work around this issue, use H100, L40S, or other GPUs with compute capability 8.9 or higher, or use non-FP8 quantized model variants on Ampere GPUs.

  • Unable to query video models deployed using Red Hat AI Inference 3.4.0.

    While the model loads correctly, it fails to respond when queried. This occurs because base images do not ship the Cisco OpenH264 codec for legal and compliance reasons. The ffmpeg-free-rhai package only enables H.264 support when a compatible libopenh264.so.7 is provided at runtime. To work around this issue, provide libopenh264.so.7 at runtime. For OpenShift or Kubernetes deployments, use volume mounts to make the library available to the container.

  • The vllm-rocm-rhel9 container fails to start when serving mistralai/Mistral-Small-3.1-24B-Instruct-2503 on AMD ROCm GPUs.

    The vLLM API server crashes during engine initialization while profiling the vision encoder with a HIP runtime error (invalid argument). This issue occurs because the model uses the Pixtral multimodal architecture, which requires precompiled GPU kernel images that are not included in Red Hat AI Inference 3.4.0. This issue will be fixed in a future z-stream release.

  • The vLLM API server fails to start when loading the RedHatAI/sarvam-105b-FP8-Dynamic model due to a Transformers v5 incompatibility.

    During initialization, the model configuration triggers a RoPE validation error caused by a breaking API change in Transformers v5. The model’s configuration calls validate_rope(ignore_keys=…​), but Transformers v5 removed the ignore_keys parameter from RotaryEmbeddingConfigMixin.validate_rope(). This causes a runtime exception during AutoConfig.from_pretrained(), preventing model loading.

    The issue occurs when trust_remote_code=True executes the model’s custom configuration directly.

Red Hat logoGithubredditYoutubeTwitter

Aprender

Experimente, compre e venda

Comunidades

Sobre a Red Hat

Fornecemos soluções robustas que facilitam o trabalho das empresas em plataformas e ambientes, desde o data center principal até a borda da rede.

Tornando o open source mais inclusivo

A Red Hat está comprometida em substituir a linguagem problemática em nosso código, documentação e propriedades da web. Para mais detalhes veja o Blog da Red Hat.

Sobre a documentação da Red Hat

Legal Notice

Theme

© 2026 Red Hat
Voltar ao topo