此内容没有您所选择的语言版本。

Chapter 2. Version 3.4.0-ea.2 release notes

Red Hat AI Inference Server 3.4.0-ea.2 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, Intel Gaudi, and IBM Spyre AI accelerators with multi-architecture support for s390x (IBM Z) and ppc64le (IBM Power).

Important

Red Hat AI Inference Server 3.4.0-ea.2 is an Early Access release. Early Access releases are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Early Access releases for production or business-critical workloads. Use Early Access releases to test upcoming product features in advance of their possible inclusion in a Red Hat product offering, and to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Early Access features without an associated SLA.

The following container images are available as early access releases from registry.redhat.io:

registry.redhat.io/rhaii-early-access/vllm-cuda-rhel9:3.4.0-ea.2
registry.redhat.io/rhaii-early-access/vllm-rocm-rhel9:3.4.0-ea.2
registry.redhat.io/rhaii-early-access/vllm-spyre-rhel9:3.4.0-ea.2 (s390x, ppc64le, x86_64)
registry.redhat.io/rhaii-early-access/model-opt-cuda-rhel9:3.4.0-ea.2

The following container images are Technology Preview features:

registry.redhat.io/rhaii-early-access/vllm-tpu-rhel9:3.4.0-ea.2
registry.redhat.io/rhaii-early-access/vllm-neuron-rhel9:3.4.0-ea.2
registry.redhat.io/rhaii-early-access/vllm-cpu-rhel9:3.4.0-ea.2
registry.redhat.io/rhaii-early-access/vllm-gaudi-rhel9:3.4.0-ea.2

Important

Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

Note

The following Technology Preview container images bundle different upstream vLLM versions:

vllm-tpu-rhel9:3.4.0-ea.2 bundles vLLM v0.13.0.
vllm-neuron-rhel9:3.4.0-ea.2 bundles vLLM v0.13.0.

2.1. New Red Hat AI Inference Server developer features
复制链接

Red Hat AI Inference Server 3.4.0-ea.2 packages the upstream vLLM v0.16.0 release. You can review the complete list of updates in the upstream vLLM v0.16.0 release notes. vLLM v0.16.0 was branch cut on February 8. Features added to vLLM after that date are not included.

Introducing Speculators library (Technology Preview)

Speculators 3.4.0-ea.2 packages the upstream Speculators 0.4.0a1 release. Speculators is an end-to-end training framework for creating EAGLE3 draft models that accelerate inference through speculative decoding, reducing model latency by 1.5-3x while maintaining output quality.

AI accelerator and platform hardware updates

Intel Gaudi 3: Added Intel Gaudi 3 AI accelerator support as a Technology Preview feature through the vllm-gaudi hardware plugin (vllm-gaudi 0.16.0)

2.2. New Red Hat AI Model Optimization Toolkit developer features
复制链接

Red Hat AI Model Optimization Toolkit 3.4.0-ea.2 packages the upstream LLM Compressor v0.10.0.1 release. You can review the complete list of updates in the upstream LLM Compressor v0.10.0.1 release notes.

LLM Compressor v0.10.0.1 brings significant performance improvements, updated quantization capabilities, and enhanced model offload support.

Distributed GPTQ with major performance improvements: GPTQ quantization supports fully distributed functionality, which results in significant speedups. It takes advantage of the underlying Distributed Data Parallel (DDP) improvements for calibration and adds weight parallel compression.
Enhanced compressed-tensors offloading (disk and distributed): Compressed-tensors support loading transformers models that are offloaded to disk, across distributed process ranks, or both. With disk offloading, you can load and compress very large models which normally would not fit in CPU memory. When loading offloaded models across distributed process ranks, the offloaded model memory is shared between ranks.
Migration from accelerate to compressed-tensors offloading: LLM Compressor v0.10 no longer uses offloading logic provided by Hugging Face’s accelerate library, instead opting to integrate with model offloading provided by compressed-tensors.
GPTQ support for FP4 microscale schemes (NVFP4, MXFP4): GPTQ supports FP4 microscale schemes including NVFP4 and MXFP4. Applying GPTQ to these schemes can result in improved recovery and overall quantization accuracy.
MXFP4 accuracy improvements: MXFP4 support has been updated with accuracy improvements for its weight scale generation. MXFP4 with activation quantization is not yet enabled in vLLM for compressed-tensors models.

2.3. Known issues
复制链接

The model-opt-cuda-rhel9 image fails to deploy on clusters with CUDA versions earlier than 13.0.
Deployment fails with a timeout error because the image requires CUDA 13.0 or later. To work around this issue, add the following environment variable to the deployment YAML file for the model-opt image:
```
env:
- name: NVIDIA_DISABLE_REQUIRE
  value: "1"
```

Neuron variant returns plain text instead of JSON for structured output requests.
When sending a chat completion request with the response_format: { type: "json_schema" } to the Neuron variant, the server accepts the request but returns unstructured plain text instead of JSON. Basic /v1/completions and /v1/chat/completions requests without the response_format work correctly. No error or warning is returned to the client. To work around this issue, use the CPU variant (vllm-cpu-rhel9) instead of the Neuron variant.

Unable to query video models deployed using Red Hat AI Inference Server 3.4.0-ea.2.
While the model loads correctly, it fails to respond when queried. This occurs because base images do not ship the Cisco OpenH264 codec for legal and compliance reasons. The ffmpeg-free-rhai package only enables H.264 support when a compatible libopenh264.so.7 is provided at runtime. To work around this issue, provide libopenh264.so.7 at runtime. For OpenShift or Kubernetes deployments, use volume mounts to make the library available to the container.

此内容没有您所选择的语言版本。

Chapter 2. Version 3.4.0-ea.2 release notes

2.1. New Red Hat AI Inference Server developer features
复制链接

2.2. New Red Hat AI Model Optimization Toolkit developer features
复制链接

2.3. Known issues
复制链接

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

此内容没有您所选择的语言版本。

Chapter 2. Version 3.4.0-ea.2 release notes

2.1. New Red Hat AI Inference Server developer features复制链接链接已复制到粘贴板!

2.2. New Red Hat AI Model Optimization Toolkit developer features复制链接链接已复制到粘贴板!

2.3. Known issues复制链接链接已复制到粘贴板!

学习

尝试、购买和销售

社区

关于红帽文档

让开源更具包容性

關於紅帽

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

2.1. New Red Hat AI Inference Server developer features
复制链接

2.2. New Red Hat AI Model Optimization Toolkit developer features
复制链接

2.3. Known issues
复制链接