Este contenido no está disponible en el idioma seleccionado.

Chapter 4. Version 3.3.0 release notes

Red Hat AI Inference Server 3.3.0 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators with multi-architecture support for s390x (IBM Z) and ppc64le (IBM Power).

The following container images are Generally Available (GA) from registry.redhat.io:

registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0
registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.3.0 (s390x, ppc64le, x86_64)
registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0

The following container images are Technology Preview features:

registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0
registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0
registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0

Important

The registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0, registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0, and registry.redhat.io/rhaiis/vllm-cpu-rhel9:3.3.0 containers are Technology Preview features only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

4.1. Early access AI Inference Server images
Copiar enlace

To facilitate customer testing of new models, early access fast release Red Hat AI Inference Server images are available in near-upstream preview builds. Fast release container images are not functionally complete or production-ready, have minimal productization, and are not supported by Red Hat in any way.

You can find available fast release images in the Red Hat ecosystem catalog.

4.2. New Red Hat AI Inference Server developer features
Copiar enlace

Red Hat AI Inference Server 3.3.0 packages the upstream vLLM v0.13.0 release. You can review the complete list of updates in the upstream vLLM v0.13.0 release notes.

Mistral 3 model support

Added support for Mistral 3 models including Mixture of Experts (MoE) architecture variants. This includes validated model configurations, tool calling parser updates, quantization options, and multimodal configuration guidance.

Geospatial model support

Added support for IBM Prithvi geospatial foundation models using Vision Transformer (ViT) architecture. This enables geospatial inference workflows with Terratorch integration for earth observation and climate analysis use cases.

NVIDIA B300 and GB300 AI accelerator support

Added support for NVIDIA B300 and GB300 Blackwell AI accelerators with CUDA 13.0. NVIDIA B300 and GB300 accelerators deliver significantly higher throughput, improved memory bandwidth, and enhanced efficiency for large-scale AI training and inference workloads compared to previous generations.

AMD MI325X AI accelerator support

Added support for the AMD Instinct MI325X AI accelerator. MI325X extends the AMD ROCm platform options for high-performance inference workloads.

Inference support for CPU-only x86_64 AVX2 (Technology Preview)

Added support for running inference on CPU-only x86_64 systems without GPU acceleration. This deployment option enables inference on CPU-only systems and includes validated CPU models and a Podman deployment procedure for CPU-based inference. While inference is fully functional with AVX2 instruction set extension, performance is significantly limited and not suitable for production workloads.

AWS Trainium and Inferentia support (Technology Preview)

Added Technology Preview support for AWS Neuron accelerators including AWS Trainium and AWS Inferentia. This includes a Podman deployment procedure and updates to supported accelerator documentation.

Various new model support features

Added support for BAGEL (AR only), AudioFlamingo3, JAIS 2, and latent MoE architecture.
Added new tool parsers for DeepSeek-V3.2, Gigachat 3, and Holo2 reasoning.
Added Qwen3-VL enhancements including embeddings and efficient video sampling.
Enabled chunked prefill for all pooling tasks.
Added multi-vector retrieval API.

Performance improvements

Whisper models now run approximately 3 times faster compared to vLLM v0.12.0.
DeepSeek-V3.1 models provide 5.3% throughput improvement and 4.4% TTFT improvement with DeepEP High-Throughput CUDA.
DeepGEMM fused layout provides 4.3% throughput improvement and 10.7% TTFT improvement.

AI accelerator and platform hardware updates

NVIDIA: Added support for NVIDIA Blackwell Ultra (SM103/GB300) with CUDA 13 and W4A8 grouped GEMM on Hopper GPUs.
AMD: Added ROCm enhancements including MXFP4 w4a4 inference.
Intel: Added XPU wNa16 compressed tensors support.
CPU: Added CPU backend encoder-decoder model support and ARM NEON vectorized attention optimizations.

Inference engine updates

Added conditional compilation via compile_ranges for selective kernel compilation.
Added xxHash high-performance option for prefix caching.
Added PrefixLM support for FlexAttention and TritonAttention.
Added online FP8 with streaming post-processing.

API and compatibility changes

The VLLM_ATTENTION_BACKEND environment variable has been replaced with the --attention-backend CLI argument.
Removed deprecated -O.xx flag and deprecated plugin and compilation fields.
Removed deprecated task, seed, and multimodal settings.

4.3. New Red Hat AI Model Optimization Toolkit developer features
Copiar enlace

Red Hat AI Model Optimization Toolkit 3.3.0 packages the upstream LLM Compressor v0.9.0.1 release. You can review the complete list of updates in the upstream LLM Compressor v0.9.0 release notes.

Model-free post-training quantization

A new pathway allows quantization directly on safetensors files without requiring a transformers model definition. Model-free post-training quantization (PTQ) currently supports data-free methods only, specifically FP8 quantization.

Extended quantization support

Enhanced KV cache and attention quantization capabilities with support for new per-head strategies.
Added batched calibration support with configurable batch_size and data_collator arguments.
Added experimental MXFP4 quantization support.

AutoRound Modifier

Added AutoRoundModifier which employs an advanced algorithm that optimizes rounding and clipping ranges through sign-gradient descent, combining post-training efficiency with parameter tuning adaptability.

Generalized AWQ Modifier

The AWQ modifier now supports INT8, FP8, and mixed quantization schemes.

Breaking changes

Training support APIs have been completely removed. Use the LLM Compressor Axolotl integration instead.
Python 3.9 support has been discontinued. LLM Compressor v0.9.0 requires Python 3.10 or later.
AutoRound is now an optional installation accessible through the llmcompressor[autoround] package specification.

4.4. Upgrading from Red Hat AI Inference Server 3.2.2 or earlier
Copiar enlace

Customers upgrading from Red Hat AI Inference Server 3.2.2 or earlier to 3.3.0 should be aware of the following changes. Customers already on Red Hat AI Inference Server 3.2.3 or later are not affected by these changes.

vLLM V0 engine deprecation is complete

Starting from Red Hat AI Inference Server 3.2.3, the vLLM V0 engine has been completely removed. The V1 engine is now the only inference engine in vLLM.

Removed AsyncLLMEngine, LLMEngine, MQLLMEngine, attention backends, encoder-decoder, samplers, LoRA interface, and hybrid model support.
Removed legacy attention classes, multimodal registry, compilation fallbacks, and default args from the old system during clean-up.

Encoder-decoder model support removed

With the removal of the V0 engine, encoder-decoder model support was removed. The following model classes are no longer supported:

BartForConditionalGeneration
MBartForConditionalGeneration
DonutForConditionalGeneration
Florence2ForConditionalGeneration
MllamaForConditionalGeneration

BART support was temporarily removed in vLLM v0.11.0 as part of the finalization of the vLLM V0 engine deprecation. BART support is expected to return in a future release.

4.5. Known issues
Copiar enlace

The FlashInfer kernel sampler was disabled by default in Red Hat AI Inference Server 3.2.3 to address non-deterministic behavior and correctness errors in model output.
This change affects sampling behavior when using FlashInfer top-p and top-k sampling methods. If required, you can enable FlashInfer by setting the VLLM_USE_FLASHINFER_SAMPLER environment variable at runtime:
```
VLLM_USE_FLASHINFER_SAMPLER=1
```
AMD ROCm AI accelerators do not support inference serving encoder-decoder models when using the vLLM v1 inference engine.
Encoder-decoder model architectures cause NotImplementedError failures with AMD ROCm accelerators. ROCm attention backends support decoder-only attention only.
Affected models include, but are not limited to, the following:
- Speech-to-text Whisper models, for example openai/whisper-large-v3-turbo and mistralai/Voxtral-Mini-3B-2507
- Vision-language models, for example microsoft/Phi-3.5-vision-instruct
- Translation models, for example T5, BART, MarianMT
- Any models using cross-attention or an encoder-decoder architecture
Inference fails for MP3 and M4A file formats. When querying audio models with these file formats, the system returns a "format not recognized" error.
```
{"error":{"message":"Error opening <_io.BytesIO object at 0x7fc052c821b0>: Format not recognised.","type":"Internal Server Error","param":null,"code":500}}
```
This issue affects audio transcription models such as openai/whisper-large-v3 and mistralai/Voxtral-Small-24B-2507. To work around this issue, convert audio files to WAV format before processing.
Jemalloc consumes more memory than glibc when deploying models on IBM Spyre AI accelerators.
When deploying models with jemalloc as the memory allocator, overall memory usage is significantly higher than when using glibc. In testing, jemalloc increased memory consumption by more than 50% compared to glibc. To work around this issue, disable jemalloc by unsetting the LD_PRELOAD environment variable so the system uses glibc as the memory allocator instead.

GPT-OSS model produces empty or gibberish responses when using multiple GPUs.
When deploying the GPT-OSS model with tensor parallelism greater than 1, the model produces empty or incorrect output. This issue is related to the Triton attention kernel. To work around this issue, use the --no-enable-prefix-caching CLI argument when running the model.

Google TPU inference supports only a limited set of model architectures.
Native TPU inference currently supports a subset of model architectures. For unsupported models, inference falls back to PyTorch, which may result in reduced performance. Additionally, native TPU inference support is not available for some Google models.
For the list of models with native TPU support, see vLLM TPU Installation.

Models quantized with Red Hat AI Model Optimization Toolkit may fail to load in Red Hat AI Inference Server 3.3.0.
Models generated using the registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0 container image may encounter errors when loaded with the registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 container image.
To work around this issue, remove the scale_dtype and zp_dtype configuration entries from the config.json file in the quantized model’s directory.

Streaming tool calls with Mistral models return invalid JSON through the /v1/messages endpoint.
When using Mistral models through the Anthropic-compatible /v1/messages endpoint with streaming tool calls, the generated arguments for tool calls contain invalid JSON. This causes most clients to fail when parsing tool call responses.
This issue affects only the /v1/messages Anthropic-compatible endpoint with Mistral models. Mistral models using the OpenAI-compatible endpoints and non-Mistral models using the /v1/messages endpoint are not affected.

Este contenido no está disponible en el idioma seleccionado.

Chapter 4. Version 3.3.0 release notes

4.1. Early access AI Inference Server images
Copiar enlace

4.2. New Red Hat AI Inference Server developer features
Copiar enlace

4.3. New Red Hat AI Model Optimization Toolkit developer features
Copiar enlace

4.4. Upgrading from Red Hat AI Inference Server 3.2.2 or earlier
Copiar enlace

4.5. Known issues
Copiar enlace

Aprender

Pruebe, compre y venda

Comunidades

Acerca de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de la documentación de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Este contenido no está disponible en el idioma seleccionado.

Chapter 4. Version 3.3.0 release notes

4.1. Early access AI Inference Server imagesCopiar enlaceEnlace copiado en el portapapeles!

4.2. New Red Hat AI Inference Server developer featuresCopiar enlaceEnlace copiado en el portapapeles!

4.3. New Red Hat AI Model Optimization Toolkit developer featuresCopiar enlaceEnlace copiado en el portapapeles!

4.4. Upgrading from Red Hat AI Inference Server 3.2.2 or earlierCopiar enlaceEnlace copiado en el portapapeles!

4.5. Known issuesCopiar enlaceEnlace copiado en el portapapeles!

Aprender

Pruebe, compre y venda

Comunidades

Acerca de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de la documentación de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

4.1. Early access AI Inference Server images
Copiar enlace

4.2. New Red Hat AI Inference Server developer features
Copiar enlace

4.3. New Red Hat AI Model Optimization Toolkit developer features
Copiar enlace

4.4. Upgrading from Red Hat AI Inference Server 3.2.2 or earlier
Copiar enlace

4.5. Known issues
Copiar enlace