Chapter 2. Version 3.3.0 release notes
Red Hat AI Inference Server 3.3.0 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators with multi-architecture support for s390x (IBM Z) and ppc64le (IBM Power).
The following container images are Generally Available (GA) from registry.redhat.io:
-
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0 -
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.3.0 -
registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.3.0(s390x, ppc64le, x86_64) -
registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0
The following container images are Technology Preview features:
-
registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0 -
registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0
The registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.3.0 and registry.redhat.io/rhaiis/vllm-neuron-rhel9:3.3.0 containers are Technology Preview features only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.
For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.
The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.
2.1. Early access AI Inference Server images Copy linkLink copied to clipboard!
To facilitate customer testing of new models, early access fast release Red Hat AI Inference Server images are available in near-upstream preview builds. Fast release container images are not functionally complete or production-ready, have minimal productization, and are not supported by Red Hat in any way.
You can find available fast release images in the Red Hat ecosystem catalog.
2.2. New Red Hat AI Inference Server developer features Copy linkLink copied to clipboard!
Red Hat AI Inference Server 3.3.0 packages the upstream vLLM v0.13.0 release. You can review the complete list of updates in the upstream vLLM v0.13.0 release notes.
- Mistral 3 model support
- Added support for Mistral 3 models including Mixture of Experts (MoE) architecture variants. This includes validated model configurations, tool calling parser updates, quantization options, and multimodal configuration guidance.
- Geospatial model support
- Added support for IBM Prithvi geospatial foundation models using Vision Transformer (ViT) architecture. This enables geospatial inference workflows with Terratorch integration for earth observation and climate analysis use cases.
- NVIDIA B300 and GB300 AI accelerator support
- Added support for NVIDIA B300 and GB300 Blackwell AI accelerators with CUDA 13.0. NVIDIA B300 and GB300 accelerators deliver significantly higher throughput, improved memory bandwidth, and enhanced efficiency for large-scale AI training and inference workloads compared to previous generations.
- AMD MI325X AI accelerator support
- Added support for the AMD Instinct MI325X AI accelerator. MI325X extends the AMD ROCm platform options for high-performance inference workloads.
- Inference support for CPU-only x86_64 AVX2 (Technology Preview)
- Added support for running inference on CPU-only x86_64 systems without GPU acceleration. This deployment option enables inference on CPU-only systems and includes validated CPU models and a Podman deployment procedure for CPU-based inference.
- AWS Trainium and Inferentia support (Technology Preview)
- Added Technology Preview support for AWS Neuron accelerators including AWS Trainium and AWS Inferentia. This includes a Podman deployment procedure and updates to supported accelerator documentation.
- Various new model support features
- Added support for BAGEL (AR only), AudioFlamingo3, JAIS 2, and latent MoE architecture.
- Added new tool parsers for DeepSeek-V3.2, Gigachat 3, and Holo2 reasoning.
- Added Qwen3-VL enhancements including embeddings and efficient video sampling.
- Enabled chunked prefill for all pooling tasks.
- Added multi-vector retrieval API.
- Performance improvements
- Whisper models now run approximately 3 times faster compared to vLLM v0.12.0.
- DeepSeek-V3.1 models provide 5.3% throughput improvement and 4.4% TTFT improvement with DeepEP High-Throughput CUDA.
- DeepGEMM fused layout provides 4.3% throughput improvement and 10.7% TTFT improvement.
- AI accelerator and platform hardware updates
- NVIDIA: Added support for NVIDIA Blackwell Ultra (SM103/GB300) with CUDA 13 and W4A8 grouped GEMM on Hopper GPUs.
- AMD: Added ROCm enhancements including MXFP4 w4a4 inference.
- Intel: Added XPU wNa16 compressed tensors support.
- CPU: Added CPU backend encoder-decoder model support and ARM NEON vectorized attention optimizations.
- Inference engine updates
-
Added conditional compilation via
compile_rangesfor selective kernel compilation. - Added xxHash high-performance option for prefix caching.
- Added PrefixLM support for FlexAttention and TritonAttention.
- Added online FP8 with streaming post-processing.
-
Added conditional compilation via
- API and compatibility changes
-
The
VLLM_ATTENTION_BACKENDenvironment variable has been replaced with the--attention-backendCLI argument. -
Removed deprecated
-O.xxflag and deprecated plugin and compilation fields. -
Removed deprecated
task,seed, and multimodal settings.
-
The
2.3. New Red Hat AI Model Optimization Toolkit developer features Copy linkLink copied to clipboard!
Red Hat AI Model Optimization Toolkit 3.3.0 packages the upstream LLM Compressor v0.9.0.1 release. You can review the complete list of updates in the upstream LLM Compressor v0.9.0 release notes.
- Model-free post-training quantization
- A new pathway allows quantization directly on safetensors files without requiring a transformers model definition. Model-free post-training quantization (PTQ) currently supports data-free methods only, specifically FP8 quantization.
- Extended quantization support
- Enhanced KV cache and attention quantization capabilities with support for new per-head strategies.
-
Added batched calibration support with configurable
batch_sizeanddata_collatorarguments. - Added experimental MXFP4 quantization support.
- AutoRound Modifier
-
Added
AutoRoundModifierwhich employs an advanced algorithm that optimizes rounding and clipping ranges through sign-gradient descent, combining post-training efficiency with parameter tuning adaptability. - Generalized AWQ Modifier
- The AWQ modifier now supports INT8, FP8, and mixed quantization schemes.
- Breaking changes
- Training support APIs have been completely removed. Use the LLM Compressor Axolotl integration instead.
- Python 3.9 support has been discontinued. LLM Compressor v0.9.0 requires Python 3.10 or later.
-
AutoRound is now an optional installation accessible through the
llmcompressor[autoround]package specification.
2.4. Upgrading from Red Hat AI Inference Server 3.2.2 or earlier Copy linkLink copied to clipboard!
Customers upgrading from Red Hat AI Inference Server 3.2.2 or earlier to 3.3.0 should be aware of the following changes. Customers already on Red Hat AI Inference Server 3.2.3 or later are not affected by these changes.
- vLLM V0 engine deprecation is complete
Starting from Red Hat AI Inference Server 3.2.3, the vLLM V0 engine has been completely removed. The V1 engine is now the only inference engine in vLLM.
-
Removed
AsyncLLMEngine,LLMEngine,MQLLMEngine, attention backends, encoder-decoder, samplers, LoRA interface, and hybrid model support. - Removed legacy attention classes, multimodal registry, compilation fallbacks, and default args from the old system during clean-up.
-
Removed
- Encoder-decoder model support removed
With the removal of the V0 engine, encoder-decoder model support was removed. The following model classes are no longer supported:
-
BartForConditionalGeneration -
MBartForConditionalGeneration -
DonutForConditionalGeneration -
Florence2ForConditionalGeneration -
MllamaForConditionalGeneration
BART support was temporarily removed in vLLM v0.11.0 as part of the finalization of the vLLM V0 engine deprecation. BART support is expected to return in a future release.
-
2.5. Known issues Copy linkLink copied to clipboard!
The FlashInfer kernel sampler was disabled by default in Red Hat AI Inference Server 3.2.3 to address non-deterministic behavior and correctness errors in model output.
This change affects sampling behavior when using FlashInfer top-p and top-k sampling methods. If required, you can enable FlashInfer by setting the
VLLM_USE_FLASHINFER_SAMPLERenvironment variable at runtime:VLLM_USE_FLASHINFER_SAMPLER=1
VLLM_USE_FLASHINFER_SAMPLER=1Copy to Clipboard Copied! Toggle word wrap Toggle overflow AMD ROCm AI accelerators do not support inference serving encoder-decoder models when using the vLLM v1 inference engine.
Encoder-decoder model architectures cause
NotImplementedErrorfailures with AMD ROCm accelerators. ROCm attention backends support decoder-only attention only.Affected models include, but are not limited to, the following:
- Speech-to-text Whisper models, for example openai/whisper-large-v3-turbo and mistralai/Voxtral-Mini-3B-2507
- Vision-language models, for example microsoft/Phi-3.5-vision-instruct
- Translation models, for example T5, BART, MarianMT
- Any models using cross-attention or an encoder-decoder architecture
Inference fails for MP3 and M4A file formats. When querying audio models with these file formats, the system returns a "format not recognized" error.
{"error":{"message":"Error opening <_io.BytesIO object at 0x7fc052c821b0>: Format not recognised.","type":"Internal Server Error","param":null,"code":500}}{"error":{"message":"Error opening <_io.BytesIO object at 0x7fc052c821b0>: Format not recognised.","type":"Internal Server Error","param":null,"code":500}}Copy to Clipboard Copied! Toggle word wrap Toggle overflow This issue affects audio transcription models such as
openai/whisper-large-v3andmistralai/Voxtral-Small-24B-2507. To work around this issue, convert audio files to WAV format before processing.Jemalloc consumes more memory than glibc when deploying models on IBM Spyre AI accelerators.
When deploying models with jemalloc as the memory allocator, overall memory usage is significantly higher than when using glibc. In testing, jemalloc increased memory consumption by more than 50% compared to glibc. To work around this issue, disable jemalloc by unsetting the
LD_PRELOADenvironment variable so the system uses glibc as the memory allocator instead.
GPT-OSS model produces empty or gibberish responses when using multiple GPUs.
When deploying the GPT-OSS model with tensor parallelism greater than 1, the model produces empty or incorrect output. This issue is related to the Triton attention kernel. To work around this issue, use the
--no-enable-prefix-cachingCLI argument when running the model.
Google TPU inference supports only a limited set of model architectures.
Native TPU inference currently supports a subset of model architectures. For unsupported models, inference falls back to PyTorch, which may result in reduced performance. Additionally, native TPU inference support is not available for some Google models.
For the list of models with native TPU support, see vLLM TPU Installation.
Models quantized with Red Hat AI Model Optimization Toolkit may fail to load in Red Hat AI Inference Server 3.3.0.
Models generated using the
registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.3.0container image may encounter errors when loaded with theregistry.redhat.io/rhaiis/vllm-cuda-rhel9:3.3.0container image.To work around this issue, remove the
scale_dtypeandzp_dtypeconfiguration entries from theconfig.jsonfile in the quantized model’s directory.
Streaming tool calls with Mistral models return invalid JSON through the
/v1/messagesendpoint.When using Mistral models through the Anthropic-compatible
/v1/messagesendpoint with streaming tool calls, the generated arguments for tool calls contain invalid JSON. This causes most clients to fail when parsing tool call responses.This issue affects only the
/v1/messagesAnthropic-compatible endpoint with Mistral models. Mistral models using the OpenAI-compatible endpoints and non-Mistral models using the/v1/messagesendpoint are not affected.