Chapter 1. Version 3.2.2 release notes


Red Hat AI Inference Server 3.2.2 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators. The container images are available from registry.redhat.io:

  • registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2
  • registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.2
  • registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2
  • registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.2

This release also includes a new rhaiis/model-opt-cuda-rhel9:3.2.2 container. This new toolkit is called Red Hat AI Model Optimization Toolkit.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

1.1. New vLLM developer features

Red Hat AI Inference Server 3.2.2 packages the upstream vLLM v0.10.1.1 release.

You can review the complete list of updates in the upstream vLLM v0.10.1.1 release notes.

Inference engine updates
  • CUDA graph performance: Full CUDA graph support with separate attention routines, FA2 and FlashInfer compatibility
  • Attention system improvements: Multiple attention metadata builders per KV cache, tree attention backend for v1 engine
  • Speculative decoding: N-gram speculative decoding with single KMP token proposal algorithm.
  • Configuration improvements: Model loader plugin system, rate limiting with bucket algorithm
Performance improvements
  • Improved startup time: enhanced headless models for pooling in the Transformers backend
  • NVIDIA Blackwell/SM100 optimizations: CutlassMLA as default backend, FlashInfer MoE per-tensor scale FP8 support
  • NVIDIA RTX PRO 6000 (SM120): Block FP8 quantization and CUTLASS NVFP4 4-bit weights/activations
  • AMD ROCm enhancements: Flash Attention backend for Qwen-VL models, optimized kernel performance for small batch sizes
  • Memory and throughput: Improved efficiency through reduced memory copying, fused RMSNorm kernels, faster multimodal hashing for repeated image prompts, and multithreaded async input loading
  • Parallelization and MoE: Faster guided decoding, better expert sharding for MoE, expanded fused kernel support for top-k softmax, and fused MoE support for nomic-embed-text-v2-moe
  • Hardware and kernels: Fixed ARM CPU builds without BF16, improved Machete on memory-bound tasks, added FlashInfer TRT-LLM prefill kernel, sped up CUDA reshape_and_cache_flash, and enabled CPU transfer in NixlConnector
  • Specialized CUDA kernels: GPT-OSS activation functions implemented, faster RLHF weight loading
New quantization options
  • Added MXFP4/bias support in Marlin and NVFP4 GEMM backends, introduced dynamic 4-bit CPU quantization with Kleidiai, and expanded model support with BitsAndBytes for MoE and Gemma3n compatibility.
API and frontend improvements
  • Added OpenAI API Unix socket support and better error alignment, new reward model interface and chunked input processing, multi-key and custom config support, plus HermesToolParser and multi-turn benchmarking.
Dependency updates
  • FlashInfer v0.3.1: now an optional via pip install vllm[flashinfer]
  • Mamba SSM 2.2.5: removed from core dependencies
  • Docker: Precompiled wheel support for easier containerized deployment
  • Python: OpenAI dependency bumped for API compatibility
  • Various dependency optimizations: Dropped xformers for Mistral models, added DeepGEMM deprecation warnings
V0 deprecation breaking changes
  • V0 deprecation: Continued cleanup of legacy engine components including removal of multi-step scheduling
  • CLI updates: Various flag updates and deprecated argument removals as part of V0 engine cleanup
  • Quantization: Removed AQLM quantization support - users should migrate to alternative methods
Tool calling support for gpt-oss models

Red Hat AI Inference Server now supports calling built-in tools directly in gpt-oss models. Tool calling uses the Chat Completions and Responses APIs, both of which can carry function-calling capabilities for gpt-oss models. For more information, see Tool use.

Note

Tool calling for gpt-oss models is supported on NVIDIA CUDA AI accelerators only.

Red Hat AI Model Optimization Toolkit 3.2.2 packages the upstream LLM Compressor v0.7.1 release.

You can review the complete list of updates in the upstream llm-compressor v0.7.1 release notes.

New Red Hat AI Model Optimization Toolkit container
The rhaiis/model-opt-cuda-rhel9 container image packages LLM Compressor v0.7.1 separately in its own runtime image, shipped as a second container image alongside the primary rhaiis/vllm-cuda-rhel9 container image. This reduces the coupling between vLLM and LLM Compressor, streamlining model compression and inference serving workflows.
Introducing transforms
Red Hat AI Model Optimization Toolkit now supports transforms. With transforms, you can inject additional matrix operations within a model for the purposes of increasing the accuracy recovery as a result of quantization.
Applying multiple compressors to a single model
Red Hat AI Model Optimization Toolkit now supports applying multiple compressors to a single model. This extends support for non-uniform quantization recipes, such as combining NVFP4 and FP8 quantization.
Support for DeepSeekV3-style block FP8 quantization
You can now apply DeepSeekV3-style block FP8 quantization during model compression, a technique designed to further compress large language models for more efficient inference.
Mixture of Experts support
Red Hat AI Model Optimization Toolkit now includes enhanced general Mixture of Experts (MoE) calibration support, including support for MoEs with NVFP4 quantization.
Llama4 quantization
LLama4 quantization is now supported in Red Hat AI Model Optimization Toolkit.
Simplified and updated Recipe classes
The Recipe system has been streamlined by merging multiple classes into one unified Recipe class. Modifier creation, lifecycle management, and parsing are now simpler. Serialization and deserialization are improved.
Configurable Observer arguments
Observer arguments can now be configured as a dict through the observer_kwargs quantization argument, which can be set through oneshot recipes.

1.3. Anonymous statistics collection

Anonymous Red Hat AI Inference Server 3.2.2 usage statistics are now sent to Red Hat. Model consumption and usage stats are collected and stored centrally via Red Hat Observatorium.

1.4. Known issues

  • The gpt-oss language model family is supported in Red Hat AI Inference Server 3.2.2 for NVIDIA CUDA AI accelerators only.
  • Red Hat AI Inference Server 3.2.2 include RPMs provided by IBM to support IBM Spyre AIU. The RPMs in the 3.2.2 release are pre-GA and are not GPG signed. IBM does not sign pre-GA RPMs.
Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat