Chapter 1. Version 3.2.2 release notes
Red Hat AI Inference Server 3.2.2 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators. The container images are available from registry.redhat.io:
-
registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2
-
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.2
-
registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2
-
registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.2
This release also includes a new rhaiis/model-opt-cuda-rhel9:3.2.2
container. This new toolkit is called Red Hat AI Model Optimization Toolkit.
The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.
1.1. New vLLM developer features Copy linkLink copied to clipboard!
Red Hat AI Inference Server 3.2.2 packages the upstream vLLM v0.10.1.1 release.
You can review the complete list of updates in the upstream vLLM v0.10.1.1 release notes.
- Inference engine updates
- CUDA graph performance: Full CUDA graph support with separate attention routines, FA2 and FlashInfer compatibility
- Attention system improvements: Multiple attention metadata builders per KV cache, tree attention backend for v1 engine
- Speculative decoding: N-gram speculative decoding with single KMP token proposal algorithm.
- Configuration improvements: Model loader plugin system, rate limiting with bucket algorithm
- Performance improvements
- Improved startup time: enhanced headless models for pooling in the Transformers backend
- NVIDIA Blackwell/SM100 optimizations: CutlassMLA as default backend, FlashInfer MoE per-tensor scale FP8 support
- NVIDIA RTX PRO 6000 (SM120): Block FP8 quantization and CUTLASS NVFP4 4-bit weights/activations
- AMD ROCm enhancements: Flash Attention backend for Qwen-VL models, optimized kernel performance for small batch sizes
- Memory and throughput: Improved efficiency through reduced memory copying, fused RMSNorm kernels, faster multimodal hashing for repeated image prompts, and multithreaded async input loading
- Parallelization and MoE: Faster guided decoding, better expert sharding for MoE, expanded fused kernel support for top-k softmax, and fused MoE support for nomic-embed-text-v2-moe
-
Hardware and kernels: Fixed ARM CPU builds without BF16, improved Machete on memory-bound tasks, added FlashInfer TRT-LLM prefill kernel, sped up CUDA
reshape_and_cache_flash
, and enabled CPU transfer in NixlConnector - Specialized CUDA kernels: GPT-OSS activation functions implemented, faster RLHF weight loading
- New quantization options
- Added MXFP4/bias support in Marlin and NVFP4 GEMM backends, introduced dynamic 4-bit CPU quantization with Kleidiai, and expanded model support with BitsAndBytes for MoE and Gemma3n compatibility.
- API and frontend improvements
- Added OpenAI API Unix socket support and better error alignment, new reward model interface and chunked input processing, multi-key and custom config support, plus HermesToolParser and multi-turn benchmarking.
- Dependency updates
-
FlashInfer v0.3.1: now an optional via pip install
vllm[flashinfer]
- Mamba SSM 2.2.5: removed from core dependencies
- Docker: Precompiled wheel support for easier containerized deployment
- Python: OpenAI dependency bumped for API compatibility
- Various dependency optimizations: Dropped xformers for Mistral models, added DeepGEMM deprecation warnings
-
FlashInfer v0.3.1: now an optional via pip install
- V0 deprecation breaking changes
- V0 deprecation: Continued cleanup of legacy engine components including removal of multi-step scheduling
- CLI updates: Various flag updates and deprecated argument removals as part of V0 engine cleanup
- Quantization: Removed AQLM quantization support - users should migrate to alternative methods
- Tool calling support for gpt-oss models
Red Hat AI Inference Server now supports calling built-in tools directly in gpt-oss models. Tool calling uses the Chat Completions and Responses APIs, both of which can carry function-calling capabilities for gpt-oss models. For more information, see Tool use.
NoteTool calling for gpt-oss models is supported on NVIDIA CUDA AI accelerators only.
1.2. New Red Hat AI Model Optimization Toolkit developer features Copy linkLink copied to clipboard!
Red Hat AI Model Optimization Toolkit 3.2.2 packages the upstream LLM Compressor v0.7.1 release.
You can review the complete list of updates in the upstream llm-compressor v0.7.1 release notes.
- New Red Hat AI Model Optimization Toolkit container
-
The
rhaiis/model-opt-cuda-rhel9
container image packages LLM Compressor v0.7.1 separately in its own runtime image, shipped as a second container image alongside the primaryrhaiis/vllm-cuda-rhel9
container image. This reduces the coupling between vLLM and LLM Compressor, streamlining model compression and inference serving workflows. - Introducing transforms
- Red Hat AI Model Optimization Toolkit now supports transforms. With transforms, you can inject additional matrix operations within a model for the purposes of increasing the accuracy recovery as a result of quantization.
- Applying multiple compressors to a single model
- Red Hat AI Model Optimization Toolkit now supports applying multiple compressors to a single model. This extends support for non-uniform quantization recipes, such as combining NVFP4 and FP8 quantization.
- Support for DeepSeekV3-style block FP8 quantization
- You can now apply DeepSeekV3-style block FP8 quantization during model compression, a technique designed to further compress large language models for more efficient inference.
- Mixture of Experts support
- Red Hat AI Model Optimization Toolkit now includes enhanced general Mixture of Experts (MoE) calibration support, including support for MoEs with NVFP4 quantization.
- Llama4 quantization
- LLama4 quantization is now supported in Red Hat AI Model Optimization Toolkit.
- Simplified and updated Recipe classes
- The Recipe system has been streamlined by merging multiple classes into one unified Recipe class. Modifier creation, lifecycle management, and parsing are now simpler. Serialization and deserialization are improved.
- Configurable Observer arguments
-
Observer arguments can now be configured as a dict through the
observer_kwargs
quantization argument, which can be set through oneshot recipes.
1.3. Anonymous statistics collection Copy linkLink copied to clipboard!
Anonymous Red Hat AI Inference Server 3.2.2 usage statistics are now sent to Red Hat. Model consumption and usage stats are collected and stored centrally via Red Hat Observatorium.
1.4. Known issues Copy linkLink copied to clipboard!
- The gpt-oss language model family is supported in Red Hat AI Inference Server 3.2.2 for NVIDIA CUDA AI accelerators only.
- Red Hat AI Inference Server 3.2.2 include RPMs provided by IBM to support IBM Spyre AIU. The RPMs in the 3.2.2 release are pre-GA and are not GPG signed. IBM does not sign pre-GA RPMs.