이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 1. Version 3.2.2 release notes


Red Hat AI Inference Server 3.2.2 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators. The container images are available from registry.redhat.io:

  • registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2
  • registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.2
  • registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2
  • registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.2

This release also includes a new rhaiis/model-opt-cuda-rhel9:3.2.2 container. This new toolkit is called Red Hat AI Model Optimization Toolkit.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

1.1. New vLLM developer features

Red Hat AI Inference Server 3.2.2 packages the upstream vLLM v0.10.1.1 release.

You can review the complete list of updates in the upstream vLLM v0.10.1.1 release notes.

Inference engine updates
  • CUDA graph performance: Full CUDA graph support with separate attention routines, FA2 and FlashInfer compatibility
  • Attention system improvements: Multiple attention metadata builders per KV cache, tree attention backend for v1 engine
  • Speculative decoding: N-gram speculative decoding with single KMP token proposal algorithm.
  • Configuration improvements: Model loader plugin system, rate limiting with bucket algorithm
Performance improvements
  • Improved startup time: enhanced headless models for pooling in the Transformers backend
  • NVIDIA Blackwell/SM100 optimizations: CutlassMLA as default backend, FlashInfer MoE per-tensor scale FP8 support
  • NVIDIA RTX PRO 6000 (SM120): Block FP8 quantization and CUTLASS NVFP4 4-bit weights/activations
  • AMD ROCm enhancements: Flash Attention backend for Qwen-VL models, optimized kernel performance for small batch sizes
  • Memory and throughput: Improved efficiency through reduced memory copying, fused RMSNorm kernels, faster multimodal hashing for repeated image prompts, and multithreaded async input loading
  • Parallelization and MoE: Faster guided decoding, better expert sharding for MoE, expanded fused kernel support for top-k softmax, and fused MoE support for nomic-embed-text-v2-moe
  • Hardware and kernels: Fixed ARM CPU builds without BF16, improved Machete on memory-bound tasks, added FlashInfer TRT-LLM prefill kernel, sped up CUDA reshape_and_cache_flash, and enabled CPU transfer in NixlConnector
  • Specialized CUDA kernels: GPT-OSS activation functions implemented, faster RLHF weight loading
New quantization options
  • Added MXFP4/bias support in Marlin and NVFP4 GEMM backends, introduced dynamic 4-bit CPU quantization with Kleidiai, and expanded model support with BitsAndBytes for MoE and Gemma3n compatibility.
API and frontend improvements
  • Added OpenAI API Unix socket support and better error alignment, new reward model interface and chunked input processing, multi-key and custom config support, plus HermesToolParser and multi-turn benchmarking.
Dependency updates
  • FlashInfer v0.3.1: now an optional via pip install vllm[flashinfer]
  • Mamba SSM 2.2.5: removed from core dependencies
  • Docker: Precompiled wheel support for easier containerized deployment
  • Python: OpenAI dependency bumped for API compatibility
  • Various dependency optimizations: Dropped xformers for Mistral models, added DeepGEMM deprecation warnings
V0 deprecation breaking changes
  • V0 deprecation: Continued cleanup of legacy engine components including removal of multi-step scheduling
  • CLI updates: Various flag updates and deprecated argument removals as part of V0 engine cleanup
  • Quantization: Removed AQLM quantization support - users should migrate to alternative methods
Tool calling support for gpt-oss models

Red Hat AI Inference Server now supports calling built-in tools directly in gpt-oss models. Tool calling uses the Chat Completions and Responses APIs, both of which can carry function-calling capabilities for gpt-oss models. For more information, see Tool use.

Note

Tool calling for gpt-oss models is supported on NVIDIA CUDA AI accelerators only.

1.2. New Red Hat AI Model Optimization Toolkit developer features

Red Hat AI Model Optimization Toolkit 3.2.2 packages the upstream LLM Compressor v0.7.1 release.

You can review the complete list of updates in the upstream llm-compressor v0.7.1 release notes.

New Red Hat AI Model Optimization Toolkit container
The rhaiis/model-opt-cuda-rhel9 container image packages LLM Compressor v0.7.1 separately in its own runtime image, shipped as a second container image alongside the primary rhaiis/vllm-cuda-rhel9 container image. This reduces the coupling between vLLM and LLM Compressor, streamlining model compression and inference serving workflows.
Introducing transforms
Red Hat AI Model Optimization Toolkit now supports transforms. With transforms, you can inject additional matrix operations within a model for the purposes of increasing the accuracy recovery as a result of quantization.
Applying multiple compressors to a single model
Red Hat AI Model Optimization Toolkit now supports applying multiple compressors to a single model. This extends support for non-uniform quantization recipes, such as combining NVFP4 and FP8 quantization.
Support for DeepSeekV3-style block FP8 quantization
You can now apply DeepSeekV3-style block FP8 quantization during model compression, a technique designed to further compress large language models for more efficient inference.
Mixture of Experts support
Red Hat AI Model Optimization Toolkit now includes enhanced general Mixture of Experts (MoE) calibration support, including support for MoEs with NVFP4 quantization.
Llama4 quantization
LLama4 quantization is now supported in Red Hat AI Model Optimization Toolkit.
Simplified and updated Recipe classes
The Recipe system has been streamlined by merging multiple classes into one unified Recipe class. Modifier creation, lifecycle management, and parsing are now simpler. Serialization and deserialization are improved.
Configurable Observer arguments
Observer arguments can now be configured as a dict through the observer_kwargs quantization argument, which can be set through oneshot recipes.

1.3. Anonymous statistics collection

Anonymous Red Hat AI Inference Server 3.2.2 usage statistics are now sent to Red Hat. Model consumption and usage stats are collected and stored centrally via Red Hat Observatorium.

1.4. Known issues

  • The gpt-oss language model family is supported in Red Hat AI Inference Server 3.2.2 for NVIDIA CUDA AI accelerators only.
  • Red Hat AI Inference Server 3.2.2 include RPMs provided by IBM to support IBM Spyre AIU. The RPMs in the 3.2.2 release are pre-GA and are not GPG signed. IBM does not sign pre-GA RPMs.
맨 위로 이동
Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

Red Hat을 사용하는 고객은 신뢰할 수 있는 콘텐츠가 포함된 제품과 서비스를 통해 혁신하고 목표를 달성할 수 있습니다. 최신 업데이트를 확인하세요.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

Theme

© 2025 Red Hat