이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 3. Version 3.2.2 release notes

Red Hat AI Inference Server 3.2.2 release provides container images that optimizes inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators. The container images are available from registry.redhat.io:

registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.2
registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.2
registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.2
registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.2

This release also includes a new rhaiis/model-opt-cuda-rhel9:3.2.2 container. This new toolkit is called Red Hat AI Model Optimization Toolkit.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

3.1. New vLLM developer features
링크 복사

Red Hat AI Inference Server 3.2.2 packages the upstream vLLM v0.10.1.1 release.

You can review the complete list of updates in the upstream vLLM v0.10.1.1 release notes.

Inference engine updates

CUDA graph performance: Full CUDA graph support with separate attention routines, FA2 and FlashInfer compatibility
Attention system improvements: Multiple attention metadata builders per KV cache, tree attention backend for v1 engine
Speculative decoding: N-gram speculative decoding with single KMP token proposal algorithm.
Configuration improvements: Model loader plugin system, rate limiting with bucket algorithm

Performance improvements

Improved startup time: enhanced headless models for pooling in the Transformers backend
NVIDIA Blackwell/SM100 optimizations: CutlassMLA as default backend, FlashInfer MoE per-tensor scale FP8 support
NVIDIA RTX PRO 6000 (SM120): Block FP8 quantization and CUTLASS NVFP4 4-bit weights/activations
AMD ROCm enhancements: Flash Attention backend for Qwen-VL models, optimized kernel performance for small batch sizes
Memory and throughput: Improved efficiency through reduced memory copying, fused RMSNorm kernels, faster multimodal hashing for repeated image prompts, and multithreaded async input loading
Parallelization and MoE: Faster guided decoding, better expert sharding for MoE, expanded fused kernel support for top-k softmax, and fused MoE support for nomic-embed-text-v2-moe
Hardware and kernels: Fixed ARM CPU builds without BF16, improved Machete on memory-bound tasks, added FlashInfer TRT-LLM prefill kernel, sped up CUDA reshape_and_cache_flash, and enabled CPU transfer in NixlConnector
Specialized CUDA kernels: GPT-OSS activation functions implemented, faster RLHF weight loading

New quantization options

Added MXFP4/bias support in Marlin and NVFP4 GEMM backends, introduced dynamic 4-bit CPU quantization with Kleidiai, and expanded model support with BitsAndBytes for MoE and Gemma3n compatibility.

API and frontend improvements

Added OpenAI API Unix socket support and better error alignment, new reward model interface and chunked input processing, multi-key and custom config support, plus HermesToolParser and multi-turn benchmarking.

Dependency updates

FlashInfer v0.3.1: now an optional via pip install vllm[flashinfer]
Mamba SSM 2.2.5: removed from core dependencies
Docker: Precompiled wheel support for easier containerized deployment
Python: OpenAI dependency bumped for API compatibility
Various dependency optimizations: Dropped xformers for Mistral models, added DeepGEMM deprecation warnings

V0 deprecation breaking changes

V0 deprecation: Continued cleanup of legacy engine components including removal of multi-step scheduling
CLI updates: Various flag updates and deprecated argument removals as part of V0 engine cleanup
Quantization: Removed AQLM quantization support - users should migrate to alternative methods

Tool calling support for gpt-oss models

Red Hat AI Inference Server now supports calling built-in tools directly in gpt-oss models. Tool calling uses the Chat Completions and Responses APIs, both of which can carry function-calling capabilities for gpt-oss models. For more information, see Tool use.

Note

Tool calling for gpt-oss models is supported on NVIDIA CUDA AI accelerators only.

3.2. New Red Hat AI Model Optimization Toolkit developer features
링크 복사

Red Hat AI Model Optimization Toolkit 3.2.2 packages the upstream LLM Compressor v0.7.1 release.

You can review the complete list of updates in the upstream llm-compressor v0.7.1 release notes.

New Red Hat AI Model Optimization Toolkit container: The rhaiis/model-opt-cuda-rhel9 container image packages LLM Compressor v0.7.1 separately in its own runtime image, shipped as a second container image alongside the primary rhaiis/vllm-cuda-rhel9 container image. This reduces the coupling between vLLM and LLM Compressor, streamlining model compression and inference serving workflows.
Introducing transforms: Red Hat AI Model Optimization Toolkit now supports transforms. With transforms, you can inject additional matrix operations within a model for the purposes of increasing the accuracy recovery as a result of quantization.
Applying multiple compressors to a single model: Red Hat AI Model Optimization Toolkit now supports applying multiple compressors to a single model. This extends support for non-uniform quantization recipes, such as combining NVFP4 and FP8 quantization.
Support for DeepSeekV3-style block FP8 quantization: You can now apply DeepSeekV3-style block FP8 quantization during model compression, a technique designed to further compress large language models for more efficient inference.
Mixture of Experts support: Red Hat AI Model Optimization Toolkit now includes enhanced general Mixture of Experts (MoE) calibration support, including support for MoEs with NVFP4 quantization.
Llama4 quantization: LLama4 quantization is now supported in Red Hat AI Model Optimization Toolkit.
Simplified and updated Recipe classes: The Recipe system has been streamlined by merging multiple classes into one unified Recipe class. Modifier creation, lifecycle management, and parsing are now simpler. Serialization and deserialization are improved.
Configurable Observer arguments: Observer arguments can now be configured as a dict through the observer_kwargs quantization argument, which can be set through oneshot recipes.

3.3. Anonymous statistics collection
링크 복사

Anonymous Red Hat AI Inference Server 3.2.2 usage statistics are now sent to Red Hat. Model consumption and usage stats are collected and stored centrally via Red Hat Observatorium.

3.4. Known issues
링크 복사

The gpt-oss language model family is supported in Red Hat AI Inference Server 3.2.2 for NVIDIA CUDA AI accelerators only.
Red Hat AI Inference Server 3.2.2 include RPMs provided by IBM to support IBM Spyre AIU. The RPMs in the 3.2.2 release are pre-GA and are not GPG signed. IBM does not sign pre-GA RPMs.

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 3. Version 3.2.2 release notes

3.1. New vLLM developer features
링크 복사

3.2. New Red Hat AI Model Optimization Toolkit developer features
링크 복사

3.3. Anonymous statistics collection
링크 복사

3.4. Known issues
링크 복사

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 소개

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 3. Version 3.2.2 release notes

3.1. New vLLM developer features링크 복사링크가 클립보드에 복사되었습니다!

3.2. New Red Hat AI Model Optimization Toolkit developer features링크 복사링크가 클립보드에 복사되었습니다!

3.3. Anonymous statistics collection링크 복사링크가 클립보드에 복사되었습니다!

3.4. Known issues링크 복사링크가 클립보드에 복사되었습니다!

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 문서 정보

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat 소개

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

3.1. New vLLM developer features
링크 복사

3.2. New Red Hat AI Model Optimization Toolkit developer features
링크 복사

3.3. Anonymous statistics collection
링크 복사

3.4. Known issues
링크 복사