Questo contenuto non è disponibile nella lingua selezionata.

Chapter 1. Version 3.2.3 release notes


Red Hat AI Inference Server 3.2.3 provides container images that optimize inferencing with large language models (LLMs) for NVIDIA CUDA, AMD ROCm, Google TPU, and IBM Spyre AI accelerators. The container images are available from registry.redhat.io:

  • registry.redhat.io/rhaiis/vllm-cuda-rhel9:3.2.3
  • registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.3
  • registry.redhat.io/rhaiis/vllm-tpu-rhel9:3.2.3
  • registry.redhat.io/rhaiis/vllm-spyre-rhel9:3.2.3
  • registry.redhat.io/rhaiis/model-opt-cuda-rhel9:3.2.3
Note

To facilitate customer testing of new models, early access fast release Red Hat AI Inference Server images are now available in near-upstream preview builds. Fast release container images are not functionally complete or production-ready, have minimal productization, and are not supported by Red Hat in any way.

You can find available fast release images in the Red Hat ecosystem catalog.

The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.

1.1. New vLLM developer features

Red Hat AI Inference Server 3.2.3 packages the upstream vLLM v0.11.0 release. You can review the complete list of updates in the upstream vLLM v0.11.0 release notes.

The release completes the removal of the vLLM V0 engine. V1 is now the only inference engine in vLLM.

The FULL_AND_PIECEWISE mode is now the CUDA graph mode default. This provides better performance for most models, particularly fine-grained MoEs, while preserving compatibility with existing models supporting only PIECEWISE mode.

Inference engine updates
  • Added KV cache offloading with CPU offload and LRU cache management.
  • Added new vLLM V1 engine features including prompt embeddings, sharded state loading, and sliding window attention.
  • Added pipeline parallel and variable hidden size support to the hybrid allocator.
  • Extended async scheduling to support uniprocessor execution.
  • Removed tokenizer groups and added multimodal caching in shared memory as part of architecture changes.
  • Improved attention with hybrid SSM/Attention and FlashAttention 3 for ViT.
  • Achieved multiple Triton and RoPE kernel speedups, with speculative decoding now 8 times faster.
  • Optimized LoRA weight loading.
  • Changed CUDA graph mode default to FULL_AND_PIECEWISE and disabled the standalone compile feature in the Inductor.
  • Added integrated CUDA graph inductor partition for torch.compile.
Model support
  • Added support for new architectures including DeepSeek-V3.2-Exp, Qwen3-VL, Qwen3-Next, OLMo3, LongCat-Flash, Dots OCR, Ling2.0, and CWM.
  • Added RADIO encoder and transformer backend support for encoder-only models.
  • Enabled new tasks including BERT NER/token classification and multimodal pooling tasks.
  • Added data parallelism for InternVL, Qwen2-VL, and Qwen3-VL.
  • Implemented EAGLE3 speculative decoding for MiniCPM3 and GPT-OSS.
  • Added new features including Qwen3-VL text-only mode, EVS video pruning, Mamba2 quantization, MRoPE and YaRN, and LongCat-Flash-Chat tools.
  • Delivered performance optimizations across GLM, Qwen, and LongCat series.
  • Added SeedOSS reason parser for reasoning tasks.
AI Accelerator hardware updates
  • NVIDIA: Added FP8 FlashInfer decoding and BF16 fused MoE for NVIDIA Hopper and Blackwell AI accelerators.
  • AMD: Added MI300X tuning for GLM-4.5.
  • Enabled DeepGEMM by default, providing a 5.5% throughput gain for model serving.
Performance improvements
  • Introduced dual-batch overlap (DBO) as an overlapping compute mechanism for higher throughput.
  • Enhanced data parallelism with the new torchrun launcher, Ray placement groups, and Triton DP/EP kernels.
  • Reduced EPLB overhead and added static placement.
  • Added KV metrics and latent dimension support for disaggregated serving.
  • Optimized MoE with shared expert overlap optimization, SiLU kernel, and Allgather/ReduceScatter backend.
  • Updated distributed NCCL symmetric memory performance resulting in a 3-4% throughput improvement.
New quantization options
  • Enhanced FP8 with per-token-group quantization, hardware acceleration, and paged attention update.
  • Added FP4 support for dense NVFP4 models and large Llama/Gemma variants.
  • Updated W4A8 to perform faster preprocessing.
  • Added blocked FP8 support for MoE models in compressed tensors.
API and front-end improvements
  • Enhanced OpenAI compatibility with full-token logprobs, reasoning event streaming, MCP tools, and better error handling.
  • Improved multimodal support with UUID caching and updated image path formats.
  • Added XML parser for Qwen3-Coder and Hermes token format for tool calling.
  • Added new --enable-logging flag and improved help output in the command line interface.
  • Enhanced configuration with speculative engine args, NVTX profiling, and backward compatibility fixes.
  • Cleaned up metrics outputs and added KV cache units in GiB.
  • Removed misleading quantization warning to improve UX.
Dependency updates
  • Upgraded PyTorch to 2.8 for CUDA and ROCm, FlashInfer to 0.3.1, and CUDA to version 13.
  • Enforced C++17 globally across builds.
  • Replaced xm.mark_step with torch_xla.sync for Google TPU.
Security updates
Fixed advisory GHSA-wr9h-g72x-mwhm.
vLLM V0 engine deprecation is complete
  • Removed AsyncLLMEngine, LLMEngine, MQLLMEngine, attention backends, encoder-decoder, samplers, LoRA interface, and hybrid model support.
  • Removed legacy attention classes, multimodal registry, compilation fallbacks, and default args from the old system during clean-up.

1.2. New Red Hat AI Model Optimization Toolkit developer features

Red Hat AI Model Optimization Toolkit 3.2.3 packages the upstream LLM Compressor v0.8.1 release.

The registry.redhat.io/rhaiis/model-opt-cuda-rhel9 container image packages LLM Compressor v0.8.1 separately in its own runtime image, shipped as a second container image alongside the primary registry.redhat.io/rhaiis/vllm-cuda-rhel9 container image. This reduces the coupling between vLLM and LLM Compressor, streamlining model compression and inference serving workflows.

You can review the complete list of updates in the upstream llm-compressor v0.8.1 release notes.

Support for multiple modifiers in oneshot compression runs

LLM Compressor now supports using multiple modifiers in oneshot compression runs.

You can apply multiple modifiers across model layers. This includes applying different modifiers, such as AWQ and GPTQ, to specific submodules for W4A16 quantization all within a single oneshot call and with only pass-through calibration data.

Quantization and calibration support for Qwen3 models

Quantization and calibration support for Qwen3 models has been added to LLM Compressor.

An updated Qwen3NextSparseMoeBlock modeling definition has been added to temporarily update the MoE block during calibration to ensure that all the experts see data and are calibrated appropriately. This allows all experts to have calibrated scales while ensuring only the gated activation values are used.

FP8 and NVFP4 quantization examples have been added for the Qwen3-Next-80B-A3B-Instruct model.

FP8 quantization support for Qwen3 VL MoE models
LLM Compressor now supports quantization for Qwen3 VL MoE models. You can now use data-free pathways such as FP8 channel-wise and block-wise quantization. Pathways requiring data such W4A16 and NVFP4 are planned for a future release.
Transforms support for non-full-size rotation sizes

You can now set a transform_block_size field in the Transform-based modifier classes SpinQuantModifier and QuIPModifier. You can configure transforms of variable size with this field, and you no longer need to restrict hadamards to match the size of the weight.

It is typically beneficial to set the hadamard block size to match the quantization group size. Examples have been updated to show how to use this field when applying the QuIP Modifier.

Improved accuracy recovery by updating W4A16 schemes to use actorder weight by default
The GPTQModifier class now uses weight activation ordering by default. Weight or "static" activation ordering has been shown to significantly improve accuracy recovery with no additional cost at runtime.
Re-enabled support for W8A8 INT8 decompression
W8A8 INT8 decompression and model generation has been re-enabled in LLM Compressor.
Updated ignore lists in example recipes to capture all vision components
Ignore lists in example recipes were updated to correctly capture all vision components. Previously, some vision components like model.vision_tower were not being caught, causing downstream issues when serving models with vLLM.
Deprecated and removed unittest.TestCase
The unittest.TestCase test case has been deprecated and removed and has been replaced with standardized pytest test definitions.

1.3. Known issues

  • The FlashInfer kernel sampler is disabled by default in Red Hat AI Inference Server to address non-deterministic behavior and correctness errors in model output. This change affects sampling behavior when using Flashinfer top-p and top-k sampling methods.

    If required, you can enable FlashInfer by setting the VLLM_USE_FLASHINFER_SAMPLER environment variable at runtime:

    VLLM_USE_FLASHINFER_SAMPLER=1
    Copy to Clipboard Toggle word wrap
  • When serving a model, using --async-scheduling produces incorrect output for preemption and other modes.
  • BART support is temporarily removed in vLLM v0.11.0 as part of the finalization of the vLLM V0 engine deprecation. It will be reinstated in a future release.
  • The aiter Python package is disabled by default in registry.redhat.io/rhaiis/vllm-rocm-rhel9:3.2.3.

    To enable aiter, configure the following Red Hat AI Inference Server runtime environment variables:

    VLLM_ROCM_USE_AITER=1
    VLLM_ROCM_USE_AITER_RMSNORM=0
    VLLM_ROCM_USE_AITER_MHA=0
    Copy to Clipboard Toggle word wrap
Torna in cima
Red Hat logoGithubredditYoutubeTwitter

Formazione

Prova, acquista e vendi

Community

Informazioni sulla documentazione di Red Hat

Aiutiamo gli utenti Red Hat a innovarsi e raggiungere i propri obiettivi con i nostri prodotti e servizi grazie a contenuti di cui possono fidarsi. Esplora i nostri ultimi aggiornamenti.

Rendiamo l’open source più inclusivo

Red Hat si impegna a sostituire il linguaggio problematico nel codice, nella documentazione e nelle proprietà web. Per maggiori dettagli, visita il Blog di Red Hat.

Informazioni su Red Hat

Forniamo soluzioni consolidate che rendono più semplice per le aziende lavorare su piattaforme e ambienti diversi, dal datacenter centrale all'edge della rete.

Theme

© 2025 Red Hat