Este contenido no está disponible en el idioma seleccionado.
Chapter 2. New features and enhancements
New versions of vLLM and LLM Compressor are included in this release:
-
900+ upstream commits since vLLM v0.8.4. New features include: FP8 fused Mixture of Experts (MoE) kernels, 14 new models supported,
/server_info
endpoint, dynamic LoRA hot reload.
-
900+ upstream commits since vLLM v0.8.4. New features include: FP8 fused Mixture of Experts (MoE) kernels, 14 new models supported,
- LLM Compressor v0.5.1
- The Red Hat AI Inference Server container base is now built on PyTorch 2.7 and Triton 3.2.
- Red Hat AI Inference Server is now fully supported on FIPS-compliant Red Hat Enterprise Linux (RHEL) hosts.
- The Red Hat AI Inference Server supported product and hardware configurations have been expanded. For more information, see Supported product and hardware configurations.
Feature | Benefit | Supported GPUs |
---|---|---|
Blackwell support | Runs on NVIDIA B200 compute capability 10.0 GPUs with FP8 kernels and full CUDA Graph acceleration | NVIDIA Blackwell |
FP8 KV-cache on ROCm | Roughly twice as large context windows with no accuracy loss | All AMD GPUs |
Skinny GEMMs | Roughly 10% lower inference latency | AMD MI300X |
Full CUDA Graph mode | 6–8% improved average Time Per Output Token (TPOT) for small models. | NVIDIA A100 and H100 |
Auto FP16 fallback | Stable runs on pre-Ampere cards without manual flags, for example, NVIDIA T4 GPUs | Older NVIDIA GPUs |
2.1. New models enabled Copiar enlaceEnlace copiado en el portapapeles!
Red Hat AI Inference Server 3.1 expands capabilities by enabling the following models:
Added in vLLM version 0.8.5:
- Qwen3 and Qwen3MoE
- ModernBERT
- Granite Speech
- PLaMo2
- Kimi-VL
- Snowflake Arctic Embed
Added in vLLM version 0.9.0:
- MiMo-7B
- MiniMax-VL-01
- Ovis 1.6, Ovis 2
- Granite 4
- FalconH1
- LlamaGuard4
2.2. New developer features Copiar enlaceEnlace copiado en el portapapeles!
- /server_info REST endpoint
- Query model, KV cache, and device settings for observability and automation.
- Dynamic LoRA hot reload
- Swap fine-tuned adapters from a URL with zero downtime.
- vllm-bench CLI
- "Ship-in-container" tool for instant latency and throughput sizing.
- Faster incremental detokenization
- Streaming responses start twice as fast on CUDA and ROCm GPUs.
- torch.compile caching
- Cached first prompt compilation shortens warm-up times across host restarts.
2.3. New operational features Copiar enlaceEnlace copiado en el portapapeles!
- Lower total cost of ownership (TCO)
- FP8/INT8 kernels and skinny GEMMs allow the same GPUs serve more tokens per second.
- Larger models on AMD GPUs
- ROCm now matches CUDA for FP8 and fused MoE model performance, making AMD MI300X a first-class deployment target.
- Operational agility
-
LoRA hot swap and
/server_info
endpoints allow for continuous integration and deployment for model fine-tuning without pod restarts.