このコンテンツは選択した言語では利用できません。

Chapter 2. Inference scheduling and caching capabilities


Distributed Inference with llm-d provides intelligent scheduling, caching, and resource management capabilities for distributed inference. These features optimize GPU usage, reduce inference latency, and enable cost-effective scaling of large language models.

Intelligent inference scheduling
Provides prefix-cache aware routing that directs each request to the replica most likely to have relevant KV cache entries already populated, maximizing GPU KV cache reuse. The inference scheduler evaluates GPU utilization metrics, queue depth, cache residency, service level agreement (SLA) constraints, and load distribution across nodes to select the optimal replica for each request.
KV cache management
Manages key-value cache efficiently across distributed inference servers, reducing memory requirements and enabling longer context windows. Routing requests to replicas with warm KV cache entries avoids redundant prompt processing, which improves both throughput and time-to-first-token.
Prefill-decode disaggregation

Separates the compute-intensive prefill phase from the latency-sensitive decode phase, allowing you to assign each phase to appropriately optimized resources and scale them independently. The prefill phase processes the full input prompt in parallel and is assigned to compute-optimized resources. The decode phase generates tokens incrementally and is assigned to latency-optimized resources. This phase-aware architecture increases GPU utilization, reduces tail latency, and lowers cost per token.

Important

Prefill-decode disaggregation is a Developer Preview feature only. Developer Preview features are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Developer Preview features for production or business-critical workloads. Developer Preview features provide early access to upcoming product features in advance of their possible inclusion in a Red Hat product offering, enabling customers to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Developer Preview features without an associated SLA.

Wide expert parallelism

Supports efficient distributed inference of mixture of experts (MoE) models across many GPU nodes, enabling cost-effective scaling of large models.

Important

Wide expert parallelism is a Developer Preview feature only. Developer Preview features are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Developer Preview features for production or business-critical workloads. Developer Preview features provide early access to upcoming product features in advance of their possible inclusion in a Red Hat product offering, enabling customers to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Developer Preview features without an associated SLA.

Red Hat logoGithubredditYoutubeTwitter

詳細情報

試用、購入および販売

コミュニティー

会社概要

Red Hat は、企業がコアとなるデータセンターからネットワークエッジに至るまで、各種プラットフォームや環境全体で作業を簡素化できるように、強化されたソリューションを提供しています。

多様性を受け入れるオープンソースの強化

Red Hat では、コード、ドキュメント、Web プロパティーにおける配慮に欠ける用語の置き換えに取り組んでいます。このような変更は、段階的に実施される予定です。詳細情報: Red Hat ブログ.

Red Hat ドキュメントについて

Legal Notice

Theme

© 2026 Red Hat
トップに戻る