이 콘텐츠는 선택한 언어로 제공되지 않습니다.

Chapter 2. Inference scheduling and caching capabilities


Distributed Inference with llm-d provides intelligent scheduling, caching, and resource management capabilities for distributed inference. These features optimize GPU usage, reduce inference latency, and enable cost-effective scaling of large language models.

Intelligent inference scheduling
Provides prefix-cache aware routing that directs each request to the replica most likely to have relevant KV cache entries already populated, maximizing GPU KV cache reuse. The inference scheduler evaluates GPU utilization metrics, queue depth, cache residency, service level agreement (SLA) constraints, and load distribution across nodes to select the optimal replica for each request.
KV cache management
Manages key-value cache efficiently across distributed inference servers, reducing memory requirements and enabling longer context windows. Routing requests to replicas with warm KV cache entries avoids redundant prompt processing, which improves both throughput and time-to-first-token.
Prefill-decode disaggregation

Separates the compute-intensive prefill phase from the latency-sensitive decode phase, allowing you to assign each phase to appropriately optimized resources and scale them independently. The prefill phase processes the full input prompt in parallel and is assigned to compute-optimized resources. The decode phase generates tokens incrementally and is assigned to latency-optimized resources. This phase-aware architecture increases GPU utilization, reduces tail latency, and lowers cost per token.

Important

Prefill-decode disaggregation is a Developer Preview feature only. Developer Preview features are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Developer Preview features for production or business-critical workloads. Developer Preview features provide early access to upcoming product features in advance of their possible inclusion in a Red Hat product offering, enabling customers to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Developer Preview features without an associated SLA.

Wide expert parallelism

Supports efficient distributed inference of mixture of experts (MoE) models across many GPU nodes, enabling cost-effective scaling of large models.

Important

Wide expert parallelism is a Developer Preview feature only. Developer Preview features are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Developer Preview features for production or business-critical workloads. Developer Preview features provide early access to upcoming product features in advance of their possible inclusion in a Red Hat product offering, enabling customers to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Developer Preview features without an associated SLA.

Red Hat logoGithubredditYoutubeTwitter

자세한 정보

평가판, 구매 및 판매

커뮤니티

Red Hat 소개

Red Hat은 기업이 핵심 데이터 센터에서 네트워크 에지에 이르기까지 플랫폼과 환경 전반에서 더 쉽게 작업할 수 있도록 강화된 솔루션을 제공합니다.

보다 포괄적 수용을 위한 오픈 소스 용어 교체

Red Hat은 코드, 문서, 웹 속성에서 문제가 있는 언어를 교체하기 위해 최선을 다하고 있습니다. 자세한 내용은 다음을 참조하세요.Red Hat 블로그.

Red Hat 문서 정보

Legal Notice

Theme

© 2026 Red Hat
맨 위로 이동