此内容没有您所选择的语言版本。

Chapter 2. Inference scheduling and caching capabilities


Distributed Inference with llm-d provides intelligent scheduling, caching, and resource management capabilities for distributed inference. These features optimize GPU usage, reduce inference latency, and enable cost-effective scaling of large language models.

Intelligent inference scheduling
Provides prefix-cache aware routing that directs each request to the replica most likely to have relevant KV cache entries already populated, maximizing GPU KV cache reuse. The inference scheduler evaluates GPU utilization metrics, queue depth, cache residency, service level agreement (SLA) constraints, and load distribution across nodes to select the optimal replica for each request.
KV cache management
Manages key-value cache efficiently across distributed inference servers, reducing memory requirements and enabling longer context windows. Routing requests to replicas with warm KV cache entries avoids redundant prompt processing, which improves both throughput and time-to-first-token.
Prefill-decode disaggregation

Separates the compute-intensive prefill phase from the latency-sensitive decode phase, allowing you to assign each phase to appropriately optimized resources and scale them independently. The prefill phase processes the full input prompt in parallel and is assigned to compute-optimized resources. The decode phase generates tokens incrementally and is assigned to latency-optimized resources. This phase-aware architecture increases GPU utilization, reduces tail latency, and lowers cost per token.

Important

Prefill-decode disaggregation is a Developer Preview feature only. Developer Preview features are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Developer Preview features for production or business-critical workloads. Developer Preview features provide early access to upcoming product features in advance of their possible inclusion in a Red Hat product offering, enabling customers to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Developer Preview features without an associated SLA.

Wide expert parallelism

Supports efficient distributed inference of mixture of experts (MoE) models across many GPU nodes, enabling cost-effective scaling of large models.

Important

Wide expert parallelism is a Developer Preview feature only. Developer Preview features are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Developer Preview features for production or business-critical workloads. Developer Preview features provide early access to upcoming product features in advance of their possible inclusion in a Red Hat product offering, enabling customers to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Developer Preview features without an associated SLA.

Red Hat logoGithubredditYoutubeTwitter

学习

尝试、购买和销售

社区

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

关于红帽文档

Legal Notice

Theme

© 2026 Red Hat
返回顶部