Chapter 2. Inference scheduling and caching capabilities


Distributed Inference with llm-d provides intelligent scheduling, caching, and resource management capabilities for distributed inference. These features optimize GPU usage, reduce inference latency, and enable cost-effective scaling of large language models.

Intelligent inference scheduling
Provides prefix-cache aware routing that directs each request to the replica most likely to have relevant KV cache entries already populated, maximizing GPU KV cache reuse. The inference scheduler evaluates GPU utilization metrics, queue depth, cache residency, service level agreement (SLA) constraints, and load distribution across nodes to select the optimal replica for each request.
KV cache management
Manages key-value cache efficiently across distributed inference servers, reducing memory requirements and enabling longer context windows. Routing requests to replicas with warm KV cache entries avoids redundant prompt processing, which improves both throughput and time-to-first-token.
Prefill-decode disaggregation

Separates the compute-intensive prefill phase from the latency-sensitive decode phase, allowing you to assign each phase to appropriately optimized resources and scale them independently. The prefill phase processes the full input prompt in parallel and is assigned to compute-optimized resources. The decode phase generates tokens incrementally and is assigned to latency-optimized resources. This phase-aware architecture increases GPU utilization, reduces tail latency, and lowers cost per token.

Important

Prefill-decode disaggregation is a Developer Preview feature only. Developer Preview features are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Developer Preview features for production or business-critical workloads. Developer Preview features provide early access to upcoming product features in advance of their possible inclusion in a Red Hat product offering, enabling customers to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Developer Preview features without an associated SLA.

Wide expert parallelism

Supports efficient distributed inference of mixture of experts (MoE) models across many GPU nodes, enabling cost-effective scaling of large models.

Important

Wide expert parallelism is a Developer Preview feature only. Developer Preview features are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Developer Preview features for production or business-critical workloads. Developer Preview features provide early access to upcoming product features in advance of their possible inclusion in a Red Hat product offering, enabling customers to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Developer Preview features without an associated SLA.

Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat Documentation

Legal Notice

Theme

© 2026 Red Hat
Back to top