Chapter 2. Inference scheduling and caching capabilities

Distributed Inference with llm-d provides intelligent scheduling, caching, and resource management capabilities for distributed inference. These features optimize GPU usage, reduce inference latency, and enable cost-effective scaling of large language models.

Intelligent inference scheduling: Provides prefix-cache aware routing that directs each request to the replica most likely to have relevant KV cache entries already populated, maximizing GPU KV cache reuse. The inference scheduler evaluates GPU utilization metrics, queue depth, cache residency, service level agreement (SLA) constraints, and load distribution across nodes to select the optimal replica for each request.
KV cache management: Manages key-value cache efficiently across distributed inference servers, reducing memory requirements and enabling longer context windows. Routing requests to replicas with warm KV cache entries avoids redundant prompt processing, which improves both throughput and time-to-first-token.
Prefill-decode disaggregation: Separates the compute-intensive prefill phase from the latency-sensitive decode phase, allowing you to assign each phase to appropriately optimized resources and scale them independently. The prefill phase processes the full input prompt in parallel and is assigned to compute-optimized resources. The decode phase generates tokens incrementally and is assigned to latency-optimized resources. This phase-aware architecture increases GPU utilization, reduces tail latency, and lowers cost per token.
Important
Prefill-decode disaggregation is a Developer Preview feature only. Developer Preview features are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Developer Preview features for production or business-critical workloads. Developer Preview features provide early access to upcoming product features in advance of their possible inclusion in a Red Hat product offering, enabling customers to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Developer Preview features without an associated SLA.
Wide expert parallelism: Supports efficient distributed inference of mixture of experts (MoE) models across many GPU nodes, enabling cost-effective scaling of large models.
Important
Wide expert parallelism is a Developer Preview feature only. Developer Preview features are not supported by Red Hat in any way and are not functionally complete or production-ready. Do not use Developer Preview features for production or business-critical workloads. Developer Preview features provide early access to upcoming product features in advance of their possible inclusion in a Red Hat product offering, enabling customers to test functionality and provide feedback during the development process. These features might not have any documentation, are subject to change or removal at any time, and testing is limited. Red Hat might provide ways to submit feedback on Developer Preview features without an associated SLA.

Chapter 2. Inference scheduling and caching capabilities

Learn

Try, buy, & sell

Communities

About Red Hat

Making open source more inclusive

About Red Hat Documentation

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links