이 콘텐츠는 선택한 언어로 제공되지 않습니다.
Chapter 9. Configure token-based rate limiting with TokenRateLimitPolicy
Red Hat Connectivity Link provides the TokenRateLimitPolicy custom resource to enforce rate limits based on token consumption rather than the number of requests. This policy extends the Envoy Rate Limit Service (RLS) protocol with automatic token usage extraction. It is particularly useful for protecting Large Language Model (LLM) APIs, where the cost and resource usage correlate more closely with the number of tokens processed.
Unlike the standard RateLimitPolicy which counts requests, TokenRateLimitPolicy counts tokens by extracting usage metrics in the body of the AI inference API call, allowing for finer-grained control over API usage based on actual workload.
9.1. How token rate limiting works 링크 복사링크가 클립보드에 복사되었습니다!
The TokenRateLimitPolicy tracks cumulative token usage per client. Before forwarding a request, it checks if the client has already exceeded their limit from previous usage. After the upstream responds, it extracts the actual token cost and updates the client’s counter.
The flow is as follows:
-
On an incoming request, the gateway evaluates the matching rules and predicates from the
TokenRateLimitPolicyresources. - If the request matches, the gateway prepares the necessary rate limit descriptors and monitors the response.
-
After receiving the response, the gateway extracts the
usage.total_tokensfield from the JSON response body. -
The gateway then sends a
RateLimitRequestto Limitador, including the actual token count as ahits_addend. -
Limitador tracks the cumulative token usage and responds to the gateway with
OKorOVER_LIMIT.
9.2. Key features and use cases 링크 복사링크가 클립보드에 복사되었습니다!
-
Enforces limits based on token usage by extracting the
usage.total_tokensfield from an OpenAI-style inference JSON response body. - Suitable for consumption-based APIs such as LLMs where the cost is tied to token counts.
- Allows defining different limits based on criteria such as user identity, API endpoints, or HTTP methods.
-
Works with
AuthPolicyto apply specific limits to authenticated users or groups. -
Inherits functionalities from
RateLimitPolicy, including defining multiple limits with different durations and using Redis for shared counters in multi-cluster environments.
9.3. Integrating with AuthPolicy 링크 복사링크가 클립보드에 복사되었습니다!
You can combine TokenRateLimitPolicy with AuthPolicy to apply token limits based on authenticated user identity. When an AuthPolicy successfully authenticates a request, it injects identity information which can then be used by the TokenRateLimitPolicy to select the appropriate limit.
For example, you can define different token limits for users belonging to 'free-tier' versus 'premium-tier' groups, identified using claims in a JWT validated by AuthPolicy.
9.4. Configure token-based rate limiting for LLM APIs 링크 복사링크가 클립보드에 복사되었습니다!
This guide shows how to configure TokenRateLimitPolicy to protect a hypothetical LLM API deployed on OpenShift Container Platform, integrated with AuthPolicy for user-specific limits.
Prerequisites
- Connectivity Link is installed on your OpenShift Container Platform cluster.
-
A Gateway and an
HTTPRouteare configured to expose your service. -
An
AuthPolicyis configured for authentication (for example, using API keys or OIDC). - Redis is configured for Limitador if running in a multi-cluster setup or requiring persistent counters.
-
Your upstream service is configured to return an OpenAI-compatible JSON response containing a
usage.total_tokensfield in the response body.
Procedure
Create a
TokenRateLimitPolicyresource. This example defines two limits: one for free users on a 10,000 tokens per day request limit, and one for pro users with a 100,000 tokens per day request limit.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Apply the policy:
oc apply -f your-tokenratelimitpolicy.yaml -n my-api-namespace
oc apply -f your-tokenratelimitpolicy.yaml -n my-api-namespaceCopy to Clipboard Copied! Toggle word wrap Toggle overflow Check the status of the policy to ensure it has been accepted and enforced on the target
HTTPRoute. Look for conditions withtype: Acceptedandtype: Enforcedwithstatus: "True".oc get tokenratelimitpolicy llm-protection -n my-api-namespace -o jsonpath='{.status.conditions}'oc get tokenratelimitpolicy llm-protection -n my-api-namespace -o jsonpath='{.status.conditions}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Send requests to your API endpoint, including the required authentication details.
curl -H "Authorization: <auth-details>" \ -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}]}' \ <your-api-endpoint>curl -H "Authorization: <auth-details>" \ -d '{"model": "gpt-4", "messages": [{"role": "user", "content": "Hello"}]}' \ <your-api-endpoint>Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Verification
-
Ensure your upstream service responds with an OpenAI-compatible JSON body containing the
usage.total_tokensfield. -
Requests made when the client is within their token limits should receive a
200 OKresponse or other success status and their token counter will be updated. -
Requests made when the client has already exceeded their token limits should receive a
429 Too Many Requestsresponse.