Observability Guide
Abstract
Chapter 1. Tracking instance status with health checks Copy linkLink copied to clipboard!
Check if an instance has finished its start up and is ready to serve requests by calling its health REST endpoints.
Red Hat build of Keycloak has built in support for health checks. This chapter describes how to enable and use the Red Hat build of Keycloak health checks. The Red Hat build of Keycloak health checks are exposed on the management port 9000
by default. For more details, see Configuring the Management Interface
1.1. Red Hat build of Keycloak health check endpoints Copy linkLink copied to clipboard!
Red Hat build of Keycloak exposes 4 health endpoints:
-
/health/live
-
/health/ready
-
/health/started
-
/health
See the Quarkus SmallRye Health docs for information on the meaning of each endpoint.
These endpoints respond with HTTP status 200 OK
on success or 503 Service Unavailable
on failure, and a JSON object like the following:
Successful response for endpoints without additional per-check information:
{ "status": "UP", "checks": [] }
{
"status": "UP",
"checks": []
}
Successful response for endpoints with information on the database connection:
1.2. Enabling the health checks Copy linkLink copied to clipboard!
It is possible to enable the health checks using the build time option health-enabled
:
bin/kc.[sh|bat] build --health-enabled=true
bin/kc.[sh|bat] build --health-enabled=true
By default, no check is returned from the health endpoints.
1.3. Using the health checks Copy linkLink copied to clipboard!
It is recommended that the health endpoints be monitored by external HTTP requests. Due to security measures that remove curl
and other packages from the Red Hat build of Keycloak container image, local command-based monitoring will not function easily.
If you are not using Red Hat build of Keycloak in a container, use whatever you want to access the health check endpoints.
1.3.1. curl Copy linkLink copied to clipboard!
You may use a simple HTTP HEAD request to determine the live
or ready
state of Red Hat build of Keycloak. curl
is a good HTTP client for this purpose.
If Red Hat build of Keycloak is deployed in a container, you must run this command from outside it due to the previously mentioned security measures. For example:
curl --head -fsS http://localhost:9000/health/ready
curl --head -fsS http://localhost:9000/health/ready
If the command returns with status 0, then Red Hat build of Keycloak is live
or ready
, depending on which endpoint you called. Otherwise there is a problem.
1.3.2. Kubernetes Copy linkLink copied to clipboard!
Define a HTTP Probe so that Kubernetes may externally monitor the health endpoints. Do not use a liveness command.
1.3.3. HEALTHCHECK Copy linkLink copied to clipboard!
The Containerfile HEALTHCHECK
instruction defines a command that will be periodically executed inside the container as it runs. The Red Hat build of Keycloak container does not have any CLI HTTP clients installed. Consider installing curl
as an additional RPM, as detailed by the Running Red Hat build of Keycloak in a container chapter. Note that your container may be less secure because of this.
1.4. Available Checks Copy linkLink copied to clipboard!
The table below shows the available checks.
Check | Description | Requires Metrics |
---|---|---|
Database | Returns the status of the database connection pool. | Yes |
For some checks, you’ll need to also enable metrics as indicated by the Requires Metrics
column. To enable metrics use the metrics-enabled
option as follows:
bin/kc.[sh|bat] build --health-enabled=true --metrics-enabled=true
bin/kc.[sh|bat] build --health-enabled=true --metrics-enabled=true
1.5. Relevant options Copy linkLink copied to clipboard!
Value | |
---|---|
🛠
|
|
Chapter 2. Gaining insights with metrics Copy linkLink copied to clipboard!
Collect metrics to gain insights about state and activities of a running instance of Red Hat build of Keycloak.
Red Hat build of Keycloak has built in support for metrics. This chapter describes how to enable and configure server metrics.
2.1. Enabling Metrics Copy linkLink copied to clipboard!
It is possible to enable metrics using the build time option metrics-enabled
:
bin/kc.[sh|bat] start --metrics-enabled=true
bin/kc.[sh|bat] start --metrics-enabled=true
2.2. Querying Metrics Copy linkLink copied to clipboard!
Red Hat build of Keycloak exposes metrics at the following endpoint on the management interface at:
-
/metrics
For more information about the management interface, see Configuring the Management Interface. The response from the endpoint uses a application/openmetrics-text
content type and it is based on the Prometheus (OpenMetrics) text format. The snippet below is an example of a response:
2.3. Next steps Copy linkLink copied to clipboard!
Read the chapters Monitoring performance with Service Level Indicators and Troubleshooting using metrics to see how to use the metrics.
2.4. Relevant options Copy linkLink copied to clipboard!
Value | |
---|---|
Available only when metrics are enabled |
|
Available only when metrics are enabled |
|
Available only when metrics are enabled | |
🛠
|
|
Chapter 3. Monitoring user activities with event metrics Copy linkLink copied to clipboard!
Event metrics provide an aggregated view of user activities in a Red Hat build of Keycloak instance.
For now, only metrics for user events are captured. For example, you can monitor the number of logins, login failures, or token refreshes performed.
The metrics are exposed using the standard metrics endpoint, and you can use it in your own metrics collection system to create dashboards and alerts.
The metrics are reported as counters per Red Hat build of Keycloak instance. The counters are reset on the restart of the instance. If you have multiple instances running in a cluster, you will need to collect the metrics from all instances and aggregate them to get per a cluster view.
3.1. Enable event metrics Copy linkLink copied to clipboard!
To start collecting event metrics, enable metrics and enable the metrics for user events.
The following shows the required startup parameters:
bin/kc.[sh|bat] start --metrics-enabled=true --event-metrics-user-enabled=true ...
bin/kc.[sh|bat] start --metrics-enabled=true --event-metrics-user-enabled=true ...
By default, there is a separate metric for each realm. To break down the metric by client and identity provider, you can add those metrics dimension using the configuration option event-metrics-user-tags
. This can be useful on installations with a small number of clients and IDPs. This is not recommended for installations with a large number of clients or IDPs as it will increase the memory usage of Red Hat build of Keycloak and as it will increase the load on your monitoring system.
The following shows how to configure Red Hat build of Keycloak to break down the metrics by all three metrics dimensions:
bin/kc.[sh|bat] start ... --event-metrics-user-tags=realm,idp,clientId ...
bin/kc.[sh|bat] start ... --event-metrics-user-tags=realm,idp,clientId ...
You can limit the events for which Red Hat build of Keycloak will expose metrics. See the Server Administration Guide on event types for an overview of the available events.
The following example limits the events collected to LOGIN
and LOGOUT
events:
bin/kc.[sh|bat] start ... --event-metrics-user-events=login,logout ...
bin/kc.[sh|bat] start ... --event-metrics-user-events=login,logout ...
See Self-provided metrics for a description of the metrics collected.
3.2. Relevant options Copy linkLink copied to clipboard!
Value | |
---|---|
🛠
|
|
🛠
Available only when metrics are enabled and feature user-event-metrics is enabled |
|
Available only when user event metrics are enabled
Use |
|
Available only when user event metrics are enabled |
|
Chapter 4. Monitoring performance with Service Level Indicators Copy linkLink copied to clipboard!
Track performance and reliability as perceived by users with Service Level Indicators (SLIs) and Service Level Objectives (SLOs).
Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are essential components in monitoring and maintaining the performance and reliability of Red Hat build of Keycloak in production environments.
The Google Site Reliability Engineering book defines this as follows:
- A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is provided.
- A Service level objective (SLO) is a target value or range of values for a service level that is measured by an SLI.
By agreeing those with the stakeholders and tracking these, service owners can ensure that deployments are aligned with user’s expectations and that they neither over- nor under-deliver on the service they provide.
4.1. Prerequisites Copy linkLink copied to clipboard!
-
Metrics need to be enabled for Red Hat build of Keycloak, and the
http-metrics-slos
option needs to be set to latency to be measured for the SLO defined below. Follow Gaining insights with metrics chapter for more details. - A monitoring system collecting the metrics. The following paragraphs assume Prometheus or a similar system is used that supports the PromQL query language.
4.2. Definition of the service delivered Copy linkLink copied to clipboard!
The following service definition is used in the next steps to identify the appropriate SLIs and SLOs. It should capture the behavior observed by its users.
As a Red Hat build of Keycloak user,
- I want to be able to log in,
- refresh my token and
- log out,
so that I can use the applications that use Red Hat build of Keycloak for authentication.
4.3. Definition of SLI and SLO Copy linkLink copied to clipboard!
The following provides example SLIs and SLOs based on the service description above and the metrics available in Red Hat build of Keycloak.
While these SLOs are independent of the actual load the system, this is expected as a single user does not care about the system load if they get slow responses.
At the same time, if you enter a Service Level Agreement (SLA) with stakeholders, you as the one running Red Hat build of Keycloak have an interest to define limits of the traffic Red Hat build of Keycloak receives, as response times will be prolonged and error rates might increase as the load of the system increases and scaling thresholds are reached.
Characteristic | Service Level Indicator | Service Level Objective* | Metric Source |
---|---|---|---|
Availability | Percentage of the time Red Hat build of Keycloak is able to answer requests as measured by the monitoring system | Red Hat build of Keycloak should be available 99.9% of the time within a month (44 minutes unavailability per month). |
Use the Prometheus |
Latency | Response time for authentication related HTTP requests as measured by the server | 95% of all authentication related requests should be faster than 250 ms within 30 days. |
Red Hat build of Keycloak server-side metrics to track latency for specific endpoints along with Response Time Distribution using |
Errors | Failed authentication requests due to server problems as measured by the server | The rate of errors due to server problems for authentication requests should be less than 0.1% within 30 days. |
Identify server side error by filtering the metric |
* These SLO target values are an example and should be tailored to fit your use case and deployment.
4.4. PromQL queries Copy linkLink copied to clipboard!
These are example queries created in a Kubernetes environment and are used with Prometheus as a monitoring tool. They are provided as blueprints, and you will need to adapt them for a different runtime or monitoring environment.
For a production environment, you might want to replace those queries or subqueries with a recording rule to make sure they do not use too many resources if you want to use them for alerting or live dashboards.
4.4.1. Availability Copy linkLink copied to clipboard!
This metric will have a value of at least one if the Red Hat build of Keycloak instances is available and responding to Prometheus scrape requests, and 0 if the service is down or unreachable.
Then use a tool like Grafana to show a 30-day time range and let it calculate the average of the metric in that time window.
In Grafana you can replace value 30d:15s
with $range:$interval
to compute availability SLI in the time range selected for the dashboard.
4.4.2. Latency of authentication requests Copy linkLink copied to clipboard!
This Prometheus query calculates the percentage of authentication requests that completed within 0.25 seconds relative to all authentication requests for specific Red Hat build of Keycloak endpoints, targeting a particular namespace and pod, over the past 30 days.
This example requires the Red Hat build of Keycloak configuration http-metrics-slos
to contain value 250
indicating that buckets for requests faster and slower than 250 ms should be recorded. Setting http-metrics-histograms-enabled
to true
would capture additional buckets which can help with performance troubleshooting.
In Grafana, you can replace value 30d
with $__range
to compute latency SLI in the time range selected for the dashboard.
4.4.3. Errors for authentication requests Copy linkLink copied to clipboard!
This Prometheus query calculates the percentage of authentication requests that returned a server side error for all authentication requests, targeting a particular namespace, over the past 30 days.
In Grafana, you can replace value 30d
with $__range
to compute errors SLI in the time range selected for the dashboard.
4.5. Further Reading Copy linkLink copied to clipboard!
Chapter 5. Troubleshooting using metrics Copy linkLink copied to clipboard!
Use metrics for troubleshooting errors and performance issues.
For a running Red Hat build of Keycloak deployment it is important to understand how the system performs and whether it meets your service level objectives (SLOs). For more details on SLOs, proceed to the Monitoring performance with Service Level Indicators chapter.
This guide will provide directions to answer the question: “What can I do when my SLOs are not met?”
Red Hat build of Keycloak consists of several components where an issue or misconfiguration of one of them can move your service level indicators to undesirable numbers.
A guidance provided by this guide is illustrated in the following example:
Observation: Latency service level objective is not met.
Metrics that indicate a problem:
- Red Hat build of Keycloak’s database connection pool is often exhausted, and there are threads queuing for a connection to be retrieved from the pool.
-
Red Hat build of Keycloak’s
users
cache hit ratio is at a low percentage, around 5%. This means only 1 out of 20 user searches is able to obtain user data from the cache and the rest needs to load it from the database.
Possible mitigations suggested:
-
Increasing the
users
cache size to a higher number which would decrease the number of reads from the database. - Increasing the number of connections in the connection pool. This would need to be checked with metrics for your database and tuning it for a higher load, for example, by increasing the number of available processors.
- This guide focuses on Red Hat build of Keycloak metrics. Troubleshooting the database itself is out of scope.
- This guide provides general guidance. You should always confirm the configuration change by conducting a performance test comparing the metrics in question for the old and the new configuration.
Grafana dashboards for the metrics below can be found in Visualizing activities in dashboards chapter.
5.1. List of Red Hat build of Keycloak key metrics Copy linkLink copied to clipboard!
- Self-provided metrics
- JVM metrics
- Database Metrics
- HTTP metrics
Single site metrics (without external Data Grid)
Multiple sites metrics (as described in Multi-site deployments)
5.2. Self-provided metrics Copy linkLink copied to clipboard!
Learn about the key metrics that Red Hat build of Keycloak provides.
This is part of the Troubleshooting using metrics chapter.
5.2.1. Prerequisites Copy linkLink copied to clipboard!
- Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
- A monitoring system collecting the metrics.
5.2.2. Metrics Copy linkLink copied to clipboard!
5.2.2.1. User Event Metrics Copy linkLink copied to clipboard!
User event metrics are disabled by default. See Monitoring user activities with event metrics on how to enable them and how to configure which tags are recorded.
Metric | Description |
---|---|
| Counting the occurrence of user events. |
Tags
The tags client_id
and idp
are disabled by default to avoid a too high cardinality.
realm
- Realm
client_id
- Client ID
idp
- Identity Provider
event
-
User event, for example
login
orlogout
. See the Server Administration Guide on event types for an overview of the available events. error
-
Error specific to the event, for example
invalid_user_credentials
for the eventlogin
. Empty string if no error occurred.
The snippet below is an example of a response provided by the metric endpoint:
5.2.2.2. Password hashing Copy linkLink copied to clipboard!
Metric | Description |
---|---|
| Counting password hashes validations. |
Tags
realm
- Realm
algorithm
-
Algorithm used for hashing password, for example
argon2
hashing_strength
-
String denoting strength of hashing algorithm, for example, number of iterations depending on the algorithm. For example,
Argon2id-1.3[m=7168,t=5,p=1]
outcome
Outcome of password validation. Possible values:
valid
- Password correct
invalid
- Password incorrect
error
- Error when creating the hash of the password
To configure what tags are available provide a comma-separated list of tag names to the following option spi-credential-keycloak-password-validations-counter-tags
. By default, all tags are enabled.
The snippet below is an example of a response provided by the metric endpoint:
HELP keycloak_credentials_password_hashing_validations_total Password validations TYPE keycloak_credentials_password_hashing_validations_total counter
# HELP keycloak_credentials_password_hashing_validations_total Password validations
# TYPE keycloak_credentials_password_hashing_validations_total counter
keycloak_credentials_password_hashing_validations_total{algorithm="argon2",hashing_strength="Argon2id-1.3[m=7168,t=5,p=1]",outcome="valid",realm="realm-0",} 39949.0
5.2.3. Next steps Copy linkLink copied to clipboard!
Return back to the Troubleshooting using metrics or proceed to JVM metrics.
5.3. JVM metrics Copy linkLink copied to clipboard!
Use JVM metrics to observe performance of Red Hat build of Keycloak.
This is part of the Troubleshooting using metrics chapter.
5.3.1. Prerequisites Copy linkLink copied to clipboard!
- Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
- A monitoring system collecting the metrics.
5.3.2. Metrics Copy linkLink copied to clipboard!
5.3.2.1. JVM info Copy linkLink copied to clipboard!
Metric | Description |
---|---|
| Information about the JVM such as version, runtime and vendor. |
5.3.2.2. Heap memory usage Copy linkLink copied to clipboard!
Metric | Description |
---|---|
| The amount of memory that the JVM has committed for use, reflecting the portion of the allocated memory that is guaranteed to be available for the JVM to use. |
| The amount of memory currently used by the JVM, indicating the actual memory consumption by the application and JVM internals. |
5.3.2.3. Garbage collection Copy linkLink copied to clipboard!
Metric | Description |
---|---|
| The maximum duration, in seconds, of garbage collection pauses experienced by the JVM due to a particular cause, which helps you quickly differentiate between types of GC (minor, major) pauses. |
| The total cumulative time spent in garbage collection pauses, indicating the impact of GC pauses on application performance in the JVM. |
| Counts the total number of garbage collection pause events, helping to assess the frequency of GC pauses in the JVM. |
| The percentage of CPU time spent on garbage collection, indicating the impact of GC on application performance in the JVM. It refers to the proportion of the total CPU processing time that is dedicated to executing garbage collection (GC) operations, as opposed to running application code or performing other tasks. This metric helps determine how much overhead GC introduces, affecting the overall performance of the Red Hat build of Keycloak’s JVM. |
5.3.2.4. CPU Usage in Kubernetes Copy linkLink copied to clipboard!
Metric | Description |
---|---|
| Cumulative CPU time consumed by the container in core-seconds. |
5.3.3. Next steps Copy linkLink copied to clipboard!
Return back to the Troubleshooting using metrics or proceed to Database Metrics.
5.4. Database Metrics Copy linkLink copied to clipboard!
Use metrics to describe Red Hat build of Keycloak’s connection to the database.
This is part of the Troubleshooting using metrics chapter.
5.4.1. Prerequisites Copy linkLink copied to clipboard!
- Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
- A monitoring system collecting the metrics.
5.4.2. Database connection pool metrics Copy linkLink copied to clipboard!
Configure Red Hat build of Keycloak to use a fixed size database connection pool. See the Concepts for database connection pools chapter for more information.
If there is a high count of threads waiting for a database connection, increasing the database connection pool size is not always the best option. It might overload the database which would then become the bottleneck. Consider the following options instead:
-
Reduce the number of HTTP worker threads using the option
http-pool-max-threads
to make it match the available database connections, and thereby reduce contention and resource usage in Red Hat build of Keycloak and increase throughput. -
Check which database statements are executed on the database. If you see, for example, a lot of information about clients and groups being fetched, and the
users
andrealms
cache are full, this might indicate that it is time to increase the sizes of those caches and see if this reduces your database load.
Metric | Description |
---|---|
| Idle database connections. |
| Database connections used in ongoing transactions. |
| Threads waiting for a database connection to become available. |
5.4.3. Next steps Copy linkLink copied to clipboard!
Return back to the Troubleshooting using metrics or proceed to HTTP metrics.
5.5. HTTP metrics Copy linkLink copied to clipboard!
Use metrics to monitor the Red Hat build of Keycloak HTTP requests processing.
This is part of the Troubleshooting using metrics chapter.
5.5.1. Prerequisites Copy linkLink copied to clipboard!
- Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
- A monitoring system collecting the metrics.
5.5.2. Metrics Copy linkLink copied to clipboard!
5.5.2.1. Processing time Copy linkLink copied to clipboard!
The processing time is exposed by these metrics, to monitor the Red Hat build of Keycloak performance and how long it takes to processing the requests.
On a healthy cluster, the average processing time will remain stable. Spikes or increases in the processing time may be an early sign that some node is under load.
Tags
method
- HTTP method.
outcome
- A more general outcome tag.
status
- The HTTP status code.
uri
- The requested URI.
Metric | Description |
---|---|
| The total number of requests processed. |
| The total duration for all the requests processed. |
You can enable histograms for this metric by setting http-metrics-histograms-enabled
to true
, and add additional buckets for service level objectives using the option http-metrics-slos
.
When histograms are enabled, the percentile buckets are available. Those are useful to create heat maps and analyze latencies, still collecting and exposing the percentile buckets will increase the load of to your monitoring system.
5.5.2.2. Active requests Copy linkLink copied to clipboard!
The current number of active requests is also available.
Metric | Description |
---|---|
| The current number of active requests |
5.5.2.3. Bandwidth Copy linkLink copied to clipboard!
The metrics below helps to monitor the bandwidth and consumed traffic used by Red Hat build of Keycloak and consumed by the requests and responses received or sent.
Metric | Description |
---|---|
| The total number of responses sent. |
| The total number of bytes sent. |
| The total number of requests received. |
| The total number of bytes received. |
When histograms are enabled, the percentile buckets are available. Those are useful to create heat maps and analyze latencies, still collecting and exposing the percentile buckets will increase the load of to your monitoring system.
5.5.3. Next steps Copy linkLink copied to clipboard!
Return back to the Troubleshooting using metrics or,
- For single site deployments proceed to Clustering metrics,
- and for multiple sites deployments proceed to Embedded Infinispan metrics for multi-site deployments
5.5.4. Relevant options Copy linkLink copied to clipboard!
Value | |
---|---|
Available only when metrics are enabled |
|
Available only when metrics are enabled |
5.6. Clustering metrics Copy linkLink copied to clipboard!
Use metrics to monitor communication between Red Hat build of Keycloak nodes.
This is part of the Troubleshooting using metrics chapter.
5.6.1. Prerequisites Copy linkLink copied to clipboard!
- Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
- A monitoring system collecting the metrics.
5.6.2. Metrics Copy linkLink copied to clipboard!
Deploying multiple Red Hat build of Keycloak nodes allows the load to be distributed amongst them, but this requires communication between the nodes. This section describes metrics that are useful for monitoring the communication between Red Hat build of Keycloak in order to identify possible faults.
This is relevant only for single site deployments. When multiple sites are used, as described in Multi-site deployments, Red Hat build of Keycloak nodes are not clustered together and therefore there is no communication between them directly.
Global tags
cluster=<name>
- The cluster name. If metrics from multiple clusters are being collected, this tag helps identify where they belong to.
node=<node>
- The name of the node reporting the metric.
All metric names prefixed with vendor_jgroups_
are provided for troubleshooting and debugging purposes only. The metric names can change in upcoming releases of Red Hat build of Keycloak without further notice. Therefore, we advise not using them in dashboards or in monitoring and alerting.
5.6.2.1. Response Time Copy linkLink copied to clipboard!
The following metrics expose the response time for the remote requests. The response time is measured between two nodes and includes the processing time. All requests are measured by these metrics, and the response time should remain stable through the cluster lifecycle.
In a healthy cluster, the response time will remain stable. An increase in response time may indicate a degraded cluster or a node under heavy load.
Tags
node=<node>
- It identifies the sender node.
target_node=<node>
- It identifies the receiver node.
Metric | Description |
---|---|
| The number of synchronous requests to a receiver node. |
| The total duration of synchronous request to a receiver node |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
5.6.2.2. Bandwidth Copy linkLink copied to clipboard!
All the bytes received and sent by the Red Hat build of Keycloak are collected by these metrics. Also, all the internal messages, as heartbeats, are counted too. They allow computing the bandwidth currently used by each node.
The metric name depends on the JGroups transport protocol in use.
Metric | Protocol | Description |
---|---|---|
|
| The total number of bytes received by a node. |
|
| |
|
| |
|
| The total number of bytes sent by a node. |
|
| |
|
|
5.6.2.3. Thread Pool Copy linkLink copied to clipboard!
Monitoring the thread pool size is a good indicator that a node is under a heavy load. All requests received are added to the thread pool for processing and, when it is full, the request is discarded. A retransmission mechanism ensures a reliable communication with an increase of resource usage.
In a healthy cluster, the thread pool should never be closer to its maximum size (by default, 200
threads).
Thread pool metrics are not available with virtual threads. Virtual threads are enabled by default when running with OpenJDK 21.
The metric name depends on the JGroups transport protocol in use. The default transport protocol is TCP.
Metric | Protocol | Description |
---|---|---|
|
| Current number of threads in the thread pool. |
|
| |
|
| |
|
| The largest number of threads that have ever simultaneously been in the pool. |
|
| |
|
|
5.6.2.4. Flow Control Copy linkLink copied to clipboard!
Flow control takes care of adjusting the rate of a message sender to the rate of the slowest receiver over time. This is implemented through a credit-based system, where each sender decrements its credits when sending. The sender blocks when the credits fall below 0, and only resumes sending messages when it receives a replenishment message from the receivers.
The metrics below show the number of blocked messages and the average blocking time. When a value is different from zero, it may signal that a receiver is overloaded and may degrade the cluster performance.
Each node has two independent flow control protocols, UFC
for unicast messages and MFC
for multicast messages.
A healthy cluster shows a value of zero for all metrics.
Metric | Description |
---|---|
| The number of times flow control blocks the sender for unicast messages. |
| Average time blocked (in ms) in flow control when trying to send an unicast message. |
| The number of times flow control blocks the sender for multicast messages. |
| Average time blocked (in ms) in flow control when trying to send a multicast message. |
5.6.2.5. Retransmissions Copy linkLink copied to clipboard!
JGroups provides reliable delivery of messages. When a message is dropped on the network, or the receiver cannot handle the message, a retransmission is required. Retransmissions increase resource usage, and it is usually a signal of an overload system.
Random Early Drop (RED) monitors the sender queues. When the queues are almost full, the message is dropped, and a retransmission must happen. It prevents threads from being blocked by a full sender queue.
A healthy cluster shows a value of zero for all metrics.
Metric | Description |
---|---|
| The number of retransmitted messages. |
| The total number of dropped messages by the sender. |
| Percentage of all messages that were dropped by the sender. |
5.6.2.6. Network Partitions Copy linkLink copied to clipboard!
5.6.2.6.1. Cluster Size Copy linkLink copied to clipboard!
The cluster size metric reports the number of nodes present in the cluster. If it differs, it may signal that a node is joining, shutdown or, in the worst case, a network partition is happening.
A healthy cluster shows the same value in all nodes.
Metric | Description |
---|---|
| The number of nodes in the cluster. |
5.6.2.6.2. Network Partition Events Copy linkLink copied to clipboard!
Network partitions in a cluster can happen due to various reasons. This metrics does not help predict network splits but signals that it happened, and the cluster has been merged.
A healthy cluster shows a value of zero for this metric.
Metric | Description |
---|---|
| The amount of time a network split was detected and healed. |
5.6.3. Next steps Copy linkLink copied to clipboard!
Return back to the Troubleshooting using metrics or proceed to Embedded Infinispan metrics for single site deployments.
5.7. Embedded Infinispan metrics for single site deployments Copy linkLink copied to clipboard!
Use metrics to monitor caching health and cluster replication.
This is part of the Troubleshooting using metrics chapter.
5.7.1. Prerequisites Copy linkLink copied to clipboard!
- Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
- A monitoring system collecting the metrics.
5.7.2. Metrics Copy linkLink copied to clipboard!
Global tags
cache=<name>
- The cache name.
5.7.2.1. Size Copy linkLink copied to clipboard!
Monitor the number of entries in your cache using these two metrics. If the cache is clustered, each entry has an owner node and zero or more backup copies of different nodes.
Sum the unique entry size metric to get a cluster total number of entries.
Metric | Description |
---|---|
| The approximate number of entries stored by the node, including backup copies. |
| The approximate number of entries stored by the node, excluding backup copies. |
5.7.2.2. Data Access Copy linkLink copied to clipboard!
The following metrics monitor the cache accesses, such as the reads, writes and their duration.
5.7.2.2.1. Stores Copy linkLink copied to clipboard!
A store operation is a write operation that writes or updates a value stored in the cache.
Metric | Description |
---|---|
| The total number of store requests. |
| The total duration of all store requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
5.7.2.2.2. Reads Copy linkLink copied to clipboard!
A read operation reads a value from the cache. It divides into two groups, a hit if a value is found, and a miss if not found.
Metric | Description |
---|---|
| The total number of read hits requests. |
| The total duration of all read hits requests. |
| The total number of read misses requests. |
| The total duration of all read misses requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
5.7.2.2.3. Removes Copy linkLink copied to clipboard!
A remove operation removes a value from the cache. It divides in two groups, a hit if a value exists, and a miss if the value does not exist.
Metric | Description |
---|---|
| The total number of remove hits requests. |
| The total duration of all remove hits requests. |
| The total number of remove misses requests. |
| The total duration of all remove misses requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
For users
and realms
cache, the database invalidation translates into a remove operation. These metrics are a good indicator of how frequent the database entities are modified and therefore removed from the cache.
Hit Ratio for read and remove operations
An expression can be used to compute the hit ratio for a cache in systems such as Prometheus. As an example, the hit ratio for read operations can be expressed as:
vendor_statistics_hit_times_seconds_count / (vendor_statistics_hit_times_seconds_count + vendor_statistics_miss_times_seconds_count)
vendor_statistics_hit_times_seconds_count
/
(vendor_statistics_hit_times_seconds_count
+ vendor_statistics_miss_times_seconds_count)
Read/Write ratio
An expression can be used to compute the read-write ratio for a cache, using the metrics above:
5.7.2.2.4. Eviction Copy linkLink copied to clipboard!
Eviction is the process to limit the cache size and, when full, an entry is removed to make room for a new entry to be cached. As Red Hat build of Keycloak caches the database entities in the users
, realms
and authorization
, database access always proceeds with an eviction event.
Metric | Description |
---|---|
| The total number of eviction events. |
Eviction rate
A rapid increase of eviction and very high database CPU usage means the users
or realms
cache is too small for smooth Red Hat build of Keycloak operation, as data needs to be re-loaded very often from the database which slows down responses. If enough memory is available, consider increasing the max cache size using the CLI options cache-embedded-users-max-count
or cache-embedded-realms-max-count
5.7.2.3. Locking Copy linkLink copied to clipboard!
Write and remove operations hold the lock until the value is replicated in the local cluster and to the remote site.
On a healthy cluster, the number of locks held should remain constant, but deadlocks may create temporary spikes.
Metric | Description |
---|---|
| The number of locks currently being held by this node. |
5.7.2.4. Transactions Copy linkLink copied to clipboard!
Transactional caches use both One-Phase-Commit and Two-Phase-Commit protocols to complete a transaction. These metrics keep track of the operation duration.
The PESSMISTIC
locking mode uses One-Phase-Commit and does not create commit requests.
In a healthy cluster, the number of rollbacks should remain zero. Deadlocks should be rare, but they increase the number of rollbacks.
Metric | Description |
---|---|
| The total number of prepare requests. |
| The total duration of all prepare requests. |
| The total number of rollback requests. |
| The total duration of all rollback requests. |
| The total number of commit requests. |
| The total duration of all commit requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
5.7.2.5. State Transfer Copy linkLink copied to clipboard!
State transfer happens when a node joins or leaves the cluster. It is required to balance the data stored and guarantee the desired number of copies.
This operation increases the resource usage, and it will affect negatively the overall performance.
Metric | Description |
---|---|
| The number of in-flight transactional segments the local node requested from other nodes. |
| The number of in-flight segments the local node requested from other nodes. |
5.7.2.6. Cluster Data Replication Copy linkLink copied to clipboard!
The cluster data replication can be the main source of failure. These metrics not only report the response time, i.e., the time it takes to replicate an update, but also the failures.
On a healthy cluster, the average replication time will be stable or with little variance. The number of failures should not increase.
Metric | Description |
---|---|
| The total number of successful replications. |
| The total number of failed replications. |
| The average time spent, in milliseconds, replicating data in the cluster. |
Success ratio
An expression can be used to compute the replication success ratio:
(vendor_rpc_manager_replication_count) / (vendor_rpc_manager_replication_count + vendor_rpc_manager_replication_failures)
(vendor_rpc_manager_replication_count)
/
(vendor_rpc_manager_replication_count
+ vendor_rpc_manager_replication_failures)
5.7.3. Next steps Copy linkLink copied to clipboard!
Return back to the Troubleshooting using metrics.
5.8. Embedded Infinispan metrics for multi-site deployments Copy linkLink copied to clipboard!
Use metrics to monitor caching health.
This is part of the Troubleshooting using metrics chapter.
5.8.1. Prerequisites Copy linkLink copied to clipboard!
- Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
- A monitoring system collecting the metrics.
5.8.2. Metrics Copy linkLink copied to clipboard!
Global tags
cache=<name>
- The cache name.
5.8.2.1. Size Copy linkLink copied to clipboard!
Monitor the number of entries in your cache using these two metrics. If the cache is clustered, each entry has an owner node and zero or more backup copies of different nodes.
Sum the unique entry size metric to get a cluster total number of entries.
Metric | Description |
---|---|
| The approximate number of entries stored by the node, including backup copies. |
| The approximate number of entries stored by the node, excluding backup copies. |
5.8.2.2. Data Access Copy linkLink copied to clipboard!
The following metrics monitor the cache accesses, such as the reads, writes and their duration.
5.8.2.2.1. Stores Copy linkLink copied to clipboard!
A store operation is a write operation that writes or updates a value stored in the cache.
Metric | Description |
---|---|
| The total number of store requests. |
| The total duration of all store requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
5.8.2.2.2. Reads Copy linkLink copied to clipboard!
A read operation reads a value from the cache. It divides into two groups, a hit if a value is found, and a miss if not found.
Metric | Description |
---|---|
| The total number of read hits requests. |
| The total duration of all read hits requests. |
| The total number of read misses requests. |
| The total duration of all read misses requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
5.8.2.2.3. Removes Copy linkLink copied to clipboard!
A remove operation removes a value from the cache. It divides in two groups, a hit if a value exists, and a miss if the value does not exist.
Metric | Description |
---|---|
| The total number of remove hits requests. |
| The total duration of all remove hits requests. |
| The total number of remove misses requests. |
| The total duration of all remove misses requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
For users
and realms
cache, the database invalidation translates into a remove operation. These metrics are a good indicator of how frequent the database entities are modified and therefore removed from the cache.
Hit Ratio for read and remove operations
An expression can be used to compute the hit ratio for a cache in systems such as Prometheus. As an example, the hit ratio for read operations can be expressed as:
vendor_statistics_hit_times_seconds_count / (vendor_statistics_hit_times_seconds_count + vendor_statistics_miss_times_seconds_count)
vendor_statistics_hit_times_seconds_count
/
(vendor_statistics_hit_times_seconds_count
+ vendor_statistics_miss_times_seconds_count)
Read/Write ratio
An expression can be used to compute the read-write ratio for a cache, using the metrics above:
5.8.2.2.4. Eviction Copy linkLink copied to clipboard!
Eviction is the process to limit the cache size and, when full, an entry is removed to make room for a new entry to be cached. As Red Hat build of Keycloak caches the database entities in the users
, realms
and authorization
, database access always proceeds with an eviction event.
Metric | Description |
---|---|
| The total number of eviction events. |
Eviction rate
A rapid increase of eviction and very high database CPU usage means the users
or realms
cache is too small for smooth Red Hat build of Keycloak operation, as data needs to be re-loaded very often from the database which slows down responses. If enough memory is available, consider increasing the max cache size using the CLI options cache-embedded-users-max-count
or cache-embedded-realms-max-count
5.8.2.3. Transactions Copy linkLink copied to clipboard!
Transactional caches use both One-Phase-Commit and Two-Phase-Commit protocols to complete a transaction. These metrics keep track of the operation duration.
The PESSMISTIC
locking mode uses One-Phase-Commit and does not create commit requests.
In a healthy cluster, the number of rollbacks should remain zero. Deadlocks should be rare, but they increase the number of rollbacks.
Metric | Description |
---|---|
| The total number of prepare requests. |
| The total duration of all prepare requests. |
| The total number of rollback requests. |
| The total duration of all rollback requests. |
| The total number of commit requests. |
| The total duration of all commit requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
5.8.3. Next steps Copy linkLink copied to clipboard!
Return back to the Troubleshooting using metrics or proceed to External Data Grid metrics.
5.9. External Data Grid metrics Copy linkLink copied to clipboard!
Use metrics to monitor external Data Grid performance.
This is part of the Troubleshooting using metrics chapter.
5.9.1. Prerequisites Copy linkLink copied to clipboard!
5.9.1.1. Enabled Data Grid server metrics Copy linkLink copied to clipboard!
Data Grid exposes metrics in the endpoint /metrics
. By default, they are enabled. We recommend enabling the attribute name-as-tags
as it makes the metrics name independent on the cache name.
To configure metrics in the Data Grid server, just enabled as shown in the XML below.
infinispan.xml
<infinispan> <cache-container statistics="true"> <metrics gauges="true" histograms="false" name-as-tags="true" /> </cache-container> </infinispan>
<infinispan>
<cache-container statistics="true">
<metrics gauges="true" histograms="false" name-as-tags="true" />
</cache-container>
</infinispan>
Using the Data Grid Operator in Kubernetes, metrics can be enabled by using a ConfigMap
with a custom configuration. It is shown below an example.
ConfigMap
infinispan.yaml CR
Additional information can be found in the Infinispan documentation and Infinispan operator documentation.
5.9.2. Clustering and Network Copy linkLink copied to clipboard!
This section describes metrics that are useful for monitoring the communication between Data Grid nodes to identify possible network issues.
Global tags
cluster=<name>
- The cluster name. If metrics from multiple clusters are being collected, this tag helps identify where they belong to.
node=<node>
- The name of the node reporting the metric.
All metric names prefixed with vendor_jgroups_
are provided for troubleshooting and debugging purposes only. The metric names can change in upcoming releases of Red Hat build of Keycloak without further notice. Therefore, we advise not using them in dashboards or in monitoring and alerting.
5.9.2.1. Response Time Copy linkLink copied to clipboard!
The following metrics expose the response time for the remote requests. The response time is measured between two nodes and includes the processing time. All requests are measured by these metrics, and the response time should remain stable through the cluster lifecycle.
In a healthy cluster, the response time will remain stable. An increase in response time may indicate a degraded cluster or a node under heavy load.
Tags
node=<node>
- It identifies the sender node.
target_node=<node>
- It identifies the receiver node.
Metric | Description |
---|---|
| The number of synchronous requests to a receiver node. |
| The total duration of synchronous request to a receiver node |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
5.9.2.2. Bandwidth Copy linkLink copied to clipboard!
All the bytes received and sent by the Data Grid are collected by these metrics. Also, all the internal messages, as heartbeats, are counted too. They allow computing the bandwidth currently used by each node.
The metric name depends on the JGroups transport protocol in use.
Metric | Protocol | Description |
---|---|---|
|
| The total number of bytes received by a node. |
|
| |
|
| |
|
| The total number of bytes sent by a node. |
|
| |
|
|
5.9.2.3. Thread Pool Copy linkLink copied to clipboard!
Monitoring the thread pool size is a good indicator that a node is under a heavy load. All requests received are added to the thread pool for processing and, when it is full, the request is discarded. A retransmission mechanism ensures a reliable communication with an increase of resource usage.
In a healthy cluster, the thread pool should never be closer to its maximum size (by default, 200
threads).
Thread pool metrics are not available with virtual threads. Virtual threads are enabled by default when running with OpenJDK 21.
The metric name depends on the JGroups transport protocol in use. The default transport protocol is TCP.
Metric | Protocol | Description |
---|---|---|
|
| Current number of threads in the thread pool. |
|
| |
|
| |
|
| The largest number of threads that have ever simultaneously been in the pool. |
|
| |
|
|
5.9.2.4. Flow Control Copy linkLink copied to clipboard!
Flow control takes care of adjusting the rate of a message sender to the rate of the slowest receiver over time. This is implemented through a credit-based system, where each sender decrements its credits when sending. The sender blocks when the credits fall below 0, and only resumes sending messages when it receives a replenishment message from the receivers.
The metrics below show the number of blocked messages and the average blocking time. When a value is different from zero, it may signal that a receiver is overloaded and may degrade the cluster performance.
Each node has two independent flow control protocols, UFC
for unicast messages and MFC
for multicast messages.
A healthy cluster shows a value of zero for all metrics.
Metric | Description |
---|---|
| The number of times flow control blocks the sender for unicast messages. |
| Average time blocked (in ms) in flow control when trying to send an unicast message. |
| The number of times flow control blocks the sender for multicast messages. |
| Average time blocked (in ms) in flow control when trying to send a multicast message. |
5.9.2.5. Retransmissions Copy linkLink copied to clipboard!
JGroups provides reliable delivery of messages. When a message is dropped on the network, or the receiver cannot handle the message, a retransmission is required. Retransmissions increase resource usage, and it is usually a signal of an overload system.
Random Early Drop (RED) monitors the sender queues. When the queues are almost full, the message is dropped, and a retransmission must happen. It prevents threads from being blocked by a full sender queue.
A healthy cluster shows a value of zero for all metrics.
Metric | Description |
---|---|
| The number of retransmitted messages. |
| The total number of dropped messages by the sender. |
| Percentage of all messages that were dropped by the sender. |
5.9.2.6. Network Partitions Copy linkLink copied to clipboard!
5.9.2.6.1. Cluster Size Copy linkLink copied to clipboard!
The cluster size metric reports the number of nodes present in the cluster. If it differs, it may signal that a node is joining, shutdown or, in the worst case, a network partition is happening.
A healthy cluster shows the same value in all nodes.
Metric | Description |
---|---|
| The number of nodes in the cluster. |
5.9.2.6.2. Cross-Site Status Copy linkLink copied to clipboard!
The cross-site status reports connection status to the other site. It returns a value of 1
if is online or 0
if offline. The value of 2
is used on nodes where the status is unknown; not all nodes establish connections to the remote sites and do not contain this information.
A healthy cluster shows a value greater than zero.
Metric | Description |
---|---|
| The single site status (1 if online). |
Tags
site=<name>
- The name of the destination site.
5.9.2.6.3. Network Partition Events Copy linkLink copied to clipboard!
Network partitions in a cluster can happen due to various reasons. This metrics does not help predict network splits but signals that it happened, and the cluster has been merged.
A healthy cluster shows a value of zero for this metric.
Metric | Description |
---|---|
| The amount of time a network split was detected and healed. |
5.9.3. Data Grid Caches Copy linkLink copied to clipboard!
The metrics in this section help monitoring the Data Grid caches health and the cluster replication.
Global tags
cache=<name>
- The cache name.
5.9.3.1. Size Copy linkLink copied to clipboard!
Monitor the number of entries in your cache using these two metrics. If the cache is clustered, each entry has an owner node and zero or more backup copies of different nodes.
Sum the unique entry size metric to get a cluster total number of entries.
Metric | Description |
---|---|
| The approximate number of entries stored by the node, including backup copies. |
| The approximate number of entries stored by the node, excluding backup copies. |
5.9.3.2. Data Access Copy linkLink copied to clipboard!
The following metrics monitor the cache accesses, such as the reads, writes and their duration.
5.9.3.2.1. Stores Copy linkLink copied to clipboard!
A store operation is a write operation that writes or updates a value stored in the cache.
Metric | Description |
---|---|
| The total number of store requests. |
| The total duration of all store requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
5.9.3.2.2. Reads Copy linkLink copied to clipboard!
A read operation reads a value from the cache. It divides into two groups, a hit if a value is found, and a miss if not found.
Metric | Description |
---|---|
| The total number of read hits requests. |
| The total duration of all read hits requests. |
| The total number of read misses requests. |
| The total duration of all read misses requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
5.9.3.2.3. Removes Copy linkLink copied to clipboard!
A remove operation removes a value from the cache. It divides in two groups, a hit if a value exists, and a miss if the value does not exist.
Metric | Description |
---|---|
| The total number of remove hits requests. |
| The total duration of all remove hits requests. |
| The total number of remove misses requests. |
| The total duration of all remove misses requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
5.9.3.3. Locking Copy linkLink copied to clipboard!
Write and remove operations hold the lock until the value is replicated in the local cluster and to the remote site.
On a healthy cluster, the number of locks held should remain constant, but deadlocks may create temporary spikes.
Metric | Description |
---|---|
| The number of locks currently being held by this node. |
5.9.3.4. Transactions Copy linkLink copied to clipboard!
Transactional caches use both One-Phase-Commit and Two-Phase-Commit protocols to complete a transaction. These metrics keep track of the operation duration.
The PESSMISTIC
locking mode uses One-Phase-Commit and does not create commit requests.
In a healthy cluster, the number of rollbacks should remain zero. Deadlocks should be rare, but they increase the number of rollbacks.
Metric | Description |
---|---|
| The total number of prepare requests. |
| The total duration of all prepare requests. |
| The total number of rollback requests. |
| The total duration of all rollback requests. |
| The total number of commit requests. |
| The total duration of all commit requests. |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
5.9.3.5. State Transfer Copy linkLink copied to clipboard!
State transfer happens when a node joins or leaves the cluster. It is required to balance the data stored and guarantee the desired number of copies.
This operation increases the resource usage, and it will affect negatively the overall performance.
Metric | Description |
---|---|
| The number of in-flight transactional segments the local node requested from other nodes. |
| The number of in-flight segments the local node requested from other nodes. |
5.9.3.6. Cluster Data Replication Copy linkLink copied to clipboard!
The cluster data replication can be the main source of failure. These metrics not only report the response time, i.e., the time it takes to replicate an update, but also the failures.
On a healthy cluster, the average replication time will be stable or with little variance. The number of failures should not increase.
Metric | Description |
---|---|
| The total number of successful replications. |
| The total number of failed replications. |
| The average time spent, in milliseconds, replicating data in the cluster. |
Success ratio
An expression can be used to compute the replication success ratio:
(vendor_rpc_manager_replication_count) / (vendor_rpc_manager_replication_count + vendor_rpc_manager_replication_failures)
(vendor_rpc_manager_replication_count)
/
(vendor_rpc_manager_replication_count
+ vendor_rpc_manager_replication_failures)
5.9.3.7. Cross Site Data Replication Copy linkLink copied to clipboard!
Like cluster data replication, the metrics in this section measure the time it takes to replicate the data to the other sites.
On a healthy cluster, the average cross-site replication time will be stable or with little variance.
Tags
site=<name>
- indicates the receiving site.
Metric | Description |
---|---|
| The total number of cross-site requests. |
| The total duration of all cross-site requests. |
| The total number of cross-site requests. This metric is more detailed with a per-site counter. |
| The total duration of all cross-site requests. This metric is more detailed with a per-site duration. |
| The total number of cross-site requests handled by this node. This metric is more detailed with a per-site counter. |
|
The site status. A value of 1 indicates that it is online. This value reacts to the Data Grid CLI commands |
When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.
5.9.4. Next steps Copy linkLink copied to clipboard!
Return back to the Troubleshooting using metrics.
Chapter 6. Root cause analysis with tracing Copy linkLink copied to clipboard!
Record information during the request lifecycle with OpenTelementry tracing to identify root cases for latencies and errors in Red Hat build of Keycloak and connected systems.
This chapter explains how you can enable and configure distributed tracing in Red Hat build of Keycloak by utilizing OpenTelemetry (OTel). Tracing allows for detailed monitoring of each request’s lifecycle, which helps quickly identify and diagnose issues, leading to more efficient debugging and maintenance.
It provides valuable insights into performance bottlenecks and can help optimize the system’s overall efficiency and across system boundaries. Red Hat build of Keycloak uses a supported Quarkus OTel extension that provides smooth integration and exposure of application traces.
6.1. Enable tracing Copy linkLink copied to clipboard!
It is possible to enable exposing traces using the build time option tracing-enabled
as follows:
bin/kc.[sh|bat] start --tracing-enabled=true
bin/kc.[sh|bat] start --tracing-enabled=true
By default, the trace exporters send out data in batches, using the gRPC
protocol and endpoint http://localhost:4317
.
The default service name is keycloak
, specified via the tracing-service-name
property, which takes precedence over service.name
defined in the tracing-resource-attributes
property.
For more information about resource attributes that can be provided via the tracing-resource-attributes
property, see the Quarkus OpenTelemetry Resource guide.
Tracing can be enabled only when the opentelemetry
feature is enabled (by default).
For more tracing settings, see all possible configurations below.
6.2. Development setup Copy linkLink copied to clipboard!
In order to see the captured Red Hat build of Keycloak traces, the basic setup with leveraging the Jaeger tracing platform might be used. For development purposes, the Jaeger-all-in-one can be used to see traces as easily as possible.
Jaeger-all-in-one includes the Jaeger agent, an OTel collector, and the query service/UI. You do not need to install a separate collector, as you can directly send the trace data to Jaeger.
podman run --name jaeger \ -p 16686:16686 \ -p 4317:4317 \ -p 4318:4318 \ jaegertracing/all-in-one
podman run --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one
6.2.1. Exposed ports Copy linkLink copied to clipboard!
16686
- Jaeger UI
4317
- OpenTelemetry Protocol gRPC receiver (default)
4318
- OpenTelemetry Protocol HTTP receiver
You can visit the Jaeger UI on http://localhost:16686/
to see the tracing information. The Jaeger UI might look like this with an arbitrary Red Hat build of Keycloak trace:
6.3. Information in traces Copy linkLink copied to clipboard!
6.3.1. Spans Copy linkLink copied to clipboard!
Red Hat build of Keycloak creates spans for the following activities:
- Incoming HTTP requests
- Outgoing Database including acquiring a database connections
- Outgoing LDAP requests including connecting to the LDAP server
- Outgoing HTTP requests including IdP brokerage
6.3.2. Tags Copy linkLink copied to clipboard!
Red Hat build of Keycloak adds tags to traces depending on the type of the request. All tags are prefixed with kc.
.
Example tags are:
kc.clientId
- Client ID
kc.realmName
- Realm name
kc.sessionId
- User session ID
kc.token.id
-
id
as mentioned in the token kc.token.issuer
-
issuer
as mentioned in the token kc.token.sid
-
sid
as mentioned in the token kc.authenticationSessionId
- Authentication session ID
kc.authenticationTabId
- Authentication Tab ID
6.3.3. Logs Copy linkLink copied to clipboard!
If a trace is being sampled, it will contain any user events created during the request. This includes, for example, LOGIN
, LOGOUT
or REFRESH_TOKEN
events with all details and IDs found in user events.
LDAP communication errors are shown as log entries in recorded traces as well with a stack trace and details of the failed operation.
6.4. Trace IDs in logs Copy linkLink copied to clipboard!
When tracing is enabled, the trace IDs are included in the log messages of all enabled log handlers (see more in Configuring logging). It can be useful for associating log events to request execution, which might provide better traceability and debugging. All log lines originating from the same request will have the same traceId
in the log.
The log message also contains a sampled
flag, which relates to the sampling described below and indicates whether the span was sampled - sent to the collector.
The format of the log records may start as follows:
2024-08-05 15:27:07,144 traceId=b636ac4c665ceb901f7fdc3fc7e80154, parentId=d59cea113d0c2549, spanId=d59cea113d0c2549, sampled=true WARN [org.keycloak.events] ...
2024-08-05 15:27:07,144 traceId=b636ac4c665ceb901f7fdc3fc7e80154, parentId=d59cea113d0c2549, spanId=d59cea113d0c2549, sampled=true WARN [org.keycloak.events] ...
6.4.1. Hide trace IDs in logs Copy linkLink copied to clipboard!
You can hide trace IDs in specific log handlers by specifying their associated Red Hat build of Keycloak option log-<handler-name>-include-trace
, where <handler-name>
is the name of the log handler. For instance, to disable trace info in the console
log, you can turn it off as follows:
bin/kc.[sh|bat] start --tracing-enabled=true --log=console --log-console-include-trace=false
bin/kc.[sh|bat] start --tracing-enabled=true --log=console --log-console-include-trace=false
When you explicitly override the log format for the particular log handlers, the *-include-trace
options do not have any effect, and no tracing is included.
6.5. Sampling Copy linkLink copied to clipboard!
The sampler decides whether a trace should be discarded or forwarded, effectively reducing overhead by limiting the number of collected traces sent to the collector. It helps manage resource consumption, which leads to avoiding the huge storage costs of tracing every single request and potential performance penalty.
For a production-ready environment, sampling should be properly set to minimize infrastructure costs.
Red Hat build of Keycloak supports several built-in OpenTelemetry samplers, such as:
-
always_on
-
always_off
-
traceidratio
(default) -
parentbased_always_on
-
parentbased_always_off
-
parentbased_traceidratio
The used sampler can be changed via the tracing-sampler-type
property.
6.5.1. Default sampler Copy linkLink copied to clipboard!
The default sampler for Red Hat build of Keycloak is traceidratio
, which controls the rate of trace sampling based on a specified ratio configurable via the tracing-sampler-ratio
property.
6.5.1.1. Trace ratio Copy linkLink copied to clipboard!
The default trace ratio is 1.0
, which means all traces are sampled - sent to the collector. The ratio is a floating number in the range [0,1]
. For instance, when the ratio is 0.1
, only 10% of the traces are sampled.
For a production-ready environment, the trace ratio should be a smaller number to prevent the massive cost of trace store infrastructure and avoid performance overhead.
The ratio can be set to 0.0
to disable sampling entirely at runtime.
6.5.1.2. Rationale Copy linkLink copied to clipboard!
The sampler makes its own sampling decisions based on the current ratio of sampled spans, regardless of the decision made on the parent span, as with using the parentbased_traceidratio
sampler.
The parentbased_traceidratio
sampler could be the preferred default type as it ensures the sampling consistency between parent and child spans. Specifically, if a parent span is sampled, all its child spans will be sampled as well - the same sampling decision for all. It helps to keep all spans together and prevents storing incomplete traces.
However, it might introduce certain security risks leading to DoS attacks. External callers can manipulate trace headers, parent spans can be injected, and the trace store can be overwhelmed. Proper HTTP headers (especially tracestate
) filtering and adequate measures of caller trust would need to be assessed.
For more information, see the W3C Trace context document.
6.6. Tracing in Kubernetes environment Copy linkLink copied to clipboard!
When the tracing is enabled when using the Red Hat build of Keycloak Operator, certain information about the deployment is propagated to the underlying containers.
6.6.1. Configuration via Keycloak CR Copy linkLink copied to clipboard!
You can change tracing configuration via Keycloak CR. For more information, see the Advanced configuration.
6.6.2. Filter traces based on Kubernetes attributes Copy linkLink copied to clipboard!
You can filter out the required traces in your tracing backend based on their tags:
-
service.name
- Red Hat build of Keycloak deployment name -
k8s.namespace.name
- Namespace -
host.name
- Pod name
Red Hat build of Keycloak Operator automatically sets the KC_TRACING_SERVICE_NAME
and KC_TRACING_RESOURCE_ATTRIBUTES
environment variables for each Red Hat build of Keycloak container included in pods it manages.
The KC_TRACING_RESOURCE_ATTRIBUTES
variable always contains (if not overridden) the k8s.namespace.name
attribute representing the current namespace.
6.7. Relevant options Copy linkLink copied to clipboard!
Value | |
---|---|
Available only when Console log handler and Tracing is activated |
|
Available only when File log handler and Tracing is activated |
|
Available only when Syslog handler and Tracing is activated |
|
Available only when Tracing is enabled |
|
🛠
Available only when 'opentelemetry' feature is enabled |
|
Available only when Tracing is enabled | (default) |
🛠
Available only when Tracing is enabled |
|
Available only when Tracing is enabled |
|
Available only when Tracing is enabled | |
Available only when Tracing is enabled | (default) |
🛠
Available only when Tracing is enabled |
|
Available only when Tracing is enabled | (default) |
Chapter 7. Visualizing activities in dashboards Copy linkLink copied to clipboard!
Install the Red Hat build of Keycloak Grafana dashboards to visualize the metrics that capture the status and activities of your deployment.
Red Hat build of Keycloak provides metrics to observe what is happening inside the deployment. To understand how metrics evolve over time, it is helpful to collect and visualize them in graphs.
This guide provides instructions on how to visualize collected Red Hat build of Keycloak metrics in a running Grafana instance.
7.1. Prerequisites Copy linkLink copied to clipboard!
- Red Hat build of Keycloak metrics are enabled. Follow Gaining insights with metrics chapter for more details.
- Grafana instance is running and Red Hat build of Keycloak metrics are collected into a Prometheus instance.
-
For the HTTP request latency heatmaps to work, enable histograms for HTTP metrics by setting
http-metrics-histograms-enabled
totrue
.
7.2. Red Hat build of Keycloak Grafana dashboards Copy linkLink copied to clipboard!
Grafana dashboards are distributed in the form of a JSON file that is imported into a Grafana instance. JSON definitions of Red Hat build of Keycloak Grafana dashboards are available in the keycloak/keycloak-grafana-dashboard GitHub repository.
Follow these steps to download JSON file definitions.
Identify the branch from
keycloak-grafana-dashboards
to use from the following table.Expand Red Hat build of Keycloak version keycloak-grafana-dashboards
branch>= 26.1
main
Clone the GitHub repository
git clone -b BRANCH_FROM_STEP_1 https://github.com/keycloak/keycloak-grafana-dashboard.git
git clone -b BRANCH_FROM_STEP_1 https://github.com/keycloak/keycloak-grafana-dashboard.git
Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
The dashboards are available in the directory
keycloak-grafana-dashboard/dashboards
.
The following sections describe the purpose of each dashboard.
7.2.1. Red Hat build of Keycloak troubleshooting dashboard Copy linkLink copied to clipboard!
This dashboard is available in the JSON file: keycloak-troubleshooting-dashboard.json
.
On the top of the dashboard, graphs display the service level indicators as defined in Monitoring performance with Service Level Indicators. This dashboard can be also used while troubleshooting a Red Hat build of Keycloak deployment following the Troubleshooting using metrics chapter, for example, when SLI graphs do not show expected results.
Figure 7.1. Troubleshooting dashboard
7.2.2. Keycloak capacity planning dashboard Copy linkLink copied to clipboard!
This dashboard is available in the JSON file: keycloak-capacity-planning-dashboard.json
.
This dashboard shows metrics that are important when estimating the load handled by a Red Hat build of Keycloak deployment. For example, it shows the number of password validations or login flows performed by Red Hat build of Keycloak. For more detail on these metrics, see the chapter Self-provided metrics.
Red Hat build of Keycloak event metrics must be enabled for this dashboard to work correctly. To enable them, see the chapter Monitoring user activities with event metrics.
Figure 7.2. Capacity planning dashboard
7.3. Import a dashboard Copy linkLink copied to clipboard!
- Open the dashboard page from the left Grafana menu.
- Click New and Import.
- Click Upload dashboard JSON file and select the JSON file of the dashboard you want to import.
- Pick your Prometheus datasource.
- Click Import.
7.4. Export a dashboard Copy linkLink copied to clipboard!
Exporting a dashboard to JSON format may be useful. For example, you may want to suggest a change in our dashboard repository.
- Open a dashboard you would like to export.
- Click share in the top left corner next to the dashboard name.
- Click the Export tab.
- Enable Export for sharing externally.
- Click either Save to file or View JSON and Copy to Clipboard according to where you want to store the resulting JSON.
7.5. Further reading Copy linkLink copied to clipboard!
Continue reading on how to connect traces to dashboard in the Analyzing outliers and errors with exemplars chapter.
Chapter 8. Analyzing outliers and errors with exemplars Copy linkLink copied to clipboard!
Use exemplars to connect a metric to a recorded trace to analyze the root cause of errors or latencies.
Metrics are aggregations over several events, and show you if your system is operating within defined bounds. They are great to monitor error rates or tail latencies and to set up alerting or drive performance optimizations. Still, the aggregation makes it difficult to find root causes for latencies or errors reported in metrics.
Root causes for errors and latencies can be found by enabling tracing. To connect a metric to a recorded trace, there is the concept of exemplars.
Once exemplars are set up, Red Hat build of Keycloak reports metrics with their last recorded trace as an exemplar. A dashboard tool like Grafana can link the exemplar from a metrics dashboard to a trace view.
Metrics that support exemplars are:
-
http_server_requests_seconds_count
(including histograms)
See the chapter HTTP metrics for details on this metric. -
keycloak_credentials_password_hashing_validations_total
See the chapter Self-provided metrics for details on this metric. -
keycloak_user_events_total
See the chapter Self-provided metrics for details on this metric.
See below for a screenshot of a heatmap visualization for latencies that is showing an exemplar when hovering over one of the pink indicators.
Figure 8.1. Heatmap diagram with exemplar
8.1. Setting up exemplars Copy linkLink copied to clipboard!
To benefit from exemplars, perform the following steps:
- Enable metrics for Red Hat build of Keycloak as described in chapter Gaining insights with metrics.
- Enable tracing for Red Hat build of Keycloak as described in chapter Root cause analysis with tracing.
Enable exemplar storage in your monitoring system.
For Prometheus, this is a preview feature that you need to enable.
Scrape the metrics using the
OpenMetricsText1.0.0
protocol, which is not enabled by default in Prometheus.If you are using
PodMonitors
or similar in a Kubernetes environment, this can be achieved by adding it to the spec of the custom resource:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Configure your metrics datasource where to link to for traces.
When using Grafana and Prometheus, this would be setting up a
exemplarTraceIdDestinations
for the Prometheus datasource, which then points to your tracing datasource that is provided by tools like Jaeger or Tempo.Enable exemplars in your dashboards.
Enable the Exemplars toggle in each query on each dashboard where you want to show exemplars. When set up correctly, you will notice little dots or stars in your dashboards that you can click on to view the traces.
- If you do not specify the scrape protocol, Prometheus will by default not send it in the content negotiation, and Keycloak will then fall back to the PrometheusText protocol which will not contain the exemplars.
- If you enabled tracing and metrics, but the request sampling did not record a trace, the exposed metric will not contain any exemplars.
- If you access the metrics endpoint with your browser, the content negotiation will lead to the format PrometheusText being returned, and you will not see any exemplars.
8.2. Verifying that exemplars work as expected Copy linkLink copied to clipboard!
Perform the following steps to verify that Red Hat build of Keycloak is set up correctly for exemplars:
- Follow the instructions to set up metrics and tracing for Red Hat build of Keycloak.
-
For test purposes, record all traces by setting the tracing ration to
1.0
. See Root cause analysis with tracing for recommended sampling settings in production systems. - Log in to the Keycloak instance to create some traces.
Scrape the metrics with a command similar to the following and search for those metrics that have an exemplar set:
curl -s http://localhost:9000/metrics \ -H 'Accept: application/openmetrics-text; version=1.0.0; charset=utf-8' \ | grep "#.*trace_id"
$ curl -s http://localhost:9000/metrics \ -H 'Accept: application/openmetrics-text; version=1.0.0; charset=utf-8' \ | grep "#.*trace_id"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow This should result in an output similar to the following. Note the additional
#
after which the span and trace IDs are added:http_server_requests_seconds_count {...} ... # {span_id="...",trace_id="..."} ...
http_server_requests_seconds_count {...} ... # {span_id="...",trace_id="..."} ...
Copy to Clipboard Copied! Toggle word wrap Toggle overflow