Observability Guide


Red Hat build of Keycloak 26.2

Red Hat Customer Content Services

Abstract

This guide helps administrators to monitor and troubleshoot Red Hat build of Keycloak 26.2 with health checks, metrics, dashboards and tracing.

Check if an instance has finished its start up and is ready to serve requests by calling its health REST endpoints.

Red Hat build of Keycloak has built in support for health checks. This chapter describes how to enable and use the Red Hat build of Keycloak health checks. The Red Hat build of Keycloak health checks are exposed on the management port 9000 by default. For more details, see Configuring the Management Interface

Red Hat build of Keycloak exposes 4 health endpoints:

  • /health/live
  • /health/ready
  • /health/started
  • /health

See the Quarkus SmallRye Health docs for information on the meaning of each endpoint.

These endpoints respond with HTTP status 200 OK on success or 503 Service Unavailable on failure, and a JSON object like the following:

Successful response for endpoints without additional per-check information:

{
    "status": "UP",
    "checks": []
}
Copy to Clipboard Toggle word wrap

Successful response for endpoints with information on the database connection:

{
    "status": "UP",
    "checks": [
        {
            "name": "Keycloak database connections health check",
            "status": "UP"
        }
    ]
}
Copy to Clipboard Toggle word wrap

1.2. Enabling the health checks

It is possible to enable the health checks using the build time option health-enabled:

bin/kc.[sh|bat] build --health-enabled=true
Copy to Clipboard Toggle word wrap

By default, no check is returned from the health endpoints.

1.3. Using the health checks

It is recommended that the health endpoints be monitored by external HTTP requests. Due to security measures that remove curl and other packages from the Red Hat build of Keycloak container image, local command-based monitoring will not function easily.

If you are not using Red Hat build of Keycloak in a container, use whatever you want to access the health check endpoints.

1.3.1. curl

You may use a simple HTTP HEAD request to determine the live or ready state of Red Hat build of Keycloak. curl is a good HTTP client for this purpose.

If Red Hat build of Keycloak is deployed in a container, you must run this command from outside it due to the previously mentioned security measures. For example:

curl --head -fsS http://localhost:9000/health/ready
Copy to Clipboard Toggle word wrap

If the command returns with status 0, then Red Hat build of Keycloak is live or ready, depending on which endpoint you called. Otherwise there is a problem.

1.3.2. Kubernetes

Define a HTTP Probe so that Kubernetes may externally monitor the health endpoints. Do not use a liveness command.

1.3.3. HEALTHCHECK

The Containerfile HEALTHCHECK instruction defines a command that will be periodically executed inside the container as it runs. The Red Hat build of Keycloak container does not have any CLI HTTP clients installed. Consider installing curl as an additional RPM, as detailed by the Running Red Hat build of Keycloak in a container chapter. Note that your container may be less secure because of this.

1.4. Available Checks

The table below shows the available checks.

Expand
CheckDescriptionRequires Metrics

Database

Returns the status of the database connection pool.

Yes

For some checks, you’ll need to also enable metrics as indicated by the Requires Metrics column. To enable metrics use the metrics-enabled option as follows:

bin/kc.[sh|bat] build --health-enabled=true --metrics-enabled=true
Copy to Clipboard Toggle word wrap

1.5. Relevant options

Expand
 Value

health-enabled 🛠

If the server should expose health check endpoints.

If enabled, health checks are available at the /health, /health/ready and /health/live endpoints.

CLI: --health-enabled
Env: KC_HEALTH_ENABLED

true, false (default)

Chapter 2. Gaining insights with metrics

Collect metrics to gain insights about state and activities of a running instance of Red Hat build of Keycloak.

Red Hat build of Keycloak has built in support for metrics. This chapter describes how to enable and configure server metrics.

2.1. Enabling Metrics

It is possible to enable metrics using the build time option metrics-enabled:

bin/kc.[sh|bat] start --metrics-enabled=true
Copy to Clipboard Toggle word wrap

2.2. Querying Metrics

Red Hat build of Keycloak exposes metrics at the following endpoint on the management interface at:

  • /metrics

For more information about the management interface, see Configuring the Management Interface. The response from the endpoint uses a application/openmetrics-text content type and it is based on the Prometheus (OpenMetrics) text format. The snippet below is an example of a response:

# HELP base_gc_total Displays the total number of collections that have occurred. This attribute lists -1 if the collection count is undefined for this collector.
# TYPE base_gc_total counter
base_gc_total{name="G1 Young Generation",} 14.0
# HELP jvm_memory_usage_after_gc_percent The percentage of long-lived heap pool used after the last GC event, in the range [0..1]
# TYPE jvm_memory_usage_after_gc_percent gauge
jvm_memory_usage_after_gc_percent{area="heap",pool="long-lived",} 0.0
# HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
# TYPE jvm_threads_peak_threads gauge
jvm_threads_peak_threads 113.0
# HELP agroal_active_count Number of active connections. These connections are in use and not available to be acquired.
# TYPE agroal_active_count gauge
agroal_active_count{datasource="default",} 0.0
# HELP base_memory_maxHeap_bytes Displays the maximum amount of memory, in bytes, that can be used for memory management.
# TYPE base_memory_maxHeap_bytes gauge
base_memory_maxHeap_bytes 1.6781410304E10
# HELP process_start_time_seconds Start time of the process since unix epoch.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.675188449054E9
# HELP system_load_average_1m The sum of the number of runnable entities queued to available processors and the number of runnable entities running on the available processors averaged over a period of time
# TYPE system_load_average_1m gauge
system_load_average_1m 4.005859375

...
Copy to Clipboard Toggle word wrap

2.3. Next steps

Read the chapters Monitoring performance with Service Level Indicators and Troubleshooting using metrics to see how to use the metrics.

2.4. Relevant options

Expand
 Value

cache-metrics-histograms-enabled

Enable histograms for metrics for the embedded caches.

CLI: --cache-metrics-histograms-enabled
Env: KC_CACHE_METRICS_HISTOGRAMS_ENABLED

Available only when metrics are enabled

true, false (default)

http-metrics-histograms-enabled

Enables a histogram with default buckets for the duration of HTTP server requests.

CLI: --http-metrics-histograms-enabled
Env: KC_HTTP_METRICS_HISTOGRAMS_ENABLED

Available only when metrics are enabled

true, false (default)

http-metrics-slos

Service level objectives for HTTP server requests.

Use this instead of the default histogram, or use it in combination to add additional buckets. Specify a list of comma-separated values defined in milliseconds. Example with buckets from 5ms to 10s: 5,10,25,50,250,500,1000,2500,5000,10000

CLI: --http-metrics-slos
Env: KC_HTTP_METRICS_SLOS

Available only when metrics are enabled

 

metrics-enabled 🛠

If the server should expose metrics.

If enabled, metrics are available at the /metrics endpoint.

CLI: --metrics-enabled
Env: KC_METRICS_ENABLED

true, false (default)

Event metrics provide an aggregated view of user activities in a Red Hat build of Keycloak instance.

For now, only metrics for user events are captured. For example, you can monitor the number of logins, login failures, or token refreshes performed.

The metrics are exposed using the standard metrics endpoint, and you can use it in your own metrics collection system to create dashboards and alerts.

The metrics are reported as counters per Red Hat build of Keycloak instance. The counters are reset on the restart of the instance. If you have multiple instances running in a cluster, you will need to collect the metrics from all instances and aggregate them to get per a cluster view.

3.1. Enable event metrics

To start collecting event metrics, enable metrics and enable the metrics for user events.

The following shows the required startup parameters:

bin/kc.[sh|bat] start --metrics-enabled=true --event-metrics-user-enabled=true ...
Copy to Clipboard Toggle word wrap

By default, there is a separate metric for each realm. To break down the metric by client and identity provider, you can add those metrics dimension using the configuration option event-metrics-user-tags. This can be useful on installations with a small number of clients and IDPs. This is not recommended for installations with a large number of clients or IDPs as it will increase the memory usage of Red Hat build of Keycloak and as it will increase the load on your monitoring system.

The following shows how to configure Red Hat build of Keycloak to break down the metrics by all three metrics dimensions:

bin/kc.[sh|bat] start ... --event-metrics-user-tags=realm,idp,clientId ...
Copy to Clipboard Toggle word wrap

You can limit the events for which Red Hat build of Keycloak will expose metrics. See the Server Administration Guide on event types for an overview of the available events.

The following example limits the events collected to LOGIN and LOGOUT events:

bin/kc.[sh|bat] start ... --event-metrics-user-events=login,logout ...
Copy to Clipboard Toggle word wrap

See Self-provided metrics for a description of the metrics collected.

3.2. Relevant options

Expand
 Value

metrics-enabled 🛠

If the server should expose metrics.

If enabled, metrics are available at the /metrics endpoint.

CLI: --metrics-enabled
Env: KC_METRICS_ENABLED

true, false (default)

event-metrics-user-enabled 🛠

Create metrics based on user events.

CLI: --event-metrics-user-enabled
Env: KC_EVENT_METRICS_USER_ENABLED

Available only when metrics are enabled and feature user-event-metrics is enabled

true, false (default)

event-metrics-user-events

Comma-separated list of events to be collected for user event metrics.

This option can be used to reduce the number of metrics created as by default all user events create a metric.

CLI: --event-metrics-user-events
Env: KC_EVENT_METRICS_USER_EVENTS

Available only when user event metrics are enabled

Use remove_credential instead of remove_totp, and update_credential instead of update_totp and update_password. Deprecated values: remove_totp, update_totp, update_password

authreqid_to_token, client_delete, client_info, client_initiated_account_linking, client_login, client_register, client_update, code_to_token, custom_required_action, delete_account, execute_action_token, execute_actions, federated_identity_link, federated_identity_override_link, grant_consent, identity_provider_first_login, identity_provider_link_account, identity_provider_login, identity_provider_post_login, identity_provider_response, identity_provider_retrieve_token, impersonate, introspect_token, invalid_signature, invite_org, login, logout, oauth2_device_auth, oauth2_device_code_to_token, oauth2_device_verify_user_code, oauth2_extension_grant, permission_token, pushed_authorization_request, refresh_token, register, register_node, remove_credential, remove_federated_identity, remove_totp (deprecated), reset_password, restart_authentication, revoke_grant, send_identity_provider_link, send_reset_password, send_verify_email, token_exchange, unregister_node, update_consent, update_credential, update_email, update_password (deprecated), update_profile, update_totp (deprecated), user_disabled_by_permanent_lockout, user_disabled_by_temporary_lockout, user_info_request, verify_email, verify_profile

event-metrics-user-tags

Comma-separated list of tags to be collected for user event metrics.

By default only realm is enabled to avoid a high metrics cardinality.

CLI: --event-metrics-user-tags
Env: KC_EVENT_METRICS_USER_TAGS

Available only when user event metrics are enabled

realm, idp, clientId

Track performance and reliability as perceived by users with Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are essential components in monitoring and maintaining the performance and reliability of Red Hat build of Keycloak in production environments.

The Google Site Reliability Engineering book defines this as follows:

  • A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is provided.
  • A Service level objective (SLO) is a target value or range of values for a service level that is measured by an SLI.

By agreeing those with the stakeholders and tracking these, service owners can ensure that deployments are aligned with user’s expectations and that they neither over- nor under-deliver on the service they provide.

4.1. Prerequisites

  • Metrics need to be enabled for Red Hat build of Keycloak, and the http-metrics-slos option needs to be set to latency to be measured for the SLO defined below. Follow Gaining insights with metrics chapter for more details.
  • A monitoring system collecting the metrics. The following paragraphs assume Prometheus or a similar system is used that supports the PromQL query language.

4.2. Definition of the service delivered

The following service definition is used in the next steps to identify the appropriate SLIs and SLOs. It should capture the behavior observed by its users.

As a Red Hat build of Keycloak user,

  • I want to be able to log in,
  • refresh my token and
  • log out,

so that I can use the applications that use Red Hat build of Keycloak for authentication.

4.3. Definition of SLI and SLO

The following provides example SLIs and SLOs based on the service description above and the metrics available in Red Hat build of Keycloak.

Note

While these SLOs are independent of the actual load the system, this is expected as a single user does not care about the system load if they get slow responses.

At the same time, if you enter a Service Level Agreement (SLA) with stakeholders, you as the one running Red Hat build of Keycloak have an interest to define limits of the traffic Red Hat build of Keycloak receives, as response times will be prolonged and error rates might increase as the load of the system increases and scaling thresholds are reached.

Expand
CharacteristicService Level IndicatorService Level Objective*Metric Source

Availability

Percentage of the time Red Hat build of Keycloak is able to answer requests as measured by the monitoring system

Red Hat build of Keycloak should be available 99.9% of the time within a month (44 minutes unavailability per month).

Use the Prometheus up metric which indicates if the Prometheus server is able to scrape metrics from the Red Hat build of Keycloak instances.

Latency

Response time for authentication related HTTP requests as measured by the server

95% of all authentication related requests should be faster than 250 ms within 30 days.

Red Hat build of Keycloak server-side metrics to track latency for specific endpoints along with Response Time Distribution using http_server_requests_seconds_bucket and http_server_requests_seconds_count.

Errors

Failed authentication requests due to server problems as measured by the server

The rate of errors due to server problems for authentication requests should be less than 0.1% within 30 days.

Identify server side error by filtering the metric http_server_requests_seconds_count on the tag outcome for value SERVER_ERROR.

* These SLO target values are an example and should be tailored to fit your use case and deployment.

4.4. PromQL queries

These are example queries created in a Kubernetes environment and are used with Prometheus as a monitoring tool. They are provided as blueprints, and you will need to adapt them for a different runtime or monitoring environment.

Note

For a production environment, you might want to replace those queries or subqueries with a recording rule to make sure they do not use too many resources if you want to use them for alerting or live dashboards.

4.4.1. Availability

This metric will have a value of at least one if the Red Hat build of Keycloak instances is available and responding to Prometheus scrape requests, and 0 if the service is down or unreachable.

Then use a tool like Grafana to show a 30-day time range and let it calculate the average of the metric in that time window.

count_over_time(
  sum (up{
    container="keycloak", 
1

    namespace="$namespace"
  } > 0)[30d:15s]
) 
2

/
count_over_time(vector(1)[30d:15s]) 
3
Copy to Clipboard Toggle word wrap
1
Filter by additional tags to identify Red Hat build of Keycloak nodes
2
Count all data points in the given range and interval when at least one Red Hat build of Keycloak node was available
3
Divide by the number of all data points in the same range and interval
Note

In Grafana you can replace value 30d:15s with $range:$interval to compute availability SLI in the time range selected for the dashboard.

4.4.2. Latency of authentication requests

This Prometheus query calculates the percentage of authentication requests that completed within 0.25 seconds relative to all authentication requests for specific Red Hat build of Keycloak endpoints, targeting a particular namespace and pod, over the past 30 days.

This example requires the Red Hat build of Keycloak configuration http-metrics-slos to contain value 250 indicating that buckets for requests faster and slower than 250 ms should be recorded. Setting http-metrics-histograms-enabled to true would capture additional buckets which can help with performance troubleshooting.

sum(
  rate(
    http_server_requests_seconds_bucket{
      uri=~"/realms/{realm}/protocol/{protocol}.*|/realms/{realm}/login-actions.*", 
1

      le="0.25", 
2

      container="keycloak", 
3

      namespace="$namespace"}
    [30d] 
4

  )
) without (le,uri,status,outcome,method,pod,instance) 
5

/
sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/realms/{realm}/protocol/{protocol}.*|/realms/{realm}/login-actions.*", 
6

      container="keycloak",
      namespace="$namespace"}
    [30d] 
7

  )
) without (le,uri,status,outcome,method,pod,instance) 
8
Copy to Clipboard Toggle word wrap
1 6
URLs related to logging in
2
Response time as defined by SLO
3 7
Filter by additional tags to identify Red Hat build of Keycloak nodes
4
Time range as specified by the SLO
5 8
Ignore as many labels necessary to create a single sum
Note

In Grafana, you can replace value 30d with $__range to compute latency SLI in the time range selected for the dashboard.

4.4.3. Errors for authentication requests

This Prometheus query calculates the percentage of authentication requests that returned a server side error for all authentication requests, targeting a particular namespace, over the past 30 days.

sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/realms/{realm}/protocol/{protocol}.*|/realms/{realm}/login-actions.*", 
1

      outcome="SERVER_ERROR", 
2

      container="keycloak", 
3

      namespace="$namespace"}
    [30d] 
4

  )
) without (le,uri,status,outcome,method,pod,instance) 
5

/
sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/realms/{realm}/protocol/{protocol}.*|/realms/{realm}/login-actions.*", 
6

      container="keycloak", 
7

      namespace="$namespace"}
    [30d] 
8

  )
) without (le,uri,status,outcome,method,pod,instance) 
9
Copy to Clipboard Toggle word wrap
1 6
URLs related to logging in
2
Filter for all requests that responded with a server error (HTTP status 5xx)
3 7
Filter by additional tags to identify Red Hat build of Keycloak nodes
4 8
Time range as specified by the SLO
5 9
Ignore as many labels necessary to create a single sum
Note

In Grafana, you can replace value 30d with $__range to compute errors SLI in the time range selected for the dashboard.

4.5. Further Reading

Chapter 5. Troubleshooting using metrics

Use metrics for troubleshooting errors and performance issues.

For a running Red Hat build of Keycloak deployment it is important to understand how the system performs and whether it meets your service level objectives (SLOs). For more details on SLOs, proceed to the Monitoring performance with Service Level Indicators chapter.

This guide will provide directions to answer the question: What can I do when my SLOs are not met?

Red Hat build of Keycloak consists of several components where an issue or misconfiguration of one of them can move your service level indicators to undesirable numbers.

A guidance provided by this guide is illustrated in the following example:

Observation: Latency service level objective is not met.

Metrics that indicate a problem:

  1. Red Hat build of Keycloak’s database connection pool is often exhausted, and there are threads queuing for a connection to be retrieved from the pool.
  2. Red Hat build of Keycloak’s users cache hit ratio is at a low percentage, around 5%. This means only 1 out of 20 user searches is able to obtain user data from the cache and the rest needs to load it from the database.

Possible mitigations suggested:

  • Increasing the users cache size to a higher number which would decrease the number of reads from the database.
  • Increasing the number of connections in the connection pool. This would need to be checked with metrics for your database and tuning it for a higher load, for example, by increasing the number of available processors.
Note
  • This guide focuses on Red Hat build of Keycloak metrics. Troubleshooting the database itself is out of scope.
  • This guide provides general guidance. You should always confirm the configuration change by conducting a performance test comparing the metrics in question for the old and the new configuration.
Note

Grafana dashboards for the metrics below can be found in Visualizing activities in dashboards chapter.

5.1. List of Red Hat build of Keycloak key metrics

5.2. Self-provided metrics

Learn about the key metrics that Red Hat build of Keycloak provides.

This is part of the Troubleshooting using metrics chapter.

5.2.1. Prerequisites

  • Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
  • A monitoring system collecting the metrics.

5.2.2. Metrics

5.2.2.1. User Event Metrics

User event metrics are disabled by default. See Monitoring user activities with event metrics on how to enable them and how to configure which tags are recorded.

Expand
MetricDescription

keycloak_user_events_total

Counting the occurrence of user events.

Tags

The tags client_id and idp are disabled by default to avoid a too high cardinality.

realm
Realm
client_id
Client ID
idp
Identity Provider
event
User event, for example login or logout. See the Server Administration Guide on event types for an overview of the available events.
error
Error specific to the event, for example invalid_user_credentials for the event login. Empty string if no error occurred.

The snippet below is an example of a response provided by the metric endpoint:

# HELP keycloak_user_events_total Keycloak user events
# TYPE keycloak_user_events_total counter
keycloak_user_events_total{client_id="security-admin-console",error="",event="code_to_token",idp="",realm="master",} 1.0
keycloak_user_events_total{client_id="security-admin-console",error="",event="login",idp="",realm="master",} 1.0
keycloak_user_events_total{client_id="security-admin-console",error="",event="logout",idp="",realm="master",} 1.0
keycloak_user_events_total{client_id="security-admin-console",error="invalid_user_credentials",event="login",idp="",realm="master",} 1.0
Copy to Clipboard Toggle word wrap
5.2.2.2. Password hashing
Expand
MetricDescription

keycloak_credentials_password_hashing_validations_total

Counting password hashes validations.

Tags

realm
Realm
algorithm
Algorithm used for hashing password, for example argon2
hashing_strength
String denoting strength of hashing algorithm, for example, number of iterations depending on the algorithm. For example, Argon2id-1.3[m=7168,t=5,p=1]
outcome

Outcome of password validation. Possible values:

valid
Password correct
invalid
Password incorrect
error
Error when creating the hash of the password

To configure what tags are available provide a comma-separated list of tag names to the following option spi-credential-keycloak-password-validations-counter-tags. By default, all tags are enabled.

The snippet below is an example of a response provided by the metric endpoint:

# HELP keycloak_credentials_password_hashing_validations_total Password validations
# TYPE keycloak_credentials_password_hashing_validations_total counter
keycloak_credentials_password_hashing_validations_total{algorithm="argon2",hashing_strength="Argon2id-1.3[m=7168,t=5,p=1]",outcome="valid",realm="realm-0",} 39949.0
Copy to Clipboard Toggle word wrap

5.2.3. Next steps

Return back to the Troubleshooting using metrics or proceed to JVM metrics.

5.3. JVM metrics

Use JVM metrics to observe performance of Red Hat build of Keycloak.

This is part of the Troubleshooting using metrics chapter.

5.3.1. Prerequisites

  • Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
  • A monitoring system collecting the metrics.

5.3.2. Metrics

5.3.2.1. JVM info
Expand
MetricDescription

jvm_info_total

Information about the JVM such as version, runtime and vendor.

5.3.2.2. Heap memory usage
Expand
MetricDescription

jvm_memory_committed_bytes

The amount of memory that the JVM has committed for use, reflecting the portion of the allocated memory that is guaranteed to be available for the JVM to use.

jvm_memory_used_bytes

The amount of memory currently used by the JVM, indicating the actual memory consumption by the application and JVM internals.

5.3.2.3. Garbage collection
Expand
MetricDescription

jvm_gc_pause_seconds_max

The maximum duration, in seconds, of garbage collection pauses experienced by the JVM due to a particular cause, which helps you quickly differentiate between types of GC (minor, major) pauses.

jvm_gc_pause_seconds_sum

The total cumulative time spent in garbage collection pauses, indicating the impact of GC pauses on application performance in the JVM.

jvm_gc_pause_seconds_count

Counts the total number of garbage collection pause events, helping to assess the frequency of GC pauses in the JVM.

jvm_gc_overhead

The percentage of CPU time spent on garbage collection, indicating the impact of GC on application performance in the JVM. It refers to the proportion of the total CPU processing time that is dedicated to executing garbage collection (GC) operations, as opposed to running application code or performing other tasks. This metric helps determine how much overhead GC introduces, affecting the overall performance of the Red Hat build of Keycloak’s JVM.

5.3.2.4. CPU Usage in Kubernetes
Expand
MetricDescription

container_cpu_usage_seconds_total

Cumulative CPU time consumed by the container in core-seconds.

5.3.3. Next steps

Return back to the Troubleshooting using metrics or proceed to Database Metrics.

5.4. Database Metrics

Use metrics to describe Red Hat build of Keycloak’s connection to the database.

This is part of the Troubleshooting using metrics chapter.

5.4.1. Prerequisites

  • Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
  • A monitoring system collecting the metrics.

5.4.2. Database connection pool metrics

Configure Red Hat build of Keycloak to use a fixed size database connection pool. See the Concepts for database connection pools chapter for more information.

Tip

If there is a high count of threads waiting for a database connection, increasing the database connection pool size is not always the best option. It might overload the database which would then become the bottleneck. Consider the following options instead:

  • Reduce the number of HTTP worker threads using the option http-pool-max-threads to make it match the available database connections, and thereby reduce contention and resource usage in Red Hat build of Keycloak and increase throughput.
  • Check which database statements are executed on the database. If you see, for example, a lot of information about clients and groups being fetched, and the users and realms cache are full, this might indicate that it is time to increase the sizes of those caches and see if this reduces your database load.
Expand
MetricDescription

agroal_available_count

Idle database connections.

agroal_active_count

Database connections used in ongoing transactions.

agroal_awaiting_count

Threads waiting for a database connection to become available.

5.4.3. Next steps

Return back to the Troubleshooting using metrics or proceed to HTTP metrics.

5.5. HTTP metrics

Use metrics to monitor the Red Hat build of Keycloak HTTP requests processing.

This is part of the Troubleshooting using metrics chapter.

5.5.1. Prerequisites

  • Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
  • A monitoring system collecting the metrics.

5.5.2. Metrics

5.5.2.1. Processing time

The processing time is exposed by these metrics, to monitor the Red Hat build of Keycloak performance and how long it takes to processing the requests.

Tip

On a healthy cluster, the average processing time will remain stable. Spikes or increases in the processing time may be an early sign that some node is under load.

Tags

method
HTTP method.
outcome
A more general outcome tag.
status
The HTTP status code.
uri
The requested URI.
Expand
MetricDescription

http_server_requests_seconds_count

The total number of requests processed.

http_server_requests_seconds_sum

The total duration for all the requests processed.

You can enable histograms for this metric by setting http-metrics-histograms-enabled to true, and add additional buckets for service level objectives using the option http-metrics-slos.

Note

When histograms are enabled, the percentile buckets are available. Those are useful to create heat maps and analyze latencies, still collecting and exposing the percentile buckets will increase the load of to your monitoring system.

5.5.2.2. Active requests

The current number of active requests is also available.

Expand
MetricDescription

http_server_active_requests

The current number of active requests

5.5.2.3. Bandwidth

The metrics below helps to monitor the bandwidth and consumed traffic used by Red Hat build of Keycloak and consumed by the requests and responses received or sent.

Expand
MetricDescription

http_server_bytes_written_count

The total number of responses sent.

http_server_bytes_written_sum

The total number of bytes sent.

http_server_bytes_read_count

The total number of requests received.

http_server_bytes_read_sum

The total number of bytes received.

Note

When histograms are enabled, the percentile buckets are available. Those are useful to create heat maps and analyze latencies, still collecting and exposing the percentile buckets will increase the load of to your monitoring system.

5.5.3. Next steps

Return back to the Troubleshooting using metrics or,

5.5.4. Relevant options

Expand
 Value

http-metrics-histograms-enabled

Enables a histogram with default buckets for the duration of HTTP server requests.

CLI: --http-metrics-histograms-enabled
Env: KC_HTTP_METRICS_HISTOGRAMS_ENABLED

Available only when metrics are enabled

true, false (default)

http-metrics-slos

Service level objectives for HTTP server requests.

Use this instead of the default histogram, or use it in combination to add additional buckets. Specify a list of comma-separated values defined in milliseconds. Example with buckets from 5ms to 10s: 5,10,25,50,250,500,1000,2500,5000,10000

CLI: --http-metrics-slos
Env: KC_HTTP_METRICS_SLOS

Available only when metrics are enabled

 

5.6. Clustering metrics

Use metrics to monitor communication between Red Hat build of Keycloak nodes.

This is part of the Troubleshooting using metrics chapter.

5.6.1. Prerequisites

  • Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
  • A monitoring system collecting the metrics.

5.6.2. Metrics

Deploying multiple Red Hat build of Keycloak nodes allows the load to be distributed amongst them, but this requires communication between the nodes. This section describes metrics that are useful for monitoring the communication between Red Hat build of Keycloak in order to identify possible faults.

Note

This is relevant only for single site deployments. When multiple sites are used, as described in Multi-site deployments, Red Hat build of Keycloak nodes are not clustered together and therefore there is no communication between them directly.

Global tags

cluster=<name>
The cluster name. If metrics from multiple clusters are being collected, this tag helps identify where they belong to.
node=<node>
The name of the node reporting the metric.
Warning

All metric names prefixed with vendor_jgroups_ are provided for troubleshooting and debugging purposes only. The metric names can change in upcoming releases of Red Hat build of Keycloak without further notice. Therefore, we advise not using them in dashboards or in monitoring and alerting.

5.6.2.1. Response Time

The following metrics expose the response time for the remote requests. The response time is measured between two nodes and includes the processing time. All requests are measured by these metrics, and the response time should remain stable through the cluster lifecycle.

Tip

In a healthy cluster, the response time will remain stable. An increase in response time may indicate a degraded cluster or a node under heavy load.

Tags

node=<node>
It identifies the sender node.
target_node=<node>
It identifies the receiver node.
Expand
MetricDescription

vendor_jgroups_stats_sync_requests_seconds_count

The number of synchronous requests to a receiver node.

vendor_jgroups_stats_sync_requests_seconds_sum

The total duration of synchronous request to a receiver node

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.6.2.2. Bandwidth

All the bytes received and sent by the Red Hat build of Keycloak are collected by these metrics. Also, all the internal messages, as heartbeats, are counted too. They allow computing the bandwidth currently used by each node.

Important

The metric name depends on the JGroups transport protocol in use.

Expand
MetricProtocolDescription

vendor_jgroups_tcp_get_num_bytes_received

TCP

The total number of bytes received by a node.

vendor_jgroups_udp_get_num_bytes_received

UDP

vendor_jgroups_tunnel_get_num_bytes_received

TUNNEL

vendor_jgroups_tcp_get_num_bytes_sent

TCP

The total number of bytes sent by a node.

vendor_jgroups_udp_get_num_bytes_sent

UDP

vendor_jgroups_tunnel_get_num_bytes_sent

TUNNEL

5.6.2.3. Thread Pool

Monitoring the thread pool size is a good indicator that a node is under a heavy load. All requests received are added to the thread pool for processing and, when it is full, the request is discarded. A retransmission mechanism ensures a reliable communication with an increase of resource usage.

Tip

In a healthy cluster, the thread pool should never be closer to its maximum size (by default, 200 threads).

Note

Thread pool metrics are not available with virtual threads. Virtual threads are enabled by default when running with OpenJDK 21.

Important

The metric name depends on the JGroups transport protocol in use. The default transport protocol is TCP.

Expand
MetricProtocolDescription

vendor_jgroups_tcp_get_thread_pool_size

TCP

Current number of threads in the thread pool.

vendor_jgroups_udp_get_thread_pool_size

UDP

vendor_jgroups_tunnel_get_thread_pool_size

TUNNEL

vendor_jgroups_tcp_get_largest_size

TCP

The largest number of threads that have ever simultaneously been in the pool.

vendor_jgroups_udp_get_largest_size

UDP

vendor_jgroups_tunnel_get_largest_size

TUNNEL

5.6.2.4. Flow Control

Flow control takes care of adjusting the rate of a message sender to the rate of the slowest receiver over time. This is implemented through a credit-based system, where each sender decrements its credits when sending. The sender blocks when the credits fall below 0, and only resumes sending messages when it receives a replenishment message from the receivers.

The metrics below show the number of blocked messages and the average blocking time. When a value is different from zero, it may signal that a receiver is overloaded and may degrade the cluster performance.

Each node has two independent flow control protocols, UFC for unicast messages and MFC for multicast messages.

Tip

A healthy cluster shows a value of zero for all metrics.

Expand
MetricDescription

vendor_jgroups_ufc_get_number_of_blockings

The number of times flow control blocks the sender for unicast messages.

vendor_jgroups_ufc_get_average_time_blocked

Average time blocked (in ms) in flow control when trying to send an unicast message.

vendor_jgroups_mfc_get_number_of_blockings

The number of times flow control blocks the sender for multicast messages.

vendor_jgroups_mfc_get_average_time_blocked

Average time blocked (in ms) in flow control when trying to send a multicast message.

5.6.2.5. Retransmissions

JGroups provides reliable delivery of messages. When a message is dropped on the network, or the receiver cannot handle the message, a retransmission is required. Retransmissions increase resource usage, and it is usually a signal of an overload system.

Random Early Drop (RED) monitors the sender queues. When the queues are almost full, the message is dropped, and a retransmission must happen. It prevents threads from being blocked by a full sender queue.

Tip

A healthy cluster shows a value of zero for all metrics.

Expand
MetricDescription

vendor_jgroups_unicast3_get_num_xmits

The number of retransmitted messages.

vendor_jgroups_red_get_dropped_messages

The total number of dropped messages by the sender.

vendor_jgroups_red_get_drop_rate

Percentage of all messages that were dropped by the sender.

5.6.2.6. Network Partitions
5.6.2.6.1. Cluster Size

The cluster size metric reports the number of nodes present in the cluster. If it differs, it may signal that a node is joining, shutdown or, in the worst case, a network partition is happening.

Tip

A healthy cluster shows the same value in all nodes.

Expand
MetricDescription

vendor_cluster_size

The number of nodes in the cluster.

5.6.2.6.2. Network Partition Events

Network partitions in a cluster can happen due to various reasons. This metrics does not help predict network splits but signals that it happened, and the cluster has been merged.

Tip

A healthy cluster shows a value of zero for this metric.

Expand
MetricDescription

vendor_jgroups_merge3_get_num_merge_events

The amount of time a network split was detected and healed.

5.6.3. Next steps

Return back to the Troubleshooting using metrics or proceed to Embedded Infinispan metrics for single site deployments.

Use metrics to monitor caching health and cluster replication.

This is part of the Troubleshooting using metrics chapter.

5.7.1. Prerequisites

  • Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
  • A monitoring system collecting the metrics.

5.7.2. Metrics

Global tags

cache=<name>
The cache name.
5.7.2.1. Size

Monitor the number of entries in your cache using these two metrics. If the cache is clustered, each entry has an owner node and zero or more backup copies of different nodes.

Tip

Sum the unique entry size metric to get a cluster total number of entries.

Expand
MetricDescription

vendor_statistics_approximate_entries

The approximate number of entries stored by the node, including backup copies.

vendor_statistics_approximate_entries_unique

The approximate number of entries stored by the node, excluding backup copies.

5.7.2.2. Data Access

The following metrics monitor the cache accesses, such as the reads, writes and their duration.

5.7.2.2.1. Stores

A store operation is a write operation that writes or updates a value stored in the cache.

Expand
MetricDescription

vendor_statistics_store_times_seconds_count

The total number of store requests.

vendor_statistics_store_times_seconds_sum

The total duration of all store requests.

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.7.2.2.2. Reads

A read operation reads a value from the cache. It divides into two groups, a hit if a value is found, and a miss if not found.

Expand
MetricDescription

vendor_statistics_hit_times_seconds_count

The total number of read hits requests.

vendor_statistics_hit_times_seconds_sum

The total duration of all read hits requests.

vendor_statistics_miss_times_seconds_count

The total number of read misses requests.

vendor_statistics_miss_times_seconds_sum

The total duration of all read misses requests.

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.7.2.2.3. Removes

A remove operation removes a value from the cache. It divides in two groups, a hit if a value exists, and a miss if the value does not exist.

Expand
MetricDescription

vendor_statistics_remove_hit_times_seconds_count

The total number of remove hits requests.

vendor_statistics_remove_hit_times_seconds_sum

The total duration of all remove hits requests.

vendor_statistics_remove_miss_times_seconds_count

The total number of remove misses requests.

vendor_statistics_remove_miss_times_seconds_sum

The total duration of all remove misses requests.

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

Tip

For users and realms cache, the database invalidation translates into a remove operation. These metrics are a good indicator of how frequent the database entities are modified and therefore removed from the cache.

Hit Ratio for read and remove operations

An expression can be used to compute the hit ratio for a cache in systems such as Prometheus. As an example, the hit ratio for read operations can be expressed as:

vendor_statistics_hit_times_seconds_count
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)
Copy to Clipboard Toggle word wrap

Read/Write ratio

An expression can be used to compute the read-write ratio for a cache, using the metrics above:

(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count
 + vendor_statistics_remove_hit_times_seconds_count
 + vendor_statistics_remove_miss_times_seconds_count
 + vendor_statistics_store_times_seconds_count)
Copy to Clipboard Toggle word wrap
5.7.2.2.4. Eviction

Eviction is the process to limit the cache size and, when full, an entry is removed to make room for a new entry to be cached. As Red Hat build of Keycloak caches the database entities in the users, realms and authorization, database access always proceeds with an eviction event.

Expand
MetricDescription

vendor_statistics_evictions

The total number of eviction events.

Eviction rate

A rapid increase of eviction and very high database CPU usage means the users or realms cache is too small for smooth Red Hat build of Keycloak operation, as data needs to be re-loaded very often from the database which slows down responses. If enough memory is available, consider increasing the max cache size using the CLI options cache-embedded-users-max-count or cache-embedded-realms-max-count

5.7.2.3. Locking

Write and remove operations hold the lock until the value is replicated in the local cluster and to the remote site.

Tip

On a healthy cluster, the number of locks held should remain constant, but deadlocks may create temporary spikes.

Expand
MetricDescription

vendor_lock_manager_number_of_locks_held

The number of locks currently being held by this node.

5.7.2.4. Transactions

Transactional caches use both One-Phase-Commit and Two-Phase-Commit protocols to complete a transaction. These metrics keep track of the operation duration.

Note

The PESSMISTIC locking mode uses One-Phase-Commit and does not create commit requests.

Tip

In a healthy cluster, the number of rollbacks should remain zero. Deadlocks should be rare, but they increase the number of rollbacks.

Expand
MetricDescription

vendor_transactions_prepare_times_seconds_count

The total number of prepare requests.

vendor_transactions_prepare_times_seconds_sum

The total duration of all prepare requests.

vendor_transactions_rollback_times_seconds_count

The total number of rollback requests.

vendor_transactions_rollback_times_seconds_sum

The total duration of all rollback requests.

vendor_transactions_commit_times_seconds_count

The total number of commit requests.

vendor_transactions_commit_times_seconds_sum

The total duration of all commit requests.

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.7.2.5. State Transfer

State transfer happens when a node joins or leaves the cluster. It is required to balance the data stored and guarantee the desired number of copies.

This operation increases the resource usage, and it will affect negatively the overall performance.

Expand
MetricDescription

vendor_state_transfer_manager_inflight_transactional_segment_count

The number of in-flight transactional segments the local node requested from other nodes.

vendor_state_transfer_manager_inflight_segment_transfer_count

The number of in-flight segments the local node requested from other nodes.

5.7.2.6. Cluster Data Replication

The cluster data replication can be the main source of failure. These metrics not only report the response time, i.e., the time it takes to replicate an update, but also the failures.

Tip

On a healthy cluster, the average replication time will be stable or with little variance. The number of failures should not increase.

Expand
MetricDescription

vendor_rpc_manager_replication_count

The total number of successful replications.

vendor_rpc_manager_replication_failures

The total number of failed replications.

vendor_rpc_manager_average_replication_time

The average time spent, in milliseconds, replicating data in the cluster.

Success ratio

An expression can be used to compute the replication success ratio:

(vendor_rpc_manager_replication_count)
/
(vendor_rpc_manager_replication_count
 + vendor_rpc_manager_replication_failures)
Copy to Clipboard Toggle word wrap

5.7.3. Next steps

Return back to the Troubleshooting using metrics.

Use metrics to monitor caching health.

This is part of the Troubleshooting using metrics chapter.

5.8.1. Prerequisites

  • Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
  • A monitoring system collecting the metrics.

5.8.2. Metrics

Global tags

cache=<name>
The cache name.
5.8.2.1. Size

Monitor the number of entries in your cache using these two metrics. If the cache is clustered, each entry has an owner node and zero or more backup copies of different nodes.

Tip

Sum the unique entry size metric to get a cluster total number of entries.

Expand
MetricDescription

vendor_statistics_approximate_entries

The approximate number of entries stored by the node, including backup copies.

vendor_statistics_approximate_entries_unique

The approximate number of entries stored by the node, excluding backup copies.

5.8.2.2. Data Access

The following metrics monitor the cache accesses, such as the reads, writes and their duration.

5.8.2.2.1. Stores

A store operation is a write operation that writes or updates a value stored in the cache.

Expand
MetricDescription

vendor_statistics_store_times_seconds_count

The total number of store requests.

vendor_statistics_store_times_seconds_sum

The total duration of all store requests.

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.8.2.2.2. Reads

A read operation reads a value from the cache. It divides into two groups, a hit if a value is found, and a miss if not found.

Expand
MetricDescription

vendor_statistics_hit_times_seconds_count

The total number of read hits requests.

vendor_statistics_hit_times_seconds_sum

The total duration of all read hits requests.

vendor_statistics_miss_times_seconds_count

The total number of read misses requests.

vendor_statistics_miss_times_seconds_sum

The total duration of all read misses requests.

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.8.2.2.3. Removes

A remove operation removes a value from the cache. It divides in two groups, a hit if a value exists, and a miss if the value does not exist.

Expand
MetricDescription

vendor_statistics_remove_hit_times_seconds_count

The total number of remove hits requests.

vendor_statistics_remove_hit_times_seconds_sum

The total duration of all remove hits requests.

vendor_statistics_remove_miss_times_seconds_count

The total number of remove misses requests.

vendor_statistics_remove_miss_times_seconds_sum

The total duration of all remove misses requests.

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

Tip

For users and realms cache, the database invalidation translates into a remove operation. These metrics are a good indicator of how frequent the database entities are modified and therefore removed from the cache.

Hit Ratio for read and remove operations

An expression can be used to compute the hit ratio for a cache in systems such as Prometheus. As an example, the hit ratio for read operations can be expressed as:

vendor_statistics_hit_times_seconds_count
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)
Copy to Clipboard Toggle word wrap

Read/Write ratio

An expression can be used to compute the read-write ratio for a cache, using the metrics above:

(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count
 + vendor_statistics_remove_hit_times_seconds_count
 + vendor_statistics_remove_miss_times_seconds_count
 + vendor_statistics_store_times_seconds_count)
Copy to Clipboard Toggle word wrap
5.8.2.2.4. Eviction

Eviction is the process to limit the cache size and, when full, an entry is removed to make room for a new entry to be cached. As Red Hat build of Keycloak caches the database entities in the users, realms and authorization, database access always proceeds with an eviction event.

Expand
MetricDescription

vendor_statistics_evictions

The total number of eviction events.

Eviction rate

A rapid increase of eviction and very high database CPU usage means the users or realms cache is too small for smooth Red Hat build of Keycloak operation, as data needs to be re-loaded very often from the database which slows down responses. If enough memory is available, consider increasing the max cache size using the CLI options cache-embedded-users-max-count or cache-embedded-realms-max-count

5.8.2.3. Transactions

Transactional caches use both One-Phase-Commit and Two-Phase-Commit protocols to complete a transaction. These metrics keep track of the operation duration.

Note

The PESSMISTIC locking mode uses One-Phase-Commit and does not create commit requests.

Tip

In a healthy cluster, the number of rollbacks should remain zero. Deadlocks should be rare, but they increase the number of rollbacks.

Expand
MetricDescription

vendor_transactions_prepare_times_seconds_count

The total number of prepare requests.

vendor_transactions_prepare_times_seconds_sum

The total duration of all prepare requests.

vendor_transactions_rollback_times_seconds_count

The total number of rollback requests.

vendor_transactions_rollback_times_seconds_sum

The total duration of all rollback requests.

vendor_transactions_commit_times_seconds_count

The total number of commit requests.

vendor_transactions_commit_times_seconds_sum

The total duration of all commit requests.

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.8.3. Next steps

Return back to the Troubleshooting using metrics or proceed to External Data Grid metrics.

5.9. External Data Grid metrics

Use metrics to monitor external Data Grid performance.

This is part of the Troubleshooting using metrics chapter.

5.9.1. Prerequisites

5.9.1.1. Enabled Data Grid server metrics

Data Grid exposes metrics in the endpoint /metrics. By default, they are enabled. We recommend enabling the attribute name-as-tags as it makes the metrics name independent on the cache name.

To configure metrics in the Data Grid server, just enabled as shown in the XML below.

infinispan.xml

<infinispan>
    <cache-container statistics="true">
        <metrics gauges="true" histograms="false" name-as-tags="true" />
    </cache-container>
</infinispan>
Copy to Clipboard Toggle word wrap

Using the Data Grid Operator in Kubernetes, metrics can be enabled by using a ConfigMap with a custom configuration. It is shown below an example.

ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-config
data:
  infinispan-config.yaml: >
    infinispan:
      cacheContainer:
        metrics:
          gauges: true
          namesAsTags: true
          histograms: false
Copy to Clipboard Toggle word wrap

infinispan.yaml CR

apiVersion: infinispan.org/v1
kind: Infinispan
metadata:
  name: infinispan
  annotations:
    infinispan.org/monitoring: 'true' 
1

spec:
  configMapName: "cluster-config" 
2
Copy to Clipboard Toggle word wrap

1
Enables monitoring for the deployment
2
Sets the ConfigMap name with the custom configuration.

Additional information can be found in the Infinispan documentation and Infinispan operator documentation.

5.9.2. Clustering and Network

This section describes metrics that are useful for monitoring the communication between Data Grid nodes to identify possible network issues.

Global tags

cluster=<name>
The cluster name. If metrics from multiple clusters are being collected, this tag helps identify where they belong to.
node=<node>
The name of the node reporting the metric.
Warning

All metric names prefixed with vendor_jgroups_ are provided for troubleshooting and debugging purposes only. The metric names can change in upcoming releases of Red Hat build of Keycloak without further notice. Therefore, we advise not using them in dashboards or in monitoring and alerting.

5.9.2.1. Response Time

The following metrics expose the response time for the remote requests. The response time is measured between two nodes and includes the processing time. All requests are measured by these metrics, and the response time should remain stable through the cluster lifecycle.

Tip

In a healthy cluster, the response time will remain stable. An increase in response time may indicate a degraded cluster or a node under heavy load.

Tags

node=<node>
It identifies the sender node.
target_node=<node>
It identifies the receiver node.
Expand
MetricDescription

vendor_jgroups_stats_sync_requests_seconds_count

The number of synchronous requests to a receiver node.

vendor_jgroups_stats_sync_requests_seconds_sum

The total duration of synchronous request to a receiver node

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.9.2.2. Bandwidth

All the bytes received and sent by the Data Grid are collected by these metrics. Also, all the internal messages, as heartbeats, are counted too. They allow computing the bandwidth currently used by each node.

Important

The metric name depends on the JGroups transport protocol in use.

Expand
MetricProtocolDescription

vendor_jgroups_tcp_get_num_bytes_received

TCP

The total number of bytes received by a node.

vendor_jgroups_udp_get_num_bytes_received

UDP

vendor_jgroups_tunnel_get_num_bytes_received

TUNNEL

vendor_jgroups_tcp_get_num_bytes_sent

TCP

The total number of bytes sent by a node.

vendor_jgroups_udp_get_num_bytes_sent

UDP

vendor_jgroups_tunnel_get_num_bytes_sent

TUNNEL

5.9.2.3. Thread Pool

Monitoring the thread pool size is a good indicator that a node is under a heavy load. All requests received are added to the thread pool for processing and, when it is full, the request is discarded. A retransmission mechanism ensures a reliable communication with an increase of resource usage.

Tip

In a healthy cluster, the thread pool should never be closer to its maximum size (by default, 200 threads).

Note

Thread pool metrics are not available with virtual threads. Virtual threads are enabled by default when running with OpenJDK 21.

Important

The metric name depends on the JGroups transport protocol in use. The default transport protocol is TCP.

Expand
MetricProtocolDescription

vendor_jgroups_tcp_get_thread_pool_size

TCP

Current number of threads in the thread pool.

vendor_jgroups_udp_get_thread_pool_size

UDP

vendor_jgroups_tunnel_get_thread_pool_size

TUNNEL

vendor_jgroups_tcp_get_largest_size

TCP

The largest number of threads that have ever simultaneously been in the pool.

vendor_jgroups_udp_get_largest_size

UDP

vendor_jgroups_tunnel_get_largest_size

TUNNEL

5.9.2.4. Flow Control

Flow control takes care of adjusting the rate of a message sender to the rate of the slowest receiver over time. This is implemented through a credit-based system, where each sender decrements its credits when sending. The sender blocks when the credits fall below 0, and only resumes sending messages when it receives a replenishment message from the receivers.

The metrics below show the number of blocked messages and the average blocking time. When a value is different from zero, it may signal that a receiver is overloaded and may degrade the cluster performance.

Each node has two independent flow control protocols, UFC for unicast messages and MFC for multicast messages.

Tip

A healthy cluster shows a value of zero for all metrics.

Expand
MetricDescription

vendor_jgroups_ufc_get_number_of_blockings

The number of times flow control blocks the sender for unicast messages.

vendor_jgroups_ufc_get_average_time_blocked

Average time blocked (in ms) in flow control when trying to send an unicast message.

vendor_jgroups_mfc_get_number_of_blockings

The number of times flow control blocks the sender for multicast messages.

vendor_jgroups_mfc_get_average_time_blocked

Average time blocked (in ms) in flow control when trying to send a multicast message.

5.9.2.5. Retransmissions

JGroups provides reliable delivery of messages. When a message is dropped on the network, or the receiver cannot handle the message, a retransmission is required. Retransmissions increase resource usage, and it is usually a signal of an overload system.

Random Early Drop (RED) monitors the sender queues. When the queues are almost full, the message is dropped, and a retransmission must happen. It prevents threads from being blocked by a full sender queue.

Tip

A healthy cluster shows a value of zero for all metrics.

Expand
MetricDescription

vendor_jgroups_unicast3_get_num_xmits

The number of retransmitted messages.

vendor_jgroups_red_get_dropped_messages

The total number of dropped messages by the sender.

vendor_jgroups_red_get_drop_rate

Percentage of all messages that were dropped by the sender.

5.9.2.6. Network Partitions
5.9.2.6.1. Cluster Size

The cluster size metric reports the number of nodes present in the cluster. If it differs, it may signal that a node is joining, shutdown or, in the worst case, a network partition is happening.

Tip

A healthy cluster shows the same value in all nodes.

Expand
MetricDescription

vendor_cluster_size

The number of nodes in the cluster.

5.9.2.6.2. Cross-Site Status

The cross-site status reports connection status to the other site. It returns a value of 1 if is online or 0 if offline. The value of 2 is used on nodes where the status is unknown; not all nodes establish connections to the remote sites and do not contain this information.

Tip

A healthy cluster shows a value greater than zero.

Expand
MetricDescription

vendor_jgroups_site_view_status

The single site status (1 if online).

Tags

site=<name>
The name of the destination site.
5.9.2.6.3. Network Partition Events

Network partitions in a cluster can happen due to various reasons. This metrics does not help predict network splits but signals that it happened, and the cluster has been merged.

Tip

A healthy cluster shows a value of zero for this metric.

Expand
MetricDescription

vendor_jgroups_merge3_get_num_merge_events

The amount of time a network split was detected and healed.

5.9.3. Data Grid Caches

The metrics in this section help monitoring the Data Grid caches health and the cluster replication.

Global tags

cache=<name>
The cache name.
5.9.3.1. Size

Monitor the number of entries in your cache using these two metrics. If the cache is clustered, each entry has an owner node and zero or more backup copies of different nodes.

Tip

Sum the unique entry size metric to get a cluster total number of entries.

Expand
MetricDescription

vendor_statistics_approximate_entries

The approximate number of entries stored by the node, including backup copies.

vendor_statistics_approximate_entries_unique

The approximate number of entries stored by the node, excluding backup copies.

5.9.3.2. Data Access

The following metrics monitor the cache accesses, such as the reads, writes and their duration.

5.9.3.2.1. Stores

A store operation is a write operation that writes or updates a value stored in the cache.

Expand
MetricDescription

vendor_statistics_store_times_seconds_count

The total number of store requests.

vendor_statistics_store_times_seconds_sum

The total duration of all store requests.

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.9.3.2.2. Reads

A read operation reads a value from the cache. It divides into two groups, a hit if a value is found, and a miss if not found.

Expand
MetricDescription

vendor_statistics_hit_times_seconds_count

The total number of read hits requests.

vendor_statistics_hit_times_seconds_sum

The total duration of all read hits requests.

vendor_statistics_miss_times_seconds_count

The total number of read misses requests.

vendor_statistics_miss_times_seconds_sum

The total duration of all read misses requests.

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.9.3.2.3. Removes

A remove operation removes a value from the cache. It divides in two groups, a hit if a value exists, and a miss if the value does not exist.

Expand
MetricDescription

vendor_statistics_remove_hit_times_seconds_count

The total number of remove hits requests.

vendor_statistics_remove_hit_times_seconds_sum

The total duration of all remove hits requests.

vendor_statistics_remove_miss_times_seconds_count

The total number of remove misses requests.

vendor_statistics_remove_miss_times_seconds_sum

The total duration of all remove misses requests.

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.9.3.3. Locking

Write and remove operations hold the lock until the value is replicated in the local cluster and to the remote site.

Tip

On a healthy cluster, the number of locks held should remain constant, but deadlocks may create temporary spikes.

Expand
MetricDescription

vendor_lock_manager_number_of_locks_held

The number of locks currently being held by this node.

5.9.3.4. Transactions

Transactional caches use both One-Phase-Commit and Two-Phase-Commit protocols to complete a transaction. These metrics keep track of the operation duration.

Note

The PESSMISTIC locking mode uses One-Phase-Commit and does not create commit requests.

Tip

In a healthy cluster, the number of rollbacks should remain zero. Deadlocks should be rare, but they increase the number of rollbacks.

Expand
MetricDescription

vendor_transactions_prepare_times_seconds_count

The total number of prepare requests.

vendor_transactions_prepare_times_seconds_sum

The total duration of all prepare requests.

vendor_transactions_rollback_times_seconds_count

The total number of rollback requests.

vendor_transactions_rollback_times_seconds_sum

The total duration of all rollback requests.

vendor_transactions_commit_times_seconds_count

The total number of commit requests.

vendor_transactions_commit_times_seconds_sum

The total duration of all commit requests.

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.9.3.5. State Transfer

State transfer happens when a node joins or leaves the cluster. It is required to balance the data stored and guarantee the desired number of copies.

This operation increases the resource usage, and it will affect negatively the overall performance.

Expand
MetricDescription

vendor_state_transfer_manager_inflight_transactional_segment_count

The number of in-flight transactional segments the local node requested from other nodes.

vendor_state_transfer_manager_inflight_segment_transfer_count

The number of in-flight segments the local node requested from other nodes.

5.9.3.6. Cluster Data Replication

The cluster data replication can be the main source of failure. These metrics not only report the response time, i.e., the time it takes to replicate an update, but also the failures.

Tip

On a healthy cluster, the average replication time will be stable or with little variance. The number of failures should not increase.

Expand
MetricDescription

vendor_rpc_manager_replication_count

The total number of successful replications.

vendor_rpc_manager_replication_failures

The total number of failed replications.

vendor_rpc_manager_average_replication_time

The average time spent, in milliseconds, replicating data in the cluster.

Success ratio

An expression can be used to compute the replication success ratio:

(vendor_rpc_manager_replication_count)
/
(vendor_rpc_manager_replication_count
 + vendor_rpc_manager_replication_failures)
Copy to Clipboard Toggle word wrap
5.9.3.7. Cross Site Data Replication

Like cluster data replication, the metrics in this section measure the time it takes to replicate the data to the other sites.

Tip

On a healthy cluster, the average cross-site replication time will be stable or with little variance.

Tags

site=<name>
indicates the receiving site.
Expand
MetricDescription

vendor_rpc_manager_cross_site_replication_times_seconds_count

The total number of cross-site requests.

vendor_rpc_manager_cross_site_replication_times_seconds_sum

The total duration of all cross-site requests.

vendor_rpc_manager_replication_times_to_site_seconds_count

The total number of cross-site requests. This metric is more detailed with a per-site counter.

vendor_rpc_manager_replication_times_to_site_seconds_sum

The total duration of all cross-site requests. This metric is more detailed with a per-site duration.

vendor_rpc_manager_number_xsite_requests_received_from_site

The total number of cross-site requests handled by this node. This metric is more detailed with a per-site counter.

vendor_x_site_admin_status

The site status. A value of 1 indicates that it is online. This value reacts to the Data Grid CLI commands bring-online and take-offline.

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.9.4. Next steps

Return back to the Troubleshooting using metrics.

Chapter 6. Root cause analysis with tracing

Record information during the request lifecycle with OpenTelementry tracing to identify root cases for latencies and errors in Red Hat build of Keycloak and connected systems.

This chapter explains how you can enable and configure distributed tracing in Red Hat build of Keycloak by utilizing OpenTelemetry (OTel). Tracing allows for detailed monitoring of each request’s lifecycle, which helps quickly identify and diagnose issues, leading to more efficient debugging and maintenance.

It provides valuable insights into performance bottlenecks and can help optimize the system’s overall efficiency and across system boundaries. Red Hat build of Keycloak uses a supported Quarkus OTel extension that provides smooth integration and exposure of application traces.

6.1. Enable tracing

It is possible to enable exposing traces using the build time option tracing-enabled as follows:

bin/kc.[sh|bat] start --tracing-enabled=true
Copy to Clipboard Toggle word wrap

By default, the trace exporters send out data in batches, using the gRPC protocol and endpoint http://localhost:4317.

The default service name is keycloak, specified via the tracing-service-name property, which takes precedence over service.name defined in the tracing-resource-attributes property.

For more information about resource attributes that can be provided via the tracing-resource-attributes property, see the Quarkus OpenTelemetry Resource guide.

Note

Tracing can be enabled only when the opentelemetry feature is enabled (by default).

For more tracing settings, see all possible configurations below.

6.2. Development setup

In order to see the captured Red Hat build of Keycloak traces, the basic setup with leveraging the Jaeger tracing platform might be used. For development purposes, the Jaeger-all-in-one can be used to see traces as easily as possible.

Note

Jaeger-all-in-one includes the Jaeger agent, an OTel collector, and the query service/UI. You do not need to install a separate collector, as you can directly send the trace data to Jaeger.

podman run --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one
Copy to Clipboard Toggle word wrap

6.2.1. Exposed ports

16686
Jaeger UI
4317
OpenTelemetry Protocol gRPC receiver (default)
4318
OpenTelemetry Protocol HTTP receiver

You can visit the Jaeger UI on http://localhost:16686/ to see the tracing information. The Jaeger UI might look like this with an arbitrary Red Hat build of Keycloak trace:

6.3. Information in traces

6.3.1. Spans

Red Hat build of Keycloak creates spans for the following activities:

  • Incoming HTTP requests
  • Outgoing Database including acquiring a database connections
  • Outgoing LDAP requests including connecting to the LDAP server
  • Outgoing HTTP requests including IdP brokerage

6.3.2. Tags

Red Hat build of Keycloak adds tags to traces depending on the type of the request. All tags are prefixed with kc..

Example tags are:

kc.clientId
Client ID
kc.realmName
Realm name
kc.sessionId
User session ID
kc.token.id
id as mentioned in the token
kc.token.issuer
issuer as mentioned in the token
kc.token.sid
sid as mentioned in the token
kc.authenticationSessionId
Authentication session ID
kc.authenticationTabId
Authentication Tab ID

6.3.3. Logs

If a trace is being sampled, it will contain any user events created during the request. This includes, for example, LOGIN, LOGOUT or REFRESH_TOKEN events with all details and IDs found in user events.

LDAP communication errors are shown as log entries in recorded traces as well with a stack trace and details of the failed operation.

6.4. Trace IDs in logs

When tracing is enabled, the trace IDs are included in the log messages of all enabled log handlers (see more in Configuring logging). It can be useful for associating log events to request execution, which might provide better traceability and debugging. All log lines originating from the same request will have the same traceId in the log.

The log message also contains a sampled flag, which relates to the sampling described below and indicates whether the span was sampled - sent to the collector.

The format of the log records may start as follows:

2024-08-05 15:27:07,144 traceId=b636ac4c665ceb901f7fdc3fc7e80154, parentId=d59cea113d0c2549, spanId=d59cea113d0c2549, sampled=true WARN  [org.keycloak.events] ...
Copy to Clipboard Toggle word wrap

6.4.1. Hide trace IDs in logs

You can hide trace IDs in specific log handlers by specifying their associated Red Hat build of Keycloak option log-<handler-name>-include-trace, where <handler-name> is the name of the log handler. For instance, to disable trace info in the console log, you can turn it off as follows:

bin/kc.[sh|bat] start --tracing-enabled=true --log=console --log-console-include-trace=false
Copy to Clipboard Toggle word wrap
Note

When you explicitly override the log format for the particular log handlers, the *-include-trace options do not have any effect, and no tracing is included.

6.5. Sampling

The sampler decides whether a trace should be discarded or forwarded, effectively reducing overhead by limiting the number of collected traces sent to the collector. It helps manage resource consumption, which leads to avoiding the huge storage costs of tracing every single request and potential performance penalty.

Warning

For a production-ready environment, sampling should be properly set to minimize infrastructure costs.

Red Hat build of Keycloak supports several built-in OpenTelemetry samplers, such as:

  • always_on
  • always_off
  • traceidratio (default)
  • parentbased_always_on
  • parentbased_always_off
  • parentbased_traceidratio

The used sampler can be changed via the tracing-sampler-type property.

6.5.1. Default sampler

The default sampler for Red Hat build of Keycloak is traceidratio, which controls the rate of trace sampling based on a specified ratio configurable via the tracing-sampler-ratio property.

6.5.1.1. Trace ratio

The default trace ratio is 1.0, which means all traces are sampled - sent to the collector. The ratio is a floating number in the range [0,1]. For instance, when the ratio is 0.1, only 10% of the traces are sampled.

Warning

For a production-ready environment, the trace ratio should be a smaller number to prevent the massive cost of trace store infrastructure and avoid performance overhead.

Tip

The ratio can be set to 0.0 to disable sampling entirely at runtime.

6.5.1.2. Rationale

The sampler makes its own sampling decisions based on the current ratio of sampled spans, regardless of the decision made on the parent span, as with using the parentbased_traceidratio sampler.

The parentbased_traceidratio sampler could be the preferred default type as it ensures the sampling consistency between parent and child spans. Specifically, if a parent span is sampled, all its child spans will be sampled as well - the same sampling decision for all. It helps to keep all spans together and prevents storing incomplete traces.

However, it might introduce certain security risks leading to DoS attacks. External callers can manipulate trace headers, parent spans can be injected, and the trace store can be overwhelmed. Proper HTTP headers (especially tracestate) filtering and adequate measures of caller trust would need to be assessed.

For more information, see the W3C Trace context document.

6.6. Tracing in Kubernetes environment

When the tracing is enabled when using the Red Hat build of Keycloak Operator, certain information about the deployment is propagated to the underlying containers.

6.6.1. Configuration via Keycloak CR

You can change tracing configuration via Keycloak CR. For more information, see the Advanced configuration.

You can filter out the required traces in your tracing backend based on their tags:

  • service.name - Red Hat build of Keycloak deployment name
  • k8s.namespace.name - Namespace
  • host.name - Pod name

Red Hat build of Keycloak Operator automatically sets the KC_TRACING_SERVICE_NAME and KC_TRACING_RESOURCE_ATTRIBUTES environment variables for each Red Hat build of Keycloak container included in pods it manages.

Note

The KC_TRACING_RESOURCE_ATTRIBUTES variable always contains (if not overridden) the k8s.namespace.name attribute representing the current namespace.

6.7. Relevant options

Expand
 Value

log-console-include-trace

Include tracing information in the console log.

If the log-console-format option is specified, this option has no effect.

CLI: --log-console-include-trace
Env: KC_LOG_CONSOLE_INCLUDE_TRACE

Available only when Console log handler and Tracing is activated

true (default), false

log-file-include-trace

Include tracing information in the file log.

If the log-file-format option is specified, this option has no effect.

CLI: --log-file-include-trace
Env: KC_LOG_FILE_INCLUDE_TRACE

Available only when File log handler and Tracing is activated

true (default), false

log-syslog-include-trace

Include tracing information in the Syslog.

If the log-syslog-format option is specified, this option has no effect.

CLI: --log-syslog-include-trace
Env: KC_LOG_SYSLOG_INCLUDE_TRACE

Available only when Syslog handler and Tracing is activated

true (default), false

tracing-compression

OpenTelemetry compression method used to compress payloads.

If unset, compression is disabled.

CLI: --tracing-compression
Env: KC_TRACING_COMPRESSION

Available only when Tracing is enabled

gzip, none (default)

tracing-enabled 🛠

Enables the OpenTelemetry tracing.

CLI: --tracing-enabled
Env: KC_TRACING_ENABLED

Available only when 'opentelemetry' feature is enabled

true, false (default)

tracing-endpoint

OpenTelemetry endpoint to connect to.

CLI: --tracing-endpoint
Env: KC_TRACING_ENDPOINT

Available only when Tracing is enabled

http://localhost:4317 (default)

tracing-jdbc-enabled 🛠

Enables the OpenTelemetry JDBC tracing.

CLI: --tracing-jdbc-enabled
Env: KC_TRACING_JDBC_ENABLED

Available only when Tracing is enabled

true (default), false

tracing-protocol

OpenTelemetry protocol used for the telemetry data.

CLI: --tracing-protocol
Env: KC_TRACING_PROTOCOL

Available only when Tracing is enabled

grpc (default), http/protobuf

tracing-resource-attributes

OpenTelemetry resource attributes present in the exported trace to characterize the telemetry producer.

Values in format key1=val1,key2=val2. For more information, check the Tracing guide.

CLI: --tracing-resource-attributes
Env: KC_TRACING_RESOURCE_ATTRIBUTES

Available only when Tracing is enabled

 

tracing-sampler-ratio

OpenTelemetry sampler ratio.

Probability that a span will be sampled. Expected double value in interval [0,1].

CLI: --tracing-sampler-ratio
Env: KC_TRACING_SAMPLER_RATIO

Available only when Tracing is enabled

1.0 (default)

tracing-sampler-type 🛠

OpenTelemetry sampler to use for tracing.

CLI: --tracing-sampler-type
Env: KC_TRACING_SAMPLER_TYPE

Available only when Tracing is enabled

always_on, always_off, traceidratio (default), parentbased_always_on, parentbased_always_off, parentbased_traceidratio

tracing-service-name

OpenTelemetry service name.

Takes precedence over service.name defined in the tracing-resource-attributes property.

CLI: --tracing-service-name
Env: KC_TRACING_SERVICE_NAME

Available only when Tracing is enabled

keycloak (default)

Chapter 7. Visualizing activities in dashboards

Install the Red Hat build of Keycloak Grafana dashboards to visualize the metrics that capture the status and activities of your deployment.

Red Hat build of Keycloak provides metrics to observe what is happening inside the deployment. To understand how metrics evolve over time, it is helpful to collect and visualize them in graphs.

This guide provides instructions on how to visualize collected Red Hat build of Keycloak metrics in a running Grafana instance.

7.1. Prerequisites

  • Red Hat build of Keycloak metrics are enabled. Follow Gaining insights with metrics chapter for more details.
  • Grafana instance is running and Red Hat build of Keycloak metrics are collected into a Prometheus instance.
  • For the HTTP request latency heatmaps to work, enable histograms for HTTP metrics by setting http-metrics-histograms-enabled to true.

7.2. Red Hat build of Keycloak Grafana dashboards

Grafana dashboards are distributed in the form of a JSON file that is imported into a Grafana instance. JSON definitions of Red Hat build of Keycloak Grafana dashboards are available in the keycloak/keycloak-grafana-dashboard GitHub repository.

Follow these steps to download JSON file definitions.

  1. Identify the branch from keycloak-grafana-dashboards to use from the following table.

    Expand
    Red Hat build of Keycloak versionkeycloak-grafana-dashboards branch

    >= 26.1

    main

  2. Clone the GitHub repository

    git clone -b BRANCH_FROM_STEP_1 https://github.com/keycloak/keycloak-grafana-dashboard.git
    Copy to Clipboard Toggle word wrap
  3. The dashboards are available in the directory keycloak-grafana-dashboard/dashboards.

The following sections describe the purpose of each dashboard.

This dashboard is available in the JSON file: keycloak-troubleshooting-dashboard.json.

On the top of the dashboard, graphs display the service level indicators as defined in Monitoring performance with Service Level Indicators. This dashboard can be also used while troubleshooting a Red Hat build of Keycloak deployment following the Troubleshooting using metrics chapter, for example, when SLI graphs do not show expected results.

Figure 7.1. Troubleshooting dashboard

7.2.2. Keycloak capacity planning dashboard

This dashboard is available in the JSON file: keycloak-capacity-planning-dashboard.json.

This dashboard shows metrics that are important when estimating the load handled by a Red Hat build of Keycloak deployment. For example, it shows the number of password validations or login flows performed by Red Hat build of Keycloak. For more detail on these metrics, see the chapter Self-provided metrics.

Note

Red Hat build of Keycloak event metrics must be enabled for this dashboard to work correctly. To enable them, see the chapter Monitoring user activities with event metrics.

Figure 7.2. Capacity planning dashboard

7.3. Import a dashboard

  1. Open the dashboard page from the left Grafana menu.
  2. Click New and Import.
  3. Click Upload dashboard JSON file and select the JSON file of the dashboard you want to import.
  4. Pick your Prometheus datasource.
  5. Click Import.

7.4. Export a dashboard

Exporting a dashboard to JSON format may be useful. For example, you may want to suggest a change in our dashboard repository.

  1. Open a dashboard you would like to export.
  2. Click share in the top left corner next to the dashboard name.
  3. Click the Export tab.
  4. Enable Export for sharing externally.
  5. Click either Save to file or View JSON and Copy to Clipboard according to where you want to store the resulting JSON.

7.5. Further reading

Continue reading on how to connect traces to dashboard in the Analyzing outliers and errors with exemplars chapter.

Use exemplars to connect a metric to a recorded trace to analyze the root cause of errors or latencies.

Metrics are aggregations over several events, and show you if your system is operating within defined bounds. They are great to monitor error rates or tail latencies and to set up alerting or drive performance optimizations. Still, the aggregation makes it difficult to find root causes for latencies or errors reported in metrics.

Root causes for errors and latencies can be found by enabling tracing. To connect a metric to a recorded trace, there is the concept of exemplars.

Once exemplars are set up, Red Hat build of Keycloak reports metrics with their last recorded trace as an exemplar. A dashboard tool like Grafana can link the exemplar from a metrics dashboard to a trace view.

Metrics that support exemplars are:

  • http_server_requests_seconds_count (including histograms)
    See the chapter HTTP metrics for details on this metric.
  • keycloak_credentials_password_hashing_validations_total
    See the chapter Self-provided metrics for details on this metric.
  • keycloak_user_events_total
    See the chapter Self-provided metrics for details on this metric.

See below for a screenshot of a heatmap visualization for latencies that is showing an exemplar when hovering over one of the pink indicators.

Figure 8.1. Heatmap diagram with exemplar

8.1. Setting up exemplars

To benefit from exemplars, perform the following steps:

  1. Enable metrics for Red Hat build of Keycloak as described in chapter Gaining insights with metrics.
  2. Enable tracing for Red Hat build of Keycloak as described in chapter Root cause analysis with tracing.
  3. Enable exemplar storage in your monitoring system.

    For Prometheus, this is a preview feature that you need to enable.

  4. Scrape the metrics using the OpenMetricsText1.0.0 protocol, which is not enabled by default in Prometheus.

    If you are using PodMonitors or similar in a Kubernetes environment, this can be achieved by adding it to the spec of the custom resource:

    apiVersion: monitoring.coreos.com/v1
    kind: PodMonitor
    metadata:
      ...
    spec:
      scrapeProtocols:
        - OpenMetricsText1.0.0
    Copy to Clipboard Toggle word wrap
  5. Configure your metrics datasource where to link to for traces.

    When using Grafana and Prometheus, this would be setting up a exemplarTraceIdDestinations for the Prometheus datasource, which then points to your tracing datasource that is provided by tools like Jaeger or Tempo.

  6. Enable exemplars in your dashboards.

    Enable the Exemplars toggle in each query on each dashboard where you want to show exemplars. When set up correctly, you will notice little dots or stars in your dashboards that you can click on to view the traces.

Note
  • If you do not specify the scrape protocol, Prometheus will by default not send it in the content negotiation, and Keycloak will then fall back to the PrometheusText protocol which will not contain the exemplars.
  • If you enabled tracing and metrics, but the request sampling did not record a trace, the exposed metric will not contain any exemplars.
  • If you access the metrics endpoint with your browser, the content negotiation will lead to the format PrometheusText being returned, and you will not see any exemplars.

8.2. Verifying that exemplars work as expected

Perform the following steps to verify that Red Hat build of Keycloak is set up correctly for exemplars:

  1. Follow the instructions to set up metrics and tracing for Red Hat build of Keycloak.
  2. For test purposes, record all traces by setting the tracing ration to 1.0. See Root cause analysis with tracing for recommended sampling settings in production systems.
  3. Log in to the Keycloak instance to create some traces.
  4. Scrape the metrics with a command similar to the following and search for those metrics that have an exemplar set:

    $ curl -s http://localhost:9000/metrics \
    -H 'Accept: application/openmetrics-text; version=1.0.0; charset=utf-8' \
    | grep "#.*trace_id"
    Copy to Clipboard Toggle word wrap

    This should result in an output similar to the following. Note the additional # after which the span and trace IDs are added:

    http_server_requests_seconds_count {...} ... # {span_id="...",trace_id="..."} ...
    Copy to Clipboard Toggle word wrap

Legal Notice

Copyright © 2025 Red Hat, Inc.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
Back to top
Red Hat logoGithubredditYoutubeTwitter

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

We help Red Hat users innovate and achieve their goals with our products and services with content they can trust. Explore our recent updates.

Making open source more inclusive

Red Hat is committed to replacing problematic language in our code, documentation, and web properties. For more details, see the Red Hat Blog.

About Red Hat

We deliver hardened solutions that make it easier for enterprises to work across platforms and environments, from the core datacenter to the network edge.

Theme

© 2025 Red Hat