Observability Guide

Red Hat build of Keycloak 26.2

Red Hat Customer Content Services

Abstract

This guide helps administrators to monitor and troubleshoot Red Hat build of Keycloak 26.2 with health checks, metrics, dashboards and tracing.

Chapter 1. Tracking instance status with health checks
Copy link

Check if an instance has finished its start up and is ready to serve requests by calling its health REST endpoints.

Red Hat build of Keycloak has built in support for health checks. This chapter describes how to enable and use the Red Hat build of Keycloak health checks. The Red Hat build of Keycloak health checks are exposed on the management port 9000 by default. For more details, see Configuring the Management Interface

1.1. Red Hat build of Keycloak health check endpoints
Copy link

Red Hat build of Keycloak exposes 4 health endpoints:

/health/live
/health/ready
/health/started
/health

See the Quarkus SmallRye Health docs for information on the meaning of each endpoint.

These endpoints respond with HTTP status 200 OK on success or 503 Service Unavailable on failure, and a JSON object like the following:

Successful response for endpoints without additional per-check information:

{
    "status": "UP",
    "checks": []
}

{
    "status": "UP",
    "checks": []
}

Copy to Clipboard

Toggle word wrap

Successful response for endpoints with information on the database connection:

{
    "status": "UP",
    "checks": [
        {
            "name": "Keycloak database connections health check",
            "status": "UP"
        }
    ]
}

{
    "status": "UP",
    "checks": [
        {
            "name": "Keycloak database connections health check",
            "status": "UP"
        }
    ]
}

Copy to Clipboard

Toggle word wrap

1.2. Enabling the health checks
Copy link

It is possible to enable the health checks using the build time option health-enabled:

bin/kc.[sh|bat] build --health-enabled=true

bin/kc.[sh|bat] build --health-enabled=true

Copy to Clipboard

Toggle word wrap

By default, no check is returned from the health endpoints.

1.3. Using the health checks
Copy link

It is recommended that the health endpoints be monitored by external HTTP requests. Due to security measures that remove curl and other packages from the Red Hat build of Keycloak container image, local command-based monitoring will not function easily.

If you are not using Red Hat build of Keycloak in a container, use whatever you want to access the health check endpoints.

1.3.1. curl
Copy link

You may use a simple HTTP HEAD request to determine the live or ready state of Red Hat build of Keycloak. curl is a good HTTP client for this purpose.

If Red Hat build of Keycloak is deployed in a container, you must run this command from outside it due to the previously mentioned security measures. For example:

curl --head -fsS http://localhost:9000/health/ready

curl --head -fsS http://localhost:9000/health/ready

Copy to Clipboard

Toggle word wrap

If the command returns with status 0, then Red Hat build of Keycloak is live or ready, depending on which endpoint you called. Otherwise there is a problem.

1.3.2. Kubernetes
Copy link

Define a HTTP Probe so that Kubernetes may externally monitor the health endpoints. Do not use a liveness command.

1.3.3. HEALTHCHECK
Copy link

The Containerfile HEALTHCHECK instruction defines a command that will be periodically executed inside the container as it runs. The Red Hat build of Keycloak container does not have any CLI HTTP clients installed. Consider installing curl as an additional RPM, as detailed by the Running Red Hat build of Keycloak in a container chapter. Note that your container may be less secure because of this.

1.4. Available Checks
Copy link

The table below shows the available checks.

Expand

Check	Description	Requires Metrics
Database	Returns the status of the database connection pool.	Yes

For some checks, you’ll need to also enable metrics as indicated by the Requires Metrics column. To enable metrics use the metrics-enabled option as follows:

bin/kc.[sh|bat] build --health-enabled=true --metrics-enabled=true

bin/kc.[sh|bat] build --health-enabled=true --metrics-enabled=true

Copy to Clipboard

Toggle word wrap

1.5. Relevant options
Copy link

Expand

Value

	Value
`health-enabled` 🛠 If the server should expose health check endpoints. If enabled, health checks are available at the `/health`, `/health/ready` and `/health/live` endpoints. CLI: `--health-enabled` Env: `KC_HEALTH_ENABLED`	`true`, `false` (default)

health-enabled 🛠

If the server should expose health check endpoints.

If enabled, health checks are available at the /health, /health/ready and /health/live endpoints.

CLI: --health-enabled
Env: KC_HEALTH_ENABLED

true, false (default)

Chapter 2. Gaining insights with metrics
Copy link

Collect metrics to gain insights about state and activities of a running instance of Red Hat build of Keycloak.

Red Hat build of Keycloak has built in support for metrics. This chapter describes how to enable and configure server metrics.

2.1. Enabling Metrics
Copy link

It is possible to enable metrics using the build time option metrics-enabled:

bin/kc.[sh|bat] start --metrics-enabled=true

bin/kc.[sh|bat] start --metrics-enabled=true

Copy to Clipboard

Toggle word wrap

2.2. Querying Metrics
Copy link

Red Hat build of Keycloak exposes metrics at the following endpoint on the management interface at:

/metrics

For more information about the management interface, see Configuring the Management Interface. The response from the endpoint uses a application/openmetrics-text content type and it is based on the Prometheus (OpenMetrics) text format. The snippet below is an example of a response:

HELP base_gc_total Displays the total number of collections that have occurred. This attribute lists -1 if the collection count is undefined for this collector.
TYPE base_gc_total counter
HELP jvm_memory_usage_after_gc_percent The percentage of long-lived heap pool used after the last GC event, in the range [0..1]
TYPE jvm_memory_usage_after_gc_percent gauge
HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
TYPE jvm_threads_peak_threads gauge
HELP agroal_active_count Number of active connections. These connections are in use and not available to be acquired.
TYPE agroal_active_count gauge
HELP base_memory_maxHeap_bytes Displays the maximum amount of memory, in bytes, that can be used for memory management.
TYPE base_memory_maxHeap_bytes gauge
HELP process_start_time_seconds Start time of the process since unix epoch.
TYPE process_start_time_seconds gauge
HELP system_load_average_1m The sum of the number of runnable entities queued to available processors and the number of runnable entities running on the available processors averaged over a period of time
TYPE system_load_average_1m gauge

# HELP base_gc_total Displays the total number of collections that have occurred. This attribute lists -1 if the collection count is undefined for this collector.
# TYPE base_gc_total counter
base_gc_total{name="G1 Young Generation",} 14.0
# HELP jvm_memory_usage_after_gc_percent The percentage of long-lived heap pool used after the last GC event, in the range [0..1]
# TYPE jvm_memory_usage_after_gc_percent gauge
jvm_memory_usage_after_gc_percent{area="heap",pool="long-lived",} 0.0
# HELP jvm_threads_peak_threads The peak live thread count since the Java virtual machine started or peak was reset
# TYPE jvm_threads_peak_threads gauge
jvm_threads_peak_threads 113.0
# HELP agroal_active_count Number of active connections. These connections are in use and not available to be acquired.
# TYPE agroal_active_count gauge
agroal_active_count{datasource="default",} 0.0
# HELP base_memory_maxHeap_bytes Displays the maximum amount of memory, in bytes, that can be used for memory management.
# TYPE base_memory_maxHeap_bytes gauge
base_memory_maxHeap_bytes 1.6781410304E10
# HELP process_start_time_seconds Start time of the process since unix epoch.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.675188449054E9
# HELP system_load_average_1m The sum of the number of runnable entities queued to available processors and the number of runnable entities running on the available processors averaged over a period of time
# TYPE system_load_average_1m gauge
system_load_average_1m 4.005859375

...

Copy to Clipboard

Toggle word wrap

2.3. Next steps
Copy link

Read the chapters Monitoring performance with Service Level Indicators and Troubleshooting using metrics to see how to use the metrics.

2.4. Relevant options
Copy link

Expand

	Value
`cache-metrics-histograms-enabled` Enable histograms for metrics for the embedded caches. CLI: `--cache-metrics-histograms-enabled` Env: `KC_CACHE_METRICS_HISTOGRAMS_ENABLED` Available only when metrics are enabled	`true`, `false` (default)
`http-metrics-histograms-enabled` Enables a histogram with default buckets for the duration of HTTP server requests. CLI: `--http-metrics-histograms-enabled` Env: `KC_HTTP_METRICS_HISTOGRAMS_ENABLED` Available only when metrics are enabled	`true`, `false` (default)
`http-metrics-slos` Service level objectives for HTTP server requests. Use this instead of the default histogram, or use it in combination to add additional buckets. Specify a list of comma-separated values defined in milliseconds. Example with buckets from 5ms to 10s: 5,10,25,50,250,500,1000,2500,5000,10000 CLI: `--http-metrics-slos` Env: `KC_HTTP_METRICS_SLOS` Available only when metrics are enabled
`metrics-enabled` 🛠 If the server should expose metrics. If enabled, metrics are available at the `/metrics` endpoint. CLI: `--metrics-enabled` Env: `KC_METRICS_ENABLED`	`true`, `false` (default)

Chapter 3. Monitoring user activities with event metrics
Copy link

Event metrics provide an aggregated view of user activities in a Red Hat build of Keycloak instance.

For now, only metrics for user events are captured. For example, you can monitor the number of logins, login failures, or token refreshes performed.

The metrics are exposed using the standard metrics endpoint, and you can use it in your own metrics collection system to create dashboards and alerts.

The metrics are reported as counters per Red Hat build of Keycloak instance. The counters are reset on the restart of the instance. If you have multiple instances running in a cluster, you will need to collect the metrics from all instances and aggregate them to get per a cluster view.

3.1. Enable event metrics
Copy link

To start collecting event metrics, enable metrics and enable the metrics for user events.

The following shows the required startup parameters:

bin/kc.[sh|bat] start --metrics-enabled=true --event-metrics-user-enabled=true ...

bin/kc.[sh|bat] start --metrics-enabled=true --event-metrics-user-enabled=true ...

Copy to Clipboard

Toggle word wrap

By default, there is a separate metric for each realm. To break down the metric by client and identity provider, you can add those metrics dimension using the configuration option event-metrics-user-tags. This can be useful on installations with a small number of clients and IDPs. This is not recommended for installations with a large number of clients or IDPs as it will increase the memory usage of Red Hat build of Keycloak and as it will increase the load on your monitoring system.

The following shows how to configure Red Hat build of Keycloak to break down the metrics by all three metrics dimensions:

bin/kc.[sh|bat] start ... --event-metrics-user-tags=realm,idp,clientId ...

bin/kc.[sh|bat] start ... --event-metrics-user-tags=realm,idp,clientId ...

Copy to Clipboard

Toggle word wrap

You can limit the events for which Red Hat build of Keycloak will expose metrics. See the Server Administration Guide on event types for an overview of the available events.

The following example limits the events collected to LOGIN and LOGOUT events:

bin/kc.[sh|bat] start ... --event-metrics-user-events=login,logout ...

bin/kc.[sh|bat] start ... --event-metrics-user-events=login,logout ...

Copy to Clipboard

Toggle word wrap

See Self-provided metrics for a description of the metrics collected.

3.2. Relevant options
Copy link

Expand

	Value
`metrics-enabled` 🛠 If the server should expose metrics. If enabled, metrics are available at the `/metrics` endpoint. CLI: `--metrics-enabled` Env: `KC_METRICS_ENABLED`	`true`, `false` (default)
`event-metrics-user-enabled` 🛠 Create metrics based on user events. CLI: `--event-metrics-user-enabled` Env: `KC_EVENT_METRICS_USER_ENABLED` Available only when metrics are enabled and feature user-event-metrics is enabled	`true`, `false` (default)
`event-metrics-user-events` Comma-separated list of events to be collected for user event metrics. This option can be used to reduce the number of metrics created as by default all user events create a metric. CLI: `--event-metrics-user-events` Env: `KC_EVENT_METRICS_USER_EVENTS` Available only when user event metrics are enabled Use `remove_credential` instead of `remove_totp`, and `update_credential` instead of `update_totp` and `update_password`. Deprecated values: `remove_totp`, `update_totp`, `update_password`	`authreqid_to_token`, `client_delete`, `client_info`, `client_initiated_account_linking`, `client_login`, `client_register`, `client_update`, `code_to_token`, `custom_required_action`, `delete_account`, `execute_action_token`, `execute_actions`, `federated_identity_link`, `federated_identity_override_link`, `grant_consent`, `identity_provider_first_login`, `identity_provider_link_account`, `identity_provider_login`, `identity_provider_post_login`, `identity_provider_response`, `identity_provider_retrieve_token`, `impersonate`, `introspect_token`, `invalid_signature`, `invite_org`, `login`, `logout`, `oauth2_device_auth`, `oauth2_device_code_to_token`, `oauth2_device_verify_user_code`, `oauth2_extension_grant`, `permission_token`, `pushed_authorization_request`, `refresh_token`, `register`, `register_node`, `remove_credential`, `remove_federated_identity`, `remove_totp` (deprecated), `reset_password`, `restart_authentication`, `revoke_grant`, `send_identity_provider_link`, `send_reset_password`, `send_verify_email`, `token_exchange`, `unregister_node`, `update_consent`, `update_credential`, `update_email`, `update_password` (deprecated), `update_profile`, `update_totp` (deprecated), `user_disabled_by_permanent_lockout`, `user_disabled_by_temporary_lockout`, `user_info_request`, `verify_email`, `verify_profile`
`event-metrics-user-tags` Comma-separated list of tags to be collected for user event metrics. By default only `realm` is enabled to avoid a high metrics cardinality. CLI: `--event-metrics-user-tags` Env: `KC_EVENT_METRICS_USER_TAGS` Available only when user event metrics are enabled	`realm`, `idp`, `clientId`

Chapter 4. Monitoring performance with Service Level Indicators
Copy link

Track performance and reliability as perceived by users with Service Level Indicators (SLIs) and Service Level Objectives (SLOs).

Service Level Indicators (SLIs) and Service Level Objectives (SLOs) are essential components in monitoring and maintaining the performance and reliability of Red Hat build of Keycloak in production environments.

The Google Site Reliability Engineering book defines this as follows:

A Service Level Indicator (SLI) is a carefully defined quantitative measure of some aspect of the level of service that is provided.
A Service level objective (SLO) is a target value or range of values for a service level that is measured by an SLI.

By agreeing those with the stakeholders and tracking these, service owners can ensure that deployments are aligned with user’s expectations and that they neither over- nor under-deliver on the service they provide.

4.1. Prerequisites
Copy link

Metrics need to be enabled for Red Hat build of Keycloak, and the http-metrics-slos option needs to be set to latency to be measured for the SLO defined below. Follow Gaining insights with metrics chapter for more details.
A monitoring system collecting the metrics. The following paragraphs assume Prometheus or a similar system is used that supports the PromQL query language.

4.2. Definition of the service delivered
Copy link

The following service definition is used in the next steps to identify the appropriate SLIs and SLOs. It should capture the behavior observed by its users.

As a Red Hat build of Keycloak user,

I want to be able to log in,
refresh my token and
log out,

so that I can use the applications that use Red Hat build of Keycloak for authentication.

4.3. Definition of SLI and SLO
Copy link

The following provides example SLIs and SLOs based on the service description above and the metrics available in Red Hat build of Keycloak.

Note

While these SLOs are independent of the actual load the system, this is expected as a single user does not care about the system load if they get slow responses.

At the same time, if you enter a Service Level Agreement (SLA) with stakeholders, you as the one running Red Hat build of Keycloak have an interest to define limits of the traffic Red Hat build of Keycloak receives, as response times will be prolonged and error rates might increase as the load of the system increases and scaling thresholds are reached.

Expand

Characteristic	Service Level Indicator	Service Level Objective^*	Metric Source
Availability	Percentage of the time Red Hat build of Keycloak is able to answer requests as measured by the monitoring system	Red Hat build of Keycloak should be available 99.9% of the time within a month (44 minutes unavailability per month).	Use the Prometheus `up` metric which indicates if the Prometheus server is able to scrape metrics from the Red Hat build of Keycloak instances.
Latency	Response time for authentication related HTTP requests as measured by the server	95% of all authentication related requests should be faster than 250 ms within 30 days.	Red Hat build of Keycloak server-side metrics to track latency for specific endpoints along with Response Time Distribution using `http_server_requests_seconds_bucket` and `http_server_requests_seconds_count`.
Errors	Failed authentication requests due to server problems as measured by the server	The rate of errors due to server problems for authentication requests should be less than 0.1% within 30 days.	Identify server side error by filtering the metric `http_server_requests_seconds_count` on the tag `outcome` for value `SERVER_ERROR`.

^* These SLO target values are an example and should be tailored to fit your use case and deployment.

4.4. PromQL queries
Copy link

These are example queries created in a Kubernetes environment and are used with Prometheus as a monitoring tool. They are provided as blueprints, and you will need to adapt them for a different runtime or monitoring environment.

Note

For a production environment, you might want to replace those queries or subqueries with a recording rule to make sure they do not use too many resources if you want to use them for alerting or live dashboards.

4.4.1. Availability
Copy link

This metric will have a value of at least one if the Red Hat build of Keycloak instances is available and responding to Prometheus scrape requests, and 0 if the service is down or unreachable.

Then use a tool like Grafana to show a 30-day time range and let it calculate the average of the metric in that time window.

count_over_time(
  sum (up{
    container="keycloak", 
    namespace="$namespace"
  } > 0)[30d:15s]
) 
/
count_over_time(vector(1)[30d:15s])

count_over_time(
  sum (up{
    container="keycloak",


    namespace="$namespace"
  } > 0)[30d:15s]
)


/
count_over_time(vector(1)[30d:15s])

Copy to Clipboard

Toggle word wrap

1: Filter by additional tags to identify Red Hat build of Keycloak nodes
2: Count all data points in the given range and interval when at least one Red Hat build of Keycloak node was available
3: Divide by the number of all data points in the same range and interval

Note

In Grafana you can replace value 30d:15s with $range:$interval to compute availability SLI in the time range selected for the dashboard.

4.4.2. Latency of authentication requests
Copy link

This Prometheus query calculates the percentage of authentication requests that completed within 0.25 seconds relative to all authentication requests for specific Red Hat build of Keycloak endpoints, targeting a particular namespace and pod, over the past 30 days.

This example requires the Red Hat build of Keycloak configuration http-metrics-slos to contain value 250 indicating that buckets for requests faster and slower than 250 ms should be recorded. Setting http-metrics-histograms-enabled to true would capture additional buckets which can help with performance troubleshooting.

sum(
  rate(
    http_server_requests_seconds_bucket{
      uri=~"/realms/{realm}/protocol/{protocol}.*|/realms/{realm}/login-actions.*", 
      le="0.25", 
      container="keycloak", 
      namespace="$namespace"}
    [30d] 
  )
) without (le,uri,status,outcome,method,pod,instance) 
/
sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/realms/{realm}/protocol/{protocol}.*|/realms/{realm}/login-actions.*", 
      container="keycloak",
      namespace="$namespace"}
    [30d] 
  )
) without (le,uri,status,outcome,method,pod,instance)

sum(
  rate(
    http_server_requests_seconds_bucket{
      uri=~"/realms/{realm}/protocol/{protocol}.*|/realms/{realm}/login-actions.*",


      le="0.25",


      container="keycloak",


      namespace="$namespace"}
    [30d]


  )
) without (le,uri,status,outcome,method,pod,instance)


/
sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/realms/{realm}/protocol/{protocol}.*|/realms/{realm}/login-actions.*",


      container="keycloak",
      namespace="$namespace"}
    [30d]


  )
) without (le,uri,status,outcome,method,pod,instance)

Copy to Clipboard

Toggle word wrap

1 6: URLs related to logging in
2: Response time as defined by SLO
3 7: Filter by additional tags to identify Red Hat build of Keycloak nodes
4: Time range as specified by the SLO
5 8: Ignore as many labels necessary to create a single sum

Note

In Grafana, you can replace value 30d with $__range to compute latency SLI in the time range selected for the dashboard.

4.4.3. Errors for authentication requests
Copy link

This Prometheus query calculates the percentage of authentication requests that returned a server side error for all authentication requests, targeting a particular namespace, over the past 30 days.

sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/realms/{realm}/protocol/{protocol}.*|/realms/{realm}/login-actions.*", 
      outcome="SERVER_ERROR", 
      container="keycloak", 
      namespace="$namespace"}
    [30d] 
  )
) without (le,uri,status,outcome,method,pod,instance) 
/
sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/realms/{realm}/protocol/{protocol}.*|/realms/{realm}/login-actions.*", 
      container="keycloak", 
      namespace="$namespace"}
    [30d] 
  )
) without (le,uri,status,outcome,method,pod,instance)

sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/realms/{realm}/protocol/{protocol}.*|/realms/{realm}/login-actions.*",


      outcome="SERVER_ERROR",


      container="keycloak",


      namespace="$namespace"}
    [30d]


  )
) without (le,uri,status,outcome,method,pod,instance)


/
sum(
  rate(
    http_server_requests_seconds_count{
      uri=~"/realms/{realm}/protocol/{protocol}.*|/realms/{realm}/login-actions.*",


      container="keycloak",


      namespace="$namespace"}
    [30d]


  )
) without (le,uri,status,outcome,method,pod,instance)

Copy to Clipboard

Toggle word wrap

1 6: URLs related to logging in
2: Filter for all requests that responded with a server error (HTTP status 5xx)
3 7: Filter by additional tags to identify Red Hat build of Keycloak nodes
4 8: Time range as specified by the SLO
5 9: Ignore as many labels necessary to create a single sum

Note

In Grafana, you can replace value 30d with $__range to compute errors SLI in the time range selected for the dashboard.

4.5. Further Reading
Copy link

Chapter 5. Troubleshooting using metrics
Copy link

Use metrics for troubleshooting errors and performance issues.

For a running Red Hat build of Keycloak deployment it is important to understand how the system performs and whether it meets your service level objectives (SLOs). For more details on SLOs, proceed to the Monitoring performance with Service Level Indicators chapter.

This guide will provide directions to answer the question: “What can I do when my SLOs are not met?”

Red Hat build of Keycloak consists of several components where an issue or misconfiguration of one of them can move your service level indicators to undesirable numbers.

A guidance provided by this guide is illustrated in the following example:

Observation: Latency service level objective is not met.

Metrics that indicate a problem:

Red Hat build of Keycloak’s database connection pool is often exhausted, and there are threads queuing for a connection to be retrieved from the pool.
Red Hat build of Keycloak’s users cache hit ratio is at a low percentage, around 5%. This means only 1 out of 20 user searches is able to obtain user data from the cache and the rest needs to load it from the database.

Possible mitigations suggested:

Increasing the users cache size to a higher number which would decrease the number of reads from the database.
Increasing the number of connections in the connection pool. This would need to be checked with metrics for your database and tuning it for a higher load, for example, by increasing the number of available processors.

Note

This guide focuses on Red Hat build of Keycloak metrics. Troubleshooting the database itself is out of scope.
This guide provides general guidance. You should always confirm the configuration change by conducting a performance test comparing the metrics in question for the old and the new configuration.

Note

Grafana dashboards for the metrics below can be found in Visualizing activities in dashboards chapter.

5.1. List of Red Hat build of Keycloak key metrics
Copy link

Self-provided metrics
JVM metrics
Database Metrics
HTTP metrics
Single site metrics (without external Data Grid)
- Clustering metrics
- Embedded Infinispan metrics for single site deployments
Multiple sites metrics (as described in Multi-site deployments)
- Embedded Infinispan metrics for multi-site deployments
- External Data Grid metrics

5.2. Self-provided metrics
Copy link

Learn about the key metrics that Red Hat build of Keycloak provides.

This is part of the Troubleshooting using metrics chapter.

5.2.1. Prerequisites
Copy link

Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
A monitoring system collecting the metrics.

5.2.2. Metrics
Copy link

5.2.2.1. User Event Metrics
Copy link

User event metrics are disabled by default. See Monitoring user activities with event metrics on how to enable them and how to configure which tags are recorded.

Expand

Metric	Description
`keycloak_user_events_total`	Counting the occurrence of user events.

Tags

The tags client_id and idp are disabled by default to avoid a too high cardinality.

realm: Realm
client_id: Client ID
idp: Identity Provider
event: User event, for example login or logout. See the Server Administration Guide on event types for an overview of the available events.
error: Error specific to the event, for example invalid_user_credentials for the event login. Empty string if no error occurred.

The snippet below is an example of a response provided by the metric endpoint:

HELP keycloak_user_events_total Keycloak user events
TYPE keycloak_user_events_total counter

# HELP keycloak_user_events_total Keycloak user events
# TYPE keycloak_user_events_total counter
keycloak_user_events_total{client_id="security-admin-console",error="",event="code_to_token",idp="",realm="master",} 1.0
keycloak_user_events_total{client_id="security-admin-console",error="",event="login",idp="",realm="master",} 1.0
keycloak_user_events_total{client_id="security-admin-console",error="",event="logout",idp="",realm="master",} 1.0
keycloak_user_events_total{client_id="security-admin-console",error="invalid_user_credentials",event="login",idp="",realm="master",} 1.0

Copy to Clipboard

Toggle word wrap

5.2.2.2. Password hashing
Copy link

Expand

Metric	Description
`keycloak_credentials_password_hashing_validations_total`	Counting password hashes validations.

Tags

realm

Realm

algorithm

Algorithm used for hashing password, for example argon2

hashing_strength

String denoting strength of hashing algorithm, for example, number of iterations depending on the algorithm. For example, Argon2id-1.3[m=7168,t=5,p=1]

outcome

Outcome of password validation. Possible values:

valid: Password correct
invalid: Password incorrect
error: Error when creating the hash of the password

To configure what tags are available provide a comma-separated list of tag names to the following option spi-credential-keycloak-password-validations-counter-tags. By default, all tags are enabled.

The snippet below is an example of a response provided by the metric endpoint:

HELP keycloak_credentials_password_hashing_validations_total Password validations
TYPE keycloak_credentials_password_hashing_validations_total counter

# HELP keycloak_credentials_password_hashing_validations_total Password validations
# TYPE keycloak_credentials_password_hashing_validations_total counter
keycloak_credentials_password_hashing_validations_total{algorithm="argon2",hashing_strength="Argon2id-1.3[m=7168,t=5,p=1]",outcome="valid",realm="realm-0",} 39949.0

Copy to Clipboard

Toggle word wrap

5.2.3. Next steps
Copy link

Return back to the Troubleshooting using metrics or proceed to JVM metrics.

5.3. JVM metrics
Copy link

Use JVM metrics to observe performance of Red Hat build of Keycloak.

This is part of the Troubleshooting using metrics chapter.

5.3.1. Prerequisites
Copy link

Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
A monitoring system collecting the metrics.

5.3.2. Metrics
Copy link

5.3.2.1. JVM info
Copy link

Expand

Metric	Description
`jvm_info_total`	Information about the JVM such as version, runtime and vendor.

5.3.2.2. Heap memory usage
Copy link

Expand

Metric	Description
`jvm_memory_committed_bytes`	The amount of memory that the JVM has committed for use, reflecting the portion of the allocated memory that is guaranteed to be available for the JVM to use.
`jvm_memory_used_bytes`	The amount of memory currently used by the JVM, indicating the actual memory consumption by the application and JVM internals.

5.3.2.3. Garbage collection
Copy link

Expand

Metric	Description
`jvm_gc_pause_seconds_max`	The maximum duration, in seconds, of garbage collection pauses experienced by the JVM due to a particular cause, which helps you quickly differentiate between types of GC (minor, major) pauses.
`jvm_gc_pause_seconds_sum`	The total cumulative time spent in garbage collection pauses, indicating the impact of GC pauses on application performance in the JVM.
`jvm_gc_pause_seconds_count`	Counts the total number of garbage collection pause events, helping to assess the frequency of GC pauses in the JVM.
`jvm_gc_overhead`	The percentage of CPU time spent on garbage collection, indicating the impact of GC on application performance in the JVM. It refers to the proportion of the total CPU processing time that is dedicated to executing garbage collection (GC) operations, as opposed to running application code or performing other tasks. This metric helps determine how much overhead GC introduces, affecting the overall performance of the Red Hat build of Keycloak’s JVM.

5.3.2.4. CPU Usage in Kubernetes
Copy link

Expand

Metric	Description
`container_cpu_usage_seconds_total`	Cumulative CPU time consumed by the container in core-seconds.

5.3.3. Next steps
Copy link

Return back to the Troubleshooting using metrics or proceed to Database Metrics.

5.4. Database Metrics
Copy link

Use metrics to describe Red Hat build of Keycloak’s connection to the database.

This is part of the Troubleshooting using metrics chapter.

5.4.1. Prerequisites
Copy link

Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
A monitoring system collecting the metrics.

5.4.2. Database connection pool metrics
Copy link

Configure Red Hat build of Keycloak to use a fixed size database connection pool. See the Concepts for database connection pools chapter for more information.

Tip

If there is a high count of threads waiting for a database connection, increasing the database connection pool size is not always the best option. It might overload the database which would then become the bottleneck. Consider the following options instead:

Reduce the number of HTTP worker threads using the option http-pool-max-threads to make it match the available database connections, and thereby reduce contention and resource usage in Red Hat build of Keycloak and increase throughput.
Check which database statements are executed on the database. If you see, for example, a lot of information about clients and groups being fetched, and the users and realms cache are full, this might indicate that it is time to increase the sizes of those caches and see if this reduces your database load.

Expand

Metric	Description
`agroal_available_count`	Idle database connections.
`agroal_active_count`	Database connections used in ongoing transactions.
`agroal_awaiting_count`	Threads waiting for a database connection to become available.

5.4.3. Next steps
Copy link

Return back to the Troubleshooting using metrics or proceed to HTTP metrics.

5.5. HTTP metrics
Copy link

Use metrics to monitor the Red Hat build of Keycloak HTTP requests processing.

This is part of the Troubleshooting using metrics chapter.

5.5.1. Prerequisites
Copy link

Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
A monitoring system collecting the metrics.

5.5.2. Metrics
Copy link

5.5.2.1. Processing time
Copy link

The processing time is exposed by these metrics, to monitor the Red Hat build of Keycloak performance and how long it takes to processing the requests.

Tip

On a healthy cluster, the average processing time will remain stable. Spikes or increases in the processing time may be an early sign that some node is under load.

Tags

method: HTTP method.
outcome: A more general outcome tag.
status: The HTTP status code.
uri: The requested URI.

Expand

Metric	Description
`http_server_requests_seconds_count`	The total number of requests processed.
`http_server_requests_seconds_sum`	The total duration for all the requests processed.

You can enable histograms for this metric by setting http-metrics-histograms-enabled to true, and add additional buckets for service level objectives using the option http-metrics-slos.

Note

When histograms are enabled, the percentile buckets are available. Those are useful to create heat maps and analyze latencies, still collecting and exposing the percentile buckets will increase the load of to your monitoring system.

5.5.2.2. Active requests
Copy link

The current number of active requests is also available.

Expand

Metric	Description
`http_server_active_requests`	The current number of active requests

5.5.2.3. Bandwidth
Copy link

The metrics below helps to monitor the bandwidth and consumed traffic used by Red Hat build of Keycloak and consumed by the requests and responses received or sent.

Expand

Metric	Description
`http_server_bytes_written_count`	The total number of responses sent.
`http_server_bytes_written_sum`	The total number of bytes sent.
`http_server_bytes_read_count`	The total number of requests received.
`http_server_bytes_read_sum`	The total number of bytes received.

Note

5.5.3. Next steps
Copy link

Return back to the Troubleshooting using metrics or,

For single site deployments proceed to Clustering metrics,
and for multiple sites deployments proceed to Embedded Infinispan metrics for multi-site deployments

5.5.4. Relevant options
Copy link

Expand

Value

	Value
`http-metrics-histograms-enabled` Enables a histogram with default buckets for the duration of HTTP server requests. CLI: `--http-metrics-histograms-enabled` Env: `KC_HTTP_METRICS_HISTOGRAMS_ENABLED` Available only when metrics are enabled	`true`, `false` (default)
`http-metrics-slos` Service level objectives for HTTP server requests. Use this instead of the default histogram, or use it in combination to add additional buckets. Specify a list of comma-separated values defined in milliseconds. Example with buckets from 5ms to 10s: 5,10,25,50,250,500,1000,2500,5000,10000 CLI: `--http-metrics-slos` Env: `KC_HTTP_METRICS_SLOS` Available only when metrics are enabled

http-metrics-histograms-enabled

Enables a histogram with default buckets for the duration of HTTP server requests.

CLI: --http-metrics-histograms-enabled
Env: KC_HTTP_METRICS_HISTOGRAMS_ENABLED

Available only when metrics are enabled

true, false (default)

http-metrics-slos

Service level objectives for HTTP server requests.

Use this instead of the default histogram, or use it in combination to add additional buckets. Specify a list of comma-separated values defined in milliseconds. Example with buckets from 5ms to 10s: 5,10,25,50,250,500,1000,2500,5000,10000

CLI: --http-metrics-slos
Env: KC_HTTP_METRICS_SLOS

Available only when metrics are enabled

5.6. Clustering metrics
Copy link

Use metrics to monitor communication between Red Hat build of Keycloak nodes.

This is part of the Troubleshooting using metrics chapter.

5.6.1. Prerequisites
Copy link

Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
A monitoring system collecting the metrics.

5.6.2. Metrics
Copy link

Deploying multiple Red Hat build of Keycloak nodes allows the load to be distributed amongst them, but this requires communication between the nodes. This section describes metrics that are useful for monitoring the communication between Red Hat build of Keycloak in order to identify possible faults.

Note

This is relevant only for single site deployments. When multiple sites are used, as described in Multi-site deployments, Red Hat build of Keycloak nodes are not clustered together and therefore there is no communication between them directly.

Global tags

cluster=<name>: The cluster name. If metrics from multiple clusters are being collected, this tag helps identify where they belong to.
node=<node>: The name of the node reporting the metric.

Warning

All metric names prefixed with vendor_jgroups_ are provided for troubleshooting and debugging purposes only. The metric names can change in upcoming releases of Red Hat build of Keycloak without further notice. Therefore, we advise not using them in dashboards or in monitoring and alerting.

5.6.2.1. Response Time
Copy link

The following metrics expose the response time for the remote requests. The response time is measured between two nodes and includes the processing time. All requests are measured by these metrics, and the response time should remain stable through the cluster lifecycle.

Tip

In a healthy cluster, the response time will remain stable. An increase in response time may indicate a degraded cluster or a node under heavy load.

Tags

node=<node>: It identifies the sender node.
target_node=<node>: It identifies the receiver node.

Expand

Metric	Description
`vendor_jgroups_stats_sync_requests_seconds_count`	The number of synchronous requests to a receiver node.
`vendor_jgroups_stats_sync_requests_seconds_sum`	The total duration of synchronous request to a receiver node

Note

When histogram is enabled, the percentile buckets are available. Those are useful to create heat maps but, collecting and exposing the percentile buckets may have a negative impact on the deployment performance.

5.6.2.2. Bandwidth
Copy link

All the bytes received and sent by the Red Hat build of Keycloak are collected by these metrics. Also, all the internal messages, as heartbeats, are counted too. They allow computing the bandwidth currently used by each node.

Important

The metric name depends on the JGroups transport protocol in use.

Expand

Metric	Protocol	Description
`vendor_jgroups_tcp_get_num_bytes_received`	`TCP`	The total number of bytes received by a node.
`vendor_jgroups_udp_get_num_bytes_received`	`UDP`
`vendor_jgroups_tunnel_get_num_bytes_received`	`TUNNEL`
`vendor_jgroups_tcp_get_num_bytes_sent`	`TCP`	The total number of bytes sent by a node.
`vendor_jgroups_udp_get_num_bytes_sent`	`UDP`
`vendor_jgroups_tunnel_get_num_bytes_sent`	`TUNNEL`

5.6.2.3. Thread Pool
Copy link

Monitoring the thread pool size is a good indicator that a node is under a heavy load. All requests received are added to the thread pool for processing and, when it is full, the request is discarded. A retransmission mechanism ensures a reliable communication with an increase of resource usage.

Tip

In a healthy cluster, the thread pool should never be closer to its maximum size (by default, 200 threads).

Note

Thread pool metrics are not available with virtual threads. Virtual threads are enabled by default when running with OpenJDK 21.

Important

The metric name depends on the JGroups transport protocol in use. The default transport protocol is TCP.

Expand

Metric	Protocol	Description
`vendor_jgroups_tcp_get_thread_pool_size`	`TCP`	Current number of threads in the thread pool.
`vendor_jgroups_udp_get_thread_pool_size`	`UDP`
`vendor_jgroups_tunnel_get_thread_pool_size`	`TUNNEL`
`vendor_jgroups_tcp_get_largest_size`	`TCP`	The largest number of threads that have ever simultaneously been in the pool.
`vendor_jgroups_udp_get_largest_size`	`UDP`
`vendor_jgroups_tunnel_get_largest_size`	`TUNNEL`

5.6.2.4. Flow Control
Copy link

Flow control takes care of adjusting the rate of a message sender to the rate of the slowest receiver over time. This is implemented through a credit-based system, where each sender decrements its credits when sending. The sender blocks when the credits fall below 0, and only resumes sending messages when it receives a replenishment message from the receivers.

The metrics below show the number of blocked messages and the average blocking time. When a value is different from zero, it may signal that a receiver is overloaded and may degrade the cluster performance.

Each node has two independent flow control protocols, UFC for unicast messages and MFC for multicast messages.

Tip

A healthy cluster shows a value of zero for all metrics.

Expand

Metric	Description
`vendor_jgroups_ufc_get_number_of_blockings`	The number of times flow control blocks the sender for unicast messages.
`vendor_jgroups_ufc_get_average_time_blocked`	Average time blocked (in ms) in flow control when trying to send an unicast message.
`vendor_jgroups_mfc_get_number_of_blockings`	The number of times flow control blocks the sender for multicast messages.
`vendor_jgroups_mfc_get_average_time_blocked`	Average time blocked (in ms) in flow control when trying to send a multicast message.

5.6.2.5. Retransmissions
Copy link

JGroups provides reliable delivery of messages. When a message is dropped on the network, or the receiver cannot handle the message, a retransmission is required. Retransmissions increase resource usage, and it is usually a signal of an overload system.

Random Early Drop (RED) monitors the sender queues. When the queues are almost full, the message is dropped, and a retransmission must happen. It prevents threads from being blocked by a full sender queue.

Tip

A healthy cluster shows a value of zero for all metrics.

Expand

Metric	Description
`vendor_jgroups_unicast3_get_num_xmits`	The number of retransmitted messages.
`vendor_jgroups_red_get_dropped_messages`	The total number of dropped messages by the sender.
`vendor_jgroups_red_get_drop_rate`	Percentage of all messages that were dropped by the sender.

5.6.2.6. Network Partitions
Copy link

5.6.2.6.1. Cluster Size
Copy link

The cluster size metric reports the number of nodes present in the cluster. If it differs, it may signal that a node is joining, shutdown or, in the worst case, a network partition is happening.

Tip

A healthy cluster shows the same value in all nodes.

Expand

Metric	Description
`vendor_cluster_size`	The number of nodes in the cluster.

5.6.2.6.2. Network Partition Events
Copy link

Network partitions in a cluster can happen due to various reasons. This metrics does not help predict network splits but signals that it happened, and the cluster has been merged.

Tip

A healthy cluster shows a value of zero for this metric.

Expand

Metric	Description
`vendor_jgroups_merge3_get_num_merge_events`	The amount of time a network split was detected and healed.

5.6.3. Next steps
Copy link

Return back to the Troubleshooting using metrics or proceed to Embedded Infinispan metrics for single site deployments.

5.7. Embedded Infinispan metrics for single site deployments
Copy link

Use metrics to monitor caching health and cluster replication.

This is part of the Troubleshooting using metrics chapter.

5.7.1. Prerequisites
Copy link

Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
A monitoring system collecting the metrics.

5.7.2. Metrics
Copy link

Global tags

cache=<name>: The cache name.

5.7.2.1. Size
Copy link

Monitor the number of entries in your cache using these two metrics. If the cache is clustered, each entry has an owner node and zero or more backup copies of different nodes.

Tip

Sum the unique entry size metric to get a cluster total number of entries.

Expand

Metric	Description
`vendor_statistics_approximate_entries`	The approximate number of entries stored by the node, including backup copies.
`vendor_statistics_approximate_entries_unique`	The approximate number of entries stored by the node, excluding backup copies.

5.7.2.2. Data Access
Copy link

The following metrics monitor the cache accesses, such as the reads, writes and their duration.

5.7.2.2.1. Stores
Copy link

A store operation is a write operation that writes or updates a value stored in the cache.

Expand

Metric	Description
`vendor_statistics_store_times_seconds_count`	The total number of store requests.
`vendor_statistics_store_times_seconds_sum`	The total duration of all store requests.

Note

5.7.2.2.2. Reads
Copy link

A read operation reads a value from the cache. It divides into two groups, a hit if a value is found, and a miss if not found.

Expand

Metric	Description
`vendor_statistics_hit_times_seconds_count`	The total number of read hits requests.
`vendor_statistics_hit_times_seconds_sum`	The total duration of all read hits requests.
`vendor_statistics_miss_times_seconds_count`	The total number of read misses requests.
`vendor_statistics_miss_times_seconds_sum`	The total duration of all read misses requests.

Note

5.7.2.2.3. Removes
Copy link

A remove operation removes a value from the cache. It divides in two groups, a hit if a value exists, and a miss if the value does not exist.

Expand

Metric	Description
`vendor_statistics_remove_hit_times_seconds_count`	The total number of remove hits requests.
`vendor_statistics_remove_hit_times_seconds_sum`	The total duration of all remove hits requests.
`vendor_statistics_remove_miss_times_seconds_count`	The total number of remove misses requests.
`vendor_statistics_remove_miss_times_seconds_sum`	The total duration of all remove misses requests.

Note

Tip

For users and realms cache, the database invalidation translates into a remove operation. These metrics are a good indicator of how frequent the database entities are modified and therefore removed from the cache.

Hit Ratio for read and remove operations

An expression can be used to compute the hit ratio for a cache in systems such as Prometheus. As an example, the hit ratio for read operations can be expressed as:

vendor_statistics_hit_times_seconds_count
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)

vendor_statistics_hit_times_seconds_count
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)

Copy to Clipboard

Toggle word wrap

Read/Write ratio

An expression can be used to compute the read-write ratio for a cache, using the metrics above:

(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count
 + vendor_statistics_remove_hit_times_seconds_count
 + vendor_statistics_remove_miss_times_seconds_count
 + vendor_statistics_store_times_seconds_count)

(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count
 + vendor_statistics_remove_hit_times_seconds_count
 + vendor_statistics_remove_miss_times_seconds_count
 + vendor_statistics_store_times_seconds_count)

Copy to Clipboard

Toggle word wrap

5.7.2.2.4. Eviction
Copy link

Eviction is the process to limit the cache size and, when full, an entry is removed to make room for a new entry to be cached. As Red Hat build of Keycloak caches the database entities in the users, realms and authorization, database access always proceeds with an eviction event.

Expand

Metric	Description
`vendor_statistics_evictions`	The total number of eviction events.

Eviction rate

A rapid increase of eviction and very high database CPU usage means the users or realms cache is too small for smooth Red Hat build of Keycloak operation, as data needs to be re-loaded very often from the database which slows down responses. If enough memory is available, consider increasing the max cache size using the CLI options cache-embedded-users-max-count or cache-embedded-realms-max-count

5.7.2.3. Locking
Copy link

Write and remove operations hold the lock until the value is replicated in the local cluster and to the remote site.

Tip

On a healthy cluster, the number of locks held should remain constant, but deadlocks may create temporary spikes.

Expand

Metric	Description
`vendor_lock_manager_number_of_locks_held`	The number of locks currently being held by this node.

5.7.2.4. Transactions
Copy link

Transactional caches use both One-Phase-Commit and Two-Phase-Commit protocols to complete a transaction. These metrics keep track of the operation duration.

Note

The PESSMISTIC locking mode uses One-Phase-Commit and does not create commit requests.

Tip

In a healthy cluster, the number of rollbacks should remain zero. Deadlocks should be rare, but they increase the number of rollbacks.

Expand

Metric	Description
`vendor_transactions_prepare_times_seconds_count`	The total number of prepare requests.
`vendor_transactions_prepare_times_seconds_sum`	The total duration of all prepare requests.
`vendor_transactions_rollback_times_seconds_count`	The total number of rollback requests.
`vendor_transactions_rollback_times_seconds_sum`	The total duration of all rollback requests.
`vendor_transactions_commit_times_seconds_count`	The total number of commit requests.
`vendor_transactions_commit_times_seconds_sum`	The total duration of all commit requests.

Note

5.7.2.5. State Transfer
Copy link

State transfer happens when a node joins or leaves the cluster. It is required to balance the data stored and guarantee the desired number of copies.

This operation increases the resource usage, and it will affect negatively the overall performance.

Expand

Metric	Description
`vendor_state_transfer_manager_inflight_transactional_segment_count`	The number of in-flight transactional segments the local node requested from other nodes.
`vendor_state_transfer_manager_inflight_segment_transfer_count`	The number of in-flight segments the local node requested from other nodes.

5.7.2.6. Cluster Data Replication
Copy link

The cluster data replication can be the main source of failure. These metrics not only report the response time, i.e., the time it takes to replicate an update, but also the failures.

Tip

On a healthy cluster, the average replication time will be stable or with little variance. The number of failures should not increase.

Expand

Metric	Description
`vendor_rpc_manager_replication_count`	The total number of successful replications.
`vendor_rpc_manager_replication_failures`	The total number of failed replications.
`vendor_rpc_manager_average_replication_time`	The average time spent, in milliseconds, replicating data in the cluster.

Success ratio

An expression can be used to compute the replication success ratio:

(vendor_rpc_manager_replication_count)
/
(vendor_rpc_manager_replication_count
 + vendor_rpc_manager_replication_failures)

(vendor_rpc_manager_replication_count)
/
(vendor_rpc_manager_replication_count
 + vendor_rpc_manager_replication_failures)

Copy to Clipboard

Toggle word wrap

5.7.3. Next steps
Copy link

Return back to the Troubleshooting using metrics.

5.8. Embedded Infinispan metrics for multi-site deployments
Copy link

Use metrics to monitor caching health.

This is part of the Troubleshooting using metrics chapter.

5.8.1. Prerequisites
Copy link

Metrics need to be enabled for Red Hat build of Keycloak. Follow the Gaining insights with metrics chapter for more details.
A monitoring system collecting the metrics.

5.8.2. Metrics
Copy link

Global tags

cache=<name>: The cache name.

5.8.2.1. Size
Copy link

Monitor the number of entries in your cache using these two metrics. If the cache is clustered, each entry has an owner node and zero or more backup copies of different nodes.

Tip

Sum the unique entry size metric to get a cluster total number of entries.

Expand

Metric	Description
`vendor_statistics_approximate_entries`	The approximate number of entries stored by the node, including backup copies.
`vendor_statistics_approximate_entries_unique`	The approximate number of entries stored by the node, excluding backup copies.

5.8.2.2. Data Access
Copy link

The following metrics monitor the cache accesses, such as the reads, writes and their duration.

5.8.2.2.1. Stores
Copy link

A store operation is a write operation that writes or updates a value stored in the cache.

Expand

Metric	Description
`vendor_statistics_store_times_seconds_count`	The total number of store requests.
`vendor_statistics_store_times_seconds_sum`	The total duration of all store requests.

Note

5.8.2.2.2. Reads
Copy link

A read operation reads a value from the cache. It divides into two groups, a hit if a value is found, and a miss if not found.

Expand

Metric	Description
`vendor_statistics_hit_times_seconds_count`	The total number of read hits requests.
`vendor_statistics_hit_times_seconds_sum`	The total duration of all read hits requests.
`vendor_statistics_miss_times_seconds_count`	The total number of read misses requests.
`vendor_statistics_miss_times_seconds_sum`	The total duration of all read misses requests.

Note

5.8.2.2.3. Removes
Copy link

A remove operation removes a value from the cache. It divides in two groups, a hit if a value exists, and a miss if the value does not exist.

Expand

Metric	Description
`vendor_statistics_remove_hit_times_seconds_count`	The total number of remove hits requests.
`vendor_statistics_remove_hit_times_seconds_sum`	The total duration of all remove hits requests.
`vendor_statistics_remove_miss_times_seconds_count`	The total number of remove misses requests.
`vendor_statistics_remove_miss_times_seconds_sum`	The total duration of all remove misses requests.

Note

Tip

Hit Ratio for read and remove operations

An expression can be used to compute the hit ratio for a cache in systems such as Prometheus. As an example, the hit ratio for read operations can be expressed as:

vendor_statistics_hit_times_seconds_count
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)

vendor_statistics_hit_times_seconds_count
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)

Copy to Clipboard

Toggle word wrap

Read/Write ratio

An expression can be used to compute the read-write ratio for a cache, using the metrics above:

(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count
 + vendor_statistics_remove_hit_times_seconds_count
 + vendor_statistics_remove_miss_times_seconds_count
 + vendor_statistics_store_times_seconds_count)

(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count)
/
(vendor_statistics_hit_times_seconds_count
 + vendor_statistics_miss_times_seconds_count
 + vendor_statistics_remove_hit_times_seconds_count
 + vendor_statistics_remove_miss_times_seconds_count
 + vendor_statistics_store_times_seconds_count)

Copy to Clipboard

Toggle word wrap

5.8.2.2.4. Eviction
Copy link

Expand

Metric	Description
`vendor_statistics_evictions`	The total number of eviction events.

Eviction rate

5.8.2.3. Transactions
Copy link

Transactional caches use both One-Phase-Commit and Two-Phase-Commit protocols to complete a transaction. These metrics keep track of the operation duration.

Note

The PESSMISTIC locking mode uses One-Phase-Commit and does not create commit requests.

Tip

In a healthy cluster, the number of rollbacks should remain zero. Deadlocks should be rare, but they increase the number of rollbacks.

Expand

Metric	Description
`vendor_transactions_prepare_times_seconds_count`	The total number of prepare requests.
`vendor_transactions_prepare_times_seconds_sum`	The total duration of all prepare requests.
`vendor_transactions_rollback_times_seconds_count`	The total number of rollback requests.
`vendor_transactions_rollback_times_seconds_sum`	The total duration of all rollback requests.
`vendor_transactions_commit_times_seconds_count`	The total number of commit requests.
`vendor_transactions_commit_times_seconds_sum`	The total duration of all commit requests.

Note

5.8.3. Next steps
Copy link

Return back to the Troubleshooting using metrics or proceed to External Data Grid metrics.

5.9. External Data Grid metrics
Copy link

Use metrics to monitor external Data Grid performance.

This is part of the Troubleshooting using metrics chapter.

5.9.1. Prerequisites
Copy link

5.9.1.1. Enabled Data Grid server metrics
Copy link

Data Grid exposes metrics in the endpoint /metrics. By default, they are enabled. We recommend enabling the attribute name-as-tags as it makes the metrics name independent on the cache name.

To configure metrics in the Data Grid server, just enabled as shown in the XML below.

infinispan.xml

<infinispan>
    <cache-container statistics="true">
        <metrics gauges="true" histograms="false" name-as-tags="true" />
    </cache-container>
</infinispan>

<infinispan>
    <cache-container statistics="true">
        <metrics gauges="true" histograms="false" name-as-tags="true" />
    </cache-container>
</infinispan>

Copy to Clipboard

Toggle word wrap

Using the Data Grid Operator in Kubernetes, metrics can be enabled by using a ConfigMap with a custom configuration. It is shown below an example.

ConfigMap

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-config
data:
  infinispan-config.yaml: >
    infinispan:
      cacheContainer:
        metrics:
          gauges: true
          namesAsTags: true
          histograms: false

apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-config
data:
  infinispan-config.yaml: >
    infinispan:
      cacheContainer:
        metrics:
          gauges: true
          namesAsTags: true
          histograms: false

Copy to Clipboard

Toggle word wrap

infinispan.yaml CR

apiVersion: infinispan.org/v1
kind: Infinispan
metadata:
  name: infinispan
  annotations:
    infinispan.org/monitoring: 'true' 
spec:
  configMapName: "cluster-config"

apiVersion: infinispan.org/v1
kind: Infinispan
metadata:
  name: infinispan
  annotations:
    infinispan.org/monitoring: 'true'


spec:
  configMapName: "cluster-config"

Copy to Clipboard

Toggle word wrap

1: Enables monitoring for the deployment
2: Sets the ConfigMap name with the custom configuration.

Additional information can be found in the Infinispan documentation and Infinispan operator documentation.

5.9.2. Clustering and Network
Copy link

This section describes metrics that are useful for monitoring the communication between Data Grid nodes to identify possible network issues.

Global tags

cluster=<name>: The cluster name. If metrics from multiple clusters are being collected, this tag helps identify where they belong to.
node=<node>: The name of the node reporting the metric.

Warning

5.9.2.1. Response Time
Copy link

Tip

In a healthy cluster, the response time will remain stable. An increase in response time may indicate a degraded cluster or a node under heavy load.

Tags

node=<node>: It identifies the sender node.
target_node=<node>: It identifies the receiver node.

Expand

Metric	Description
`vendor_jgroups_stats_sync_requests_seconds_count`	The number of synchronous requests to a receiver node.
`vendor_jgroups_stats_sync_requests_seconds_sum`	The total duration of synchronous request to a receiver node

Note

5.9.2.2. Bandwidth
Copy link

All the bytes received and sent by the Data Grid are collected by these metrics. Also, all the internal messages, as heartbeats, are counted too. They allow computing the bandwidth currently used by each node.

Important

The metric name depends on the JGroups transport protocol in use.

Expand

Metric	Protocol	Description
`vendor_jgroups_tcp_get_num_bytes_received`	`TCP`	The total number of bytes received by a node.
`vendor_jgroups_udp_get_num_bytes_received`	`UDP`
`vendor_jgroups_tunnel_get_num_bytes_received`	`TUNNEL`
`vendor_jgroups_tcp_get_num_bytes_sent`	`TCP`	The total number of bytes sent by a node.
`vendor_jgroups_udp_get_num_bytes_sent`	`UDP`
`vendor_jgroups_tunnel_get_num_bytes_sent`	`TUNNEL`

5.9.2.3. Thread Pool
Copy link

Tip

In a healthy cluster, the thread pool should never be closer to its maximum size (by default, 200 threads).

Note

Thread pool metrics are not available with virtual threads. Virtual threads are enabled by default when running with OpenJDK 21.

Important

The metric name depends on the JGroups transport protocol in use. The default transport protocol is TCP.

Expand

Metric	Protocol	Description
`vendor_jgroups_tcp_get_thread_pool_size`	`TCP`	Current number of threads in the thread pool.
`vendor_jgroups_udp_get_thread_pool_size`	`UDP`
`vendor_jgroups_tunnel_get_thread_pool_size`	`TUNNEL`
`vendor_jgroups_tcp_get_largest_size`	`TCP`	The largest number of threads that have ever simultaneously been in the pool.
`vendor_jgroups_udp_get_largest_size`	`UDP`
`vendor_jgroups_tunnel_get_largest_size`	`TUNNEL`

5.9.2.4. Flow Control
Copy link

Each node has two independent flow control protocols, UFC for unicast messages and MFC for multicast messages.

Tip

A healthy cluster shows a value of zero for all metrics.

Expand

Metric	Description
`vendor_jgroups_ufc_get_number_of_blockings`	The number of times flow control blocks the sender for unicast messages.
`vendor_jgroups_ufc_get_average_time_blocked`	Average time blocked (in ms) in flow control when trying to send an unicast message.
`vendor_jgroups_mfc_get_number_of_blockings`	The number of times flow control blocks the sender for multicast messages.
`vendor_jgroups_mfc_get_average_time_blocked`	Average time blocked (in ms) in flow control when trying to send a multicast message.

5.9.2.5. Retransmissions
Copy link

Tip

A healthy cluster shows a value of zero for all metrics.

Expand

Metric	Description
`vendor_jgroups_unicast3_get_num_xmits`	The number of retransmitted messages.
`vendor_jgroups_red_get_dropped_messages`	The total number of dropped messages by the sender.
`vendor_jgroups_red_get_drop_rate`	Percentage of all messages that were dropped by the sender.

5.9.2.6. Network Partitions
Copy link

5.9.2.6.1. Cluster Size
Copy link

The cluster size metric reports the number of nodes present in the cluster. If it differs, it may signal that a node is joining, shutdown or, in the worst case, a network partition is happening.

Tip

A healthy cluster shows the same value in all nodes.

Expand

Metric	Description
`vendor_cluster_size`	The number of nodes in the cluster.

5.9.2.6.2. Cross-Site Status
Copy link

The cross-site status reports connection status to the other site. It returns a value of 1 if is online or 0 if offline. The value of 2 is used on nodes where the status is unknown; not all nodes establish connections to the remote sites and do not contain this information.

Tip

A healthy cluster shows a value greater than zero.

Expand

Metric	Description
`vendor_jgroups_site_view_status`	The single site status (1 if online).

Tags

site=<name>: The name of the destination site.

5.9.2.6.3. Network Partition Events
Copy link

Network partitions in a cluster can happen due to various reasons. This metrics does not help predict network splits but signals that it happened, and the cluster has been merged.

Tip

A healthy cluster shows a value of zero for this metric.

Expand

Metric	Description
`vendor_jgroups_merge3_get_num_merge_events`	The amount of time a network split was detected and healed.

5.9.3. Data Grid Caches
Copy link

The metrics in this section help monitoring the Data Grid caches health and the cluster replication.

Global tags

cache=<name>: The cache name.

5.9.3.1. Size
Copy link

Monitor the number of entries in your cache using these two metrics. If the cache is clustered, each entry has an owner node and zero or more backup copies of different nodes.

Tip

Sum the unique entry size metric to get a cluster total number of entries.

Expand

Metric	Description
`vendor_statistics_approximate_entries`	The approximate number of entries stored by the node, including backup copies.
`vendor_statistics_approximate_entries_unique`	The approximate number of entries stored by the node, excluding backup copies.

5.9.3.2. Data Access
Copy link

The following metrics monitor the cache accesses, such as the reads, writes and their duration.

5.9.3.2.1. Stores
Copy link

A store operation is a write operation that writes or updates a value stored in the cache.

Expand

Metric	Description
`vendor_statistics_store_times_seconds_count`	The total number of store requests.
`vendor_statistics_store_times_seconds_sum`	The total duration of all store requests.

Note

5.9.3.2.2. Reads
Copy link

A read operation reads a value from the cache. It divides into two groups, a hit if a value is found, and a miss if not found.

Expand

Metric	Description
`vendor_statistics_hit_times_seconds_count`	The total number of read hits requests.
`vendor_statistics_hit_times_seconds_sum`	The total duration of all read hits requests.
`vendor_statistics_miss_times_seconds_count`	The total number of read misses requests.
`vendor_statistics_miss_times_seconds_sum`	The total duration of all read misses requests.

Note

5.9.3.2.3. Removes
Copy link

A remove operation removes a value from the cache. It divides in two groups, a hit if a value exists, and a miss if the value does not exist.

Expand

Metric	Description
`vendor_statistics_remove_hit_times_seconds_count`	The total number of remove hits requests.
`vendor_statistics_remove_hit_times_seconds_sum`	The total duration of all remove hits requests.
`vendor_statistics_remove_miss_times_seconds_count`	The total number of remove misses requests.
`vendor_statistics_remove_miss_times_seconds_sum`	The total duration of all remove misses requests.

Note

5.9.3.3. Locking
Copy link

Write and remove operations hold the lock until the value is replicated in the local cluster and to the remote site.

Tip

On a healthy cluster, the number of locks held should remain constant, but deadlocks may create temporary spikes.

Expand

Metric	Description
`vendor_lock_manager_number_of_locks_held`	The number of locks currently being held by this node.

5.9.3.4. Transactions
Copy link

Transactional caches use both One-Phase-Commit and Two-Phase-Commit protocols to complete a transaction. These metrics keep track of the operation duration.

Note

The PESSMISTIC locking mode uses One-Phase-Commit and does not create commit requests.

Tip

In a healthy cluster, the number of rollbacks should remain zero. Deadlocks should be rare, but they increase the number of rollbacks.

Expand

Metric	Description
`vendor_transactions_prepare_times_seconds_count`	The total number of prepare requests.
`vendor_transactions_prepare_times_seconds_sum`	The total duration of all prepare requests.
`vendor_transactions_rollback_times_seconds_count`	The total number of rollback requests.
`vendor_transactions_rollback_times_seconds_sum`	The total duration of all rollback requests.
`vendor_transactions_commit_times_seconds_count`	The total number of commit requests.
`vendor_transactions_commit_times_seconds_sum`	The total duration of all commit requests.

Note

5.9.3.5. State Transfer
Copy link

State transfer happens when a node joins or leaves the cluster. It is required to balance the data stored and guarantee the desired number of copies.

This operation increases the resource usage, and it will affect negatively the overall performance.

Expand

Metric	Description
`vendor_state_transfer_manager_inflight_transactional_segment_count`	The number of in-flight transactional segments the local node requested from other nodes.
`vendor_state_transfer_manager_inflight_segment_transfer_count`	The number of in-flight segments the local node requested from other nodes.

5.9.3.6. Cluster Data Replication
Copy link

The cluster data replication can be the main source of failure. These metrics not only report the response time, i.e., the time it takes to replicate an update, but also the failures.

Tip

On a healthy cluster, the average replication time will be stable or with little variance. The number of failures should not increase.

Expand

Metric	Description
`vendor_rpc_manager_replication_count`	The total number of successful replications.
`vendor_rpc_manager_replication_failures`	The total number of failed replications.
`vendor_rpc_manager_average_replication_time`	The average time spent, in milliseconds, replicating data in the cluster.

Success ratio

An expression can be used to compute the replication success ratio:

(vendor_rpc_manager_replication_count)
/
(vendor_rpc_manager_replication_count
 + vendor_rpc_manager_replication_failures)

(vendor_rpc_manager_replication_count)
/
(vendor_rpc_manager_replication_count
 + vendor_rpc_manager_replication_failures)

Copy to Clipboard

Toggle word wrap

5.9.3.7. Cross Site Data Replication
Copy link

Like cluster data replication, the metrics in this section measure the time it takes to replicate the data to the other sites.

Tip

On a healthy cluster, the average cross-site replication time will be stable or with little variance.

Tags

site=<name>: indicates the receiving site.

Expand

Metric	Description
`vendor_rpc_manager_cross_site_replication_times_seconds_count`	The total number of cross-site requests.
`vendor_rpc_manager_cross_site_replication_times_seconds_sum`	The total duration of all cross-site requests.
`vendor_rpc_manager_replication_times_to_site_seconds_count`	The total number of cross-site requests. This metric is more detailed with a per-site counter.
`vendor_rpc_manager_replication_times_to_site_seconds_sum`	The total duration of all cross-site requests. This metric is more detailed with a per-site duration.
`vendor_rpc_manager_number_xsite_requests_received_from_site`	The total number of cross-site requests handled by this node. This metric is more detailed with a per-site counter.
`vendor_x_site_admin_status`	The site status. A value of 1 indicates that it is online. This value reacts to the Data Grid CLI commands `bring-online` and `take-offline`.

Note

5.9.4. Next steps
Copy link

Return back to the Troubleshooting using metrics.

Chapter 6. Root cause analysis with tracing
Copy link

Record information during the request lifecycle with OpenTelementry tracing to identify root cases for latencies and errors in Red Hat build of Keycloak and connected systems.

This chapter explains how you can enable and configure distributed tracing in Red Hat build of Keycloak by utilizing OpenTelemetry (OTel). Tracing allows for detailed monitoring of each request’s lifecycle, which helps quickly identify and diagnose issues, leading to more efficient debugging and maintenance.

It provides valuable insights into performance bottlenecks and can help optimize the system’s overall efficiency and across system boundaries. Red Hat build of Keycloak uses a supported Quarkus OTel extension that provides smooth integration and exposure of application traces.

6.1. Enable tracing
Copy link

It is possible to enable exposing traces using the build time option tracing-enabled as follows:

bin/kc.[sh|bat] start --tracing-enabled=true

bin/kc.[sh|bat] start --tracing-enabled=true

Copy to Clipboard

Toggle word wrap

By default, the trace exporters send out data in batches, using the gRPC protocol and endpoint http://localhost:4317.

The default service name is keycloak, specified via the tracing-service-name property, which takes precedence over service.name defined in the tracing-resource-attributes property.

For more information about resource attributes that can be provided via the tracing-resource-attributes property, see the Quarkus OpenTelemetry Resource guide.

Note

Tracing can be enabled only when the opentelemetry feature is enabled (by default).

For more tracing settings, see all possible configurations below.

6.2. Development setup
Copy link

In order to see the captured Red Hat build of Keycloak traces, the basic setup with leveraging the Jaeger tracing platform might be used. For development purposes, the Jaeger-all-in-one can be used to see traces as easily as possible.

Note

Jaeger-all-in-one includes the Jaeger agent, an OTel collector, and the query service/UI. You do not need to install a separate collector, as you can directly send the trace data to Jaeger.

podman run --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one

podman run --name jaeger \
-p 16686:16686 \
-p 4317:4317 \
-p 4318:4318 \
jaegertracing/all-in-one

Copy to Clipboard

Toggle word wrap

6.2.1. Exposed ports
Copy link

16686: Jaeger UI
4317: OpenTelemetry Protocol gRPC receiver (default)
4318: OpenTelemetry Protocol HTTP receiver

You can visit the Jaeger UI on http://localhost:16686/ to see the tracing information. The Jaeger UI might look like this with an arbitrary Red Hat build of Keycloak trace:

6.3. Information in traces
Copy link

6.3.1. Spans
Copy link

Red Hat build of Keycloak creates spans for the following activities:

Incoming HTTP requests
Outgoing Database including acquiring a database connections
Outgoing LDAP requests including connecting to the LDAP server
Outgoing HTTP requests including IdP brokerage

6.3.2. Tags
Copy link

Red Hat build of Keycloak adds tags to traces depending on the type of the request. All tags are prefixed with kc..

Example tags are:

kc.clientId: Client ID
kc.realmName: Realm name
kc.sessionId: User session ID
kc.token.id: id as mentioned in the token
kc.token.issuer: issuer as mentioned in the token
kc.token.sid: sid as mentioned in the token
kc.authenticationSessionId: Authentication session ID
kc.authenticationTabId: Authentication Tab ID

6.3.3. Logs
Copy link

If a trace is being sampled, it will contain any user events created during the request. This includes, for example, LOGIN, LOGOUT or REFRESH_TOKEN events with all details and IDs found in user events.

LDAP communication errors are shown as log entries in recorded traces as well with a stack trace and details of the failed operation.

6.4. Trace IDs in logs
Copy link

When tracing is enabled, the trace IDs are included in the log messages of all enabled log handlers (see more in Configuring logging). It can be useful for associating log events to request execution, which might provide better traceability and debugging. All log lines originating from the same request will have the same traceId in the log.

The log message also contains a sampled flag, which relates to the sampling described below and indicates whether the span was sampled - sent to the collector.

The format of the log records may start as follows:

2024-08-05 15:27:07,144 traceId=b636ac4c665ceb901f7fdc3fc7e80154, parentId=d59cea113d0c2549, spanId=d59cea113d0c2549, sampled=true WARN  [org.keycloak.events] ...

2024-08-05 15:27:07,144 traceId=b636ac4c665ceb901f7fdc3fc7e80154, parentId=d59cea113d0c2549, spanId=d59cea113d0c2549, sampled=true WARN  [org.keycloak.events] ...

Copy to Clipboard

Toggle word wrap

6.4.1. Hide trace IDs in logs
Copy link

You can hide trace IDs in specific log handlers by specifying their associated Red Hat build of Keycloak option log-<handler-name>-include-trace, where <handler-name> is the name of the log handler. For instance, to disable trace info in the console log, you can turn it off as follows:

bin/kc.[sh|bat] start --tracing-enabled=true --log=console --log-console-include-trace=false

bin/kc.[sh|bat] start --tracing-enabled=true --log=console --log-console-include-trace=false

Copy to Clipboard

Toggle word wrap

Note

When you explicitly override the log format for the particular log handlers, the *-include-trace options do not have any effect, and no tracing is included.

6.5. Sampling
Copy link

The sampler decides whether a trace should be discarded or forwarded, effectively reducing overhead by limiting the number of collected traces sent to the collector. It helps manage resource consumption, which leads to avoiding the huge storage costs of tracing every single request and potential performance penalty.

Warning

For a production-ready environment, sampling should be properly set to minimize infrastructure costs.

Red Hat build of Keycloak supports several built-in OpenTelemetry samplers, such as:

always_on
always_off
traceidratio (default)
parentbased_always_on
parentbased_always_off
parentbased_traceidratio

The used sampler can be changed via the tracing-sampler-type property.

6.5.1. Default sampler
Copy link

The default sampler for Red Hat build of Keycloak is traceidratio, which controls the rate of trace sampling based on a specified ratio configurable via the tracing-sampler-ratio property.

6.5.1.1. Trace ratio
Copy link

The default trace ratio is 1.0, which means all traces are sampled - sent to the collector. The ratio is a floating number in the range [0,1]. For instance, when the ratio is 0.1, only 10% of the traces are sampled.

Warning

For a production-ready environment, the trace ratio should be a smaller number to prevent the massive cost of trace store infrastructure and avoid performance overhead.

Tip

The ratio can be set to 0.0 to disable sampling entirely at runtime.

6.5.1.2. Rationale
Copy link

The sampler makes its own sampling decisions based on the current ratio of sampled spans, regardless of the decision made on the parent span, as with using the parentbased_traceidratio sampler.

The parentbased_traceidratio sampler could be the preferred default type as it ensures the sampling consistency between parent and child spans. Specifically, if a parent span is sampled, all its child spans will be sampled as well - the same sampling decision for all. It helps to keep all spans together and prevents storing incomplete traces.

However, it might introduce certain security risks leading to DoS attacks. External callers can manipulate trace headers, parent spans can be injected, and the trace store can be overwhelmed. Proper HTTP headers (especially tracestate) filtering and adequate measures of caller trust would need to be assessed.

For more information, see the W3C Trace context document.

6.6. Tracing in Kubernetes environment
Copy link

When the tracing is enabled when using the Red Hat build of Keycloak Operator, certain information about the deployment is propagated to the underlying containers.

6.6.1. Configuration via Keycloak CR
Copy link

You can change tracing configuration via Keycloak CR. For more information, see the Advanced configuration.

6.6.2. Filter traces based on Kubernetes attributes
Copy link

You can filter out the required traces in your tracing backend based on their tags:

service.name - Red Hat build of Keycloak deployment name
k8s.namespace.name - Namespace
host.name - Pod name

Red Hat build of Keycloak Operator automatically sets the KC_TRACING_SERVICE_NAME and KC_TRACING_RESOURCE_ATTRIBUTES environment variables for each Red Hat build of Keycloak container included in pods it manages.

Note

The KC_TRACING_RESOURCE_ATTRIBUTES variable always contains (if not overridden) the k8s.namespace.name attribute representing the current namespace.

6.7. Relevant options
Copy link

Expand

	Value
`log-console-include-trace` Include tracing information in the console log. If the `log-console-format` option is specified, this option has no effect. CLI: `--log-console-include-trace` Env: `KC_LOG_CONSOLE_INCLUDE_TRACE` Available only when Console log handler and Tracing is activated	`true` (default), `false`
`log-file-include-trace` Include tracing information in the file log. If the `log-file-format` option is specified, this option has no effect. CLI: `--log-file-include-trace` Env: `KC_LOG_FILE_INCLUDE_TRACE` Available only when File log handler and Tracing is activated	`true` (default), `false`
`log-syslog-include-trace` Include tracing information in the Syslog. If the `log-syslog-format` option is specified, this option has no effect. CLI: `--log-syslog-include-trace` Env: `KC_LOG_SYSLOG_INCLUDE_TRACE` Available only when Syslog handler and Tracing is activated	`true` (default), `false`
`tracing-compression` OpenTelemetry compression method used to compress payloads. If unset, compression is disabled. CLI: `--tracing-compression` Env: `KC_TRACING_COMPRESSION` Available only when Tracing is enabled	`gzip`, `none` (default)
`tracing-enabled` 🛠 Enables the OpenTelemetry tracing. CLI: `--tracing-enabled` Env: `KC_TRACING_ENABLED` Available only when 'opentelemetry' feature is enabled	`true`, `false` (default)
`tracing-endpoint` OpenTelemetry endpoint to connect to. CLI: `--tracing-endpoint` Env: `KC_TRACING_ENDPOINT` Available only when Tracing is enabled	`http://localhost:4317` (default)
`tracing-jdbc-enabled` 🛠 Enables the OpenTelemetry JDBC tracing. CLI: `--tracing-jdbc-enabled` Env: `KC_TRACING_JDBC_ENABLED` Available only when Tracing is enabled	`true` (default), `false`
`tracing-protocol` OpenTelemetry protocol used for the telemetry data. CLI: `--tracing-protocol` Env: `KC_TRACING_PROTOCOL` Available only when Tracing is enabled	`grpc` (default), `http/protobuf`
`tracing-resource-attributes` OpenTelemetry resource attributes present in the exported trace to characterize the telemetry producer. Values in format `key1=val1,key2=val2`. For more information, check the Tracing guide. CLI: `--tracing-resource-attributes` Env: `KC_TRACING_RESOURCE_ATTRIBUTES` Available only when Tracing is enabled
`tracing-sampler-ratio` OpenTelemetry sampler ratio. Probability that a span will be sampled. Expected double value in interval [0,1]. CLI: `--tracing-sampler-ratio` Env: `KC_TRACING_SAMPLER_RATIO` Available only when Tracing is enabled	`1.0` (default)
`tracing-sampler-type` 🛠 OpenTelemetry sampler to use for tracing. CLI: `--tracing-sampler-type` Env: `KC_TRACING_SAMPLER_TYPE` Available only when Tracing is enabled	`always_on`, `always_off`, `traceidratio` (default), `parentbased_always_on`, `parentbased_always_off`, `parentbased_traceidratio`
`tracing-service-name` OpenTelemetry service name. Takes precedence over `service.name` defined in the `tracing-resource-attributes` property. CLI: `--tracing-service-name` Env: `KC_TRACING_SERVICE_NAME` Available only when Tracing is enabled	`keycloak` (default)

Chapter 7. Visualizing activities in dashboards
Copy link

Install the Red Hat build of Keycloak Grafana dashboards to visualize the metrics that capture the status and activities of your deployment.

Red Hat build of Keycloak provides metrics to observe what is happening inside the deployment. To understand how metrics evolve over time, it is helpful to collect and visualize them in graphs.

This guide provides instructions on how to visualize collected Red Hat build of Keycloak metrics in a running Grafana instance.

7.1. Prerequisites
Copy link

Red Hat build of Keycloak metrics are enabled. Follow Gaining insights with metrics chapter for more details.
Grafana instance is running and Red Hat build of Keycloak metrics are collected into a Prometheus instance.
For the HTTP request latency heatmaps to work, enable histograms for HTTP metrics by setting http-metrics-histograms-enabled to true.

7.2. Red Hat build of Keycloak Grafana dashboards
Copy link

Grafana dashboards are distributed in the form of a JSON file that is imported into a Grafana instance. JSON definitions of Red Hat build of Keycloak Grafana dashboards are available in the keycloak/keycloak-grafana-dashboard GitHub repository.

Follow these steps to download JSON file definitions.

Identify the branch from keycloak-grafana-dashboards to use from the following table.
Expand
Red Hat build of Keycloak version keycloak-grafana-dashboards branch

>= 26.1

main

Clone the GitHub repository

git clone -b BRANCH_FROM_STEP_1 https://github.com/keycloak/keycloak-grafana-dashboard.git

git clone -b BRANCH_FROM_STEP_1 https://github.com/keycloak/keycloak-grafana-dashboard.git

Copy to Clipboard

Toggle word wrap

The dashboards are available in the directory keycloak-grafana-dashboard/dashboards.

The following sections describe the purpose of each dashboard.

7.2.1. Red Hat build of Keycloak troubleshooting dashboard
Copy link

This dashboard is available in the JSON file: keycloak-troubleshooting-dashboard.json.

On the top of the dashboard, graphs display the service level indicators as defined in Monitoring performance with Service Level Indicators. This dashboard can be also used while troubleshooting a Red Hat build of Keycloak deployment following the Troubleshooting using metrics chapter, for example, when SLI graphs do not show expected results.

Figure 7.1. Troubleshooting dashboard

7.2.2. Keycloak capacity planning dashboard
Copy link

This dashboard is available in the JSON file: keycloak-capacity-planning-dashboard.json.

This dashboard shows metrics that are important when estimating the load handled by a Red Hat build of Keycloak deployment. For example, it shows the number of password validations or login flows performed by Red Hat build of Keycloak. For more detail on these metrics, see the chapter Self-provided metrics.

Note

Red Hat build of Keycloak event metrics must be enabled for this dashboard to work correctly. To enable them, see the chapter Monitoring user activities with event metrics.

Figure 7.2. Capacity planning dashboard

7.3. Import a dashboard
Copy link

Open the dashboard page from the left Grafana menu.
Click New and Import.
Click Upload dashboard JSON file and select the JSON file of the dashboard you want to import.
Pick your Prometheus datasource.
Click Import.

7.4. Export a dashboard
Copy link

Exporting a dashboard to JSON format may be useful. For example, you may want to suggest a change in our dashboard repository.

Open a dashboard you would like to export.
Click share in the top left corner next to the dashboard name.
Click the Export tab.
Enable Export for sharing externally.
Click either Save to file or View JSON and Copy to Clipboard according to where you want to store the resulting JSON.

7.5. Further reading
Copy link

Continue reading on how to connect traces to dashboard in the Analyzing outliers and errors with exemplars chapter.

Chapter 8. Analyzing outliers and errors with exemplars
Copy link

Use exemplars to connect a metric to a recorded trace to analyze the root cause of errors or latencies.

Metrics are aggregations over several events, and show you if your system is operating within defined bounds. They are great to monitor error rates or tail latencies and to set up alerting or drive performance optimizations. Still, the aggregation makes it difficult to find root causes for latencies or errors reported in metrics.

Root causes for errors and latencies can be found by enabling tracing. To connect a metric to a recorded trace, there is the concept of exemplars.

Once exemplars are set up, Red Hat build of Keycloak reports metrics with their last recorded trace as an exemplar. A dashboard tool like Grafana can link the exemplar from a metrics dashboard to a trace view.

Metrics that support exemplars are:

http_server_requests_seconds_count (including histograms)
See the chapter HTTP metrics for details on this metric.
keycloak_credentials_password_hashing_validations_total
See the chapter Self-provided metrics for details on this metric.
keycloak_user_events_total
See the chapter Self-provided metrics for details on this metric.

See below for a screenshot of a heatmap visualization for latencies that is showing an exemplar when hovering over one of the pink indicators.

Figure 8.1. Heatmap diagram with exemplar

8.1. Setting up exemplars
Copy link

To benefit from exemplars, perform the following steps:

Enable metrics for Red Hat build of Keycloak as described in chapter Gaining insights with metrics.
Enable tracing for Red Hat build of Keycloak as described in chapter Root cause analysis with tracing.
Enable exemplar storage in your monitoring system.
For Prometheus, this is a preview feature that you need to enable.
Scrape the metrics using the OpenMetricsText1.0.0 protocol, which is not enabled by default in Prometheus.
If you are using PodMonitors or similar in a Kubernetes environment, this can be achieved by adding it to the spec of the custom resource:
```
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  ...
spec:
  scrapeProtocols:
    - OpenMetricsText1.0.0
```
```
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  ...
spec:
  scrapeProtocols:
    - OpenMetricsText1.0.0
```
Copy to Clipboard Toggle word wrap
Configure your metrics datasource where to link to for traces.
When using Grafana and Prometheus, this would be setting up a exemplarTraceIdDestinations for the Prometheus datasource, which then points to your tracing datasource that is provided by tools like Jaeger or Tempo.
Enable exemplars in your dashboards.
Enable the Exemplars toggle in each query on each dashboard where you want to show exemplars. When set up correctly, you will notice little dots or stars in your dashboards that you can click on to view the traces.

Note

If you do not specify the scrape protocol, Prometheus will by default not send it in the content negotiation, and Keycloak will then fall back to the PrometheusText protocol which will not contain the exemplars.
If you enabled tracing and metrics, but the request sampling did not record a trace, the exposed metric will not contain any exemplars.
If you access the metrics endpoint with your browser, the content negotiation will lead to the format PrometheusText being returned, and you will not see any exemplars.

8.2. Verifying that exemplars work as expected
Copy link

Perform the following steps to verify that Red Hat build of Keycloak is set up correctly for exemplars:

Follow the instructions to set up metrics and tracing for Red Hat build of Keycloak.
For test purposes, record all traces by setting the tracing ration to 1.0. See Root cause analysis with tracing for recommended sampling settings in production systems.
Log in to the Keycloak instance to create some traces.

Scrape the metrics with a command similar to the following and search for those metrics that have an exemplar set:

curl -s http://localhost:9000/metrics \
-H 'Accept: application/openmetrics-text; version=1.0.0; charset=utf-8' \
| grep "#.*trace_id"

$ curl -s http://localhost:9000/metrics \
-H 'Accept: application/openmetrics-text; version=1.0.0; charset=utf-8' \
| grep "#.*trace_id"

Copy to Clipboard

Toggle word wrap

This should result in an output similar to the following. Note the additional # after which the span and trace IDs are added:

http_server_requests_seconds_count {...} ... # {span_id="...",trace_id="..."} ...

http_server_requests_seconds_count {...} ... # {span_id="...",trace_id="..."} ...

Copy to Clipboard

Toggle word wrap

Legal Notice
Copy link

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Observability Guide

Chapter 1. Tracking instance status with health checksCopy linkLink copied to clipboard!

1.1. Red Hat build of Keycloak health check endpointsCopy linkLink copied to clipboard!

1.2. Enabling the health checksCopy linkLink copied to clipboard!

1.3. Using the health checksCopy linkLink copied to clipboard!

1.3.1. curlCopy linkLink copied to clipboard!

1.3.2. KubernetesCopy linkLink copied to clipboard!

1.3.3. HEALTHCHECKCopy linkLink copied to clipboard!

1.4. Available ChecksCopy linkLink copied to clipboard!

1.5. Relevant optionsCopy linkLink copied to clipboard!

Chapter 2. Gaining insights with metricsCopy linkLink copied to clipboard!

2.1. Enabling MetricsCopy linkLink copied to clipboard!

2.2. Querying MetricsCopy linkLink copied to clipboard!

2.3. Next stepsCopy linkLink copied to clipboard!

2.4. Relevant optionsCopy linkLink copied to clipboard!

Chapter 3. Monitoring user activities with event metricsCopy linkLink copied to clipboard!

3.1. Enable event metricsCopy linkLink copied to clipboard!

3.2. Relevant optionsCopy linkLink copied to clipboard!

Chapter 4. Monitoring performance with Service Level IndicatorsCopy linkLink copied to clipboard!

4.1. PrerequisitesCopy linkLink copied to clipboard!

4.2. Definition of the service deliveredCopy linkLink copied to clipboard!

4.3. Definition of SLI and SLOCopy linkLink copied to clipboard!

4.4. PromQL queriesCopy linkLink copied to clipboard!

4.4.1. AvailabilityCopy linkLink copied to clipboard!

4.4.2. Latency of authentication requestsCopy linkLink copied to clipboard!

4.4.3. Errors for authentication requestsCopy linkLink copied to clipboard!

4.5. Further ReadingCopy linkLink copied to clipboard!

Chapter 5. Troubleshooting using metricsCopy linkLink copied to clipboard!

5.1. List of Red Hat build of Keycloak key metricsCopy linkLink copied to clipboard!

5.2. Self-provided metricsCopy linkLink copied to clipboard!

5.2.1. PrerequisitesCopy linkLink copied to clipboard!

5.2.2. MetricsCopy linkLink copied to clipboard!

5.2.2.1. User Event MetricsCopy linkLink copied to clipboard!

5.2.2.2. Password hashingCopy linkLink copied to clipboard!

5.2.3. Next stepsCopy linkLink copied to clipboard!

5.3. JVM metricsCopy linkLink copied to clipboard!

5.3.1. PrerequisitesCopy linkLink copied to clipboard!

5.3.2. MetricsCopy linkLink copied to clipboard!

5.3.2.1. JVM infoCopy linkLink copied to clipboard!

5.3.2.2. Heap memory usageCopy linkLink copied to clipboard!

5.3.2.3. Garbage collectionCopy linkLink copied to clipboard!

5.3.2.4. CPU Usage in KubernetesCopy linkLink copied to clipboard!

5.3.3. Next stepsCopy linkLink copied to clipboard!

5.4. Database MetricsCopy linkLink copied to clipboard!

5.4.1. PrerequisitesCopy linkLink copied to clipboard!

5.4.2. Database connection pool metricsCopy linkLink copied to clipboard!

5.4.3. Next stepsCopy linkLink copied to clipboard!

5.5. HTTP metricsCopy linkLink copied to clipboard!

5.5.1. PrerequisitesCopy linkLink copied to clipboard!

5.5.2. MetricsCopy linkLink copied to clipboard!

5.5.2.1. Processing timeCopy linkLink copied to clipboard!

5.5.2.2. Active requestsCopy linkLink copied to clipboard!

5.5.2.3. BandwidthCopy linkLink copied to clipboard!

5.5.3. Next stepsCopy linkLink copied to clipboard!

5.5.4. Relevant optionsCopy linkLink copied to clipboard!

5.6. Clustering metricsCopy linkLink copied to clipboard!

5.6.1. PrerequisitesCopy linkLink copied to clipboard!

5.6.2. MetricsCopy linkLink copied to clipboard!

5.6.2.1. Response TimeCopy linkLink copied to clipboard!

5.6.2.2. BandwidthCopy linkLink copied to clipboard!

5.6.2.3. Thread PoolCopy linkLink copied to clipboard!

5.6.2.4. Flow ControlCopy linkLink copied to clipboard!

5.6.2.5. RetransmissionsCopy linkLink copied to clipboard!

5.6.2.6. Network PartitionsCopy linkLink copied to clipboard!

5.6.2.6.1. Cluster SizeCopy linkLink copied to clipboard!

5.6.2.6.2. Network Partition EventsCopy linkLink copied to clipboard!

5.6.3. Next stepsCopy linkLink copied to clipboard!

5.7. Embedded Infinispan metrics for single site deploymentsCopy linkLink copied to clipboard!

5.7.1. PrerequisitesCopy linkLink copied to clipboard!

5.7.2. MetricsCopy linkLink copied to clipboard!

5.7.2.1. SizeCopy linkLink copied to clipboard!

5.7.2.2. Data AccessCopy linkLink copied to clipboard!

5.7.2.2.1. StoresCopy linkLink copied to clipboard!

5.7.2.2.2. ReadsCopy linkLink copied to clipboard!

5.7.2.2.3. RemovesCopy linkLink copied to clipboard!

5.7.2.2.4. EvictionCopy linkLink copied to clipboard!

5.7.2.3. LockingCopy linkLink copied to clipboard!

5.7.2.4. TransactionsCopy linkLink copied to clipboard!

5.7.2.5. State TransferCopy linkLink copied to clipboard!

5.7.2.6. Cluster Data ReplicationCopy linkLink copied to clipboard!

Chapter 1. Tracking instance status with health checks
Copy link

1.1. Red Hat build of Keycloak health check endpoints
Copy link

1.2. Enabling the health checks
Copy link

1.3. Using the health checks
Copy link

1.3.1. curl
Copy link

1.3.2. Kubernetes
Copy link

1.3.3. HEALTHCHECK
Copy link

1.4. Available Checks
Copy link

1.5. Relevant options
Copy link

Chapter 2. Gaining insights with metrics
Copy link

2.1. Enabling Metrics
Copy link

2.2. Querying Metrics
Copy link

2.3. Next steps
Copy link

2.4. Relevant options
Copy link

Chapter 3. Monitoring user activities with event metrics
Copy link

3.1. Enable event metrics
Copy link

3.2. Relevant options
Copy link

Chapter 4. Monitoring performance with Service Level Indicators
Copy link

4.1. Prerequisites
Copy link

4.2. Definition of the service delivered
Copy link

4.3. Definition of SLI and SLO
Copy link

4.4. PromQL queries
Copy link

4.4.1. Availability
Copy link

4.4.2. Latency of authentication requests
Copy link

4.4.3. Errors for authentication requests
Copy link

4.5. Further Reading
Copy link

Chapter 5. Troubleshooting using metrics
Copy link

5.1. List of Red Hat build of Keycloak key metrics
Copy link

5.2. Self-provided metrics
Copy link

5.2.1. Prerequisites
Copy link

5.2.2. Metrics
Copy link

5.2.2.1. User Event Metrics
Copy link

5.2.2.2. Password hashing
Copy link

5.2.3. Next steps
Copy link

5.3. JVM metrics
Copy link

5.3.1. Prerequisites
Copy link

5.3.2. Metrics
Copy link

5.3.2.1. JVM info
Copy link

5.3.2.2. Heap memory usage
Copy link

5.3.2.3. Garbage collection
Copy link

5.3.2.4. CPU Usage in Kubernetes
Copy link

5.3.3. Next steps
Copy link

5.4. Database Metrics
Copy link

5.4.1. Prerequisites
Copy link

5.4.2. Database connection pool metrics
Copy link

5.4.3. Next steps
Copy link

5.5. HTTP metrics
Copy link

5.5.1. Prerequisites
Copy link

5.5.2. Metrics
Copy link

5.5.2.1. Processing time
Copy link

5.5.2.2. Active requests
Copy link

5.5.2.3. Bandwidth
Copy link

5.5.3. Next steps
Copy link

5.5.4. Relevant options
Copy link

5.6. Clustering metrics
Copy link

5.6.1. Prerequisites
Copy link

5.6.2. Metrics
Copy link

5.6.2.1. Response Time
Copy link

5.6.2.2. Bandwidth
Copy link

5.6.2.3. Thread Pool
Copy link

5.6.2.4. Flow Control
Copy link

5.6.2.5. Retransmissions
Copy link

5.6.2.6. Network Partitions
Copy link

5.6.2.6.1. Cluster Size
Copy link

5.6.2.6.2. Network Partition Events
Copy link

5.6.3. Next steps
Copy link

5.7. Embedded Infinispan metrics for single site deployments
Copy link

5.7.1. Prerequisites
Copy link

5.7.2. Metrics
Copy link

5.7.2.1. Size
Copy link

5.7.2.2. Data Access
Copy link

5.7.2.2.1. Stores
Copy link

5.7.2.2.2. Reads
Copy link

5.7.2.2.3. Removes
Copy link

5.7.2.2.4. Eviction
Copy link

5.7.2.3. Locking
Copy link

5.7.2.4. Transactions
Copy link

5.7.2.5. State Transfer
Copy link

5.7.2.6. Cluster Data Replication
Copy link

5.7.3. Next steps
Copy link