Chapter 6. Loki query performance troubleshooting
This documentation details methods for optimizing your Logging stack to improve query performance and provides steps for troubleshooting.
6.1. Best practices for Loki query performance Copy linkLink copied to clipboard!
You can take the following steps to improve Loki query performance:
- Ensure that you are running the latest version of the Loki Operator.
-
Ensure that you have migrated LokiStack schema to the
v13version. Ensure that you use reliable and fast object storage. Loki places significant demands on object storage. If you are not using an object store solution from a cloud provider, use solid-state drive (SSD) for your object storage. By using SSDs you can benefit from the high parallelization capabilities of Loki.
To better understand the utilization of object storage by Loki, you can use the following query in the Metrics dashboard in the OpenShift Container Platform web console:
sum by(status, container, operation) (label_replace(rate(loki_s3_request_duration_seconds_count{namespace="openshift-logging"}[5m]), "status", "${1}xx", "status_code", "([0-9]).."))sum by(status, container, operation) (label_replace(rate(loki_s3_request_duration_seconds_count{namespace="openshift-logging"}[5m]), "status", "${1}xx", "status_code", "([0-9]).."))Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Loki Operator enables automatic stream sharding by default. The default automatic stream sharding mechanism should be adequate in most cases and users should not need to configure
perStream*attributes. - If you use the OpenTelemetry Protocol (OTLP) data model, you can configure additional stream labels in LokiStack. For more information, see Best practices for Loki labels.
- Different types of queries have different performance characteristics. Use simple filter queries instead of regular expressions for better performance.
6.2. Best practices for Loki labels Copy linkLink copied to clipboard!
Labels in Loki are the keyspace on which Loki shards incoming data. They are also the index used for finding logs at query-time. You can optimize query performance by properly using labels.
Consider the following criteria when creating labels:
- Labels should describe infrastructure. This could include regions, clusters, servers, applications, namespaces, or environments.
- Labels are long-lived. Label values should generate logs perpetually, or at least for several hours.
- Labels are intuitive for querying.
6.3. Configuration of stream labels in Loki Operator Copy linkLink copied to clipboard!
Configuring which labels the Loki Operator will use as stream labels depends on the data model you are using: ViaQ or OpenTelemetry Protocol (OTLP).
Both models come with a predefined set of stream labels, for more information, see OpenTelemetry data model.
- ViaQ model
ViaQ does not support structured metadata. To configure stream labels for the ViaQ model, add the configuration in the
ClusterLogForwarderresource. For example:Copy to Clipboard Copied! Toggle word wrap Toggle overflow lokiStack.labelKeysfield contains the configuration that maps log record keys to Loki labels used to identify streams.- OTLP model
- In the OTLP model all labels that are not specified as stream labels are attached as structured metadata.
The following are the best practices for creating stream labels:
- have a low cardinality with at most tens of values.
-
The values are long lived. For example, the first level of an HTTP path:
/load,/save, and/update. - The labels can be used in queries to improve query performance.
6.4. Analyzing Loki query performance Copy linkLink copied to clipboard!
Every query and subquery in Loki generates a metrics.go log line with performance statistics. Subqueries emit the log line in the queriers. Every query has an associated single summary metrics.go line emitted by the query-front end. Use these statistics to calculate the query performance metrics.
Prerequisites
- You have administrator permissions.
- You have access to the OpenShift Container Platform web console.
- You installed and configured Loki Operator.
Procedure
-
In the OpenShift Container Platform web console, navigate to the Metrics
Observe tab. Note the following values:
- duration: Denotes the amount of time a query took to run.
- queue_time: Denotes the time a query spent in the queue before being processed.
- chunk_refs_fetch_time: Denotes the amount of time spent in getting chunk information from the index.
- store_chunks_download_time: Denotes the amount of time in getting chunks from cache or storage.
Calculate the following performance metrics:
total query time as
total_duration:total_duration = duration + queue_time
total_duration = duration + queue_timeCopy to Clipboard Copied! Toggle word wrap Toggle overflow Percentage of the total duration that a query spent in the queue as
Queue Time:Queue Time = queue_time / total_duration * 100
Queue Time = queue_time / total_duration * 100Copy to Clipboard Copied! Toggle word wrap Toggle overflow Calculate the percentage of the total duration that was spent in getting chunk information from the index as
Chunk Refs Fetch Time:Chunk Refs Fetch Time = chunk_refs_fetch_time / total_duration * 100
Chunk Refs Fetch Time = chunk_refs_fetch_time / total_duration * 100Copy to Clipboard Copied! Toggle word wrap Toggle overflow Calculate the percentage of the total duration that was spent in getting chunks from cache or storage:
Chunks Download Time = store_chunks_download_time / total_duration * 100
Chunks Download Time = store_chunks_download_time / total_duration * 100Copy to Clipboard Copied! Toggle word wrap Toggle overflow Calculate the percentage of the total duration that was spent in executing the query:
Execution Time = (duration - chunk_refs_fetch_time - store_chunks_download_time) / total_duration * 100
Execution Time = (duration - chunk_refs_fetch_time - store_chunks_download_time) / total_duration * 100Copy to Clipboard Copied! Toggle word wrap Toggle overflow
- Refer to Query performance analysis to understand the reason for each metric and how each metric affects query performance.
6.5. Query performance analysis Copy linkLink copied to clipboard!
For best query performance, you want to see as much time as possible spent in execution time, denoted by the Execution Time metric. See the table below for the reason other performance metrics might be higher and the steps you can take to improve them. You can also reduce the execution time by modifying your queries, thereby improving the overall performance.
| Issue | Reason | Fix |
|---|---|---|
|
High | Queries might be doing many CPU-intensive operations such as regular expression processing. | You can make the following changes:
|
| Your queries have many small log lines. | If your queries have many small lines, execution becomes dependent on how fast Loki can iterate the lines themselves. This becomes a CPU clock frequency bottleneck. To make things faster you need a faster CPU. | |
|
High | You do not have enough queriers running. |
The only fix is to increase the number of queriers replicas in the |
|
High |
Insufficient number of index-gateway replicas in the | Increase the number of index-gateway replicas or ensure they have enough CPU resources. |
|
High | The chunks might be too small |
Check the average chunk size by dividing |
| Query timing out | Query timeout value might be too low |
Increase the |