Chapter 2. Configuring the log store
You can configure a LokiStack
custom resource (CR) to store application, audit, and infrastructure-related logs.
Loki is a horizontally scalable, highly available, multi-tenant log aggregation system offered as a GA log store for logging for Red Hat OpenShift that can be visualized with the OpenShift Observability UI. The Loki configuration provided by OpenShift Logging is a short-term log store designed to enable users to perform fast troubleshooting with the collected logs. For that purpose, the logging for Red Hat OpenShift configuration of Loki has short-term storage, and is optimized for very recent queries.
For long-term storage or queries over a long time period, users should look to log stores external to their cluster. Loki sizing is only tested and supported for short term storage, for a maximum of 30 days.
2.1. Loki deployment sizing Copy linkLink copied to clipboard!
Sizing for Loki follows the format of 1x.<size>
where the value 1x
is number of instances and <size>
specifies performance capabilities.
The 1x.pico
configuration defines a single Loki deployment with minimal resource and limit requirements, offering high availability (HA) support for all Loki components. This configuration is suited for deployments that do not require a single replication factor or auto-compaction.
Disk requests are similar across size configurations, allowing customers to test different sizes to determine the best fit for their deployment needs.
It is not possible to change the number 1x
for the deployment size.
1x.demo | 1x.pico [6.1+ only] | 1x.extra-small | 1x.small | 1x.medium | |
---|---|---|---|---|---|
Data transfer | Demo use only | 50GB/day | 100GB/day | 500GB/day | 2TB/day |
Queries per second (QPS) | Demo use only | 1-25 QPS at 200ms | 1-25 QPS at 200ms | 25-50 QPS at 200ms | 25-75 QPS at 200ms |
Replication factor | None | 2 | 2 | 2 | 2 |
Total CPU requests | None | 7 vCPUs | 14 vCPUs | 34 vCPUs | 54 vCPUs |
Total CPU requests if using the ruler | None | 8 vCPUs | 16 vCPUs | 42 vCPUs | 70 vCPUs |
Total memory requests | None | 17Gi | 31Gi | 67Gi | 139Gi |
Total memory requests if using the ruler | None | 18Gi | 35Gi | 83Gi | 171Gi |
Total disk requests | 40Gi | 590Gi | 430Gi | 430Gi | 590Gi |
Total disk requests if using the ruler | 60Gi | 910Gi | 750Gi | 750Gi | 910Gi |
2.2. Loki object storage Copy linkLink copied to clipboard!
The Loki Operator supports AWS S3, as well as other S3 compatible object stores such as Minio and OpenShift Data Foundation. Azure, GCS, and Swift are also supported.
The recommended nomenclature for Loki storage is logging-loki-<your_storage_provider>
.
The following table shows the type
values within the LokiStack
custom resource (CR) for each storage provider. For more information, see the section on your storage provider.
Storage provider | Secret type value |
---|---|
AWS | s3 |
Azure | azure |
Google Cloud | gcs |
Minio | s3 |
OpenShift Data Foundation | s3 |
Swift | swift |
2.2.1. AWS storage Copy linkLink copied to clipboard!
You can create an object storage in Amazon Web Services (AWS) to store logs.
Prerequisites
- You installed the Loki Operator.
-
You installed the OpenShift CLI (
oc
). - You created a bucket on AWS.
- You created an AWS IAM Policy and IAM User.
Procedure
Create an object storage secret with the name
logging-loki-aws
by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
logging-loki-aws
is the name of the secret.- 2
- AWS endpoints (those ending in
.amazonaws.com
) use a virtual-hosted style by default, which is equivalent to setting theforcepathstyle
attribute tofalse
. Conversely, non-AWS endpoints use a path style, equivalent to settingforcepathstyle
attribute totrue
. If you need to use a virtual-hosted style with non-AWS S3 services, you must explicitly setforcepathstyle
tofalse
.
2.2.1.1. AWS storage for STS enabled clusters Copy linkLink copied to clipboard!
If your cluster has STS enabled, the Cloud Credential Operator (CCO) supports short-term authentication by using AWS tokens.
You can create the Loki object storage secret manually by running the following command:
oc -n openshift-logging create secret generic "logging-loki-aws" \ --from-literal=bucketnames="<s3_bucket_name>" \ --from-literal=region="<bucket_region>" \ --from-literal=audience="<oidc_audience>"
$ oc -n openshift-logging create secret generic "logging-loki-aws" \
--from-literal=bucketnames="<s3_bucket_name>" \
--from-literal=region="<bucket_region>" \
--from-literal=audience="<oidc_audience>"
- 1
- Optional annotation, default value is
openshift
.
2.2.2. Azure storage Copy linkLink copied to clipboard!
Prerequisites
- You installed the Loki Operator.
-
You installed the OpenShift CLI (
oc
). - You created a bucket on Azure.
Procedure
Create an object storage secret with the name
logging-loki-azure
by running the following command:oc create secret generic logging-loki-azure \ --from-literal=container="<azure_container_name>" \ --from-literal=environment="<azure_environment>" \ --from-literal=account_name="<azure_account_name>" \ --from-literal=account_key="<azure_account_key>"
$ oc create secret generic logging-loki-azure \ --from-literal=container="<azure_container_name>" \ --from-literal=environment="<azure_environment>" \
1 --from-literal=account_name="<azure_account_name>" \ --from-literal=account_key="<azure_account_key>"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Supported environment values are
AzureGlobal
,AzureChinaCloud
,AzureGermanCloud
, orAzureUSGovernment
.
2.2.2.1. Azure storage for Microsoft Entra Workload ID enabled clusters Copy linkLink copied to clipboard!
If your cluster has Microsoft Entra Workload ID enabled, the Cloud Credential Operator (CCO) supports short-term authentication using Workload ID.
You can create the Loki object storage secret manually by running the following command:
oc -n openshift-logging create secret generic logging-loki-azure \ --from-literal=environment="<azure_environment>" \ --from-literal=account_name="<storage_account_name>" \ --from-literal=container="<container_name>"
$ oc -n openshift-logging create secret generic logging-loki-azure \
--from-literal=environment="<azure_environment>" \
--from-literal=account_name="<storage_account_name>" \
--from-literal=container="<container_name>"
2.2.3. Google Cloud Platform storage Copy linkLink copied to clipboard!
Prerequisites
- You installed the Loki Operator.
-
You installed the OpenShift CLI (
oc
). - You created a project on Google Cloud Platform (GCP).
- You created a bucket in the same project.
- You created a service account in the same project for GCP authentication.
Procedure
-
Copy the service account credentials received from GCP into a file called
key.json
. Create an object storage secret with the name
logging-loki-gcs
by running the following command:oc create secret generic logging-loki-gcs \ --from-literal=bucketname="<bucket_name>" \ --from-file=key.json="<path/to/key.json>"
$ oc create secret generic logging-loki-gcs \ --from-literal=bucketname="<bucket_name>" \ --from-file=key.json="<path/to/key.json>"
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.2.4. Minio storage Copy linkLink copied to clipboard!
You can create an object storage in Minio to store logs.
Prerequisites
Procedure
Create an object storage secret with the name
logging-loki-minio
by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
logging-loki-minio
is the name of the secret.- 2
- AWS endpoints (those ending in
.amazonaws.com
) use a virtual-hosted style by default, which is equivalent to setting theforcepathstyle
attribute tofalse
. Conversely, non-AWS endpoints use a path style, equivalent to settingforcepathstyle
attribute totrue
. If you need to use a virtual-hosted style with non-AWS S3 services, you must explicitly setforcepathstyle
tofalse
.
2.2.5. OpenShift Data Foundation storage Copy linkLink copied to clipboard!
You can create an object storage in OpenShift Data Foundation storage to store logs.
Prerequisites
- You installed the Loki Operator.
-
You installed the OpenShift CLI (
oc
). - You deployed OpenShift Data Foundation.
- You configured your OpenShift Data Foundation cluster for object storage.
Procedure
Create an
ObjectBucketClaim
custom resource in theopenshift-logging
namespace:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get bucket properties from the associated
ConfigMap
object by running the following command:BUCKET_HOST=$(oc get -n openshift-logging configmap loki-bucket-odf -o jsonpath='{.data.BUCKET_HOST}') BUCKET_NAME=$(oc get -n openshift-logging configmap loki-bucket-odf -o jsonpath='{.data.BUCKET_NAME}') BUCKET_PORT=$(oc get -n openshift-logging configmap loki-bucket-odf -o jsonpath='{.data.BUCKET_PORT}')
BUCKET_HOST=$(oc get -n openshift-logging configmap loki-bucket-odf -o jsonpath='{.data.BUCKET_HOST}') BUCKET_NAME=$(oc get -n openshift-logging configmap loki-bucket-odf -o jsonpath='{.data.BUCKET_NAME}') BUCKET_PORT=$(oc get -n openshift-logging configmap loki-bucket-odf -o jsonpath='{.data.BUCKET_PORT}')
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Get bucket access key from the associated secret by running the following command:
ACCESS_KEY_ID=$(oc get -n openshift-logging secret loki-bucket-odf -o jsonpath='{.data.AWS_ACCESS_KEY_ID}' | base64 -d) SECRET_ACCESS_KEY=$(oc get -n openshift-logging secret loki-bucket-odf -o jsonpath='{.data.AWS_SECRET_ACCESS_KEY}' | base64 -d)
ACCESS_KEY_ID=$(oc get -n openshift-logging secret loki-bucket-odf -o jsonpath='{.data.AWS_ACCESS_KEY_ID}' | base64 -d) SECRET_ACCESS_KEY=$(oc get -n openshift-logging secret loki-bucket-odf -o jsonpath='{.data.AWS_SECRET_ACCESS_KEY}' | base64 -d)
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Create an object storage secret with the name
logging-loki-odf
by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
logging-loki-odf
is the name of the secret.- 2
- AWS endpoints (those ending in
.amazonaws.com
) use a virtual-hosted style by default, which is equivalent to setting theforcepathstyle
attribute tofalse
. Conversely, non-AWS endpoints use a path style, equivalent to settingforcepathstyle
attribute totrue
. If you need to use a virtual-hosted style with non-AWS S3 services, you must explicitly setforcepathstyle
tofalse
.
2.2.6. Swift storage Copy linkLink copied to clipboard!
Prerequisites
- You installed the Loki Operator.
-
You installed the OpenShift CLI (
oc
). - You created a bucket on Swift.
Procedure
Create an object storage secret with the name
logging-loki-swift
by running the following command:Copy to Clipboard Copied! Toggle word wrap Toggle overflow You can optionally provide project-specific data, region, or both by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.2.7. Deploying a Loki log store on a cluster that uses short-term credentials Copy linkLink copied to clipboard!
For some storage providers, you can use the Cloud Credential Operator utility (ccoctl
) during installation to implement short-term credentials. These credentials are created and managed outside the OpenShift Container Platform cluster. For more information, see Manual mode with short-term credentials for components.
Short-term credential authentication must be configured during a new installation of Loki Operator, on a cluster that uses this credentials strategy. You cannot configure an existing cluster that uses a different credentials strategy to use this feature.
2.2.7.1. Authenticating with workload identity federation to access cloud-based log stores Copy linkLink copied to clipboard!
You can use workload identity federation with short-lived tokens to authenticate to cloud-based log stores. With workload identity federation, you do not have to store long-lived credentials in your cluster, which reduces the risk of credential leaks and simplifies secret management.
Prerequisites
- You have administrator permissions.
Procedure
Use one of the following options to enable authentication:
-
If you used the OpenShift Container Platform web console to install the Loki Operator, the system automatically detects clusters that use short-lived tokens. You are prompted to create roles and supply the data required for the Loki Operator to create a
CredentialsRequest
object, which populates a secret. If you used the OpenShift CLI (
oc
) to install the Loki Operator, you must manually create aSubscription
object. Use the appropriate template for your storage provider, as shown in the following samples. This authentication strategy supports only the storage providers indicated within the samples.Microsoft Azure sample subscription
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Amazon Web Services (AWS) sample subscription
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Google Cloud Platform (GCP) sample subscription
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
-
If you used the OpenShift Container Platform web console to install the Loki Operator, the system automatically detects clusters that use short-lived tokens. You are prompted to create roles and supply the data required for the Loki Operator to create a
2.2.7.2. Creating a LokiStack custom resource by using the web console Copy linkLink copied to clipboard!
You can create a LokiStack
custom resource (CR) by using the OpenShift Container Platform web console.
Prerequisites
- You have administrator permissions.
- You have access to the OpenShift Container Platform web console.
- You installed the Loki Operator.
Procedure
-
Go to the Operators
Installed Operators page. Click the All instances tab. - From the Create new drop-down list, select LokiStack.
Select YAML view, and then use the following template to create a
LokiStack
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Use the name
logging-loki
. - 2
- Specify the deployment size. In the logging 5.8 and later versions, the supported size options for production instances of Loki are
1x.extra-small
,1x.small
, or1x.medium
. - 3
- Specify the secret used for your log storage.
- 4
- Specify the corresponding storage type.
- 5
- Optional field, logging 5.9 and later. Supported user configured values are as follows:
static
is the default authentication mode available for all supported object storage types using credentials stored in a Secret.token
for short-lived tokens retrieved from a credential source. In this mode the static configuration does not contain credentials needed for the object storage. Instead, they are generated during runtime using a service, which allows for shorter-lived credentials and much more granular control. This authentication mode is not supported for all object storage types.token-cco
is the default value when Loki is running on managed STS mode and using CCO on STS/WIF clusters. - 6
- Enter the name of a storage class for temporary storage. For best performance, specify a storage class that allocates block storage. Available storage classes for your cluster can be listed by using the
oc get storageclasses
command.
2.2.7.3. Creating a secret for Loki object storage by using the CLI Copy linkLink copied to clipboard!
To configure Loki object storage, you must create a secret. You can do this by using the OpenShift CLI (oc
).
Prerequisites
- You have administrator permissions.
- You installed the Loki Operator.
-
You installed the OpenShift CLI (
oc
).
Procedure
Create a secret in the directory that contains your certificate and key files by running the following command:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Use generic or opaque secrets for best results.
Verification
Verify that a secret was created by running the following command:
oc get secrets
$ oc get secrets
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.2.8. Fine grained access for Loki logs Copy linkLink copied to clipboard!
The Red Hat OpenShift Logging Operator does not grant all users access to logs by default. As an administrator, you must configure your users' access unless the Operator was upgraded and prior configurations are in place. Depending on your configuration and need, you can configure fine grain access to logs using the following:
- Cluster wide policies
- Namespace scoped policies
- Creation of custom admin groups
As an administrator, you need to create the role bindings and cluster role bindings appropriate for your deployment. The Red Hat OpenShift Logging Operator provides the following cluster roles:
-
cluster-logging-application-view
grants permission to read application logs. -
cluster-logging-infrastructure-view
grants permission to read infrastructure logs. -
cluster-logging-audit-view
grants permission to read audit logs.
If you have upgraded from a prior version, an additional cluster role logging-application-logs-reader
and associated cluster role binding logging-all-authenticated-application-logs-reader
provide backward compatibility, allowing any authenticated user read access in their namespaces.
Users with access by namespace must provide a namespace when querying application logs.
2.2.8.1. Cluster wide access Copy linkLink copied to clipboard!
Cluster role binding resources reference cluster roles, and set permissions cluster wide.
Example ClusterRoleBinding
2.2.8.2. Namespaced access Copy linkLink copied to clipboard!
RoleBinding
resources can be used with ClusterRole
objects to define the namespace a user or group has access to logs for.
Example RoleBinding
- 1
- Specifies the namespace this
RoleBinding
applies to.
2.2.8.3. Custom admin group access Copy linkLink copied to clipboard!
If you have a large deployment with several users who require broader permissions, you can create a custom group using the adminGroup
field. Users who are members of any group specified in the adminGroups
field of the LokiStack
CR are considered administrators.
Administrator users have access to all application logs in all namespaces, if they also get assigned the cluster-logging-application-view
role.
Example LokiStack CR
2.2.9. Creating a new group for the cluster-admin user role Copy linkLink copied to clipboard!
Querying application logs for multiple namespaces as a cluster-admin
user, where the sum total of characters of all of the namespaces in the cluster is greater than 5120, results in the error Parse error: input size too long (XXXX > 5120)
. For better control over access to logs in LokiStack, make the cluster-admin
user a member of the cluster-admin
group. If the cluster-admin
group does not exist, create it and add the desired users to it.
Use the following procedure to create a new group for users with cluster-admin
permissions.
Procedure
Enter the following command to create a new group:
oc adm groups new cluster-admin
$ oc adm groups new cluster-admin
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Enter the following command to add the desired user to the
cluster-admin
group:oc adm groups add-users cluster-admin <username>
$ oc adm groups add-users cluster-admin <username>
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Enter the following command to add
cluster-admin
user role to the group:oc adm policy add-cluster-role-to-group cluster-admin cluster-admin
$ oc adm policy add-cluster-role-to-group cluster-admin cluster-admin
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.3. Enhanced reliability and performance Copy linkLink copied to clipboard!
Use the following configurations to ensure reliability and efficiency of Loki in production.
2.3.1. Loki pod placement Copy linkLink copied to clipboard!
You can control which nodes the Loki pods run on, and prevent other workloads from using those nodes, by using tolerations or node selectors on the pods.
You can apply tolerations to the log store pods with the LokiStack custom resource (CR) and apply taints to a node with the node specification. A taint on a node is a key:value
pair that instructs the node to repel all pods that do not allow the taint. Using a specific key:value
pair that is not on other pods ensures that only the log store pods can run on that node.
Example LokiStack with node selectors
Example LokiStack CR with node selectors and tolerations
To configure the nodeSelector
and tolerations
fields of the LokiStack (CR), you can use the oc explain
command to view the description and fields for a particular resource:
oc explain lokistack.spec.template
$ oc explain lokistack.spec.template
Example output
For more detailed information, you can add a specific field:
oc explain lokistack.spec.template.compactor
$ oc explain lokistack.spec.template.compactor
Example output
2.3.2. Configuring Loki to tolerate node failure Copy linkLink copied to clipboard!
In the logging 5.8 and later versions, the Loki Operator supports setting pod anti-affinity rules to request that pods of the same component are scheduled on different available nodes in the cluster.
Affinity is a property of pods that controls the nodes on which they prefer to be scheduled. Anti-affinity is a property of pods that prevents a pod from being scheduled on a node.
In OpenShift Container Platform, pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key-value labels on other pods.
The Operator sets default, preferred podAntiAffinity
rules for all Loki components, which includes the compactor
, distributor
, gateway
, indexGateway
, ingester
, querier
, queryFrontend
, and ruler
components.
You can override the preferred podAntiAffinity
settings for Loki components by configuring required settings in the requiredDuringSchedulingIgnoredDuringExecution
field:
Example user settings for the ingester component
2.3.3. Enabling stream-based retention with Loki Copy linkLink copied to clipboard!
You can configure retention policies based on log streams. You can set retention rules globally, per-tenant, or both. If you configure both, tenant rules apply before global rules.
If there is no retention period defined on the s3 bucket or in the LokiStack custom resource (CR), then the logs are not pruned and they stay in the s3 bucket forever, which might fill up the s3 storage.
-
Although logging version 5.9 and later supports schema
v12
, schemav13
is recommended for future compatibility. For cost-effective log pruning, configure retention policies directly on the object storage provider. Use the lifecycle management features of the storage provider to ensure automatic deletion of old logs. This also avoids extra processing from Loki and delete requests to S3.
If the object storage does not support lifecycle policies, you must configure LokiStack to enforce retention internally. The supported retention period is up to 30 days.
Prerequisites
- You have administrator permissions.
- You have installed the Loki Operator.
-
You have installed the OpenShift CLI (
oc
).
Procedure
To enable stream-based retention, create a
LokiStack
CR and save it as a YAML file. In the following example, it is calledlokistack.yaml
.Example global stream-based retention for S3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Set the retention policy for all log streams. This policy does not impact the retention period for stored logs in object storage.
- 2
- Enable retention in the cluster by adding the
retention
block to the CR. - 3
- Specify the LogQL query to match log streams to the retention rule.
Example per-tenant stream-based retention for S3
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- Set the retention policy per-tenant. Valid tenant types are
application
,audit
, andinfrastructure
. - 2
- Specify the LogQL query to match log streams to the retention rule.
Apply the
LokiStack
CR:oc apply -f lokistack.yaml
$ oc apply -f lokistack.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.3.4. Configuring Loki to tolerate memberlist creation failure Copy linkLink copied to clipboard!
In an OpenShift Container Platform cluster, administrators generally use a non-private IP network range. As a result, the LokiStack memberlist configuration fails because, by default, it only uses private IP networks.
As an administrator, you can select the pod network for the memberlist configuration. You can modify the LokiStack
custom resource (CR) to use the podIP
address in the hashRing
spec. To configure the LokiStack
CR, use the following command:
oc patch LokiStack logging-loki -n openshift-logging --type=merge -p '{"spec": {"hashRing":{"memberlist":{"instanceAddrType":"podIP"},"type":"memberlist"}}}'
$ oc patch LokiStack logging-loki -n openshift-logging --type=merge -p '{"spec": {"hashRing":{"memberlist":{"instanceAddrType":"podIP"},"type":"memberlist"}}}'
Example LokiStack to include podIP
2.3.5. LokiStack behavior during cluster restarts Copy linkLink copied to clipboard!
When an OpenShift Container Platform cluster is restarted, LokiStack ingestion and the query path continue to operate within the available CPU and memory resources available for the node. This means that there is no downtime for the LokiStack during OpenShift Container Platform cluster updates. This behavior is achieved by using PodDisruptionBudget
resources. The Loki Operator provisions PodDisruptionBudget
resources for Loki, which determine the minimum number of pods that must be available per component to ensure normal operations under certain conditions.
2.4. Advanced deployment and scalability Copy linkLink copied to clipboard!
To configure high availability, scalability, and error handling, use the following information.
2.4.1. Zone aware data replication Copy linkLink copied to clipboard!
The Loki Operator offers support for zone-aware data replication through pod topology spread constraints. Enabling this feature enhances reliability and safeguards against log loss in the event of a single zone failure. When configuring the deployment size as 1x.extra-small
, 1x.small
, or 1x.medium
, the replication.factor
field is automatically set to 2.
To ensure proper replication, you need to have at least as many availability zones as the replication factor specifies. While it is possible to have more availability zones than the replication factor, having fewer zones can lead to write failures. Each zone should host an equal number of instances for optimal operation.
Example LokiStack CR with zone replication enabled
- 1
- Deprecated field, values entered are overwritten by
replication.factor
. - 2
- This value is automatically set when deployment size is selected at setup.
- 3
- The maximum difference in number of pods between any two topology domains. The default is 1, and you cannot specify a value of 0.
- 4
- Defines zones in the form of a topology key that corresponds to a node label.
2.4.2. Recovering Loki pods from failed zones Copy linkLink copied to clipboard!
In OpenShift Container Platform a zone failure happens when specific availability zone resources become inaccessible. Availability zones are isolated areas within a cloud provider’s data center, aimed at enhancing redundancy and fault tolerance. If your OpenShift Container Platform cluster is not configured to handle this, a zone failure can lead to service or data loss.
Loki pods are part of a StatefulSet, and they come with Persistent Volume Claims (PVCs) provisioned by a StorageClass
object. Each Loki pod and its PVCs reside in the same zone. When a zone failure occurs in a cluster, the StatefulSet controller automatically attempts to recover the affected pods in the failed zone.
The following procedure will delete the PVCs in the failed zone, and all data contained therein. To avoid complete data loss the replication factor field of the LokiStack
CR should always be set to a value greater than 1 to ensure that Loki is replicating.
Prerequisites
-
Verify your
LokiStack
CR has a replication factor greater than 1. - Zone failure detected by the control plane, and nodes in the failed zone are marked by cloud provider integration.
The StatefulSet controller automatically attempts to reschedule pods in a failed zone. Because the associated PVCs are also in the failed zone, automatic rescheduling to a different zone does not work. You must manually delete the PVCs in the failed zone to allow successful re-creation of the stateful Loki Pod and its provisioned PVC in the new zone.
Procedure
List the pods in
Pending
status by running the following command:oc get pods --field-selector status.phase==Pending -n openshift-logging
$ oc get pods --field-selector status.phase==Pending -n openshift-logging
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
oc get pods
outputNAME READY STATUS RESTARTS AGE logging-loki-index-gateway-1 0/1 Pending 0 17m logging-loki-ingester-1 0/1 Pending 0 16m logging-loki-ruler-1 0/1 Pending 0 16m
NAME READY STATUS RESTARTS AGE
1 logging-loki-index-gateway-1 0/1 Pending 0 17m logging-loki-ingester-1 0/1 Pending 0 16m logging-loki-ruler-1 0/1 Pending 0 16m
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- These pods are in
Pending
status because their corresponding PVCs are in the failed zone.
List the PVCs in
Pending
status by running the following command:oc get pvc -o=json -n openshift-logging | jq '.items[] | select(.status.phase == "Pending") | .metadata.name' -r
$ oc get pvc -o=json -n openshift-logging | jq '.items[] | select(.status.phase == "Pending") | .metadata.name' -r
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example
oc get pvc
outputstorage-logging-loki-index-gateway-1 storage-logging-loki-ingester-1 wal-logging-loki-ingester-1 storage-logging-loki-ruler-1 wal-logging-loki-ruler-1
storage-logging-loki-index-gateway-1 storage-logging-loki-ingester-1 wal-logging-loki-ingester-1 storage-logging-loki-ruler-1 wal-logging-loki-ruler-1
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the PVC(s) for a pod by running the following command:
oc delete pvc <pvc_name> -n openshift-logging
$ oc delete pvc <pvc_name> -n openshift-logging
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the pod(s) by running the following command:
oc delete pod <pod_name> -n openshift-logging
$ oc delete pod <pod_name> -n openshift-logging
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Once these objects have been successfully deleted, they should automatically be rescheduled in an available zone.
2.4.2.1. Troubleshooting PVC in a terminating state Copy linkLink copied to clipboard!
The PVCs might hang in the terminating state without being deleted, if PVC metadata finalizers are set to kubernetes.io/pv-protection
. Removing the finalizers should allow the PVCs to delete successfully.
Remove the finalizer for each PVC by running the command below, then retry deletion.
oc patch pvc <pvc_name> -p '{"metadata":{"finalizers":null}}' -n openshift-logging
$ oc patch pvc <pvc_name> -p '{"metadata":{"finalizers":null}}' -n openshift-logging
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
2.4.3. Troubleshooting Loki rate limit errors Copy linkLink copied to clipboard!
If the Log Forwarder API forwards a large block of messages that exceeds the rate limit to Loki, Loki generates rate limit (429
) errors.
These errors can occur during normal operation. For example, when adding the logging to a cluster that already has some logs, rate limit errors might occur while the logging tries to ingest all of the existing log entries. In this case, if the rate of addition of new logs is less than the total rate limit, the historical data is eventually ingested, and the rate limit errors are resolved without requiring user intervention.
In cases where the rate limit errors continue to occur, you can fix the issue by modifying the LokiStack
custom resource (CR).
The LokiStack
CR is not available on Grafana-hosted Loki. This topic does not apply to Grafana-hosted Loki servers.
Conditions
- The Log Forwarder API is configured to forward logs to Loki.
Your system sends a block of messages that is larger than 2 MB to Loki. For example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow After you enter
oc logs -n openshift-logging -l component=collector
, the collector logs in your cluster show a line containing one of the following error messages:429 Too Many Requests Ingestion rate limit exceeded
429 Too Many Requests Ingestion rate limit exceeded
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Example Vector error message
2023-08-25T16:08:49.301780Z WARN sink{component_kind="sink" component_id=default_loki_infra component_type=loki component_name=default_loki_infra}: vector::sinks::util::retries: Retrying after error. error=Server responded with an error: 429 Too Many Requests internal_log_rate_limit=true
2023-08-25T16:08:49.301780Z WARN sink{component_kind="sink" component_id=default_loki_infra component_type=loki component_name=default_loki_infra}: vector::sinks::util::retries: Retrying after error. error=Server responded with an error: 429 Too Many Requests internal_log_rate_limit=true
Copy to Clipboard Copied! Toggle word wrap Toggle overflow The error is also visible on the receiving end. For example, in the LokiStack ingester pod:
Example Loki ingester error message
level=warn ts=2023-08-30T14:57:34.155592243Z caller=grpc_logging.go:43 duration=1.434942ms method=/logproto.Pusher/Push err="rpc error: code = Code(429) desc = entry with timestamp 2023-08-30 14:57:32.012778399 +0000 UTC ignored, reason: 'Per stream rate limit exceeded (limit: 3MB/sec) while attempting to ingest for stream
level=warn ts=2023-08-30T14:57:34.155592243Z caller=grpc_logging.go:43 duration=1.434942ms method=/logproto.Pusher/Push err="rpc error: code = Code(429) desc = entry with timestamp 2023-08-30 14:57:32.012778399 +0000 UTC ignored, reason: 'Per stream rate limit exceeded (limit: 3MB/sec) while attempting to ingest for stream
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Procedure
Update the
ingestionBurstSize
andingestionRate
fields in theLokiStack
CR:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The
ingestionBurstSize
field defines the maximum local rate-limited sample size per distributor replica in MB. This value is a hard limit. Set this value to at least the maximum logs size expected in a single push request. Single requests that are larger than theingestionBurstSize
value are not permitted. - 2
- The
ingestionRate
field is a soft limit on the maximum amount of ingested samples per second in MB. Rate limit errors occur if the rate of logs exceeds the limit, but the collector retries sending the logs. As long as the total average is lower than the limit, the system recovers and errors are resolved without user intervention.
2.5. Log-based alerts Copy linkLink copied to clipboard!
2.5.1. Authorizing LokiStack rules RBAC permissions Copy linkLink copied to clipboard!
Administrators can allow users to create and manage their own alerting and recording rules by binding cluster roles to usernames. Cluster roles are defined as ClusterRole
objects that contain necessary role-based access control (RBAC) permissions for users.
The following cluster roles for alerting and recording rules are available for LokiStack:
Rule name | Description |
---|---|
|
Users with this role have administrative-level access to manage alerting rules. This cluster role grants permissions to create, read, update, delete, list, and watch |
|
Users with this role can view the definitions of Custom Resource Definitions (CRDs) related to |
|
Users with this role have permission to create, update, and delete |
|
Users with this role can read |
|
Users with this role have administrative-level access to manage recording rules. This cluster role grants permissions to create, read, update, delete, list, and watch |
|
Users with this role can view the definitions of Custom Resource Definitions (CRDs) related to |
|
Users with this role have permission to create, update, and delete |
|
Users with this role can read |
2.5.1.1. Examples Copy linkLink copied to clipboard!
To apply cluster roles for a user, you must bind an existing cluster role to a specific username.
Cluster roles can be cluster or namespace scoped, depending on which type of role binding you use. When a RoleBinding
object is used, as when using the oc adm policy add-role-to-user
command, the cluster role only applies to the specified namespace. When a ClusterRoleBinding
object is used, as when using the oc adm policy add-cluster-role-to-user
command, the cluster role applies to all namespaces in the cluster.
The following example command gives the specified user create, read, update and delete (CRUD) permissions for alerting rules in a specific namespace in the cluster:
Example cluster role binding command for alerting rule CRUD permissions in a specific namespace
oc adm policy add-role-to-user alertingrules.loki.grafana.com-v1-admin -n <namespace> <username>
$ oc adm policy add-role-to-user alertingrules.loki.grafana.com-v1-admin -n <namespace> <username>
The following command gives the specified user administrator permissions for alerting rules in all namespaces:
Example cluster role binding command for administrator permissions
oc adm policy add-cluster-role-to-user alertingrules.loki.grafana.com-v1-admin <username>
$ oc adm policy add-cluster-role-to-user alertingrules.loki.grafana.com-v1-admin <username>
2.5.2. Creating a log-based alerting rule with Loki Copy linkLink copied to clipboard!
The AlertingRule
CR contains a set of specifications and webhook validation definitions to declare groups of alerting rules for a single LokiStack
instance. In addition, the webhook validation definition provides support for rule validation conditions:
-
If an
AlertingRule
CR includes an invalidinterval
period, it is an invalid alerting rule -
If an
AlertingRule
CR includes an invalidfor
period, it is an invalid alerting rule. -
If an
AlertingRule
CR includes an invalid LogQLexpr
, it is an invalid alerting rule. -
If an
AlertingRule
CR includes two groups with the same name, it is an invalid alerting rule. - If none of the above applies, an alerting rule is considered valid.
Tenant type | Valid namespaces for AlertingRule CRs |
---|---|
application |
|
audit |
|
infrastructure |
|
Procedure
Create an
AlertingRule
custom resource (CR):Example infrastructure
AlertingRule
CRCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The namespace where this
AlertingRule
CR is created must have a label matching the LokiStackspec.rules.namespaceSelector
definition. - 2
- The
labels
block must match the LokiStackspec.rules.selector
definition. - 3
AlertingRule
CRs forinfrastructure
tenants are only supported in theopenshift-*
,kube-\*
, ordefault
namespaces.- 4
- The value for
kubernetes_namespace_name:
must match the value formetadata.namespace
. - 5
- The value of this mandatory field must be
critical
,warning
, orinfo
. - 6
- This field is mandatory.
- 7
- This field is mandatory.
Example application
AlertingRule
CRCopy to Clipboard Copied! Toggle word wrap Toggle overflow - 1
- The namespace where this
AlertingRule
CR is created must have a label matching the LokiStackspec.rules.namespaceSelector
definition. - 2
- The
labels
block must match the LokiStackspec.rules.selector
definition. - 3
- Value for
kubernetes_namespace_name:
must match the value formetadata.namespace
. - 4
- The value of this mandatory field must be
critical
,warning
, orinfo
. - 5
- The value of this mandatory field is a summary of the rule.
- 6
- The value of this mandatory field is a detailed description of the rule.
Apply the
AlertingRule
CR:oc apply -f <filename>.yaml
$ oc apply -f <filename>.yaml
Copy to Clipboard Copied! Toggle word wrap Toggle overflow