Questo contenuto non è disponibile nella lingua selezionata.

Chapter 5. Tuning the log store


Configure advanced settings for Loki to optimize reliability, performance, scalability, security, and alerting in production environments.

5.1. Prerequisites

  • You have created a LokiStack custom resource.

5.2. Enhanced reliability and performance for Loki

Use the following configurations to ensure reliability and efficiency of Loki in production environments. These settings help optimize pod placement, data retention, cluster hardening, and recovery behavior to minimize data loss and maintain consistent performance under load.

5.2.1. Loki pod placement

You can control which nodes the Loki pods run on, and prevent other workloads from using those nodes, by using tolerations or node selectors on the pods.

You can apply tolerations to the log store pods with the LokiStack custom resource (CR) and apply taints to a node with the node specification. A taint on a node is a key:value pair that instructs the node to repel all pods that do not allow the taint. Using a specific key:value pair that is not on other pods ensures that only the log store pods can run on that node.

Example LokiStack with node selectors

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
# ...
  template:
    compactor: 
1

      nodeSelector:
        node-role.kubernetes.io/infra: "" 
2

    distributor:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    gateway:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    indexGateway:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    ingester:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    querier:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    queryFrontend:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
    ruler:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
# ...

1
Specifies the component pod type that applies to the node selector.
2
Specifies the pods that are moved to nodes containing the defined label.

Example LokiStack CR with node selectors and tolerations

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
# ...
  template:
    compactor:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    distributor:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    indexGateway:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    ingester:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    querier:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    queryFrontend:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    ruler:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
    gateway:
      nodeSelector:
        node-role.kubernetes.io/infra: ""
      tolerations:
      - effect: NoSchedule
        key: node-role.kubernetes.io/infra
        value: reserved
      - effect: NoExecute
        key: node-role.kubernetes.io/infra
        value: reserved
# ...

To configure the nodeSelector and tolerations fields of the LokiStack (CR), you can use the oc explain command to view the description and fields for a particular resource:

$ oc explain lokistack.spec.template

Example output

KIND:     LokiStack
VERSION:  loki.grafana.com/v1

RESOURCE: template <Object>

DESCRIPTION:
     Template defines the resource/limits/tolerations/nodeselectors per
     component

FIELDS:
   compactor	<Object>
     Compactor defines the compaction component spec.

   distributor	<Object>
     Distributor defines the distributor component spec.
...

For more detailed information, you can add a specific field:

$ oc explain lokistack.spec.template.compactor

Example output

KIND:     LokiStack
VERSION:  loki.grafana.com/v1

RESOURCE: compactor <Object>

DESCRIPTION:
     Compactor defines the compaction component spec.

FIELDS:
   nodeSelector	<map[string]string>
     NodeSelector defines the labels required by a node to schedule the
     component onto it.
...

5.2.2. Configuring Loki to tolerate node failure

In the logging 5.8 and later versions, the Loki Operator supports setting pod anti-affinity rules to request that pods of the same component are scheduled on different available nodes in the cluster.

Affinity is a property of pods that controls the nodes on which they prefer to be scheduled. Anti-affinity is a property of pods that prevents a pod from being scheduled on a node.

In OpenShift Container Platform, pod affinity and pod anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled on based on the key-value labels on other pods.

The Operator sets default, preferred podAntiAffinity rules for all Loki components, which includes the compactor, distributor, gateway, indexGateway, ingester, querier, queryFrontend, and ruler components.

You can override the preferred podAntiAffinity settings for Loki components by configuring required settings in the requiredDuringSchedulingIgnoredDuringExecution field:

Example user settings for the ingester component

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
# ...
  template:
    ingester:
      podAntiAffinity:
      # ...
        requiredDuringSchedulingIgnoredDuringExecution: 
1

        - labelSelector:
            matchLabels: 
2

              app.kubernetes.io/component: ingester
          topologyKey: kubernetes.io/hostname
# ...

1
The stanza to define a required rule.
2
The key-value pair (label) that must be matched to apply the rule.

5.2.3. Enabling stream-based retention with Loki

You can configure retention policies based on log streams. You can set retention rules globally, per-tenant, or both. If you configure both, tenant rules apply before global rules.

Important

If there is no retention period defined on the s3 bucket or in the LokiStack custom resource (CR), then the logs are not pruned and they stay in the s3 bucket forever, which might fill up the s3 storage.

Note
  • Although logging version 5.9 and later supports schema v12, schema v13 is recommended for future compatibility.
  • For cost-effective log pruning, configure retention policies directly on the object storage provider. Use the lifecycle management features of the storage provider to ensure automatic deletion of old logs. This also avoids extra processing from Loki and delete requests to S3.

    If the object storage does not support lifecycle policies, you must configure LokiStack to enforce retention internally. The supported retention period is up to 30 days.

Prerequisites

  • You have administrator permissions.
  • You have installed the Loki Operator.
  • You have installed the OpenShift CLI (oc).

Procedure

  1. To enable stream-based retention, create a LokiStack CR and save it as a YAML file. In the following example, it is called lokistack.yaml.

    Example global stream-based retention for S3

    apiVersion: loki.grafana.com/v1
    kind: LokiStack
    metadata:
      name: logging-loki
      namespace: openshift-logging
    spec:
      limits:
       global: 
    1
    
          retention: 
    2
    
            days: 20
            streams:
            - days: 4
              priority: 1
              selector: '{kubernetes_namespace_name=~"test.+"}' 
    3
    
            - days: 1
              priority: 1
              selector: '{log_type="infrastructure"}'
      managementState: Managed
      replicationFactor: 1
      size: 1x.small
      storage:
        schemas:
        - effectiveDate: "2020-10-11"
          version: v13
        secret:
          name: logging-loki-s3
          type: s3
      storageClassName: gp3-csi
      tenants:
        mode: openshift-logging

    1
    Set the retention policy for all log streams. This policy does not impact the retention period for stored logs in object storage.
    2
    Enable retention in the cluster by adding the retention block to the CR.
    3
    Specify the LogQL query to match log streams to the retention rule.

    Example per-tenant stream-based retention for S3

    apiVersion: loki.grafana.com/v1
    kind: LokiStack
    metadata:
      name: logging-loki
      namespace: openshift-logging
    spec:
      limits:
        global:
          retention:
            days: 20
        tenants: 
    1
    
          application:
            retention:
              days: 1
              streams:
                - days: 4
                  selector: '{kubernetes_namespace_name=~"test.+"}' 
    2
    
          infrastructure:
            retention:
              days: 5
              streams:
                - days: 1
                  selector: '{kubernetes_namespace_name=~"openshift-cluster.+"}'
      managementState: Managed
      replicationFactor: 1
      size: 1x.small
      storage:
        schemas:
        - effectiveDate: "2020-10-11"
          version: v13
        secret:
          name: logging-loki-s3
          type: s3
      storageClassName: gp3-csi
      tenants:
        mode: openshift-logging

    1
    Set the retention policy per-tenant. Valid tenant types are application, audit, and infrastructure.
    2
    Specify the LogQL query to match log streams to the retention rule.
  2. Apply the LokiStack CR:

    $ oc apply -f lokistack.yaml

5.2.4. Configuring Loki to tolerate memberlist creation failure

In an OpenShift Container Platform cluster, administrators generally use a non-private IP network range. As a result, the LokiStack memberlist configuration fails because, by default, it only uses private IP networks.

As an administrator, you can select the pod network for the memberlist configuration. You can modify the LokiStack custom resource (CR) to use the podIP address in the hashRing spec. To configure the LokiStack CR, use the following command:

$ oc patch LokiStack logging-loki -n openshift-logging  --type=merge -p '{"spec": {"hashRing":{"memberlist":{"instanceAddrType":"podIP"},"type":"memberlist"}}}'

Example LokiStack to include podIP

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: logging-loki
  namespace: openshift-logging
spec:
# ...
  hashRing:
    type: memberlist
    memberlist:
      instanceAddrType: podIP
# ...

5.2.5. LokiStack behavior during cluster restarts

When an OpenShift Container Platform cluster is restarted, LokiStack ingestion and the query path continue to operate within the available CPU and memory resources available for the node. This means that there is no downtime for the LokiStack during OpenShift Container Platform cluster updates. This behavior is achieved by using PodDisruptionBudget resources. The Loki Operator provisions PodDisruptionBudget resources for Loki, which determine the minimum number of pods that must be available per component to ensure normal operations under certain conditions.

5.3. Advanced deployment and scalability for Loki

Configure high availability, scalability, and error handling for Loki to support large-scale deployments across multiple availability zones. These features enable Loki to tolerate zone failures, manage rate limit errors, and scale horizontally to handle increased log ingestion rates.

5.3.1. Zone aware data replication

The Loki Operator offers support for zone-aware data replication through pod topology spread constraints. Enabling this feature enhances reliability and safeguards against log loss in the event of a single zone failure. When configuring the deployment size as 1x.extra-small, 1x.small, or 1x.medium, the replication.factor field is automatically set to 2.

To ensure proper replication, you need to have at least as many availability zones as the replication factor specifies. While it is possible to have more availability zones than the replication factor, having fewer zones can lead to write failures. Each zone should host an equal number of instances for optimal operation.

Example LokiStack CR with zone replication enabled

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
 name: logging-loki
 namespace: openshift-logging
spec:
 replicationFactor: 2 
1

 replication:
   factor: 2 
2

   zones:
   -  maxSkew: 1 
3

      topologyKey: topology.kubernetes.io/zone 
4

1
Deprecated field, values entered are overwritten by replication.factor.
2
This value is automatically set when deployment size is selected at setup.
3
The maximum difference in number of pods between any two topology domains. The default is 1, and you cannot specify a value of 0.
4
Defines zones in the form of a topology key that corresponds to a node label.

5.3.2. Recovering Loki pods from failed zones

In OpenShift Container Platform a zone failure happens when specific availability zone resources become inaccessible. Availability zones are isolated areas within a cloud provider’s data center, aimed at enhancing redundancy and fault tolerance. If your OpenShift Container Platform cluster is not configured to handle this, a zone failure can lead to service or data loss.

Loki pods are part of a StatefulSet, and they come with Persistent Volume Claims (PVCs) provisioned by a StorageClass object. Each Loki pod and its PVCs reside in the same zone. When a zone failure occurs in a cluster, the StatefulSet controller automatically attempts to recover the affected pods in the failed zone.

Warning

The following procedure will delete the PVCs in the failed zone, and all data contained therein. To avoid complete data loss the replication factor field of the LokiStack CR should always be set to a value greater than 1 to ensure that Loki is replicating.

Prerequisites

  • Verify your LokiStack CR has a replication factor greater than 1.
  • Zone failure detected by the control plane, and nodes in the failed zone are marked by cloud provider integration.

The StatefulSet controller automatically attempts to reschedule pods in a failed zone. Because the associated PVCs are also in the failed zone, automatic rescheduling to a different zone does not work. You must manually delete the PVCs in the failed zone to allow successful re-creation of the stateful Loki Pod and its provisioned PVC in the new zone.

Procedure

  1. List the pods in Pending status by running the following command:

    $ oc get pods --field-selector status.phase==Pending -n openshift-logging

    Example oc get pods output

    NAME                           READY   STATUS    RESTARTS   AGE 
    1
    
    logging-loki-index-gateway-1   0/1     Pending   0          17m
    logging-loki-ingester-1        0/1     Pending   0          16m
    logging-loki-ruler-1           0/1     Pending   0          16m

    1
    These pods are in Pending status because their corresponding PVCs are in the failed zone.
  2. List the PVCs in Pending status by running the following command:

    $ oc get pvc -o=json -n openshift-logging | jq '.items[] | select(.status.phase == "Pending") | .metadata.name' -r

    Example oc get pvc output

    storage-logging-loki-index-gateway-1
    storage-logging-loki-ingester-1
    wal-logging-loki-ingester-1
    storage-logging-loki-ruler-1
    wal-logging-loki-ruler-1

  3. Delete the PVC(s) for a pod by running the following command:

    $ oc delete pvc <pvc_name> -n openshift-logging
  4. Delete the pod(s) by running the following command:

    $ oc delete pod <pod_name> -n openshift-logging

    Once these objects have been successfully deleted, they should automatically be rescheduled in an available zone.

5.3.2.1. Troubleshooting PVC in a terminating state

The PVCs might hang in the terminating state without being deleted, if PVC metadata finalizers are set to kubernetes.io/pv-protection. Removing the finalizers should allow the PVCs to delete successfully.

  • Remove the finalizer for each PVC by running the command below, then retry deletion.

    $ oc patch pvc <pvc_name> -p '{"metadata":{"finalizers":null}}' -n openshift-logging

5.3.3. Troubleshooting Loki rate limit errors

If the Log Forwarder API forwards a large block of messages that exceeds the rate limit to Loki, Loki generates rate limit (429) errors.

These errors can occur during normal operation. For example, when adding the logging to a cluster that already has some logs, rate limit errors might occur while the logging tries to ingest all of the existing log entries. In this case, if the rate of addition of new logs is less than the total rate limit, the historical data is eventually ingested, and the rate limit errors are resolved without requiring user intervention.

In cases where the rate limit errors continue to occur, you can fix the issue by modifying the LokiStack custom resource (CR).

Important

The LokiStack CR is not available on Grafana-hosted Loki. This topic does not apply to Grafana-hosted Loki servers.

Conditions

  • The Log Forwarder API is configured to forward logs to Loki.
  • Your system sends a block of messages that is larger than 2 MB to Loki. For example:

    "values":[["1630410392689800468","{\"kind\":\"Event\",\"apiVersion\":\
    .......
    ......
    ......
    ......
    \"received_at\":\"2021-08-31T11:46:32.800278+00:00\",\"version\":\"1.7.4 1.6.0\"}},\"@timestamp\":\"2021-08-31T11:46:32.799692+00:00\",\"viaq_index_name\":\"audit-write\",\"viaq_msg_id\":\"MzFjYjJkZjItNjY0MC00YWU4LWIwMTEtNGNmM2E5ZmViMGU4\",\"log_type\":\"audit\"}"]]}]}
  • After you enter oc logs -n openshift-logging -l component=collector, the collector logs in your cluster show a line containing one of the following error messages:

    429 Too Many Requests Ingestion rate limit exceeded

    Example Vector error message

    2023-08-25T16:08:49.301780Z  WARN sink{component_kind="sink" component_id=default_loki_infra component_type=loki component_name=default_loki_infra}: vector::sinks::util::retries: Retrying after error. error=Server responded with an error: 429 Too Many Requests internal_log_rate_limit=true

    The error is also visible on the receiving end. For example, in the LokiStack ingester pod:

    Example Loki ingester error message

    level=warn ts=2023-08-30T14:57:34.155592243Z caller=grpc_logging.go:43 duration=1.434942ms method=/logproto.Pusher/Push err="rpc error: code = Code(429) desc = entry with timestamp 2023-08-30 14:57:32.012778399 +0000 UTC ignored, reason: 'Per stream rate limit exceeded (limit: 3MB/sec) while attempting to ingest for stream

Procedure

  • Update the ingestionBurstSize and ingestionRate fields in the LokiStack CR:

    apiVersion: loki.grafana.com/v1
    kind: LokiStack
    metadata:
      name: logging-loki
      namespace: openshift-logging
    spec:
      limits:
        global:
          ingestion:
            ingestionBurstSize: 16 
    1
    
            ingestionRate: 8 
    2
    
    # ...
    1
    The ingestionBurstSize field defines the maximum local rate-limited sample size per distributor replica in MB. This value is a hard limit. Set this value to at least the maximum logs size expected in a single push request. Single requests that are larger than the ingestionBurstSize value are not permitted.
    2
    The ingestionRate field is a soft limit on the maximum amount of ingested samples per second in MB. Rate limit errors occur if the rate of logs exceeds the limit, but the collector retries sending the logs. As long as the total average is lower than the limit, the system recovers and errors are resolved without user intervention.

5.4. Loki network policies for added security

The Loki Operator can deploy and manage a set of network policies that restrict communications to and from Loki components to enhance security. These network policies control ingress and egress traffic at the pod level, limiting exposure to only necessary services while allowing integration with external monitoring systems when required.

5.4.1. Loki network policies

You can enable the Loki Operator to automatically create a NetworkPolicy resource that implements a "default deny" security model with explicit allow rules for required communications. Network policies provide network segmentation for your LokiStack deployment by controlling ingress and egress traffic between Loki components and external services. The network policies in Loki Operator are designed to be secure by default while maintaining compatibility across diverse environments.

Network policies for Loki on OpenShift Container Platform include the following additional integrations:

  • Monitoring: Automatic integration with the OpenShift Container Platform monitoring stack.
  • DNS: Support for both standard and OpenShift Container Platform DNS services (port 5353).

5.4.2. Configuring a network policy for Loki

Enable or disable the deployment of NetworkPolicies per LokiStack by setting the networkPolicies field.

Prerequisites

  • You have administrator permissions.
  • You have installed the OpenShift CLI (oc).
  • You have installed the Loki Operator.
  • You have created a LokiStack custom resource (CR).

Procedure

  1. Update the LokiStack CR:

    apiVersion: loki.grafana.com/v1
    kind: LokiStack
    metadata:
      name: logging-loki
      namespace: openshift-logging
    spec:
      size: 1x.small
      storage:
        schemas:
        - version: v13
          effectiveDate: "<yyyy>-<mm>-<dd>"
        secret:
          name: logging-loki-s3
          type: s3
      storageClassName: <storage_class_name>
      tenants:
        mode: openshift-logging
      networkPolicies:
        ruleSet: RestrictIngressEgress

    You can set one of the following values for the spec.networkPolicies.ruleSet field:

    None
    Loki Operator will not deploy any network policy.
    RestrictIngressEgress

    Loki Operator will deploy a set of network policies that restrict the communications to and from the Loki components.

    If you do not define a spec.networkPolicies.ruleSet value, the platform and operator default values are inherited and full network access is allowed.

  2. Apply the LokiStack CR object by running the following command:

    $ oc apply -f <filename>.yaml

5.4.3. Loki NetworkPolicy resources

When network policies are enabled, the Loki Operator creates several NetworkPolicy resources to secure different aspects of your LokiStack deployment.

Expand

Policy name

Purpose

Components affected

{name}-default-deny

A baseline deny-all policy

All LokiStack pods

{name}-loki-allow

Inter-component communication allowed

All Loki components

{name}-loki-allow-metrics

Allow metric scraping on the prometheus endpoint

All Loki components

{name}-loki-allow-bucket-egress

Policy for object storage access

ingester, querier, index-gateway, compactor, ruler

{name}-loki-allow-gateway-ingress

Allow gateway access to Loki components

distributor, query-frontend, ruler

{name}-gateway-allow

Gateway external and monitoring access

LokiStack-gateway

{name}-gateway-allow-metrics

Allow metric scraping on the prometheus endpoint

LokiStack-gateway

{name}-ruler-allow-alert-egress

Allow ruler egress to AlertManager

ruler

{name}-loki-allow-query-frontend

Query frontend external access

query-frontend (OpenShift network mode)

5.4.4. Integrating Loki network policy with external systems

To integrate Loki with external systems such as custom dashboards, or external alerting, create additional network policies. You can select specific components by using the label app.kubernetes.io/component. Always include the labels app.kubernetes.io/name=lokistack and app.kubernetes.io/instance={name} to avoid collision with other pods deployed in the namespace.

Prerequisites

  • You have administrator permissions.
  • You have installed the OpenShift CLI (oc).
  • You have installed the Loki Operator.
  • You have created a LokiStack custom resource (CR).

Procedure

  1. Create a network policy:

    apiVersion: networking.k8s.io/v1
    kind: NetworkPolicy
    metadata:
      name: <name>
    spec:
      podSelector:
        matchLabels:
          app.kubernetes.io/name: lokistack
          app.kubernetes.io/instance: <instance_name>
          app.kubernetes.io/component: <loki_component>
      policyTypes:
      - Egress
      egress:
      - to:
        - namespaceSelector:
            matchLabels:
              kubernetes.io/metadata.name: <namespace_name>
        ports:
        - protocol: TCP
          port: <port_number>

    Replace <component_name> with the component you want to integrate with.

  2. Apply the network policy:

    $ oc apply -f <file_name>.yaml

5.5. Log-based alerts for Loki

Configure log-based alerts for Loki by creating AlertingRule custom resources. Log-based alerting enables you to trigger alerts based on log patterns and volumes, complementing metric-based alerting to provide comprehensive observability. These alerts integrate with Red Hat OpenShift Logging monitoring and can route to external alerting systems.

5.5.1. Authorizing LokiStack rules RBAC permissions

Administrators bind cluster roles to users to enable them to create and manage alerting and recording rules. A cluster role is defined as a ClusterRole object that has the required role-based access control (RBAC) permissions.

The following cluster roles for alerting and recording rules are available for LokiStack:

Expand
Rule nameDescription

alertingrules.loki.grafana.com-v1-admin

Users with this role have administrative-level access to manage alerting rules. This cluster role grants permissions to create, read, update, delete, list, and watch AlertingRule resources within the loki.grafana.com/v1 API group.

alertingrules.loki.grafana.com-v1-crdview

Users with this role can view the definitions of Custom Resource Definitions (CRDs) related to AlertingRule resources within the loki.grafana.com/v1 API group, but do not have permissions for modifying or managing these resources.

alertingrules.loki.grafana.com-v1-edit

Users with this role have permission to create, update, and delete AlertingRule resources.

alertingrules.loki.grafana.com-v1-view

Users with this role can read AlertingRule resources within the loki.grafana.com/v1 API group. They can inspect configurations, labels, and annotations for existing alerting rules but cannot make any modifications to them.

recordingrules.loki.grafana.com-v1-admin

Users with this role have administrative-level access to manage recording rules. This cluster role grants permissions to create, read, update, delete, list, and watch RecordingRule resources within the loki.grafana.com/v1 API group.

recordingrules.loki.grafana.com-v1-crdview

Users with this role can view the definitions of Custom Resource Definitions (CRDs) related to RecordingRule resources within the loki.grafana.com/v1 API group, but do not have permissions for modifying or managing these resources.

recordingrules.loki.grafana.com-v1-edit

Users with this role have permission to create, update, and delete RecordingRule resources.

recordingrules.loki.grafana.com-v1-view

Users with this role can read RecordingRule resources within the loki.grafana.com/v1 API group. They can inspect configurations, labels, and annotations for existing alerting rules but cannot make any modifications to them.

5.5.1.1. Examples

To apply cluster roles for a user, you must bind an existing cluster role to a specific username.

Cluster roles can be cluster or namespace scoped, depending on which type of role binding you use. When a RoleBinding object is used, as when using the oc adm policy add-role-to-user command, the cluster role only applies to the specified namespace. When a ClusterRoleBinding object is used, as when using the oc adm policy add-cluster-role-to-user command, the cluster role applies to all namespaces in the cluster.

The following example command gives the specified user create, read, update and delete (CRUD) permissions for alerting rules in a specific namespace in the cluster:

The following example displays cluster role binding command for alerting rule CRUD permissions in a specific namespace:

$ oc adm policy add-role-to-user alertingrules.loki.grafana.com-v1-admin -n <namespace> <username>

The following command gives the specified user administrator permissions for alerting rules in all namespaces:

$ oc adm policy add-cluster-role-to-user alertingrules.loki.grafana.com-v1-admin <username>

5.5.2. Creating a log-based alerting rule with Loki

The AlertingRule CR contains a set of specifications and webhook validation definitions to declare groups of alerting rules for a single LokiStack instance. In addition, the webhook validation definition provides support for rule validation conditions:

  • If an AlertingRule CR includes an invalid interval period, it is an invalid alerting rule
  • If an AlertingRule CR includes an invalid for period, it is an invalid alerting rule.
  • If an AlertingRule CR includes an invalid LogQL expr, it is an invalid alerting rule.
  • If an AlertingRule CR includes two groups with the same name, it is an invalid alerting rule.
  • If none of the above applies, an alerting rule is considered valid.
Expand
Table 5.1. AlertingRule definitions
Tenant typeValid namespaces for AlertingRule CRs

application

<your_application_namespace>

audit

openshift-logging

infrastructure

openshift-/*, kube-/\*, default

Procedure

  1. Create an AlertingRule custom resource (CR):

    Example infrastructure AlertingRule CR

      apiVersion: loki.grafana.com/v1
      kind: AlertingRule
      metadata:
        name: loki-operator-alerts
        namespace: openshift-operators-redhat 
    1
    
        labels: 
    2
    
          openshift.io/<label_name>: "true"
      spec:
        tenantID: "infrastructure" 
    3
    
        groups:
          - name: LokiOperatorHighReconciliationError
            rules:
              - alert: HighPercentageError
                expr: | 
    4
    
                  sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"} |= "error" [1m])) by (job)
                    /
                  sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"}[1m])) by (job)
                    > 0.01
                for: 10s
                labels:
                  severity: critical 
    5
    
                annotations:
                  summary: High Loki Operator Reconciliation Errors 
    6
    
                  description: High Loki Operator Reconciliation Errors 
    7

    1
    The namespace where this AlertingRule CR is created must have a label matching the LokiStack spec.rules.namespaceSelector definition.
    2
    The labels block must match the LokiStack spec.rules.selector definition.
    3
    AlertingRule CRs for infrastructure tenants are only supported in the openshift-*, kube-\*, or default namespaces.
    4
    The value for kubernetes_namespace_name: must match the value for metadata.namespace.
    5
    The value of this mandatory field must be critical, warning, or info.
    6
    This field is mandatory.
    7
    This field is mandatory.

    Example application AlertingRule CR

      apiVersion: loki.grafana.com/v1
      kind: AlertingRule
      metadata:
        name: app-user-workload
        namespace: app-ns 
    1
    
        labels: 
    2
    
          openshift.io/<label_name>: "true"
      spec:
        tenantID: "application"
        groups:
          - name: AppUserWorkloadHighError
            rules:
              - alert:
                expr: | 
    3
    
                  sum(rate({kubernetes_namespace_name="app-ns", kubernetes_pod_name=~"podName.*"} |= "error" [1m])) by (job)
                for: 10s
                labels:
                  severity: critical 
    4
    
                annotations:
                  summary:  
    5
    
                  description:  
    6

    1
    The namespace where this AlertingRule CR is created must have a label matching the LokiStack spec.rules.namespaceSelector definition.
    2
    The labels block must match the LokiStack spec.rules.selector definition.
    3
    Value for kubernetes_namespace_name: must match the value for metadata.namespace.
    4
    The value of this mandatory field must be critical, warning, or info.
    5
    The value of this mandatory field is a summary of the rule.
    6
    The value of this mandatory field is a detailed description of the rule.
  2. Apply the AlertingRule CR:

    $ oc apply -f <filename>.yaml
Red Hat logoGithubredditYoutubeTwitter

Formazione

Prova, acquista e vendi

Community

Informazioni su Red Hat

Forniamo soluzioni consolidate che rendono più semplice per le aziende lavorare su piattaforme e ambienti diversi, dal datacenter centrale all'edge della rete.

Rendiamo l’open source più inclusivo

Red Hat si impegna a sostituire il linguaggio problematico nel codice, nella documentazione e nelle proprietà web. Per maggiori dettagli, visita il Blog di Red Hat.

Informazioni sulla documentazione di Red Hat

Legal Notice

Theme

© 2026 Red Hat
Torna in cima