Inicio
Productos
OpenShift Dedicated
4
Logging
Chapter 11. Logging alerts

Este contenido no está disponible en el idioma seleccionado.

Chapter 11. Logging alerts

11.1. Default logging alerts
Copiar enlace

Logging alerts are installed as part of the Red Hat OpenShift Logging Operator installation. Alerts depend on metrics exported by the log collection and log storage backends. These metrics are enabled if you selected the option to Enable Operator recommended cluster monitoring on this namespace when installing the Red Hat OpenShift Logging Operator.

Default logging alerts are sent to the OpenShift Dedicated monitoring stack Alertmanager in the openshift-monitoring namespace, unless you have disabled the local Alertmanager instance.

11.1.1. Accessing the Alerting UI in the Administrator and Developer perspectives
Copiar enlace

The Alerting UI is accessible through the Administrator perspective and the Developer perspective of the OpenShift Dedicated web console.

In the Administrator perspective, go to Observe Alerting. The three main pages in the Alerting UI in this perspective are the Alerts, Silences, and Alerting rules pages.

In the Developer perspective, go to Observe <project_name> Alerts. In this perspective, alerts, silences, and alerting rules are all managed from the Alerts page. The results shown in the Alerts page are specific to the selected project.

Note

In the Developer perspective, you can select from core OpenShift Dedicated and user-defined projects that you have access to in the Project: <project_name> list. However, alerts, silences, and alerting rules relating to core OpenShift Dedicated projects are not displayed if you are not logged in as a cluster administrator.

11.1.2. Logging collector alerts
Copiar enlace

In logging 5.8 and later versions, the following alerts are generated by the Red Hat OpenShift Logging Operator. You can view these alerts in the OpenShift Dedicated web console.

Expand

Alert Name	Message	Description	Severity
CollectorNodeDown	Prometheus could not scrape `namespace`/`pod` collector component for more than 10m.	Collector cannot be scraped.	Critical
CollectorHighErrorRate	`value`% of records have resulted in an error by `namespace`/`pod` collector component.	`namespace`/`pod` collector component errors are high.	Critical
CollectorVeryHighErrorRate	`value`% of records have resulted in an error by `namespace`/`pod` collector component.	`namespace`/`pod` collector component errors are very high.	Critical

11.1.3. Vector collector alerts
Copiar enlace

In logging 5.7 and later versions, the following alerts are generated by the Vector collector. You can view these alerts in the OpenShift Dedicated web console.

Expand

Table 11.1. Vector collector alerts
Alert	Message	Description	Severity
`CollectorHighErrorRate`	`<value> of records have resulted in an error by vector <instance>.`	The number of vector output errors is high, by default more than 10 in the previous 15 minutes.	Warning
`CollectorNodeDown`	`Prometheus could not scrape vector <instance> for more than 10m.`	Vector is reporting that Prometheus could not scrape a specific Vector instance.	Critical
`CollectorVeryHighErrorRate`	`<value> of records have resulted in an error by vector <instance>.`	The number of Vector component errors are very high, by default more than 25 in the previous 15 minutes.	Critical
`FluentdQueueLengthIncreasing`	`In the last 1h, fluentd <instance> buffer queue length constantly increased more than 1. Current value is <value>.`	Fluentd is reporting that the queue size is increasing.	Warning

11.1.4. Fluentd collector alerts
Copiar enlace

The following alerts are generated by the legacy Fluentd log collector. You can view these alerts in the OpenShift Dedicated web console.

Expand

Table 11.2. Fluentd collector alerts
Alert	Message	Description	Severity
`FluentDHighErrorRate`	`<value> of records have resulted in an error by fluentd <instance>.`	The number of FluentD output errors is high, by default more than 10 in the previous 15 minutes.	Warning
`FluentdNodeDown`	`Prometheus could not scrape fluentd <instance> for more than 10m.`	Fluentd is reporting that Prometheus could not scrape a specific Fluentd instance.	Critical
`FluentdQueueLengthIncreasing`	`In the last 1h, fluentd <instance> buffer queue length constantly increased more than 1. Current value is <value>.`	Fluentd is reporting that the queue size is increasing.	Warning
`FluentDVeryHighErrorRate`	`<value> of records have resulted in an error by fluentd <instance>.`	The number of FluentD output errors is very high, by default more than 25 in the previous 15 minutes.	Critical

11.1.5. Elasticsearch alerting rules
Copiar enlace

You can view these alerting rules in the OpenShift Dedicated web console.

Expand

Table 11.3. Alerting rules
Alert	Description	Severity
`ElasticsearchClusterNotHealthy`	The cluster health status has been RED for at least 2 minutes. The cluster does not accept writes, shards may be missing, or the master node has not been elected yet.	Critical
`ElasticsearchClusterNotHealthy`	The cluster health status has been YELLOW for at least 20 minutes. Some shard replicas are not allocated.	Warning
`ElasticsearchDiskSpaceRunningLow`	The cluster is expected to be out of disk space within the next 6 hours.	Critical
`ElasticsearchHighFileDescriptorUsage`	The cluster is predicted to be out of file descriptors within the next hour.	Warning
`ElasticsearchJVMHeapUseHigh`	The JVM Heap usage on the specified node is high.	Alert
`ElasticsearchNodeDiskWatermarkReached`	The specified node has hit the low watermark due to low free disk space. Shards can not be allocated to this node anymore. You should consider adding more disk space to the node.	Info
`ElasticsearchNodeDiskWatermarkReached`	The specified node has hit the high watermark due to low free disk space. Some shards will be re-allocated to different nodes if possible. Make sure more disk space is added to the node or drop old indices allocated to this node.	Warning
`ElasticsearchNodeDiskWatermarkReached`	The specified node has hit the flood watermark due to low free disk space. Every index that has a shard allocated on this node is enforced a read-only block. The index block must be manually released when the disk use falls below the high watermark.	Critical
`ElasticsearchJVMHeapUseHigh`	The JVM Heap usage on the specified node is too high.	Alert
`ElasticsearchWriteRequestsRejectionJumps`	Elasticsearch is experiencing an increase in write rejections on the specified node. This node might not be keeping up with the indexing speed.	Warning
`AggregatedLoggingSystemCPUHigh`	The CPU used by the system on the specified node is too high.	Alert
`ElasticsearchProcessCPUHigh`	The CPU used by Elasticsearch on the specified node is too high.	Alert

11.2. Custom logging alerts
Copiar enlace

In logging 5.7 and later versions, users can configure the LokiStack deployment to produce customized alerts and recorded metrics. If you want to use customized alerting and recording rules, you must enable the LokiStack ruler component.

LokiStack log-based alerts and recorded metrics are triggered by providing LogQL expressions to the ruler component. The Loki Operator manages a ruler that is optimized for the selected LokiStack size, which can be 1x.extra-small, 1x.small, or 1x.medium.

To provide these expressions, you must create an AlertingRule custom resource (CR) containing Prometheus-compatible alerting rules, or a RecordingRule CR containing Prometheus-compatible recording rules.

Administrators can configure log-based alerts or recorded metrics for application, audit, or infrastructure tenants. Users without administrator permissions can configure log-based alerts or recorded metrics for application tenants of the applications that they have access to.

Application, audit, and infrastructure alerts are sent by default to the OpenShift Dedicated monitoring stack Alertmanager in the openshift-monitoring namespace, unless you have disabled the local Alertmanager instance. If the Alertmanager that is used to monitor user-defined projects in the openshift-user-workload-monitoring namespace is enabled, application alerts are sent to the Alertmanager in this namespace by default.

11.2.1. Configuring the ruler
Copiar enlace

When the LokiStack ruler component is enabled, users can define a group of LogQL expressions that trigger logging alerts or recorded metrics.

Administrators can enable the ruler by modifying the LokiStack custom resource (CR).

Prerequisites

You have installed the Red Hat OpenShift Logging Operator and the Loki Operator.
You have created a LokiStack CR.
You have administrator permissions.

Procedure

Enable the ruler by ensuring that the LokiStack CR contains the following spec configuration:

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: <name>
  namespace: <namespace>
spec:
# ...
  rules:
    enabled: true 
    selector:
      matchLabels:
        openshift.io/<label_name>: "true" 
    namespaceSelector:
      matchLabels:
        openshift.io/<label_name>: "true"

apiVersion: loki.grafana.com/v1
kind: LokiStack
metadata:
  name: <name>
  namespace: <namespace>
spec:
# ...
  rules:
    enabled: true


    selector:
      matchLabels:
        openshift.io/<label_name>: "true"


    namespaceSelector:
      matchLabels:
        openshift.io/<label_name>: "true"

Copy to Clipboard

Toggle word wrap

1: Enable Loki alerting and recording rules in your cluster.
2: Add a custom label that can be added to namespaces where you want to enable the use of logging alerts and metrics.
3: Add a custom label that can be added to namespaces where you want to enable the use of logging alerts and metrics.

11.2.2. Authorizing LokiStack rules RBAC permissions
Copiar enlace

Administrators can allow users to create and manage their own alerting and recording rules by binding cluster roles to usernames. Cluster roles are defined as ClusterRole objects that contain necessary role-based access control (RBAC) permissions for users.

In logging 5.8 and later, the following cluster roles for alerting and recording rules are available for LokiStack:

Expand

Rule name	Description
`alertingrules.loki.grafana.com-v1-admin`	Users with this role have administrative-level access to manage alerting rules. This cluster role grants permissions to create, read, update, delete, list, and watch `AlertingRule` resources within the `loki.grafana.com/v1` API group.
`alertingrules.loki.grafana.com-v1-crdview`	Users with this role can view the definitions of Custom Resource Definitions (CRDs) related to `AlertingRule` resources within the `loki.grafana.com/v1` API group, but do not have permissions for modifying or managing these resources.
`alertingrules.loki.grafana.com-v1-edit`	Users with this role have permission to create, update, and delete `AlertingRule` resources.
`alertingrules.loki.grafana.com-v1-view`	Users with this role can read `AlertingRule` resources within the `loki.grafana.com/v1` API group. They can inspect configurations, labels, and annotations for existing alerting rules but cannot make any modifications to them.
`recordingrules.loki.grafana.com-v1-admin`	Users with this role have administrative-level access to manage recording rules. This cluster role grants permissions to create, read, update, delete, list, and watch `RecordingRule` resources within the `loki.grafana.com/v1` API group.
`recordingrules.loki.grafana.com-v1-crdview`	Users with this role can view the definitions of Custom Resource Definitions (CRDs) related to `RecordingRule` resources within the `loki.grafana.com/v1` API group, but do not have permissions for modifying or managing these resources.
`recordingrules.loki.grafana.com-v1-edit`	Users with this role have permission to create, update, and delete `RecordingRule` resources.
`recordingrules.loki.grafana.com-v1-view`	Users with this role can read `RecordingRule` resources within the `loki.grafana.com/v1` API group. They can inspect configurations, labels, and annotations for existing alerting rules but cannot make any modifications to them.

11.2.2.1. Examples
Copiar enlace

To apply cluster roles for a user, you must bind an existing cluster role to a specific username.

Cluster roles can be cluster or namespace scoped, depending on which type of role binding you use. When a RoleBinding object is used, as when using the oc adm policy add-role-to-user command, the cluster role only applies to the specified namespace. When a ClusterRoleBinding object is used, as when using the oc adm policy add-cluster-role-to-user command, the cluster role applies to all namespaces in the cluster.

The following example command gives the specified user create, read, update and delete (CRUD) permissions for alerting rules in a specific namespace in the cluster:

Example cluster role binding command for alerting rule CRUD permissions in a specific namespace

oc adm policy add-role-to-user alertingrules.loki.grafana.com-v1-admin -n <namespace> <username>

$ oc adm policy add-role-to-user alertingrules.loki.grafana.com-v1-admin -n <namespace> <username>

Copy to Clipboard

Toggle word wrap

The following command gives the specified user administrator permissions for alerting rules in all namespaces:

Example cluster role binding command for administrator permissions

oc adm policy add-cluster-role-to-user alertingrules.loki.grafana.com-v1-admin <username>

$ oc adm policy add-cluster-role-to-user alertingrules.loki.grafana.com-v1-admin <username>

Copy to Clipboard

Toggle word wrap

11.2.3. Creating a log-based alerting rule with Loki
Copiar enlace

The AlertingRule CR contains a set of specifications and webhook validation definitions to declare groups of alerting rules for a single LokiStack instance. In addition, the webhook validation definition provides support for rule validation conditions:

If an AlertingRule CR includes an invalid interval period, it is an invalid alerting rule
If an AlertingRule CR includes an invalid for period, it is an invalid alerting rule.
If an AlertingRule CR includes an invalid LogQL expr, it is an invalid alerting rule.
If an AlertingRule CR includes two groups with the same name, it is an invalid alerting rule.
If none of above applies, an alerting rule is considered valid.

Expand

Tenant type	Valid namespaces for `AlertingRule` CRs
application
audit	`openshift-logging`
infrastructure	`openshift-/`, `kube-/\`, `default`

Prerequisites

Red Hat OpenShift Logging Operator 5.7 and later
OpenShift Dedicated 4.13 and later

Procedure

Create an AlertingRule custom resource (CR):

Example infrastructure AlertingRule CR

  apiVersion: loki.grafana.com/v1
  kind: AlertingRule
  metadata:
    name: loki-operator-alerts
    namespace: openshift-operators-redhat 
    labels: 
      openshift.io/<label_name>: "true"
  spec:
    tenantID: "infrastructure" 
    groups:
      - name: LokiOperatorHighReconciliationError
        rules:
          - alert: HighPercentageError
            expr: | 
              sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"} |= "error" [1m])) by (job)
                /
              sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"}[1m])) by (job)
                > 0.01
            for: 10s
            labels:
              severity: critical 
            annotations:
              summary: High Loki Operator Reconciliation Errors 
              description: High Loki Operator Reconciliation Errors

  apiVersion: loki.grafana.com/v1
  kind: AlertingRule
  metadata:
    name: loki-operator-alerts
    namespace: openshift-operators-redhat


    labels:


      openshift.io/<label_name>: "true"
  spec:
    tenantID: "infrastructure"


    groups:
      - name: LokiOperatorHighReconciliationError
        rules:
          - alert: HighPercentageError
            expr: |


              sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"} |= "error" [1m])) by (job)
                /
              sum(rate({kubernetes_namespace_name="openshift-operators-redhat", kubernetes_pod_name=~"loki-operator-controller-manager.*"}[1m])) by (job)
                > 0.01
            for: 10s
            labels:
              severity: critical


            annotations:
              summary: High Loki Operator Reconciliation Errors


              description: High Loki Operator Reconciliation Errors

Copy to Clipboard

Toggle word wrap

1: The namespace where this AlertingRule CR is created must have a label matching the LokiStack spec.rules.namespaceSelector definition.
2: The labels block must match the LokiStack spec.rules.selector definition.
3: AlertingRule CRs for infrastructure tenants are only supported in the openshift-*, kube-\*, or default namespaces.
4: The value for kubernetes_namespace_name: must match the value for metadata.namespace.
5: The value of this mandatory field must be critical, warning, or info.
6: This field is mandatory.
7: This field is mandatory.

Example application AlertingRule CR

  apiVersion: loki.grafana.com/v1
  kind: AlertingRule
  metadata:
    name: app-user-workload
    namespace: app-ns 
    labels: 
      openshift.io/<label_name>: "true"
  spec:
    tenantID: "application"
    groups:
      - name: AppUserWorkloadHighError
        rules:
          - alert:
            expr: | 
            sum(rate({kubernetes_namespace_name="app-ns", kubernetes_pod_name=~"podName.*"} |= "error" [1m])) by (job)
            for: 10s
            labels:
              severity: critical 
            annotations:
              summary:  
              description:

  apiVersion: loki.grafana.com/v1
  kind: AlertingRule
  metadata:
    name: app-user-workload
    namespace: app-ns


    labels:


      openshift.io/<label_name>: "true"
  spec:
    tenantID: "application"
    groups:
      - name: AppUserWorkloadHighError
        rules:
          - alert:
            expr: |


            sum(rate({kubernetes_namespace_name="app-ns", kubernetes_pod_name=~"podName.*"} |= "error" [1m])) by (job)
            for: 10s
            labels:
              severity: critical


            annotations:
              summary:


              description:

Copy to Clipboard

Toggle word wrap

1: The namespace where this AlertingRule CR is created must have a label matching the LokiStack spec.rules.namespaceSelector definition.
2: The labels block must match the LokiStack spec.rules.selector definition.
3: Value for kubernetes_namespace_name: must match the value for metadata.namespace.
4: The value of this mandatory field must be critical, warning, or info.
5: The value of this mandatory field is a summary of the rule.
6: The value of this mandatory field is a detailed description of the rule.

Apply the AlertingRule CR:
```
oc apply -f <filename>.yaml
```
```
$ oc apply -f <filename>.yaml
```
Copy to Clipboard Toggle word wrap

Volver arriba

Este contenido no está disponible en el idioma seleccionado.

Chapter 11. Logging alerts

11.1. Default logging alerts
Copiar enlace

11.1.1. Accessing the Alerting UI in the Administrator and Developer perspectives
Copiar enlace

11.1.2. Logging collector alerts
Copiar enlace

11.1.3. Vector collector alerts
Copiar enlace

11.1.4. Fluentd collector alerts
Copiar enlace

11.1.5. Elasticsearch alerting rules
Copiar enlace

11.2. Custom logging alerts
Copiar enlace

11.2.1. Configuring the ruler
Copiar enlace

11.2.2. Authorizing LokiStack rules RBAC permissions
Copiar enlace

11.2.2.1. Examples
Copiar enlace

11.2.3. Creating a log-based alerting rule with Loki
Copiar enlace

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Este contenido no está disponible en el idioma seleccionado.

Chapter 11. Logging alerts

11.1. Default logging alertsCopiar enlaceEnlace copiado en el portapapeles!

11.1.1. Accessing the Alerting UI in the Administrator and Developer perspectivesCopiar enlaceEnlace copiado en el portapapeles!

11.1.2. Logging collector alertsCopiar enlaceEnlace copiado en el portapapeles!

11.1.3. Vector collector alertsCopiar enlaceEnlace copiado en el portapapeles!

11.1.4. Fluentd collector alertsCopiar enlaceEnlace copiado en el portapapeles!

11.1.5. Elasticsearch alerting rulesCopiar enlaceEnlace copiado en el portapapeles!

11.2. Custom logging alertsCopiar enlaceEnlace copiado en el portapapeles!

11.2.1. Configuring the rulerCopiar enlaceEnlace copiado en el portapapeles!

11.2.2. Authorizing LokiStack rules RBAC permissionsCopiar enlaceEnlace copiado en el portapapeles!

11.2.2.1. ExamplesCopiar enlaceEnlace copiado en el portapapeles!

11.2.3. Creating a log-based alerting rule with LokiCopiar enlaceEnlace copiado en el portapapeles!

Aprender

Pruebe, compre y venda

Comunidades

Acerca de la documentación de Red Hat

Hacer que el código abierto sea más inclusivo

Acerca de Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

11.1. Default logging alerts
Copiar enlace

11.1.1. Accessing the Alerting UI in the Administrator and Developer perspectives
Copiar enlace

11.1.2. Logging collector alerts
Copiar enlace

11.1.3. Vector collector alerts
Copiar enlace

11.1.4. Fluentd collector alerts
Copiar enlace

11.1.5. Elasticsearch alerting rules
Copiar enlace

11.2. Custom logging alerts
Copiar enlace

11.2.1. Configuring the ruler
Copiar enlace

11.2.2. Authorizing LokiStack rules RBAC permissions
Copiar enlace

11.2.2.1. Examples
Copiar enlace

11.2.3. Creating a log-based alerting rule with Loki
Copiar enlace