Chapter 5. Using operational features of Service Telemetry Framework
You can use the following operational features to provide additional functionality to the Service Telemetry Framework (STF):
5.1. Dashboards in Service Telemetry Framework Copy linkLink copied to clipboard!
Use the third-party application, Grafana, to visualize system-level metrics that the data collectors collectd and Ceilometer gather for each individual host node.
For more information about configuring data collectors, see Section 4.1, “Deploying Red Hat OpenStack Platform overcloud for Service Telemetry Framework using director”.
You can use dashboards to monitor a cloud:
- Infrastructure dashboard
- Use the infrastructure dashboard to view metrics for a single node at a time. Select a node from the upper left corner of the dashboard.
- Cloud view dashboard
Use the cloud view dashboard to view panels to monitor service resource usage, API stats, and cloud events. You must enable API health monitoring and service monitoring to provide the data for this dashboard. API health monitoring is enabled by default in the STF base configuration. For more information, see Section 4.1.2, “Creating the base configuration for STF”.
- For more information about API health monitoring, see Section 5.8, “Red Hat OpenStack Platform API status and containerized services health”.
- For more information about RHOSP service monitoring, see Section 5.7, “Resource usage of Red Hat OpenStack Platform services”.
- Virtual machine view dashboard
- Use the virtual machine view dashboard to view panels to monitor virtual machine infrastructure usage. Select a cloud and project from the upper left corner of the dashboard. You must enable event storage if you want to enable the event annotations on this dashboard. For more information, see Section 3.2, “Creating a ServiceTelemetry object in Red Hat OpenShift Container Platform”.
- Memcached view dashboard
- Use the memcached view dashboard to view panels to monitor connections, availability, system metrics and cache performance. Select a cloud from the upper left corner of the dashboard.
5.1.1. Configuring Grafana to host the dashboard Copy linkLink copied to clipboard!
Grafana is not included in the default Service Telemetry Framework (STF) deployment, so you must deploy the Grafana Operator from community-operators CatalogSource. If you use the Service Telemetry Operator to deploy Grafana, it results in a Grafana instance and the configuration of the default data sources for the local STF deployment.
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetrynamespace:oc project service-telemetry
$ oc project service-telemetryCopy to Clipboard Copied! Toggle word wrap Toggle overflow Subscribe to the Grafana Operator by using the community-operators CatalogSource:
WarningCommunity Operators are Operators which have not been vetted or verified by Red Hat. Community Operators should be used with caution because their stability is unknown. Red Hat provides no support for community Operators.
Learn more about Red Hat’s third party software support policy
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the Operator launched successfully. In the command output, if the value of the
PHASEcolumn isSucceeded, the Operator launched successfully:oc get csv --selector operators.coreos.com/grafana-operator.service-telemetry NAME DISPLAY VERSION REPLACES PHASE grafana-operator.v4.10.1 Grafana Operator 4.10.1 grafana-operator.v4.10.0 Succeeded
$ oc get csv --selector operators.coreos.com/grafana-operator.service-telemetry NAME DISPLAY VERSION REPLACES PHASE grafana-operator.v4.10.1 Grafana Operator 4.10.1 grafana-operator.v4.10.0 SucceededCopy to Clipboard Copied! Toggle word wrap Toggle overflow To launch a Grafana instance, create or modify the
ServiceTelemetryobject. Setgraphing.enabledandgraphing.grafana.ingressEnabledtotrue. Optionally, set the value ofgraphing.grafana.baseImageto the Grafana workload container image that will be deployed:Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the Grafana instance deployed:
oc get pod -l app=grafana NAME READY STATUS RESTARTS AGE grafana-deployment-7fc7848b56-sbkhv 1/1 Running 0 1m
$ oc get pod -l app=grafana NAME READY STATUS RESTARTS AGE grafana-deployment-7fc7848b56-sbkhv 1/1 Running 0 1mCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the Grafana data sources installed correctly:
oc get grafanadatasources NAME AGE default-datasources 20h
$ oc get grafanadatasources NAME AGE default-datasources 20hCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the Grafana route exists:
oc get route grafana-route NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD grafana-route grafana-route-service-telemetry.apps.infra.watch grafana-service 3000 edge None
$ oc get route grafana-route NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD grafana-route grafana-route-service-telemetry.apps.infra.watch grafana-service 3000 edge NoneCopy to Clipboard Copied! Toggle word wrap Toggle overflow
5.1.2. Overriding the default Grafana container image Copy linkLink copied to clipboard!
The dashboards in Service Telemetry Framework (STF) require features that are available only in Grafana version 8.1.0 and later. By default, the Service Telemetry Operator installs a compatible version. You can override the base Grafana image by specifying the image path to an image registry with graphing.grafana.baseImage.
Procedure
Ensure that you have the correct version of Grafana:
oc get pod -l "app=grafana" -ojsonpath='{.items[0].spec.containers[0].image}' docker.io/grafana/grafana:7.3.10$ oc get pod -l "app=grafana" -ojsonpath='{.items[0].spec.containers[0].image}' docker.io/grafana/grafana:7.3.10Copy to Clipboard Copied! Toggle word wrap Toggle overflow If the running image is older than 8.1.0, patch the ServiceTelemetry object to update the image. Service Telemetry Operator updates the Grafana manifest, which restarts the Grafana deployment:
oc patch stf/default --type merge -p '{"spec":{"graphing":{"grafana":{"baseImage":"docker.io/grafana/grafana:8.1.5"}}}}'$ oc patch stf/default --type merge -p '{"spec":{"graphing":{"grafana":{"baseImage":"docker.io/grafana/grafana:8.1.5"}}}}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that a new Grafana pod exists and has a
STATUSvalue ofRunning:oc get pod -l "app=grafana" NAME READY STATUS RESTARTS AGE grafana-deployment-fb9799b58-j2hj2 1/1 Running 0 10s
$ oc get pod -l "app=grafana" NAME READY STATUS RESTARTS AGE grafana-deployment-fb9799b58-j2hj2 1/1 Running 0 10sCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the new instance is running the updated image:
oc get pod -l "app=grafana" -ojsonpath='{.items[0].spec.containers[0].image}' docker.io/grafana/grafana:8.1.0$ oc get pod -l "app=grafana" -ojsonpath='{.items[0].spec.containers[0].image}' docker.io/grafana/grafana:8.1.0Copy to Clipboard Copied! Toggle word wrap Toggle overflow
5.1.3. Importing dashboards Copy linkLink copied to clipboard!
The Grafana Operator can import and manage dashboards by creating GrafanaDashboard objects. You can view example dashboards at https://github.com/infrawatch/dashboards.
Procedure
Import the infrastructure dashboard:
oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1/rhos-dashboard.yaml grafanadashboard.integreatly.org/rhos-dashboard-1 created
$ oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1/rhos-dashboard.yaml grafanadashboard.integreatly.org/rhos-dashboard-1 createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow Import the cloud dashboard:
WarningIn the
stf-connectors.yamlfile, ensure you set the value of the collectdvirtplugin parameterhostname_formattoname uuid hostname, otherwise some of the panels on the cloud dashboard display no information. For more information about thevirtplugin, see collectd plugins.oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1/rhos-cloud-dashboard.yaml grafanadashboard.integreatly.org/rhos-cloud-dashboard-1 created
$ oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1/rhos-cloud-dashboard.yaml grafanadashboard.integreatly.org/rhos-cloud-dashboard-1 createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow Import the cloud events dashboard:
oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1/rhos-cloudevents-dashboard.yaml grafanadashboard.integreatly.org/rhos-cloudevents-dashboard created
$ oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1/rhos-cloudevents-dashboard.yaml grafanadashboard.integreatly.org/rhos-cloudevents-dashboard createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow Import the virtual machine dashboard:
oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1/virtual-machine-view.yaml grafanadashboard.integreatly.org/virtual-machine-view-1 configured
$ oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1/virtual-machine-view.yaml grafanadashboard.integreatly.org/virtual-machine-view-1 configuredCopy to Clipboard Copied! Toggle word wrap Toggle overflow Import the memcached dashboard:
oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1/memcached-dashboard.yaml grafanadashboard.integreatly.org/memcached-dashboard-1 created
$ oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/stf-1/memcached-dashboard.yaml grafanadashboard.integreatly.org/memcached-dashboard-1 createdCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the dashboards are available:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Retrieve the Grafana route address:
oc get route grafana-route -ojsonpath='{.spec.host}' grafana-route-service-telemetry.apps.infra.watch$ oc get route grafana-route -ojsonpath='{.spec.host}' grafana-route-service-telemetry.apps.infra.watchCopy to Clipboard Copied! Toggle word wrap Toggle overflow - In a web browser, navigate to https://<grafana_route_address>. Replace <grafana_route_address> with the value that you retrieved in the previous step.
- To view the dashboard, click Dashboards and Manage.
5.1.4. Retrieving and setting Grafana login credentials Copy linkLink copied to clipboard!
When Grafana is enabled, you can login using openshift authentication, or the default username and password set by the Grafana Operator.
You can override the credentials in the ServiceTelemetry object to have Service Telemetry Framework (STF) set the username and password for Grafana instead.
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetrynamespace:oc project service-telemetry
$ oc project service-telemetryCopy to Clipboard Copied! Toggle word wrap Toggle overflow Retrieve the existing username and password from the STF object:
oc get stf default -o jsonpath="{.spec.graphing.grafana['adminUser','adminPassword']}"$ oc get stf default -o jsonpath="{.spec.graphing.grafana['adminUser','adminPassword']}"Copy to Clipboard Copied! Toggle word wrap Toggle overflow To modify the default values of the Grafana administrator username and password through the ServiceTelemetry object, use the
graphing.grafana.adminUserandgraphing.grafana.adminPasswordparameters.oc edit stf default
$ oc edit stf defaultCopy to Clipboard Copied! Toggle word wrap Toggle overflow Wait for the grafana pod to restart with the new credentials in place
oc get po -l app=grafana -w
$ oc get po -l app=grafana -wCopy to Clipboard Copied! Toggle word wrap Toggle overflow
5.2. Metrics retention time period in Service Telemetry Framework Copy linkLink copied to clipboard!
The default retention time for metrics stored in Service Telemetry Framework (STF) is 24 hours, which provides enough data for trends to develop for the purposes of alerting.
For long-term storage, use systems designed for long-term data retention, for example, Thanos.
Additional resources
- To adjust STF for additional metrics retention time, see Section 5.2.1, “Editing the metrics retention time period in Service Telemetry Framework”.
- For recommendations about Prometheus data storage and estimating storage space, see https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects
- For more information about Thanos, see https://thanos.io/
5.2.1. Editing the metrics retention time period in Service Telemetry Framework Copy linkLink copied to clipboard!
You can adjust Service Telemetry Framework (STF) for additional metrics retention time.
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the service-telemetry namespace:
oc project service-telemetry
$ oc project service-telemetryCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the ServiceTelemetry object:
oc edit stf default
$ oc edit stf defaultCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add
retention: 7dto the storage section of backends.metrics.prometheus.storage to increase the retention period to seven days:NoteIf you set a long retention period, retrieving data from heavily populated Prometheus systems can result in queries returning results slowly.
Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Save your changes and close the object.
Wait for prometheus to restart with the new settings.
oc get po -l app.kubernetes.io/name=prometheus -w
$ oc get po -l app.kubernetes.io/name=prometheus -wCopy to Clipboard Copied! Toggle word wrap Toggle overflow Verify the new retention setting by checking the command line arguments used in the pod.
oc describe po prometheus-default-0 | grep retention.time --storage.tsdb.retention.time=24h$ oc describe po prometheus-default-0 | grep retention.time --storage.tsdb.retention.time=24hCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional resources
- For more information about the metrics retention time, see Section 5.2, “Metrics retention time period in Service Telemetry Framework”.
5.3. Alerts in Service Telemetry Framework Copy linkLink copied to clipboard!
You create alert rules in Prometheus and alert routes in Alertmanager. Alert rules in Prometheus servers send alerts to an Alertmanager, which manages the alerts. Alertmanager can silence, inhibit, or aggregate alerts, and send notifications by using email, on-call notification systems, or chat platforms.
To create an alert, complete the following tasks:
- Create an alert rule in Prometheus. For more information, see Section 5.3.1, “Creating an alert rule in Prometheus”.
Create an alert route in Alertmanager. There are two ways in which you can create an alert route:
Additional resources
For more information about alerts or notifications with Prometheus and Alertmanager, see https://prometheus.io/docs/alerting/overview/
To view an example set of alerts that you can use with Service Telemetry Framework (STF), see https://github.com/infrawatch/service-telemetry-operator/tree/master/deploy/alerts
5.3.1. Creating an alert rule in Prometheus Copy linkLink copied to clipboard!
Prometheus evaluates alert rules to trigger notifications. If the rule condition returns an empty result set, the condition is false. Otherwise, the rule is true and it triggers an alert.
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetrynamespace:oc project service-telemetry
$ oc project service-telemetryCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
PrometheusRuleobject that contains the alert rule. The Prometheus Operator loads the rule into Prometheus:Copy to Clipboard Copied! Toggle word wrap Toggle overflow To change the rule, edit the value of the
exprparameter.To verify that the Operator loaded the rules into Prometheus, run the
curlcommand against the default-prometheus-proxy route with basic authentication:curl -k --user "internal:$(oc get secret default-prometheus-htpasswd -ogo-template='{{ .data.password | base64decode }}')" https://$(oc get route default-prometheus-proxy -ogo-template='{{ .spec.host }}')/api/v1/rules {"status":"success","data":{"groups":[{"name":"./openstack.rules","file":"/etc/prometheus/rules/prometheus-default-rulefiles-0/service-telemetry-prometheus-alarm-rules.yaml","rules":[{"state":"inactive","name":"Collectd metrics receive count is zero","query":"rate(sg_total_collectd_msg_received_count[1m]) == 0","duration":0,"labels":{},"annotations":{},"alerts":[],"health":"ok","evaluationTime":0.00034627,"lastEvaluation":"2021-12-07T17:23:22.160448028Z","type":"alerting"}],"interval":30,"evaluationTime":0.000353787,"lastEvaluation":"2021-12-07T17:23:22.160444017Z"}]}}$ curl -k --user "internal:$(oc get secret default-prometheus-htpasswd -ogo-template='{{ .data.password | base64decode }}')" https://$(oc get route default-prometheus-proxy -ogo-template='{{ .spec.host }}')/api/v1/rules {"status":"success","data":{"groups":[{"name":"./openstack.rules","file":"/etc/prometheus/rules/prometheus-default-rulefiles-0/service-telemetry-prometheus-alarm-rules.yaml","rules":[{"state":"inactive","name":"Collectd metrics receive count is zero","query":"rate(sg_total_collectd_msg_received_count[1m]) == 0","duration":0,"labels":{},"annotations":{},"alerts":[],"health":"ok","evaluationTime":0.00034627,"lastEvaluation":"2021-12-07T17:23:22.160448028Z","type":"alerting"}],"interval":30,"evaluationTime":0.000353787,"lastEvaluation":"2021-12-07T17:23:22.160444017Z"}]}}Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional resources
- For more information on alerting, see https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/alerting.md
5.3.2. Configuring custom alerts Copy linkLink copied to clipboard!
You can add custom alerts to the PrometheusRule object that you created in Section 5.3.1, “Creating an alert rule in Prometheus”.
Procedure
Use the
oc editcommand:oc edit prometheusrules prometheus-alarm-rules
$ oc edit prometheusrules prometheus-alarm-rulesCopy to Clipboard Copied! Toggle word wrap Toggle overflow -
Edit the
PrometheusRulesmanifest. - Save and close the manifest.
Additional resources
- For more information about how to configure alerting rules, see https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/.
- For more information about PrometheusRules objects, see https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/alerting.md
5.3.3. Creating a standard alert route in Alertmanager Copy linkLink copied to clipboard!
Use Alertmanager to deliver alerts to an external system, such as email, IRC, or other notification channel. The Prometheus Operator manages the Alertmanager configuration as a Red Hat OpenShift Container Platform secret. By default, Service Telemetry Framework (STF) deploys a basic configuration that results in no receivers:
To deploy a custom Alertmanager route with STF, you must add a alertmanagerConfigManifest parameter to the Service Telemetry Operator that results in an updated secret, managed by the Prometheus Operator.
If your alertmanagerConfigManifest contains a custom template, for example, to construct the title and text of the sent alert, you must deploy the contents of the alertmanagerConfigManifest using a base64-encoded configuration. For more information, see Section 5.3.4, “Creating an alert route with templating in Alertmanager”.
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetrynamespace:oc project service-telemetry
$ oc project service-telemetryCopy to Clipboard Copied! Toggle word wrap Toggle overflow Edit the
ServiceTelemetryobject for your STF deployment:oc edit stf default
$ oc edit stf defaultCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add the new parameter
alertmanagerConfigManifestand theSecretobject contents to define thealertmanager.yamlconfiguration for Alertmanager:NoteThis step loads the default template that the Service Telemetry Operator manages. To verify that the changes are populating correctly, change a value, return the
alertmanager-defaultsecret, and verify that the new value is loaded into memory. For example, change the value of the parameterglobal.resolve_timeoutfrom5mto10m.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the configuration has been applied to the secret:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the
wgetcommand from the prometheus pod against thealertmanager-proxyservice to retrieve the status andconfigYAMLcontents, and verify that the supplied configuration matches the configuration in Alertmanager:oc exec -it prometheus-default-0 -c prometheus -- sh -c "wget --header \"Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\" https://default-alertmanager-proxy:9095/api/v1/status -q -O -" {"status":"success","data":{"configYAML":"...",...}}$ oc exec -it prometheus-default-0 -c prometheus -- sh -c "wget --header \"Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\" https://default-alertmanager-proxy:9095/api/v1/status -q -O -" {"status":"success","data":{"configYAML":"...",...}}Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Verify that the
configYAMLfield contains the changes you expect. To clean up the environment, delete the
curlpod:oc delete pod curl pod "curl" deleted
$ oc delete pod curl pod "curl" deletedCopy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional resources
- For more information about the Red Hat OpenShift Container Platform secret and the Prometheus operator, see Prometheus user guide on alerting.
5.3.4. Creating an alert route with templating in Alertmanager Copy linkLink copied to clipboard!
Use Alertmanager to deliver alerts to an external system, such as email, IRC, or other notification channel. The Prometheus Operator manages the Alertmanager configuration as a Red Hat OpenShift Container Platform secret. By default, Service Telemetry Framework (STF) deploys a basic configuration that results in no receivers:
If the alertmanagerConfigManifest parameter contains a custom template, for example, to construct the title and text of the sent alert, you must deploy the contents of the alertmanagerConfigManifest by using a base64-encoded configuration.
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetrynamespace:oc project service-telemetry
$ oc project service-telemetryCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create the necessary alertmanager config in a file called alertmanager.yaml, for example:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Generate the config manifest and add it to the
ServiceTelemetryobject for your STF deployment:CONFIG_MANIFEST=$(oc create secret --dry-run=client generic alertmanager-default --from-file=alertmanager.yaml -o json) oc patch stf default --type=merge -p '{"spec":{"alertmanagerConfigManifest":'"$CONFIG_MANIFEST"'}}'$ CONFIG_MANIFEST=$(oc create secret --dry-run=client generic alertmanager-default --from-file=alertmanager.yaml -o json) $ oc patch stf default --type=merge -p '{"spec":{"alertmanagerConfigManifest":'"$CONFIG_MANIFEST"'}}'Copy to Clipboard Copied! Toggle word wrap Toggle overflow Verify that the configuration has been applied to the secret:
NoteThere will be a short delay as the operators update each object
Copy to Clipboard Copied! Toggle word wrap Toggle overflow Run the
wgetcommand from the prometheus pod against thealertmanager-proxyservice to retrieve the status andconfigYAMLcontents, and verify that the supplied configuration matches the configuration in Alertmanager:oc exec -it prometheus-default-0 -c prometheus -- /bin/sh -c "wget --header \"Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\" https://default-alertmanager-proxy:9095/api/v1/status -q -O -" {"status":"success","data":{"configYAML":"...",...}}$ oc exec -it prometheus-default-0 -c prometheus -- /bin/sh -c "wget --header \"Authorization: Bearer \$(cat /var/run/secrets/kubernetes.io/serviceaccount/token)\" https://default-alertmanager-proxy:9095/api/v1/status -q -O -" {"status":"success","data":{"configYAML":"...",...}}Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Verify that the
configYAMLfield contains the changes you expect.
Additional resources
- For more information about the Red Hat OpenShift Container Platform secret and the Prometheus operator, see Prometheus user guide on alerting.
5.4. Sending alerts as SNMP traps Copy linkLink copied to clipboard!
To enable SNMP traps, modify the ServiceTelemetry object and configure the snmpTraps parameters. SNMP traps are sent using version 2c.
5.4.1. Configuration parameters for snmpTraps Copy linkLink copied to clipboard!
The snmpTraps parameter contains the following sub-parameters for configuring the alert receiver:
- enabled
- Set the value of this sub-parameter to true to enable the SNMP trap alert receiver. The default value is false.
- target
-
Target address to send SNMP traps. Value is a string. Default is
192.168.24.254. - port
-
Target port to send SNMP traps. Value is an integer. Default is
162. - community
-
Target community to send SNMP traps to. Value is a string. Default is
public. - retries
-
SNMP trap retry delivery limit. Value is an integer. Default is
5. - timeout
-
SNMP trap delivery timeout defined in seconds. Value is an integer. Default is
1. - alertOidLabel
-
Label name in the alert that defines the OID value to send the SNMP trap as. Value is a string. Default is
oid. - trapOidPrefix
-
SNMP trap OID prefix for variable bindings. Value is a string. Default is
1.3.6.1.4.1.50495.15. - trapDefaultOid
-
SNMP trap OID when no alert OID label has been specified with the alert. Value is a string. Default is
1.3.6.1.4.1.50495.15.1.2.1. - trapDefaultSeverity
- SNMP trap severity when no alert severity has been set. Value is a string. Defaults to an empty string.
Configure the snmpTraps parameter as part of the alerting.alertmanager.receivers definition in the ServiceTelemetry object:
5.4.2. Overview of the MIB definition Copy linkLink copied to clipboard!
Delivery of SNMP traps uses object identifier (OID) value 1.3.6.1.4.1.50495.15.1.2.1 by default. The management information base (MIB) schema is available at https://github.com/infrawatch/prometheus-webhook-snmp/blob/master/PROMETHEUS-ALERT-CEPH-MIB.txt.
The OID number is comprised of the following component values: * The value 1.3.6.1.4.1 is a global OID defined for private enterprises. * The next identifier 50495 is a private enterprise number assigned by IANA for the Ceph organization. * The other values are child OIDs of the parent.
- 15
- prometheus objects
- 15.1
- prometheus alerts
- 15.1.2
- prometheus alert traps
- 15.1.2.1
- prometheus alert trap default
The prometheus alert trap default is an object comprised of several other sub-objects to OID 1.3.6.1.4.1.50495.15 which is defined by the alerting.alertmanager.receivers.snmpTraps.trapOidPrefix parameter:
- <trapOidPrefix>.1.1.1
- alert name
- <trapOidPrefix>.1.1.2
- status
- <trapOidPrefix>.1.1.3
- severity
- <trapOidPrefix>.1.1.4
- instance
- <trapOidPrefix>.1.1.5
- job
- <trapOidPrefix>.1.1.6
- description
- <trapOidPrefix>.1.1.7
- labels
- <trapOidPrefix>.1.1.8
- timestamp
- <trapOidPrefix>.1.1.9
- rawdata
The following is example output from a simple SNMP trap receiver that outputs the received trap to the console:
5.4.3. Configuring SNMP traps Copy linkLink copied to clipboard!
Prerequisites
- Ensure that you know the IP address or hostname of the SNMP trap receiver where you want to send the alerts to.
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetrynamespace:oc project service-telemetry
$ oc project service-telemetryCopy to Clipboard Copied! Toggle word wrap Toggle overflow To enable SNMP traps, modify the
ServiceTelemetryobject:oc edit stf default
$ oc edit stf defaultCopy to Clipboard Copied! Toggle word wrap Toggle overflow Set the
alerting.alertmanager.receivers.snmpTrapsparameters:Copy to Clipboard Copied! Toggle word wrap Toggle overflow -
Ensure that you set the value of
targetto the IP address or hostname of the SNMP trap receiver.
Additional Information
For more information about available parameters for snmpTraps, see Section 5.4.1, “Configuration parameters for snmpTraps”.
5.4.4. Creating alerts for SNMP traps Copy linkLink copied to clipboard!
You can create alerts that are configured for delivery by SNMP traps by adding labels that are parsed by the prometheus-webhook-snmp middleware to define the trap information and delivered object identifiers (OID). Adding the oid or severity labels is only required if you need to change the default values for a particular alert definition.
- NOTE
-
When you set the oid label, the top-level SNMP trap OID changes, but the sub-OIDs remain defined by the global
trapOidPrefixvalue plus the child OID values.1.1.1through.1.1.9. For more information about the MIB definition, see Section 5.4.2, “Overview of the MIB definition”.
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetrynamespace:oc project service-telemetry
$ oc project service-telemetryCopy to Clipboard Copied! Toggle word wrap Toggle overflow Create a
PrometheusRuleobject that contains the alert rule and anoidlabel that contains the SNMP trap OID override value:Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional information
For more information about configuring alerts, see Section 5.3, “Alerts in Service Telemetry Framework”.
5.5. High availability Copy linkLink copied to clipboard!
With high availability, Service Telemetry Framework (STF) can rapidly recover from failures in its component services. Although Red Hat OpenShift Container Platform restarts a failed pod if nodes are available to schedule the workload, this recovery process might take more than one minute, during which time events and metrics are lost. A high availability configuration includes multiple copies of STF components, which reduces recovery time to approximately 2 seconds. To protect against failure of an Red Hat OpenShift Container Platform node, deploy STF to an Red Hat OpenShift Container Platform cluster with three or more nodes.
STF is not yet a fully fault tolerant system. Delivery of metrics and events during the recovery period is not guaranteed.
Enabling high availability has the following effects:
- Three Elasticsearch pods run instead of the default one.
The following components run two pods instead of the default one:
- AMQ Interconnect
- Alertmanager
- Prometheus
- Events Smart Gateway
- Metrics Smart Gateway
- Recovery time from a lost pod in any of these services reduces to approximately 2 seconds.
5.5.1. Configuring high availability Copy linkLink copied to clipboard!
To configure Service Telemetry Framework (STF) for high availability, add highAvailability.enabled: true to the ServiceTelemetry object in Red Hat OpenShift Container Platform. You can set this parameter at installation time or, if you already deployed STF, complete the following steps:
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetrynamespace:oc project service-telemetry
$ oc project service-telemetryCopy to Clipboard Copied! Toggle word wrap Toggle overflow Use the oc command to edit the ServiceTelemetry object:
oc edit stf default
$ oc edit stf defaultCopy to Clipboard Copied! Toggle word wrap Toggle overflow Add
highAvailability.enabled: trueto thespecsection:Copy to Clipboard Copied! Toggle word wrap Toggle overflow - Save your changes and close the object.
5.6. Observability Strategy in Service Telemetry Framework Copy linkLink copied to clipboard!
Service Telemetry Framework (STF) does not include storage backends and alerting tools. STF uses community operators to deploy Prometheus, Alertmanager, Grafana, and Elasticsearch. STF makes requests to these community operators to create instances of each application configured to work with STF.
Instead of having Service Telemetry Operator create custom resource requests, you can use your own deployments of these applications or other compatible applications, and scrape the metrics Smart Gateways for delivery to your own Prometheus-compatible system for telemetry storage. If you set the observabilityStrategy to none, then storage backends will not be deployed so persistent storage will not be required by STF.
5.6.1. Configuring an alternate observability strategy Copy linkLink copied to clipboard!
To configure STF to skip the deployment of storage, visualization, and alerting backends, add observabilityStrategy: none to the ServiceTelemetry spec. In this mode, only AMQ Interconnect routers and metrics Smart Gateways are deployed, and you must configure an external Prometheus-compatible system to collect metrics from the STF Smart Gateways.
Currently, only metrics are supported when you set observabilityStrategy to none. Events Smart Gateways are not deployed.
Procedure
Create a
ServiceTelemetryobject with the propertyobservabilityStrategy: nonein thespecparameter. The manifest shows results in a default deployment of STF that is suitable for receiving telemetry from a single cloud with all metrics collector types.Copy to Clipboard Copied! Toggle word wrap Toggle overflow Delete the left over objects that are managed by community operators
for o in alertmanager/default prometheus/default elasticsearch/elasticsearch grafana/default; do oc delete $o; done
$ for o in alertmanager/default prometheus/default elasticsearch/elasticsearch grafana/default; do oc delete $o; doneCopy to Clipboard Copied! Toggle word wrap Toggle overflow To verify that all workloads are operating correctly, view the pods and the status of each pod:
Copy to Clipboard Copied! Toggle word wrap Toggle overflow
Additional resources
For more information about configuring additional clouds or to change the set of supported collectors, see Section 4.3.2, “Deploying Smart Gateways”
5.7. Resource usage of Red Hat OpenStack Platform services Copy linkLink copied to clipboard!
You can monitor the resource usage of the Red Hat OpenStack Platform (RHOSP) services, such as the APIs and other infrastructure processes, to identify bottlenecks in the overcloud by showing services that run out of compute power. Resource usage monitoring is enabled by default.
Additional resources
- To disable resource usage monitoring, see Section 5.7.1, “Disabling resource usage monitoring of Red Hat OpenStack Platform services”.
5.7.1. Disabling resource usage monitoring of Red Hat OpenStack Platform services Copy linkLink copied to clipboard!
To disable the monitoring of RHOSP containerized service resource usage, you must set the CollectdEnableLibpodstats parameter to false.
Prerequisites
-
You have created the
stf-connectors.yamlfile. For more information, see Section 4.1, “Deploying Red Hat OpenStack Platform overcloud for Service Telemetry Framework using director”. - You are using the most current version of Red Hat OpenStack Platform (RHOSP) 16.1.
Procedure
Open the
stf-connectors.yamlfile and add theCollectdEnableLibpodstatsparameter to override the setting inenable-stf.yaml. Ensure thatstf-connectors.yamlis called from theopenstack overcloud deploycommand afterenable-stf.yaml:CollectdEnableLibpodstats: false
CollectdEnableLibpodstats: falseCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Continue with the overcloud deployment procedure. For more information, see Section 4.1.4, “Deploying the overcloud”.
5.8. Red Hat OpenStack Platform API status and containerized services health Copy linkLink copied to clipboard!
You can use the OCI (Open Container Initiative) standard to assess the container health status of each Red Hat OpenStack Platform (RHOSP) service by periodically running a health check script. Most RHOSP services implement a health check that logs issues and returns a binary status. For the RHOSP APIs, the health checks query the root endpoint and determine the health based on the response time.
Monitoring of RHOSP container health and API status is enabled by default.
Additional resources
- To disable RHOSP container health and API status monitoring, see Section 5.8.1, “Disabling container health and API status monitoring”.
5.8.1. Disabling container health and API status monitoring Copy linkLink copied to clipboard!
To disable RHOSP containerized service health and API status monitoring, you must set the CollectdEnableSensubility parameter to false.
Prerequisites
-
You have created the
stf-connectors.yamlfile in your templates directory. For more information, see Section 4.1, “Deploying Red Hat OpenStack Platform overcloud for Service Telemetry Framework using director”. - You are using the most current version of Red Hat OpenStack Platform (RHOSP) 16.1.
Procedure
Open the
stf-connectors.yamland add theCollectdEnableSensubilityparameter to override the setting inenable-stf.yaml. Ensure thatstf-connectors.yamlis called from theopenstack overcloud deploycommand afterenable-stf.yaml:CollectdEnableSensubility: false
CollectdEnableSensubility: falseCopy to Clipboard Copied! Toggle word wrap Toggle overflow - Continue with the overcloud deployment procedure. For more information, see Section 4.1.4, “Deploying the overcloud”.
Additional resources
- For more information about multiple cloud addresses, see Section 4.3, “Configuring multiple clouds”.