Monitoring OpenShift Container Storage
Monitoring OpenShift Container Storage using storage dashboards
Abstract
Making open source more inclusive
Red Hat is committed to replacing problematic language in our code, documentation, and web properties. We are beginning with these four terms: master, slave, blacklist, and whitelist. Because of the enormity of this endeavor, these changes will be implemented gradually over several upcoming releases. For more details, see our CTO Chris Wright’s message.
Providing feedback on Red Hat documentation
We appreciate your input on our documentation. Do let us know how we can make it better. To give feedback:
For simple comments on specific passages:
- Make sure you are viewing the documentation in the Multi-page HTML format. In addition, ensure you see the Feedback button in the upper right corner of the document.
- Use your mouse cursor to highlight the part of text that you want to comment on.
- Click the Add Feedback pop-up that appears below the highlighted text.
- Follow the displayed instructions.
For submitting more complex feedback, create a Bugzilla ticket:
- Go to the Bugzilla website.
- As the Component, use Documentation.
- Fill in the Description field with your suggestion for improvement. Include a link to the relevant part(s) of documentation.
- Click Submit Bug.
Chapter 1. Cluster health
1.1. Verifying OpenShift Container Storage is healthy
Storage health is visible on the Persistent Storage and Object Service dashboards.
Procedure
- Log in to OpenShift Web Console.
Check the Status card in the following locations.
- Home → Overview → Persistent Storage
Home → Overview → Object Service
If Green Tick appears on the Status card, the cluster is healthy.
If the state is not Healthy, see Section 1.2, “Storage health levels and cluster state” for more information about the current state and any alerts that appear.
1.2. Storage health levels and cluster state
Status information and alerts related to OpenShift Container Storage are displayed in the storage dashboards.
1.2.1. Persistent storage dashboard indicators
The Persistent Storage dashboard shows the state of OpenShift Container Storage as a whole, as well as the state of persistent volumes.
The states that are possible for each resource type are listed in the following table.
State | Icon | Description |
---|---|---|
UNKNOWN |
| OpenShift Container Storage is not deployed or unavailable. |
Green Tick |
| Cluster health is good. |
Warning |
| When OpenShift Container Storage cluster is in a warning state. In internal mode, an alert will be displayed along with the issue details. Alerts are not displayed for external mode. |
Error |
| When the OpenShift Container Storage cluster has encountered an error and some component is nonfunctional. In internal mode, an alert will be displayed along with the issue details. Alerts are not displayed for external mode. |
1.2.2. Object Service dashboard indicators
The Object Service dashboard shows the state of the Multicloud Object Gateway and any object claims in the cluster.
The states that are possible for each resource type are listed in the following table.
State | Description |
---|---|
Green Tick | Object Storage is healthy. |
Multicloud Object Gateway is not running | Shown when NooBaa system is not found. |
All resources are unhealthy | Shown when all NooBaa pools are unhealthy. |
Many buckets have issues | Shown when >= 50% of buckets encounter error(s). |
Some buckets have issues | Shown when >= 30% of buckets encounter error(s). |
Unavailable | Shown when network issues and/or errors exist. |
1.2.3. Alert panel
The Alert panel appears below the Status card in both the Persistent Storage dashboard and the Object Service dashboard when the cluster state is not healthy.
Information about specific alerts and how to respond to them is available in Troubleshooting OpenShift Container Storage.
Chapter 2. Metrics
2.1. Metrics in the persistent storage dashboard
To view the persistent storage dashboard, click Home → Overview → Persistent Storage in OpenShift Web Console.
The following cards on Persistent Storage dashboard provide the metrics based on deployment mode (internal or external):
- Details card
The Details card shows the following:
- Service Name
- Cluster name
- The name of the Provider on which the system runs (example: AWS, VSphere, ‘None’ for Bare metal)
- Mode (deployment mode as either Internal or External)
- OpenShift Container Storage operator version.
- Inventory card
- The Inventory card shows the number of active nodes, PVCs and PVs backed by OpenShift Container Storage provisioner. On the left hand side of the card, total number of storage nodes, PVCs and PVs are displayed. While on the corresponding right hand side of the card, number of storage nodes in Not Ready state, count of PVCs in Pending state and PVs in Released state are shown.
For external mode, the number of nodes will be 0 by default, since there are no dedicated nodes for OpenShift Container Storage.
- Status card
This card shows whether the cluster is up and running without any errors or is experiencing some issues.
For internal mode, Data Resiliency indicates the status of data re-balancing in Ceph across the replicas. When the internal mode cluster is in a warning or error state, the Alerts section is shown along with the relevant alerts.
For external mode, Data Resiliency and alerts will not be displayed
- Raw Capacity
In this card, you can view the total raw storage capacity which includes replication, on the cluster.
- Used - displays the used raw storage capacity on the cluster.
- Available - displays the available raw storage capacity on the cluster.
This card is not applicable for external mode clusters.
- Used Capacity Breakdown
This card shows the actual amount of non-replicated data stored in the cluster and its distribution. You can choose between Projects, Storage Classes and Pods from the drop down menu on the top of the card. These options are for filtering the data shown in the graph. The graph displays the used capacity for only the top five entities, based on usage. The aggregate usage of the remaining entities is displayed as Other.
Option Display Projects
The aggregated capacity of each project which is using the OpenShift Container storage and how much is being used.
Storage Classes
The aggregate capacity usage from the OpenShift Container Storage based storage classes.
Pods
The capacity usage per pod from the attached PVC backed by OpenShift Container Storage provisioners.
For external mode, see the Capacity breakdown card.
- Capacity breakdown card
This card is only applicable for external mode clusters. In this card, you can view graphic breakdown of capacity per project, storage classes and pods. You can choose between Projects, Storage Classes and Pods from the drop down menu on the top of the card. These options are for filtering the data shown in the graph. The graph displays the used capacity for only the top five entities, based on usage. The aggregate usage of the remaining entities is displayed as Other.
Option Display Projects
The aggregated capacity of each project which is using the OpenShift Container storage and how much is being used.
Storage Classes
The aggregate capacity usage from the OpenShift Container Storage based storage classes.
Pods
The capacity usage per pod from the attached PVC backed by OpenShift Container Storage provisioners.
- Utilization card
The card shows Used Capacity, input/output operations per second, latency, throughput, and recovery information for the internal mode cluster.
For external mode, this card shows only the used and requested capacity details for that cluster.
- Storage Efficiency card
- This card shows the system-wide compression ratio and the amount of space saved for persistent volume claims using the storage classes with compression.
- Activity card
This card displays what activities are happening or have recently happened in the OpenShift Container Storage cluster. The card is separated into two sections:
- Ongoing: Displays the progress of ongoing activities related to rebuilding of data resiliency and upgrading of OpenShift Container Storage operator.
-
Recent Events: Displays the list of events that happened in the
openshift-storage
namespace.
2.2. Metrics in the object service dashboard
To view the object service dashboard, click Home → Overview → Object Service in OpenShift Container Platform Web Console.
The following metrics are available in Object Service dashboard:
- Details card
This card shows the following information:
- Service Name: The Multicloud Object Gateway (MCG) service name.
- System Name: The Multicloud Object Gateway and RADOS Object Gateway system names. The Multicloud Object Gateway system name is also a hyperlink to the MCG management user interface.
- Provider: The name of the provider on which the system runs (example: AWS, VSphere, ‘None’ for Baremetal)
- Version: OpenShift Container Storage operator version.
- Storage Efficiency card
- In this card you can view how the MCG optimizes the consumption of the storage backend resources through deduplication and compression and provides you with a calculated efficiency ratio (application data vs logical data) and an estimated savings figure (how many bytes the MCG did not send to the storage provider) based on capacity of bare metal and cloud based storage and egress of cloud based storage.
- Buckets card
Buckets are containers maintained by the MCG and RADOS Object Gateway to store data on behalf of the applications. These buckets are created and accessed through object bucket claims (OBCs). A specific policy can be applied to bucket to customize data placement, data spill-over, data resiliency, capacity quotas, and so on.
In this card, information about object buckets (OB) and object bucket claims (OBCs) is shown separately. OB includes all the buckets that are created using S3 or the user interface(UI) and OBC includes all the buckets created using YAMLs or the command line interface (CLI). The number displayed on the left of the bucket type is the total count of OBs or OBCs. The number displayed on the right shows the error count and is visible only when the error count is greater than zero. You can click on the number to see the list of buckets that has the warning or error status.
- Resource Providers card
- This card displays a list of all Multicloud Object Gateway and RADOS Object Gateway resources that are currently in use. Those resources are used to store data according to the buckets policies and can be a cloud-based resource or a bare metal resource.
- Status card
This card shows whether the system and its services are running without any issues. When the system is in a warning or error state, the alerts section is shown and the relevant alerts are displayed there. Click the alert links beside each alert for more information about the issue. For information about health checks, see Cluster health.
If multiple object storage services are available in the cluster, click the service type (such as Object Service or Data Resiliency) to see the state of the individual services.
Data resiliency in the status card indicates if there is any resiliency issue regarding the data stored through the Multicloud Object Gateway and RADOS Object Gateway.
- Capacity breakdown card
- In this card you can visualize how applications consume the object storage through the Multicloud Object Gateway and RADOS Object Gateway. You can use the Service Type drop-down to view the capacity breakdown for the Multicloud Gateway and Object Gateway separately. When viewing the Multicloud Object Gateway, you can use the Break By drop-down to filter the results in the graph by Total, Projects or Bucket Classes.
- Performance card
In this card, you can view the performance of the Multicloud Object Gateway or RADOS Object Gateway. Use the Service Type drop-down to choose which you would like to view.
For Multicloud Object Gateway accounts, you can view the I/O operations and logical used capacity. For providers, you can view I/O operation, physical and logical usage, and egress.
The following tables explain the different metrics that you can view based on your selection from the drop down menus on the top of the card:
Table 2.1. Indicators for Multicloud Object Gateway Consumer types Metrics Chart display Accounts
I/O operations
Displays read and write I/O operations for the top five consumers. The total reads and writes of all the consumers is displayed at the bottom. This information helps you monitor the throughput demand (IOPS) per application or account.
Accounts
Logical Used Capacity
Displays total logical usage of each account for the top five consumers. This helps you monitor the throughput demand per application or account.
Providers
I/O operations
Displays the count of I/O operations generated by the MCG when accessing the storage backend hosted by the provider. This helps you understand the traffic in the cloud so that you can improve resource allocation according to the I/O pattern, thereby optimizing the cost.
Providers
Physical vs Logical usage
Displays the data consumption in the system by comparing the physical usage with the logical usage per provider. This helps you control the storage resources and devise a placement strategy in line with your usage characteristics and your performance requirements while potentially optimizing your costs.
Providers
Egress
The amount of data the MCG retrieves from each provider (read bandwidth originated with the applications). This helps you understand the traffic in the cloud to improve resource allocation according to the egress pattern, thereby optimizing the cost.
Accounts
I/O operations
Displays read and write I/O operations for the top five consumers. The total reads and writes of all the consumers is displayed at the bottom. This information helps you monitor the throughput demand (IOPS) per application or account.
Accounts
Logical Used Capacity
Displays total logical usage of each account for the top five consumers. This helps you monitor the throughput demand per application or account.
For the RADOS Object Gateway, you can use the Metric drop-down to view the Latency or Bandwidth.
- Latency: Provides a visual indication of the average GET/PUT latency imbalance across RADOS Object Gateway instances.
- Bandwidth: Provides a visual indication of the sum of GET/PUT bandwidth across RADOS Object Gateway instances.
- Activity card
This card displays what activities are happening or have recently happened in the OpenShift Container Storage cluster. The card is separated into two sections:
- Ongoing: Displays the progress of ongoing activities related to rebuilding of data resiliency and upgrading of OpenShift Container Storage operator.
-
Recent Events: Displays the list of events that happened in the
openshift-storage
namespace.
Chapter 3. Alerts
3.1. Setting up alerts
For internal Mode clusters, various alerts related to the storage metrics services, storage cluster, disk devices, cluster health, cluster capacity, and so on are displayed in the persistent storage and the object service dashboards. These alerts are not available for external Mode.
It might take a few minutes for alerts to be shown in the alert panel, because only firing alerts are visible in this panel.
You can also view alerts with additional details and customize the display of Alerts in the OpenShift Container Platform.
For more information, see Managing alerts.
Chapter 4. Remote health monitoring
OpenShift Container Storage collects anonymized aggregated information about the health, usage, and size of clusters and reports it to Red Hat via an integrated component called Telemetry. This information allows Red Hat to improve OpenShift Container Storage and to react to issues that impact customers more quickly.
A cluster that reports data to Red Hat via Telemetry is considered a connected cluster.
4.1. About Telemetry
Telemetry sends a carefully chosen subset of the cluster monitoring metrics to Red Hat. These metrics are sent continuously and describe:
- The size of an OpenShift Container Storage cluster
- The health and status of OpenShift Container Storage components
- The health and status of any upgrade being performed
- Limited usage information about OpenShift Container Storage components and features
- Summary info about alerts reported by the cluster monitoring component
This continuous stream of data is used by Red Hat to monitor the health of clusters in real time and to react as necessary to problems that impact our customers. It also allows Red Hat to roll out OpenShift Container Storage upgrades to customers so as to minimize service impact and continuously improve the upgrade experience.
This debugging information is available to Red Hat Support and engineering teams with the same restrictions as accessing data reported via support cases. All connected cluster information is used by Red Hat to help make OpenShift Container Storage better and more intuitive to use. None of the information is shared with third parties.
4.2. Information collected by Telemetry
Primary information collected by Telemetry includes:
-
The size of ceph cluster in bytes :
"ceph_cluster_total_bytes"
, -
The amount of ceph cluster storage used in bytes :
"ceph_cluster_total_used_raw_bytes"
, -
Ceph cluster health status :
"ceph_health_status"
, -
The total count of osds :
"job:ceph_osd_metadata:count"
, -
The total number of Persistent Volumes present in RHOCP cluster :
"job:kube_pv:count"
, -
The total iops (reads+writes) value for all the pools in ceph cluster :
"job:ceph_pools_iops:total"
, -
The total iops (reads+writes) value in bytes for all the pools in ceph cluster :
"job:ceph_pools_iops_bytes:total"
, -
The total count of ceph cluster versions running :
"job:ceph_versions_running:count"
-
The total number of unhealthy noobaa buckets :
"job:noobaa_total_unhealthy_buckets:sum"
, -
The total number of noobaa buckets :
"job:noobaa_bucket_count:sum"
, -
The total number of noobaa objects :
"job:noobaa_total_object_count:sum"
, -
The count of noobaa’s accounts :
"noobaa_accounts_num"
, -
The total usage of noobaa’s storage in bytes :
"noobaa_total_usage"
, -
The total amount of storage requested by PVCs from a particular storage provisioner in bytes:
"cluster:kube_persistentvolumeclaim_resource_requests_storage_bytes:provisioner:sum"
, -
The total amount of storage used by PVCs from a particular storage provisioner in bytes:
"cluster:kubelet_volume_stats_used_bytes:provisioner:sum"
.
Telemetry does not collect identifying information such as user names, passwords, or the names or addresses of user resources. In addition to the telemetry information stated above, NooBaa sends statistical information about accounts, buckets, objects, capacity, nodes, and connectivity health to phonehome.noobaa.com.