Monitoring OpenShift Container Storage
Monitoring OpenShift Container Storage using storage dashboards
Abstract
Chapter 1. Cluster health
1.1. Verifying OpenShift Container Storage is healthy
Storage health is visible on the Persistent Storage and Object Service dashboards.
Procedure
- Log in to OpenShift Web Console.
Check the Health card in the following locations.
- Home → Dashboards → Persistent Storage
Home → Dashboards → Object Service
If Healthy appears on the Health card, the cluster is healthy.
If the state is not Healthy, see Section 1.2, “Storage health levels and cluster state” for more information about the current state and any alerts that appear.
1.2. Storage health levels and cluster state
Status information and alerts related to OpenShift Container Storage are displayed in the storage dashboards.
1.2.1. Persistent storage dashboard indicators
The Persistent Storage dashboard shows the state of OpenShift Container Storage as a whole, as well as the state of persistent volumes.
The states that are possible for each resource type are listed in the following table.
State | Icon | Description |
---|---|---|
UNKNOWN | - | OpenShift Container Storage is not deployed or unavailable. |
Healthy |
| Cluster health is good. |
Warning |
| When Ceph cluster is in a warning state. Alert will be displayed containing the issue with the Ceph system. |
Error |
| When the Ceph cluster has encountered an error and some component is nonfunctional. Alert will be displayed containing the details. |
1.2.2. Object Service dashboard indicators
The Object Service dashboard shows the state of the Multicloud Object Gateway and any object claims in the cluster.
The states that are possible for each resource type are listed in the following table.
State | Description |
---|---|
Healthy | Object Storage is healthy. |
Multicloud Object Gateway is not running | Shown when NooBaa system is not found. |
All resources are unhealthy | Shown when all NooBaa pools are unhealthy. |
Many buckets have issues | Shown when >= 50% of buckets encounter error(s). |
Some buckets have issues | Shown when >= 30% of buckets encounter error(s). |
Unavailable | Shown when network issues and/or errors exist. |
1.2.3. Alert panel
The Alert panel appears below the Health card in both the Persistent Storage dashboard and the Object Storage dashboard when the cluster state is not healthy.
Information about specific alerts and how to respond to them is available in Troubleshooting OpenShift Container Storage.
Chapter 2. Metrics
2.1. Viewing metrics in persistent storage dashboard
To view the persistent storage dashboard, click Home → Dashboards → Persistent Storage in OpenShift Web Console.
Figure 2.1. Persistent Storage dashboard
The following metrics are available in persistent storage dashboard:
- Details card
- The Details card shows the OpenShift Container Storage cluster name, the provider name, and the OpenShift Container Storage operator version.
- Inventory card
- The Inventory card shows the number of active nodes, PVCs, and PVs in the cluster. On the left hand side, total number of nodes, PVCs, and PVs are displayed and on the right hand side, number of nodes, PVCs, ad PVs that are in good state, error state, or processing state are displayed.
- Health card
- This card shows whether the cluster is up and running without any errors or is experiencing some issues. When the cluster is in a warning or error state, the Alerts section is shown and the relevant alerts are displayed there.
- Capacity card
- In this card, you can choose different options in the drop down menu to view the following for different provider types:
Option | Display |
---|---|
Total capacity (default) | The total storage capacity that OpenShift Container Storage is using to store user data and the overhead caused by ensuring redundancy and reliability of the data. |
Requested vs Used | Requested is the sum total of storage requested by all the pods and Used is the sum total of storage actually used by all the pods. The card shows the available storage against the requested storage and the percentage of storage used. |
- Data Resiliency card
- In this card, you can view if there is any resiliency issue in the cluster. If the cluster is recovering or rebalancing the application data present in the cluster, you will see a progress bar that indicates the progression.
- Top Consumers card
- In this card you can display the top consumers ordered by used capacity, requested capacity, storage class or pod. This information helps you understand how the cluster resources are consumed and elaborate an effective capacity planning strategy.
- Events card
- This card shows the most recent events of the cluster.
- Utilization card
- This card shows input/output operations per second, latency, throughput, and recovery information for the cluster.
2.2. Viewing metrics in object service dashboard
To view the object service dashboard, click Home → Dashboards → Object Service in OpenShift Web Console.
Figure 2.2. Object Service dashboard
The following metrics are available in object service dashboard:
- Details card
This card shows the following information:
- The Multicloud Object Gateway (MCG) service name.
- The system name, which is also a hyperlink to the MCG management user interface.
- The name of the provider on which the system runs.
- OpenShift Container Storage operator version.
- Buckets card
Buckets are containers maintained by the MCG to store data on behalf of the applications. These buckets are created and accessed through object bucket claims (OBCs). A specific policy can be applied to bucket to customize data placement, data spill-over, data resiliency, capacity quotas, and so on.
In this card, information about object buckets (OB) and object bucket claims (OBCs) is shown separately. OB includes all the buckets that are created using S3 or the user interface(UI) and OBC includes all the buckets created using YAMLs or the command line interface (CLI). The number displayed on the left of the bucket type is the total count of OBs or OBCs. The number displayed on the right shows the error count and is visible only when the error count is greater that zero. You can click on the number to see the list of buckets that has the warning or error status.
- Resource Providers card (Type, count, and health)
- This card lists the different storage backend providers used. A storage backend can be cloud based such as Amazon S3 or bare metal based. For each type of storage backend, the card indicates the number of backends configured and their respective status. The status is indicated only in case of unhealthy backends. If the resource is healthy, the total numbers are shown regularly.
- Health card
This card shows if the system is up and running without any issues. When the system is in a warning or error state, the alerts section is shown and the relevant alerts are displayed there. For information about health checks, see Cluster health.
You can click on the links on the right of the alerts to get more information about the issue.
- Data Consumption card
In this card, you can view physical usage (raw storage), logical usage (usable storage), I/O, and egress traffic per provider and MCG account.
For MCG accounts, you can view the I/O operations and logical used capacity. For providers, you can view I/O operation, physical and logical usage, and egress.
The following table provides the different key performance indicators (KPIs) that you can view based on your selection from the drop down menus on the top of the card:
Consumer types | KPIs | Chart Display |
---|---|---|
Accounts | I/O operations | Displays read and write I/O operations for the top five consumers. The total reads and writes of all the consumers is displayed at the bottom. This information helps you monitor the throughput demand (IOPS) per application or account. |
Accounts | Logical Used Capacity | Displays total logical usage of each account for the top five consumers. This helps you monitor the throughtput demand per application or account. |
Providers | I/O operations | Displays the count of I/O operations generated by the MCG when accessing the storage backend hosted by the provider. This helps you understand the traffic in the cloud so that you can improve resource allocation according to the I/O pattern, thereby optimizing the cost. |
Providers | Physical vs Logical usage | Displays the data consumption in the system by comparing the physical usage with the logical usage per provider. This helps you control the storage resources and devise a placement strategy in line with your usage characteristics and your performance requirements while potentially optimizing your costs. |
Providers | Egress | The amount of data the MCG retrieves from each provider (read bandwidth originated with the applications). This helps you understand the traffic in the cloud to improve resource allocation according to the egress pattern, thereby optimizing the cost. |
- Data Resiliency card
Data resiliency is the ability of stored objects to recover and continue operating in the case of a failure. In this card, you can view if there is any resiliency issue regarding the data stored through MCG. During the recovery phases automatically initiated by the MCG to bring the data resiliency in line with the requested configuration after a failure, you will be able to track the progress of the recovery/rebuild together with an estimated time before resiliency is back to normal. When a process is rebuilding, you will see a progress bar that indicates the progression with time estimation.
NoteCertain changes in the system such as, unavailable resource or change of bucket policy cause an object to require a rebuilding process in order to stay resilient.
- Capacity Breakdown card
- In this card you can visualize how applications consume the object storage through the MCG. The card, through its drop-down box, offers graphic breakdowns per project and bucket class. You can choose between Projects and Bucket Class options from the drop down menu on the top of the card. These options are the filtering options that change the data shown in the graph.
- Object Data Reduction card
In this card you can view how the MCG optimizes the consumption of the storage backend resources through deduplication and compression and provides you with a calculated efficiency ratio (application data vs logical data) and an estimated savings figure (how many bytes the MCG did not send to the storage provider).
NoteSavings are two fold: Capacity savings (applies to bare metal and cloud based storage providers) and egress traffic savings (applies to storage cloud based providers).
Chapter 3. Alerts
3.1. Setting up alerts
Various alerts related to the storage metrics services, storage cluster, disk devices, cluster health, cluster capacity, and so on are displayed in the persistent storage and the object service dashboards.
It might take a few minutes for alerts to be shown in the alert panel, because only firing alerts are visible in this panel.
You can also view alerts with additional details and customize the display of Alerts in the OpenShift Container Platform. For more information, see Managing cluster alerts.
Chapter 4. Remote health monitoring
OpenShift Container Storage collects anonymized aggregated information about the health, usage, and size of clusters and reports it to Red Hat via an integrated component called Telemetry. This information allows Red Hat to improve OpenShift Container Storage and to react to issues that impact customers more quickly.
A cluster that reports data to Red Hat via Telemetry is considered a connected cluster.
4.1. About Telemetry
Telemetry sends a carefully chosen subset of the cluster monitoring metrics to Red Hat. These metrics are sent continuously and describe:
- The size of an OpenShift Container Storage cluster
- The health and status of OpenShift Container Storage components
- The health and status of any upgrade being performed
- Limited usage information about OpenShift Container Storage components and features
- Summary info about alerts reported by the cluster monitoring component
This continuous stream of data is used by Red Hat to monitor the health of clusters in real time and to react as necessary to problems that impact our customers. It also allows Red Hat to roll out OpenShift Container Storage upgrades to customers so as to minimize service impact and continuously improve the upgrade experience.
This debugging information is available to Red Hat Support and engineering teams with the same restrictions as accessing data reported via support cases. All connected cluster information is used by Red Hat to help make OpenShift Container Storage better and more intuitive to use. None of the information is shared with third parties.
4.2. Information collected by Telemetry
Primary information collected by Telemetry includes:
-
The size of ceph cluster in bytes :
{_name_="ceph_cluster_total_bytes"}
, -
The amount of ceph cluster storage used in bytes :
{_name_="ceph_cluster_total_used_raw_bytes"}
, -
Ceph cluster health status :
{_name_="ceph_health_status"}
, -
The total count of osds :
{_name_="job:ceph_osd_metadata:count"}
, -
The total number of Persistent Volumes present in OCP cluster :
{_name_="job:kube_pv:count"}
, -
The total iops (reads+writes) value for all the pools in ceph cluster :
{_name_="job:ceph_pools_iops:total"}
, -
The total iops (reads+writes) value in bytes for all the pools in ceph cluster :
{_name_="job:ceph_pools_iops_bytes:total"}
, -
The total count of ceph cluster versions running :
{_name_="job:ceph_versions_running:count"}
-
The total number of unhealthy noobaa buckets :
{_name_="job:noobaa_total_unhealthy_buckets:sum"}
, -
The total number of noobaa buckets :
{_name_="job:noobaa_bucket_count:sum"}
, -
The total number of noobaa objects :
{_name_="job:noobaa_total_object_count:sum"}
, -
The count of noobaa’s accounts :
{_name_="noobaa_accounts_num"}
, -
The total usage of noobaa’s storage in bytes. :
{_name_="noobaa_total_usage"}
Telemetry does not collect identifying information such as user names, passwords, or the names or addresses of user resources. In addition to the telemetry information stated above, NooBaa sends statistical information about accounts, buckets, objects, capacity, nodes, and connectivity health to phonehome.noobaa.com.