ホーム
製品
Red Hat OpenShift AI Self-Managed
3.0
Administer OpenShift AI platform access, apps, and operations
第12章可観測性の管理

第12章可観測性の管理

Red Hat OpenShift AI は、集中型のプラットフォーム可観測性を提供します。これは、OpenShift AI インスタンスとユーザーワークロードの健全性とパフォーマンスを監視するための、すぐに使用できる統合ソリューションです。

この集中型ソリューションには、標準化されたデータ取り込み用の OpenTelemetry Collector (OTC)、メトリクス用の Prometheus、分散トレーシング用の Red Hat build of Tempo を備えた専用の事前設定済み可観測性スタックが含まれています。このアーキテクチャーにより、OpenShift AI コンポーネントの共通のヘルスメトリクスとアラートのセットが有効になり、既存の外部の可観測性ツールと統合するメカニズムが提供されます。

重要

この機能は現在、Red Hat OpenShift AI 3.0 でテクノロジープレビュー機能として利用できます。テクノロジープレビュー機能は、Red Hat 製品のサービスレベルアグリーメント (SLA) の対象外であり、機能的に完全ではないことがあります。Red Hat では、実稼働環境での使用を推奨していません。テクノロジープレビュー機能は、最新の製品機能をいち早く提供して、開発段階で機能のテストを行い、フィードバックを提供していただくことを目的としています。

Red Hat のテクノロジープレビュー機能のサポート範囲に関する詳細は、テクノロジープレビュー機能のサポート範囲を参照してください。

12.1. 可観測性スタックの有効化
リンクのコピー

可観測性スタックは、OpenShift AI のメトリクス、トレース、アラートを収集して相関させ、OpenShift AI コンポーネントを監視、トラブルシューティング、最適化できるようにします。クラスター管理者は、DataScienceClusterInitialization (DSCI) カスタムリソースでこの機能を明示的に有効にする必要があります。

有効にすると、次のアクションを実行できます。

OpenShift AI コンポーネントのメトリクス、トレース、アラートを 1 カ所で表示することで、トラブルシューティングを高速化します。
健全性とリソースの使用状況を監視し、重大な問題に関するアラートを受信することで、プラットフォームの安定性を維持します。
Red Hat build of OpenTelemetry を通じてテレメトリーをサードパーティーの可観測性ソリューションにエクスポートすることで、既存のツールと統合します。

重要

Red Hat のテクノロジープレビュー機能のサポート範囲に関する詳細は、テクノロジープレビュー機能のサポート範囲を参照してください。

前提条件

OpenShift クラスターのクラスター管理者権限を持っている。
Red Hat OpenShift AI がインストール済みである。
可観測性スタックのコンポーネントを提供する以下の Operator がインストールされている。
- Cluster Observability Operator: メトリクスとアラート用の Prometheus と Alertmanager をデプロイおよび管理します。
- Tempo Operator: 分散トレーシング用の Tempo バックエンドを提供します。
- Red Hat build of OpenTelemetry: テレメトリーデータを収集およびエクスポートするための OpenTelemetry Collector をデプロイします。

手順

OpenShift Web コンソールにクラスター管理者としてログインします。
OpenShift コンソールで、Operators Installed Operators をクリックします。
Red Hat OpenShift AI Operator を検索し、Operator 名をクリックして Operator details ページを開きます。
DSCInitialization タブをクリックします。
デフォルトのインスタンス名 (例: default-dsci) をクリックして、インスタンスの詳細ページを開きます。
YAML タブをクリックして、インスタンスの仕様を表示します。

spec.monitoring セクションで、managementState フィールドの値を Managed に設定し、次の例に示すようにメトリクス、アラート、およびトレースの設定を行います。

監視設定の例

# ...
spec:
  monitoring:
    managementState: Managed                 # Required: Enables and manages the observability stack
    namespace: redhat-ods-monitoring    # Required: Namespace where monitoring components are deployed
    alerting: {}                              # Alertmanager configuration, uses default settings if empty
    metrics:                                  # Prometheus configuration for metrics collection
      replicas: 1                             # Optional: Number of Prometheus instances
      resources:                              # CPU and memory requests and limits for Prometheus pods
        cpulimit: 500m                        # Optional: Maximum CPU allocation in millicores
        cpurequest: 100m                      # Optional: Minimum CPU allocation in millicores
        memorylimit: 512Mi                    # Optional: Maximum memory allocation in mebibytes
        memoryrequest: 256Mi                  # Optional: Minimum memory allocation in mebibytes
      storage:                                # Storage configuration for metrics data
        size: 5Gi                             # Required: Storage size for Prometheus data
        retention: 90d                        # Required: Retention period for metrics data in days
      exporters: {}                           # External metrics exporters
    traces:                                   # Tempo backend for distributed tracing
      sampleRatio: '0.1'                      # Optional: Portion of traces to sample, expressed as a decimal
      storage:                                # Storage configuration for trace data
        backend: pv                           # Required: Storage backend for Tempo traces (pv, s3, or gcs)
        retention: 2160h                      # Optional: Retention period for trace data in hours
      exporters: {}                           # External traces exporters
# ...

# ...
spec:
  monitoring:
    managementState: Managed                 # Required: Enables and manages the observability stack
    namespace: redhat-ods-monitoring    # Required: Namespace where monitoring components are deployed
    alerting: {}                              # Alertmanager configuration, uses default settings if empty
    metrics:                                  # Prometheus configuration for metrics collection
      replicas: 1                             # Optional: Number of Prometheus instances
      resources:                              # CPU and memory requests and limits for Prometheus pods
        cpulimit: 500m                        # Optional: Maximum CPU allocation in millicores
        cpurequest: 100m                      # Optional: Minimum CPU allocation in millicores
        memorylimit: 512Mi                    # Optional: Maximum memory allocation in mebibytes
        memoryrequest: 256Mi                  # Optional: Minimum memory allocation in mebibytes
      storage:                                # Storage configuration for metrics data
        size: 5Gi                             # Required: Storage size for Prometheus data
        retention: 90d                        # Required: Retention period for metrics data in days
      exporters: {}                           # External metrics exporters
    traces:                                   # Tempo backend for distributed tracing
      sampleRatio: '0.1'                      # Optional: Portion of traces to sample, expressed as a decimal
      storage:                                # Storage configuration for trace data
        backend: pv                           # Required: Storage backend for Tempo traces (pv, s3, or gcs)
        retention: 2160h                      # Optional: Retention period for trace data in hours
      exporters: {}                           # External traces exporters
# ...

Copy to Clipboard

Toggle word wrap

Save をクリックして変更を適用します。

検証

設定された namespace で可観測性スタックコンポーネントが実行されていることを確認します。

OpenShift Web コンソールで、Workloads Pods をクリックします。
プロジェクトリストから、redhat-ods-monitoring を選択します。

設定に対して実行中の Pod があることを確認します。次の Pod は、可観測性スタックがアクティブであることを示します。

alertmanager-data-science-monitoringstack-#      2/2   Running   0   1m
data-science-collector-collector-#               1/1   Running   0   1m
prometheus-data-science-monitoringstack-#        2/2   Running   0   1m
tempo-data-science-tempomonolithic-#             1/1   Running   0   1m
thanos-querier-data-science-thanos-querier-#     2/2   Running   0   1m

alertmanager-data-science-monitoringstack-#      2/2   Running   0   1m
data-science-collector-collector-#               1/1   Running   0   1m
prometheus-data-science-monitoringstack-#        2/2   Running   0   1m
tempo-data-science-tempomonolithic-#             1/1   Running   0   1m
thanos-querier-data-science-thanos-querier-#     2/2   Running   0   1m

Copy to Clipboard

Toggle word wrap

次のステップ

ユーザーワークロードからのメトリクスの収集

第12章可観測性の管理

12.1. 可観測性スタックの有効化
リンクのコピー

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

多様性を受け入れるオープンソースの強化

会社概要

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

第12章 可観測性の管理

12.1. 可観測性スタックの有効化リンクのコピーリンクがクリップボードにコピーされました!

詳細情報

試用、購入および販売

コミュニティー

Red Hat ドキュメントについて

多様性を受け入れるオープンソースの強化

会社概要

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

第12章可観測性の管理

12.1. 可観測性スタックの有効化
リンクのコピー