26.4. 自定义依赖条件


您可以通过编辑 Node Problem Detector 配置映射来配置 Node Problem Detector 来监控任何日志字符串。

节点问题检测程序配置映射示例

apiVersion: v1
kind: ConfigMap
metadata:
  name: node-problem-detector
data:
  docker-monitor.json: |  1
    {
        "plugin": "journald", 2
        "pluginConfig": {
                "source": "docker"
        },
        "logPath": "/host/log/journal", 3
        "lookback": "5m",
        "bufferSize": 10,
        "source": "docker-monitor",
        "conditions": [],
        "rules": [              4
                {
                        "type": "temporary", 5
                        "reason": "CorruptDockerImage", 6
                        "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*" 7
                }
        ]
    }
  kernel-monitor.json: |  8
    {
        "plugin": "journald", 9
        "pluginConfig": {
                "source": "kernel"
        },
        "logPath": "/host/log/journal", 10
        "lookback": "5m",
        "bufferSize": 10,
        "source": "kernel-monitor",
        "conditions": [                 11
                {
                        "type": "KernelDeadlock", 12
                        "reason": "KernelHasNoDeadlock", 13
                        "message": "kernel has no deadlock"  14
                }
        ],
        "rules": [
                {
                        "type": "temporary",
                        "reason": "OOMKilling",
                        "pattern": "Kill process \\d+ (.+) score \\d+ or sacrifice child\\nKilled process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB"
                },
                {
                        "type": "temporary",
                        "reason": "TaskHung",
                        "pattern": "task \\S+:\\w+ blocked for more than \\w+ seconds\\."
                },
                {
                        "type": "temporary",
                        "reason": "UnregisterNetDevice",
                        "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
                },
                {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "BUG: unable to handle kernel NULL pointer dereference at .*"
                },
                {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "divide error: 0000 \\[#\\d+\\] SMP"
                },
                {
                        "type": "permanent",
                        "condition": "KernelDeadlock",
                        "reason": "AUFSUmountHung",
                        "pattern": "task umount\\.aufs:\\w+ blocked for more than \\w+ seconds\\."
                },
                {
                        "type": "permanent",
                        "condition": "KernelDeadlock",
                        "reason": "DockerHung",
                        "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\."
                }
        ]
    }

1
适用于容器镜像的规则和条件。
2 9
以逗号分隔列表的形式监控服务。
3 10
监控服务日志的路径。
4 11
要监控的事件列表。
5 12
表示错误的标签(临时)或 NodeCondition(永久)。
6 13
描述错误的文本消息。
7 14
节点问题检测程序监视的错误消息。
8
适用于内核的规则和条件。

要配置节点问题检测程序,添加或删除问题条件和事件。

  1. 使用文本编辑器编辑 Node Problem Detector 配置映射。

    $ oc edit configmap -n openshift-node-problem-detector node-problem-detector
  2. 根据需要删除、添加或编辑任何节点状况或事件。

    {
           "type": <`temporary` or `permanent`>,
           "reason": <free-form text describing the error>,
           "pattern": <log message to watch for>
    },

    例如:

    {
           "type": "temporary",
           "reason": "UnregisterNetDevice",
           "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
    },
  3. 重启正在运行的 pod 以应用更改。要重启 pod,您可以删除所有现有的 pod:

    # oc delete pods -n openshift-node-problem-detector -l name=node-problem-detector
  4. 要将节点问题检测程序输出显示到标准输出(stdout)和标准错误(stderr),请在 Node Problem Detector 的 DaemonSet 中添加以下内容:

    spec:
      template:
        spec:
          containers:
          - name: node-problem-detector
            command:
            - node-problem-detector
            - --alsologtostderr=true 1
            - --log_dir="/tmp" 2
            - --system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json 3
    1
    将输出发送到标准输出(stdout)。
    2
    错误日志的路径。
    3
    插件配置文件的逗号分隔路径。
Red Hat logoGithubRedditYoutubeTwitter

学习

尝试、购买和销售

社区

关于红帽文档

通过我们的产品和服务,以及可以信赖的内容,帮助红帽用户创新并实现他们的目标。

让开源更具包容性

红帽致力于替换我们的代码、文档和 Web 属性中存在问题的语言。欲了解更多详情,请参阅红帽博客.

關於紅帽

我们提供强化的解决方案,使企业能够更轻松地跨平台和环境(从核心数据中心到网络边缘)工作。

© 2024 Red Hat, Inc.