Chapter 25. Node Problem Detector

25.1. Overview
Copy link

The Node Problem Detector monitors the health of your nodes by finding certain problems and reporting these problems to the API server. The detector runs as a daemonset on each node.

Important

The Node Problem Detector is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs), might not be functionally complete, and Red Hat does not recommend to use them for production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information on Red Hat Technology Preview features support scope, see https://access.redhat.com/support/offerings/techpreview/.

The Node Problem Detector reads system logs and watches for specific entries and makes these problems visible to the control plane, which you can view using OpenShift Container Platform commands, such as oc get node and oc get event. You could then take action to correct these problems as appropriate or capture the messages using a tool of your choice, such as the OpenShift Container Platform log monitoring. Detected problems can be in one of the following categories:

NodeCondition: A permanent problem that makes the node unavailable for pods. The node condition will not be cleared until the host is rebooted.
Event: A temporary problem that has limited impact on a node, but is informative.

The Node Problem Detector can detect:

container runtime issues:
- unresponsive runtime daemons
hardware issues:
- bad CPU
- bad memory
- bad disk
kernel issues:
- kernel deadlock conditions
- corrupted file systems
- unresponsive runtime daemons
infrastructure daemon issues:
- NTP service outages

25.2. Example Node Problem Detector Output
Copy link

The following examples show output from the Node Problem Detector watching for kernel deadlock node condition on a specific node. The command uses oc get node to watch a specific node filtering for a KernelDeadlock entry in a log.

oc get node <node> -o yaml | grep -B5 KernelDeadlock

# oc get node <node> -o yaml | grep -B5 KernelDeadlock

Copy to Clipboard

Toggle word wrap

Sample Node Problem Detector output with no issues

message: kernel has no deadlock
reason: KernelHasNoDeadlock
status: false
type: KernelDeadLock

message: kernel has no deadlock
reason: KernelHasNoDeadlock
status: false
type: KernelDeadLock

Copy to Clipboard

Toggle word wrap

Sample output for KernelDeadLock condition

message: task docker:1234 blocked for more than 120 seconds
reason: DockerHung
status: true
type: KernelDeadLock

message: task docker:1234 blocked for more than 120 seconds
reason: DockerHung
status: true
type: KernelDeadLock

Copy to Clipboard

Toggle word wrap

This example shows output from the Node Problem Detector watching for events on a node. The following command uses oc get event against the default project watching for events listed in the kernel-monitor.json section of the Node Problem Detector configuration map.

oc get event -n default --field-selector=source=kernel-monitor --watch

# oc get event -n default --field-selector=source=kernel-monitor --watch

Copy to Clipboard

Toggle word wrap

Sample output showing events on nodes

LAST SEEN                       FIRST SEEN                    COUNT NAME     KIND  SUBOBJECT TYPE    REASON      SOURCE                   MESSAGE
2018-06-27 09:08:27 -0400 EDT   2018-06-27 09:08:27 -0400 EDT 1     my-node1 node            Warning TaskHunk    kernel-monitor.my-node1  docker:1234 blocked for more than 300 seconds
2018-06-27 09:08:27 -0400 EDT   2018-06-27 09:08:27 -0400 EDT 3     my-node2 node            Warning KernelOops  kernel-monitor.my-node2  BUG: unable to handle kernel NULL pointer deference at nowhere
2018-06-27 09:08:27 -0400 EDT   2018-06-27 09:08:27 -0400 EDT 1     my-node1 node            Warning KernelOops  kernel-monitor.my-node2  divide error 0000 [#0] SMP

LAST SEEN                       FIRST SEEN                    COUNT NAME     KIND  SUBOBJECT TYPE    REASON      SOURCE                   MESSAGE
2018-06-27 09:08:27 -0400 EDT   2018-06-27 09:08:27 -0400 EDT 1     my-node1 node            Warning TaskHunk    kernel-monitor.my-node1  docker:1234 blocked for more than 300 seconds
2018-06-27 09:08:27 -0400 EDT   2018-06-27 09:08:27 -0400 EDT 3     my-node2 node            Warning KernelOops  kernel-monitor.my-node2  BUG: unable to handle kernel NULL pointer deference at nowhere
2018-06-27 09:08:27 -0400 EDT   2018-06-27 09:08:27 -0400 EDT 1     my-node1 node            Warning KernelOops  kernel-monitor.my-node2  divide error 0000 [#0] SMP

Copy to Clipboard

Toggle word wrap

Note

The Node Problem Detector consumes resources. If you use the Node Problem Detector, make sure you have enough nodes to balance cluster performance.

25.3. Installing the Node Problem Detector
Copy link

If openshift_node_problem_detector_install was set to true in the /etc/ansible/hosts inventory file, the installation creates a Node Problem Detector daemonset by default and creates a project for the detector, called openshift-node-problem-detector.

Note

Because the Node Problem Detector is in Technology Preview, the openshift_node_problem_detector_install is set to false by default. You must manually change the parameter to true when installing the Node Problem Detector.

If the Node Problem Detector is not installed, run the openshift-node-problem-detector/config.yml playbook to install Node Problem Detector:

ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-node-problem-detector/config.yml

# ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-node-problem-detector/config.yml

Copy to Clipboard

Toggle word wrap

25.4. Customizing Detected Conditions
Copy link

You can configure the Node Problem Detector to watch for any log string by editing the Node Problem Detector configuration map.

Sample Node Problem Detector Configuration Map

apiVersion: v1
kind: ConfigMap
metadata:
  name: node-problem-detector
data:
  docker-monitor.json: |  
    {
        "plugin": "journald", 
        "pluginConfig": {
                "source": "docker"
        },
        "logPath": "/host/log/journal", 
        "lookback": "5m",
        "bufferSize": 10,
        "source": "docker-monitor",
        "conditions": [],
        "rules": [              
                {
                        "type": "temporary", 
                        "reason": "CorruptDockerImage", 
                        "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*" 
                }
        ]
    }
  kernel-monitor.json: |  
    {
        "plugin": "journald", 
        "pluginConfig": {
                "source": "kernel"
        },
        "logPath": "/host/log/journal", 
        "lookback": "5m",
        "bufferSize": 10,
        "source": "kernel-monitor",
        "conditions": [                 
                {
                        "type": "KernelDeadlock", 
                        "reason": "KernelHasNoDeadlock", 
                        "message": "kernel has no deadlock"  
                }
        ],
        "rules": [
                {
                        "type": "temporary",
                        "reason": "OOMKilling",
                        "pattern": "Kill process \\d+ (.+) score \\d+ or sacrifice child\\nKilled process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB"
                },
                {
                        "type": "temporary",
                        "reason": "TaskHung",
                        "pattern": "task \\S+:\\w+ blocked for more than \\w+ seconds\\."
                },
                {
                        "type": "temporary",
                        "reason": "UnregisterNetDevice",
                        "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
                },
                {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "BUG: unable to handle kernel NULL pointer dereference at .*"
                },
                {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "divide error: 0000 \\[#\\d+\\] SMP"
                },
                {
                        "type": "permanent",
                        "condition": "KernelDeadlock",
                        "reason": "AUFSUmountHung",
                        "pattern": "task umount\\.aufs:\\w+ blocked for more than \\w+ seconds\\."
                },
                {
                        "type": "permanent",
                        "condition": "KernelDeadlock",
                        "reason": "DockerHung",
                        "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\."
                }
        ]
    }

apiVersion: v1
kind: ConfigMap
metadata:
  name: node-problem-detector
data:
  docker-monitor.json: |

1


    {
        "plugin": "journald",

2


        "pluginConfig": {
                "source": "docker"
        },
        "logPath": "/host/log/journal",

3


        "lookback": "5m",
        "bufferSize": 10,
        "source": "docker-monitor",
        "conditions": [],
        "rules": [

4


                {
                        "type": "temporary",

5


                        "reason": "CorruptDockerImage",

6


                        "pattern": "Error trying v2 registry: failed to register layer: rename /var/lib/docker/image/(.+) /var/lib/docker/image/(.+): directory not empty.*"

7


                }
        ]
    }
  kernel-monitor.json: |

8


    {
        "plugin": "journald",

9


        "pluginConfig": {
                "source": "kernel"
        },
        "logPath": "/host/log/journal",

10


        "lookback": "5m",
        "bufferSize": 10,
        "source": "kernel-monitor",
        "conditions": [

11


                {
                        "type": "KernelDeadlock",

12


                        "reason": "KernelHasNoDeadlock",

13


                        "message": "kernel has no deadlock"

14


                }
        ],
        "rules": [
                {
                        "type": "temporary",
                        "reason": "OOMKilling",
                        "pattern": "Kill process \\d+ (.+) score \\d+ or sacrifice child\\nKilled process \\d+ (.+) total-vm:\\d+kB, anon-rss:\\d+kB, file-rss:\\d+kB"
                },
                {
                        "type": "temporary",
                        "reason": "TaskHung",
                        "pattern": "task \\S+:\\w+ blocked for more than \\w+ seconds\\."
                },
                {
                        "type": "temporary",
                        "reason": "UnregisterNetDevice",
                        "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
                },
                {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "BUG: unable to handle kernel NULL pointer dereference at .*"
                },
                {
                        "type": "temporary",
                        "reason": "KernelOops",
                        "pattern": "divide error: 0000 \\[#\\d+\\] SMP"
                },
                {
                        "type": "permanent",
                        "condition": "KernelDeadlock",
                        "reason": "AUFSUmountHung",
                        "pattern": "task umount\\.aufs:\\w+ blocked for more than \\w+ seconds\\."
                },
                {
                        "type": "permanent",
                        "condition": "KernelDeadlock",
                        "reason": "DockerHung",
                        "pattern": "task docker:\\w+ blocked for more than \\w+ seconds\\."
                }
        ]
    }

Copy to Clipboard

Toggle word wrap

1: Rules and conditions that apply to Docker images.
2 9: Monitoring services, in a comma-separated list.
3 10: Path to the monitoring service log.
4 11: List of events to be monitored.
5 12: Label to indicate the error is an event (temporary) or NodeCondition (permanent).
6 13: Text message to describe the error.
7 14: Error message that the Node Problem Detector watches for.
8: Rules and conditions that apply to the kernel.

To configure the Node Problem Detector, add or remove problem conditions and events.

Edit the Node Problem Detector configuration map with a text editor.

oc edit configmap -n openshift-node-problem-detector node-problem-detector

oc edit configmap -n openshift-node-problem-detector node-problem-detector

Copy to Clipboard

Toggle word wrap

Remove, add, or edit any node conditions or events as needed.

{
       "type": <`temporary` or `permanent`>,
       "reason": <free-form text describing the error>,
       "pattern": <log message to watch for>
},

{
       "type": <`temporary` or `permanent`>,
       "reason": <free-form text describing the error>,
       "pattern": <log message to watch for>
},

Copy to Clipboard

Toggle word wrap

For example:

{
       "type": "temporary",
       "reason": "UnregisterNetDevice",
       "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
},

{
       "type": "temporary",
       "reason": "UnregisterNetDevice",
       "pattern": "unregister_netdevice: waiting for \\w+ to become free. Usage count = \\d+"
},

Copy to Clipboard

Toggle word wrap

Restart running pods to apply the changes. To restart pods, you can delete all existing pods:

oc delete pods -n openshift-node-problem-detector -l name=node-problem-detector

# oc delete pods -n openshift-node-problem-detector -l name=node-problem-detector

Copy to Clipboard

Toggle word wrap

To display Node Problem Detector output to standard output (stdout) and standard error (stderr) add the following to the configuration map:

spec:
  template:
    spec:
      containers:
      - name: node-problem-detector
        command:
        - node-problem-detector
        - --alsologtostderr=true 
        - --log_dir="/tmp" 
        - --system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json

spec:
  template:
    spec:
      containers:
      - name: node-problem-detector
        command:
        - node-problem-detector
        - --alsologtostderr=true

1


        - --log_dir="/tmp"

2


        - --system-log-monitors=/etc/npd/kernel-monitor.json,/etc/npd/docker-monitor.json

3

Copy to Clipboard

Toggle word wrap

1: Sends the output to standard output (stdout).
2: Path to the error log.
3: Comma-separated path to the plug-in configuration files.

25.5. Verifying that the Node Problem Detector is Running
Copy link

To verify that the Node Problem Detector is active:

Run the following command to get the name of the Problem Node Detector pod:

oc get pods -n openshift-node-problem-detector

NAME                          READY     STATUS    RESTARTS   AGE
node-problem-detector-8z8r8   1/1       Running   0          1h
node-problem-detector-nggjv   1/1       Running   0          1h

# oc get pods -n openshift-node-problem-detector

NAME                          READY     STATUS    RESTARTS   AGE
node-problem-detector-8z8r8   1/1       Running   0          1h
node-problem-detector-nggjv   1/1       Running   0          1h

Copy to Clipboard

Toggle word wrap

Run the following command to view log information on the Problem Node Detector pod:

oc logs -n openshift-node-problem-detector <pod_name>

# oc logs -n openshift-node-problem-detector <pod_name>

Copy to Clipboard

Toggle word wrap

The output should be similar to the following:

oc logs -n openshift-node-problem-detector node-problem-detector-c6kng
I0416 23:22:00.641354       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]

# oc logs -n openshift-node-problem-detector node-problem-detector-c6kng
I0416 23:22:00.641354       1 log_monitor.go:63] Finish parsing log monitor config file: {WatcherConfig:{Plugin:journald PluginConfig:map[source:kernel] LogPath:/host/log/journal Lookback:5m} BufferSize:10 Source:kernel-monitor DefaultConditions:[{Type:KernelDeadlock Status:false Transition:0001-01-01 00:00:00 +0000 UTC Reason:KernelHasNoDeadlock Message:kernel has no deadlock}]

Copy to Clipboard

Toggle word wrap

Test the Node Problem Detector by simulating an event on the node:
```
echo "kernel: divide error: 0000 [#0] SMP." >> /dev/kmsg
```
```
# echo "kernel: divide error: 0000 [#0] SMP." >> /dev/kmsg
```
Copy to Clipboard Toggle word wrap

Test the Node Problem Detector by simulating a condition on the node:

echo "kernel: task docker:7 blocked for more than 300 seconds." >> /dev/kmsg

# echo "kernel: task docker:7 blocked for more than 300 seconds." >> /dev/kmsg

Copy to Clipboard

Toggle word wrap

25.6. Uninstall the Node Problem Detector
Copy link

To uninstall the Node Problem Detector:

Add following options in Ansible inventory file:

[OSEv3:vars]
openshift_node_problem_detector_state=absent

[OSEv3:vars]
openshift_node_problem_detector_state=absent

Copy to Clipboard

Toggle word wrap

Run the following Ansible playbook:

ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-node-problem-detector/config.yml

# ansible-playbook /usr/share/ansible/openshift-ansible/playbooks/openshift-node-problem-detector/config.yml

Copy to Clipboard

Toggle word wrap

25.1. Overview
Copy link

25.2. Example Node Problem Detector Output
Copy link

25.3. Installing the Node Problem Detector
Copy link

25.4. Customizing Detected Conditions
Copy link

25.5. Verifying that the Node Problem Detector is Running
Copy link

25.6. Uninstall the Node Problem Detector
Copy link

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

Chapter 25. Node Problem Detector

25.1. OverviewCopy linkLink copied to clipboard!

25.2. Example Node Problem Detector OutputCopy linkLink copied to clipboard!

25.3. Installing the Node Problem DetectorCopy linkLink copied to clipboard!

25.4. Customizing Detected ConditionsCopy linkLink copied to clipboard!

25.5. Verifying that the Node Problem Detector is RunningCopy linkLink copied to clipboard!

25.6. Uninstall the Node Problem DetectorCopy linkLink copied to clipboard!

Learn

Try, buy, & sell

Communities

About Red Hat Documentation

Making open source more inclusive

About Red Hat

Theme

Red Hat legal and privacy links

Red Hat legal and privacy links

25.1. Overview
Copy link

25.2. Example Node Problem Detector Output
Copy link

25.3. Installing the Node Problem Detector
Copy link

25.4. Customizing Detected Conditions
Copy link

25.5. Verifying that the Node Problem Detector is Running
Copy link

25.6. Uninstall the Node Problem Detector
Copy link