Node Problem Detector Custom Plugins

Published on January 24, 2025

I was recently asked to prototype a custom plugin for node-problem-detector. I found that the documentation for the plugin interface is pretty inscrutable. The official documentation links to the plugin interface proposal document on Google Docs which doesn’t make a good guide to writing your own plugin. So here’s a primer I developed based on my brief experience.

Quick Introduction to Node Problem Detector

Node Problem Detector (NPD) runs as a DaemonSet, and sets conditions in the status field of the node that it is running on. It can also output an Event when a problem occurs. The condition and events would be visible when running kubectl describe node <node-name>.

NPD has several built-in problems it will detect. But it also has a way to take advantage of its infrastructure to add your own conditions and events.

Writing Custom Plugins

The “script interface” is based on exit status and a message on stdout. Exit status of 0 means no problem detected; 1 means the problem was detected; any other non-zero status means the status could not be determined.

The path to this script is added to a JSON configuration file, and the path to that configuration file is passed to the node-problem-detector binary via a command-line argument. The helm chart abstracts some of this plumbing away so that you only need to author the script, the configuration, and probably modify the node-problem-detector image so that it contains the tools necessary for your script.

You might want to start by adding a configuration like this to your helm values:

      {
        "plugin": "custom",
        "pluginConfig": {
          "invoke_interval": "30s",
          "timeout": "5s",
          "max_output_length": 80,
          "concurrency": 3
        },
        "source": "my-custom-plugin-monitor",
        "metricsReporting": true,

        "conditions": [
          {
            "type": "MyProblemCondition",
            "reason": "NoProblem",
            "message": "Everything is normal"
          }
        ],
        "rules": [
          {
            "type": "permanent",
            "condition": "MyProblemCondition",
            "reason": "ProblemCause",
            "path": "./custom-config/plugin-my_problem.sh"
          }
        ]
      }

But what do these configuration keys even mean? What are the “conditions” and “rules”?

The best explanation I’ve seen for custom plugin configuration is in the source for node-problem-detector at docs/custom_plugin_monitor.md. The struct definitions for Condition and CustomRule serve as additional references.

It’s not the most obvious configuration surface. Instead of starting from configuration, it’s simpler to think about this from the bottom-up, starting from the script, how it gets executed, and how the result is turned into a status condition or an event.

How a plugin script invocation is turned into a Condition or Event

I write a script that can return an exit status of 0 or 1. For example, below is an abridged version of a sample script from the NPD repository that checks if a systemd service (NTP) is running.

# Return success if service active (i.e. running)
if systemctl -q is-active ntp.service; then
  echo "NTP is running"
  exit 0
else
  echo "NTP is not running"
  exit 1
fi

I put this script as the "path" of a "rules" entry with "type": "temporary".

Since this is a "temporary" rule, NPD may output an Event.
If the script returns 0, then do nothing.
If the script returns 1, then output an Event with "reason" from this "rule", and "message" from stdout.

Alternatively, I could add this script to the configuration as the "path" of a "rules" entry with "type": "permanent".

Since this is a "permanent" rule, NPD will update an entry in status.conditions of the node. Which Condition entry? The one with "type" equal to this rule’s "condition" field. (Or if none exists, then it will create it, of course.)
If the script returns 0, then NPD will set the Condition to its default state. The default state comes from the entry in the "conditions" section whose "type" matches this rule’s "condition". NPD will update the Condition using both the "reason" and "message" from the "conditions" entry.
If it returns 1, then NPD will set the Condition to the "reason" from this rule and the "message" to the stdout produced by the script.

Recipes

We can distill this further into three recipes.

I want an Event to be emitted when my script returns non-zero status.

Don’t add anything to "conditions".
Add an entry in "rules" with "type": "temporary", and "path" with the path to your script.
Set the rule’s "reason" to what you want emitted in the event.

I want a Condition to be set according to how my script returns

Add an entry in "conditions" with "type", and an entry in "rules" with "condition", set to the same value: The name of the Condition you want to output.
The rule should have "type": "permanent", and "path" with the path to your script.
The condition should have a "reason" and "message" for the passing case.
The rule should have a “reason” for the failing case. The detailed “message” will come from the script’s stdout.

I want a Condition and an Event when my script returns non-zero.

Combine the above…
A "conditions" entry for the default state of the Condition.
A "permanent" rule for the erroring state of the Condition.
A "temporary" rule to emit an event.
That is, one condition and two rules with the same "path".

Deployment

You now have a script and custom plugin configuration to tell node-problem-detector to run the script and update a Condition or output an Event. A little more work is needed to bundle it all together to deploy. Using the helm chart, the following values should be set to enable the custom plugin:

image: If a custom image was needed to extend the node-problem-detector image with additional binaries, then specify it here.
settings.custom_plugin_monitors: This is a list of file paths within the container to the JSON configuration file for custom plugins. If using the chart’s custom_monitor_definitions to populate a ConfigMap, then these paths should start with /custom-config/.
Settings.custom_monitor_definitions define the contents of a ConfigMap mounted at /custom-config/. Add a key under here with a filename like "my-custom-monitor.json". Your script can also go in a key here.

That’s it! Deploy the helm chart with these values, and your custom conditions and events should begin appearing on the node objects when you run kubectl get nodes -o yaml or kubectl describe a node.

Demo

As part of my exploration, I made a demo repository. The exploration was to see if NPD could be used to detect that the node had a network connection issue. The conclusion of that exploration was that NPD was not a good candidate for detecting network connectivity problems, since it would be unable to write the status back to the api-server over the very network it was detecting a problem within. Nevertheless, the repository shows a complete custom plugin deployment that can be deployed in a kind cluster.