Published on February 15, 2025
This post continues on from our previous article on building custom plugins for Node Problem Detector.
Managing GPU-enabled Kubernetes clusters presents unique challenges that require closely monitoring GPU health and responding to hardware issues. While Kubernetes excels at container orchestration, it needs to be extended to monitor specialized hardware like GPUs. The combination of Node Problem Detector (NPD) and GPUd could create a solution for automated GPU health monitoring through Kubernetes’ native health reporting mechanisms.
Introduction
Node Problem Detector (NPD) is a Kubernetes monitoring agent that detects system-level issues and reports them as node conditions and events. The conditions and events are exposed through the Kubernetes API, and visible with kubectl describe node
. NPD comes with built-in problem monitors, and supports custom plugins for extending its capabilities.
GPUd is a system monitoring daemon specializing in GPU metrics. It has a component-based architecture that allows it to monitor GPU-specific metrics and related system components affecting GPU clusters. It’s output includes states and events that indicate the health of each component.
The monitoring models between NPD and GPUd appear to be compatible—states mapping to node conditions and events aligning with Kubernetes events—and should enable effective GPU health monitoring in Kubernetes.
In Collaboration with Sailplane
This blog post was produced in collaboration with Sailplane. I paired with their AI agent to develop the proof-of-concept plugin, create and manage a test environment, and deploy GPUd and NPD within it, as well as authoring this post.
The GPUd Architecture and API
GPUd’s NVIDIA-specific GPU monitoring components include: GPU status and performance metrics, temperature monitoring, driver and CUDA toolkit health, GPU memory usage, and ECC errors.
GPUd also has components for monitoring non-GPU general system health, like systemd services, memory and CPU usage, kernel module status, and kernel dmesg logs.
GPUd caches monitoring data in a SQLite database and exposes it through a RESTful API. The API endpoints include:
-
/v1/components
to list the components; -
/v1/states
shows instantaneous current health of the component; -
/v1/events
shows timestamped series of notable events within the component; -
/v1/metrics
gathers measurements from the component similar to Prometheus metrics; and -
/v1/info
to gather all component information in one response.
Each accepts a components
query parameter to filter the results to one or a set of components, and the events and metrics endpoints accept startTime
and endTime
to query a time range.
Here’s an example of a state response from the systemd component (/v1/states?components=systemd
):
[{
"component": "systemd",
"states": [{
"name": "unit",
"healthy": true,
"reason": "name: kubelet active: true uptime: 1 day ago",
"extra_info": {
"active": "true",
"name": "kubelet",
"uptime_humanized": "1 day ago",
"uptime_seconds": "90344"
}
}]
}]
The /v1/events
API produces similar output, but requires the startTime
parameter to actually produce any output. endTime
is also an available parameter, and both default to the current time (ie, the default time range has zero duration and no events would be selected).
Here’s an example of an event response from the memory component (/v1/events?components=memory&startTime=[...]
):
[{
"component": "memory",
"startTime": "2025-02-11T20:59:30Z",
"endTime": "2025-02-12T00:59:30.450503144Z",
"events": [{
"time": "2025-02-11T21:09:19Z",
"name": "memory_oom_cgroup",
"type": "Warning",
"message": "oom cgroup detected",
"extra_info": {
"log_line": "Memory cgroup out of memory: Killed process 339038 (python) total-vm:92920kB, anon-rss:64672kB, file-rss:4608kB, shmem-rss:0kB, UID:0 pgtables:184kB oom_score_adj:992"
}
}]
}]
NPD Custom Plugin Implementation
We can use NPD’s custom plugin system to bridge GPUd’s monitoring capabilities to Kubernetes’ node health model. NPD interprets the exit status of a plugin script as the detection of a problem, and if a problem is detected, uses any message on stdout as the condition reason or event message. The plugin script can query GPUd’s API and process the response with jq
to filter for the relevant state or events. It can then print a message for the state’s reason, or the event message, as well as setting the process exit status.
See the appendix below for a proof-of-concept implementation of such a script.
For state monitoring, the plugin can detect a problem when a state is reported with "healthy": false
.
For events, the plugin can detect a problem whenever a matching event is emitted.
Here’s an example configuration of rules for monitoring the state of the kubelet service, and event monitoring for OOM kills:
{
...
"rules": [
{
"type": "permanent",
"condition": "GPUdKubeletHealthy",
"reason": "KubeletRunning",
"path": "/usr/local/bin/gpud-npd-plugin.sh",
"args": [
"--mode", "states",
"--component", "systemd",
"--state-name", "unit",
"--match-extra-info", ".name == \"kubelet\""
]
},
{
"type": "temporary",
"reason": "OOMKilling",
"path": "/usr/local/bin/gpud-npd-plugin.sh",
"args": [
"--mode", "events",
"--component", "memory",
"--event-name", "memory_oom_cgroup"
]
},
...
]
}
Limitations
Some limitations arise from the way NPD queries plugins for problem detection.
Event Handling
The plugin can only emit one event per polling interval. This will miss sequences of events that occur faster than the polling interval. While shrinking the polling interval may help, that approach cannot guarantee events will not be missed. Instead, the script should output an indicator that multiple events occurred. Additionally, the event rules should be split up with finely sliced (more specific) queries to match the smallest number of events for a "reason"
.
However, splitting these into finely sliced queries explodes the configuration above and will compound the performance overhead, as explained next.
Performance Overhead
The NPD custom plugin architecture requires polling GPUd’s API separately for each configured component. Each poll of each rule must fork/exec a process, and each script execution will launch several other processes. The most expensive process will be contacting the GPUd API with the connection overhead (TLS) that entails. Depending on the component, GPUd will read in-memory caches or run a SQLite query to collect the requested information. With many components, this can add up to significant overhead.
A potential mitigation (not implemented for this proof-of-concept) is to run another per-node process (eg, another daemonset or added to the NPD daemonset) that periodically polls GPUd for the information from all components in one request, and then process it out to individual files to be more cheaply read by individual rules.
GPUd Code Quality
GPUd is a young project published by a fast-moving startup. As such it shows signs of immaturity that we can hope will improve over time.
There is not a lot of documentation for the project. The list of components has once-sentence descriptions that give little more than restating the name, and then links to the GoDocs for the component, which has no additional information.
The API documentation lists the API endpoints. But it shows component
as a query parameter, instead of the correct parameter components
.
Meanwhile startTime
and endTime
are not documented, yet these are critical to getting any information from the /v1/events
endpoint, as noted earlier. The shortcomings of the documentation leaves you to either probe the API directly to figure out what information is available, or read the code.
Some components appear to have tunable thresholds, like the fd
component has a Config struct with a field threshold_allocated_file_handles
which is also in the component’s output.
If you’re looking to change this threshold, you are out of luck.
You might, as I did, look at the code to see how to set this configuration and have some hope. There is a global configuration object (that includes the fd
component’s Config
struct). And there is a function that reads from a YAML file in a set of fixed locations.
But, at the time of writing, that function is dead code, never referenced by any other code. GPUd always launches with its built-in, automatic configuration, with limited modification from command-line arguments.
Conclusion
The integration of GPUd with Node Problem Detector demonstrates how Kubernetes’ node health monitoring can be extended to cover specialized hardware like GPUs. By mapping GPUd’s monitoring capabilities to Kubernetes’ native health reporting mechanisms through NPD’s plugin system, clusters gain visibility into GPU health and can potentially automate responses to GPU-related issues. While the plugin architecture has limitations around event handling and performance, it provides a starting point for exploring automated GPU health monitoring in Kubernetes environments and should work fine for a limited number of extracted event and condition types.
Appendix
This is the proof-of-concept plugin script written for this blog post. curl
and jq
must be installed in the node-problem-detector image.
#!/bin/sh
# Exit code definitions
EXIT_SUCCESS=0 # No problem detected
EXIT_PROBLEM_DETECTED=1 # Problem detected in GPUd events/states
EXIT_SYSTEM_ERROR=2 # System errors like API failures or invalid arguments
die() {
echo "$1"
exit $EXIT_SYSTEM_ERROR
}
MODE=""
COMPONENT=""
EVENT_NAME=""
STATE_NAME=""
MATCH_EXTRA_INFO=""
while [ $# -gt 0 ]; do
case "$1" in
--mode) MODE="$2"; shift 2 ;;
--component) COMPONENT="$2"; shift 2 ;;
--event-name) EVENT_NAME="$2"; shift 2 ;;
--state-name) STATE_NAME="$2"; shift 2 ;;
--match-extra-info) MATCH_EXTRA_INFO="$2"; shift 2 ;;
*) die "Unknown argument: $1" ;;
esac
done
query_events() {
# Query GPUd events for specific component and filter by event name and message pattern
# startTime is needed to get any events. endTime defaults to now. 30sec matches the polling interval.
response=$(curl -sk "https://$NODE_NAME:15132/v1/events?components=${COMPONENT}&startTime=$(date -d "30sec ago" +%s)")
if [ $? -ne 0 ]; then
die "Failed to query GPUd events API"
fi
event_count=$(echo "$response" | jq --arg name "$EVENT_NAME" \
'[.[].events[] | select(.name == $name)] | length')
event_msg=$(echo "$response" | jq -r --arg name "$EVENT_NAME" \
'[.[].events[] | select(.name == $name) | .message][0]')
if [ "$event_count" -gt 0 ]; then
echo -n "$event_msg"
if [ "$event_count" -gt 1 ]; then
echo -n " ($((event_count - 1)) events missed)"
fi
exit $EXIT_PROBLEM_DETECTED
fi
return $EXIT_SUCCESS
}
query_states() {
# Query GPUd states for specific component and filter by state name and extra info
response=$(curl -sk "https://$NODE_NAME:15132/v1/states?components=${COMPONENT}")
if [ $? -ne 0 ]; then
die "Failed to query GPUd states API"
fi
state_reason=$(echo "$response" | jq -r --arg name "$STATE_NAME" \
"[.[].states[] | select(.name == \$name and (.extra_info | $MATCH_EXTRA_INFO) and .healthy == false) | .reason][0]")
if [ -n "$state_reason" ] && [ "$state_reason" != "null" ]; then
echo -n "$state_reason"
exit $EXIT_PROBLEM_DETECTED
fi
return $EXIT_SUCCESS
}
case "$MODE" in
"events") query_events ;;
"states") query_states ;;
*) die "Invalid mode: $MODE" ;;
esac
exit $EXIT_SUCCESS