Search Exchange
Search All Sites
Nagios Live Webinars
Let our experts show you how Nagios can help your organization.Login
Directory Tree
check_pbsnodes
Meet The New Nagios Core Services Platform
Built on over 25 years of monitoring experience, the Nagios Core Services Platform provides insightful monitoring dashboards, time-saving monitoring wizards, and unmatched ease of use. Use it for free indefinitely.
Monitoring Made Magically Better
- Nagios Core on Overdrive
- Powerful Monitoring Dashboards
- Time-Saving Configuration Wizards
- Open Source Powered Monitoring On Steroids
- And So Much More!
Example:
./check_pbsnodes -w 1 -c 2
This would warn Nagios if one node was unresponsive. If two nodes were down,
would send Nagios a critical message. In addition, the plugin reports the names of the crashed nodes, along with the job id's and users who own them.
FULL DESCRIPTION:
This plugin is for testing the presence of crashed nodes in a high performance
computing cluster. In such clusters, it is not uncommon for load to reach very
very high levels on compute nodes. Under such load, many parts of the system may
bog down and become unresponsive. For example, SSH logins may no longer work.
Polling via Gangila or Cacti may cease. And yet, this does not mean that the compute
node has crashed or isn't still doing the work assigned to it by the cluster scheduler.
Under such circumstances, the only way to know if a node is really down is if a job
goes negative. Torque has a higher nice level than the jobs it runs, so it is always
guranteed a processor time slice. If walltime is exceeded and Torque is able to get a
slice it will kill the job. If it can't, then it's because the node has crashed and
we'll see showq show negative time in the (time) REMAINING column.
Therefore, this plugin is designed to be run on the Cluster Service Node, calling the
showq command, parsing the output, and searching for values in the REMAINING column
that are negative numbers. When it finds them, it should report the problem using
correct Nagios syntax, and provide the crashed node names to the output string. It needs
to be called from a remote plugin executor such as NRPE, or MRPE if using Matthias Kettner's
Check_MK.
This plugin is for testing the presence of crashed nodes in a high performance
computing cluster. In such clusters, it is not uncommon for load to reach very
very high levels on compute nodes. Under such load, many parts of the system may
bog down and become unresponsive. For example, SSH logins may no longer work.
Polling via Gangila or Cacti may cease. And yet, this does not mean that the compute
node has crashed or isn't still doing the work assigned to it by the cluster scheduler.
Under such circumstances, the only way to know if a node is really down is if a job
goes negative. Torque has a higher nice level than the jobs it runs, so it is always
guranteed a processor time slice. If walltime is exceeded and Torque is able to get a
slice it will kill the job. If it can't, then it's because the node has crashed and
we'll see showq show negative time in the (time) REMAINING column.
Therefore, this plugin is designed to be run on the Cluster Service Node, calling the
showq command, parsing the output, and searching for values in the REMAINING column
that are negative numbers. When it finds them, it should report the problem using
correct Nagios syntax, and provide the crashed node names to the output string. It needs
to be called from a remote plugin executor such as NRPE, or MRPE if using Matthias Kettner's
Check_MK.
Reviews (0)
Be the first to review this listing!