Search Exchange
Search All Sites
Nagios Live Webinars
Let our experts show you how Nagios can help your organization.Login
Directory Tree
check_oomkiller
2.0
2010-09-07
- Nagios 3.x
100320
File | Description |
---|---|
check_oomkiller.pl.txt | check_oomkiller client plugin |
check_oomkiller.c.txt | check_oomkiller suid wrapper |
Meet The New Nagios Core Services Platform
Built on over 25 years of monitoring experience, the Nagios Core Services Platform provides insightful monitoring dashboards, time-saving monitoring wizards, and unmatched ease of use. Use it for free indefinitely.
Monitoring Made Magically Better
- Nagios Core on Overdrive
- Powerful Monitoring Dashboards
- Time-Saving Configuration Wizards
- Open Source Powered Monitoring On Steroids
- And So Much More!
The LINUX OS will assassinate "big memory" processes during extreme memory shortages as a self defense. This plugin checks for such activity and returns a critical status if any has occurred since the previous check.
This plugin was written on and for RHEL4 using Nagios and NRPE and may need to be tweaked for other distro's.
IMPORTANT - This check requires read only access to the system messages file /var/log/messages which is not by default available to unprivileged accounts. For this reason it is either necessary to make /var/log/messages readable to the nagios user account (DON'T DO THIS), or it is necessary to write a small compiled wrapper program around the script and install the wrapper as a root owned SUID executable (DO THIS!)
Installation Instructions
1) Put the PERL script and C program in the nagios/libexec directory on the client system that will be checked.
2) Edit the C program if needed changing the REAL_PATH definition for your environment.
3) Compile the C program and install it as an SUID application. (chmod 4555 and chown root)
4) Use the following plugin definition on the client system in the nrpe.cfg configuration file. As with the C program edit the path if needed for your environment.
5) Use the following service check definition on the Nagios server to perform the check on monitored systems.
Because each instance of the OOM-Killer check resets the current status, the service check definition on the Nagios server MUST contain "max_check_attempts 1". If you don't do this you will NEVER be notified.
Also notice that I am using a custom check_command called check_nrpe60. The only difference between check_nrpe60 and the standard check_nrpe is the addition of a 60 second timeout specification (see below). This is necessary because on systems with large /var/log/messages files (or busy systems with few CPU cycles to spare) the standard NRPE check on the server can timeout before the plugin has actually completed on the client.
The plugin returns a warning status if it can't perform its task, and a critical status if any OOM killer activity has taken place. If the status is critical, it also returns extended status information detailing the PID's and users affected.
And finally, it is worth noting that on a properly tuned system this activity will probably not occur. We discovered it "by accident" when a physical server was converted to a virtual machine and not given the same amount of memory it had previously. When we identified and applied the correct memory tuning parameter (vm.lower_zone_protection applied in /etc/syctl.conf in this case) the OOM-Killer activity ceased.
This plugin was written on and for RHEL4 using Nagios and NRPE and may need to be tweaked for other distro's.
IMPORTANT - This check requires read only access to the system messages file /var/log/messages which is not by default available to unprivileged accounts. For this reason it is either necessary to make /var/log/messages readable to the nagios user account (DON'T DO THIS), or it is necessary to write a small compiled wrapper program around the script and install the wrapper as a root owned SUID executable (DO THIS!)
Installation Instructions
1) Put the PERL script and C program in the nagios/libexec directory on the client system that will be checked.
2) Edit the C program if needed changing the REAL_PATH definition for your environment.
3) Compile the C program and install it as an SUID application. (chmod 4555 and chown root)
4) Use the following plugin definition on the client system in the nrpe.cfg configuration file. As with the C program edit the path if needed for your environment.
command[check_oomkiller]=/usr/local/nagios/libexec/check_oomkiller
5) Use the following service check definition on the Nagios server to perform the check on monitored systems.
define service{ use generic-service host_name possible-oom-killer-victim service_description OOM Killer check_command check_nrpe60!check_oomkiller max_check_attempts 1 }
Because each instance of the OOM-Killer check resets the current status, the service check definition on the Nagios server MUST contain "max_check_attempts 1". If you don't do this you will NEVER be notified.
Also notice that I am using a custom check_command called check_nrpe60. The only difference between check_nrpe60 and the standard check_nrpe is the addition of a 60 second timeout specification (see below). This is necessary because on systems with large /var/log/messages files (or busy systems with few CPU cycles to spare) the standard NRPE check on the server can timeout before the plugin has actually completed on the client.
define command{ command_name check_nrpe60 command_line $USER1$/check_nrpe -H $HOSTADDRESS$ -t 60 -c $ARG1$ }
The plugin returns a warning status if it can't perform its task, and a critical status if any OOM killer activity has taken place. If the status is critical, it also returns extended status information detailing the PID's and users affected.
And finally, it is worth noting that on a properly tuned system this activity will probably not occur. We discovered it "by accident" when a physical server was converted to a virtual machine and not given the same amount of memory it had previously. When we identified and applied the correct memory tuning parameter (vm.lower_zone_protection applied in /etc/syctl.conf in this case) the OOM-Killer activity ceased.
Reviews (1)
I've got the compiled C wrapper working on the command line as ./check_oomkiller, but when attempting to run with NRPE as ./check_nrpe -H 127.0.0.1 -c check_oomkiller I continue to get NRPE: Unable to read output. Using NRPE 3.0.1. I have also update my nrpe.cfg file with the matching command and restarted nrpe.