Search Exchange
Search All Sites
Nagios Live Webinars
Let our experts show you how Nagios can help your organization.Login
Directory Tree
check_cdu - Monitor Server Technology Cabinet Distribution (CDU) Products
2.3
2018-03-12
- Nagios 3.x
- Nagios 4.x
GPL
35774
File | Description |
---|---|
check_cdu.pl | Version 1.0 |
check_cdu.pl | Version 1.4 |
check_cdu.pl | Version 2.1 |
Meet The New Nagios Core Services Platform
Built on over 25 years of monitoring experience, the Nagios Core Services Platform provides insightful monitoring dashboards, time-saving monitoring wizards, and unmatched ease of use. Use it for free indefinitely.
Monitoring Made Magically Better
- Nagios Core on Overdrive
- Powerful Monitoring Dashboards
- Time-Saving Configuration Wizards
- Open Source Powered Monitoring On Steroids
- And So Much More!
(This is a partial dump of 'perldoc check_cdu.pl')
NAME
check_cdu - Check various metrics from a Server Technology Cabinet Distribution Unit (CDU)
VERSION
This documentation refers to check_cdu version 2.1
APPLICATION REQUIREMENTS
Several standard Perl libraries are required for this program to function. Namely, Net::SNMP,
Getopt::Std, Getopt::Long, Nagios::Plugin::Threshold
GENERAL USAGE
check_cdu.pl -H -C [-t SNMP timeout] [-p SNMP port]
REQUIRED ARGUMENTS
Only the hostname and community are required. Timeout will default to 2 seconds, port 161.
THRESHOLDS
I opted to use the Nagios::Plugin::Threshold class to handle thresholds. In general I do
not prefer Nagios::Plugins objects, but I just simply could not avoid using the Threshold
class. I apologize for the added dependency, I just could not afford re-inventing the wheel.
The benefit is that the threshold logic used in this plugin follows the standard used in
many other plugins. For reference, here are the general threshold guidelines:
Range definition Generate an alert if x...
10 < 0 or > 10, (outside the range of {0 .. 10})
10: < 10, (outside {10 .. ?})
~:10 > 10, (outside the range of {-? .. 10})
10:20 < 10 or > 20, (outside the range of {10 .. 20})
@10:20? 10 and ? 20, (inside the range of {10 .. 20})
10 < 0 or > 10, (outside the range of {0 .. 10})
Read: http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
For the full, official, documentation
FULL DOCUMENTATION
check_cdu is intended to provide extremely flexible and extensive monitoring support for
Server Technology Cabinet Distribution Units (CDU). In general the workflow for this application
follows this procedure:
1. Pull in an entire SNMP table using a Net::SNMP session and get_table().
2. Renumerate these "flat" values into a structured hash
3. Evaluate any options or thresholds passed on the command line by the user.
4. Process the command line options against the data collected from the CDU
5. Exit appropriately given the status results
This workflow is generally followed in four slightly different ways depending on the desired options.
These four procedures are:
1. General System
2. Environment
3. Towers (Sentry3 Products)
4. Infeeds (Sentry3 Products)
5. Cords (Sentry4 Products)
6. Lines (Sentry4 Products)
7. Phases (Sentry4 Products)
8. Branches (Sentry4 Products)
Environment
An optional feature of a CDU are temperature and humidity probes. On most units, only two T/H
ports exist. Some "Link" or "Expansion" CDUs also have T/H ports. When using an EMCU 1-1B even
more T/H probes are available. This application is designed to support any number of T/H probes
available to the system. The way these are identified vary between Sentry3 and Sentry4 products.
Prior to running any checks for temperature or humidity, this plugin will check the T/H probe
status. The following states will result in an UNKNOWN return:
notFound
readError
lost
noComm
This applies to both temperature and humidity. The LowThresh and HighThresh states are ignored.
Any of these states will issue an UNKNOWN return. Since there is no data available, it's not
logical to initiate a WARNING or CRITICAL and roll someone out of bed. This behavior can easily
be changed in the code, if desired.
In its simplest form, the environment checks will query all available T/H probes connected to the
system. Unfortunately if any ports have no sensor connected the plugin will return an UNKNOWN state
indicating that some sensors are notFound. In this case you will need to explicity indicate which
sensors to query (I have no way of knowing if a sensor isn't really there, or if it failed)
The CDU has internal High/Low thresholds configured for both Temperature and Humidity, and
this is done on a per sensor basis. Without any arguments, this plugin will honor those values.
Considering that there is only one high:low range, I opted to designate this as a WARNING threshold.
This behavior can easily be changed in the code to the CRITICAL state, if desired, but it is NOT
modifiable from the command line. A basic invocation would resemble:
$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid
OK -BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 18C, Bottom-Rack-Inlet_F31(A1): 43%,
Bottom-Rack-Exhaust_F32(A2): 33C, Bottom-Rack-Exhaust_F32(A2): 16%, Top-Rack-Inlet_F31(B1): 24C,
Top-Rack-Inlet_F31(B1): 28%, Top-Rack-Exhaust_F32(B2): 36.5C, Top-Rack-Exhaust_F32(B2): 12%
The plugin output always includes the systemLocation defined on the CDU first. The various objects
queried are then returned in a comma separated list. For temperature and humidity probes, the sensor
Name is returned along with the ID in parantheses. If names haven't been set, the defaults will still
be displayed. Finally the value is listed for each sensor. The temperature scale is automatically
determined from the TempScale object provided via SNMP. For instances where the CDU is configured
for one scale, but the user desires the plugin to report in another scale, the --fahrenheit and
--celsius options are quite handy:
$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --fahrenheit
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 62.6F, Bottom-Rack-Inlet_F31(A1): 45%, Bottom-Rack-Exhaust_F32(A2): 91.4F
--celsius works in a similar fashion. If a scale is passed to the plugin and the T/H probe is already
configured for that scale, no error will occur. The values will be reported in the native scale for
that sensor.
Expanding on this basic functionality is the --ths option. --ths allows the user to select
which sensors to query, based on the sensor ID (not the name!). --ths will automatically determine
if the sensors exist, and exit UNKNOWN if they were not found. All of the regular sensor status
checks are still performed.
$ check_cdu.pl -H 192.168.0.1 -C public --temp --ths A1,B2 --fahrenheit
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 62.6F, Top-Rack-Exhaust_F32(B2): 97.7F
Note I also left out the --humid option. Either option can be specified alone, or both together,
providing maximum flexibility for designing purpose-built nagios service checks.
User supplied WARNING and CRITICAL thresholds can be applied to the temperature and humidity
sensors using the --warning and --critical directives. This overrides the automatic threshold
logic that relies upon the internal CDU configuration. Either --warning or --critical can be used,
or both can be used together. When querying multiple temperature sensors, a single threshold is
applied across all sensors. The same is true for querying multiple humidity sensors. Both temperature
and humidity can be queried together in the same command, by "chaining" the thresholds together.
Here are a couple examples:
$ check_cdu.pl -H 192.168.0.1 -C public --temp --fahrenheit --ths A1,B1 --warning 60:80
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 62.6F, Top-Rack-Inlet_F31(B1): 77F
(Query just the temperature from T/H probes A1 and B1 and apply a warning threshold to alarm if
either sensor falls below 60F or above 80F)
$ check_cdu.pl -H 192.168.0.1 -C public --humid --ths A2,B2 --warning 10:70
OK - BLDG_ROOM_RACK, Bottom-Rack-Exhaust_F32(A2): 18%, Top-Rack-Exhaust_F32(B2): 13%
(Query just the humidity from T/H probes A2 and B2 and apply a warning threshold to alarm if
either sensor falls below 10% or above 70% relative humidity)
$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --fahrenheit --ths A1 --warning 80,20: --critical 95,10:
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 64.4F, Bottom-Rack-Inlet_F31(A1): 48%
(Check just sensor A1, but query both temperature and humidity from this sensor. If the temperature
rises above 80F or the humidity falls below 20% generate a WARNING. If the temperature rises above
95 or the humidity falls below 10% generate a CRITICAL.)
IMPORTANT NOTE: When specifying both --temp and --humid the thresholds are chained together as
temperature_threshold,humidity_threshold regardless of which order --temp and --humid are passed!!
aka the following are equivalent:
'--temp --humid --warning 45,60' , '--humid --temp --warning 45,60'
The following are NOT equivalent:
'--temp --humid --warning 45,60', '--humid --temp --warning 60,45'
Starting in version 1.3 monitoring dewpoint temperature and dewpoint delta is supported. The
CDU does not natively support dewpoint, but it can be calculated given temperature and humidity.
Dewpoint is calculated using constants from J Applied Meteorology and Climatology and the
dewpoint calculations provided at: http://en.wikipedia.org/wiki/Dew_point#Calculating_the_dew_point
There are two ways to monitor dewpoint. First is with the "--dewtemp" option. This simply
calculates the air temperature dewpoint of any given sensor and applies the user supplied
thresholds to the value. Using the "--dewdelta" directive calculates the differential temperature
between the air temperature and calculated air temperature dewpoint values. This is especially
useful for determining how close a sensor is to reaching the dewpoint temperature, and hence
when condesnsation might start forming within a data center. An example invocation would look like:
$ check_cdu.pl -H 192.168.0.1 -C public --dewdelta --fahrenheit --ths A1,C1 --warning 10: --critical 5:
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet(A1) Delta: 39.00F, Top-Rack-Inlet(C1) Delta: 40.37F
This check would initiate a WARNING if the dewpoint is 10F or less from the air temperature and a
CRITICAL if the dewpoint is 5F or less from the air temperature. I believe this would be a typical
use for this function. The dewpoint temperature can never be greater than air temperature, only
less than, or equal to.
Since the CDU does not have built-in thresholds for dewpoint, it is required to use either
--warning or --critical in conjunction with either --dewtemp or --dewdelta. Like --temp and
--humid options chaining is supported with the dewpoint options. The order of the chained
thresholds is always temp,humidity,dewpoint. You cannot specify --dewdelta and --dewtemp in
the same invocation. An complex example invocation would be:
$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --dewdelta --fahrenheit --ths A1,C1
--warning 80,50,10: --critical 90,80,5:
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet(A1): 67.1F, Bottom-Rack-Inlet(A1): 24%, Bottom-Rack-Inlet(A1)
Delta: 37.96F, Top-Rack-Inlet(C1): 68F, Top-Rack-Inlet(C1): 22%, Top-Rack-Inlet(C1) Delta: 40.22F
This command checks two sensors for temperature, humidity and dewpoint delta. Temperature
WARNING above 80, CRITICAL above 90. Humidity WARNING above 50, CRITICAL above 80. Dewpoint Delta
WARNING if less than 10 and CRITICAL if less than 5.
Towers (Sentry3 Products)
Tower state and statistics are checked using the --tower directive. If specified with no arguments
only the overall state of the tower(s) are checked. The ability to query a specific tower does not
exist at this time. If the 'noComm' state is encountered for a tower a WARNING state is generated.
This is likely only possible on a slave tower. If the master tower is in state 'noComm', I doubt you'd
get this far with it ;) If 'fanFail', 'overTemp' or 'nvmFail' states are encountered, the state is
returned as CRITICAL. The 'outOfBalance' state returns WARNING.
Various metrics from the tower can also be queried by passing them to the --tower directive as a
comma separated list. At the time of development, these metrics are only supported on PIPS units.
A regular SMART or SWITCHED CDU will likely not benefit from any of these enhancements. The plugin
will correctly identify the absence of these metrics if you attempt to query them. The metrics are:
VACapacity
ApparentPower
VACapacityUsed
ActivePower
Energy
LineFrequency
It is very important to note that the 'Status' checks are largely skipped when querying any of these
metrics. The 'fanFail' and 'overTemp' states are completely ignored. If the 'noComm' state is
encountered, the metric(s) are skipped and a state UNKNOWN is returned. Given this, to fully utilize
the features of this plugin one should ALWAYS have a service check using just '--tower'. It was not
logical to exit on WARNING/CRITICAL for a 'noComm' state multiple times (say, for instance if there
are separate service checks defined for every metric listed above).
The towers are identified similar to the T/H probes, in the form of NAME(ID): VALUE. These are all
configurable on the CDU itself. Typically, a circuit name would be used for a Tower name. Thresholds
are applied in a similar manner to the --temp and --humid checks. ORDER DOES MATTER. The order in
which the metrics are listed is the order in which the thresholds should be "chained". The same logic
applies to these thresholds, see the THRESHOLDS section for specifics.
Here are some examples:
$ check_cdu.pl -H 192.168.0.1 -C public --tower
OK - BLDG_ROOM_RACK, TowerA(A) Status: normal(0), TowerB(B) Status: normal(0)a
$ check_cdu.pl -H 192.168.0.1 -C public --tower ApparentPower,ActivePower,VACapacityUsed --warning 1200,1000,30
OK - BLDG_ROOM_RACK, TowerA(A) ApparentPower: 993VA, TowerA(A) ActivePower: 939W, TowerA(A) VACapacityUsed: 9.1%,
TowerB(B) ApparentPower: 927VA, TowerB(B) ActivePower: 870W, TowerB(B) VACapacityUsed: 8.5%
(Check that ApparentPower does not exceed 1200VA, ActivePower does not exceed 1000W and the Capacity
used does not exceed 30%. If any of these scenarios occur, generate a WARNING)
$ check_cdu.pl -H 192.168.0.1 -C public --tower Energy --warning 10000 --critical 15000
OK - BLDG_ROOM_RACK, TowerA(A) Energy: 6654kWh, TowerB(B) Energy: 7658kWh
(If the kWh consumption of either tower exceeds 10,000 generate a WARNING. If it exceeds 15,000
generate a CRITICAL. Say you're in a co-lo paying for power utilization and your piggy bank will
run dry if you use too much power ...)
$ check_cdu.pl -H 192.168.0.1 -C public --tower VACapacity --warning 10800
WARNING - BLDG_ROOM_RACK, TowerA(A) VACapacity: -1VA
(This is a very bizarre but interesting scenario. I included VACapacity because it was there, but
who would logically check a static value such as the capacity of a tower? Well, it turns out that
this particular unit is slightly broken and the Capacity is -1. This should just provide some ideas
on why it may be useful to monitor things that otherwise wouldn't make sense)
Infeeds (Sentry3 Products)
Infeed state and statistics are checked using the --infeed directive. It is very similar to the --tower
check. If specified with no agruments, the infeed 'Status' and 'LoadStatus' objects are checked. The
ability to query a specific infeed does not exist at this time (and likely never will). The following
infeed Statuses will generate a WARNING:
noComm
offWait
onWait
off
reading
A CRITICAL will be generated if the Infeed has the following Status:
offError
onError
offFuse
onFuse
Likewise the LoadStatus object is checked for each infeed as well. A WARNING is generated for the
following LoadStatus conditions:
noComm
reading
loadLow
I wasn't sure what the 'reading' state was, this state is also present across many other CDU
objects. There is a good chance this state simply infers that the state is currently being
"read" or updated, and it's likely that this state will be ignored in future versions of the
plugin if that is the case. The loadLow must be determined by an internal CDU threshold, however
this threshold isn't available via SNMP - so I left it alone. A CRITICAL is generated for the
other LoadStatus states:
notOn
loadHigh
overLoad
readError
Simple modifications to the code can be done to move these various Statuses between the CRITICAL
and WARNING states if desired, but it is not possible from the command line.
Similar to the --tower directive, many of these Status checks are skipped when querying specifc
metrics from the infeed. If any metrics are provided to --infeed, the infeed Status is checked
for the 'noComm' status. If this is true, the plugin will append this to the UNKNOWN 'bucket'
and skip checking the metric. The following infeed metrics are currently supported:
PhaseVoltage *
Voltage
CapacityUsed *
Power
ApparentPower *
Energy *
LoadValue
PhaseCurrent *
CrestFactor *
PowerFactor *
* These metrics are only available on PIPS units.
The infeeds are identified similar to the T/H probes, in the form of NAME(ID): VALUE. These are all
configurable on the CDU itself. Typically, a circuit name would be used for an infeed name. Thresholds
are applied in a similar manner to the --temp and --humid checks. ORDER DOES MATTER. The order in
which the metrics are listed is the order in which the thresholds should be "chained". The same logic
applies to these thresholds, see the THRESHOLDS section for specifics.
A special note on PowerFactor: An unloaded infeed will typically report -0.01 for the Power Factor.
It does not seem logical to apply the provided threshold to this value. So if the Power Factor is
less than 0 the threshold is not used and the state is simply assumed to be 'OK'.
Some examples:
$ check_cdu.pl -H 192.168.0.1 -C public --infeed
OK - BLDG_ROOM_RACK, TowerA_InfeedA(AA) Status: on(1), TowerA_InfeedA(AA) LoadStatus: normal(0),
TowerA_InfeedB(AB) Status: on(1), TowerA_InfeedB(AB) LoadStatus: normal(0), TowerA_InfeedC(AC) Status:
on(1), TowerA_InfeedC(AC) LoadStatus: normal(0), TowerB_InfeedA(BA) Status: on(1), TowerB_InfeedA(BA)
LoadStatus: normal(0), TowerB_InfeedB(BB) Status: on(1), TowerB_InfeedB(BB) LoadStatus: normal(0),
TowerB_InfeedC(BC) Status: on(1), TowerB_InfeedC(BC) LoadStatus: normal(0)
(This is a basic tower check for a master/slave 3 phase CDU. There are 6 infeeds total across both
towers, and two separate checks are performed (Status,LoadStatus) for each infeed. This is a lot of data)
$ check_cdu.pl -H 192.168.0.1 -C public --infeed LoadValue --warning 12 --critical 24
OK - BLDG_ROOM_RACK, TowerA_InfeedA(AA) LoadValue: 4.07A, TowerA_InfeedB(AB) LoadValue: 3.21A,
TowerA_InfeedC(AC) LoadValue: 1.62A, TowerB_InfeedA(BA) LoadValue: 3.61A, TowerB_InfeedB(BB) LoadValue:
2.76A, TowerB_InfeedC(BC) LoadValue: 1.73A
(This is a simple load/current check which applies a warning and critical threshold to the load of all 6
infeeds on a dual tower 3 phase CDU.)
$ check_cdu.pl -H 192.168.0.1 -C public --infeed ApparentPower,CapacityUsed --warning 1000,20
OK - BLDG_ROOM_RACK, TowerA_InfeedA(AA) ApparentPower: 673VA, TowerA_InfeedA(AA) CapacityUsed: 12.6%,
TowerA_InfeedB(AB) ApparentPower: 0VA, TowerA_InfeedB(AB) CapacityUsed: 10.5%, TowerA_InfeedC(AC)
ApparentPower: 317VA, TowerA_InfeedC(AC) CapacityUsed: 5.3%, TowerB_InfeedA(BA) ApparentPower: 575VA,
TowerB_InfeedA(BA) CapacityUsed: 12%, TowerB_InfeedB(BB) ApparentPower: 0VA, TowerB_InfeedB(BB) CapacityUsed:
8.9%, TowerB_InfeedC(BC) ApparentPower: 348VA, TowerB_InfeedC(BC) CapacityUsed: 5.7%
(Generate a warning if the ApparentPower of any infeed exceeds 1000VA, and generate a warning if the
Capacity Used exceeds 20% on any infeed)
PhaseVoltage and PhaseCurrent use the PhaseID instead of infeedID in the plugin output. Throughout our
testing, it has been difficult to ascertain a difference between PhaseVoltage and Voltage. There is
generally a considerable difference between PhaseCurrent and LoadValue, however it most likely makes
sense to only check one of these.
Enhanced Infeed checks (Sentry3 Products)
There are two additional metrics that can be checked with the '--infeed' directive. They are:
LoadImbalance
VoltageImbalance
These metrics are not provided directly by the CDU, rather they are computed internally by the plugin.
Please note, these special metrics are ONLY available on 3 phase units. Some versions of the CDU
firmware provide a '3-Phase Load Out-of-Balance Threshold' setting and the results are displayed on
the 'istat' menu. None of this information is provided via SNMP. Thresholds are required for either
of these computed metrics. Unlike the display in 'istat' only the out-of-balance infeed(s) will be
displayed, not infeeds across the entire tower. I used a basic 3 phase motor load phase imbalance
equation to generate the imbalance percentages for both Current and Voltage:
Percent imbalance = maximum deviation from average / average of three phases * 100
When an infeed is queried for either voltage or current imbalance, the plugin determines which tower
the infeed is a part of. All infeed values (voltage or current) for that tower are then averaged
together. The deviation from the average is then determined for this particular infeed, accomodating
either a negative or positive delta from the average. This is then divided by the average and
multiplied by 100 to determine the percent imbalance. This equation was pulled from the following
document:
http://support.fluke.com/educators/download/asset/2161031_b_w.pdf
An example invocation of this check would look like:
$ check_cdu.pl -H 192.168.0.1 -C public --infeed LoadImbalance --warning 20 --critical 30
CRITICAL - BLDG_ROOM_RACK, TowerA_InfeedA(AA) LoadImbalance: 39.07%, TowerA_InfeedC(AC)
LoadImbalance: 50.46%, TowerB_InfeedA(BA) LoadImbalance: 33.54%, TowerB_InfeedC(BC) LoadImbalance: 34.54%
(Generate a WARNING if the load imbalance of any infeed exceeds 20%, and a CRITICAL if the imbalance
exceeds 30%. Clearly, this is not a well balanced rack! Hence the need for such a check)
The same can be done for voltage, however the margins should be much, much smaller than load.
This can be useful to detect bad incoming power conditions. Unfortunately this only evaluates
an imbalance across the phases of a single tower. A more useful approach would be to judge
imbalance between two separate towers, and hence two separate feeds/circuits which could be
coming from two separate sources (ie. UPS/utility). Currently that functionality does not exist.
Here is an example:
$ check_cdu.pl -H 192.168.0.1 -C public --infeed VoltageImbalance --warning .5 --critical 2
WARNING - BLDG_ROOM_RACK, TowerB_InfeedB(BB) VoltageImbalance: 0.65%
(Generate a WARNING if the imbalance between voltages per infeed is greater than .5% and a
CRITICAL if the imbalance is greater than 2%)
Cords (Sentry4 Products)
Cord state and statisitcs are checked using the --cord directive. If specificed with no arguments
only the Status and State metrics of the cord(s) are checked. The ability to query a specific cord
does not exist at this time. If any other state is encountered for either object a WARNING is
generated.
There are other status objects that can be queried in addtion to Status and State. Check any number
of these objects by passing a comma separated list to the --cord directive. They do not accept
thresholds. The "normal" state of each metric is hard-coded (usually either "normal" or "on"). Here
is the full list of available "State" metrics:
State
Status
ActivePowerStatus
ApparentPowerStatus
PowerFactorStatus
OutOfBalanceStatus
Other non-state metrics can be queried in the same way, but require a threshold. At this time these
"metered" metrics do not honor any of the built-in thresholds available on the CDU. If you look in
the code, I am collecting any available Warning/Alarm metrics, but I have not coded in the ability to
use them. This is planned in a future version, I hope. If a metric is not available for some reason,
the plugin will identify this. Here are the cord metrics:
PowerCapacity
ActivePower
ApparentPower
PowerUtilized
PowerFactor
Energy
Frequency
OutOfBalance
It is very important to note that the 'Status' checks are largely skipped when querying any of these
metrics. If the 'noComm' state is encountered, the metric(s) are skipped and a state UNKNOWN is returned.
Given this, to fully utilize the features of this plugin one should ALWAYS have a service check using
just '--cord', or by specifying the "Status" checks explicity.
The naming convention of the cords is very similar (identical) to how all the other resources are
identified in the system.
Thresholds are applied the same way as is in other checks. ORDER DOES MATTER. The order in which the
metrics are listed is the order in which the thresholds should be "chained". See the THRESHOLDS section
for specifics.
Here are some examples:
$ check_cdu.pl -H 192.168.0.1 -C public --cord
OK - BLDG_ROOM_RACK, Master_Cord_A(AA) Status: normal(0) State: on(1), Link1_Cord_A(BA)
Status: normal(0) State: on(1)
$ check_cdu.pl -H 192.168.0.1 -C public --cord ActivePowerStatus,OutOfBalanceStatus
OK - BLDG_ROOM_RACK, Master_Cord_A(AA) ActivePowerStatus: normal(0), Master_Cord_A(AA)
OutOfBalanceStatus: normal(0), Link1_Cord_A(BA) ActivePowerStatus: normal(0), Link1_Cord_A(BA)
OutOfBalanceStatus: normal(0)
$ check_cdu.pl -H 192.168.0.1 -C public --cord ActivePower,PowerUtilized --warning 2500,20 --critical 4000,50
OK - BLDG_ROOM_RACK, Master_Cord_A(AA) ActivePower: 1442W, Master_Cord_A(AA) PowerUtilized: 8.1%,
Link1_Cord_A(BA) ActivePower: 1511W, Link1_Cord_A(BA) PowerUtilized: 8.2%
$ check_cdu.pl -H 192.168.0.1 -C public --cord PowerCapacity
WARNING - BLDG_ROOM_RACK, Link1_Cord_A(BA) PowerCapacity: -1VA
(This is a very bizarre but interesting scenario. I included PowerCapacity because it was there, but
who would logically check a static value such as the capacity of a cord? Well, it turns out that
this particular unit is slightly broken and the Capacity is -1. This should just provide some ideas
on why it may be useful to monitor things that otherwise wouldn't make sense)
Lines (Sentry4 Products)
Line state and statisitcs are checked using the -line directive. If specificed with no arguments
only the Status and State metrics of the cord(s) are checked. The ability to query a specific cord
does not exist at this time. If any other state is encountered for either object a WARNING is
generated.
There are other status objects that can be queried in addtion to Status and State. Check any number
of these objects by passing a comma separated list to the --line directive. They do not accept
thresholds. The "normal" state of each metric is hard-coded (usually either "normal" or "on"). Here
is the full list of available "State" metrics:
State
Status
CurrentStatus
Other non-state metrics can be queried in the same way, but require a threshold. At this time these
"metered" metrics do not honor any of the built-in thresholds available on the CDU. If you look in
the code, I am collecting any available Warning/Alarm metrics, but I have not coded in the ability to
use them. This is planned in a future version, I hope. If a metric is not available for some reason,
the plugin will identify this. Here are the line metrics:
CurrentCapacity
Current
CurrentUtilized
It is very important to note that the 'Status' checks are largely skipped when querying any of these
metrics. If the 'noComm' state is encountered, the metric(s) are skipped and a state UNKNOWN is returned.
Given this, to fully utilize the features of this plugin one should ALWAYS have a service check using
just '--line', or by specifying the "Status" checks explicity.
The naming convention of the cords is very similar (identical) to how all the other resources are
identified in the system.
Thresholds are applied the same way as is in other checks. ORDER DOES MATTER. The order in which the
metrics are listed is the order in which the thresholds should be "chained". See the THRESHOLDS section
for specifics.
Here are some examples:
$ check_cdu.pl -H 192.168.0.1 -C public --line
OK - BLDG_ROOM_RACK, AA:L1(AA1) Status: normal(0) State: on(1), AA:L2(AA2) Status: normal(0)
State: on(1), AA:L3(AA3) Status: normal(0) State: on(1), AA:N(AA4) Status: normal(0) State: on(1),
BA:L1(BA1) Status: normal(0) State: on(1), BA:L2(BA2) Status: normal(0) State: on(1), BA:L3(BA3)
Status: normal(0) State: on(1), BA:N(BA4) Status: normal(0) State: on(1)
$ check_cdu.pl -H 192.168.0.1 -C public --line CurrenStatus
OK - BLDG_ROOM_RACK, AA:L1(AA1) CurrentStatus: normal(0), AA:L2(AA2) CurrentStatus: normal(0),
AA:L3(AA3) CurrentStatus: normal(0), AA:N(AA4) CurrentStatus: normal(0), BA:L1(BA1) CurrentStatus:
normal(0), BA:L2(BA2) CurrentStatus: normal(0), BA:L3(BA3) CurrentStatus: normal(0), BA:N(BA4)
CurrentStatus: normal(0)
$ check_cdu.pl -H 192.168.0.1 -C public --line Current,CurrentUtilized --warning 5,40 --critical 10,95
OK - BLDG_ROOM_RACK, AA:L1(AA1) Current: 3.06A, AA:L1(AA1) CurrentUtilized: 9.5%, AA:L2(AA2)
Current: 2.23A, AA:L2(AA2) CurrentUtilized: 6.9%, AA:L3(AA3) Current: 2.1A, AA:L3(AA3)
CurrentUtilized: 6.5%, AA:N(AA4) Current: 1.05A, AA:N(AA4) CurrentUtilized: 3.2%, BA:L1(BA1)
Current: 3.18A, BA:L1(BA1) CurrentUtilized: 9.9%, BA:L2(BA2) Current: 2.36A, BA:L2(BA2)
CurrentUtilized: 7.3%, BA:L3(BA3) Current: 2.07A, BA:L3(BA3) CurrentUtilized: 6.4%, BA:N(BA4)
Current: 1.1A, BA:N(BA4) CurrentUtilized: 3.4%
Phases (Sentry4 Products)
Read the documentation for Cords and Lines. Phases are handled the same way.
Available "State" metrics:
State
Status
VoltageStatus
PowerFactorStatus
Reactance
Metered Metrics:
Voltage
VoltageDeviation
Current
CrestFactor
ActivePower
ApparentPower
PowerFactor
Energy
NOTE: Reactance is evaluated in terms of the following states:
unknown
capacitive
inductive
resistive
I opted to choose "capacitive" as the "OK" state. This could really not work well. YMMV
Branches (Sentry4 Products)
Read the documentation for Cords and Lines. Branches are handled the same way.
Available "State" metrics:
State
Status
CurrentStatus
Metered Metrics:
CurrentCapacity
Current
CurrentUtilized
Contact Sensors
Contact Closure sensors (Dry Contacts) are available when the EMCU-1-1B unit is used. Each firmware
version and even each CDU type can enumerate the sensors differently, so the IDs have been "simplified"
for use in this plugin. Do not use E1, C1, etc as the ID. Just use 1-4. The plugin figures the rest
out automagically. A state/status of "normal(0)" returns an OK. Anything else returns a WARNING. I
didn't bother to make this configurable, but you can hack the code yourself to change this if you want
If you don't explicity specify which IDs to query, the script looks at all four of them.
$ check_cdu.pl -H 192.168.0.1 -C public --contact 1,2
OK - BLDG_ROOM_RACK, FRONT_DOOR(B1): normal(0), REAR_DOOR(B2): normal(0)
Plugin Termination
Numerous scenarios exist where the plugin will exit abnormally. This could be due to user input error,
or failure to retrieve required SNMP data, etc. In all identifiable cases, the plugin will exit with a
UNKNOWN state and a descriptive message indicating the failure. Users should be aware that if all SNMP
calls fail, monitoring of the CDU may be effectively rendered useless if UNKNOWN states are not report
(this is common). This is dissimilar to plugins like check_nrpe that exit CRITICAL if an SSL negotiati
erorr occurs!
Throughout the workflow of the plugin metrics are evaluated against thresholds and the results are pla
into various 'buckets' reflecting OK,WARNING,CRITICAL and UNKNOWN states. At the end of the workflow,
reporting is done based upon the presence or absence of these buckets. If both CRITICAL and WARNING
conditions exist, they are BOTH reported in the plugin_output text, however the state is reported as
CRITICAL. An example of this can be seen in the following output:
$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --ths A1 --warning 16,30 --critical 20,40
CRITICAL - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 43%, WARNING - Bottom-Rack-Inlet_F31(A1): 17C
Some options end up producing a large amount of output, and this could easily exceed what Nagios can
accept, or also exceed character limits on various notification devices (maybe you're tweeting your
CDU status for instance ;P) The '--oksummary' option exists to summarize the output for any type of
check being done. If all metrics being checked are in state 'OK' the output supresses the specifics
of these metrics and simply reports 'N metrics are OK' The version and location are also displayed
in the plugin_output.
INCOMPATIBILITIES
None. See Bugs.
BUGS AND LIMITATIONS
None.
If you experience any problems please contact me. (eric.schoeller coloradoDOTedu)
AUTHOR
Eric Schoeller (eric.schoeller coloradoDOTedu)
LICENCE AND COPYRIGHT
Copyright (c) 2013 Eric Schoeller (eric.schoeller coloradoDOTedu).
All rights reserved.
This module is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License.
See L.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
NAME
check_cdu - Check various metrics from a Server Technology Cabinet Distribution Unit (CDU)
VERSION
This documentation refers to check_cdu version 2.1
APPLICATION REQUIREMENTS
Several standard Perl libraries are required for this program to function. Namely, Net::SNMP,
Getopt::Std, Getopt::Long, Nagios::Plugin::Threshold
GENERAL USAGE
check_cdu.pl -H
REQUIRED ARGUMENTS
Only the hostname and community are required. Timeout will default to 2 seconds, port 161.
THRESHOLDS
I opted to use the Nagios::Plugin::Threshold class to handle thresholds. In general I do
not prefer Nagios::Plugins objects, but I just simply could not avoid using the Threshold
class. I apologize for the added dependency, I just could not afford re-inventing the wheel.
The benefit is that the threshold logic used in this plugin follows the standard used in
many other plugins. For reference, here are the general threshold guidelines:
Range definition Generate an alert if x...
10 < 0 or > 10, (outside the range of {0 .. 10})
10: < 10, (outside {10 .. ?})
~:10 > 10, (outside the range of {-? .. 10})
10:20 < 10 or > 20, (outside the range of {10 .. 20})
@10:20? 10 and ? 20, (inside the range of {10 .. 20})
10 < 0 or > 10, (outside the range of {0 .. 10})
Read: http://nagiosplug.sourceforge.net/developer-guidelines.html#THRESHOLDFORMAT
For the full, official, documentation
FULL DOCUMENTATION
check_cdu is intended to provide extremely flexible and extensive monitoring support for
Server Technology Cabinet Distribution Units (CDU). In general the workflow for this application
follows this procedure:
1. Pull in an entire SNMP table using a Net::SNMP session and get_table().
2. Renumerate these "flat" values into a structured hash
3. Evaluate any options or thresholds passed on the command line by the user.
4. Process the command line options against the data collected from the CDU
5. Exit appropriately given the status results
This workflow is generally followed in four slightly different ways depending on the desired options.
These four procedures are:
1. General System
2. Environment
3. Towers (Sentry3 Products)
4. Infeeds (Sentry3 Products)
5. Cords (Sentry4 Products)
6. Lines (Sentry4 Products)
7. Phases (Sentry4 Products)
8. Branches (Sentry4 Products)
Environment
An optional feature of a CDU are temperature and humidity probes. On most units, only two T/H
ports exist. Some "Link" or "Expansion" CDUs also have T/H ports. When using an EMCU 1-1B even
more T/H probes are available. This application is designed to support any number of T/H probes
available to the system. The way these are identified vary between Sentry3 and Sentry4 products.
Prior to running any checks for temperature or humidity, this plugin will check the T/H probe
status. The following states will result in an UNKNOWN return:
notFound
readError
lost
noComm
This applies to both temperature and humidity. The LowThresh and HighThresh states are ignored.
Any of these states will issue an UNKNOWN return. Since there is no data available, it's not
logical to initiate a WARNING or CRITICAL and roll someone out of bed. This behavior can easily
be changed in the code, if desired.
In its simplest form, the environment checks will query all available T/H probes connected to the
system. Unfortunately if any ports have no sensor connected the plugin will return an UNKNOWN state
indicating that some sensors are notFound. In this case you will need to explicity indicate which
sensors to query (I have no way of knowing if a sensor isn't really there, or if it failed)
The CDU has internal High/Low thresholds configured for both Temperature and Humidity, and
this is done on a per sensor basis. Without any arguments, this plugin will honor those values.
Considering that there is only one high:low range, I opted to designate this as a WARNING threshold.
This behavior can easily be changed in the code to the CRITICAL state, if desired, but it is NOT
modifiable from the command line. A basic invocation would resemble:
$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid
OK -BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 18C, Bottom-Rack-Inlet_F31(A1): 43%,
Bottom-Rack-Exhaust_F32(A2): 33C, Bottom-Rack-Exhaust_F32(A2): 16%, Top-Rack-Inlet_F31(B1): 24C,
Top-Rack-Inlet_F31(B1): 28%, Top-Rack-Exhaust_F32(B2): 36.5C, Top-Rack-Exhaust_F32(B2): 12%
The plugin output always includes the systemLocation defined on the CDU first. The various objects
queried are then returned in a comma separated list. For temperature and humidity probes, the sensor
Name is returned along with the ID in parantheses. If names haven't been set, the defaults will still
be displayed. Finally the value is listed for each sensor. The temperature scale is automatically
determined from the TempScale object provided via SNMP. For instances where the CDU is configured
for one scale, but the user desires the plugin to report in another scale, the --fahrenheit and
--celsius options are quite handy:
$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --fahrenheit
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 62.6F, Bottom-Rack-Inlet_F31(A1): 45%, Bottom-Rack-Exhaust_F32(A2): 91.4F
--celsius works in a similar fashion. If a scale is passed to the plugin and the T/H probe is already
configured for that scale, no error will occur. The values will be reported in the native scale for
that sensor.
Expanding on this basic functionality is the --ths option. --ths allows the user to select
which sensors to query, based on the sensor ID (not the name!). --ths will automatically determine
if the sensors exist, and exit UNKNOWN if they were not found. All of the regular sensor status
checks are still performed.
$ check_cdu.pl -H 192.168.0.1 -C public --temp --ths A1,B2 --fahrenheit
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 62.6F, Top-Rack-Exhaust_F32(B2): 97.7F
Note I also left out the --humid option. Either option can be specified alone, or both together,
providing maximum flexibility for designing purpose-built nagios service checks.
User supplied WARNING and CRITICAL thresholds can be applied to the temperature and humidity
sensors using the --warning and --critical directives. This overrides the automatic threshold
logic that relies upon the internal CDU configuration. Either --warning or --critical can be used,
or both can be used together. When querying multiple temperature sensors, a single threshold is
applied across all sensors. The same is true for querying multiple humidity sensors. Both temperature
and humidity can be queried together in the same command, by "chaining" the thresholds together.
Here are a couple examples:
$ check_cdu.pl -H 192.168.0.1 -C public --temp --fahrenheit --ths A1,B1 --warning 60:80
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 62.6F, Top-Rack-Inlet_F31(B1): 77F
(Query just the temperature from T/H probes A1 and B1 and apply a warning threshold to alarm if
either sensor falls below 60F or above 80F)
$ check_cdu.pl -H 192.168.0.1 -C public --humid --ths A2,B2 --warning 10:70
OK - BLDG_ROOM_RACK, Bottom-Rack-Exhaust_F32(A2): 18%, Top-Rack-Exhaust_F32(B2): 13%
(Query just the humidity from T/H probes A2 and B2 and apply a warning threshold to alarm if
either sensor falls below 10% or above 70% relative humidity)
$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --fahrenheit --ths A1 --warning 80,20: --critical 95,10:
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 64.4F, Bottom-Rack-Inlet_F31(A1): 48%
(Check just sensor A1, but query both temperature and humidity from this sensor. If the temperature
rises above 80F or the humidity falls below 20% generate a WARNING. If the temperature rises above
95 or the humidity falls below 10% generate a CRITICAL.)
IMPORTANT NOTE: When specifying both --temp and --humid the thresholds are chained together as
temperature_threshold,humidity_threshold regardless of which order --temp and --humid are passed!!
aka the following are equivalent:
'--temp --humid --warning 45,60' , '--humid --temp --warning 45,60'
The following are NOT equivalent:
'--temp --humid --warning 45,60', '--humid --temp --warning 60,45'
Starting in version 1.3 monitoring dewpoint temperature and dewpoint delta is supported. The
CDU does not natively support dewpoint, but it can be calculated given temperature and humidity.
Dewpoint is calculated using constants from J Applied Meteorology and Climatology and the
dewpoint calculations provided at: http://en.wikipedia.org/wiki/Dew_point#Calculating_the_dew_point
There are two ways to monitor dewpoint. First is with the "--dewtemp" option. This simply
calculates the air temperature dewpoint of any given sensor and applies the user supplied
thresholds to the value. Using the "--dewdelta" directive calculates the differential temperature
between the air temperature and calculated air temperature dewpoint values. This is especially
useful for determining how close a sensor is to reaching the dewpoint temperature, and hence
when condesnsation might start forming within a data center. An example invocation would look like:
$ check_cdu.pl -H 192.168.0.1 -C public --dewdelta --fahrenheit --ths A1,C1 --warning 10: --critical 5:
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet(A1) Delta: 39.00F, Top-Rack-Inlet(C1) Delta: 40.37F
This check would initiate a WARNING if the dewpoint is 10F or less from the air temperature and a
CRITICAL if the dewpoint is 5F or less from the air temperature. I believe this would be a typical
use for this function. The dewpoint temperature can never be greater than air temperature, only
less than, or equal to.
Since the CDU does not have built-in thresholds for dewpoint, it is required to use either
--warning or --critical in conjunction with either --dewtemp or --dewdelta. Like --temp and
--humid options chaining is supported with the dewpoint options. The order of the chained
thresholds is always temp,humidity,dewpoint. You cannot specify --dewdelta and --dewtemp in
the same invocation. An complex example invocation would be:
$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --dewdelta --fahrenheit --ths A1,C1
--warning 80,50,10: --critical 90,80,5:
OK - BLDG_ROOM_RACK, Bottom-Rack-Inlet(A1): 67.1F, Bottom-Rack-Inlet(A1): 24%, Bottom-Rack-Inlet(A1)
Delta: 37.96F, Top-Rack-Inlet(C1): 68F, Top-Rack-Inlet(C1): 22%, Top-Rack-Inlet(C1) Delta: 40.22F
This command checks two sensors for temperature, humidity and dewpoint delta. Temperature
WARNING above 80, CRITICAL above 90. Humidity WARNING above 50, CRITICAL above 80. Dewpoint Delta
WARNING if less than 10 and CRITICAL if less than 5.
Towers (Sentry3 Products)
Tower state and statistics are checked using the --tower directive. If specified with no arguments
only the overall state of the tower(s) are checked. The ability to query a specific tower does not
exist at this time. If the 'noComm' state is encountered for a tower a WARNING state is generated.
This is likely only possible on a slave tower. If the master tower is in state 'noComm', I doubt you'd
get this far with it ;) If 'fanFail', 'overTemp' or 'nvmFail' states are encountered, the state is
returned as CRITICAL. The 'outOfBalance' state returns WARNING.
Various metrics from the tower can also be queried by passing them to the --tower directive as a
comma separated list. At the time of development, these metrics are only supported on PIPS units.
A regular SMART or SWITCHED CDU will likely not benefit from any of these enhancements. The plugin
will correctly identify the absence of these metrics if you attempt to query them. The metrics are:
VACapacity
ApparentPower
VACapacityUsed
ActivePower
Energy
LineFrequency
It is very important to note that the 'Status' checks are largely skipped when querying any of these
metrics. The 'fanFail' and 'overTemp' states are completely ignored. If the 'noComm' state is
encountered, the metric(s) are skipped and a state UNKNOWN is returned. Given this, to fully utilize
the features of this plugin one should ALWAYS have a service check using just '--tower'. It was not
logical to exit on WARNING/CRITICAL for a 'noComm' state multiple times (say, for instance if there
are separate service checks defined for every metric listed above).
The towers are identified similar to the T/H probes, in the form of NAME(ID): VALUE. These are all
configurable on the CDU itself. Typically, a circuit name would be used for a Tower name. Thresholds
are applied in a similar manner to the --temp and --humid checks. ORDER DOES MATTER. The order in
which the metrics are listed is the order in which the thresholds should be "chained". The same logic
applies to these thresholds, see the THRESHOLDS section for specifics.
Here are some examples:
$ check_cdu.pl -H 192.168.0.1 -C public --tower
OK - BLDG_ROOM_RACK, TowerA(A) Status: normal(0), TowerB(B) Status: normal(0)a
$ check_cdu.pl -H 192.168.0.1 -C public --tower ApparentPower,ActivePower,VACapacityUsed --warning 1200,1000,30
OK - BLDG_ROOM_RACK, TowerA(A) ApparentPower: 993VA, TowerA(A) ActivePower: 939W, TowerA(A) VACapacityUsed: 9.1%,
TowerB(B) ApparentPower: 927VA, TowerB(B) ActivePower: 870W, TowerB(B) VACapacityUsed: 8.5%
(Check that ApparentPower does not exceed 1200VA, ActivePower does not exceed 1000W and the Capacity
used does not exceed 30%. If any of these scenarios occur, generate a WARNING)
$ check_cdu.pl -H 192.168.0.1 -C public --tower Energy --warning 10000 --critical 15000
OK - BLDG_ROOM_RACK, TowerA(A) Energy: 6654kWh, TowerB(B) Energy: 7658kWh
(If the kWh consumption of either tower exceeds 10,000 generate a WARNING. If it exceeds 15,000
generate a CRITICAL. Say you're in a co-lo paying for power utilization and your piggy bank will
run dry if you use too much power ...)
$ check_cdu.pl -H 192.168.0.1 -C public --tower VACapacity --warning 10800
WARNING - BLDG_ROOM_RACK, TowerA(A) VACapacity: -1VA
(This is a very bizarre but interesting scenario. I included VACapacity because it was there, but
who would logically check a static value such as the capacity of a tower? Well, it turns out that
this particular unit is slightly broken and the Capacity is -1. This should just provide some ideas
on why it may be useful to monitor things that otherwise wouldn't make sense)
Infeeds (Sentry3 Products)
Infeed state and statistics are checked using the --infeed directive. It is very similar to the --tower
check. If specified with no agruments, the infeed 'Status' and 'LoadStatus' objects are checked. The
ability to query a specific infeed does not exist at this time (and likely never will). The following
infeed Statuses will generate a WARNING:
noComm
offWait
onWait
off
reading
A CRITICAL will be generated if the Infeed has the following Status:
offError
onError
offFuse
onFuse
Likewise the LoadStatus object is checked for each infeed as well. A WARNING is generated for the
following LoadStatus conditions:
noComm
reading
loadLow
I wasn't sure what the 'reading' state was, this state is also present across many other CDU
objects. There is a good chance this state simply infers that the state is currently being
"read" or updated, and it's likely that this state will be ignored in future versions of the
plugin if that is the case. The loadLow must be determined by an internal CDU threshold, however
this threshold isn't available via SNMP - so I left it alone. A CRITICAL is generated for the
other LoadStatus states:
notOn
loadHigh
overLoad
readError
Simple modifications to the code can be done to move these various Statuses between the CRITICAL
and WARNING states if desired, but it is not possible from the command line.
Similar to the --tower directive, many of these Status checks are skipped when querying specifc
metrics from the infeed. If any metrics are provided to --infeed, the infeed Status is checked
for the 'noComm' status. If this is true, the plugin will append this to the UNKNOWN 'bucket'
and skip checking the metric. The following infeed metrics are currently supported:
PhaseVoltage *
Voltage
CapacityUsed *
Power
ApparentPower *
Energy *
LoadValue
PhaseCurrent *
CrestFactor *
PowerFactor *
* These metrics are only available on PIPS units.
The infeeds are identified similar to the T/H probes, in the form of NAME(ID): VALUE. These are all
configurable on the CDU itself. Typically, a circuit name would be used for an infeed name. Thresholds
are applied in a similar manner to the --temp and --humid checks. ORDER DOES MATTER. The order in
which the metrics are listed is the order in which the thresholds should be "chained". The same logic
applies to these thresholds, see the THRESHOLDS section for specifics.
A special note on PowerFactor: An unloaded infeed will typically report -0.01 for the Power Factor.
It does not seem logical to apply the provided threshold to this value. So if the Power Factor is
less than 0 the threshold is not used and the state is simply assumed to be 'OK'.
Some examples:
$ check_cdu.pl -H 192.168.0.1 -C public --infeed
OK - BLDG_ROOM_RACK, TowerA_InfeedA(AA) Status: on(1), TowerA_InfeedA(AA) LoadStatus: normal(0),
TowerA_InfeedB(AB) Status: on(1), TowerA_InfeedB(AB) LoadStatus: normal(0), TowerA_InfeedC(AC) Status:
on(1), TowerA_InfeedC(AC) LoadStatus: normal(0), TowerB_InfeedA(BA) Status: on(1), TowerB_InfeedA(BA)
LoadStatus: normal(0), TowerB_InfeedB(BB) Status: on(1), TowerB_InfeedB(BB) LoadStatus: normal(0),
TowerB_InfeedC(BC) Status: on(1), TowerB_InfeedC(BC) LoadStatus: normal(0)
(This is a basic tower check for a master/slave 3 phase CDU. There are 6 infeeds total across both
towers, and two separate checks are performed (Status,LoadStatus) for each infeed. This is a lot of data)
$ check_cdu.pl -H 192.168.0.1 -C public --infeed LoadValue --warning 12 --critical 24
OK - BLDG_ROOM_RACK, TowerA_InfeedA(AA) LoadValue: 4.07A, TowerA_InfeedB(AB) LoadValue: 3.21A,
TowerA_InfeedC(AC) LoadValue: 1.62A, TowerB_InfeedA(BA) LoadValue: 3.61A, TowerB_InfeedB(BB) LoadValue:
2.76A, TowerB_InfeedC(BC) LoadValue: 1.73A
(This is a simple load/current check which applies a warning and critical threshold to the load of all 6
infeeds on a dual tower 3 phase CDU.)
$ check_cdu.pl -H 192.168.0.1 -C public --infeed ApparentPower,CapacityUsed --warning 1000,20
OK - BLDG_ROOM_RACK, TowerA_InfeedA(AA) ApparentPower: 673VA, TowerA_InfeedA(AA) CapacityUsed: 12.6%,
TowerA_InfeedB(AB) ApparentPower: 0VA, TowerA_InfeedB(AB) CapacityUsed: 10.5%, TowerA_InfeedC(AC)
ApparentPower: 317VA, TowerA_InfeedC(AC) CapacityUsed: 5.3%, TowerB_InfeedA(BA) ApparentPower: 575VA,
TowerB_InfeedA(BA) CapacityUsed: 12%, TowerB_InfeedB(BB) ApparentPower: 0VA, TowerB_InfeedB(BB) CapacityUsed:
8.9%, TowerB_InfeedC(BC) ApparentPower: 348VA, TowerB_InfeedC(BC) CapacityUsed: 5.7%
(Generate a warning if the ApparentPower of any infeed exceeds 1000VA, and generate a warning if the
Capacity Used exceeds 20% on any infeed)
PhaseVoltage and PhaseCurrent use the PhaseID instead of infeedID in the plugin output. Throughout our
testing, it has been difficult to ascertain a difference between PhaseVoltage and Voltage. There is
generally a considerable difference between PhaseCurrent and LoadValue, however it most likely makes
sense to only check one of these.
Enhanced Infeed checks (Sentry3 Products)
There are two additional metrics that can be checked with the '--infeed' directive. They are:
LoadImbalance
VoltageImbalance
These metrics are not provided directly by the CDU, rather they are computed internally by the plugin.
Please note, these special metrics are ONLY available on 3 phase units. Some versions of the CDU
firmware provide a '3-Phase Load Out-of-Balance Threshold' setting and the results are displayed on
the 'istat' menu. None of this information is provided via SNMP. Thresholds are required for either
of these computed metrics. Unlike the display in 'istat' only the out-of-balance infeed(s) will be
displayed, not infeeds across the entire tower. I used a basic 3 phase motor load phase imbalance
equation to generate the imbalance percentages for both Current and Voltage:
Percent imbalance = maximum deviation from average / average of three phases * 100
When an infeed is queried for either voltage or current imbalance, the plugin determines which tower
the infeed is a part of. All infeed values (voltage or current) for that tower are then averaged
together. The deviation from the average is then determined for this particular infeed, accomodating
either a negative or positive delta from the average. This is then divided by the average and
multiplied by 100 to determine the percent imbalance. This equation was pulled from the following
document:
http://support.fluke.com/educators/download/asset/2161031_b_w.pdf
An example invocation of this check would look like:
$ check_cdu.pl -H 192.168.0.1 -C public --infeed LoadImbalance --warning 20 --critical 30
CRITICAL - BLDG_ROOM_RACK, TowerA_InfeedA(AA) LoadImbalance: 39.07%, TowerA_InfeedC(AC)
LoadImbalance: 50.46%, TowerB_InfeedA(BA) LoadImbalance: 33.54%, TowerB_InfeedC(BC) LoadImbalance: 34.54%
(Generate a WARNING if the load imbalance of any infeed exceeds 20%, and a CRITICAL if the imbalance
exceeds 30%. Clearly, this is not a well balanced rack! Hence the need for such a check)
The same can be done for voltage, however the margins should be much, much smaller than load.
This can be useful to detect bad incoming power conditions. Unfortunately this only evaluates
an imbalance across the phases of a single tower. A more useful approach would be to judge
imbalance between two separate towers, and hence two separate feeds/circuits which could be
coming from two separate sources (ie. UPS/utility). Currently that functionality does not exist.
Here is an example:
$ check_cdu.pl -H 192.168.0.1 -C public --infeed VoltageImbalance --warning .5 --critical 2
WARNING - BLDG_ROOM_RACK, TowerB_InfeedB(BB) VoltageImbalance: 0.65%
(Generate a WARNING if the imbalance between voltages per infeed is greater than .5% and a
CRITICAL if the imbalance is greater than 2%)
Cords (Sentry4 Products)
Cord state and statisitcs are checked using the --cord directive. If specificed with no arguments
only the Status and State metrics of the cord(s) are checked. The ability to query a specific cord
does not exist at this time. If any other state is encountered for either object a WARNING is
generated.
There are other status objects that can be queried in addtion to Status and State. Check any number
of these objects by passing a comma separated list to the --cord directive. They do not accept
thresholds. The "normal" state of each metric is hard-coded (usually either "normal" or "on"). Here
is the full list of available "State" metrics:
State
Status
ActivePowerStatus
ApparentPowerStatus
PowerFactorStatus
OutOfBalanceStatus
Other non-state metrics can be queried in the same way, but require a threshold. At this time these
"metered" metrics do not honor any of the built-in thresholds available on the CDU. If you look in
the code, I am collecting any available Warning/Alarm metrics, but I have not coded in the ability to
use them. This is planned in a future version, I hope. If a metric is not available for some reason,
the plugin will identify this. Here are the cord metrics:
PowerCapacity
ActivePower
ApparentPower
PowerUtilized
PowerFactor
Energy
Frequency
OutOfBalance
It is very important to note that the 'Status' checks are largely skipped when querying any of these
metrics. If the 'noComm' state is encountered, the metric(s) are skipped and a state UNKNOWN is returned.
Given this, to fully utilize the features of this plugin one should ALWAYS have a service check using
just '--cord', or by specifying the "Status" checks explicity.
The naming convention of the cords is very similar (identical) to how all the other resources are
identified in the system.
Thresholds are applied the same way as is in other checks. ORDER DOES MATTER. The order in which the
metrics are listed is the order in which the thresholds should be "chained". See the THRESHOLDS section
for specifics.
Here are some examples:
$ check_cdu.pl -H 192.168.0.1 -C public --cord
OK - BLDG_ROOM_RACK, Master_Cord_A(AA) Status: normal(0) State: on(1), Link1_Cord_A(BA)
Status: normal(0) State: on(1)
$ check_cdu.pl -H 192.168.0.1 -C public --cord ActivePowerStatus,OutOfBalanceStatus
OK - BLDG_ROOM_RACK, Master_Cord_A(AA) ActivePowerStatus: normal(0), Master_Cord_A(AA)
OutOfBalanceStatus: normal(0), Link1_Cord_A(BA) ActivePowerStatus: normal(0), Link1_Cord_A(BA)
OutOfBalanceStatus: normal(0)
$ check_cdu.pl -H 192.168.0.1 -C public --cord ActivePower,PowerUtilized --warning 2500,20 --critical 4000,50
OK - BLDG_ROOM_RACK, Master_Cord_A(AA) ActivePower: 1442W, Master_Cord_A(AA) PowerUtilized: 8.1%,
Link1_Cord_A(BA) ActivePower: 1511W, Link1_Cord_A(BA) PowerUtilized: 8.2%
$ check_cdu.pl -H 192.168.0.1 -C public --cord PowerCapacity
WARNING - BLDG_ROOM_RACK, Link1_Cord_A(BA) PowerCapacity: -1VA
(This is a very bizarre but interesting scenario. I included PowerCapacity because it was there, but
who would logically check a static value such as the capacity of a cord? Well, it turns out that
this particular unit is slightly broken and the Capacity is -1. This should just provide some ideas
on why it may be useful to monitor things that otherwise wouldn't make sense)
Lines (Sentry4 Products)
Line state and statisitcs are checked using the -line directive. If specificed with no arguments
only the Status and State metrics of the cord(s) are checked. The ability to query a specific cord
does not exist at this time. If any other state is encountered for either object a WARNING is
generated.
There are other status objects that can be queried in addtion to Status and State. Check any number
of these objects by passing a comma separated list to the --line directive. They do not accept
thresholds. The "normal" state of each metric is hard-coded (usually either "normal" or "on"). Here
is the full list of available "State" metrics:
State
Status
CurrentStatus
Other non-state metrics can be queried in the same way, but require a threshold. At this time these
"metered" metrics do not honor any of the built-in thresholds available on the CDU. If you look in
the code, I am collecting any available Warning/Alarm metrics, but I have not coded in the ability to
use them. This is planned in a future version, I hope. If a metric is not available for some reason,
the plugin will identify this. Here are the line metrics:
CurrentCapacity
Current
CurrentUtilized
It is very important to note that the 'Status' checks are largely skipped when querying any of these
metrics. If the 'noComm' state is encountered, the metric(s) are skipped and a state UNKNOWN is returned.
Given this, to fully utilize the features of this plugin one should ALWAYS have a service check using
just '--line', or by specifying the "Status" checks explicity.
The naming convention of the cords is very similar (identical) to how all the other resources are
identified in the system.
Thresholds are applied the same way as is in other checks. ORDER DOES MATTER. The order in which the
metrics are listed is the order in which the thresholds should be "chained". See the THRESHOLDS section
for specifics.
Here are some examples:
$ check_cdu.pl -H 192.168.0.1 -C public --line
OK - BLDG_ROOM_RACK, AA:L1(AA1) Status: normal(0) State: on(1), AA:L2(AA2) Status: normal(0)
State: on(1), AA:L3(AA3) Status: normal(0) State: on(1), AA:N(AA4) Status: normal(0) State: on(1),
BA:L1(BA1) Status: normal(0) State: on(1), BA:L2(BA2) Status: normal(0) State: on(1), BA:L3(BA3)
Status: normal(0) State: on(1), BA:N(BA4) Status: normal(0) State: on(1)
$ check_cdu.pl -H 192.168.0.1 -C public --line CurrenStatus
OK - BLDG_ROOM_RACK, AA:L1(AA1) CurrentStatus: normal(0), AA:L2(AA2) CurrentStatus: normal(0),
AA:L3(AA3) CurrentStatus: normal(0), AA:N(AA4) CurrentStatus: normal(0), BA:L1(BA1) CurrentStatus:
normal(0), BA:L2(BA2) CurrentStatus: normal(0), BA:L3(BA3) CurrentStatus: normal(0), BA:N(BA4)
CurrentStatus: normal(0)
$ check_cdu.pl -H 192.168.0.1 -C public --line Current,CurrentUtilized --warning 5,40 --critical 10,95
OK - BLDG_ROOM_RACK, AA:L1(AA1) Current: 3.06A, AA:L1(AA1) CurrentUtilized: 9.5%, AA:L2(AA2)
Current: 2.23A, AA:L2(AA2) CurrentUtilized: 6.9%, AA:L3(AA3) Current: 2.1A, AA:L3(AA3)
CurrentUtilized: 6.5%, AA:N(AA4) Current: 1.05A, AA:N(AA4) CurrentUtilized: 3.2%, BA:L1(BA1)
Current: 3.18A, BA:L1(BA1) CurrentUtilized: 9.9%, BA:L2(BA2) Current: 2.36A, BA:L2(BA2)
CurrentUtilized: 7.3%, BA:L3(BA3) Current: 2.07A, BA:L3(BA3) CurrentUtilized: 6.4%, BA:N(BA4)
Current: 1.1A, BA:N(BA4) CurrentUtilized: 3.4%
Phases (Sentry4 Products)
Read the documentation for Cords and Lines. Phases are handled the same way.
Available "State" metrics:
State
Status
VoltageStatus
PowerFactorStatus
Reactance
Metered Metrics:
Voltage
VoltageDeviation
Current
CrestFactor
ActivePower
ApparentPower
PowerFactor
Energy
NOTE: Reactance is evaluated in terms of the following states:
unknown
capacitive
inductive
resistive
I opted to choose "capacitive" as the "OK" state. This could really not work well. YMMV
Branches (Sentry4 Products)
Read the documentation for Cords and Lines. Branches are handled the same way.
Available "State" metrics:
State
Status
CurrentStatus
Metered Metrics:
CurrentCapacity
Current
CurrentUtilized
Contact Sensors
Contact Closure sensors (Dry Contacts) are available when the EMCU-1-1B unit is used. Each firmware
version and even each CDU type can enumerate the sensors differently, so the IDs have been "simplified"
for use in this plugin. Do not use E1, C1, etc as the ID. Just use 1-4. The plugin figures the rest
out automagically. A state/status of "normal(0)" returns an OK. Anything else returns a WARNING. I
didn't bother to make this configurable, but you can hack the code yourself to change this if you want
If you don't explicity specify which IDs to query, the script looks at all four of them.
$ check_cdu.pl -H 192.168.0.1 -C public --contact 1,2
OK - BLDG_ROOM_RACK, FRONT_DOOR(B1): normal(0), REAR_DOOR(B2): normal(0)
Plugin Termination
Numerous scenarios exist where the plugin will exit abnormally. This could be due to user input error,
or failure to retrieve required SNMP data, etc. In all identifiable cases, the plugin will exit with a
UNKNOWN state and a descriptive message indicating the failure. Users should be aware that if all SNMP
calls fail, monitoring of the CDU may be effectively rendered useless if UNKNOWN states are not report
(this is common). This is dissimilar to plugins like check_nrpe that exit CRITICAL if an SSL negotiati
erorr occurs!
Throughout the workflow of the plugin metrics are evaluated against thresholds and the results are pla
into various 'buckets' reflecting OK,WARNING,CRITICAL and UNKNOWN states. At the end of the workflow,
reporting is done based upon the presence or absence of these buckets. If both CRITICAL and WARNING
conditions exist, they are BOTH reported in the plugin_output text, however the state is reported as
CRITICAL. An example of this can be seen in the following output:
$ check_cdu.pl -H 192.168.0.1 -C public --temp --humid --ths A1 --warning 16,30 --critical 20,40
CRITICAL - BLDG_ROOM_RACK, Bottom-Rack-Inlet_F31(A1): 43%, WARNING - Bottom-Rack-Inlet_F31(A1): 17C
Some options end up producing a large amount of output, and this could easily exceed what Nagios can
accept, or also exceed character limits on various notification devices (maybe you're tweeting your
CDU status for instance ;P) The '--oksummary' option exists to summarize the output for any type of
check being done. If all metrics being checked are in state 'OK' the output supresses the specifics
of these metrics and simply reports 'N metrics are OK' The version and location are also displayed
in the plugin_output.
INCOMPATIBILITIES
None. See Bugs.
BUGS AND LIMITATIONS
None.
If you experience any problems please contact me. (eric.schoeller
AUTHOR
Eric Schoeller (eric.schoeller
LICENCE AND COPYRIGHT
Copyright (c) 2013 Eric Schoeller (eric.schoeller
All rights reserved.
This module is free software; you can redistribute it and/or
modify it under the terms of the GNU General Public License.
See L
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Reviews (0)
Be the first to review this listing!