Check for multiple devices (composite check)?
Posted: Wed Sep 17, 2014 3:22 pm
Hi all, been using Nagios since version 2, now using Nagios Core 4.0.7. I have 4.0.8 downloaded and plan to get that upgrade installed soon.
I have a need to do something I'm not sure I can describe clearly, so hopefully someone can interpret
I have a number of hosts that I check, that are all in the hostgroup "internet" -- I'm not checking services on these per se, all I care about it whether I can reach (ping) them. But they go down from time to time, and that is not convincing evidence that my internet connection is in trouble. What would convince me would be if more than 80% of the hosts in the group became unreachable.
So I need a "composite" host, or perhaps a service, one that is composed of these hosts. It should go to a warning state more than say 75% become unreachable according to the thresholds and timeouts already configured, and critical if more than 80% of them go beyond the thresholds of response time and dropped packets.
And here's why: all our notification is done by sending email to e.g. [email protected]. And if the internet connection is down, well, you get the idea.
Ideally, I'd like to manage what hosts constitute this check just by putting them in the magic hostgroup, without having to maintain some external list. We don't use any service groups at present.
What I can do, and have tested, is create a "call file" which describes an outbound telephone call, a call which dials a number and plays a recording. I drop the call file on our Asterisk (open source PBX) server and poof! It calls the support cells phones and plays the recording, stating that we think the internet connection might be in trouble and needs checking.
But I haven't figured out how to accomplish the composite check, with warning and critical thresholds.
What I did do was write a Perl script that parses out the hosts.cfg file and creates a flat file listing all hosts that are in the hostgroup specified. I had some idea of writing a check that reads the flat file and, um, that's where I run out of ideas. What does it do then? Somehow ask Nagios for that status of all those hosts as of the last check? Or check them all itself, maybe by calling plug-ins in /usr/local/nagios/etc/libexec?
I did look at the command API and wrote a few scripts top write to the Nagios command pipe-- I can do stuff like disable/enable notifications for all hosts in a hostgroup. But there's seemingly no way to use that interface to retrieve all members of a hostgroup for processing external to Nagios.
There has to be an easier way.
Thanks in advance for any help.
I have a need to do something I'm not sure I can describe clearly, so hopefully someone can interpret
I have a number of hosts that I check, that are all in the hostgroup "internet" -- I'm not checking services on these per se, all I care about it whether I can reach (ping) them. But they go down from time to time, and that is not convincing evidence that my internet connection is in trouble. What would convince me would be if more than 80% of the hosts in the group became unreachable.
So I need a "composite" host, or perhaps a service, one that is composed of these hosts. It should go to a warning state more than say 75% become unreachable according to the thresholds and timeouts already configured, and critical if more than 80% of them go beyond the thresholds of response time and dropped packets.
And here's why: all our notification is done by sending email to e.g. [email protected]. And if the internet connection is down, well, you get the idea.
Ideally, I'd like to manage what hosts constitute this check just by putting them in the magic hostgroup, without having to maintain some external list. We don't use any service groups at present.
What I can do, and have tested, is create a "call file" which describes an outbound telephone call, a call which dials a number and plays a recording. I drop the call file on our Asterisk (open source PBX) server and poof! It calls the support cells phones and plays the recording, stating that we think the internet connection might be in trouble and needs checking.
But I haven't figured out how to accomplish the composite check, with warning and critical thresholds.
What I did do was write a Perl script that parses out the hosts.cfg file and creates a flat file listing all hosts that are in the hostgroup specified. I had some idea of writing a check that reads the flat file and, um, that's where I run out of ideas. What does it do then? Somehow ask Nagios for that status of all those hosts as of the last check? Or check them all itself, maybe by calling plug-ins in /usr/local/nagios/etc/libexec?
I did look at the command API and wrote a few scripts top write to the Nagios command pipe-- I can do stuff like disable/enable notifications for all hosts in a hostgroup. But there's seemingly no way to use that interface to retrieve all members of a hostgroup for processing external to Nagios.
There has to be an easier way.
Thanks in advance for any help.