re-check service after “down” machine comes “up”?

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
hymie
Posts: 5
Joined: Fri Jan 29, 2016 10:51 am

re-check service after “down” machine comes “up”?

Post by hymie »

Greetings.

Let's say I have a nagios client. It has a scheduled downtime from 8am Monday through 8am Tuesday, and the machine is turned off for the entire duration.

There is a service check it performs once per day (check_interval 1440)

Let's say the check happens to be scheduled for 8pm. So at 8pm Monday, the check happens, and it fails, and nagios does nothing because scheduled downtime.

Tuesday at 7:50am. The machine comes back on. But that one service will remain in the CRITICAL state until 8pm when the next scheduled check happens.

Is there some way that I can tell nagios "Maintain the check interval 1440; but, if you see the machine go down and come back up, then force a re-check regardless of the interval" ?
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: re-check service after “down” machine comes “up”?

Post by rkennedy »

I believe what you're looking for is retry_interval. See the below description from https://assets.nagios.com/downloads/nag ... tions.html
retry_interval: This directive is used to define the number of "time units" to wait before scheduling a re-check of the hosts. Hosts are rescheduled at the retry interval when they have changed to a non-UP state. Once the host has been retried max_check_attempts times without a change in its status, it will revert to being scheduled at its "normal" rate as defined by the check_interval value. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. More information on this value can be found in the check scheduling documentation.
Former Nagios Employee
hymie
Posts: 5
Joined: Fri Jan 29, 2016 10:51 am

Re: re-check service after “down” machine comes “up”?

Post by hymie »

I humbly believe you are mistaken.

retry_interval, from my understanding, is when the host or service first goes down ("changed to a non-UP state"), it will check again after that many minutes to see if it is still down. For example, I have a check for the last "puppet" run. Puppet only runs every 30 minutes, successful or not; so if (say) puppet runs at 9am and 9:30am, and nagios detects a problem at 9:05am, there is no point in checking again at 9:10am or 9:15am. I would set the retry_interval to 30 so nagios will try again at 9:35am.

I want the opposite. The machine has just changed to the "UP" state. Now I want it to force a check of a particular service (or "of all services" would be acceptable), even if the normal check_interval has not yet expired.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: re-check service after “down” machine comes “up”?

Post by tmcdonald »

hymie wrote:Is there some way that I can tell nagios "Maintain the check interval 1440; but, if you see the machine go down and come back up, then force a re-check regardless of the interval" ?
No, because by its very nature Nagios will not have any way to know the host went down and then came back up if it is not checking. You would need to be checking the machine much more frequently than every 24 hours if you want to catch this situation.
Former Nagios employee
hymie
Posts: 5
Joined: Fri Jan 29, 2016 10:51 am

Re: re-check service after “down” machine comes “up”?

Post by hymie »

tmcdonald wrote:
hymie wrote:Is there some way that I can tell nagios "Maintain the check interval 1440; but, if you see the machine go down and come back up, then force a re-check regardless of the interval" ?
No, because by its very nature Nagios will not have any way to know the host went down and then came back up if it is not checking. You would need to be checking the machine much more frequently than every 24 hours if you want to catch this situation.
It checks the machine regularly at whatever default schedule nagios sets (every 5 minutes, from what I can see). It is just this one service that is checked once every 24 hours.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: re-check service after “down” machine comes “up”?

Post by ssax »

The only thing that I can think of is for you to write your own event handler for that host that would check the current and previous status then if it matches your criteria submit the forced check:
- The top one is likely what you'd use but I posted them all just in case.

http://old.nagios.org/developerinfo/ext ... and_id=129
http://old.nagios.org/developerinfo/ext ... and_id=130
http://old.nagios.org/developerinfo/ext ... and_id=128
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: re-check service after “down” machine comes “up”?

Post by mcapra »

You might be able to hack together an event handler to say "if host recovers, schedule this service check":
https://assets.nagios.com/downloads/nag ... dlers.html

You could then have the script run by your event handler do something like this:

Code: Select all

NAGIOS_CMD_SOCKET='/usr/local/nagios/var/rw/nagios.cmd'
# schedule a service check
/usr/bin/printf "[%lu] SCHEDULE_FORCED_SVC_CHECK;%s;%s;%s\n" \
                  $(date +%s) \
                  "<your_hostname_here>" \
                  "<your_servicename_here>" \
                  $(date +%s) | tee -a $NAGIOS_CMD_SOCKET
The above schedules an immediate service check. If you service takes a few minutes to get running, you might need to adjust the script accordingly.
Former Nagios employee
https://www.mcapra.com/
hymie
Posts: 5
Joined: Fri Jan 29, 2016 10:51 am

Re: re-check service after “down” machine comes “up”?

Post by hymie »

mcapra wrote:You might be able to hack together an event handler to say "if host recovers, schedule this service check":
https://assets.nagios.com/downloads/nag ... dlers.html

You could then have the script run by your event handler do something like this:

Code: Select all

NAGIOS_CMD_SOCKET='/usr/local/nagios/var/rw/nagios.cmd'
# schedule a service check
/usr/bin/printf "[%lu] SCHEDULE_FORCED_SVC_CHECK;%s;%s;%s\n" \
                  $(date +%s) \
                  "<your_hostname_here>" \
                  "<your_servicename_here>" \
                  $(date +%s) | tee -a $NAGIOS_CMD_SOCKET
The above schedules an immediate service check. If you service takes a few minutes to get running, you might need to adjust the script accordingly.
An event handler is exactly what I needed and worked perfectly. Thank you both!
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: re-check service after “down” machine comes “up”?

Post by mcapra »

Awesome! Is it alright if we lock this thread and mark the issue as resolved?
Former Nagios employee
https://www.mcapra.com/
Locked