re-check service after “down” machine comes “up”?

hymie · Post by **hymie** » Thu Jul 21, 2016 9:12 am

Greetings.

Let's say I have a nagios client. It has a scheduled downtime from 8am Monday through 8am Tuesday, and the machine is turned off for the entire duration.

There is a service check it performs once per day (check_interval 1440)

Let's say the check happens to be scheduled for 8pm. So at 8pm Monday, the check happens, and it fails, and nagios does nothing because scheduled downtime.

Tuesday at 7:50am. The machine comes back on. But that one service will remain in the CRITICAL state until 8pm when the next scheduled check happens.

Is there some way that I can tell nagios "Maintain the check interval 1440; but, if you see the machine go down and come back up, then force a re-check regardless of the interval" ?

rkennedy · Post by **rkennedy** » Thu Jul 21, 2016 9:52 am

I believe what you're looking for is retry_interval. See the below description from https://assets.nagios.com/downloads/nag ... tions.html

retry_interval: This directive is used to define the number of "time units" to wait before scheduling a re-check of the hosts. Hosts are rescheduled at the retry interval when they have changed to a non-UP state. Once the host has been retried max_check_attempts times without a change in its status, it will revert to being scheduled at its "normal" rate as defined by the check_interval value. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. More information on this value can be found in the check scheduling documentation.

hymie · Post by **hymie** » Thu Jul 21, 2016 10:05 am

I humbly believe you are mistaken.

retry_interval, from my understanding, is when the host or service first goes down ("changed to a non-UP state"), it will check again after that many minutes to see if it is still down. For example, I have a check for the last "puppet" run. Puppet only runs every 30 minutes, successful or not; so if (say) puppet runs at 9am and 9:30am, and nagios detects a problem at 9:05am, there is no point in checking again at 9:10am or 9:15am. I would set the retry_interval to 30 so nagios will try again at 9:35am.

I want the opposite. The machine has just changed to the "UP" state. Now I want it to force a check of a particular service (or "of all services" would be acceptable), even if the normal check_interval has not yet expired.

tmcdonald · Post by **tmcdonald** » Thu Jul 21, 2016 10:16 am

hymie wrote:Is there some way that I can tell nagios "Maintain the check interval 1440; but, if you see the machine go down and come back up, then force a re-check regardless of the interval" ?

No, because by its very nature Nagios will not have any way to know the host went down and then came back up if it is not checking. You would need to be checking the machine much more frequently than every 24 hours if you want to catch this situation.

hymie · Post by **hymie** » Thu Jul 21, 2016 10:32 am

tmcdonald wrote:
hymie wrote:Is there some way that I can tell nagios "Maintain the check interval 1440; but, if you see the machine go down and come back up, then force a re-check regardless of the interval" ?
No, because by its very nature Nagios will not have any way to know the host went down and then came back up if it is not checking. You would need to be checking the machine much more frequently than every 24 hours if you want to catch this situation.

It checks the machine regularly at whatever default schedule nagios sets (every 5 minutes, from what I can see). It is just this one service that is checked once every 24 hours.

ssax · Post by **ssax** » Thu Jul 21, 2016 12:25 pm

The only thing that I can think of is for you to write your own event handler for that host that would check the current and previous status then if it matches your criteria submit the forced check:
- The top one is likely what you'd use but I posted them all just in case.

http://old.nagios.org/developerinfo/ext ... and_id=129
http://old.nagios.org/developerinfo/ext ... and_id=130
http://old.nagios.org/developerinfo/ext ... and_id=128

Post by **mcapra** » Thu Jul 21, 2016 12:27 pm

You might be able to hack together an event handler to say "if host recovers, schedule this service check":
https://assets.nagios.com/downloads/nag ... dlers.html

You could then have the script run by your event handler do something like this:

Code: Select all

NAGIOS_CMD_SOCKET='/usr/local/nagios/var/rw/nagios.cmd'
# schedule a service check
/usr/bin/printf "[%lu] SCHEDULE_FORCED_SVC_CHECK;%s;%s;%s\n" \
                  $(date +%s) \
                  "<your_hostname_here>" \
                  "<your_servicename_here>" \
                  $(date +%s) | tee -a $NAGIOS_CMD_SOCKET

The above schedules an immediate service check. If you service takes a few minutes to get running, you might need to adjust the script accordingly.

hymie · Post by **hymie** » Fri Jul 22, 2016 6:56 am

mcapra wrote:You might be able to hack together an event handler to say "if host recovers, schedule this service check":
https://assets.nagios.com/downloads/nag ... dlers.html

You could then have the script run by your event handler do something like this:
Code: Select all
NAGIOS_CMD_SOCKET='/usr/local/nagios/var/rw/nagios.cmd'
# schedule a service check
/usr/bin/printf "[%lu] SCHEDULE_FORCED_SVC_CHECK;%s;%s;%s\n" \
                  $(date +%s) \
                  "<your_hostname_here>" \
                  "<your_servicename_here>" \
                  $(date +%s) | tee -a $NAGIOS_CMD_SOCKET
The above schedules an immediate service check. If you service takes a few minutes to get running, you might need to adjust the script accordingly.

An event handler is exactly what I needed and worked perfectly. Thank you both!

Post by **mcapra** » Fri Jul 22, 2016 9:13 am

Awesome! Is it alright if we lock this thread and mark the issue as resolved?

Nagios Support Forum

re-check service after “down” machine comes “up”?

re-check service after “down” machine comes “up”?

Re: re-check service after “down” machine comes “up”?

Re: re-check service after “down” machine comes “up”?

Re: re-check service after “down” machine comes “up”?

Re: re-check service after “down” machine comes “up”?

Re: re-check service after “down” machine comes “up”?

Re: re-check service after “down” machine comes “up”?

Re: re-check service after “down” machine comes “up”?

Re: re-check service after “down” machine comes “up”?