Greetings.
Let's say I have a nagios client. It has a scheduled downtime from 8am Monday through 8am Tuesday, and the machine is turned off for the entire duration.
There is a service check it performs once per day (check_interval 1440)
Let's say the check happens to be scheduled for 8pm. So at 8pm Monday, the check happens, and it fails, and nagios does nothing because scheduled downtime.
Tuesday at 7:50am. The machine comes back on. But that one service will remain in the CRITICAL state until 8pm when the next scheduled check happens.
Is there some way that I can tell nagios "Maintain the check interval 1440; but, if you see the machine go down and come back up, then force a re-check regardless of the interval" ?
re-check service after “down” machine comes “up”?
Re: re-check service after “down” machine comes “up”?
I believe what you're looking for is retry_interval. See the below description from https://assets.nagios.com/downloads/nag ... tions.html
retry_interval: This directive is used to define the number of "time units" to wait before scheduling a re-check of the hosts. Hosts are rescheduled at the retry interval when they have changed to a non-UP state. Once the host has been retried max_check_attempts times without a change in its status, it will revert to being scheduled at its "normal" rate as defined by the check_interval value. Unless you've changed the interval_length directive from the default value of 60, this number will mean minutes. More information on this value can be found in the check scheduling documentation.
Former Nagios Employee
Re: re-check service after “down” machine comes “up”?
I humbly believe you are mistaken.
retry_interval, from my understanding, is when the host or service first goes down ("changed to a non-UP state"), it will check again after that many minutes to see if it is still down. For example, I have a check for the last "puppet" run. Puppet only runs every 30 minutes, successful or not; so if (say) puppet runs at 9am and 9:30am, and nagios detects a problem at 9:05am, there is no point in checking again at 9:10am or 9:15am. I would set the retry_interval to 30 so nagios will try again at 9:35am.
I want the opposite. The machine has just changed to the "UP" state. Now I want it to force a check of a particular service (or "of all services" would be acceptable), even if the normal check_interval has not yet expired.
retry_interval, from my understanding, is when the host or service first goes down ("changed to a non-UP state"), it will check again after that many minutes to see if it is still down. For example, I have a check for the last "puppet" run. Puppet only runs every 30 minutes, successful or not; so if (say) puppet runs at 9am and 9:30am, and nagios detects a problem at 9:05am, there is no point in checking again at 9:10am or 9:15am. I would set the retry_interval to 30 so nagios will try again at 9:35am.
I want the opposite. The machine has just changed to the "UP" state. Now I want it to force a check of a particular service (or "of all services" would be acceptable), even if the normal check_interval has not yet expired.
Re: re-check service after “down” machine comes “up”?
No, because by its very nature Nagios will not have any way to know the host went down and then came back up if it is not checking. You would need to be checking the machine much more frequently than every 24 hours if you want to catch this situation.hymie wrote:Is there some way that I can tell nagios "Maintain the check interval 1440; but, if you see the machine go down and come back up, then force a re-check regardless of the interval" ?
Former Nagios employee
Re: re-check service after “down” machine comes “up”?
It checks the machine regularly at whatever default schedule nagios sets (every 5 minutes, from what I can see). It is just this one service that is checked once every 24 hours.tmcdonald wrote:No, because by its very nature Nagios will not have any way to know the host went down and then came back up if it is not checking. You would need to be checking the machine much more frequently than every 24 hours if you want to catch this situation.hymie wrote:Is there some way that I can tell nagios "Maintain the check interval 1440; but, if you see the machine go down and come back up, then force a re-check regardless of the interval" ?
Re: re-check service after “down” machine comes “up”?
The only thing that I can think of is for you to write your own event handler for that host that would check the current and previous status then if it matches your criteria submit the forced check:
- The top one is likely what you'd use but I posted them all just in case.
http://old.nagios.org/developerinfo/ext ... and_id=129
http://old.nagios.org/developerinfo/ext ... and_id=130
http://old.nagios.org/developerinfo/ext ... and_id=128
- The top one is likely what you'd use but I posted them all just in case.
http://old.nagios.org/developerinfo/ext ... and_id=129
http://old.nagios.org/developerinfo/ext ... and_id=130
http://old.nagios.org/developerinfo/ext ... and_id=128
Re: re-check service after “down” machine comes “up”?
You might be able to hack together an event handler to say "if host recovers, schedule this service check":
https://assets.nagios.com/downloads/nag ... dlers.html
You could then have the script run by your event handler do something like this:
The above schedules an immediate service check. If you service takes a few minutes to get running, you might need to adjust the script accordingly.
https://assets.nagios.com/downloads/nag ... dlers.html
You could then have the script run by your event handler do something like this:
Code: Select all
NAGIOS_CMD_SOCKET='/usr/local/nagios/var/rw/nagios.cmd'
# schedule a service check
/usr/bin/printf "[%lu] SCHEDULE_FORCED_SVC_CHECK;%s;%s;%s\n" \
$(date +%s) \
"<your_hostname_here>" \
"<your_servicename_here>" \
$(date +%s) | tee -a $NAGIOS_CMD_SOCKET
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: re-check service after “down” machine comes “up”?
An event handler is exactly what I needed and worked perfectly. Thank you both!mcapra wrote:You might be able to hack together an event handler to say "if host recovers, schedule this service check":
https://assets.nagios.com/downloads/nag ... dlers.html
You could then have the script run by your event handler do something like this:
The above schedules an immediate service check. If you service takes a few minutes to get running, you might need to adjust the script accordingly.Code: Select all
NAGIOS_CMD_SOCKET='/usr/local/nagios/var/rw/nagios.cmd' # schedule a service check /usr/bin/printf "[%lu] SCHEDULE_FORCED_SVC_CHECK;%s;%s;%s\n" \ $(date +%s) \ "<your_hostname_here>" \ "<your_servicename_here>" \ $(date +%s) | tee -a $NAGIOS_CMD_SOCKET
Re: re-check service after “down” machine comes “up”?
Awesome! Is it alright if we lock this thread and mark the issue as resolved?
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/