Page 1 of 1
a question about services.cfg
Posted: Mon Sep 29, 2014 4:31 pm
by smcracraft
I still don't understand how best to use
check_interval
retry_interval
max_check_attempts
to measure something over a period so that it doesn't
massively alarm unnecessarily.
Can someone explain it in plain English non-geek-speak?
Re: a question about services.cfg
Posted: Mon Sep 29, 2014 4:38 pm
by tmcdonald
check_interval:
Alright, so I have this thing I want to monitor. As long as it is checking out alright, I only want to check it every 5 minutes. So I'll set "check_interval 5" in that service's config file.
retry_interval:
Sometimes things don't check out alright, maybe there is a problem that needs to be looked into. If I check this thing and there *is* a problem, I want to re-check it every minute since it's important now, so I'll set "retry_interval 1".
max_check_attempts:
Just in case the thing turns out to be a temporary problem, I want to re-check it a few times before determining it really is something to worry about and start sending alert emails. 3 times should be enough (remembering that it will check every 1 minute, as per retry_interval) so I will set "max_check_attempts 3".
Re: a question about services.cfg
Posted: Mon Sep 29, 2014 5:13 pm
by eloyd
I give a Nagios training presentation that talks about the "Nagios Timeline." Basically, take what Trevor just said and turn it into a lot of text and one picture:
Code: Select all
If a service is OK it is in a HARD state.
In normal OK state, checks are performed at check_interval intervals.
If it then becomes non-OK, it is in a SOFT state.
Further checks are made at a decreased interval (retry_interval)
Services remain in a SOFT state until they have had max_check_attempts successive non-OK attempts.
At that point, they are in a HARD state equal to the last status (HARD WARNING or HARD CRITICAL).
Further checks are made at the normal interval from now on (check_interval).
Services that are in a non-OK HARD state that then become OK are placed into a SOFT OK state.
The next check will result in a HARD OK if it is an OK status, or a SOFT version of a non-OK state, and this process repeats.
This allows for “instantaneous” outages that don’t immediately trigger notifications or event handlers.
The picture shows the normal check_interval until something goes wrong, then a max_check_attempts of 3 being checked at the reduce retry_interval until it goes bad, then it changes back to check_interval until it goes OK.

- Nagios Timeline.png (2.79 KiB) Viewed 1603 times
Re: a question about services.cfg
Posted: Tue Sep 30, 2014 10:24 am
by slansing
Hopefully this helps answer the OP's question, thanks for the tips guys. Let us know if you have further questions smcracraft!