a question about services.cfg

smcracraft · Post by **smcracraft** » Mon Sep 29, 2014 4:31 pm

I still don't understand how best to use

check_interval
retry_interval
max_check_attempts

to measure something over a period so that it doesn't
massively alarm unnecessarily.

Can someone explain it in plain English non-geek-speak?

tmcdonald · Post by **tmcdonald** » Mon Sep 29, 2014 4:38 pm

check_interval:

Alright, so I have this thing I want to monitor. As long as it is checking out alright, I only want to check it every 5 minutes. So I'll set "check_interval 5" in that service's config file.

retry_interval:

Sometimes things don't check out alright, maybe there is a problem that needs to be looked into. If I check this thing and there *is* a problem, I want to re-check it every minute since it's important now, so I'll set "retry_interval 1".

max_check_attempts:

Just in case the thing turns out to be a temporary problem, I want to re-check it a few times before determining it really is something to worry about and start sending alert emails. 3 times should be enough (remembering that it will check every 1 minute, as per retry_interval) so I will set "max_check_attempts 3".

Post by **eloyd** » Mon Sep 29, 2014 5:13 pm

I give a Nagios training presentation that talks about the "Nagios Timeline." Basically, take what Trevor just said and turn it into a lot of text and one picture:

Code: Select all

If a service is OK it is in a HARD state.
In normal OK state, checks are performed at check_interval intervals.
If it then becomes non-OK, it is in a SOFT state.
  Further checks are made at a decreased interval (retry_interval)
Services remain in a SOFT state until they have had max_check_attempts successive non-OK attempts.
  At that point, they are in a HARD state equal to the last status (HARD WARNING or HARD CRITICAL).
  Further checks are made at the normal interval from now on (check_interval).
Services that are in a non-OK HARD state that then become OK are placed into a SOFT OK state.
  The next check will result in a HARD OK if it is an OK status, or a SOFT version of a non-OK state, and this process repeats.

This allows for “instantaneous” outages that don’t immediately trigger notifications or event handlers.

The picture shows the normal check_interval until something goes wrong, then a max_check_attempts of 3 being checked at the reduce retry_interval until it goes bad, then it changes back to check_interval until it goes OK.

: Nagios Timeline.png (2.79 KiB) Viewed 1629 times

slansing · Post by **slansing** » Tue Sep 30, 2014 10:24 am

Hopefully this helps answer the OP's question, thanks for the tips guys. Let us know if you have further questions smcracraft!

Nagios Support Forum

a question about services.cfg

a question about services.cfg

Re: a question about services.cfg

Re: a question about services.cfg

Re: a question about services.cfg