Page 1 of 2

Some service checks to run over 1 minutes

Posted: Mon Sep 19, 2016 9:49 am
by dlukinski
Hello XI support

Need help with this one. Please define (detailed if possible) how we could have some service checks running way over 1 minute before the timeout?
We have existing environment with over 500 hosts and 6.5k+ checks.

How after integration with SELENIUM, Q/A programmers are telling us that under 1 minute checks are completely unrealistic (has to be a lot longer if not double digits)
- So is there way to make only SOME service checks to go over 1 minute long? (so that they would not timeout)?


If not and this is a global variable only, what should we take into consideration when changing one?
Of course we have many retries configured to happen every minute or every to minutes (some are w/o templates).
Even if every 1 minute could be fixed into every 2, many checks are every 3-5 min and have to be this way.
Therefore really unsure how to approach global variable changes if required.
------------------------------------------------------------------------------------------------
Maybe we should make a ticket out of it?

Would this approach be correct?
- https://deadlockprocess.wordpress.com/2 ... tosrhel-5/

Re: Some service checks to run over 1 minutes

Posted: Mon Sep 19, 2016 10:35 am
by gormank
Look at the service difinitions and see what command runs them. If its check_nrpe for example, you can create a check_nrpe_long (or whatever) command and use a longer timeout, or make the timeout part of the list of arguments.

I'd guess the timeout needs to be less than the check interval, or there will be problems.

Re: Some service checks to run over 1 minutes

Posted: Mon Sep 19, 2016 10:39 am
by dlukinski
gormank wrote:Look at the service difinitions and see what command runs them. If its check_nrpe for example, you can create a check_nrpe_long (or whatever) command and use a longer timeout, or make the timeout part of the list of arguments.

I'd guess the timeout needs to be less than the check interval, or there will be problems.
This does not work so far:

(Service check timed out after 60.01 seconds) with $USER1$/check_selenium -t 300 --script=$USER1$/$ARG1$

Re: Some service checks to run over 1 minutes

Posted: Mon Sep 19, 2016 10:43 am
by gormank
# grep 60 /usr/local/nagios/etc/nagios.cfg
host_freshness_check_interval=60
interval_length=60
max_check_result_file_age=3600
retention_update_interval=60
service_check_timeout=60
service_freshness_check_interval=60

Re: Some service checks to run over 1 minutes

Posted: Mon Sep 19, 2016 10:51 am
by dlukinski
gormank wrote:# grep 60 /usr/local/nagios/etc/nagios.cfg
host_freshness_check_interval=60
interval_length=60
max_check_result_file_age=3600
retention_update_interval=60
service_check_timeout=60
service_freshness_check_interval=60
Which would impact all services checks.. is that OK ?
- does it mean we have to specify timeouts manually for the rest of them to avoid default values?

Re: Some service checks to run over 1 minutes

Posted: Mon Sep 19, 2016 11:03 am
by gormank
You need to look at your service definitions, as I suggested earlier to answer those questions...
What are the timeouts defined?

Re: Some service checks to run over 1 minutes

Posted: Mon Sep 19, 2016 11:09 am
by tgriep
Most plugins should have a default timeout so for those plugins, increasing the system wide service timeout value will not affect those.
You would have to monitor the nagios.log file for the service timeout of 60 seconds that you are currently getting, edit those checks and add a timeout to them.
Then when you increase the system wide timeout, those checks will not take the longer time to timeout.

Re: Some service checks to run over 1 minutes

Posted: Mon Sep 19, 2016 11:11 am
by dlukinski
gormank wrote:You need to look at your service definitions, as I suggested earlier to answer those questions...
What are the timeouts defined?
in production it is 60 sec

What we are trying to understand is the impact on all other checks with 1-2 min re-tries and 5-10 min run frequency, where specifically and only SELENIUM may require 10-30 min timeouts

Re: Some service checks to run over 1 minutes

Posted: Mon Sep 19, 2016 11:20 am
by dlukinski
tgriep wrote:Most plugins should have a default timeout so for those plugins, increasing the system wide service timeout value will not affect those.
You would have to monitor the nagios.log file for the service timeout of 60 seconds that you are currently getting, edit those checks and add a timeout to them.
Then when you increase the system wide timeout, those checks will not take the longer time to timeout.

So if I get this right,

1. We increase system-wide to whatever the value we need.
2. Monitor logs: pretty much 1 script: check_selenium (we may multiply) where long timeouts would be required..
3. All other checks should not take longer to timeout after increase is made; what about all the consecutive checks created post-increase?

Re: Some service checks to run over 1 minutes

Posted: Mon Sep 19, 2016 11:37 am
by tgriep
None of the other checks will take longer to run. The system wide timeout settings it there for if someone writes a plugin that doesn't have a timeout built in it and it will keep those from running too long.