Page 1 of 1

Interpreting the change service_check_timeout

Posted: Sun Oct 18, 2015 10:45 pm
by nagmoto
Hi
As seen in attached screenshot, Around 10:30am today, service_check_timeout was increased from default 60 seconds to 200.
After this change active service check number dropped a lot.
Can you help me understand how so ?

Re: Interpreting the change service_check_timeout

Posted: Mon Oct 19, 2015 12:18 am
by Box293
How is this graph being generated? Is this an service for localhost? If so what is the plugin used?

Re: Interpreting the change service_check_timeout

Posted: Mon Oct 19, 2015 5:14 am
by nagmoto
How is this graph being generated?

mrtg graphing from /usr/bin/nagiostats command

Code: Select all

[root@nagios ~]# egrep -v '^#|^$'  /etc/nagios/mrtg.cfg
WorkDir: /usr/share/nagios/html/stats
<snipped>
Target[nagios-j]: `/usr/bin/nagiostats --mrtg --data=NUMSACTSVCCHECKS5M,NUMOACTSVCCHECKS5M,PROGRUNTIME,NAGIOSVERPID`
MaxBytes[nagios-j]: 7000
Title[nagios-j]: Active Service Checks
PageTop[nagios-j]: <H1>Active Service Checks</H1>
Options[nagios-j]: growright,gauge,nopercent
YLegend[nagios-j]: Checks
ShortLegend[nagios-j]:  
LegendI[nagios-j]:  Scheduled Checks:
LegendO[nagios-j]:  On-Demand Checks:
<snipped>
[root@nagios ~]#

Is this an service for localhost?

Yes, on localhost. system info is centos 6.x and using its vendor nagios 3.5.1 package

If so what is the plugin used?

I followed Tom Ryder's book page 275 in "Nagios core administration cookbook".

Re: Interpreting the change service_check_timeout

Posted: Mon Oct 19, 2015 11:00 am
by nagmoto
Looks like there were many service checks need times between 60 and 200 seconds.
So once the service timeout set to 200 then those checks were able to executed without resubmit for next runs.
If this is the case, am I able to look up those timeout records in nagios.log* files ?

Re: Interpreting the change service_check_timeout

Posted: Mon Oct 19, 2015 3:39 pm
by tmcdonald
There's a cool script here that can analyze your check times:

https://exchange.nagios.org/directory/P ... me/details

You will need to modify the line that points to your log file, but otherwise it usually works right out of the box. Not sure what you are looking for specifically, but the output should be related.

Re: Interpreting the change service_check_timeout

Posted: Tue Oct 20, 2015 6:09 am
by nagmoto
Hi Tom

1. Thanks for the pointer to the script. I am able to see a list of checks and their time spent.
tmcdonald wrote:There's a cool script here that can analyze your check times:

https://exchange.nagios.org/directory/P ... me/details

You will need to modify the line that points to your log file, but otherwise it usually works right out of the box. Not sure what you are looking for specifically, but the output should be related.
2. Still, I don't understand why the change of service_check_timeout from 60 to 200 seconds , in /etc/nagios/nagios.cfg, will reduce the schedule checks number from 1420 to 150. The profile perl script I ran show very few checks are over 60 seconds long.

Re: Interpreting the change service_check_timeout

Posted: Tue Oct 20, 2015 3:12 pm
by jdalrymple
If you're not seeing many run up against the timeout then I am not sure what is going on here. If you look at the Nagios process statistics in the UI do they correlate well to what your MRTG graph indicates? I am a bit put off by how ridiculously flat your graph is, seems kind of overly ideal.
nagmoto wrote:So once the service timeout set to 200 then those checks were able to executed without resubmit for next runs.If this is the case, am I able to look up those timeout records in nagios.log* files ?
Once logged, it's logged and doesn't change. You should be able to grep the log archives for timeouts so you can compare prior to current. I don't know that I expect that to explain your situation though.

Let us know what you find.