too many timeouts in a row cause nagios to bail

stucky · Post by **stucky** » Wed Aug 05, 2015 1:31 pm

Gurus
Recently we are experiencing an issue where when comvault backups kick in around 7 a lot of VMs start timing out and it appears that at some point there are too many timeouts for nagios to handle. This is when it starts killing procs and I get 1000s of these

Aug 4 19:56:40 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;check load;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 4 19:56:40 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;check uptime;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 4 19:56:40 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;check load;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 4 19:56:41 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;du /;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: Core Worker 14940: job 23984 (pid=20521) timed out. Killing it
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: CHECK job 23984 from worker Core Worker 14940 timed out after 60.01s
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: host=xxxx; service=check load;
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Aug 4 19:56:42 prdmgtnag01 nagios: Warning: Check of service 'check load' on host 'xxxxxx' timed out after 60.009s!
Aug 4 19:56:42 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;check load;CRITICAL;SOFT;1;(Service check timed out after 60.01 seconds)
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: Core Worker 14940: job 23984 (pid=20521): Dormant child reaped
Aug 4 19:56:42 prdmgtnag01 nagios: SERVICE ALERT: xxxxxx;check load;OK;SOFT;2;OK - load average: 0.95, 1.60, 1.04
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: Core Worker 14938: job 24084 (pid=21354) timed out. Killing it
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: GLOBAL SERVICE EVENTHANDLER job 24084 from worker Core Worker 14938 timed out after 51.18s
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: stderr line 01: No entry for terminal type "unknown";
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: stderr line 02: using dumb terminal settings.
Aug 4 19:57:29 prdmgtnag01 nagios: Warning: Global service event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=service --host="xxxxxxxxxxx" --service="du /usr" --hostaddress="xxxxxxxxxxx" --hoststate=UP --hoststateid=0 --hosteventid=271852 --hostproblemid=0 --servicestate=CRITICAL --servicestateid=2 --lastservicestate=OK --lastservicestateid=0 --servicestatetype=SOFT --currentattempt=1 --maxattempts=5 --serviceeventid=285391 --serviceproblemid=132684 --serviceoutput="CRITICAL - Plugin timed out after 10 seconds" --longserviceoutput="" --servicedowntime=0' timed out after 0.00 seconds

After that ALL plugins timeout for a few minutes and I get 900 alerts

. The load goes to 275 so that explains a lot. I have 8G of Ram and 2 CPUs. Normally I dont use much

[astuck@prdmgtnag01 ~]$ free -m
total used free shared buffers cached
Mem: 7873 2221 5651 27 210 1023
-/+ buffers/cache: 987 6885
Swap: 4095 113 3982
[astuck@prdmgtnag01 ~]$

Then it all recovers and I get 900 recovery emails

Can we tune nagios to "survive" these spikes ? All the check settings are as shipped by default.

tmcdonald · Post by **tmcdonald** » Wed Aug 05, 2015 4:23 pm

Honestly I think the better way to do this is to implement scheduled downtime:

https://assets.nagios.com/downloads/nag ... ntime.html

The link can be found on the Home page, Scheduled Downtime.

jdalrymple · Post by **jdalrymple** » Wed Aug 05, 2015 4:32 pm

We'll probably need to start by identifying the bottleneck, or the source of the massive load increase. It was likely disk/swap, and the resolution may be to add memory, faster disks, or a combination of the 2. The thing about Nagios is that when things go bad in your environment (if bad enough) they spiral into oblivion for the Nagios host. Event Handlers start kicking off, Notifications start happening, if you have environmental macros enabled this will consume MASSIVE amounts of memory, and of course on-demand checks start firing.

Look at your sar reports and see if it's disk or CPU causing the load spike
Also - if there is potential to rearrange your parent/child relationships and NOT alert on unreachables, that could be huge. In addition to service dependencies that would additionally lower the "spiral out of controllness" - as in "If my VMware host is dead, there is no sense in me monitoring services or hosts that live on that VMware host."
Lastly - it can be helpful to start offloading the things that can be offloaded, the databases, the workers (mod_gearman) etc.

Start with finding the bottleneck though.

stucky · Post by **stucky** » Wed Aug 05, 2015 5:10 pm

jda
What u describe is pretty much what happens once a night. I had already turned off alerts on UNKNOWN but when taking a closer look I noticed that the check_by_ssh service timeouts are actually causing CRITICALS instead of UNKNOWNs

Aug 5 06:15:29 prdmgtnag01 nagios: SERVICE ALERT: xxxxxx;du /usr;CRITICAL;SOFT;3;CRITICAL - Plugin timed out after 10 seconds

whereas the check_wmi timeouts cause an UNKNOWN as expected

All timeouts should cause UNKNOWN and that would cut back on alerts massively. It appears to be undesired behaviour by check_by_ssh.

Post by **Box293** » Wed Aug 05, 2015 7:19 pm

In nagios-plugins 2.1.x, you can define what the timeout state is.

-t, --timeout=INTEGER:<timeout state>
Seconds before connection times out (default: 10)
Optional ":<timeout state>" can be a state integer (0,1,2,3) or a state STRING

In your case you would add -t 10:3 to your command

It was released just last week I think.

You should be able to:
Type cd /tmp and press Enter
Type wget http://nagios-plugins.org/download/nagi ... 1.1.tar.gz and press Enter
Type tar zxf nagios-plugins-2.1.1.tar.gz and press Enter
Type cd nagios-plugins-2.1.1 and press Enter
Type ./configure --with-nagios-user=nagios --with-nagios-group=nagios and press Enter
Wait for the configure command to complete
Type make and press Enter
Wait for the make command to complete
Type make install and press Enter
Wait for the make install command to complete

That should be it. Test and then adjust your check_by_ssh commands.

Nagios Support Forum

too many timeouts in a row cause nagios to bail

too many timeouts in a row cause nagios to bail

Re: too many timeouts in a row cause nagios to bail

Re: too many timeouts in a row cause nagios to bail

Re: too many timeouts in a row cause nagios to bail

Re: too many timeouts in a row cause nagios to bail