too many timeouts in a row cause nagios to bail
Posted: Wed Aug 05, 2015 1:31 pm
Gurus
Recently we are experiencing an issue where when comvault backups kick in around 7 a lot of VMs start timing out and it appears that at some point there are too many timeouts for nagios to handle. This is when it starts killing procs and I get 1000s of these
Aug 4 19:56:40 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;check load;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 4 19:56:40 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;check uptime;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 4 19:56:40 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;check load;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 4 19:56:41 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;du /;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: Core Worker 14940: job 23984 (pid=20521) timed out. Killing it
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: CHECK job 23984 from worker Core Worker 14940 timed out after 60.01s
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: host=xxxx; service=check load;
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Aug 4 19:56:42 prdmgtnag01 nagios: Warning: Check of service 'check load' on host 'xxxxxx' timed out after 60.009s!
Aug 4 19:56:42 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;check load;CRITICAL;SOFT;1;(Service check timed out after 60.01 seconds)
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: Core Worker 14940: job 23984 (pid=20521): Dormant child reaped
Aug 4 19:56:42 prdmgtnag01 nagios: SERVICE ALERT: xxxxxx;check load;OK;SOFT;2;OK - load average: 0.95, 1.60, 1.04
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: Core Worker 14938: job 24084 (pid=21354) timed out. Killing it
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: GLOBAL SERVICE EVENTHANDLER job 24084 from worker Core Worker 14938 timed out after 51.18s
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: stderr line 01: No entry for terminal type "unknown";
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: stderr line 02: using dumb terminal settings.
Aug 4 19:57:29 prdmgtnag01 nagios: Warning: Global service event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=service --host="xxxxxxxxxxx" --service="du /usr" --hostaddress="xxxxxxxxxxx" --hoststate=UP --hoststateid=0 --hosteventid=271852 --hostproblemid=0 --servicestate=CRITICAL --servicestateid=2 --lastservicestate=OK --lastservicestateid=0 --servicestatetype=SOFT --currentattempt=1 --maxattempts=5 --serviceeventid=285391 --serviceproblemid=132684 --serviceoutput="CRITICAL - Plugin timed out after 10 seconds" --longserviceoutput="" --servicedowntime=0' timed out after 0.00 seconds
After that ALL plugins timeout for a few minutes and I get 900 alerts
. The load goes to 275 so that explains a lot. I have 8G of Ram and 2 CPUs. Normally I dont use much
[astuck@prdmgtnag01 ~]$ free -m
total used free shared buffers cached
Mem: 7873 2221 5651 27 210 1023
-/+ buffers/cache: 987 6885
Swap: 4095 113 3982
[astuck@prdmgtnag01 ~]$
Then it all recovers and I get 900 recovery emails
Can we tune nagios to "survive" these spikes ? All the check settings are as shipped by default.
Recently we are experiencing an issue where when comvault backups kick in around 7 a lot of VMs start timing out and it appears that at some point there are too many timeouts for nagios to handle. This is when it starts killing procs and I get 1000s of these
Aug 4 19:56:40 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;check load;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 4 19:56:40 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;check uptime;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 4 19:56:40 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;check load;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 4 19:56:41 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;du /;CRITICAL;SOFT;1;CRITICAL - Plugin timed out after 10 seconds
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: Core Worker 14940: job 23984 (pid=20521) timed out. Killing it
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: CHECK job 23984 from worker Core Worker 14940 timed out after 60.01s
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: host=xxxx; service=check load;
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Aug 4 19:56:42 prdmgtnag01 nagios: Warning: Check of service 'check load' on host 'xxxxxx' timed out after 60.009s!
Aug 4 19:56:42 prdmgtnag01 nagios: SERVICE ALERT: xxxxx;check load;CRITICAL;SOFT;1;(Service check timed out after 60.01 seconds)
Aug 4 19:56:42 prdmgtnag01 nagios: wproc: Core Worker 14940: job 23984 (pid=20521): Dormant child reaped
Aug 4 19:56:42 prdmgtnag01 nagios: SERVICE ALERT: xxxxxx;check load;OK;SOFT;2;OK - load average: 0.95, 1.60, 1.04
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: Core Worker 14938: job 24084 (pid=21354) timed out. Killing it
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: GLOBAL SERVICE EVENTHANDLER job 24084 from worker Core Worker 14938 timed out after 51.18s
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: stderr line 01: No entry for terminal type "unknown";
Aug 4 19:57:29 prdmgtnag01 nagios: wproc: stderr line 02: using dumb terminal settings.
Aug 4 19:57:29 prdmgtnag01 nagios: Warning: Global service event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=service --host="xxxxxxxxxxx" --service="du /usr" --hostaddress="xxxxxxxxxxx" --hoststate=UP --hoststateid=0 --hosteventid=271852 --hostproblemid=0 --servicestate=CRITICAL --servicestateid=2 --lastservicestate=OK --lastservicestateid=0 --servicestatetype=SOFT --currentattempt=1 --maxattempts=5 --serviceeventid=285391 --serviceproblemid=132684 --serviceoutput="CRITICAL - Plugin timed out after 10 seconds" --longserviceoutput="" --servicedowntime=0' timed out after 0.00 seconds
After that ALL plugins timeout for a few minutes and I get 900 alerts
[astuck@prdmgtnag01 ~]$ free -m
total used free shared buffers cached
Mem: 7873 2221 5651 27 210 1023
-/+ buffers/cache: 987 6885
Swap: 4095 113 3982
[astuck@prdmgtnag01 ~]$
Then it all recovers and I get 900 recovery emails
Can we tune nagios to "survive" these spikes ? All the check settings are as shipped by default.