Abnormal load on Nagios server
Posted: Fri Jun 20, 2014 3:00 am
The Nagios server experience abnorma load suddenly and back to normal in 3 minutes. The logs shows following messages...
Jun 19 17:33:04 reg-nagios nagios: Warning: Host performance data file processing command '/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1403220778.perfdata.host' timed out after 5 seconds
Jun 19 17:33:10 reg-nagios nagios: Warning: Service performance data file processing command '/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1403220784.perfdata.service' timed out after 5 seconds
Jun 19 17:33:20 reg-nagios nagios: Warning: Host performance data file processing command '/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1403220794.perfdata.host' timed out after 5 seconds
Jun 19 17:33:26 reg-nagios nagios: Warning: Service performance data file processing command '/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1403220800.perfdata.service' timed out after 5 seconds
Jun 19 17:33:26 reg-nagios nagios: wproc: Core Worker 31012: job 6508 (pid=13539) timed out. Killing it
Jun 19 17:33:26 reg-nagios nagios: wproc: CHECK job 6508 from worker Core Worker 31012 timed out after 30.02s
Jun 19 17:33:26 reg-nagios nagios: wproc: host=reg-glb18.viterra.com; service=(null);
Jun 19 17:33:26 reg-nagios nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jun 19 17:33:26 reg-nagios nagios: Warning: Check of host 'reg-glb18.viterra.com' timed out after 30.02 seconds
Jun 19 17:33:26 reg-nagios nagios: HOST ALERT: reg-glb18.viterra.com;DOWN;SOFT;1;(Host check timed out after 30.02 seconds)
Jun 19 17:33:26 reg-nagios nagios: wproc: Core Worker 31012: tv.tv_sec is currently 1403220804
Jun 19 17:33:26 reg-nagios nagios: wproc: Core Worker 31012: Failed to reap child with pid 13539. Next attempt @ 1403220809.534442
Jun 19 17:33:27 reg-nagios nagios: wproc: Core Worker 31039: job 6508 (pid=13564) timed out. Killing it
Jun 19 17:33:27 reg-nagios nagios: wproc: CHECK job 6508 from worker Core Worker 31039 timed out after 30.01s
Jun 19 17:33:27 reg-nagios nagios: wproc: host=reg-dut-01.viterra.com; service=(null);
Jun 19 17:33:27 reg-nagios nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
But nothing looks wrong at the disk too. Wondering what might be the reason?
Also the monitoring engine queue shows 1000+ events at same time. Before it was evenly distributed.
Jun 19 17:33:04 reg-nagios nagios: Warning: Host performance data file processing command '/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1403220778.perfdata.host' timed out after 5 seconds
Jun 19 17:33:10 reg-nagios nagios: Warning: Service performance data file processing command '/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1403220784.perfdata.service' timed out after 5 seconds
Jun 19 17:33:20 reg-nagios nagios: Warning: Host performance data file processing command '/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1403220794.perfdata.host' timed out after 5 seconds
Jun 19 17:33:26 reg-nagios nagios: Warning: Service performance data file processing command '/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1403220800.perfdata.service' timed out after 5 seconds
Jun 19 17:33:26 reg-nagios nagios: wproc: Core Worker 31012: job 6508 (pid=13539) timed out. Killing it
Jun 19 17:33:26 reg-nagios nagios: wproc: CHECK job 6508 from worker Core Worker 31012 timed out after 30.02s
Jun 19 17:33:26 reg-nagios nagios: wproc: host=reg-glb18.viterra.com; service=(null);
Jun 19 17:33:26 reg-nagios nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jun 19 17:33:26 reg-nagios nagios: Warning: Check of host 'reg-glb18.viterra.com' timed out after 30.02 seconds
Jun 19 17:33:26 reg-nagios nagios: HOST ALERT: reg-glb18.viterra.com;DOWN;SOFT;1;(Host check timed out after 30.02 seconds)
Jun 19 17:33:26 reg-nagios nagios: wproc: Core Worker 31012: tv.tv_sec is currently 1403220804
Jun 19 17:33:26 reg-nagios nagios: wproc: Core Worker 31012: Failed to reap child with pid 13539. Next attempt @ 1403220809.534442
Jun 19 17:33:27 reg-nagios nagios: wproc: Core Worker 31039: job 6508 (pid=13564) timed out. Killing it
Jun 19 17:33:27 reg-nagios nagios: wproc: CHECK job 6508 from worker Core Worker 31039 timed out after 30.01s
Jun 19 17:33:27 reg-nagios nagios: wproc: host=reg-dut-01.viterra.com; service=(null);
Jun 19 17:33:27 reg-nagios nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
But nothing looks wrong at the disk too. Wondering what might be the reason?
Also the monitoring engine queue shows 1000+ events at same time. Before it was evenly distributed.