Page 1 of 1

Abnormal load on Nagios server

Posted: Fri Jun 20, 2014 3:00 am
by narayanamoorthys
The Nagios server experience abnorma load suddenly and back to normal in 3 minutes. The logs shows following messages...

Jun 19 17:33:04 reg-nagios nagios: Warning: Host performance data file processing command '/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1403220778.perfdata.host' timed out after 5 seconds
Jun 19 17:33:10 reg-nagios nagios: Warning: Service performance data file processing command '/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1403220784.perfdata.service' timed out after 5 seconds
Jun 19 17:33:20 reg-nagios nagios: Warning: Host performance data file processing command '/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1403220794.perfdata.host' timed out after 5 seconds
Jun 19 17:33:26 reg-nagios nagios: Warning: Service performance data file processing command '/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1403220800.perfdata.service' timed out after 5 seconds
Jun 19 17:33:26 reg-nagios nagios: wproc: Core Worker 31012: job 6508 (pid=13539) timed out. Killing it
Jun 19 17:33:26 reg-nagios nagios: wproc: CHECK job 6508 from worker Core Worker 31012 timed out after 30.02s
Jun 19 17:33:26 reg-nagios nagios: wproc: host=reg-glb18.viterra.com; service=(null);
Jun 19 17:33:26 reg-nagios nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jun 19 17:33:26 reg-nagios nagios: Warning: Check of host 'reg-glb18.viterra.com' timed out after 30.02 seconds
Jun 19 17:33:26 reg-nagios nagios: HOST ALERT: reg-glb18.viterra.com;DOWN;SOFT;1;(Host check timed out after 30.02 seconds)
Jun 19 17:33:26 reg-nagios nagios: wproc: Core Worker 31012: tv.tv_sec is currently 1403220804
Jun 19 17:33:26 reg-nagios nagios: wproc: Core Worker 31012: Failed to reap child with pid 13539. Next attempt @ 1403220809.534442
Jun 19 17:33:27 reg-nagios nagios: wproc: Core Worker 31039: job 6508 (pid=13564) timed out. Killing it
Jun 19 17:33:27 reg-nagios nagios: wproc: CHECK job 6508 from worker Core Worker 31039 timed out after 30.01s
Jun 19 17:33:27 reg-nagios nagios: wproc: host=reg-dut-01.viterra.com; service=(null);
Jun 19 17:33:27 reg-nagios nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;

But nothing looks wrong at the disk too. Wondering what might be the reason?

Also the monitoring engine queue shows 1000+ events at same time. Before it was evenly distributed.

Re: Abnormal load on Nagios server

Posted: Fri Jun 20, 2014 1:01 pm
by lmiltchev
The Nagios server experience abnorma load suddenly and back to normal in 3 minutes.
Did this happen only once or it keeps happening over a regular intervals? What is the Nagios XI version that you are currently using? Have you made any changes to the system prior to experiencing this issue? Are you using Mod Gearman?

Re: Abnormal load on Nagios server

Posted: Sat Jun 21, 2014 8:52 am
by narayanamoorthys
It happened only once so far and no changes made.

Version: 2024R1.0

We are not using Mod Gearman

Re: Abnormal load on Nagios server

Posted: Mon Jun 23, 2014 9:36 am
by slansing
Are you running additional modules?:

Code: Select all

cat /usr/local/nagios/etc/nagios.cfg | grep 'broker'
Also the monitoring engine queue shows 1000+ events at same time. Before it was evenly distributed.
Are you saying the above only happened one time? Did it resolve itself?

Re: Abnormal load on Nagios server

Posted: Mon Jun 23, 2014 9:52 am
by narayanamoorthys
Below find the output

[root@reg-nagios libexec]# cat /usr/local/nagios/etc/nagios.cfg | grep 'broker'
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg
event_broker_options=-1

Now monitoring queue shows 420+ checks at a point with few distributed checks.

Re: Abnormal load on Nagios server

Posted: Mon Jun 23, 2014 5:01 pm
by tmcdonald
Are you running a lot of ESX, WMI, or check_ifoperstatus checks? Those can be quite CPU and memory intensive and having many run at once (or get stuck) can cause this behavior. How many hosts/services are you checking overall? Are they all on a 5-minute timer or are there some that run more often?

Re: Abnormal load on Nagios server

Posted: Tue Jun 24, 2014 2:35 am
by narayanamoorthys
We don't have any ESX checks. We monitor around 150 Unix servers and they are at default check intervals (5 min)

Re: Abnormal load on Nagios server

Posted: Tue Jun 24, 2014 2:12 pm
by tmcdonald
Can I get a copy of your profile? In the XI web interface, go to Admin -> System Profile and click the blue "Download Profile" button. Then PM that profile.zip file to me.