Abnormal load on Nagios server

narayanamoorthys · Post by **narayanamoorthys** » Fri Jun 20, 2014 3:00 am

The Nagios server experience abnorma load suddenly and back to normal in 3 minutes. The logs shows following messages...

Jun 19 17:33:04 reg-nagios nagios: Warning: Host performance data file processing command '/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1403220778.perfdata.host' timed out after 5 seconds
Jun 19 17:33:10 reg-nagios nagios: Warning: Service performance data file processing command '/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1403220784.perfdata.service' timed out after 5 seconds
Jun 19 17:33:20 reg-nagios nagios: Warning: Host performance data file processing command '/bin/mv /usr/local/nagios/var/host-perfdata /usr/local/nagios/var/spool/xidpe/1403220794.perfdata.host' timed out after 5 seconds
Jun 19 17:33:26 reg-nagios nagios: Warning: Service performance data file processing command '/bin/mv /usr/local/nagios/var/service-perfdata /usr/local/nagios/var/spool/xidpe/1403220800.perfdata.service' timed out after 5 seconds
Jun 19 17:33:26 reg-nagios nagios: wproc: Core Worker 31012: job 6508 (pid=13539) timed out. Killing it
Jun 19 17:33:26 reg-nagios nagios: wproc: CHECK job 6508 from worker Core Worker 31012 timed out after 30.02s
Jun 19 17:33:26 reg-nagios nagios: wproc: host=reg-glb18.viterra.com; service=(null);
Jun 19 17:33:26 reg-nagios nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jun 19 17:33:26 reg-nagios nagios: Warning: Check of host 'reg-glb18.viterra.com' timed out after 30.02 seconds
Jun 19 17:33:26 reg-nagios nagios: HOST ALERT: reg-glb18.viterra.com;DOWN;SOFT;1;(Host check timed out after 30.02 seconds)
Jun 19 17:33:26 reg-nagios nagios: wproc: Core Worker 31012: tv.tv_sec is currently 1403220804
Jun 19 17:33:26 reg-nagios nagios: wproc: Core Worker 31012: Failed to reap child with pid 13539. Next attempt @ 1403220809.534442
Jun 19 17:33:27 reg-nagios nagios: wproc: Core Worker 31039: job 6508 (pid=13564) timed out. Killing it
Jun 19 17:33:27 reg-nagios nagios: wproc: CHECK job 6508 from worker Core Worker 31039 timed out after 30.01s
Jun 19 17:33:27 reg-nagios nagios: wproc: host=reg-dut-01.viterra.com; service=(null);
Jun 19 17:33:27 reg-nagios nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;

But nothing looks wrong at the disk too. Wondering what might be the reason?

Also the monitoring engine queue shows 1000+ events at same time. Before it was evenly distributed.

Post by **lmiltchev** » Fri Jun 20, 2014 1:01 pm

The Nagios server experience abnorma load suddenly and back to normal in 3 minutes.

Did this happen only once or it keeps happening over a regular intervals? What is the Nagios XI version that you are currently using? Have you made any changes to the system prior to experiencing this issue? Are you using Mod Gearman?

narayanamoorthys · Post by **narayanamoorthys** » Sat Jun 21, 2014 8:52 am

It happened only once so far and no changes made.

Version: 2024R1.0

We are not using Mod Gearman

slansing · Post by **slansing** » Mon Jun 23, 2014 9:36 am

Are you running additional modules?:

Code: Select all

cat /usr/local/nagios/etc/nagios.cfg | grep 'broker'

Also the monitoring engine queue shows 1000+ events at same time. Before it was evenly distributed.

Are you saying the above only happened one time? Did it resolve itself?

narayanamoorthys · Post by **narayanamoorthys** » Mon Jun 23, 2014 9:52 am

Below find the output

[root@reg-nagios libexec]# cat /usr/local/nagios/etc/nagios.cfg | grep 'broker'
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg
event_broker_options=-1

Now monitoring queue shows 420+ checks at a point with few distributed checks.

tmcdonald · Post by **tmcdonald** » Mon Jun 23, 2014 5:01 pm

Are you running a lot of ESX, WMI, or check_ifoperstatus checks? Those can be quite CPU and memory intensive and having many run at once (or get stuck) can cause this behavior. How many hosts/services are you checking overall? Are they all on a 5-minute timer or are there some that run more often?

narayanamoorthys · Post by **narayanamoorthys** » Tue Jun 24, 2014 2:35 am

We don't have any ESX checks. We monitor around 150 Unix servers and they are at default check intervals (5 min)

tmcdonald · Post by **tmcdonald** » Tue Jun 24, 2014 2:12 pm

Can I get a copy of your profile? In the XI web interface, go to Admin -> System Profile and click the blue "Download Profile" button. Then PM that profile.zip file to me.

Nagios Support Forum

Abnormal load on Nagios server

Abnormal load on Nagios server

Re: Abnormal load on Nagios server

Re: Abnormal load on Nagios server

Re: Abnormal load on Nagios server

Re: Abnormal load on Nagios server

Re: Abnormal load on Nagios server

Re: Abnormal load on Nagios server

Re: Abnormal load on Nagios server