Monitoring Engine Event Queue anomaly?

lance · Post by **lance** » Thu Apr 11, 2013 9:02 pm

Hi,

just upgraded from 2011R3.1 to 2012R1.7. All seemed to work OK.

Current Setup:

1 x central Host (which is the one that was upgraded to 2012R1.7)
4 x down stream nodes running a mix of Nagios Core & XI (2011R3.1)
Service checks are forwarded via NSCA

The central host is reeceiveing the checks OK.

The main issue I'm seeing is that theres a large amount of events indicated in the "scheduled Events over time" portlet. & its static. We can see that there are events flowing through in the portlet.

There doesnt seem to be any impact to the central host other than just the indication that there are more than 980 scheduled events waiting in the queue at "Now". Load & performance seem reasonable..

ScreenHunter_01 Apr. 12 11.43.gif

ScreenHunter_02 Apr. 12 11.44.gif

ScreenHunter_03 Apr. 12 11.44.gif

I've already attempted the regular MySQL & postgresql DB maintenance tasks.

Appreciate any advice..

thanks

Lincoln

abrist · Post by **abrist** » Fri Apr 12, 2013 3:21 pm

Are most of the checks returning to the central server passive or active? Is there a buildup in the checkresults folder?

Code: Select all

 ls /usr/local/nagios/var/spool/checkresults/ | wc -l

Is the date/time on the system correct? Go to --> Admin --> System Profile. Copy the Date/Time section and post it here.

lance · Post by **lance** » Sun Apr 14, 2013 6:55 pm

Thanks for the response.

The majority of checks are passive - using nsca (seems to be working OK).

output from ls command - 6 (file content looks like the result from some old testing)

System Profile date/Time output:

Date/Time
PHP Timezone: Not set
PHP Time: Mon, 15 Apr 2013 09:50:46 +1000
System Time: Mon, 15 Apr 2013 09:50:46 +1000

Regards

Lincoln

slansing · Post by **slansing** » Mon Apr 15, 2013 9:31 am

How many passive checks are you sending to the central host from the others? Say.. in a five minute range? Lets also do the following as quoted from Mike Guthrie:

It would be worth running the mysql repair procedure:
http://assets.nagios.com/downloads/nagi ... tabase.pdf

As well as the vacuum commands on postgresql:
http://support.nagios.com/wiki/index.ph ... .22_in_log

There are a few possible reasons for this:
- Lots of disk activity causes things to get backed up because the system is waiting to write to disk. We see this sometimes on VM's because of a shared physical disk.
- LONG running checks or event handlers, these will block the main Nagios loop and hold up the check schedule.
- A big spike in CPU usage could cause this, but usually this would be a consistently high load...

Does the "Dashlet" refresh if you refresh the page? You are saying the "Now" time is always between 984 and 1230?

lance · Post by **lance** » Mon Apr 15, 2013 8:24 pm

Hi,

According to the performance dashlet, we're doing under 1000 passive checks over 5 mins:

ScreenHunter_01 Apr. 16 11.07.gif

Tried the DB maintenance tasks - with no luck

Refreshing he page has no impact (Still shows a large amount of checks occuring "now")

& yep , am saying the "Now" time is indicating a high amount of checks scheduled.. Although it has come down slightly since the initial upgrade:

ScreenHunter_02 Apr. 16 11.22.gif

Appreciate the help

Thanks

Lincoln

slansing · Post by **slansing** » Tue Apr 16, 2013 9:29 am

It seems the "Now" time is more consistent with your 5 minute check count as you noted, which is strange. I'm not sure if this is by design or a flaw I am going to ask one of our lead's to take a look at this and see if we can get it sorted out.

scottwilkerson · Post by **scottwilkerson** » Tue Apr 16, 2013 9:45 am

Check to make sure that the time is synced on all servers? That can also throw off that chart.

If your latency would be high, then your bottleneck would probably be Disk IO, but that doesn't sound like the problem.

We may want to also tail the syslog to see if we are getting any NSCA errors

Code: Select all

tail -f /var/log/messages

lance · Post by **lance** » Tue Apr 16, 2013 8:14 pm

Hi,

can confirm that all nagios hosts are synchronising time OK with a local time source.

Interesting thing is that we were getting High I/O & High load in the past, which seemed to be due to logging. So through some troubleshooting we turned off log_on_success in xinetd.conf as nsca was logging all successful connections. That brought the I/O & Load down considerably.

We also implemented ramdisk & rrdcahed per the standard instructions to assist with the previous performance issues we had.

So at the moment /var/log/messages is fairly sparse. The latest events in there are to do with ntp at the moment since I restarted the ntp service

Apr 17 10:38:43 host ntpd[17774]: ntpd [email protected] Tue Oct 25 12:54:17 UTC 2011 (1)
Apr 17 10:38:43 host ntpd[17775]: precision = 1.000 usec
Apr 17 10:38:43 host ntpd[17775]: Listening on interface wildcard, 0.0.0.0#123 Disabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface wildcard, ::#123 Disabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface lo, ::1#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface virbr0, fe80::200:ff:fe00:0#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface eth0, fe80::250:56ff:feb2:196f#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface lo, 127.0.0.1#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface eth0, x.x.x.x#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface virbr0, 192.168.122.1#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: kernel time sync status 0040
Apr 17 10:38:43 host ntpd[17775]: frequency initialized 18.108 PPM from /var/lib/ntp/drift
Apr 17 10:43:04 host ntpd[17775]: synchronized to x.x.x.x, stratum 4
Apr 17 10:43:04 host ntpd[17775]: kernel time sync enabled 0001

Odd thing is , there seems to be no impact:

- passive checks are being received OK
- Active checks are being performed OK
- Event Handlers are firing OK
- Doesnt seem to be any IO/Load issues

Appreciate your efforts

Regards

Lincoln

scottwilkerson · Post by **scottwilkerson** » Wed Apr 17, 2013 9:38 am

Could you send your latest configuration snapshot to [email protected] along with a link to this thread.

Thanks.

lance · Post by **lance** » Wed Apr 17, 2013 11:56 pm

Sent as requested..thanks

Nagios Support Forum

Monitoring Engine Event Queue anomaly?

Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?