Page 1 of 2
Monitoring Engine Event Queue anomaly?
Posted: Thu Apr 11, 2013 9:02 pm
by lance
Hi,
just upgraded from 2011R3.1 to 2012R1.7. All seemed to work OK.
Current Setup:
1 x central Host (which is the one that was upgraded to 2012R1.7)
4 x down stream nodes running a mix of Nagios Core & XI (2011R3.1)
Service checks are forwarded via NSCA
The central host is reeceiveing the checks OK.
The main issue I'm seeing is that theres a large amount of events indicated in the "scheduled Events over time" portlet. & its static. We can see that there are events flowing through in the portlet.
There doesnt seem to be any impact to the central host other than just the indication that there are more than 980 scheduled events waiting in the queue at "Now". Load & performance seem reasonable..
ScreenHunter_01 Apr. 12 11.43.gif
ScreenHunter_02 Apr. 12 11.44.gif
ScreenHunter_03 Apr. 12 11.44.gif
I've already attempted the regular MySQL & postgresql DB maintenance tasks.
Appreciate any advice..
thanks
Lincoln
Re: Monitoring Engine Event Queue anomaly?
Posted: Fri Apr 12, 2013 3:21 pm
by abrist
Are most of the checks returning to the central server passive or active? Is there a buildup in the checkresults folder?
Code: Select all
ls /usr/local/nagios/var/spool/checkresults/ | wc -l
Is the date/time on the system correct? Go to --> Admin --> System Profile. Copy the Date/Time section and post it here.
Re: Monitoring Engine Event Queue anomaly?
Posted: Sun Apr 14, 2013 6:55 pm
by lance
Thanks for the response.
The majority of checks are passive - using nsca (seems to be working OK).
output from ls command - 6 (file content looks like the result from some old testing)
System Profile date/Time output:
Date/Time
PHP Timezone: Not set
PHP Time: Mon, 15 Apr 2013 09:50:46 +1000
System Time: Mon, 15 Apr 2013 09:50:46 +1000
Regards
Lincoln
Re: Monitoring Engine Event Queue anomaly?
Posted: Mon Apr 15, 2013 9:31 am
by slansing
How many passive checks are you sending to the central host from the others? Say.. in a five minute range? Lets also do the following as quoted from Mike Guthrie:
It would be worth running the mysql repair procedure:
http://assets.nagios.com/downloads/nagi ... tabase.pdf
As well as the vacuum commands on postgresql:
http://support.nagios.com/wiki/index.ph ... .22_in_log
There are a few possible reasons for this:
- Lots of disk activity causes things to get backed up because the system is waiting to write to disk. We see this sometimes on VM's because of a shared physical disk.
- LONG running checks or event handlers, these will block the main Nagios loop and hold up the check schedule.
- A big spike in CPU usage could cause this, but usually this would be a consistently high load...
Does the "Dashlet" refresh if you refresh the page? You are saying the "Now" time is always between 984 and 1230?
Re: Monitoring Engine Event Queue anomaly?
Posted: Mon Apr 15, 2013 8:24 pm
by lance
Hi,
According to the performance dashlet, we're doing under 1000 passive checks over 5 mins:
ScreenHunter_01 Apr. 16 11.07.gif
Tried the DB maintenance tasks - with no luck
Refreshing he page has no impact (Still shows a large amount of checks occuring "now")
& yep , am saying the "Now" time is indicating a high amount of checks scheduled.. Although it has come down slightly since the initial upgrade:
ScreenHunter_02 Apr. 16 11.22.gif
Appreciate the help
Thanks
Lincoln
Re: Monitoring Engine Event Queue anomaly?
Posted: Tue Apr 16, 2013 9:29 am
by slansing
It seems the "Now" time is more consistent with your 5 minute check count as you noted, which is strange. I'm not sure if this is by design or a flaw I am going to ask one of our lead's to take a look at this and see if we can get it sorted out.
Re: Monitoring Engine Event Queue anomaly?
Posted: Tue Apr 16, 2013 9:45 am
by scottwilkerson
Check to make sure that the time is synced on all servers? That can also throw off that chart.
If your latency would be high, then your bottleneck would probably be Disk IO, but that doesn't sound like the problem.
We may want to also tail the syslog to see if we are getting any NSCA errors
Re: Monitoring Engine Event Queue anomaly?
Posted: Tue Apr 16, 2013 8:14 pm
by lance
Hi,
can confirm that all nagios hosts are synchronising time OK with a local time source.
Interesting thing is that we were getting High I/O & High load in the past, which seemed to be due to logging. So through some troubleshooting we turned off log_on_success in xinetd.conf as nsca was logging all successful connections. That brought the I/O & Load down considerably.
We also implemented ramdisk & rrdcahed per the standard instructions to assist with the previous performance issues we had.
So at the moment /var/log/messages is fairly sparse. The latest events in there are to do with ntp at the moment since I restarted the ntp service
Apr 17 10:38:43 host ntpd[17774]: ntpd
[email protected] Tue Oct 25 12:54:17 UTC 2011 (1)
Apr 17 10:38:43 host ntpd[17775]: precision = 1.000 usec
Apr 17 10:38:43 host ntpd[17775]: Listening on interface wildcard, 0.0.0.0#123 Disabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface wildcard, ::#123 Disabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface lo, ::1#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface virbr0, fe80::200:ff:fe00:0#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface eth0, fe80::250:56ff:feb2:196f#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface lo, 127.0.0.1#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface eth0, x.x.x.x#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface virbr0, 192.168.122.1#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: kernel time sync status 0040
Apr 17 10:38:43 host ntpd[17775]: frequency initialized 18.108 PPM from /var/lib/ntp/drift
Apr 17 10:43:04 host ntpd[17775]: synchronized to x.x.x.x, stratum 4
Apr 17 10:43:04 host ntpd[17775]: kernel time sync enabled 0001
Odd thing is , there seems to be no impact:
- passive checks are being received OK
- Active checks are being performed OK
- Event Handlers are firing OK
- Doesnt seem to be any IO/Load issues
Appreciate your efforts
Regards
Lincoln
Re: Monitoring Engine Event Queue anomaly?
Posted: Wed Apr 17, 2013 9:38 am
by scottwilkerson
Could you send your latest configuration snapshot to
[email protected] along with a link to this thread.
Thanks.
Re: Monitoring Engine Event Queue anomaly?
Posted: Wed Apr 17, 2013 11:56 pm
by lance
Sent as requested..thanks