Monitoring Engine Event Queue anomaly?
Monitoring Engine Event Queue anomaly?
Hi,
just upgraded from 2011R3.1 to 2012R1.7. All seemed to work OK.
Current Setup:
1 x central Host (which is the one that was upgraded to 2012R1.7)
4 x down stream nodes running a mix of Nagios Core & XI (2011R3.1)
Service checks are forwarded via NSCA
The central host is reeceiveing the checks OK.
The main issue I'm seeing is that theres a large amount of events indicated in the "scheduled Events over time" portlet. & its static. We can see that there are events flowing through in the portlet.
There doesnt seem to be any impact to the central host other than just the indication that there are more than 980 scheduled events waiting in the queue at "Now". Load & performance seem reasonable..
I've already attempted the regular MySQL & postgresql DB maintenance tasks.
Appreciate any advice..
thanks
Lincoln
just upgraded from 2011R3.1 to 2012R1.7. All seemed to work OK.
Current Setup:
1 x central Host (which is the one that was upgraded to 2012R1.7)
4 x down stream nodes running a mix of Nagios Core & XI (2011R3.1)
Service checks are forwarded via NSCA
The central host is reeceiveing the checks OK.
The main issue I'm seeing is that theres a large amount of events indicated in the "scheduled Events over time" portlet. & its static. We can see that there are events flowing through in the portlet.
There doesnt seem to be any impact to the central host other than just the indication that there are more than 980 scheduled events waiting in the queue at "Now". Load & performance seem reasonable..
I've already attempted the regular MySQL & postgresql DB maintenance tasks.
Appreciate any advice..
thanks
Lincoln
You do not have the required permissions to view the files attached to this post.
Re: Monitoring Engine Event Queue anomaly?
Are most of the checks returning to the central server passive or active? Is there a buildup in the checkresults folder?
Is the date/time on the system correct? Go to --> Admin --> System Profile. Copy the Date/Time section and post it here.
Code: Select all
ls /usr/local/nagios/var/spool/checkresults/ | wc -lFormer Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Monitoring Engine Event Queue anomaly?
Thanks for the response.
The majority of checks are passive - using nsca (seems to be working OK).
output from ls command - 6 (file content looks like the result from some old testing)
System Profile date/Time output:
Date/Time
PHP Timezone: Not set
PHP Time: Mon, 15 Apr 2013 09:50:46 +1000
System Time: Mon, 15 Apr 2013 09:50:46 +1000
Regards
Lincoln
The majority of checks are passive - using nsca (seems to be working OK).
output from ls command - 6 (file content looks like the result from some old testing)
System Profile date/Time output:
Date/Time
PHP Timezone: Not set
PHP Time: Mon, 15 Apr 2013 09:50:46 +1000
System Time: Mon, 15 Apr 2013 09:50:46 +1000
Regards
Lincoln
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Monitoring Engine Event Queue anomaly?
How many passive checks are you sending to the central host from the others? Say.. in a five minute range? Lets also do the following as quoted from Mike Guthrie:
Does the "Dashlet" refresh if you refresh the page? You are saying the "Now" time is always between 984 and 1230?It would be worth running the mysql repair procedure:
http://assets.nagios.com/downloads/nagi ... tabase.pdf
As well as the vacuum commands on postgresql:
http://support.nagios.com/wiki/index.ph ... .22_in_log
There are a few possible reasons for this:
- Lots of disk activity causes things to get backed up because the system is waiting to write to disk. We see this sometimes on VM's because of a shared physical disk.
- LONG running checks or event handlers, these will block the main Nagios loop and hold up the check schedule.
- A big spike in CPU usage could cause this, but usually this would be a consistently high load...
Re: Monitoring Engine Event Queue anomaly?
Hi,
According to the performance dashlet, we're doing under 1000 passive checks over 5 mins:
Tried the DB maintenance tasks - with no luck
Refreshing he page has no impact (Still shows a large amount of checks occuring "now")
& yep , am saying the "Now" time is indicating a high amount of checks scheduled.. Although it has come down slightly since the initial upgrade: Appreciate the help
Thanks
Lincoln
According to the performance dashlet, we're doing under 1000 passive checks over 5 mins:
Tried the DB maintenance tasks - with no luck
Refreshing he page has no impact (Still shows a large amount of checks occuring "now")
& yep , am saying the "Now" time is indicating a high amount of checks scheduled.. Although it has come down slightly since the initial upgrade: Appreciate the help
Thanks
Lincoln
You do not have the required permissions to view the files attached to this post.
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Monitoring Engine Event Queue anomaly?
It seems the "Now" time is more consistent with your 5 minute check count as you noted, which is strange. I'm not sure if this is by design or a flaw I am going to ask one of our lead's to take a look at this and see if we can get it sorted out.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Monitoring Engine Event Queue anomaly?
Check to make sure that the time is synced on all servers? That can also throw off that chart.
If your latency would be high, then your bottleneck would probably be Disk IO, but that doesn't sound like the problem.
We may want to also tail the syslog to see if we are getting any NSCA errors
If your latency would be high, then your bottleneck would probably be Disk IO, but that doesn't sound like the problem.
We may want to also tail the syslog to see if we are getting any NSCA errors
Code: Select all
tail -f /var/log/messagesRe: Monitoring Engine Event Queue anomaly?
Hi,
can confirm that all nagios hosts are synchronising time OK with a local time source.
Interesting thing is that we were getting High I/O & High load in the past, which seemed to be due to logging. So through some troubleshooting we turned off log_on_success in xinetd.conf as nsca was logging all successful connections. That brought the I/O & Load down considerably.
We also implemented ramdisk & rrdcahed per the standard instructions to assist with the previous performance issues we had.
So at the moment /var/log/messages is fairly sparse. The latest events in there are to do with ntp at the moment since I restarted the ntp service
Odd thing is , there seems to be no impact:
- passive checks are being received OK
- Active checks are being performed OK
- Event Handlers are firing OK
- Doesnt seem to be any IO/Load issues
Appreciate your efforts
Regards
Lincoln
can confirm that all nagios hosts are synchronising time OK with a local time source.
Interesting thing is that we were getting High I/O & High load in the past, which seemed to be due to logging. So through some troubleshooting we turned off log_on_success in xinetd.conf as nsca was logging all successful connections. That brought the I/O & Load down considerably.
We also implemented ramdisk & rrdcahed per the standard instructions to assist with the previous performance issues we had.
So at the moment /var/log/messages is fairly sparse. The latest events in there are to do with ntp at the moment since I restarted the ntp service
Apr 17 10:38:43 host ntpd[17774]: ntpd [email protected] Tue Oct 25 12:54:17 UTC 2011 (1)
Apr 17 10:38:43 host ntpd[17775]: precision = 1.000 usec
Apr 17 10:38:43 host ntpd[17775]: Listening on interface wildcard, 0.0.0.0#123 Disabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface wildcard, ::#123 Disabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface lo, ::1#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface virbr0, fe80::200:ff:fe00:0#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface eth0, fe80::250:56ff:feb2:196f#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface lo, 127.0.0.1#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface eth0, x.x.x.x#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: Listening on interface virbr0, 192.168.122.1#123 Enabled
Apr 17 10:38:43 host ntpd[17775]: kernel time sync status 0040
Apr 17 10:38:43 host ntpd[17775]: frequency initialized 18.108 PPM from /var/lib/ntp/drift
Apr 17 10:43:04 host ntpd[17775]: synchronized to x.x.x.x, stratum 4
Apr 17 10:43:04 host ntpd[17775]: kernel time sync enabled 0001
Odd thing is , there seems to be no impact:
- passive checks are being received OK
- Active checks are being performed OK
- Event Handlers are firing OK
- Doesnt seem to be any IO/Load issues
Appreciate your efforts
Regards
Lincoln
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Monitoring Engine Event Queue anomaly?
Could you send your latest configuration snapshot to [email protected] along with a link to this thread.
Thanks.
Thanks.
Re: Monitoring Engine Event Queue anomaly?
Sent as requested..thanks