Page 2 of 3
Re: Monitoring Engine Event Queue anomaly?
Posted: Wed Jun 12, 2013 2:20 pm
by lmiltchev
We are still trying to pinpoint the issue. There is also another customer experiencing a similar problem. As soon as we have a possible solution, we will let you know.
Re: Monitoring Engine Event Queue anomaly?
Posted: Wed Jun 12, 2013 4:23 pm
by Box293
No problems, let me know if there is any other information you require or if you want to establish a remote session to have a look at our server.
Re: Monitoring Engine Event Queue anomaly?
Posted: Wed Jun 12, 2013 4:44 pm
by lmiltchev
Run the following commands and show us the output:
Code: Select all
tail -n 200 /var/log/messages
tail -n 200 /var/log/mysqld.log
Re: Monitoring Engine Event Queue anomaly?
Posted: Wed Jun 12, 2013 7:18 pm
by scottwilkerson
Troy,
Searching through the forum, you had a very similar post about 6 months ago that was related to time syncing
http://support.nagios.com/forum/viewtop ... =10#p40970
Could this problem be creeping back up?
Re: Monitoring Engine Event Queue anomaly?
Posted: Thu Jun 13, 2013 1:56 am
by Box293
I've just run the two commands and have sent you a private message with the output from these commands (there is just some client related data that I would prefer not to post publically).
In relation to the time syncing stuff, I have checked and the VM does NOT have the VMware Tools syncing time with the ESXi host, I have CentOS configured to use NTP. When I ran the date command the correct date and time was displayed.
Re: Monitoring Engine Event Queue anomaly?
Posted: Thu Jun 13, 2013 9:15 am
by slansing
Hmm, we are hoping to get into a remote session with another client experiencing this issue, today, we shall let you know what we dig up!
Re: Monitoring Engine Event Queue anomaly?
Posted: Thu Jun 13, 2013 12:12 pm
by lmiltchev
Troy, do you remember if your issues started around 04/04/2013?
Run the following command and send me the output via PM:
Code: Select all
tail -n 200 /usr/local/nagiosxi/var/dbmaint.log
Re: Monitoring Engine Event Queue anomaly?
Posted: Thu Jun 13, 2013 3:05 pm
by slansing
Scott resolved the issue with the other client, the problem was an offloaded mysql database that was not synced with the XI server's time "off by a minute give or take," and a huge amount of backed up checkresults in:
Code: Select all
/usr/local/nagios/var/spool/checkresults/
Please let us know if these are the case for you, that check results pile up created the giant stack in the event queue you are experiencing.
Re: Monitoring Engine Event Queue anomaly?
Posted: Thu Jun 13, 2013 4:56 pm
by Box293
Two very good questions.
When I look at what was happening around 2013/04/04 I found the following. This was a week after we had relocated our environment to a new datacenter. We had a problem with one of the iSCSI switches in a stack of two which rebooted, so during this time there was a hang of all VM's of about 10-30 seconds while the SAN controllers transitioned to the other switch. The Nagios XI VM was running on one of these SANs that was connected to the iSCSI switches. Not sure if it was related or not but before the relocation I disabled a lot of services and hosts that would no longer exist in the new datacenter (in CCM). Configuration applied OK, but these old services and hosts were left in the database for about four weeks afterwards.
As per the checkresults folder, there was about 1400 files in this folder when I had a look. Interestingly there were 1070 files that were created in 2012, 2011 and 2010. When watching the folder, files created in 2013 were being processed correctly and dissapearing soon thereafter. So I've just deleted those 1070 files and I'm waiting for the database maintenance task to complete.
I'll PM the results of the commands you requested and I'll get back to you in a couple of hours to see if the problem has been resolved.
Re: Monitoring Engine Event Queue anomaly?
Posted: Thu Jun 13, 2013 6:24 pm
by Box293
Deleting those files has not made a difference.
The database maintenance job has run twice since and nothing seems to have changed, scheduled events over time went up to 4000+ as the job ran over a 20 minute window.