Monitoring Engine Event Queue anomaly?

Post by **Box293** » Wed Jun 05, 2013 7:00 pm

Was there an answer to this?

Our events get up to about 4000 every hour and then drop to about 300, around the time of the hourly DB scripts run from what I understand.

CPU usage on the XI host goes up when this happens. This screenshot shows the past day CPU usage for this XI host, you can see the hourly job that occurs.

CPU 1 day summary.png

We only have active serivce checks, no passive.

Nagios XI 2012R1.7 VM running on ESXi 5.1.

Post by **lmiltchev** » Thu Jun 06, 2013 12:48 pm

Was there an answer to this?
...
We only have active serivce checks, no passive.

We haven't received a response from the customer since 04/21/2013, so I am not sure if this is resolved. His case was different though. The majority of the checks were passive checks, sent at similar times.

Post by **Box293** » Fri Jun 07, 2013 12:22 am

I'll put some notes together about this and come back to you with some helpful information.

I have a suggestion (that requires some explaining first).

We use the Nagios XI Server Monitoring Wizard on a test server to monitor our production server. This lets us know if the production server goes down or there is something wrong (it's really handy). In particular I am talking about the "Nagios XI Jobs" service.

Normally the status is "All jobs are running ok" however a condition can occur when the scheduled events over time build up WHILE a backup of the VM is occurring and in turn causes the Nagios XI Jobs service to report "Nagios XI Jobs;UNKNOWN;SOFT;3;Database Maintenance (dbmaint) stale (1203 seconds old), Database Maintenance (dbmaint) stale (1203 seconds old)".

My suggestion is that you incorporate performance data into this service so we can observe over time how long the database maintenance jobs take to run. OR define a new service in the wizard that tracks the database maintenance durations so we can observe this in pretty graphs.

I hope this makes sense.

Post by **lmiltchev** » Fri Jun 07, 2013 10:10 am

It makes sense and I think it's a good idea. Please, post a feature request on our bug tracker, so that it won't "fall in the cracks".

Post by **Box293** » Tue Jun 11, 2013 4:15 pm

OK so here's some more information.

As quoted from another post:

It would be worth running the mysql repair procedure:
http://assets.nagios.com/downloads/nagi ... tabase.pdf

As well as the vacuum commands on postgresql:
http://support.nagios.com/wiki/index.ph ... .22_in_log

Here is the log file with the output from these commands.

putty-2013-06-12.log

Also here is a better screenshot of the problem when it occurs.

Monitoring Engine Status.png

One other observation is that when the shceduled events over time build up like this, things like dashlets do not display properly. It's almost like the dashlet cannot access the database to get the service object (or something like that).

I'll post back here in a couple of hours with an update, to report back if the problem has been resolved since running these two repair / cleanup procedures.

Post by **Box293** » Tue Jun 11, 2013 4:31 pm

FYI here is the bug tracker link http://tracker.nagios.com/view.php?id=412

Post by **lmiltchev** » Tue Jun 11, 2013 4:49 pm

FYI here is the bug tracker link http://tracker.nagios.com/view.php?id=412

Thanks for the post, Troy!

Post by **Box293** » Tue Jun 11, 2013 4:57 pm

OK so the two repair / cleanup procedures did not solve the problem.

The database maintenance job just ran and it did the same thing, scheduled events over time went up to 4000+ as the job ran over a 20 minute window.

scottwilkerson · Post by **scottwilkerson** » Tue Jun 11, 2013 6:31 pm

Troy,

There were some issues introduced in 2012r1.7 that could be causing the issue you are seeing. It affected how we query objects in the DB, and caused us to release 2012r1.8 shortly after.

I also like the feature request, when I get some time I try to get that in there...

Post by **Box293** » Tue Jun 11, 2013 10:16 pm

Thanks for that Scott.

A couple of hours ago I upgraded to 2012R2.2 however the problem has not been resolved.

The database maintenance job has run twice since and nothing seems to have changed, scheduled events over time went up to 4000+ as the job ran over a 20 minute window.

Nagios Support Forum

Monitoring Engine Event Queue anomaly?

Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?

Re: Monitoring Engine Event Queue anomaly?