Page 2 of 2

Re: Monitoring Engine Event Queue bottlenecks occassionally

Posted: Tue Dec 18, 2012 11:52 am
by paul.jobb
I'm not exactly sure of the veriosn of our esxi environment, I would guess 4.1 though. We also decided to install vm tools on our nagios servers and use the tools for clock synchronization in place of ntp.

Re: Monitoring Engine Event Queue bottlenecks occassionally

Posted: Tue Dec 18, 2012 5:45 pm
by scottwilkerson
I have a feeling that using vmware tools for the time sync could be causing the issue. Would you be willing to change that to NTP?

Re: Monitoring Engine Event Queue bottlenecks occassionally

Posted: Tue Dec 18, 2012 6:50 pm
by Box293
High five paul.jobb,
Problem appears to be solved.

I changed my database log entries and state history to be 182 days. Now the problem no longer occurrs. When I look at the VM performance, disk I/O and CPU usage has dropped dramatically when the hourly db optimization task occurred.

I'm thinking I might look at implementing some mysql monitoring checks for Nagios so I can get information like duration of db optimisation jobs into some nice pretty graphs.

FYI #1 The night before I tried removing one vCPU so I only had two. That just made things really bad. So instead I added another one so my total vCPU count was 4. This did not fix the problem, however it did mask the issue, scheduled events would only get up to about 1500. Also CPU ready time increased for this VM.

FYI #2 One side effect of this problem is that is was causing MRTG to lose data somehow and I ended up with gaps in my graphs. Since I made the changes last night, the graphs are looking great. Screenshot shows what I mean, the ones I am pointing to are from yesterday and the data on the right is from today.
Effects on MRTG.png

Re: Monitoring Engine Event Queue bottlenecks occassionally

Posted: Wed Dec 19, 2012 12:03 pm
by paul.jobb
that's good to hear, that seemed to be similar behavior I was having with the db optimization process.

In regards to ntp, I was using ntp until recently. When we stepped down from 4 vcpu's to 2 vcpu's we also installed vmtools and disabled ntp and enabled vmtools time sync. It seems likely that my vm environment is under resourced at certain times, so the the thought was that the tools may be better at smoothing the clock during those times. A few weeks ago I saw this message in my nagios log file, we have had some issues with latent checks and monitoring stopping at certain points. All my checks are farmed off to gearman workers so its just a matter keeping them scheduled.

[1354423407] Warning: A system time change of 0d 0h 36m 15s (forwards in time) has been detected. Compensating...