Monitoring Engine Event Queue bottlenecks occassionally

Post by **Box293** » Wed Dec 12, 2012 5:13 pm

I've been observing the Monitoring Engine Event Queue dashlet that shows me the Scheduled Events Over Time. What seems to happen on a regular basis is that the queue becomes bottlenecked by some process ... and then eventually the bottleneck dissapears.

Monitoring Engine.png

top.png

CPU Usage.png

This seems to have occurred since we upgraded to XI 2012. Usually we don't need to do anything and it corrects itself.

The problem is that sometimes the "Now" scheduled events sometimes gets up to 3000+ and we need to restart the monitoring enging.

Is anyone else experiencing this behaviour?

Nagios XI 2012R1.2 VM CentOS 32bit on ESXi 5.0 Update 1.

mguthrie · Post by **mguthrie** » Wed Dec 12, 2012 5:52 pm

It would be worth running the mysql repair procedure:
http://assets.nagios.com/downloads/nagi ... tabase.pdf

As well as the vacuum commands on postgresql:
http://support.nagios.com/wiki/index.ph ... .22_in_log

There are a few possible reasons for this:
- Lots of disk activity causes things to get backed up because the system is waiting to write to disk. We see this sometimes on VM's because of a shared physical disk.
- LONG running checks or event handlers, these will block the main Nagios loop and hold up the check schedule.
- A big spike in CPU usage could cause this, but usually this would be a consistently high load...

Post by **mikew** » Thu Dec 13, 2012 8:12 am

I have spent considerable time trying to work toward concrete solutions for this problem. I have worked with a number of organizations which have a large number of checks and checks which have a large difference in the time in which checks occur. I am not making recommendations besides being very careful and documenting what you do and evaluating over time. Here is some of my research which basically tells you to be careful with changes!!!! This information is not complete and every situation has variables which make outcomes different so keep that in mind.

I am interested in getting feedback from others working to solve this issue for XI so feel free to contact me. We work with this issue in the Advanced Nagios Training because it seems to come up frequently with larger installs.

One thing to note, this condition will often occur if you have checks that do not run 24x7 as they build up and dump all at once.

inter_check_delay_method
This setting is used to space the checks in order to equalize the load on the Nagios server as well as on the hosts that it is monitoring. By default this setting uses “smart” to calculate the spread. It is important to understand how “smart” works as there are some situations that this setting is not optimal.

n = no delay - schedule all service checks to run immediately
d = use a "dumb" delay of 1 second between service checks
s = use a "smart" delay calculation to spread service checks out evenly (default)
x.xx = use a user-supplied inter-check delay of x.xx seconds

Here are the defaults(both using smart):

host_inter_check_delay_method=s
service_inter_check_delay_method=s

smart
This setting uses the following formula to decide on the spread between checks:

inter check delay = (average check interval)/(total number of services)

Here is an example. The Nagios server has 10,000 checks with an average check interval of 5 minutes. This means that the formula then looks like this:

inter check delay = 5 / 10000

This comes out to 0.0005 minutes or 0.03 seconds. This means that the initial spread for the checks is going to have to be around every 0.03 seconds before another check will need to occur, or 33 checks will need to occur every second.

Let's do the math again, this time use an average of 6.5 minutes.

inter check delay = 6.5 / 10000

This comes out to 0.00065 minutes or 0.039 seconds. This means that the initial spread for the checks is going to have to be around every 0.039 seconds before another check will need to occur, or 25 checks will need to occur every second.

Obviously, the greater the average check time, the less checks per second will need to occur in order to balance the checks.

Now consider a more realistic scenario with these checks:

15% of checks at 15 seconds
25% of checks at 60 seconds
50% of checks at 300 seconds
10% of checks at 600 seconds

There are several very interesting aspects here. First, there is a wide spread from 15 seconds to 10 minutes, this will certainly distort the averages. So let's take a look again using 10000 checks:

15% of checks at 15 seconds 22500
25% of checks at 60 seconds 150,000
50% of checks at 300 seconds 1500,000
10% of checks at 600 seconds 600,000

average 92.25 seconds or 1.538 minutes

inter check delay = 1.538 / 10000

This comes out to 0.0001538 minutes or 0.0092 seconds. This means that the initial spread for the checks is going to have to be around every 0.0092 seconds before another check will need to occur, or 109 checks will need to occur every second.

One of the issues with this example is that the spread of check times is quite wide and will in time probably twist this into a terrible mess. This is the first limitation of “smart” is that it assumes the spread in check times will be more consistent.

Service Interleave
This setting is used to equalize the load on the hosts that are monitored. In other words, the checks are spread out on the remote host so as not to load the remote host beyond what it can do. Remember that in Nagios 3 parallelization is possible so that multiple checks can be run at the same time.

Here are the defaults(using smart):

service_interleave_factor=s

If you use “smart”, this is how it works.

service interleave = ceil (total services) / (total hosts)
ceil means that it must be rounded up to the nearest integer.

So in the example above with 10,000 checks and 500 hosts you get 20.

Service interleave = 10,000 / 500

So what this means is that Nagios will schedule 1 check for a host, then skip the next 19 checks for that host until all hosts have the first check, then schedule the next check for that host.

Another way of doing this is to change to this setting:
service_interleave_factor=1

By using a “1” for this option it is basically disabled and all of the checks for one host will be set up and then all of the checks will be scheduled for the second host, etc.

This is one of those settings that you will need to experiment with to get the most out of it. The problem lies in the fact that there are a lot of variables that influence this setting. One major factor is the number of checks per host. If that average is consistent you will more likely be able to use smart. For example if most of the hosts have 10 checks, smart will probably do OK.

On the other hand if you have some hosts with 20 checks and others with 4 checks, this will distort the average considerably and you will more likely find better results in testing the service_interleave_factor between 8 and 12, again trying to find an average that works. Remember, if 80% of hosts have 4 checks this will also distort the averages, so take all of those into consideration.

Adjustments
Adjustments must be thought of in terms of chemistry, altering one factor certainly will alter other elements of the system. When making these adjustments be careful to monitor closely and document all changes so that the previous settings can be returned. With that said, most of the recommendations are in several stages in order to assess the impact on the system.

Inter Check Delay
In high performance situations, you will probably need to move away from “smart”, especially if your check settings have a wide range of check times. For hosts, use “n” to signify that there is no delay or a very small delay. For services use a small delay.

host_inter_check_delay_method=0.01
service_inter_check_delay_method=0.01

Service Interleave
This setting, as mentioned above, is impacted by averages of the number of services divided by then number of hosts. If you take the services/hosts and come up with an integer, you may need to adjust that number downward if you have a wide range of difference on the number of service checks on each host. For example, in one test, smart indicated that 12 was a good number to use but it was found that much better performance was attained using some trial and error with the number 6. Test and then verify.

Maximum Concurrent Checks
No change is recommended in maximum concurrent checks, allow Nagios to run as many as possible.

max_concurrent_checks=0

Reaper Frequency
The first stage is to move the reaper frequency to the high performance recommendations of Nagios:

check_result_reaper_frequency=3
max_check_result_reaper_time=10

The second stage will be to move the frequency lower.

check_result_reaper_frequency=2

Sleep Time
Adjustments in sleep time should be to reduce the time Nagios sleeps. This should be done in stages in order to evaluate the impact on the system. The first stage should change the setting to:

sleep_time=0.1

If that setting indicates no problems then move to stage two which is to basically eliminate sleep time.

sleep_time=0.01

Post by **Box293** » Thu Dec 13, 2012 4:20 pm

mguthrie,
Thanks for those steps. I ran through the steps you mentioned.

MySQL did output some stuff like:

Code: Select all

/usr/local/nagiosxi/scripts/repairmysql.sh nagios
DATABASE: nagios
TABLE:    
/var/lib/mysql/nagios ~
Warning: option 'key_buffer_size': unsigned value 18446744073709551615 adjusted to 4294963200
Warning: option 'read_buffer_size': unsigned value 18446744073709551615 adjusted to 4294967295
Warning: option 'write_buffer_size': unsigned value 18446744073709551615 adjusted to 4294967295
Warning: option 'sort_buffer_size': unsigned value 18446744073709551615 adjusted to 4294967295

But nothing about crashed tables.

Postgre also did output some stuff:

Code: Select all

psql postgres postgres

postgres=# VACUUM;
NOTICE:  number of page slots needed (27584) exceeds max_fsm_pages (20000)
HINT:  Consider increasing the configuration parameter "max_fsm_pages" to a value over 27584.
VACUUM
postgres=#

When I was looking at the log files in the pg_log folder there were two different entires:

Code: Select all

ERROR:  relation "xi_notifications" does not exist
LOG:  unexpected EOF on client connection

I'll see how it performs and get back to you.

You are dead right about the disk activity with VM's, I'll also be keeping an eye on this.

Post by **Box293** » Thu Dec 13, 2012 4:31 pm

mikew,
Thanks very much for the detailed information.

I am not making recommendations besides being very careful and documenting what you do and evaluating over time.

Because I am taking holidays and won't be back in the office until February, I won't start going through these steps until then.

However I am keen to work through these as this seems like a common problem you've come across.

Cheers
Troy

paul.jobb · Post by **paul.jobb** » Thu Dec 13, 2012 5:13 pm

one thing I noted with my installation running on a VM is that CPU scheduling with VMware seemed to really impact my monitoring latencies negatively, I'm not sure this applies in your case but I thought I would mention it. Essentially we had our nagios server configured with 4 vCPU's but moving it back to 2 really improved the monitoring engine performance, weekends were a problem when VM resources were being utilized for system backups and other similar tasks. Also I was saving too much history in my mysql database(i.e. log entries, state history) which resulted in the hourly db optimization task taking quite a while to complete, which seemed to halt the engine while it ran, so I slashed those values and that improved greatly.

mguthrie · Post by **mguthrie** » Fri Dec 14, 2012 3:27 pm

Thanks everybody for the feedback on this, this is good info!

Post by **Box293** » Sun Dec 16, 2012 6:40 pm

paul.jobb,
You are correct about VMs and CPU scheduling.

One thing that really affects this is the hypervisor version and the CPU version. I'm pretty sure I saw massive improvements when we upgraded to ESXi 4.1, there were some major kernal changes that helped CPU scheduling (we're now on 5.0 U1). In addition to this, each Intel CPU family that comes our always has virtualisation improvements.

However as per your suggestion I'm going to cut my ESXi server back to 2 vCPUs. I was on 3 because we were running a bunch of checks that used the VMware PowerCLI and it's CPU usage was outrageous. I disabled all those checks but forgot to remove one of the CPUs.

I am keen to offload MySQL to a seperate box as I suspect this is one of the key contibutors to the issue. Disk I/O goes up whenever the issue occurs and it only seems to do this when MySQL is consuming heaps of CPU. I'll be doing this next year when I get back from holidays.

Also I was saving too much history in my mysql database(i.e. log entries, state history) which resulted in the hourly db optimization task taking quite a while to complete, which seemed to halt the engine while it ran, so I slashed those values and that improved greatly.

How do I change how much history is stored in the mysql database?
How I do purge anything older than 12 months?

scottwilkerson · Post by **scottwilkerson** » Mon Dec 17, 2012 10:22 am

Box293 wrote:How do I change how much history is stored in the mysql database?
How I do purge anything older than 12 months?

Both of these can be done in Admin -> Performance Settings -> Database Tab

paul.jobb · Post by **paul.jobb** » Tue Dec 18, 2012 11:40 am

I have phpmyadmin installed on my mysql server, so I was able to see that the optimize db process was running for an extended amount of time and blocking other database processes, specifically when sorting the nagios_logentries and nagios_notifications tables . This resulted in seeing an hourly cpu spike as well as a pause in monitoring during that time period. Those two tables had millions of records each in my case, so I trimmed them to save 3 days worth of data instead, that appeared to address my issues.

Nagios Support Forum

Monitoring Engine Event Queue bottlenecks occassionally

Monitoring Engine Event Queue bottlenecks occassionally

Re: Monitoring Engine Event Queue bottlenecks occassionally

Re: Monitoring Engine Event Queue bottlenecks occassionally

Re: Monitoring Engine Event Queue bottlenecks occassionally

Re: Monitoring Engine Event Queue bottlenecks occassionally

Re: Monitoring Engine Event Queue bottlenecks occassionally

Re: Monitoring Engine Event Queue bottlenecks occassionally

Re: Monitoring Engine Event Queue bottlenecks occassionally

Re: Monitoring Engine Event Queue bottlenecks occassionally

Re: Monitoring Engine Event Queue bottlenecks occassionally