Monitoring Engine Event Queue bottlenecks occassionally
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Monitoring Engine Event Queue bottlenecks occassionally
I've been observing the Monitoring Engine Event Queue dashlet that shows me the Scheduled Events Over Time. What seems to happen on a regular basis is that the queue becomes bottlenecked by some process ... and then eventually the bottleneck dissapears.
This seems to have occurred since we upgraded to XI 2012. Usually we don't need to do anything and it corrects itself.
The problem is that sometimes the "Now" scheduled events sometimes gets up to 3000+ and we need to restart the monitoring enging.
Is anyone else experiencing this behaviour?
Nagios XI 2012R1.2 VM CentOS 32bit on ESXi 5.0 Update 1.
This seems to have occurred since we upgraded to XI 2012. Usually we don't need to do anything and it corrects itself.
The problem is that sometimes the "Now" scheduled events sometimes gets up to 3000+ and we need to restart the monitoring enging.
Is anyone else experiencing this behaviour?
Nagios XI 2012R1.2 VM CentOS 32bit on ESXi 5.0 Update 1.
You do not have the required permissions to view the files attached to this post.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Monitoring Engine Event Queue bottlenecks occassionally
It would be worth running the mysql repair procedure:
http://assets.nagios.com/downloads/nagi ... tabase.pdf
As well as the vacuum commands on postgresql:
http://support.nagios.com/wiki/index.ph ... .22_in_log
There are a few possible reasons for this:
- Lots of disk activity causes things to get backed up because the system is waiting to write to disk. We see this sometimes on VM's because of a shared physical disk.
- LONG running checks or event handlers, these will block the main Nagios loop and hold up the check schedule.
- A big spike in CPU usage could cause this, but usually this would be a consistently high load...
http://assets.nagios.com/downloads/nagi ... tabase.pdf
As well as the vacuum commands on postgresql:
http://support.nagios.com/wiki/index.ph ... .22_in_log
There are a few possible reasons for this:
- Lots of disk activity causes things to get backed up because the system is waiting to write to disk. We see this sometimes on VM's because of a shared physical disk.
- LONG running checks or event handlers, these will block the main Nagios loop and hold up the check schedule.
- A big spike in CPU usage could cause this, but usually this would be a consistently high load...
Re: Monitoring Engine Event Queue bottlenecks occassionally
I have spent considerable time trying to work toward concrete solutions for this problem. I have worked with a number of organizations which have a large number of checks and checks which have a large difference in the time in which checks occur. I am not making recommendations besides being very careful and documenting what you do and evaluating over time. Here is some of my research which basically tells you to be careful with changes!!!! This information is not complete and every situation has variables which make outcomes different so keep that in mind.
I am interested in getting feedback from others working to solve this issue for XI so feel free to contact me. We work with this issue in the Advanced Nagios Training because it seems to come up frequently with larger installs.
One thing to note, this condition will often occur if you have checks that do not run 24x7 as they build up and dump all at once.
inter_check_delay_method
This setting is used to space the checks in order to equalize the load on the Nagios server as well as on the hosts that it is monitoring. By default this setting uses “smart” to calculate the spread. It is important to understand how “smart” works as there are some situations that this setting is not optimal.
n = no delay - schedule all service checks to run immediately
d = use a "dumb" delay of 1 second between service checks
s = use a "smart" delay calculation to spread service checks out evenly (default)
x.xx = use a user-supplied inter-check delay of x.xx seconds
Here are the defaults(both using smart):
host_inter_check_delay_method=s
service_inter_check_delay_method=s
smart
This setting uses the following formula to decide on the spread between checks:
inter check delay = (average check interval)/(total number of services)
Here is an example. The Nagios server has 10,000 checks with an average check interval of 5 minutes. This means that the formula then looks like this:
inter check delay = 5 / 10000
This comes out to 0.0005 minutes or 0.03 seconds. This means that the initial spread for the checks is going to have to be around every 0.03 seconds before another check will need to occur, or 33 checks will need to occur every second.
Let's do the math again, this time use an average of 6.5 minutes.
inter check delay = 6.5 / 10000
This comes out to 0.00065 minutes or 0.039 seconds. This means that the initial spread for the checks is going to have to be around every 0.039 seconds before another check will need to occur, or 25 checks will need to occur every second.
Obviously, the greater the average check time, the less checks per second will need to occur in order to balance the checks.
Now consider a more realistic scenario with these checks:
15% of checks at 15 seconds
25% of checks at 60 seconds
50% of checks at 300 seconds
10% of checks at 600 seconds
There are several very interesting aspects here. First, there is a wide spread from 15 seconds to 10 minutes, this will certainly distort the averages. So let's take a look again using 10000 checks:
15% of checks at 15 seconds 22500
25% of checks at 60 seconds 150,000
50% of checks at 300 seconds 1500,000
10% of checks at 600 seconds 600,000
average 92.25 seconds or 1.538 minutes
inter check delay = 1.538 / 10000
This comes out to 0.0001538 minutes or 0.0092 seconds. This means that the initial spread for the checks is going to have to be around every 0.0092 seconds before another check will need to occur, or 109 checks will need to occur every second.
One of the issues with this example is that the spread of check times is quite wide and will in time probably twist this into a terrible mess. This is the first limitation of “smart” is that it assumes the spread in check times will be more consistent.
Service Interleave
This setting is used to equalize the load on the hosts that are monitored. In other words, the checks are spread out on the remote host so as not to load the remote host beyond what it can do. Remember that in Nagios 3 parallelization is possible so that multiple checks can be run at the same time.
Here are the defaults(using smart):
service_interleave_factor=s
If you use “smart”, this is how it works.
service interleave = ceil (total services) / (total hosts)
ceil means that it must be rounded up to the nearest integer.
So in the example above with 10,000 checks and 500 hosts you get 20.
Service interleave = 10,000 / 500
So what this means is that Nagios will schedule 1 check for a host, then skip the next 19 checks for that host until all hosts have the first check, then schedule the next check for that host.
Another way of doing this is to change to this setting:
service_interleave_factor=1
By using a “1” for this option it is basically disabled and all of the checks for one host will be set up and then all of the checks will be scheduled for the second host, etc.
This is one of those settings that you will need to experiment with to get the most out of it. The problem lies in the fact that there are a lot of variables that influence this setting. One major factor is the number of checks per host. If that average is consistent you will more likely be able to use smart. For example if most of the hosts have 10 checks, smart will probably do OK.
On the other hand if you have some hosts with 20 checks and others with 4 checks, this will distort the average considerably and you will more likely find better results in testing the service_interleave_factor between 8 and 12, again trying to find an average that works. Remember, if 80% of hosts have 4 checks this will also distort the averages, so take all of those into consideration.
Adjustments
Adjustments must be thought of in terms of chemistry, altering one factor certainly will alter other elements of the system. When making these adjustments be careful to monitor closely and document all changes so that the previous settings can be returned. With that said, most of the recommendations are in several stages in order to assess the impact on the system.
Inter Check Delay
In high performance situations, you will probably need to move away from “smart”, especially if your check settings have a wide range of check times. For hosts, use “n” to signify that there is no delay or a very small delay. For services use a small delay.
host_inter_check_delay_method=0.01
service_inter_check_delay_method=0.01
Service Interleave
This setting, as mentioned above, is impacted by averages of the number of services divided by then number of hosts. If you take the services/hosts and come up with an integer, you may need to adjust that number downward if you have a wide range of difference on the number of service checks on each host. For example, in one test, smart indicated that 12 was a good number to use but it was found that much better performance was attained using some trial and error with the number 6. Test and then verify.
Maximum Concurrent Checks
No change is recommended in maximum concurrent checks, allow Nagios to run as many as possible.
max_concurrent_checks=0
Reaper Frequency
The first stage is to move the reaper frequency to the high performance recommendations of Nagios:
check_result_reaper_frequency=3
max_check_result_reaper_time=10
The second stage will be to move the frequency lower.
check_result_reaper_frequency=2
Sleep Time
Adjustments in sleep time should be to reduce the time Nagios sleeps. This should be done in stages in order to evaluate the impact on the system. The first stage should change the setting to:
sleep_time=0.1
If that setting indicates no problems then move to stage two which is to basically eliminate sleep time.
sleep_time=0.01
I am interested in getting feedback from others working to solve this issue for XI so feel free to contact me. We work with this issue in the Advanced Nagios Training because it seems to come up frequently with larger installs.
One thing to note, this condition will often occur if you have checks that do not run 24x7 as they build up and dump all at once.
inter_check_delay_method
This setting is used to space the checks in order to equalize the load on the Nagios server as well as on the hosts that it is monitoring. By default this setting uses “smart” to calculate the spread. It is important to understand how “smart” works as there are some situations that this setting is not optimal.
n = no delay - schedule all service checks to run immediately
d = use a "dumb" delay of 1 second between service checks
s = use a "smart" delay calculation to spread service checks out evenly (default)
x.xx = use a user-supplied inter-check delay of x.xx seconds
Here are the defaults(both using smart):
host_inter_check_delay_method=s
service_inter_check_delay_method=s
smart
This setting uses the following formula to decide on the spread between checks:
inter check delay = (average check interval)/(total number of services)
Here is an example. The Nagios server has 10,000 checks with an average check interval of 5 minutes. This means that the formula then looks like this:
inter check delay = 5 / 10000
This comes out to 0.0005 minutes or 0.03 seconds. This means that the initial spread for the checks is going to have to be around every 0.03 seconds before another check will need to occur, or 33 checks will need to occur every second.
Let's do the math again, this time use an average of 6.5 minutes.
inter check delay = 6.5 / 10000
This comes out to 0.00065 minutes or 0.039 seconds. This means that the initial spread for the checks is going to have to be around every 0.039 seconds before another check will need to occur, or 25 checks will need to occur every second.
Obviously, the greater the average check time, the less checks per second will need to occur in order to balance the checks.
Now consider a more realistic scenario with these checks:
15% of checks at 15 seconds
25% of checks at 60 seconds
50% of checks at 300 seconds
10% of checks at 600 seconds
There are several very interesting aspects here. First, there is a wide spread from 15 seconds to 10 minutes, this will certainly distort the averages. So let's take a look again using 10000 checks:
15% of checks at 15 seconds 22500
25% of checks at 60 seconds 150,000
50% of checks at 300 seconds 1500,000
10% of checks at 600 seconds 600,000
average 92.25 seconds or 1.538 minutes
inter check delay = 1.538 / 10000
This comes out to 0.0001538 minutes or 0.0092 seconds. This means that the initial spread for the checks is going to have to be around every 0.0092 seconds before another check will need to occur, or 109 checks will need to occur every second.
One of the issues with this example is that the spread of check times is quite wide and will in time probably twist this into a terrible mess. This is the first limitation of “smart” is that it assumes the spread in check times will be more consistent.
Service Interleave
This setting is used to equalize the load on the hosts that are monitored. In other words, the checks are spread out on the remote host so as not to load the remote host beyond what it can do. Remember that in Nagios 3 parallelization is possible so that multiple checks can be run at the same time.
Here are the defaults(using smart):
service_interleave_factor=s
If you use “smart”, this is how it works.
service interleave = ceil (total services) / (total hosts)
ceil means that it must be rounded up to the nearest integer.
So in the example above with 10,000 checks and 500 hosts you get 20.
Service interleave = 10,000 / 500
So what this means is that Nagios will schedule 1 check for a host, then skip the next 19 checks for that host until all hosts have the first check, then schedule the next check for that host.
Another way of doing this is to change to this setting:
service_interleave_factor=1
By using a “1” for this option it is basically disabled and all of the checks for one host will be set up and then all of the checks will be scheduled for the second host, etc.
This is one of those settings that you will need to experiment with to get the most out of it. The problem lies in the fact that there are a lot of variables that influence this setting. One major factor is the number of checks per host. If that average is consistent you will more likely be able to use smart. For example if most of the hosts have 10 checks, smart will probably do OK.
On the other hand if you have some hosts with 20 checks and others with 4 checks, this will distort the average considerably and you will more likely find better results in testing the service_interleave_factor between 8 and 12, again trying to find an average that works. Remember, if 80% of hosts have 4 checks this will also distort the averages, so take all of those into consideration.
Adjustments
Adjustments must be thought of in terms of chemistry, altering one factor certainly will alter other elements of the system. When making these adjustments be careful to monitor closely and document all changes so that the previous settings can be returned. With that said, most of the recommendations are in several stages in order to assess the impact on the system.
Inter Check Delay
In high performance situations, you will probably need to move away from “smart”, especially if your check settings have a wide range of check times. For hosts, use “n” to signify that there is no delay or a very small delay. For services use a small delay.
host_inter_check_delay_method=0.01
service_inter_check_delay_method=0.01
Service Interleave
This setting, as mentioned above, is impacted by averages of the number of services divided by then number of hosts. If you take the services/hosts and come up with an integer, you may need to adjust that number downward if you have a wide range of difference on the number of service checks on each host. For example, in one test, smart indicated that 12 was a good number to use but it was found that much better performance was attained using some trial and error with the number 6. Test and then verify.
Maximum Concurrent Checks
No change is recommended in maximum concurrent checks, allow Nagios to run as many as possible.
max_concurrent_checks=0
Reaper Frequency
The first stage is to move the reaper frequency to the high performance recommendations of Nagios:
check_result_reaper_frequency=3
max_check_result_reaper_time=10
The second stage will be to move the frequency lower.
check_result_reaper_frequency=2
Sleep Time
Adjustments in sleep time should be to reduce the time Nagios sleeps. This should be done in stages in order to evaluate the impact on the system. The first stage should change the setting to:
sleep_time=0.1
If that setting indicates no problems then move to stage two which is to basically eliminate sleep time.
sleep_time=0.01
Mike Weber
Nagios Training/Consulting
Nagios Training/Consulting
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: Monitoring Engine Event Queue bottlenecks occassionally
mguthrie,
Thanks for those steps. I ran through the steps you mentioned.
MySQL did output some stuff like:
But nothing about crashed tables.
Postgre also did output some stuff:
When I was looking at the log files in the pg_log folder there were two different entires:
I'll see how it performs and get back to you.
You are dead right about the disk activity with VM's, I'll also be keeping an eye on this.
Thanks for those steps. I ran through the steps you mentioned.
MySQL did output some stuff like:
Code: Select all
/usr/local/nagiosxi/scripts/repairmysql.sh nagios
DATABASE: nagios
TABLE:
/var/lib/mysql/nagios ~
Warning: option 'key_buffer_size': unsigned value 18446744073709551615 adjusted to 4294963200
Warning: option 'read_buffer_size': unsigned value 18446744073709551615 adjusted to 4294967295
Warning: option 'write_buffer_size': unsigned value 18446744073709551615 adjusted to 4294967295
Warning: option 'sort_buffer_size': unsigned value 18446744073709551615 adjusted to 4294967295Postgre also did output some stuff:
Code: Select all
psql postgres postgres
postgres=# VACUUM;
NOTICE: number of page slots needed (27584) exceeds max_fsm_pages (20000)
HINT: Consider increasing the configuration parameter "max_fsm_pages" to a value over 27584.
VACUUM
postgres=#Code: Select all
ERROR: relation "xi_notifications" does not exist
LOG: unexpected EOF on client connectionYou are dead right about the disk activity with VM's, I'll also be keeping an eye on this.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: Monitoring Engine Event Queue bottlenecks occassionally
mikew,
Thanks very much for the detailed information.
However I am keen to work through these as this seems like a common problem you've come across.
Cheers
Troy
Thanks very much for the detailed information.
Because I am taking holidays and won't be back in the office until February, I won't start going through these steps until then.I am not making recommendations besides being very careful and documenting what you do and evaluating over time.
However I am keen to work through these as this seems like a common problem you've come across.
Cheers
Troy
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Monitoring Engine Event Queue bottlenecks occassionally
one thing I noted with my installation running on a VM is that CPU scheduling with VMware seemed to really impact my monitoring latencies negatively, I'm not sure this applies in your case but I thought I would mention it. Essentially we had our nagios server configured with 4 vCPU's but moving it back to 2 really improved the monitoring engine performance, weekends were a problem when VM resources were being utilized for system backups and other similar tasks. Also I was saving too much history in my mysql database(i.e. log entries, state history) which resulted in the hourly db optimization task taking quite a while to complete, which seemed to halt the engine while it ran, so I slashed those values and that improved greatly.
Re: Monitoring Engine Event Queue bottlenecks occassionally
Thanks everybody for the feedback on this, this is good info!
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: Monitoring Engine Event Queue bottlenecks occassionally
paul.jobb,
You are correct about VMs and CPU scheduling.
One thing that really affects this is the hypervisor version and the CPU version. I'm pretty sure I saw massive improvements when we upgraded to ESXi 4.1, there were some major kernal changes that helped CPU scheduling (we're now on 5.0 U1). In addition to this, each Intel CPU family that comes our always has virtualisation improvements.
However as per your suggestion I'm going to cut my ESXi server back to 2 vCPUs. I was on 3 because we were running a bunch of checks that used the VMware PowerCLI and it's CPU usage was outrageous. I disabled all those checks but forgot to remove one of the CPUs.
I am keen to offload MySQL to a seperate box as I suspect this is one of the key contibutors to the issue. Disk I/O goes up whenever the issue occurs and it only seems to do this when MySQL is consuming heaps of CPU. I'll be doing this next year when I get back from holidays.
How I do purge anything older than 12 months?
You are correct about VMs and CPU scheduling.
One thing that really affects this is the hypervisor version and the CPU version. I'm pretty sure I saw massive improvements when we upgraded to ESXi 4.1, there were some major kernal changes that helped CPU scheduling (we're now on 5.0 U1). In addition to this, each Intel CPU family that comes our always has virtualisation improvements.
However as per your suggestion I'm going to cut my ESXi server back to 2 vCPUs. I was on 3 because we were running a bunch of checks that used the VMware PowerCLI and it's CPU usage was outrageous. I disabled all those checks but forgot to remove one of the CPUs.
I am keen to offload MySQL to a seperate box as I suspect this is one of the key contibutors to the issue. Disk I/O goes up whenever the issue occurs and it only seems to do this when MySQL is consuming heaps of CPU. I'll be doing this next year when I get back from holidays.
How do I change how much history is stored in the mysql database?Also I was saving too much history in my mysql database(i.e. log entries, state history) which resulted in the hourly db optimization task taking quite a while to complete, which seemed to halt the engine while it ran, so I slashed those values and that improved greatly.
How I do purge anything older than 12 months?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Monitoring Engine Event Queue bottlenecks occassionally
Both of these can be done in Admin -> Performance Settings -> Database TabBox293 wrote:How do I change how much history is stored in the mysql database?
How I do purge anything older than 12 months?
Re: Monitoring Engine Event Queue bottlenecks occassionally
I have phpmyadmin installed on my mysql server, so I was able to see that the optimize db process was running for an extended amount of time and blocking other database processes, specifically when sorting the nagios_logentries and nagios_notifications tables . This resulted in seeing an hourly cpu spike as well as a pause in monitoring during that time period. Those two tables had millions of records each in my case, so I trimmed them to save 3 days worth of data instead, that appeared to address my issues.