Page 1 of 1

High Service Check Latency

Posted: Thu Oct 24, 2013 6:06 am
by petronagios
Hi we are experiencing high Service Check Latency on one of our distributed nagios core servers. The nagios configuration is quite small, see tactical overview output below, but before I restarted nagios the service check latency figures were in the 1000’s !

Service Check Execution Time: 0.01 / 6.52 / 1.295 sec
Service Check Latency: 0.62 / 223.07 / 118.967 sec
Host Check Execution Time: 4.00 / 4.22 / 4.080 sec
Host Check Latency: 0.01 / 306.80 / 135.256 sec
# Active Host / Service Checks: 50 / 488
# Passive Host / Service Checks: 0 / 0

The server is a small VM with 4cpus and 4GB of memory and the load average is consistently about zero, see top snapshot.

top - 20:19:44 up 87 days, 1:34, 2 users, load average: 0.32, 0.14, 0.05
Tasks: 155 total, 1 running, 154 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.3%us, 0.2%sy, 0.0%ni, 99.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4043664k total, 1353964k used, 2689700k free, 184248k buffers
Swap: 2097144k total, 96k used, 2097048k free, 587180k cached

I know I could try altering the following parameters

Max_concurrent_checks=0
max_check_result_reaper_time=30
check_result_reaper_frequency=10

but is that required for such a small number of hosts/services? Is there something fundamental I've missed in the basic configuration?

Thanks
Steve.

Re: High Service Check Latency

Posted: Thu Oct 24, 2013 10:22 am
by abrist
What type of checks are you running? With that high of latency, I would assume you have some checks that are taking a very long time to complete (or are not even completing). Run the following command and then post the ps.txt as an attachment:

Code: Select all

ps aux > /tmp/ps.txt

Re: High Service Check Latency

Posted: Fri Oct 25, 2013 9:46 am
by petronagios
Hi Here's the output from the ps -aux. I'll update about the checks later today! Thanks.

Re: High Service Check Latency

Posted: Fri Oct 25, 2013 1:56 pm
by yancy
petronagios,
What type of checks are you running? With that high of latency, I would assume you have some checks that are taking a very long time to complete (or are not even completing).
The issue as abrist points out is probably due to the type of checks you are running. For example, if it's a active check over a very low bandwidth connection, or if the active check is running some custom scripts on the other end that are taking a long time to complete, that would be an issue.

-Yancy

Re: High Service Check Latency

Posted: Mon Oct 28, 2013 10:29 am
by petronagios
OK, I’ve had a look at the type of checks we are running on server. They are all active checks and 80 of the 488 are license manager checks running the following command at 2 minute intervals

lmutil lmstat -c $PORT@$HOST -f $FEATURE

After running some tests each lmutil command takes four to five seconds to complete! I compared this to the usual nagios plugins ran using nrpe (check_load, disk, mem etc) and these complete in less than a second.

Do you think that’s what could be causing the high Service Check Latency?

Re: High Service Check Latency

Posted: Mon Oct 28, 2013 10:54 am
by abrist
Those could definitely cause more latency, if you change their interval to 5 minutes, does the latency decrease?

Re: High Service Check Latency

Posted: Tue Oct 29, 2013 3:50 am
by petronagios
Thanks abrist and yancy for your replies. I changed the license manager checks to 5mins instead of 2 and the actual Service Check Latency has reduced

Service Check Execution Time: 0.01 / 8.41 / 1.338 sec
Service Check Latency: 2.27 / 237.61 / 104.679 sec
Host Check Execution Time: 4.00 / 4.21 / 4.079 sec
Host Check Latency: 0.00 / 366.47 / 192.863 sec
# Active Host / Service Checks: 50 / 488
# Passive Host / Service Checks: 0 / 0

I didn’t realise these checks were taking so long to complete, I’ll see if all the license feature checks are required maybe I can reduce the amount or stagger the frequency to help improve performance.

Re: High Service Check Latency

Posted: Tue Oct 29, 2013 10:54 am
by lmiltchev
Sounds good. Let us know if you have any more issues.