Page 1 of 2

service check latency 5k and climbing

Posted: Tue Jul 17, 2012 11:57 am
by benhank
My service check latency used to hover around 243, but now is at 5k +. It has gotten as high at 100K +

we do check 5k+ services.
Odd thing is this started after we installed mod gearmon.

Re: service check latency 5k and climbing

Posted: Tue Jul 17, 2012 12:04 pm
by scottwilkerson
Are you talking about 100k secs?

Re: service check latency 5k and climbing

Posted: Tue Jul 17, 2012 12:16 pm
by benhank
Your Nagios XI installation is up to date.

Latest Available Version: 2011R3.2
Installed Version: 2011R3.2
Last Update Check: 07/17/2012 11:46:01
Last Updated: 07/17/2012 13:05:07

Monitoring Engine Event Queue
Scheduled Events Over Time
Monitoring Engine Check Statistics
Metric Value

Active Host Checks
1-min 386
5-min 2,434
15-min 3,141

Passive Host Checks
1-min 0
5-min 0
15-min 0
Active Service Checks
1-min 502
5-min 4,311
15-min 5,920

Passive Service Checks
1-min 0
5-min 0
15-min 0
Last Updated: 07/17/2012 13:06:29

Monitoring Engine Performance
Metric

Value
Host Check Latency
Min 0.00 sec
Max 46.28 sec
Avg 1.01 sec

Host Check Execution Time
Min 0.00 sec
Max 31.00 sec
Avg 0.49 sec

Service Check Latency
Min 0.00 sec
Max 6,464.34 sec <--- was at 100,k+ this morning as of 10:00am before I rebooted the server. i was gone since last Thursday. so from thursday it had grown to 100k +, I rebooted the server and this is where it is now. All other values are normal, as seen here.
Avg 45.29 sec

Service Check Execution Time
Min 0.00 sec

Max 61.01 sec

Avg 1.04 sec

Last Updated: 07/17/2012 13:06:29

Re: service check latency 5k and climbing

Posted: Tue Jul 17, 2012 12:36 pm
by scottwilkerson
My guess is that something (maybe the mod_gearman daemon) stopped working correctly and the latency was continuing to grow, the number that you highlighted is the Max, not the current latency .

Looking at the results you posted the system seems to have processed almost 7000 checks in the last 5 minutes so that is likely on track.

I would watch it closely to see if it appears to start keeping up.

Also, I would look at any logs you might be getting from mod_gearman both on the server and the clients, as well as the syslog to see if you can spot any problems

Re: service check latency 5k and climbing

Posted: Tue Jul 17, 2012 12:54 pm
by benhank
I think I got it. I had a lot of warning for services that had no check times defined. I added the check times and now I am down to 800+.
How do I fix the
Warning: Duplicate definition found for service 'Ping' on host 'wkendsvp01.healthone.org' (config file '/usr/local/nagios/etc/services/windows-servers.cfg', starting on line 101)
Warning: Duplicate definition found for service 'CPU Usage for VMHost' on host 'vkenesxt01' (config file '/usr/local/nagios/etc/services/vmware-servers.cfg', starting on line 14)
and stuff. i think if I clean those out I should be good.
Thanks Scott in advance.

Re: service check latency 5k and climbing

Posted: Tue Jul 17, 2012 1:31 pm
by scottwilkerson
benhank wrote:How do I fix the
Warning: Duplicate definition found for service 'Ping' on host 'wkendsvp01.healthone.org' (config file '/usr/local/nagios/etc/services/windows-servers.cfg', starting on line 101)
Warning: Duplicate definition found for service 'CPU Usage for VMHost' on host 'vkenesxt01' (config file '/usr/local/nagios/etc/services/vmware-servers.cfg', starting on line 14)
Ok, this is usually caused by having for example, Ping setup in 2 places for the same host or you have a hostgroup added to a service.

We will use the first as an example.

If you go to CCM -> Services and select windows-servers.cfg from the Config name filter

You will have a service Ping
I am guessing that if you modify this service you have either added multiple hosts, or a hostgroup. This is fine, however you likely also Have a Ping service defined if you go to CCM -> Services and select wkendsvp01.healthone.org.cfg

When you add a hostgroup to a service, that service will be defined for ALL hosts in the hostgroup.

So, to fix this you need to remove one of the Ping services so you do not have more than one.

Re: service check latency 5k and climbing

Posted: Tue Jul 17, 2012 1:45 pm
by benhank
THANKS!.
btw, I spoke too soon. As of the time I posted that my latency was at 800 to now. It is up to 3k...and rising.

Since my other latency results are so low, can you tell me how this may affect my system? Is it a cause to worry?

Re: service check latency 5k and climbing

Posted: Tue Jul 17, 2012 3:03 pm
by mguthrie
Yes, latency is bad. If you're wanting a check to run every 5mn and you have a latency of 6000, that means that the check will run 6000 seconds behind schedule, which means 100 minutes late. Do you have a batch of checks that take a long time to run? Maybe a whole bunch of checks that might be taking the full max_execution_time to run?

Do you get these results with Mod Gearman turned off? If not, then I'd start digging there, because something is probably timing out with a LOT of checks. 5000 checks isn't that much that you should be having that kind of latency on any piece of hardware, so something is probably timing out somewhere...

Re: service check latency 5k and climbing

Posted: Tue Jul 17, 2012 3:04 pm
by scottwilkerson
benhank wrote:Since my other latency results are so low, can you tell me how this may affect my system? Is it a cause to worry?
I would still check your logs. Something doesn't seem right, I'm not sure if it is a configuration of mod_gearman or missing plugins or what it may be...

Re: service check latency 5k and climbing

Posted: Wed Jul 18, 2012 9:21 am
by benhank
scott when I check the logs for mod geamon, what should I look out for?