service check latency 5k and climbing

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
benhank
Posts: 1264
Joined: Tue Apr 12, 2011 12:29 pm

service check latency 5k and climbing

Post by benhank »

My service check latency used to hover around 243, but now is at 5k +. It has gotten as high at 100K +

we do check 5k+ services.
Odd thing is this started after we installed mod gearmon.
Proudly running:
NagiosXI 5.4.12 2 node Prod Env 2500 hosts, 13,000 services
Nagiosxi 5.5.7(test env) 2500 hosts, 13,000 services
Nagios Logserver 2 node Prod Env 500 objects sending
Nagios Network Analyser
Nagios Fusion
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: service check latency 5k and climbing

Post by scottwilkerson »

Are you talking about 100k secs?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
benhank
Posts: 1264
Joined: Tue Apr 12, 2011 12:29 pm

Re: service check latency 5k and climbing

Post by benhank »

Your Nagios XI installation is up to date.

Latest Available Version: 2011R3.2
Installed Version: 2011R3.2
Last Update Check: 07/17/2012 11:46:01
Last Updated: 07/17/2012 13:05:07

Monitoring Engine Event Queue
Scheduled Events Over Time
Monitoring Engine Check Statistics
Metric Value

Active Host Checks
1-min 386
5-min 2,434
15-min 3,141

Passive Host Checks
1-min 0
5-min 0
15-min 0
Active Service Checks
1-min 502
5-min 4,311
15-min 5,920

Passive Service Checks
1-min 0
5-min 0
15-min 0
Last Updated: 07/17/2012 13:06:29

Monitoring Engine Performance
Metric

Value
Host Check Latency
Min 0.00 sec
Max 46.28 sec
Avg 1.01 sec

Host Check Execution Time
Min 0.00 sec
Max 31.00 sec
Avg 0.49 sec

Service Check Latency
Min 0.00 sec
Max 6,464.34 sec <--- was at 100,k+ this morning as of 10:00am before I rebooted the server. i was gone since last Thursday. so from thursday it had grown to 100k +, I rebooted the server and this is where it is now. All other values are normal, as seen here.
Avg 45.29 sec

Service Check Execution Time
Min 0.00 sec

Max 61.01 sec

Avg 1.04 sec

Last Updated: 07/17/2012 13:06:29
Proudly running:
NagiosXI 5.4.12 2 node Prod Env 2500 hosts, 13,000 services
Nagiosxi 5.5.7(test env) 2500 hosts, 13,000 services
Nagios Logserver 2 node Prod Env 500 objects sending
Nagios Network Analyser
Nagios Fusion
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: service check latency 5k and climbing

Post by scottwilkerson »

My guess is that something (maybe the mod_gearman daemon) stopped working correctly and the latency was continuing to grow, the number that you highlighted is the Max, not the current latency .

Looking at the results you posted the system seems to have processed almost 7000 checks in the last 5 minutes so that is likely on track.

I would watch it closely to see if it appears to start keeping up.

Also, I would look at any logs you might be getting from mod_gearman both on the server and the clients, as well as the syslog to see if you can spot any problems
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
benhank
Posts: 1264
Joined: Tue Apr 12, 2011 12:29 pm

Re: service check latency 5k and climbing

Post by benhank »

I think I got it. I had a lot of warning for services that had no check times defined. I added the check times and now I am down to 800+.
How do I fix the
Warning: Duplicate definition found for service 'Ping' on host 'wkendsvp01.healthone.org' (config file '/usr/local/nagios/etc/services/windows-servers.cfg', starting on line 101)
Warning: Duplicate definition found for service 'CPU Usage for VMHost' on host 'vkenesxt01' (config file '/usr/local/nagios/etc/services/vmware-servers.cfg', starting on line 14)
and stuff. i think if I clean those out I should be good.
Thanks Scott in advance.
Proudly running:
NagiosXI 5.4.12 2 node Prod Env 2500 hosts, 13,000 services
Nagiosxi 5.5.7(test env) 2500 hosts, 13,000 services
Nagios Logserver 2 node Prod Env 500 objects sending
Nagios Network Analyser
Nagios Fusion
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: service check latency 5k and climbing

Post by scottwilkerson »

benhank wrote:How do I fix the
Warning: Duplicate definition found for service 'Ping' on host 'wkendsvp01.healthone.org' (config file '/usr/local/nagios/etc/services/windows-servers.cfg', starting on line 101)
Warning: Duplicate definition found for service 'CPU Usage for VMHost' on host 'vkenesxt01' (config file '/usr/local/nagios/etc/services/vmware-servers.cfg', starting on line 14)
Ok, this is usually caused by having for example, Ping setup in 2 places for the same host or you have a hostgroup added to a service.

We will use the first as an example.

If you go to CCM -> Services and select windows-servers.cfg from the Config name filter

You will have a service Ping
I am guessing that if you modify this service you have either added multiple hosts, or a hostgroup. This is fine, however you likely also Have a Ping service defined if you go to CCM -> Services and select wkendsvp01.healthone.org.cfg

When you add a hostgroup to a service, that service will be defined for ALL hosts in the hostgroup.

So, to fix this you need to remove one of the Ping services so you do not have more than one.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
benhank
Posts: 1264
Joined: Tue Apr 12, 2011 12:29 pm

Re: service check latency 5k and climbing

Post by benhank »

THANKS!.
btw, I spoke too soon. As of the time I posted that my latency was at 800 to now. It is up to 3k...and rising.

Since my other latency results are so low, can you tell me how this may affect my system? Is it a cause to worry?
Proudly running:
NagiosXI 5.4.12 2 node Prod Env 2500 hosts, 13,000 services
Nagiosxi 5.5.7(test env) 2500 hosts, 13,000 services
Nagios Logserver 2 node Prod Env 500 objects sending
Nagios Network Analyser
Nagios Fusion
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: service check latency 5k and climbing

Post by mguthrie »

Yes, latency is bad. If you're wanting a check to run every 5mn and you have a latency of 6000, that means that the check will run 6000 seconds behind schedule, which means 100 minutes late. Do you have a batch of checks that take a long time to run? Maybe a whole bunch of checks that might be taking the full max_execution_time to run?

Do you get these results with Mod Gearman turned off? If not, then I'd start digging there, because something is probably timing out with a LOT of checks. 5000 checks isn't that much that you should be having that kind of latency on any piece of hardware, so something is probably timing out somewhere...
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: service check latency 5k and climbing

Post by scottwilkerson »

benhank wrote:Since my other latency results are so low, can you tell me how this may affect my system? Is it a cause to worry?
I would still check your logs. Something doesn't seem right, I'm not sure if it is a configuration of mod_gearman or missing plugins or what it may be...
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
benhank
Posts: 1264
Joined: Tue Apr 12, 2011 12:29 pm

Re: service check latency 5k and climbing

Post by benhank »

scott when I check the logs for mod geamon, what should I look out for?
Proudly running:
NagiosXI 5.4.12 2 node Prod Env 2500 hosts, 13,000 services
Nagiosxi 5.5.7(test env) 2500 hosts, 13,000 services
Nagios Logserver 2 node Prod Env 500 objects sending
Nagios Network Analyser
Nagios Fusion
Locked