Host Checks / Latency

OptimusB · Post by **OptimusB** » Wed Apr 20, 2016 7:44 pm

So I've noticed some spikes in latency for some of our hosts and rtmax spikes up to 150ms. After investigation, it is actually the Nagios XI server that is causing these "spikes". When checking ping times from other servers, it seems to be quite stable, but at the same time a ping session from the Nagios XI server can show some fluctuations of ping times. I suspect this could be due to load and by adding more CPU and RAM, it does seem to be better, but it isn't solid.

So my question is, how reliable is the host check data of rta and rtmax? If this is derived from what the Nagios XI is getting based on the load of the XI server? Anyone else have this issue? Just trying to get some more reliable data.

latency.jpg

rkennedy · Post by **rkennedy** » Thu Apr 21, 2016 11:51 am

Well, if the XI server is under pretty heavy load, then you could see an increase delay in responses so I feel this all lines up proportionately.

To help us get a better idea - what kind of resources did you have before / after, and how many host / service checks are you running?

OptimusB · Post by **OptimusB** » Thu Apr 21, 2016 12:04 pm

This instance has one gearman worker. We had 2CPU and 8GB of RAM for the server before. It is now on 4CPU with 16GB of RAM. According to CCM, we have 221 hosts and 4176 services. Not too much.

We have another instance with about 1500 hosts and about 24065 service counts that's using 6CPU with 20GB of RAM, but that also has 3 mod gearman workers. I am seeing some latencies but it is much lower. I guess the best way to handle this is to spin up an additional gearman worker node and have that only handle host checks to get better latency results.

rkennedy · Post by **rkennedy** » Thu Apr 21, 2016 3:26 pm

Not much at all.

Gearman may be the solution. When the system only had 2CPU/8G, were you noticing any throttles anywhere (CPU/RAM)? Feel free to PM over a profile, and I can take a look to see if anything is standing out that would cause this. I may not be able to find much though, since you've already added the resources.

Another option to throw at you, is to use check_by_ssh or check_nrpe as agents. Run these on a 'cloud' service, and you'd be able to get additional information about the latency from an external source.

Nagios Support Forum

Host Checks / Latency

Host Checks / Latency

Re: Host Checks / Latency

Re: Host Checks / Latency

Re: Host Checks / Latency