Nagios distributed monitoring

klajosh2 · Post by **klajosh2** » Mon Sep 14, 2015 10:45 am

1. ok I will do so. regarding this, I have workers in different timezones with proper tz settings on servers. So far this did not cause any trouble.

2. ok I will do so. one more interesting thing, I think I know what causes the high load on the worker. I examined the nagios.log once more and
there I see orphaned checks, checks from the specific worker only. But only in nagios.log and not in neb's log. And I think the cause of the high
cpu utilization is that nagios fires the checks like crazy on the worker. This puts the cpu utilization to the sky on the worker. Now I am not sure
why nagios sees those checks as orphaned checks.

So in summary I have 2 distinct problems.

I. when check_results queue getting "full" (not checks are processed) this is very rarely.
II. the current one: the checks from one of my workers are lost... at least nagios sees them as orphaned so reschedule the checks and load grows to 60 or above

.

jdalrymple · Post by **jdalrymple** » Mon Sep 14, 2015 11:42 am

klajosh2 wrote:And I think the cause of the high
cpu utilization is that nagios fires the checks like crazy on the worker. This puts the cpu utilization to the sky on the worker.

I typically suggest keeping max_workers on the Nagios box set to the minimum number possible to run all the checks needed. More often than not this number is somewhere between 0 and 2 inclusive.

klajosh2 · Post by **klajosh2** » Tue Sep 15, 2015 5:43 am

ok I think we found the solution for issue II. (This: II. the current one: the checks from one of my workers are lost... at least nagios sees them as orphaned so reschedule the checks and load grows to 60 or above )
the problem was that time on the worker was not set properly. When I said that it was not set properly I meant that clock on worker had "ntpd spike_detect"
"eg.: ntpd[2365]: 0.0.0.0 0613 03 spike_detect +192.594008 s". Little background: the worker is a virtual machine on a vmware esx. It has vmware tools installed and via vmware tools it tries to sync
the time and date also. So the solution was (to not to sync ntp via vmware tools):

vmware-toolbox-cmd timesync status <--- to check if it is enabled
If the output is Enabled, then issue the following command:
vmware-toolbox-cmd timesync disable
There is no need to restart NTP after disabling the timesync.

regarding issue I. (when check_results queue has many checks):
I setup gearmand with --verbose=INFO and when problem strike I will check the logs and see what happens.

Thanks for help,

klajosh2

Post by **eloyd** » Tue Sep 15, 2015 7:23 am

Our VMs never trust their host hardware clock. They all sync NTP to the appropriate pool.ntp.org servers. Glad you got it working.

hsmith · Post by **hsmith** » Tue Sep 15, 2015 9:09 am

Glad to hear it is working. I am going to close this and mark it as resolved, as the original poster had his issue resolved already, and it sounds like yours is as well. Please let us know if you need any more help!

Nagios Support Forum

Nagios distributed monitoring

Re: Nagios distributed monitoring

Re: Nagios distributed monitoring

Re: Nagios distributed monitoring

Re: Nagios distributed monitoring

Re: Nagios distributed monitoring