Nagios distributed monitoring

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
klajosh2
Posts: 38
Joined: Thu Jan 16, 2014 5:22 am

Re: Nagios distributed monitoring

Post by klajosh2 »

1. ok I will do so. regarding this, I have workers in different timezones with proper tz settings on servers. So far this did not cause any trouble.

2. ok I will do so. one more interesting thing, I think I know what causes the high load on the worker. I examined the nagios.log once more and
there I see orphaned checks, checks from the specific worker only. But only in nagios.log and not in neb's log. And I think the cause of the high
cpu utilization is that nagios fires the checks like crazy on the worker. This puts the cpu utilization to the sky on the worker. Now I am not sure
why nagios sees those checks as orphaned checks.

So in summary I have 2 distinct problems.

I. when check_results queue getting "full" (not checks are processed) this is very rarely.
II. the current one: the checks from one of my workers are lost... at least nagios sees them as orphaned so reschedule the checks and load grows to 60 or above :(.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Nagios distributed monitoring

Post by jdalrymple »

klajosh2 wrote:And I think the cause of the high
cpu utilization is that nagios fires the checks like crazy on the worker. This puts the cpu utilization to the sky on the worker.
I typically suggest keeping max_workers on the Nagios box set to the minimum number possible to run all the checks needed. More often than not this number is somewhere between 0 and 2 inclusive.
klajosh2
Posts: 38
Joined: Thu Jan 16, 2014 5:22 am

Re: Nagios distributed monitoring

Post by klajosh2 »

ok I think we found the solution for issue II. (This: II. the current one: the checks from one of my workers are lost... at least nagios sees them as orphaned so reschedule the checks and load grows to 60 or above )
the problem was that time on the worker was not set properly. When I said that it was not set properly I meant that clock on worker had "ntpd spike_detect"
"eg.: ntpd[2365]: 0.0.0.0 0613 03 spike_detect +192.594008 s". Little background: the worker is a virtual machine on a vmware esx. It has vmware tools installed and via vmware tools it tries to sync
the time and date also. So the solution was (to not to sync ntp via vmware tools):

vmware-toolbox-cmd timesync status <--- to check if it is enabled
If the output is Enabled, then issue the following command:
vmware-toolbox-cmd timesync disable
There is no need to restart NTP after disabling the timesync.

regarding issue I. (when check_results queue has many checks):
I setup gearmand with --verbose=INFO and when problem strike I will check the logs and see what happens.

Thanks for help,

klajosh2
User avatar
eloyd
Cool Title Here
Posts: 2129
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: Nagios distributed monitoring

Post by eloyd »

Our VMs never trust their host hardware clock. They all sync NTP to the appropriate pool.ntp.org servers. Glad you got it working.
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoydI'm a Nagios Fanatic!
User avatar
hsmith
Agent Smith
Posts: 3539
Joined: Thu Jul 30, 2015 11:09 am
Location: 127.0.0.1
Contact:

Re: Nagios distributed monitoring

Post by hsmith »

Glad to hear it is working. I am going to close this and mark it as resolved, as the original poster had his issue resolved already, and it sounds like yours is as well. Please let us know if you need any more help!
Former Nagios Employee.
me.
Locked