Page 1 of 1

Mod-Gearman questions

Posted: Thu Nov 02, 2017 6:02 pm
by cbeattie-unitrends
I've set up Mod-Gearman to distribute the load on one of my Nagios XI hosts, but I have two problems with it.

I'm running Nagios XI 5.4.10 on CentOS 7, 64-bit, manual install, with Mod-Gearman installed from the instructions at https://assets.nagios.com/downloads/nag ... ios_XI.pdf and two workers (Also CentOS 7). This is a clone of my production Nagios XI host, so downtime isn't a concern.

The first problem I have is that the performance is abysmal. Checks are going stale by hours. Mod-Gearman seems to work for everyone else, so I'm sure I'm overlooking something. :lol: My production Nagios XI instance is handling over 33,000 checks every 15 minutes, but most of them are on a 10 or 15 minute schedule. When I add Mod-Gearman, I'm getting about 1,800. My gut tells me that the server isn't feeding the workers fast enough. I see the processes pop up occasionally in top on the workers, but they are executed quickly and disappear. Gearman-top2 on the Nagios XI host looks like this:

Code: Select all

2017-11-02 16:13:33  -  localhost:4730  -  v0.33

 Queue Name          | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------
 check_results       |               1  |           0  |           0
 eventhandler        |              27  |           0  |           0
 host                |              27  |           0  |           3
 service             |              27  |           0  |           2
 worker_den-gearman1 |               1  |           0  |           0
 worker_den-gearman2 |               1  |           0  |           0
----------------------------------------------------------------------
Jobs Waiting occasionally gets a few (20-60), but the workers gather them right away.

The other problem I have is executing scripts on the monitored targets with check_by_ssh. Checks using SNMP or check_http (for example) work fine, but checks over SSH return "CRITICAL: Return code of 255 is out of bounds. (worker: den-gearman2)". I exchanged SSH keys between the workers and the monitored hosts, so I can log in as root without a password, but that didn't help. I suspect that either I need to do that for another user, or that I have a problem quoting the command argument.

Thanks!

Re: Mod-Gearman questions

Posted: Fri Nov 03, 2017 5:46 am
by tacolover101
it looks like your workers are performing properly, which leads me to think it may be an issue with the checks happening. a few things to check:

- if you're using any plugins, ensure they're copied over to the gearman worker machines.
- pending what the checks are, make sure there isn't any "local XI" dependency (such as storage, rrd files, etc.) since the XI / worker do not have a shared storage solution. one check in particular which would be networking, depends on the RRD files on the local XI machine. this is not a good check to offload for that reason.
- what amount of resources do your workers have?

Re: Mod-Gearman questions

Posted: Fri Nov 03, 2017 11:24 am
by tgriep
The check_by_ssh checks require a public ssh key setup to work. You probably have it setup to work with the XI server but did not setup the workers for that plugin.
If you want the workers to run that plugin, the workers have to be setup using this document.
https://assets.nagios.com/downloads/nag ... ng_SSH.pdf

Also, with that many checks, you should increase the following options in the worker.conf file

Code: Select all

max-worker=50
max-jobs=1000
That will allow them to run more workers and jobs so they will run faster.

Re: Mod-Gearman questions

Posted: Fri Nov 03, 2017 5:20 pm
by cbeattie-unitrends
tacolover101 wrote:it looks like your workers are performing properly, which leads me to think it may be an issue with the checks happening. a few things to check:

- if you're using any plugins, ensure they're copied over to the gearman worker machines.
- pending what the checks are, make sure there isn't any "local XI" dependency (such as storage, rrd files, etc.) since the XI / worker do not have a shared storage solution. one check in particular which would be networking, depends on the RRD files on the local XI machine. this is not a good check to offload for that reason.
- what amount of resources do your workers have?
I copied the plugins to the workers. I had trouble with them at first, because they weren't in the same path as on the Nagios XI host, but I figured that out. None of the plugins are dependent on XI. The workers have 2 CPUs and 2GB of memory each. I added four more workers for a total of six. Their load averages are between 0 and .2, so they're not busy.
tgriep wrote:The check_by_ssh checks require a public ssh key setup to work. You probably have it setup to work with the XI server but did not setup the workers for that plugin.
If you want the workers to run that plugin, the workers have to be setup using this document.
I did set up different public keys for all the workers. I can log in as root without a password. I can run the checks that use check_by_ssh from the worker, and they work. However, they aren't running properly when Nagios tries to run them through Mod-Gearman. These checks work properly when running from the production Nagios XI instance.

Here's the output from Nagios XI and Mod-Gearman:

Code: Select all

CRITICAL: Return code of 255 is out of bounds. (worker: den-gearman5)
UNKNOWN - check_by_ssh: Remote command '/usr/local/nagios/libexec/check_load.sh -w 1.00 -c 1.10' returned status 255
Here's running the command from that worker:

Code: Select all

[root@den-gearman5 ~]# /usr/local/nagios/libexec/check_by_ssh -H den-03ltr007 -C "/usr/local/nagios/libexec/check_load.sh -w 1.00 -c 1.10" -l root --skip-stderr
1m: .65 5m: .66, 15m: .68|ala1=.65;1.00;1.10 ala5=.66;1.00;1.10 ala15=.68;1.00;1.10

Re: Mod-Gearman questions

Posted: Mon Nov 06, 2017 11:47 am
by tgriep
The Mod Gearman worker is probably running as the nagios user account and the check_by_ssh plugin has to be setup to run as the nagios user.
Try re-configuring the public keys for the nagios user account and retest using the nagios user account.
After doing that, it should work for you.