Mod-Gearman questions
Posted: Thu Nov 02, 2017 6:02 pm
I've set up Mod-Gearman to distribute the load on one of my Nagios XI hosts, but I have two problems with it.
I'm running Nagios XI 5.4.10 on CentOS 7, 64-bit, manual install, with Mod-Gearman installed from the instructions at https://assets.nagios.com/downloads/nag ... ios_XI.pdf and two workers (Also CentOS 7). This is a clone of my production Nagios XI host, so downtime isn't a concern.
The first problem I have is that the performance is abysmal. Checks are going stale by hours. Mod-Gearman seems to work for everyone else, so I'm sure I'm overlooking something.
My production Nagios XI instance is handling over 33,000 checks every 15 minutes, but most of them are on a 10 or 15 minute schedule. When I add Mod-Gearman, I'm getting about 1,800. My gut tells me that the server isn't feeding the workers fast enough. I see the processes pop up occasionally in top on the workers, but they are executed quickly and disappear. Gearman-top2 on the Nagios XI host looks like this:
Jobs Waiting occasionally gets a few (20-60), but the workers gather them right away.
The other problem I have is executing scripts on the monitored targets with check_by_ssh. Checks using SNMP or check_http (for example) work fine, but checks over SSH return "CRITICAL: Return code of 255 is out of bounds. (worker: den-gearman2)". I exchanged SSH keys between the workers and the monitored hosts, so I can log in as root without a password, but that didn't help. I suspect that either I need to do that for another user, or that I have a problem quoting the command argument.
Thanks!
I'm running Nagios XI 5.4.10 on CentOS 7, 64-bit, manual install, with Mod-Gearman installed from the instructions at https://assets.nagios.com/downloads/nag ... ios_XI.pdf and two workers (Also CentOS 7). This is a clone of my production Nagios XI host, so downtime isn't a concern.
The first problem I have is that the performance is abysmal. Checks are going stale by hours. Mod-Gearman seems to work for everyone else, so I'm sure I'm overlooking something.
Code: Select all
2017-11-02 16:13:33 - localhost:4730 - v0.33
Queue Name | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 27 | 0 | 0
host | 27 | 0 | 3
service | 27 | 0 | 2
worker_den-gearman1 | 1 | 0 | 0
worker_den-gearman2 | 1 | 0 | 0
----------------------------------------------------------------------The other problem I have is executing scripts on the monitored targets with check_by_ssh. Checks using SNMP or check_http (for example) work fine, but checks over SSH return "CRITICAL: Return code of 255 is out of bounds. (worker: den-gearman2)". I exchanged SSH keys between the workers and the monitored hosts, so I can log in as root without a password, but that didn't help. I suspect that either I need to do that for another user, or that I have a problem quoting the command argument.
Thanks!