Distributed Nagios Architecture

Post by **BanditBBS** » Mon Jun 09, 2014 11:22 am

Smark wrote:So I have everything working, sort of. When looking at the hostgroups and servicegroups option in mod_gearman_worker.conf you can specify which hostgroups and which servicegroups should be executed by which workers.

In our environment we have servers dispersed around the world so it makes more sense to say "any services on hosts in this hostgroup should be checked by this worker". Does that functionality exist?

Yep, that's how I do it! Need more details?

Smark · Post by **Smark** » Mon Jun 09, 2014 11:37 am

BanditBBS wrote:
Smark wrote:So I have everything working, sort of. When looking at the hostgroups and servicegroups option in mod_gearman_worker.conf you can specify which hostgroups and which servicegroups should be executed by which workers.

In our environment we have servers dispersed around the world so it makes more sense to say "any services on hosts in this hostgroup should be checked by this worker". Does that functionality exist?
Yep, that's how I do it! Need more details?

Yes! I'm looking at gearman_top and I see my hostgroup_Name queues:

Code: Select all

2014-06-09 09:33:01  -  localhost:4730   -  v0.25

 Queue Name                | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------
 check_results             |               1  |           0  |           0
 eventhandler              |              56  |           0  |           0
 host                      |              56  |           0  |           0
 hostgroup_Flagstaff Hosts |              13  |           0  |           0
 hostgroup_Phoenix Hosts   |              43  |           0  |           0
 hostgroup_Sunnyvale Hosts |              13  |           0  |           0
 service                   |              56  |           0  |          11
 worker_flgnagiosgmdv1     |               1  |           0  |           0
 worker_phxnagiosgmdv1     |               1  |           0  |           0
----------------------------------------------------------------------------

I'm currently at about 2000 service checks at 375 hosts. I can see jobs running in the ~40-60 range under service but only one-or-two at a time under the specific hostgroups. I assume all the service checks are staying in the service queue and the host checks are going to the hostgroup_Name queues. I read somewhere that you can specify a Host variable that will determine what queue it goes into.

I found it:

queue_custom_variable
Can be used to define the target queue by a custom variable in addition to host/servicegroups. When set for ex. to WORKER you then could define a _WORKER custom variable for your hosts and services to directly set the worker queue. The host queue is inherited unless overwritten by a service custom variable. Set the value of your custom variable to local to bypass Mod-Gearman (Same behaviour as in localhostgroups/localservicegroups).

queue_custom_variable=WORKER

I'll try this out and report back unless you have something to add.

I hope the mods don't mind this conversation continuing in this thread. I think this will serve as a good reference to people in the future.

Thanks again for everyone's continued help.

Post by **BanditBBS** » Mon Jun 09, 2014 11:42 am

mod_gearman_worker.conf file on one of the remote workers:

hosts=no
services=no
hostgroups=comma seperated list of hostgroups for this worker

Make sure those settings are set and then restart the mod_gearman_worker service on that worker. Any host(and all services on that host) in the hostgroup specified will go to that worker. Any host/service not in the specified hostgroup will not be sent to that worker and will be split up between workers that have host and service set to yes.

Smark · Post by **Smark** » Mon Jun 09, 2014 12:14 pm

BanditBBS wrote:mod_gearman_worker.conf file on one of the remote workers:

hosts=no
services=no
hostgroups=comma seperated list of hostgroups for this worker

Make sure those settings are set and then restart the mod_gearman_worker service on that worker. Any host(and all services on that host) in the hostgroup specified will go to that worker. Any host/service not in the specified hostgroup will not be sent to that worker and will be split up between workers that have host and service set to yes.

Ok, so I tried that here below:

Code: Select all

2014-06-09 10:11:14  -  localhost:4730   -  v0.25

 Queue Name            | Worker Available | Jobs Waiting | Jobs Running
------------------------------------------------------------------------
 check_results         |               1  |           0  |           0
 eventhandler          |              35  |           0  |           0
 host                  |              25  |           0  |           0
 hostgroup_Flagstaff   |              30  |           0  |           0
 hostgroup_Phoenix     |              30  |           0  |           0
 hostgroup_Sunnyvale   |              30  |           0  |           0
 service               |              25  |          82  |          25
 worker_flgnagiosgmdv1 |               1  |           0  |           0
 worker_phxnagiosdv1   |               1  |           0  |           0
 worker_phxnagiosgmdv1 |               1  |           0  |           0
------------------------------------------------------------------------

When I look at the log on phxnagiosdv1 (the one running Gearmand and the default service and host queue) I can see "got host job" and "got service job" scrolling by. I never see the other queues starting anything.

Example (flgnagiosgmdv1):

Code: Select all

# defines if the module should distribute execution of
# service checks.
services=no

# defines if the module should distribute execution of
# host checks.
hosts=no

# sets a list of hostgroups which will go into seperate
# queues. Either specify a comma seperated list or use
# multiple lines.
hostgroups=Flagstaff,Sunnyvale

The phoenix one is exactly the same.

The hostgroups out of Nagios XI Enterprise 2014R1.1 are specified in the hostgroup itself, not in the host (the hostgroup specifies the hosts in it).

Any idea what the problem could be here? I'll keep investigating.

Edit: Yes, I restarted the services after making these changes.

Post by **BanditBBS** » Mon Jun 09, 2014 12:21 pm

Wondering if you need to restart nagios as well. Just for grins on the main server restart gearmand and restart nagios and see if either of those resolves your issue. I can tell you without a doubt that I use it the way you show and it works great!

EDIT: Stupid question time: You do have hosts belonging to those hostgroups, right? If you look at hostgroup summary it does have numbers next to them?

EDIT #2: On the main server in the mod_gearman_neb.conf file, you do have those hostgroups listed in the hostgroup section, right?

Example:

Code: Select all

# sets a list of hostgroups which will go into seperate
# queues. Either specify a comma seperated list or use
# multiple lines.
#hostgroups=name2,name3
hostgroups=pci,dmz,win_corp,wdd

After making changes to this, either nagios or gearmand needs restarted, can't remember which

Smark · Post by **Smark** » Mon Jun 09, 2014 12:31 pm

BanditBBS wrote:Wondering if you need to restart nagios as well. Just for grins on the main server restart gearmand and restart nagios and see if either of those resolves your issue. I can tell you without a doubt that I use it the way you show and it works great!

EDIT: Stupid question time: You do have hosts belonging to those hostgroups, right? If you look at hostgroup summary it does have numbers next to them?

EDIT #2: On the main server in the mod_gearman_neb.conf file, you do have those hostgroups listed in the hostgroup section, right?

Example:
Code: Select all
# sets a list of hostgroups which will go into seperate
# queues. Either specify a comma seperated list or use
# multiple lines.
#hostgroups=name2,name3
hostgroups=pci,dmz,win_corp,wdd
After making changes to this, either nagios or gearmand needs restarted, can't remember which

huzzah! I had already made changes to [...]_neb.conf as a troubleshooting step. Looks like I had to restart nagios. To make it easy I rebooted the Nagios server.

Looks good to me!

Code: Select all

2014-06-09 10:29:33  -  localhost:4730   -  v0.25

 Queue Name            | Worker Available | Jobs Waiting | Jobs Running
------------------------------------------------------------------------
 check_results         |               1  |           0  |           0
 eventhandler          |              89  |           0  |           0
 host                  |               0  |          14  |           0
 hostgroup_Flagstaff   |              46  |           1  |          34
 hostgroup_Phoenix     |              43  |           0  |          30
 hostgroup_Sunnyvale   |              46  |           0  |           5
 service               |               0  |          10  |           0
 worker_flgnagiosgmdv1 |               1  |           0  |           0
 worker_phxnagiosgmdv1 |               1  |           0  |           0
------------------------------------------------------------------------

BanditBBS, how do you keep your libexec folder (the one with all the check scripts) in sync between nodes? I was planning on something like an rsync script unless you had something more elegant in mind. I was thinking about an NFS export, but since these will be distributed and not in the same DC, it would be best for everything to be local.

Post by **BanditBBS** » Mon Jun 09, 2014 12:37 pm

I just do it manually believe it or not...its ugly!

sreinhardt · Post by **sreinhardt** » Mon Jun 09, 2014 3:50 pm

rsync being pulled or pushed from a central readonly system would be my suggestion. Keep bandit's crazy manual replication away from your systems.

Nagios Support Forum

Distributed Nagios Architecture

Re: Distributed Nagios Architecture

Re: Distributed Nagios Architecture

Re: Distributed Nagios Architecture

Re: Distributed Nagios Architecture

Re: Distributed Nagios Architecture

Re: Distributed Nagios Architecture

Re: Distributed Nagios Architecture

Re: Distributed Nagios Architecture