host check orphaned

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: host check orphaned

Post by jdalrymple »

Glad to hear we're AT LEAST making progress :)

Do you know if you're calling mod_gearman_worker with any arguments?

Please run on dce1:

Code: Select all

ps -ef | grep gearman
The reason I wonder is because I can't spot in your configs where the servicegroup queue that IS PROCESSING checks is coming from. I may be able to discover from the trace files also - I'll review those as I'm able to.
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned

Post by bosecorp »

here is the output

root@gearmandce1:(03-18 17:20): /root
# ps -ef | grep gearman
nagios 419 31031 0 17:18 ? 00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios 478 31031 0 17:18 ? 00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios 481 31031 0 17:18 ? 00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios 672 31031 0 17:18 ? 00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios 673 31031 0 17:18 ? 00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
avahi 1741 1 0 Mar15 ? 00:00:00 avahi-daemon: running [gearmandce1.local]
nagios 2865 2863 0 17:20 ? 00:00:00 /bin/sh -c /usr/local/nrdp/clients/nrds/nrds.pl -H 'gearmandce1' 2>&1
nagios 2866 2865 0 17:20 ? 00:00:00 /usr/bin/perl -w /usr/local/nrdp/clients/nrds/nrds.pl -H gearmandce1
nagios 3217 31031 0 17:20 ? 00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
root 3921 3253 0 17:20 pts/0 00:00:00 grep gearman
nagios 25528 31031 0 17:15 ? 00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios 28924 31031 0 17:16 ? 00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios 30922 31031 0 17:17 ? 00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios 30924 31031 0 17:17 ? 00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios 31031 1 0 15:45 ? 00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid
nagios 32730 31031 0 17:18 ? 00:00:00 /usr/bin/mod_gearman_worker -d --config=/etc/mod_gearman/mod_gearman_worker.conf --pidfile=/var/mod_gearman/mod_gearman_worker.pid

as I said before the number of orphans is now below 80, in times I see going down to 30 but then it goes back up to 70 or 80
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: host check orphaned

Post by jdalrymple »

To bring things back together - here are the 3 most relevant gearman_tops that you've posted and their associated state:

Code: Select all

When we started:

Queue Name             | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------
check_results          |               1  |           0  |           0
eventhandler           |               6  |           0  |           0
host                   |              13  |           0  |           2
hostgroup_gearman_dce1 |               5  |           0  |           0
hostgroup_gearman_dcn1 |               5  |           0  |           0
service                |              13  |           0  |           0
worker_gearmandce1     |               1  |           0  |           0
worker_gearmandcn1     |               1  |           0  |           0
worker_nagmonus1       |               1  |           0  |           0
worker_nagmonus2       |               1  |           0  |           0
-------------------------------------------------------------------------

After modifying NEB config

Queue Name                | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------
check_results             |               2  |         103  |           2
eventhandler              |              34  |           0  |           0
host                      |              54  |           0  |           0
hostgroup_gearman_dce1    |               7  |           0  |           2
hostgroup_gearman_dcn1    |               7  |           0  |           3
service                   |              54  |           0  |          40
servicegroup_gearman_dce1 |               7  |           0  |           0
servicegroup_gearman_dcn1 |               7  |           0  |           0
worker_gearmandce1        |               1  |           0  |           0
worker_gearmandcn1        |               1  |           0  |           0
worker_nagmonus1          |               1  |           0  |           0
worker_nagmonus2          |               1  |           0  |           0
----------------------------------------------------------------------------

After modifying the worker configs

Queue Name                | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------
check_results             |               4  |           0  |           1
eventhandler              |              51  |           0  |           0
host                      |              62  |           0  |           1
hostgroup_gearman_dce1    |               7  |           0  |           0
hostgroup_gearman_dcn1    |               5  |           0  |           0
service                   |              62  |           0  |          39
servicegroup_gearman_dce1 |               7  |           0  |           2
servicegroup_gearman_dcn1 |               5  |           0  |           0
worker_gearmandce1        |               1  |           0  |           0
worker_gearmandcn1        |               1  |           0  |           0
worker_nagmonus1          |               1  |           0  |           0
worker_nagmonus2          |               1  |           0  |           0
----------------------------------------------------------------------------
I think we've definitely made ground. Things I can't understand are as follows, and to clear things up I may need you to send me your updated configs:

1) Why did servicegroup_gearman_dce1 show up and start doing work? We never uncommented your servicegroup lines. Based upon the way you're distributing work I wouldn't expect us to want to use servicegroups - just hostgroups.
2) Why is it your hostgroup_gearman no. of workers aren't increasing but hosts and services queues are. Either way, it's great, the remote workers are doing work, but I feel like they're probably not doing the right work.
3) Not sure why in the middle output the check_results queue was so filled up, this is something handled only on the job server - what we did with the worker configurations shouldn't have affected that.

Have you made any configuration options outside the ones suggested? If so - no problem, I'm just trying to get caught up on where things stand.
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned

Post by bosecorp »

) Why did servicegroup_gearman_dce1 show up and start doing work? We never uncommented your servicegroup lines. Based upon the way you're distributing work I wouldn't expect us to want to use servicegroups - just hostgroups.
I made that change, I enabled the service groups. sounds like you don't recommend that. when you ask me to uncomment the hostgroups, I assume that I should also add the service groups
2) Why is it your hostgroup_gearman no. of workers aren't increasing but hosts and services queues are. Either way, it's great, the remote workers are doing work, but I feel like they're probably not doing the right work.
I checked and this hostgroup does not have any devices. I don;t know why this group was created.
3) Not sure why in the middle output the check_results queue was so filled up, this is something handled only on the job server - what we did with the worker configurations shouldn't have affected that.
no comments
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: host check orphaned

Post by jdalrymple »

Are you trying to distribute the work based upon geographical location? I'm guessing that just based upon the hostgroup names. If so what you'll want to do is add the hosts in the appropriate locations to the appropriate hostgroups, then the work should get distributed properly.

Servicegroups you'd only distribute generally based upon a worker hosts ability to carry out a check, it generally wouldn't have to do with geography. For instance you might have to add some custom plugins and configs to handle your Microsoft environment so you could have a MS servicegroup intended to carry out those checks and you don't have to distribute those configs to all of your workers servers.

Circling all the way back to the beginning, I think there is just misconfiguration in the whole gearman environment. What I recommend is going over the documentation at http://labs.consol.de/nagios/mod-gearman/ and reworking your configurations to distribute the work properly. What was originally happening was simply that your workers weren't involving themselves in the queues. I'm still not 100% certain why that is, but something along the way has changed that so now they are participating.
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned

Post by bosecorp »

Hi

we were trying to distribute the load across multiple workers because we have around 3000 devices. do you still recommend having the JOB server doing the service checks?

I still see orphans but now are less. I think thinks are working better, but still something is causing the orphans

I will PM you my conf files again

here is the output again from my gearman_top

2015-03-19 15:27:54 - 10.100.30.111:4730 - v0.33

Queue Name | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------
check_results | 2 | 0 | 0
eventhandler | 283 | 0 | 0
host | 283 | 0 | 0
hostgroup_gearman_dce1 | 12 | 0 | 0
hostgroup_gearman_dcn1 | 9 | 0 | 0
hostgroup_gearman_fdc | 255 | 0 | 0
service | 283 | 0 | 1
servicegroup_gearman_dce1 | 12 | 0 | 0
servicegroup_gearman_dcn1 | 9 | 0 | 0
servicegroup_gearman_fdc | 255 | 0 | 0
worker_gearmandce1 | 1 | 0 | 0
worker_gearmandcn1 | 1 | 0 | 0
worker_nagmonus1 | 1 | 0 | 0
worker_nagmonus2 | 1 | 0 | 0
----------------------------------------------------------------------------


and check are still happening very slow
You do not have the required permissions to view the files attached to this post.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: host check orphaned

Post by jdalrymple »

What's not clear is if there is any reason to structure what events go to what queues. Are all of your worker servers able to process all checks, and by that I mean do they all have the full suite of plugins and meta-configs that the Nagios server itself has?

Is there any geographical separation, or from a network standpoint are all of the hosts/services to be checked the same distance from each worker? If that's the case - the default gearman configs are setup to properly distribute load without any hostgroup or servicegroup definitions whatsoever. We only need define hostgroups and servicegroups if we want to specify exactly where we want checks to be processed.
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned

Post by bosecorp »

servicegroups

Yes, the devices are in a different geographic locations

we placed a worker per location.

btw

I see two orphans now

things continue to get better. but I continue to see that checks seem to be delayed. takes 10 or 15 minutes for a check to be done again. this despite the fact that I have the checks configured to be done every 5 minutes

could IO wait be the issue?


and I continue to see this errros in the nagios.log file

[1426798630] Warning: The check of service 'Port 11145 Bandwidth' on host 'uswb-idf-21-470' looks like it was orphaned (results never came back; last_check=1426797059; next_check=1426797359). I'm scheduling an immediate check of the service...
[1426798630] Warning: The check of service 'Port 11633 Bandwidth' on host 'uswb-idf-21-470' looks like it was orphaned (results never came back; last_check=1426797280; next_check=1426797340). I'm scheduling an immediate check of the service...
[1426798630] Warning: The check of service 'Port 11635 Bandwidth' on host 'uswb-idf-21-470' looks like it was orphaned (results never came back; last_check=1426797083; next_check=1426797383). I'm scheduling an immediate check of the service...
You do not have the required permissions to view the files attached to this post.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: host check orphaned

Post by jdalrymple »

bosecorp wrote: Yes, the devices are in a different geographic locations

we placed a worker per location.

could IO wait be the issue?
IO wait is potentially caused by the isssue, not causing the issue.

I would remove (comment out) all the servicegroups from all of your configs, neb and worker and use just the hostgroups. This is typically what makes the most sense when you're dealing with geographic separation. Also make sure your hostgroups are setup properly so that the checks get distributed to the right workers.

I'm sorry, I also saw a new gearman_fdc pop up. That is new it seems. Is it another location? It looks like it's getting quite a large portion of the work now, or maybe you just have minimum workers configured very high for it. Either way - what is it and where did it come from? Is it necessary?
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned

Post by bosecorp »

Hi jdalrymple

I have removed the servicesgroups from my configs

yes, fdc is a new location.

I checked my worker config, and everything seems to be in order
Locked