host check orphaned
Re: host check orphaned
Ah!. Yes, I did use the IP
root@nagmonus1:(03-16 15:37): /root
# check_gearman -H 10.100.30.111 -q worker_`hostname`
check_gearman OK - 0 jobs running and 0 jobs waiting. Version: 0.33|'worker_nagmonus1_waiting'=0;10;100;0 'worker_nagmonus1_running'=0 'worker_nagmonus1_worker'=1;25;50;0
root@nagmonus1:(03-16 15:37): /root
# check_gearman -H 10.100.30.111 -q worker_`hostname`
check_gearman OK - 0 jobs running and 0 jobs waiting. Version: 0.33|'worker_nagmonus1_waiting'=0;10;100;0 'worker_nagmonus1_running'=0 'worker_nagmonus1_worker'=1;25;50;0
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
This appears to be running from the job server bosecorp, can you try running it from one of the worker servers that is giving you problems?
Re: host check orphaned
my undertanting from who implemented the Nagios environment is that nagmonus1 is also performs monitoring related activities. therefore mod_gearman is also running on this server and gearmand
anyways. here is the output from one of my gearman server that is also giving me problems
# check_gearman -H 10.10.32.80 -q worker_`hostname`
check_gearman WARNING - failed to connect to 10.10.32.80:4730 - Connection refused
Queue worker_gearmandce1 not found
root@gearmandce1:(03-16 15:55): /root
# check_gearman -H gearmandce1 -q worker_`hostname`
check_gearman WARNING - failed to connect to gearmandce1:4730 - Connection refused
Queue worker_gearmandce1 not found
anyways. here is the output from one of my gearman server that is also giving me problems
# check_gearman -H 10.10.32.80 -q worker_`hostname`
check_gearman WARNING - failed to connect to 10.10.32.80:4730 - Connection refused
Queue worker_gearmandce1 not found
root@gearmandce1:(03-16 15:55): /root
# check_gearman -H gearmandce1 -q worker_`hostname`
check_gearman WARNING - failed to connect to gearmandce1:4730 - Connection refused
Queue worker_gearmandce1 not found
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
There need only be 1 job server but you have multiple workers. In this case what I need is for you to run the command as follows and from the worker server.
You are right in that there are workers on your main Nagios server, and it's also acting as your job server. The problem is the remote workers, these are workers on the remote systems that must report back to the Job server. This is inherently the nature of your problem also, your Job server (the one running gearmand) is waiting for results to come in for checks it's issuing to the workers, but those results are never coming back.
It might be helpful to review the following to better understand how gearmand and workers interact:
http://labs.consol.de/nagios/mod-gearma ... es_it_work
--EDIT--
I just noticed that gearmandce is not a host listed in your gearman_top in the post on page 1. It is important that we understand which hosts are which, which IPs are which, which are working, which aren't, etc. It seems at this point like a lot of the problem may just be in configuration confusion.
Code: Select all
check_gearman -H 10.100.30.111 -q worker_gearmandceIt might be helpful to review the following to better understand how gearmand and workers interact:
http://labs.consol.de/nagios/mod-gearma ... es_it_work
--EDIT--
I just noticed that gearmandce is not a host listed in your gearman_top in the post on page 1. It is important that we understand which hosts are which, which IPs are which, which are working, which aren't, etc. It seems at this point like a lot of the problem may just be in configuration confusion.
Re: host check orphaned
here is another output of the command
root@gearmandce1:(03-16 15:55): /etc/mod_gearman
# check_gearman -H 10.100.30.111 -q worker_`hostname`
check_gearman OK - 0 jobs running and 0 jobs waiting. Version: 0.33|'worker_gearmandce1_waiting'=0;10;100;0 'worker_gearmandce1_running'=0 'worker_gearmandce1_worker'=1;25;50;0
I noticed that on the mod_gearman config file in gearmandce1 the IP of my nagmonus1 is in the configuration, so I re-run the command. maybe that is why it gave me the previous error
I hope it make sense
here is the IP information you requested
gearmandce1 10.10.32.80
gearmandcn1 10.10.32.81
nagmonus1 10.100.30.111
nagmonus2 10.100.30.113
nagfusionus1 10.100.30.110
and here is an output of the gearman_top command
Queue Name | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 10 | 0 | 0
host | 10 | 0 | 0
hostgroup_gearman_dce1 | 5 | 0 | 0
hostgroup_gearman_dcn1 | 5 | 0 | 0
service | 10 | 0 | 0
worker_gearmandce1 | 1 | 0 | 0
worker_gearmandcn1 | 1 | 0 | 0
worker_nagmonus1 | 1 | 0 | 0
worker_nagmonus2 | 1 | 0 | 0
-------------------------------------------------------------------------
root@gearmandce1:(03-16 15:55): /etc/mod_gearman
# check_gearman -H 10.100.30.111 -q worker_`hostname`
check_gearman OK - 0 jobs running and 0 jobs waiting. Version: 0.33|'worker_gearmandce1_waiting'=0;10;100;0 'worker_gearmandce1_running'=0 'worker_gearmandce1_worker'=1;25;50;0
I noticed that on the mod_gearman config file in gearmandce1 the IP of my nagmonus1 is in the configuration, so I re-run the command. maybe that is why it gave me the previous error
I hope it make sense
here is the IP information you requested
gearmandce1 10.10.32.80
gearmandcn1 10.10.32.81
nagmonus1 10.100.30.111
nagmonus2 10.100.30.113
nagfusionus1 10.100.30.110
and here is an output of the gearman_top command
Queue Name | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 10 | 0 | 0
host | 10 | 0 | 0
hostgroup_gearman_dce1 | 5 | 0 | 0
hostgroup_gearman_dcn1 | 5 | 0 | 0
service | 10 | 0 | 0
worker_gearmandce1 | 1 | 0 | 0
worker_gearmandcn1 | 1 | 0 | 0
worker_nagmonus1 | 1 | 0 | 0
worker_nagmonus2 | 1 | 0 | 0
-------------------------------------------------------------------------
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
What happened between 15:58-16:15 and 16:22-01:51 - at that point the gearman worker was taking on checks but the rest of the time there are none?bosecorp wrote:[2015-03-15 15:58:09][4468][INFO ] no checks in 2minutes, restarting all workers
[2015-03-15 16:15:29][4468][INFO ] mod_gearman worker exited
[2015-03-15 16:18:27][2842][INFO ] mod_gearman worker daemon started with pid 2842
[2015-03-15 16:18:28][2842][INFO ] no checks in 2minutes, restarting all workers
[2015-03-15 16:20:29][2842][INFO ] no checks in 2minutes, restarting all workers
[2015-03-15 16:21:19][2842][INFO ] mod_gearman worker exited
[2015-03-15 16:22:19][4142][INFO ] mod_gearman worker daemon started with pid 4142
[2015-03-15 16:22:20][4142][INFO ] no checks in 2minutes, restarting all workers
[2015-03-16 01:51:37][4142][INFO ] no checks in 2minutes, restarting all workers
I feel like we're going in circles, but it's also not clear at all what is going on and when it's going on.
The problem is that host checks are timing out when run on a remote gearman server
The gearman servers are all reporting in OK it appears
The logs indicate that the checks are timing out
Sometimes the gearman workers are getting work and other times there is none (this is very confusing)
The job server log doesn't at all agree with the worker logs
Some host checks finish but others fail
Host checks perform fine from the gearman server (I'd like to verify that this is true - did you actually run that check_icmp command from the gearman server or was there some confusion on that?)
Maybe it would be best to do that again and also run hostname directly thereafter?bosecorp wrote:it works from the command line
# /usr/local/nagios/libexec/check_icmp -H 10.103.120.12 -p 5
OK - 10.103.120.12: rta 2.395ms, lost 0%|rta=2.395ms;200.000;500.000;0; pl=0%;40;80;;
Nothing adds up. It may be desireable for you to package up some of your gearman conf files and PM them to us for us to look at.
Re: host check orphaned
I did run the command. but here is again
root@nagmonus1:(03-16 16:55): /root
# /usr/local/nagios/libexec/check_icmp -H 10.103.120.12 -p 5
OK - 10.103.120.12: rta 2.381ms, lost 0%|rta=2.381ms;200.000;500.000;0; pl=0%;40;80;;
the problems happens at all times. regardless who is doing the monitoring activities for any particular devices will experience an orphaned situation. I have tried to move some of the devices to a different gearman server and the problem follows. In addition when I force the check, the devices came back green,and later it goes back to orphan again. The number of orphaned devices is different all the time
I will PM you the gearmand.conf files
root@nagmonus1:(03-16 16:55): /root
# /usr/local/nagios/libexec/check_icmp -H 10.103.120.12 -p 5
OK - 10.103.120.12: rta 2.381ms, lost 0%|rta=2.381ms;200.000;500.000;0; pl=0%;40;80;;
the problems happens at all times. regardless who is doing the monitoring activities for any particular devices will experience an orphaned situation. I have tried to move some of the devices to a different gearman server and the problem follows. In addition when I force the check, the devices came back green,and later it goes back to orphan again. The number of orphaned devices is different all the time
I will PM you the gearmand.conf files
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
The problem is you're running the check from the Nagios host/Job server. We need to see this run from the gearman server.bosecorp wrote:root@nagmonus1:(03-16 16:55): /root
# /usr/local/nagios/libexec/check_icmp -H 10.103.120.12 -p 5
OK - 10.103.120.12: rta 2.381ms, lost 0%|rta=2.381ms;200.000;500.000;0; pl=0%;40;80;;
Do you understand that the gearman server is the computer that actually runs the process `check_icmp`? That is the idea behind gearman, the Nagios host/job server just collects the results. The checks are actually processed remotely. For this reason it doesn't matter if your Nagios server can run check_icmp properly, we need to see what happens when the process is run on the gearman server.
Thanks for your patience with this, I will try my best to help you understand the architecture as we go along. I think that this is the biggest problem with your environment is just understanding how it all interacts.
Re: host check orphaned
Hi here is what you requested
root@gearmandce1:(03-16 17:15): /root
# /usr/local/nagios/libexec/check_icmp -H 10.103.120.12 -p 5
OK - 10.103.120.12: rta 2.119ms, lost 0%|rta=2.119ms;200.000;500.000;0; pl=0%;40;80;; rtmax=2.849ms;;;; rtmin=1.897ms;;;;
I do understand who it works. the reason why I ran it from nagmonus1 is because the person who set this up said that this particular device is being monitored off the nagmonus1 server. the person who set this up is not longer available, so I can not reach out to him.
the way that is control who does the monitoring is based on hostgroups. this is the way it was setup. So, in this case this devices doesn't belong to any of the gearman host groups that we have defined, which means that nagmonus1 will do the monitoring activities.
I sent you the config files you request.
Sorry, if I seem unfamiliar, but like I said the person who set this up is not longer available and there isn't documentation on who everything is configure so we have been learning as we go
root@gearmandce1:(03-16 17:15): /root
# /usr/local/nagios/libexec/check_icmp -H 10.103.120.12 -p 5
OK - 10.103.120.12: rta 2.119ms, lost 0%|rta=2.119ms;200.000;500.000;0; pl=0%;40;80;; rtmax=2.849ms;;;; rtmin=1.897ms;;;;
I do understand who it works. the reason why I ran it from nagmonus1 is because the person who set this up said that this particular device is being monitored off the nagmonus1 server. the person who set this up is not longer available, so I can not reach out to him.
the way that is control who does the monitoring is based on hostgroups. this is the way it was setup. So, in this case this devices doesn't belong to any of the gearman host groups that we have defined, which means that nagmonus1 will do the monitoring activities.
I sent you the config files you request.
Sorry, if I seem unfamiliar, but like I said the person who set this up is not longer available and there isn't documentation on who everything is configure so we have been learning as we go
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
I see ... so the problem isn't just restricted to remote gearman servers, you're having timeouts even from hosts that are defined to be checked by the local gearman workers? I apologize if I missed that information before. If that's the case I think it might be advisable to turn the debug level to 1 (for now, maybe 2 later) in the config files mod_gearman_neb.conf and mod_gearman_worker.conf and see if anything pops out in those. Make sure to reload both daemons after you modify the configs.