host check orphaned
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
host check orphaned
I noticed that we are getting a lot of Host down messages with "host check orphaned, is the mod-gearman worker on queue 'host' running?". Yesterday I've restarted the Nagios service and mod gearman worker/gearmand as documented to try to fix this issue. Seems to have worked temporarily. Now there 315 hosts "down" out of 461. Any idea what is causing this? Thanks.
Also this is only on the host checks and not service checks it seems...
Also this is only on the host checks and not service checks it seems...
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
Is it broken now? If so can you post the output of gearman_top?
Thanks
Thanks
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: host check orphaned
It is still broken... here's gearman_top
You do not have the required permissions to view the files attached to this post.
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
Everything looks proper there. Do you have your hosts sorted at all into hostgroups that relate to the worker queues, or do you just pool them all?
Is it any specific worker giving issues or are you rebooting all of the workers to get things rolling again? If it's specific to a single worker I think that we should have you turn on debugging on that worker. If not the other place to look would just be in nagios.log.
In other installs I've seen host check orphaned be a result of an overly busy XI server, it seems the job service is much more sensitive to a heavy load than the workers are.
Is it any specific worker giving issues or are you rebooting all of the workers to get things rolling again? If it's specific to a single worker I think that we should have you turn on debugging on that worker. If not the other place to look would just be in nagios.log.
In other installs I've seen host check orphaned be a result of an overly busy XI server, it seems the job service is much more sensitive to a heavy load than the workers are.
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: host check orphaned
I am suspecting that the host could be busy.... however due to the nature of all the network bandwidth checks and MRTG, I've created the host group Network_ALL and service group All_Network_Bandwidth. other than that, there are no other host/service configurations in the mod gearman config. How do I know when the checks are too much for the host XI / worker?
If I reboot the XI / Master, then everything is ok temporarily. I have debugging on all workers, and seems ok. However the main XI box, which has a worker on it has tons of checks going to it (which are all network related checks). For configuration, I've set min worker to 25 and max to 1000 (probably wont hit it?)
If I reboot the XI / Master, then everything is ok temporarily. I have debugging on all workers, and seems ok. However the main XI box, which has a worker on it has tons of checks going to it (which are all network related checks). For configuration, I've set min worker to 25 and max to 1000 (probably wont hit it?)
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
In my experience the workers aren't at all sensitive to a high load - just the job server, a.k.a. XI box. What does the load look like on it? Unless you removed it we shipped XI to you with a built in load monitor. Otherwise sar will give us some good history to look at.CFT6Server wrote:How do I know when the checks are too much for the host XI / worker?
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: host check orphaned
Here are the performance stats. The load seems pretty heavy? I have a service check for the XI cpu load.
You do not have the required permissions to view the files attached to this post.
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: host check orphaned
One more .... Server Stats
You do not have the required permissions to view the files attached to this post.
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
Very calm box - not at all what I expected to see knowing the size of your environment (BTW myself and jolson are still very actively digging into the slow NNA queries)
Does the worker debug log on that box indicate anything unusual? Maybe it would be best to exclude this worker from all of the queues except those it explicitly needs to be performing?CFT6Server wrote:However the main XI box, which has a worker on it has tons of checks going to it (which are all network related checks).
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: host check orphaned
Currently the main XI box has these hosts/services configured....
Looks like host checks are still running on this box even though the aren't part of the groups defined. Perhaps I am not configuring this properly?
Debug logs are looking clean so far.. i turned on tracing and then ran a forced check on host that had the message, but then it came back green...
I had to restart the worker nodes to enable tracing, so I am wondering if that will help or if this will then have this issue over time. I will monitor these....
Looks like I cannot do a mass force check of all hosts checks?
Code: Select all
services=yes
hosts=yes
hostgroups=Network_ALL
servicegroups=ALL_Network_Bandwidth
localhostgroups=localhost
Debug logs are looking clean so far.. i turned on tracing and then ran a forced check on host that had the message, but then it came back green...
Code: Select all
host_name=kdcpisbx02
command_line=/usr/local/nagios/libexec/check_icmp -H kdcpisbx02 -w 3000.0,80% -c 5000.0,100% -p 5
[2015-07-09 11:20:01][9414][DEBUG] got host job: kdcpisbx02
[2015-07-09 11:20:01][9414][TRACE] command: /usr/local/nagios/libexec/check_icmp -H kdcpisbx02 -w 3000.0,80% -c 5000.0,100% -p 5
host_name=kdcpisbx02
output=OK - kdcpisbx02: rta 0.541ms, lost 0%|rta=0.541ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=0.646ms;;;; rtmin=0.438ms;;;; \n
[2015-07-09 11:20:01][9414][TRACE] 310 --->host_name=kdcpisbx02
output=OK - kdcpisbx02: rta 0.541ms, lost 0%|rta=0.541ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=0.646ms;;;; rtmin=0.438ms;;;; \n
Looks like I cannot do a mass force check of all hosts checks?