host check orphaned

CFT6Server · Post by **CFT6Server** » Tue Jul 07, 2015 3:02 pm

I noticed that we are getting a lot of Host down messages with "host check orphaned, is the mod-gearman worker on queue 'host' running?". Yesterday I've restarted the Nagios service and mod gearman worker/gearmand as documented to try to fix this issue. Seems to have worked temporarily. Now there 315 hosts "down" out of 461. Any idea what is causing this? Thanks.

Also this is only on the host checks and not service checks it seems...

jdalrymple · Post by **jdalrymple** » Tue Jul 07, 2015 3:51 pm

Is it broken now? If so can you post the output of gearman_top?

Thanks

CFT6Server · Post by **CFT6Server** » Tue Jul 07, 2015 4:33 pm

It is still broken... here's gearman_top

gearmantop.jpg

jdalrymple · Post by **jdalrymple** » Tue Jul 07, 2015 4:51 pm

Everything looks proper there. Do you have your hosts sorted at all into hostgroups that relate to the worker queues, or do you just pool them all?

Is it any specific worker giving issues or are you rebooting all of the workers to get things rolling again? If it's specific to a single worker I think that we should have you turn on debugging on that worker. If not the other place to look would just be in nagios.log.

In other installs I've seen host check orphaned be a result of an overly busy XI server, it seems the job service is much more sensitive to a heavy load than the workers are.

CFT6Server · Post by **CFT6Server** » Tue Jul 07, 2015 6:14 pm

I am suspecting that the host could be busy.... however due to the nature of all the network bandwidth checks and MRTG, I've created the host group Network_ALL and service group All_Network_Bandwidth. other than that, there are no other host/service configurations in the mod gearman config. How do I know when the checks are too much for the host XI / worker?

If I reboot the XI / Master, then everything is ok temporarily. I have debugging on all workers, and seems ok. However the main XI box, which has a worker on it has tons of checks going to it (which are all network related checks). For configuration, I've set min worker to 25 and max to 1000 (probably wont hit it?)

jdalrymple · Post by **jdalrymple** » Wed Jul 08, 2015 9:02 am

CFT6Server wrote:How do I know when the checks are too much for the host XI / worker?

In my experience the workers aren't at all sensitive to a high load - just the job server, a.k.a. XI box. What does the load look like on it? Unless you removed it we shipped XI to you with a built in load monitor. Otherwise sar will give us some good history to look at.

CFT6Server · Post by **CFT6Server** » Thu Jul 09, 2015 11:55 am

Here are the performance stats. The load seems pretty heavy? I have a service check for the XI cpu load.

CFT6Server · Post by **CFT6Server** » Thu Jul 09, 2015 12:23 pm

One more .... Server Stats

jdalrymple · Post by **jdalrymple** » Thu Jul 09, 2015 12:32 pm

Very calm box - not at all what I expected to see knowing the size of your environment (BTW myself and jolson are still very actively digging into the slow NNA queries)

CFT6Server wrote:However the main XI box, which has a worker on it has tons of checks going to it (which are all network related checks).

Does the worker debug log on that box indicate anything unusual? Maybe it would be best to exclude this worker from all of the queues except those it explicitly needs to be performing?

CFT6Server · Post by **CFT6Server** » Thu Jul 09, 2015 1:22 pm

Currently the main XI box has these hosts/services configured....

Code: Select all

services=yes
hosts=yes
hostgroups=Network_ALL
servicegroups=ALL_Network_Bandwidth
localhostgroups=localhost

Looks like host checks are still running on this box even though the aren't part of the groups defined. Perhaps I am not configuring this properly?

Debug logs are looking clean so far.. i turned on tracing and then ran a forced check on host that had the message, but then it came back green...

Code: Select all

host_name=kdcpisbx02
command_line=/usr/local/nagios/libexec/check_icmp -H kdcpisbx02 -w 3000.0,80% -c 5000.0,100% -p 5
[2015-07-09 11:20:01][9414][DEBUG] got host job: kdcpisbx02
[2015-07-09 11:20:01][9414][TRACE] command: /usr/local/nagios/libexec/check_icmp -H kdcpisbx02 -w 3000.0,80% -c 5000.0,100% -p 5
host_name=kdcpisbx02
output=OK - kdcpisbx02: rta 0.541ms, lost 0%|rta=0.541ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=0.646ms;;;; rtmin=0.438ms;;;; \n
[2015-07-09 11:20:01][9414][TRACE] 310 --->host_name=kdcpisbx02
output=OK - kdcpisbx02: rta 0.541ms, lost 0%|rta=0.541ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=0.646ms;;;; rtmin=0.438ms;;;; \n

I had to restart the worker nodes to enable tracing, so I am wondering if that will help or if this will then have this issue over time. I will monitor these....

Looks like I cannot do a mass force check of all hosts checks?

Nagios Support Forum

host check orphaned

host check orphaned

Re: host check orphaned

Re: host check orphaned

Re: host check orphaned

Re: host check orphaned

Re: host check orphaned

Re: host check orphaned

Re: host check orphaned

Re: host check orphaned

Re: host check orphaned