host check orphaned

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

host check orphaned

Post by CFT6Server »

I noticed that we are getting a lot of Host down messages with "host check orphaned, is the mod-gearman worker on queue 'host' running?". Yesterday I've restarted the Nagios service and mod gearman worker/gearmand as documented to try to fix this issue. Seems to have worked temporarily. Now there 315 hosts "down" out of 461. Any idea what is causing this? Thanks.

Also this is only on the host checks and not service checks it seems...
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: host check orphaned

Post by jdalrymple »

Is it broken now? If so can you post the output of gearman_top?

Thanks
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: host check orphaned

Post by CFT6Server »

It is still broken... here's gearman_top
gearmantop.jpg
You do not have the required permissions to view the files attached to this post.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: host check orphaned

Post by jdalrymple »

Everything looks proper there. Do you have your hosts sorted at all into hostgroups that relate to the worker queues, or do you just pool them all?

Is it any specific worker giving issues or are you rebooting all of the workers to get things rolling again? If it's specific to a single worker I think that we should have you turn on debugging on that worker. If not the other place to look would just be in nagios.log.

In other installs I've seen host check orphaned be a result of an overly busy XI server, it seems the job service is much more sensitive to a heavy load than the workers are.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: host check orphaned

Post by CFT6Server »

I am suspecting that the host could be busy.... however due to the nature of all the network bandwidth checks and MRTG, I've created the host group Network_ALL and service group All_Network_Bandwidth. other than that, there are no other host/service configurations in the mod gearman config. How do I know when the checks are too much for the host XI / worker?

If I reboot the XI / Master, then everything is ok temporarily. I have debugging on all workers, and seems ok. However the main XI box, which has a worker on it has tons of checks going to it (which are all network related checks). For configuration, I've set min worker to 25 and max to 1000 (probably wont hit it?)
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: host check orphaned

Post by jdalrymple »

CFT6Server wrote:How do I know when the checks are too much for the host XI / worker?
In my experience the workers aren't at all sensitive to a high load - just the job server, a.k.a. XI box. What does the load look like on it? Unless you removed it we shipped XI to you with a built in load monitor. Otherwise sar will give us some good history to look at.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: host check orphaned

Post by CFT6Server »

Here are the performance stats. The load seems pretty heavy? I have a service check for the XI cpu load.
You do not have the required permissions to view the files attached to this post.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: host check orphaned

Post by CFT6Server »

One more .... Server Stats
You do not have the required permissions to view the files attached to this post.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: host check orphaned

Post by jdalrymple »

Very calm box - not at all what I expected to see knowing the size of your environment (BTW myself and jolson are still very actively digging into the slow NNA queries)
CFT6Server wrote:However the main XI box, which has a worker on it has tons of checks going to it (which are all network related checks).
Does the worker debug log on that box indicate anything unusual? Maybe it would be best to exclude this worker from all of the queues except those it explicitly needs to be performing?
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: host check orphaned

Post by CFT6Server »

Currently the main XI box has these hosts/services configured....

Code: Select all

services=yes
hosts=yes
hostgroups=Network_ALL
servicegroups=ALL_Network_Bandwidth
localhostgroups=localhost
Looks like host checks are still running on this box even though the aren't part of the groups defined. Perhaps I am not configuring this properly?

Debug logs are looking clean so far.. i turned on tracing and then ran a forced check on host that had the message, but then it came back green...

Code: Select all

host_name=kdcpisbx02
command_line=/usr/local/nagios/libexec/check_icmp -H kdcpisbx02 -w 3000.0,80% -c 5000.0,100% -p 5
[2015-07-09 11:20:01][9414][DEBUG] got host job: kdcpisbx02
[2015-07-09 11:20:01][9414][TRACE] command: /usr/local/nagios/libexec/check_icmp -H kdcpisbx02 -w 3000.0,80% -c 5000.0,100% -p 5
host_name=kdcpisbx02
output=OK - kdcpisbx02: rta 0.541ms, lost 0%|rta=0.541ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=0.646ms;;;; rtmin=0.438ms;;;; \n
[2015-07-09 11:20:01][9414][TRACE] 310 --->host_name=kdcpisbx02
output=OK - kdcpisbx02: rta 0.541ms, lost 0%|rta=0.541ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=0.646ms;;;; rtmin=0.438ms;;;; \n
I had to restart the worker nodes to enable tracing, so I am wondering if that will help or if this will then have this issue over time. I will monitor these....

Looks like I cannot do a mass force check of all hosts checks?
Locked