Page 1 of 10

host check orphaned

Posted: Tue Mar 10, 2015 8:56 am
by bosecorp
Since about a week ago I am getting this error

(host check orphaned, is the mod-gearman worker on queue 'host' running?

Itried rebooting my server, but still don;t help

I am running Nagios 4. the gearman version is 1.4 because that is the only one supported by Nagios 4

I am running RHEL 6, 64 bits. the Servers are running in VMware. this was a manual installation.

the database server is a separate server.

Re: host check orphaned

Posted: Tue Mar 10, 2015 9:10 am
by jolson
Please run the following on the problem host and report the results:

Code: Select all

gearman_top
tail -n20 /var/log/gearmand.log

Re: host check orphaned

Posted: Tue Mar 10, 2015 10:32 am
by bosecorp

Code: Select all

Queue Name             | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------
 check_results          |               1  |           0  |           0
 eventhandler           |               6  |           0  |           0
 host                   |              13  |           0  |           2
 hostgroup_gearman_dce1 |               5  |           0  |           0
 hostgroup_gearman_dcn1 |               5  |           0  |           0
 service                |              13  |           0  |           0
 worker_gearmandce1     |               1  |           0  |           0
 worker_gearmandcn1     |               1  |           0  |           0
 worker_nagmonus1       |               1  |           0  |           0
 worker_nagmonus2       |               1  |           0  |           0
-------------------------------------------------------------------------

root@nagmonus1:(03-10 11:32): /root
# tail -n20 /var/log/gearmand/gearmand.log 
  ERROR 2015-02-08 20:26:28.000000 [     2 ] recv(Connection timed out) -> libgearman-server/io.cc:105
  ERROR 2015-02-08 20:26:28.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2015-02-08 20:26:28.000000 [     3 ] recv(Connection timed out) -> libgearman-server/io.cc:105
  ERROR 2015-02-08 20:26:28.000000 [     3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2015-02-08 20:26:28.000000 [     3 ] recv(Connection timed out) -> libgearman-server/io.cc:105
  ERROR 2015-02-08 20:26:28.000000 [     3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2015-02-08 20:26:28.000000 [     4 ] recv(Connection timed out) -> libgearman-server/io.cc:105
  ERROR 2015-02-08 20:26:28.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2015-02-08 20:26:28.000000 [     4 ] recv(Connection timed out) -> libgearman-server/io.cc:105
  ERROR 2015-02-08 20:26:28.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2015-02-08 20:26:28.000000 [     3 ] recv(Connection timed out) -> libgearman-server/io.cc:105
  ERROR 2015-02-08 20:26:28.000000 [     3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2015-02-08 20:26:28.000000 [     2 ] recv(Connection timed out) -> libgearman-server/io.cc:105
  ERROR 2015-02-08 20:26:28.000000 [     2 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2015-02-09 03:10:49.000000 [     3 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2015-02-09 03:10:49.000000 [     3 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2015-02-09 14:48:56.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2015-02-09 14:48:56.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
  ERROR 2015-02-10 13:42:29.000000 [     4 ] lost connection to client recv(EPIPE || ECONNRESET || EHOSTDOWN)(Connection reset by peer) -> libgearman-server/io.cc:100
  ERROR 2015-02-10 13:42:29.000000 [     4 ] closing connection due to previous errno error -> libgearman-server/io.cc:109
root@nagmonus1:(03-10 11:32): /root
#

Re: host check orphaned

Posted: Tue Mar 10, 2015 12:58 pm
by lmiltchev
It's possible that the timeout on some your hostcheck plugins (depending on what you are using) is higher than the nagios host check timeout.

Code: Select all

grep host_check_timeout /usr/local/nagios/etc/nagios.cfg
You can try increasing the "host_check_timeout" value a bit, restarting nagios, gearmand and the workers. Let us know if this resolved your issue.

Re: host check orphaned

Posted: Tue Mar 10, 2015 1:08 pm
by bosecorp
this is the value

# grep host_check_timeout /usr/local/nagios/etc/nagios.cfg
host_check_timeout=30

I tried increasing the value and it did not work

Re: host check orphaned

Posted: Tue Mar 10, 2015 1:49 pm
by lmiltchev
Run the following commands and show us the output in code wraps:

Code: Select all

/usr/local/nagios/bin/nagios | head -2
/usr/local/nagios/bin/ndo2db | head -2
grep broker /usr/local/nagios/etc/nagios.cfg
rpm -qa | grep gearman

Re: host check orphaned

Posted: Tue Mar 10, 2015 1:57 pm
by bosecorp

Code: Select all

# /usr/local/nagios/bin/nagios | head -2

Nagios Core 4.0.8
You have mail in /var/spool/mail/root
root@nagmonus1:(03-10 13:51): /usr/local/nagios/var
# /usr/local/nagios/bin/ndo2db | head -2
grep broker /usr/local/nagios/etc/nagios.cfg

NDO2DB 2.0.0
root@nagmonus1:(03-10 13:51): /usr/local/nagios/var
# grep broker /usr/local/nagios/etc/nagios.cfg
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg
broker_module=/usr/lib64/mod_gearman/mod_gearman.o config=/etc/mod_gearman/mod_gearman_neb.conf
event_broker_options=-1
root@nagmonus1:(03-10 13:51): /usr/local/nagios/var
# rpm -qa | grep gearman
gearmand-0.25-1.x86_64
mod_gearman-1.4_nagios4-1.el6.x86_64
libgearman-1.1.8-2.el6.x86_64
gearmand-server-0.33-2.x86_64
gearmand-devel-0.25-1.x86_64
root@nagmonus1:(03-10 13:51): /usr/local/nagios/var

what are the implications of increasing

host_check_timeout

why we need to increase it. what is the logic behind it. I am just trying to understand

--------

I just increased the number all the way up to 290 and is now working fine.

but my previous question still stands. what are the implications of increasing the number that high? what is this value for?

Edit: take that back. it started again. it only worked for few

Re: host check orphaned

Posted: Tue Mar 10, 2015 2:36 pm
by jolson
Regarding host_check_timeout
This is the maximum number of seconds that Nagios will allow host checks to run. If checks exceed this limit, they are killed and a CRITICAL state is returned and the host will be assumed to be DOWN. A timeout error will also be logged.

There is often widespread confusion as to what this option really does. It is meant to be used as a last ditch mechanism to kill off plugins which are misbehaving and not exiting in a timely manner. It should be set to something high (like 60 seconds or more), so that each host check normally finishes executing within this time limit. If a host check runs longer than this limit, Nagios will kill it off thinking it is a runaway processes.
http://nagios.sourceforge.net/docs/3_0/configmain.html

Re: host check orphaned

Posted: Tue Mar 10, 2015 2:47 pm
by bosecorp
thanks for the explanation.

but unfortunately, the problem came back.

Re: host check orphaned

Posted: Tue Mar 10, 2015 4:07 pm
by abrist
Are the orphaned host checks only related to hosts that are not responding?
If you manually run the host check from the gearman server, do you get a timeout?