Page 3 of 4
Re: host check orphaned, is the mod-gearman worker on queue
Posted: Fri Apr 13, 2018 6:42 am
by rtsupport
Its strange about your reply as Mode gearman was not developed by Nagios but there are lots of documents which created by Naigos and explaining to integrate with Mod gearman. on which Nagios support Form is a Point of contact so its expected that Nagios will support if any issue raised related to Mod gearman as well.
Here are few links which i refereed as well.
https://assets.nagios.com/downloads/gen ... utions.pdf
https://assets.nagios.com/downloads/nag ... ios_XI.pdf
https://support.nagios.com/kb/article.php?id=484
https://assets.nagios.com/downloads/nag ... buted.html
Re: host check orphaned, is the mod-gearman worker on queue
Posted: Fri Apr 13, 2018 10:28 am
by tmcdonald
While it is true that we have documentation for Mod_Gearman and we recommend it for distributed checks, the fact remains that we did not write the software so we cannot guarantee its functionality. The distinction is that we
recommend using it and will
support certain setups, but we cannot
guarantee it will always work since we did not write it. At a certain point it takes familiarity with the source code to explain why something does or does not work, and we just aren't familiar with it on that level.
The same is true for just about any other project we do not maintain (Windows, MySQL, Python, etc). We definitely use or integrate with those technologies, but certain problems with them are just out of scope. Generally with third-party software we take a "best effort" approach and try to help as much as we can - I think we can all agree that is what happened here, but as
@tgriep mentioned it seems we have hit the point where asking the original author is the best step.
Re: host check orphaned, is the mod-gearman worker on queue
Posted: Mon Apr 16, 2018 8:10 am
by rtsupport
ok, lets skip this ...
@ tgrip
as per our last discussion we commented our test server and now we are running with one gearman server and one worker server.. and if we stop worker service then on nagios server all server/service should be critical or we should get an error " host check orphaned, is the mod-gearman worker on queue " to test this i stopped worker service even though all server/service on Nagios server are working fine anysuggestion..
Re: host check orphaned, is the mod-gearman worker on queue
Posted: Mon Apr 16, 2018 3:20 pm
by tgriep
Can you login as root on the server that is running the Gearman Server, run the following commands and post the /tmp/info.txt file here?
Code: Select all
ps -ef --cols=300 >/tmp/info.txt
netstat -an >>/tmp/info.txt
gearman_top2 -b 1 -v >>/tmp/info.txt
Thanks
Re: host check orphaned, is the mod-gearman worker on queue
Posted: Tue Apr 17, 2018 6:07 am
by rtsupport
details has been shared to you PM, as advised.
Re: host check orphaned, is the mod-gearman worker on queue
Posted: Tue Apr 17, 2018 9:24 am
by tgriep
Thanks for the file. Everything looks like it should work so let's test it again bu shutting off the Gearman Worker.
Then in the Gearman server, check the gearmand.log file and see if the server detects that the worker is down and and errors that happen when the checks are submitted.
Re: host check orphaned, is the mod-gearman worker on queue
Posted: Tue Apr 17, 2018 10:52 am
by rtsupport
Have shared the logs output on your PM.
test.nagioslog.txt -- /usr/local/nagios/var/nagios.log
test.gearmandlog.txt -- /var/log/gearman/gearmand.log
observation -
After stopping worker server ..
# All server and services checks are working which are added in hostgroup (nagios_infrastructure)
# Checks for server and services stopped which are added in any other group except (nagios_infrastructure)
# Checks for server and services stopped for server which are configured using configure wizard as well.
# NO Error related to worker stopped.
Re: host check orphaned, is the mod-gearman worker on queue
Posted: Wed Apr 18, 2018 8:30 am
by tgriep
How long did you shut down the Worker for? I tested out the Orphan settings and I did receive the Orphan message after 30 minutes.
(host check orphaned, is the mod-gearman worker on queue 'hostgroup_Centos7_HostGroup' running?)
Re: host check orphaned, is the mod-gearman worker on queue
Posted: Wed Apr 18, 2018 11:22 am
by rtsupport
Today i stopped it again more than 2 hrs. but no error..
[nagios@usa****** ~]$ /etc/init.d/mod_gearman_worker stop ; date
Stopping mod_gearman2_worker..................OK
Wed Apr 18 10:18:57 EDT 2018
[nagios@usa******* ~]$ /etc/init.d/mod_gearman_worker start ; date
Starting mod_gearman2_worker...OK
Wed Apr 18 12:20:22 EDT 2018
[nagios@usa****** ~]$ tail -n 10 /var/log/gearman/gearmand.log
ERROR 2017-04-06 00:18:07.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-04-06 00:18:07.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-04-06 00:18:07.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-04-06 00:18:07.000000 [ 1 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-04-06 00:18:07.000000 [ 3 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-04-06 00:18:07.000000 [ 2 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2017-04-06 00:26:02.000000 [ 4 ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
ERROR 2018-02-06 14:43:31.000000 [ main ] Timeout occured when calling bind() for 0.0.0.0:4730 -> libgearman-server/gearmand.cc:688
ERROR 2018-04-04 12:40:10.000000 [ proc ] GEARMAND_WAKEUP_RUN(Bad file descriptor) -> libgearman-server/gearmand_thread.cc:399
ERROR 2018-04-13 13:06:03.000000 [ main ] GEARMAND_WAKEUP_RUN(Bad file descriptor) -> libgearman-server/gearmand_thread.cc:399
Re: host check orphaned, is the mod-gearman worker on queue
Posted: Wed Apr 18, 2018 11:41 am
by rtsupport
Adding on it..
when we start the worker service suddenly we start getting Critical alert then recovery alert for servers and services which are not member of host group " nagios_infrastructure"
Errors are randomly below are the all error we were getting .... then recovery alert.
Info:CRITICAL: Return code of 255 is out of bounds. (worker: usa0***)
Info:CHECK_NRPE: Socket timeout after 30 seconds.
Info:CHECK_NRPE: Error - Could not complete SSL handshake.