gearman troubles

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

gearman troubles

Post by WillemDH »

Hello,

We just set up a new CentOS 7 Gearman worker. We seem to experience a weird issue on hosts and services for this Gearman worker node where the next check time is in the past. Is there any issue known to be associated with this? Date and hwclock are correct on both the Nagios server and the gearman workers.

On the Nagios XI server:

Code: Select all

yum list installed | grep gearman
gearmand.x86_64                    1:0.33-2                            @/gearmand-0.33-2.rhel6.x86_64
gearmand-devel.x86_64              1:0.33-2                            @/gearmand-devel-0.33-2.rhel6.x86_64
gearmand-server.x86_64             1:0.33-2                            @/gearmand-server-0.33-2.rhel6.x86_64
mod_gearman2.x86_64                2.1.1-1.el6                         @/mod_gearman2-2.1.1-1.rhel6.x86_64
On the new Gearman server:

Code: Select all

yum list installed | grep gearman                                                                                                                  [17-01-05 10:40:58]
gearmand.x86_64                         1:0.33-2                       installed
gearmand-debuginfo.x86_64               1:0.33-2                       installed
gearmand-devel.x86_64                   1:0.33-2                       installed
mod_gearman2.x86_64                     2.1.1-1.el7.centos             installed
We sometimes also seem to get the error "(host check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_gearman_"

While the service seems to be running on this Gearman worker:

Code: Select all

systemctl status mod-gearman2-worker                                                                                                               [17-01-05 10:43:24]
● mod-gearman2-worker.service - Mod-Gearman Worker
   Loaded: loaded (/usr/lib/systemd/system/mod-gearman2-worker.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2017-01-05 09:13:31 CET; 1h 30min ago
     Docs: http://mod-gearman.org/docs.html
  Process: 1162 ExecStart=/usr/bin/mod_gearman2_worker -d --config=/etc/mod_gearman2/worker.conf --pidfile=/var/mod_gearman2/mod_gearman_worker.pid (code=exited, status=0/SUCCESS)
 Main PID: 1171 (mod_gearman2_wo)
   CGroup: /system.slice/mod-gearman2-worker.service
           ├─1171 /usr/bin/mod_gearman2_worker -d --config=/etc/mod_gearman2/worker.conf --pidfile=/var/mod_gearman2/mod_gearman_worker.pid
           ├─6828 /usr/bin/mod_gearman2_worker -d --config=/etc/mod_gearman2/worker.conf --pidfile=/var/mod_gearman2/mod_gearman_worker.pid
           ├─6830 /usr/bin/mod_gearman2_worker -d --config=/etc/mod_gearman2/worker.conf --pidfile=/var/mod_gearman2/mod_gearman_worker.pid
           ├─6831 /usr/bin/mod_gearman2_worker -d --config=/etc/mod_gearman2/worker.conf --pidfile=/var/mod_gearman2/mod_gearman_worker.pid
           ├─6832 /usr/bin/mod_gearman2_worker -d --config=/etc/mod_gearman2/worker.conf --pidfile=/var/mod_gearman2/mod_gearman_worker.pid
           ├─6833 /usr/bin/mod_gearman2_worker -d --config=/etc/mod_gearman2/worker.conf --pidfile=/var/mod_gearman2/mod_gearman_worker.pid
           └─6834 /usr/bin/mod_gearman2_worker -d --config=/etc/mod_gearman2/worker.conf --pidfile=/var/mod_gearman2/mod_gearman_worker.pid
The real time the moment the screenshot was taken was actually 10:48. I have another Gearman 2 worker on CentOS 6 where everything seems to be working fine. The config files are identical, except for the hostgroup whcih defines the checks which have to run on the worker.

Code: Select all

###############################################################################
#
#  Mod-Gearman - distribute checks with gearman
#
#  Copyright (c) 2010 Sven Nierlein
#
#  Worker Module Config
#
###############################################################################

# Identifier, hostname will be used if undefined
#identifier=hostname

# use debug to increase the verbosity of the module.
# Possible values are:
#     0 = only errors
#     1 = debug messages
#     2 = trace messages
#     3 = trace and all gearman related logs are going to stdout.
# Default is 0.
debug=1

# Path to the logfile.
logfile=/var/log/mod_gearman2/mod_gearman_worker.log

# sets the addess of your gearman job server. Can be specified
# more than once to add more server.
server=10.10.10.10:4730


# sets the address of your 2nd (duplicate) gearman job server. Can
# be specified more than once o add more servers.
#dupserver=<host>:<port>


# defines if the worker should execute eventhandlers.
eventhandler=no


# defines if the worker should execute
# service checks.
services=no


# defines if the worker should execute
# host checks.
hosts=no


# sets a list of hostgroups which this worker will work
# on. Either specify a comma seperated list or use
# multiple lines.
#hostgroups=name1
#hostgroups=name2,name3
hostgroups=hg_gearman_2

# sets a list of servicegroups which this worker will
# work on.
#servicegroups=name1,name2,name3

# enables or disables encryption. It is strongly
# advised to not disable encryption. Anybody will be
# able to inject packages to your worker.
# Encryption is enabled by default and you have to
# explicitly disable it.
# When using encryption, you will either have to
# specify a shared password with key=... or a
# keyfile with keyfile=...
# Default is On.
encryption=yes


# A shared password which will be used for
# encryption of data pakets. Should be at least 8
# bytes long. Maximum length is 32 characters.
key=key


# The shared password will be read from this file.
# Use either key or keyfile. Only the first 32
# characters will be used.
#keyfile=/path/to/secret.file

# Path to the pidfile. Usually set by the init script
#pidfile=/var/mod_gearman2/mod_gearman_worker.pid

# Default job timeout in seconds. Currently this value is only used for
# eventhandler. The worker will use the values from the core for host and
# service checks.
job_timeout=60

# Minimum number of worker processes which should
# run at any time.
min-worker=5

# Maximum number of worker processes which should
# run at any time. You may set this equal to
# min-worker setting to disable dynamic starting of
# workers. When setting this to 1, all services from
# this worker will be executed one after another.
max-worker=50

# Time after which an idling worker exists
# This parameter controls how fast your waiting workers will
# exit if there are no jobs waiting.
idle-timeout=30

# Controls the amount of jobs a worker will do before he exits
# Use this to control how fast the amount of workers will go down
# after high load times
max-jobs=1000

# max-age is the threshold for discarding too old jobs. When a new job is older
# than this amount of seconds it will not be executed and just discarded. Set to
# zero to disable this check.
#max-age=0

# defines the rate of spawned worker per second as long
# as there are jobs waiting
spawn-rate=1

# Use this option to disable an extra fork for each plugin execution. Disabling
# this option will reduce the load on the worker host but can lead to problems with
# unclean plugin. Default: yes
fork_on_exec=no

# Set a limit based on the 1min load average. When exceding the load limit,
# no new worker will be started until the current load is below the limit.
# No limit will be used when set to 0.
load_limit1=0

# Same as load_limit1 but for the 5min load average.
load_limit5=0

# Same as load_limit1 but for the 15min load average.
load_limit15=0

# Use this option to show stderr output of plugins too.
# Default: yes
show_error_output=yes

# Use dup_results_are_passive to set if the duplicate result send to the dupserver
# will be passive or active.
# Default is yes (passive).
#dup_results_are_passive=yes

# When embedded perl has been compiled in, you can use this
# switch to enable or disable the embedded perl interpreter.
enable_embedded_perl=on

# Default value used when the perl script does not have a
# "nagios: +epn" or "nagios: -epn" set.
# Perl scripts not written for epn support usually fail with epn,
# so its better to set the default to off.
use_embedded_perl_implicitly=off

# Cache compiled perl scripts. This makes the worker process a little
# bit bigger but makes execution of perl scripts even faster.
# When turned off, Mod-Gearman will still use the embedded perl
# interpreter, but will not cache the compiled script.
use_perl_cache=on

# path to p1 file which is used to execute and cache the
# perl scripts run by the embedded perl interpreter
p1_file=/usr/share/mod_gearman2/mod_gearman_p1.pl


# Security
# restrict_path allows you to restrict this worker to only execute plugins
# from these particular folders. Can be used multiple times to specify more
# than one folder.
# Note that when this restriction is active, no shell will be spawned and
# no shell characters ($`'"()|) are allowed in the command line itself.
#restrict_path=/usr/local/plugins/

# Workarounds

# workaround for rc 25 bug
# duplicate jobs from gearmand result in exit code 25 of plugins
# because they are executed twice and get killed because of using
# the same ressource.
# Sending results (when exit code is 25 ) will be skipped with this
# enabled.
workaround_rc_25=off
Thanks for any help solving this.
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: gearman troubles

Post by rkennedy »

From the XI machine, can you show us the full output of gearman_top2?

As for the worker side, the configuration looks good - just ran it against what my test boxes are using and it's identical. Is there a firewall running at all on this machine?
Former Nagios Employee
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: gearman troubles

Post by WillemDH »

Yes firewalld is running on this machine, but it seems like sometimes it's working and sometimes it isn't. Which would imply it's not a firewall issue. Check the attacked graph where it's clear when it's working and when not.
I have no clue what might be causing this for now.
Gearman_top2 seems to output ok. I can sometimes see running jobs on the problematic worker ndoe.
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
SteveBeauchemin
Posts: 524
Joined: Mon Oct 14, 2013 7:19 pm

Re: gearman troubles

Post by SteveBeauchemin »

I just had a Gearman issue - finally resolved.

We have 6 workers and a core host. One test kept flapping. The flapping test was touching an AWS RDS instance doing MySQL tests. Finally someone logged in to each remote gearman worker and ran the test manually. We ran check_tcp to port 3306. One of the remote workers was unable to connect to the RDS instance. The others had no problem. When gearman was farming out the tests, it went to any of the workers randomly, so sometimes it worked, but when it landed on the bad host, it failed. this caused a 'flapping' state on the host in Nagios. This was in place for months before we finally got an actual alert and had to dig in to find the solution.

It turns out that an AWS setting had most but not all of the IP allowed to communicate to the RDS instance.

Basically, a firewall issue, but not a firewall. More like a screening router, or hosts.allow thing. AWS flavored.

I hope that you are not having a similar issue, because it was quite a pain to find. But admit it. It is fun to chase this stuff down. You love it right?

Good Luck

Steve B
XI 5.7.3 / Core 4.4.6 / NagVis 1.9.8 / LiveStatus 1.5.0p11 / RRDCached 1.7.0 / Redis 3.2.8 /
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: gearman troubles

Post by dwhitfield »

Thanks @SteveBeauchemin!

@WillemDH, was Steve's answer useful for you?
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: gearman troubles

Post by WillemDH »

I hope that you are not having a similar issue, because it was quite a pain to find. But admit it. It is fun to chase this stuff down. You love it right?
Of course I love it.. if I have the time.. :)

But I'm quite sure Steve's issue is not my issue. I remember you saying you had issues with Gearman and CentOS 7 in the past. Is CentOS 7 supported os for a Gearman worker node in the meantime?

Is one of your 6 worker running CentOS 7 Steve?

The worker having troubles has been set up by a colleague which wants to use it for checking stuff from outside our network. At the moment it has only one host with one service running on this worker which checks if a website is reachable through curl.
Nagios XI 5.8.1
https://outsideit.net
SteveBeauchemin
Posts: 524
Joined: Mon Oct 14, 2013 7:19 pm

Re: gearman troubles

Post by SteveBeauchemin »

Willem,

When I set up my new system, I went to the mod gearman source and found that they now have a repo for Red Hat 7

Code: Select all

rpm -Uvh "https://labs.consol.de/repo/stable/rhel7/i386/labs-consol-stable.rhel7.noarch.rpm"
After I installed the repo I was able to yum install the rpms and ended up with these.

Code: Select all

yum list installed | grep gearman
gearmand.x86_64               1:0.33-5                @labs_consol_stable
gearmand-server.x86_64        1:0.33-5                @labs_consol_stable
mod_gearman.x86_64            3.0.0-1.el7.centos      @labs_consol_stable
This is running for me now. The core server is running Red Hat 7. It also runs gearmand and a mod_gearman worker. My 6 remote gearman systems are still Red Hat 6 but will be getting migrated slowly over the next couple weeks. Basically I have 7 workers, one is just closer to home. I hope when I get them all upgraded that I don't have similar problems to what you are seeing. So far so good.

Steve B
XI 5.7.3 / Core 4.4.6 / NagVis 1.9.8 / LiveStatus 1.5.0p11 / RRDCached 1.7.0 / Redis 3.2.8 /
SNMPTT / Gearman 0.33-7 / Mod_Gearman 3.0.7 / NLS 2.0.8 / NNA 2.3.1 /
NSClient 0.5.0 / NRPE Solaris 3.2.1 Linux 3.2.1 HPUX 3.2.1
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: gearman troubles

Post by lmiltchev »

Willem,
Let us know if Steve's solution works out for you.
Be sure to check out our Knowledgebase for helpful articles and solutions!
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: gearman troubles

Post by WillemDH »

Hello,

We updated the gearmand packages:

Code: Select all

yum list installed | grep gearman                                                                                            [17-01-17 16:04:32]
gearmand.x86_64                       1:0.33-5                        @labs_consol_stable
gearmand-debuginfo.x86_64             1:0.33-2                        installed
gearmand-devel.x86_64                 1:0.33-5                        @labs_consol_stable
mod_gearman2.x86_64                   2.1.1-1.el7.centos              installed
But were unable to update mod_gearman2 from 2.1.1-1.el7.centos to 3.0.0-1.el7.centos as in Steve's screenshot.

Code: Select all

 yum update mod_gearman2.x86_64                                                                                               [17-01-17 16:12:44]
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: ftp.hosteurope.de
 * epel: mirror.i3d.net
 * extras: ftp.hosteurope.de
 * updates: ftp.hosteurope.de
No packages marked for update
What else is needed to update mod_gearman?

Another question, I saw this in the Nagios Core 4.3 roadmap:
Native remote workers

Is this something that will replace mod_gearman in time for XI?

Willem
Nagios XI 5.8.1
https://outsideit.net
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: gearman troubles

Post by rkennedy »

Willem,

Our current supported version that we've done testing with is -

Code: Select all

gearman_folder="v2.1.1"
gearmand_version="0.33-2"
I have been doing testing with gearman3 though lately to see how stable it is. I haven't seen any issues this far, but again, still in testing on my end.

What is the output of yum repolist? Here's what I have running on my end with gearman3 installed -

Code: Select all

[root@centos6x64 ~]# yum list installed | grep gearman
gearmand.x86_64                       1:0.33-5                         @labs_consol_stable
libgearman.x86_64                     1.1.8-2.el6                      @epel
mod_gearman.x86_64                    3.0.0-1.el6                      @labs_consol_stable
mod_gearman-debuginfo.x86_64          3.0.0-1.el6                      installed

Code: Select all

[root@centos6x64 ~]# yum repolist
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: centos.chi.host-engine.com
 * epel: mirror.steadfast.net
 * extras: repos.forethought.net
 * updates: mirror.nexcess.net
repo id                                               repo name                                                                         status
base                                                  CentOS-6 - Base                                                                    6,696
cr                                                    CentOS-6 - CR                                                                          0
epel                                                  Extra Packages for Enterprise Linux 6 - x86_64                                    12,215
extras                                                CentOS-6 - Extras                                                                     62
labs_consol_stable                                    labs_consol_stable                                                                    31
nagios-base                                           Nagios                                                                               155
nagiosxi-deps                                         Nagios XI Dependencies                                                                31
updates                                               CentOS-6 - Updates                                                                   772
Another question, I saw this in the Nagios Core 4.3 roadmap:
Native remote workers

Is this something that will replace mod_gearman in time for XI?
Yes, we are working on our own version of gearman in house.
Former Nagios Employee
Locked