Page 2 of 3
Re: orphaned check and force an imediate check doesn't work
Posted: Tue Jun 01, 2021 1:09 am
by jweijters
Hi Ben,
We have a new orphaned check
Capture8.JPG
because of the large log's I grepped only get the data of this host:
At the Nagios server we see
I forced an immediate check of this check at 07:57. At 07:59 I did a immediate host check of this host:
Code: Select all
# tail -f mod_gearman_neb.log |grep -i dsom-nl-am4-esx59
[2021-06-01 07:56:56][183647][DEBUG] received job for queue hostgroup_hg_worker_dsom: dsom-nl-am4-esx59-idrac - CMOS Battery, check_options: 0
[2021-06-01 07:56:56][183647][DEBUG] service: 'dsom-nl-am4-esx59-idrac' - 'CMOS Battery', next_check is at 2021-06-01 07:56:56, latency so far: 0
[2021-06-01 07:56:56][183647][DEBUG] service job completed: dsom-nl-am4-esx59-idrac CMOS Battery: exit 0, latency: 0.652, exec_time: 0.168
[2021-06-01 07:57:05][183647][DEBUG] received job for queue hostgroup_hg_worker_dsom: dsom-nl-am4-esx59 - VMware: Host Switch Status, check_options: 1
[2021-06-01 07:57:05][183647][DEBUG] service: 'dsom-nl-am4-esx59' - 'VMware: Host Switch Status', next_check is at 2021-06-01 07:56:50, latency so far: 15
[2021-06-01 07:59:11][183647][DEBUG] received job for queue hostgroup_hg_worker_dsom: dsom-nl-am4-esx59, check_options: 1
[2021-06-01 07:59:11][183647][DEBUG] host: 'dsom-nl-am4-esx59', next_check is at 2021-06-01 07:59:07, latency so far: 4
[2021-06-01 07:59:15][183647][DEBUG] host job completed: dsom-nl-am4-esx59: exit 0, latency: 4.786, exec_time: 4.004
It looks like this check is not reaching the worker.
this is the part of the log at the worker:
Code: Select all
# tail -f mod_gearman_worker.log |grep dsom-nl-am4-esx59
[2021-06-01 07:56:56][5566][DEBUG] got service job: dsom-nl-am4-esx59-idrac - CMOS Battery
[2021-06-01 07:59:11][7595][DEBUG] got host job: dsom-nl-am4-esx59
Re: orphaned check and force an imediate check doesn't work
Posted: Tue Jun 01, 2021 2:22 am
by jweijters
Hi Ben,
I have done some more investigation. For the dsom, I have 8 orphaned checks. see picture below.
Capture9.JPG
these check stay at the gearman queue:
Code: Select all
# gearadmin --show-unique-jobs |grep -i dsom
dsom-nl-am4-esx61-VMware: Host OS Name Version
dsom-nl-am4-esx51-VMware: Host Switch Status
dsom-nl-am4-esx23-VMware: Host CPU Info
dsom-nl-am4-esx53-VMware: Host OS Name Version
dsom-nl-am4-esx21-VMware: Host OS Name Version
dsom-nl-am4-esx63-VMware: Host CPU Info
dsom-nl-am4-esx19-VMware: Host Switch Status
dsom-nl-am4-esx55-VMware: Host CPU Info
kind regards,
Joris Weijters
Re: orphaned check and force an imediate check doesn't work
Posted: Tue Jun 01, 2021 4:40 pm
by ssax
Was that gearadmin command output from the XI server or the worker?
Please attach this file from your dsom worker:
Attach these from your XI server:
Code: Select all
/etc/mod_gearman/module.conf
/etc/mod_gearman/worker.conf
Re: orphaned check and force an imediate check doesn't work
Posted: Tue Jun 01, 2021 11:49 pm
by jweijters
Hi,
This gearadmin command was at the NagiosXI host.
At the worker there is no gearmand-server proces running, only the mod_gearman_worker processes run at the worker. Can you query this?
I included a zip file with my NagiosXI and Worker configs
Kind regards,
Joris Weijters
Re: orphaned check and force an imediate check doesn't work
Posted: Wed Jun 02, 2021 4:03 pm
by ssax
I don't see anything unusual in them.
Please enable debug logging in your dsom worker.conf:
Then restart the gearman worker service and wait for the orphan to show up, then zip up and PM me this file:
Code: Select all
/var/log/mod_gearman/mod_gearman_worker.log
Re: orphaned check and force an imediate check doesn't work
Posted: Thu Jun 03, 2021 4:26 am
by jweijters
Hi,
I have some host and service checks orphaned. I put the worker in debug=2 logging mode
Capture10.JPG
Capture11.JPG
At the nagiosXI server:
Code: Select all
# gearadmin --show-unique-jobs |grep -i dsom
dsom-nl-am4-esx61
dsom-nl-am7-esx26-idrac-FANs
dsom-nl-am7-esx16-VMware: Host pNIC Usage
dsom-nl-am7-esx62-idrac-FANs
dsom-nl-am7-esx62-idrac-Memory
dsom-nl-am7-esx60
dsom-nl-am7-esx12
I uploaded the mod_gearman_worker.log
Kind regards,
Joris Weijters
Re: orphaned check and force an imediate check doesn't work
Posted: Thu Jun 03, 2021 3:47 pm
by ssax
I do not even see those checks listed in there at those times.
I'm wondering if you need to update gearmand to 1.1.19-1 as well, you likely need to have all the workers and the job server match version-wise:
Code: Select all
gearmand 1.1.18
mod_gearman_worker: version 3.3.0 running on libgearman 1.1.19.1
Here's what I have:
Code: Select all
[root@xig ~]# rpm -qa |grep gearman
gearmand-server-1.1.19-1.el7.x86_64
mod_gearman-3.3.0-1.el7.x86_64
gearmand-1.1.19-1.el7.x86_64
I would backup your gearman configs from
/etc/mod_gearman and then run through the server upgrade section of this:
https://assets.nagios.com/downloads/nag ... ios_XI.pdf
Then replace the /etc/mod_gearman files with your old ones and restart the gearman services.
Re: orphaned check and force an imediate check doesn't work
Posted: Mon Jun 07, 2021 12:47 am
by jweijters
Hi ssax,
I will plan to update gearmand-server-1.1.19-1.el7.x86_64, mod_gearman-3.3.0-1.el7.x86_64, gearmand-1.1.19-1.el7.x86_64
to the latest version this week.
As you can see it's a lot of work, to get this updated at all workers and the Nagios Server.
kind regards,
Joris Weijters
Re: orphaned check and force an imediate check doesn't work
Posted: Mon Jun 07, 2021 1:16 pm
by ssax
Unfortunately, we don't control the codebase of gearman, the gearman developers do and they all need to match to work properly.
Upgrade gearman on the XI server to match following the guide and then see if that resolves the issue.
If it doesn't, what OS/version is the worker running?
Code: Select all
uname -a
cat /etc/*release
rpm -qa | grep -i gear
We're seeing this:
Code: Select all
ERROR 2021-05-28 09:08:58.000000 [ main ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
Which apparently is lost connection from the gearman worker.
The new gearmand has the ability to log debugging data.
Here is the option and the argument list. The arguments are case sensitive so they have to be upper case.
Code: Select all
--verbose arg (=ERROR) Set verbose level (FATAL, ALERT,
CRITICAL, ERROR, WARNING, NOTICE, INFO,
DEBUG).
To set the verbose outout to debug, edit the following file
Change this line from:
Code: Select all
OPTIONS="--log-file=/var/log/gearmand/gearmand.log"
to
Code: Select all
OPTIONS="--log-file=/var/log/gearmand/gearmand.log --verbose=DEBUG"
Save the change and restart gearmand to load the change:
Code: Select all
systemctl stop nagios
systemctl stop gearmand
systemctl start gearmand
systemctl start nagios
Then force the check to orphan and then send us the gearmand logs from the XI server.
Monitor the
/var/log/gearmand/gearmand.log file for any errors.
Also, keep an eye on the size of the log file, it will become quite large over time.
Re: orphaned check and force an imediate check doesn't work
Posted: Thu Jun 10, 2021 12:58 am
by jweijters
Hi I upgraded gearman to the latest level.
I still get orphaned checks. when I reconfigured gearmand to run the daemon in DEBUG log mode, I have 3 GB of log in 5 minutes......
After restarting the gearmand there are the orpahened ckecks are removed from the queue, and recheck, so not orphaned anymore.
I don't know how long I have to collect data before checks et orphaned again, so running the gearmand in debug mode is not realy a solution.
however I asked our storage guys for an extra disk.
We have an special small Nagios instance monitoring our big Nagios environments.
Is it possible to setup a monitor for orphaned checks, and if they exist for let's say more than 15 minutes, I can restart gearmand?
It is possible to restart gearmand succesfully. Nagios doesn't have to be stopped and started anymore, we allready tested this. ( sometimes gearman just hangs after a network issue at one of our customers, and gearman workers don't run )
that's issue:
https://github.com/gearman/gearmand/issues/301
kind regards,
Joris Weijters