Nagios Support Forum

Posted: **Tue May 25, 2021 3:10 am**

We use gearman as a distributed engine for our monitoring environment. We have ~25 hostgroups with ~50 gearman workers.
we run:
Nagios 5.8.1

gearmand 1.1.18
mod_gearman_worker: version 3.3.0 running on libgearman 1.1.19.1

Sometimes we see for a host that only 1 service is orphaned, other services are checking and returning data just fine.

Capture2.JPG

When I want to recheck, by force an imediate check, this service, it looks like it isn't rescheduled.

the last check timestamp doen't update, nor does the atempt infromation.

Capture.JPG

Where can I see the log of the imediade check now? how Can I fix this?

Posted: **Tue May 25, 2021 7:04 am**

Doing further investigation, I found that when a service is orphaned the check isn't rescheduled at all, although the parameters are set:

check_for_orphaned_hosts=1
check_for_orphaned_services=1

Posted: **Wed May 26, 2021 9:34 am**

Hi,

I would recommend trying to restart the Gearman server and worker and try once more to see if you get the same behavior. There is a specific order the must be followed when restarting, the steps can be found on page 8 of the guide below.

Integrating Mod-Gearman With Nagios XI

Also, if the error occurs again, force an immediate check, let me know if there is any discrepancies in the check results between the XI and Core interface. To view, the core interface, go to:

Code: Select all

http://<IP address>/nagios

Lastly, what is the output of the gearman_top command, and please send us a system profile? Thanks, Benjamin

To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button

Posted: **Thu May 27, 2021 12:09 am**

Hi,

there are no descepancies in the check.
You can see for this host, all services are ok and regurarly checked, exept for just one service which is orphaned now for 2 days.

Capture3.JPG

I rechecked this service check, but this didn't give any success

Capture4.JPG

In the previous example I also set the worker in worker log at level 3 and followed the log for approxamately 30 minutes. The recheck of the service never came in the log, so it looks like it doesn't get rescheduled.

I send a system profile by PM.
Hereby the output of gearman_top

Code: Select all

 gearman_top -b
2021-05-27 06:52:42  -  localhost:4730  -  v1.1.18

 Queue Name                            | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------------------
 check_results                         |               1  |           0  |           0
 hostgroup_hg_worker_avr-dvn           |              17  |           0  |           0
 hostgroup_hg_worker_avr-rzb           |              27  |           0  |           0
 hostgroup_hg_worker_bran              |               5  |           0  |           0
 hostgroup_hg_worker_dock-ka           |               5  |           0  |           0
 hostgroup_hg_worker_dock-pa           |               5  |           0  |           0
 hostgroup_hg_worker_dro               |              37  |           0  |           3
 hostgroup_hg_worker_dsom              |              66  |           0  |          13
 hostgroup_hg_worker_finq              |               5  |           0  |           0
 hostgroup_hg_worker_flmc              |              11  |           0  |           0
 hostgroup_hg_worker_game              |              10  |           0  |           0
 hostgroup_hg_worker_ggn               |              36  |           1  |           6
 hostgroup_hg_worker_jzhz              |               5  |           0  |           0
 hostgroup_hg_worker_kivo              |              10  |           0  |           0
 hostgroup_hg_worker_lek               |              26  |           0  |          16
 hostgroup_hg_worker_oenr              |              25  |           0  |           6
 hostgroup_hg_worker_sdb               |              32  |           0  |           1
 hostgroup_hg_worker_sdzb              |              93  |           0  |          23
 hostgroup_hg_worker_snms              |             232  |           0  |          54
 hostgroup_hg_worker_sviz              |              29  |           0  |           2
 hostgroup_hg_worker_wnf               |              11  |           0  |           1
 worker_JZHZDCBTCSS-001                |               1  |           0  |           0
 worker_KC-MON-P01.kivo.tm             |               1  |           0  |           0
 worker_OENRDCBTMON01                  |               1  |           0  |           0
 worker_ZORG-WAATNMW01                 |               1  |           0  |           0
 worker_ZORG-WAATNMW02                 |               1  |           0  |           0
 worker_bmrdhgtcssm02.brandmr.local    |               0  |           0  |           0
 worker_bu-amf-vma01.bu-amf.local      |               1  |           0  |           0
 worker_dsbwaa01pmgw02                 |               1  |           0  |           0
 worker_dsom-nagt02.desom.mgmt         |               5  |           0  |           0
 worker_flmc-gropnag01.flamco.local    |               1  |           0  |           0
 worker_gdr01dcbmgw02                  |               1  |           0  |           0
 worker_monxisltn-vms.dockaas.nl       |               1  |           0  |           0
 worker_sbhptsssm013.sltn-beheer.local |               1  |           0  |           0
 worker_sbhptsssm014.sltn-beheer.local |               1  |           0  |           0
 worker_sbhptsssm015.sltn-beheer.local |               1  |           0  |           0
 worker_sdb-waatcssm01.dbij.local      |               1  |           0  |           0
 worker_sr_monsltn_pa.dockaas.nl       |               1  |           0  |           0
 worker_svavrdmont01.durable.local     |               1  |           0  |           0
 worker_svavrrmont01.durable.local     |               1  |           0  |           0
 worker_svizdcbpnagi02.vivium.local    |               1  |           0  |           0
 worker_svr-lnxngs-002                 |               1  |           0  |           0
 worker_svr-mgw200                     |               1  |           0  |           0
 worker_wnf-s-mgw01.wnf.local          |               3  |           0  |           0
----------------------------------------------------------------------------------------

Posted: **Thu May 27, 2021 12:40 am**

Hi benjaminsmith,

It looks like I can't download the system profile.
I get an empty page in NagiosXI I see this empty page in all our NagiosXI 5.8.1 installations, and also my dev system 5.8.2,
I checked with browsers: firefox 78.9.0 and Chrome Version 89.0.4389.90

In my browser I get a status 500

Capture5.JPG

and in the ssl_access log at my Nagios server:

ssl_access_log:10.128.20.105 - - [27/May/2021:08:08:23 +0200] "GET /nagiosxi/includes/components/profile/profile.php HTTP/1.1" 500 2

kind regards,

Joris Weijters

Posted: **Thu May 27, 2021 6:25 am**

Hi benjaminsmith,

I've been doing some debugging on the code for the profile.php.
in requires the /../../configwizards.inc.php
I did some debugging on this one, it includes the includes of the configuration wizards.

During the include of the nagiostats.inc.php something fails, and the configwizards.inc.php fails.
I didn't notice before, but that the configuration wizard also doesn't work when the "nagiostat wizard" directory is in the /usr/local/nagiosxi/html/includes/configwizards/

I get the error in the error_log:

[Thu May 27 12:55:22.959586 2021] [php7:error] [pid 59880] [client 192.168.10.1:62045] PHP Fatal error: Cannot redeclare val() (previously declared in /usr/local/nagiosxi/html/includes/components/ccm/includes/common_functions.inc.php:230) in /usr/local/nagiosxi/html/includes/configwizards/nagiostats/nagiostats.inc.php on line 38, referer: http://192.168.10.128/nagiosxi/admin/

I'm running
Nagios 5.8.2
and
PHP 7.2.34 (cli) (built: Feb 3 2021 09:23:21) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies
with Zend OPcache v7.2.34, Copyright (c) 1999-2018, by Zend Technologies

during my investigation I did a clean install of Nagios 5.8.2 at a Centos7.9 running php 5.4.16-48 and at that system everything seems to work.

Kind regards

Joris Weijters

Posted: **Thu May 27, 2021 5:21 pm**

Hi Joris,

We would like to know the Hostgroup the host is in and the Hostname so we know which worker it is supposed to run it on. Then, enable debugging on the Nagios server.

Edit the /etc/mod_gearman/module.conf and change this line from

Code: Select all

debug=0
to
debug=1

Save the change and restart the Nagios, and if you see the Orphan message, get this file from the Nagios server and post it.

Code: Select all

/var/log/mod_gearman/mod_gearman_neb.log

Also, please retreive log file from the Nagios server and post it.

Code: Select all

/var/log/gearmand/gearmand.log

The profile download is weird. Both of the php files do declare the val function but they should not be accessed at the same time when the profile is downloaded.

Try running it from the command line as well.

Code: Select all

rm -rf /usr/local/nagiosxi/var/components/profile.zip
/usr/local/nagiosxi/scripts/components/getprofile.sh SUPPORT

Then send me the resulting /usr/local/nagiosxi/var/components/profile.zip file. Thanks, Ben

Posted: **Fri May 28, 2021 12:39 am**

Hi Ben,

Can we split-off the problem with the "profile page"?

Posted: **Fri May 28, 2021 1:45 am**

Hi Ben,

[1622181710] SERVICE NOTIFICATION: alert_topdesk_7x24_desom_email;dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;xi_service_notification_handler_email;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622181710] SERVICE NOTIFICATION: alert_topdesk_7x24;dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;xi_service_notification_handler_topdesk;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622181710] SERVICE ALERT: dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;HARD;10;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622182901] Warning: The check of service 'Fortigate: high availability active-passive' on host 'dsom-med-mer1-fw01' looks like it was orphaned (results never came back; last_check=1622181701; next_check=1622182010). I'm scheduling an immediate check of the service...

I reissued the imediate check at ~08:32
I will upload the log file via a PM

Capture7.JPG

Code: Select all

 gearman_top -b
2021-05-28 08:38:34  -  localhost:4730  -  v1.1.18

 Queue Name                            | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------------------
 check_results                         |               1  |           0  |           0
 hostgroup_hg_worker_avr-dvn           |               8  |           2  |           8
 hostgroup_hg_worker_avr-rzb           |              21  |           0  |          10
 hostgroup_hg_worker_bran              |               5  |           0  |           0
 hostgroup_hg_worker_dock-ka           |               5  |           0  |           0
 hostgroup_hg_worker_dock-pa           |               5  |           0  |           0
 hostgroup_hg_worker_dro               |              32  |           0  |           3
 hostgroup_hg_worker_dsom              |              60  |           0  |           1
 hostgroup_hg_worker_finq              |               5  |           0  |           0
 hostgroup_hg_worker_flmc              |               9  |           0  |           1
 hostgroup_hg_worker_game              |              10  |           0  |           1
 hostgroup_hg_worker_ggn               |              43  |           0  |           4
 hostgroup_hg_worker_jzhz              |               5  |           0  |           0
 hostgroup_hg_worker_kivo              |              10  |           0  |           0
 hostgroup_hg_worker_lek               |              45  |           0  |           3
 hostgroup_hg_worker_oenr              |              25  |           0  |           7
 hostgroup_hg_worker_sdb               |              30  |           0  |           0
 hostgroup_hg_worker_sdzb              |              68  |           0  |          35
 hostgroup_hg_worker_snms              |             226  |           0  |         128
 hostgroup_hg_worker_sviz              |              48  |           0  |           1
 hostgroup_hg_worker_wnf               |              10  |           0  |           1
 worker_JZHZDCBTCSS-001                |               1  |           0  |           0
 worker_KC-MON-P01.kivo.tm             |               0  |           0  |           0
 worker_OENRDCBTMON01                  |               1  |           0  |           0
 worker_ZORG-WAATNMW01                 |               1  |           0  |           0
 worker_ZORG-WAATNMW02                 |               0  |           0  |           0
 worker_bmrdhgtcssm02.brandmr.local    |               1  |           0  |           0
 worker_bu-amf-vma01.bu-amf.local      |               1  |           0  |           0
 worker_dsbwaa01pmgw02                 |               1  |           0  |           0
 worker_dsom-nagt02.desom.mgmt         |               4  |           0  |           0
 worker_flmc-gropnag01.flamco.local    |               1  |           0  |           0
 worker_gdr01dcbmgw02                  |               1  |           0  |           0
 worker_monxisltn-vms.dockaas.nl       |               1  |           0  |           0
 worker_sbhptsssm013.sltn-beheer.local |               1  |           0  |           0
 worker_sbhptsssm014.sltn-beheer.local |               1  |           0  |           0
 worker_sbhptsssm015.sltn-beheer.local |               1  |           0  |           0
 worker_sdb-waatcssm01.dbij.local      |               1  |           0  |           0
 worker_sr_monsltn_pa.dockaas.nl       |               1  |           0  |           0
 worker_svavrdmont01.durable.local     |               1  |           0  |           0
 worker_svavrrmont01.durable.local     |               1  |           0  |           0
 worker_svizdcbpnagi02.vivium.local    |               1  |           0  |           0
 worker_svr-lnxngs-002                 |               1  |           0  |           0
 worker_svr-mgw200                     |               1  |           0  |           0
 worker_wnf-s-mgw01.wnf.local          |               1  |           0  |           0
----------------------------------------------------------------------------------------

Posted: **Fri May 28, 2021 4:48 pm**

Hi @jweijters,

Thanks for sending the profile over and the gearman log. So based on the logs, this is a gearman worker and/or network issue (not directly related to Nagiso XI).

ERROR 2021-05-28 09:08:58.000000 [ main ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218

To help troubleshoot this. log into the gearman worker for the hg_worker_dsom host group, and enable debugging. Edit the /etc/mod_gearman/worker.conf and change this line from

Code: Select all

debug=0
to
debug=1

Save the change and restart the worker.

If you see the orphaned message again, then retrieve the log from the worker and post that to the thread. Thanks, Benjamin

Code: Select all

/var/log/mod_gearman/mod_gearman_worker.log

Nagios Support Forum

orphaned check and force an imediate check doesn't work

orphaned check and force an imediate check doesn't work

Re: orphaned message and force an imediate check

Re: orphaned check and force an imediate check doesn't work

Re: orphaned check and force an imediate check doesn't work

Re: orphaned check and force an imediate check doesn't work

Re: orphaned check and force an imediate check doesn't work

Re: orphaned check and force an imediate check doesn't work

Re: orphaned check and force an imediate check doesn't work

Re: orphaned check and force an imediate check doesn't work

Re: orphaned check and force an imediate check doesn't work