Page 1 of 3

orphaned check and force an imediate check doesn't work

Posted: Tue May 25, 2021 3:10 am
by jweijters
We use gearman as a distributed engine for our monitoring environment. We have ~25 hostgroups with ~50 gearman workers.
we run:
Nagios 5.8.1

gearmand 1.1.18
mod_gearman_worker: version 3.3.0 running on libgearman 1.1.19.1


Sometimes we see for a host that only 1 service is orphaned, other services are checking and returning data just fine.
Capture2.JPG
When I want to recheck, by force an imediate check, this service, it looks like it isn't rescheduled.

the last check timestamp doen't update, nor does the atempt infromation.
Capture.JPG
Where can I see the log of the imediade check now? how Can I fix this?

Re: orphaned message and force an imediate check

Posted: Tue May 25, 2021 7:04 am
by jweijters
Doing further investigation, I found that when a service is orphaned the check isn't rescheduled at all, although the parameters are set:

check_for_orphaned_hosts=1
check_for_orphaned_services=1

Re: orphaned check and force an imediate check doesn't work

Posted: Wed May 26, 2021 9:34 am
by benjaminsmith
Hi,

I would recommend trying to restart the Gearman server and worker and try once more to see if you get the same behavior. There is a specific order the must be followed when restarting, the steps can be found on page 8 of the guide below.

Integrating Mod-Gearman With Nagios XI

Also, if the error occurs again, force an immediate check, let me know if there is any discrepancies in the check results between the XI and Core interface. To view, the core interface, go to:

Code: Select all

http://<IP address>/nagios
Lastly, what is the output of the gearman_top command, and please send us a system profile? Thanks, Benjamin

To send us your system profile
.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button

Re: orphaned check and force an imediate check doesn't work

Posted: Thu May 27, 2021 12:09 am
by jweijters
Hi,

there are no descepancies in the check.
You can see for this host, all services are ok and regurarly checked, exept for just one service which is orphaned now for 2 days.
Capture3.JPG
I rechecked this service check, but this didn't give any success
Capture4.JPG
In the previous example I also set the worker in worker log at level 3 and followed the log for approxamately 30 minutes. The recheck of the service never came in the log, so it looks like it doesn't get rescheduled.

I send a system profile by PM.
Hereby the output of gearman_top

Code: Select all

 gearman_top -b
2021-05-27 06:52:42  -  localhost:4730  -  v1.1.18

 Queue Name                            | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------------------
 check_results                         |               1  |           0  |           0
 hostgroup_hg_worker_avr-dvn           |              17  |           0  |           0
 hostgroup_hg_worker_avr-rzb           |              27  |           0  |           0
 hostgroup_hg_worker_bran              |               5  |           0  |           0
 hostgroup_hg_worker_dock-ka           |               5  |           0  |           0
 hostgroup_hg_worker_dock-pa           |               5  |           0  |           0
 hostgroup_hg_worker_dro               |              37  |           0  |           3
 hostgroup_hg_worker_dsom              |              66  |           0  |          13
 hostgroup_hg_worker_finq              |               5  |           0  |           0
 hostgroup_hg_worker_flmc              |              11  |           0  |           0
 hostgroup_hg_worker_game              |              10  |           0  |           0
 hostgroup_hg_worker_ggn               |              36  |           1  |           6
 hostgroup_hg_worker_jzhz              |               5  |           0  |           0
 hostgroup_hg_worker_kivo              |              10  |           0  |           0
 hostgroup_hg_worker_lek               |              26  |           0  |          16
 hostgroup_hg_worker_oenr              |              25  |           0  |           6
 hostgroup_hg_worker_sdb               |              32  |           0  |           1
 hostgroup_hg_worker_sdzb              |              93  |           0  |          23
 hostgroup_hg_worker_snms              |             232  |           0  |          54
 hostgroup_hg_worker_sviz              |              29  |           0  |           2
 hostgroup_hg_worker_wnf               |              11  |           0  |           1
 worker_JZHZDCBTCSS-001                |               1  |           0  |           0
 worker_KC-MON-P01.kivo.tm             |               1  |           0  |           0
 worker_OENRDCBTMON01                  |               1  |           0  |           0
 worker_ZORG-WAATNMW01                 |               1  |           0  |           0
 worker_ZORG-WAATNMW02                 |               1  |           0  |           0
 worker_bmrdhgtcssm02.brandmr.local    |               0  |           0  |           0
 worker_bu-amf-vma01.bu-amf.local      |               1  |           0  |           0
 worker_dsbwaa01pmgw02                 |               1  |           0  |           0
 worker_dsom-nagt02.desom.mgmt         |               5  |           0  |           0
 worker_flmc-gropnag01.flamco.local    |               1  |           0  |           0
 worker_gdr01dcbmgw02                  |               1  |           0  |           0
 worker_monxisltn-vms.dockaas.nl       |               1  |           0  |           0
 worker_sbhptsssm013.sltn-beheer.local |               1  |           0  |           0
 worker_sbhptsssm014.sltn-beheer.local |               1  |           0  |           0
 worker_sbhptsssm015.sltn-beheer.local |               1  |           0  |           0
 worker_sdb-waatcssm01.dbij.local      |               1  |           0  |           0
 worker_sr_monsltn_pa.dockaas.nl       |               1  |           0  |           0
 worker_svavrdmont01.durable.local     |               1  |           0  |           0
 worker_svavrrmont01.durable.local     |               1  |           0  |           0
 worker_svizdcbpnagi02.vivium.local    |               1  |           0  |           0
 worker_svr-lnxngs-002                 |               1  |           0  |           0
 worker_svr-mgw200                     |               1  |           0  |           0
 worker_wnf-s-mgw01.wnf.local          |               3  |           0  |           0
----------------------------------------------------------------------------------------

Re: orphaned check and force an imediate check doesn't work

Posted: Thu May 27, 2021 12:40 am
by jweijters
Hi benjaminsmith,

It looks like I can't download the system profile.
I get an empty page in NagiosXI I see this empty page in all our NagiosXI 5.8.1 installations, and also my dev system 5.8.2,
I checked with browsers: firefox 78.9.0 and Chrome Version 89.0.4389.90

In my browser I get a status 500
Capture5.JPG
and in the ssl_access log at my Nagios server:

ssl_access_log:10.128.20.105 - - [27/May/2021:08:08:23 +0200] "GET /nagiosxi/includes/components/profile/profile.php HTTP/1.1" 500 2

kind regards,

Joris Weijters

Re: orphaned check and force an imediate check doesn't work

Posted: Thu May 27, 2021 6:25 am
by jweijters
Hi benjaminsmith,

I've been doing some debugging on the code for the profile.php.
in requires the /../../configwizards.inc.php
I did some debugging on this one, it includes the includes of the configuration wizards.

During the include of the nagiostats.inc.php something fails, and the configwizards.inc.php fails.
I didn't notice before, but that the configuration wizard also doesn't work when the "nagiostat wizard" directory is in the /usr/local/nagiosxi/html/includes/configwizards/

I get the error in the error_log:

[Thu May 27 12:55:22.959586 2021] [php7:error] [pid 59880] [client 192.168.10.1:62045] PHP Fatal error: Cannot redeclare val() (previously declared in /usr/local/nagiosxi/html/includes/components/ccm/includes/common_functions.inc.php:230) in /usr/local/nagiosxi/html/includes/configwizards/nagiostats/nagiostats.inc.php on line 38, referer: http://192.168.10.128/nagiosxi/admin/


I'm running
Nagios 5.8.2
and
PHP 7.2.34 (cli) (built: Feb 3 2021 09:23:21) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies
with Zend OPcache v7.2.34, Copyright (c) 1999-2018, by Zend Technologies

during my investigation I did a clean install of Nagios 5.8.2 at a Centos7.9 running php 5.4.16-48 and at that system everything seems to work.


Kind regards

Joris Weijters

Re: orphaned check and force an imediate check doesn't work

Posted: Thu May 27, 2021 5:21 pm
by benjaminsmith
Hi Joris,

We would like to know the Hostgroup the host is in and the Hostname so we know which worker it is supposed to run it on. Then, enable debugging on the Nagios server.

Edit the /etc/mod_gearman/module.conf and change this line from

Code: Select all

debug=0
to
debug=1
Save the change and restart the Nagios, and if you see the Orphan message, get this file from the Nagios server and post it.

Code: Select all

/var/log/mod_gearman/mod_gearman_neb.log
Also, please retreive log file from the Nagios server and post it.

Code: Select all

/var/log/gearmand/gearmand.log
The profile download is weird. Both of the php files do declare the val function but they should not be accessed at the same time when the profile is downloaded.

Try running it from the command line as well.

Code: Select all

rm -rf /usr/local/nagiosxi/var/components/profile.zip
/usr/local/nagiosxi/scripts/components/getprofile.sh SUPPORT
Then send me the resulting /usr/local/nagiosxi/var/components/profile.zip​ file. Thanks, Ben

Re: orphaned check and force an imediate check doesn't work

Posted: Fri May 28, 2021 12:39 am
by jweijters
Hi Ben,

Can we split-off the problem with the "profile page"?

Re: orphaned check and force an imediate check doesn't work

Posted: Fri May 28, 2021 1:45 am
by jweijters
Hi Ben,


[1622181710] SERVICE NOTIFICATION: alert_topdesk_7x24_desom_email;dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;xi_service_notification_handler_email;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622181710] SERVICE NOTIFICATION: alert_topdesk_7x24;dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;xi_service_notification_handler_topdesk;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622181710] SERVICE ALERT: dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;HARD;10;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622182901] Warning: The check of service 'Fortigate: high availability active-passive' on host 'dsom-med-mer1-fw01' looks like it was orphaned (results never came back; last_check=1622181701; next_check=1622182010). I'm scheduling an immediate check of the service...

I reissued the imediate check at ~08:32
I will upload the log file via a PM
Capture7.JPG

Code: Select all

 gearman_top -b
2021-05-28 08:38:34  -  localhost:4730  -  v1.1.18

 Queue Name                            | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------------------
 check_results                         |               1  |           0  |           0
 hostgroup_hg_worker_avr-dvn           |               8  |           2  |           8
 hostgroup_hg_worker_avr-rzb           |              21  |           0  |          10
 hostgroup_hg_worker_bran              |               5  |           0  |           0
 hostgroup_hg_worker_dock-ka           |               5  |           0  |           0
 hostgroup_hg_worker_dock-pa           |               5  |           0  |           0
 hostgroup_hg_worker_dro               |              32  |           0  |           3
 hostgroup_hg_worker_dsom              |              60  |           0  |           1
 hostgroup_hg_worker_finq              |               5  |           0  |           0
 hostgroup_hg_worker_flmc              |               9  |           0  |           1
 hostgroup_hg_worker_game              |              10  |           0  |           1
 hostgroup_hg_worker_ggn               |              43  |           0  |           4
 hostgroup_hg_worker_jzhz              |               5  |           0  |           0
 hostgroup_hg_worker_kivo              |              10  |           0  |           0
 hostgroup_hg_worker_lek               |              45  |           0  |           3
 hostgroup_hg_worker_oenr              |              25  |           0  |           7
 hostgroup_hg_worker_sdb               |              30  |           0  |           0
 hostgroup_hg_worker_sdzb              |              68  |           0  |          35
 hostgroup_hg_worker_snms              |             226  |           0  |         128
 hostgroup_hg_worker_sviz              |              48  |           0  |           1
 hostgroup_hg_worker_wnf               |              10  |           0  |           1
 worker_JZHZDCBTCSS-001                |               1  |           0  |           0
 worker_KC-MON-P01.kivo.tm             |               0  |           0  |           0
 worker_OENRDCBTMON01                  |               1  |           0  |           0
 worker_ZORG-WAATNMW01                 |               1  |           0  |           0
 worker_ZORG-WAATNMW02                 |               0  |           0  |           0
 worker_bmrdhgtcssm02.brandmr.local    |               1  |           0  |           0
 worker_bu-amf-vma01.bu-amf.local      |               1  |           0  |           0
 worker_dsbwaa01pmgw02                 |               1  |           0  |           0
 worker_dsom-nagt02.desom.mgmt         |               4  |           0  |           0
 worker_flmc-gropnag01.flamco.local    |               1  |           0  |           0
 worker_gdr01dcbmgw02                  |               1  |           0  |           0
 worker_monxisltn-vms.dockaas.nl       |               1  |           0  |           0
 worker_sbhptsssm013.sltn-beheer.local |               1  |           0  |           0
 worker_sbhptsssm014.sltn-beheer.local |               1  |           0  |           0
 worker_sbhptsssm015.sltn-beheer.local |               1  |           0  |           0
 worker_sdb-waatcssm01.dbij.local      |               1  |           0  |           0
 worker_sr_monsltn_pa.dockaas.nl       |               1  |           0  |           0
 worker_svavrdmont01.durable.local     |               1  |           0  |           0
 worker_svavrrmont01.durable.local     |               1  |           0  |           0
 worker_svizdcbpnagi02.vivium.local    |               1  |           0  |           0
 worker_svr-lnxngs-002                 |               1  |           0  |           0
 worker_svr-mgw200                     |               1  |           0  |           0
 worker_wnf-s-mgw01.wnf.local          |               1  |           0  |           0
----------------------------------------------------------------------------------------

Re: orphaned check and force an imediate check doesn't work

Posted: Fri May 28, 2021 4:48 pm
by benjaminsmith
Hi @jweijters,

Thanks for sending the profile over and the gearman log. So based on the logs, this is a gearman worker and/or network issue (not directly related to Nagiso XI).
ERROR 2021-05-28 09:08:58.000000 [ main ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
To help troubleshoot this. log into the gearman worker for the hg_worker_dsom host group, and enable debugging. Edit the /etc/mod_gearman/worker.conf and change this line from

Code: Select all

debug=0
to
debug=1
Save the change and restart the worker.

If you see the orphaned message again, then retrieve the log from the worker and post that to the thread. Thanks, Benjamin

Code: Select all

/var/log/mod_gearman/mod_gearman_worker.log