orphaned check and force an imediate check doesn't work

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
jweijters
Posts: 63
Joined: Thu Feb 06, 2020 3:50 am

orphaned check and force an imediate check doesn't work

Post by jweijters »

We use gearman as a distributed engine for our monitoring environment. We have ~25 hostgroups with ~50 gearman workers.
we run:
Nagios 5.8.1

gearmand 1.1.18
mod_gearman_worker: version 3.3.0 running on libgearman 1.1.19.1


Sometimes we see for a host that only 1 service is orphaned, other services are checking and returning data just fine.
Capture2.JPG
When I want to recheck, by force an imediate check, this service, it looks like it isn't rescheduled.

the last check timestamp doen't update, nor does the atempt infromation.
Capture.JPG
Where can I see the log of the imediade check now? how Can I fix this?
You do not have the required permissions to view the files attached to this post.
Last edited by jweijters on Tue May 25, 2021 11:48 pm, edited 1 time in total.
jweijters
Posts: 63
Joined: Thu Feb 06, 2020 3:50 am

Re: orphaned message and force an imediate check

Post by jweijters »

Doing further investigation, I found that when a service is orphaned the check isn't rescheduled at all, although the parameters are set:

check_for_orphaned_hosts=1
check_for_orphaned_services=1
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: orphaned check and force an imediate check doesn't work

Post by benjaminsmith »

Hi,

I would recommend trying to restart the Gearman server and worker and try once more to see if you get the same behavior. There is a specific order the must be followed when restarting, the steps can be found on page 8 of the guide below.

Integrating Mod-Gearman With Nagios XI

Also, if the error occurs again, force an immediate check, let me know if there is any discrepancies in the check results between the XI and Core interface. To view, the core interface, go to:

Code: Select all

http://<IP address>/nagios
Lastly, what is the output of the gearman_top command, and please send us a system profile? Thanks, Benjamin

To send us your system profile
.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
jweijters
Posts: 63
Joined: Thu Feb 06, 2020 3:50 am

Re: orphaned check and force an imediate check doesn't work

Post by jweijters »

Hi,

there are no descepancies in the check.
You can see for this host, all services are ok and regurarly checked, exept for just one service which is orphaned now for 2 days.
Capture3.JPG
I rechecked this service check, but this didn't give any success
Capture4.JPG
In the previous example I also set the worker in worker log at level 3 and followed the log for approxamately 30 minutes. The recheck of the service never came in the log, so it looks like it doesn't get rescheduled.

I send a system profile by PM.
Hereby the output of gearman_top

Code: Select all

 gearman_top -b
2021-05-27 06:52:42  -  localhost:4730  -  v1.1.18

 Queue Name                            | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------------------
 check_results                         |               1  |           0  |           0
 hostgroup_hg_worker_avr-dvn           |              17  |           0  |           0
 hostgroup_hg_worker_avr-rzb           |              27  |           0  |           0
 hostgroup_hg_worker_bran              |               5  |           0  |           0
 hostgroup_hg_worker_dock-ka           |               5  |           0  |           0
 hostgroup_hg_worker_dock-pa           |               5  |           0  |           0
 hostgroup_hg_worker_dro               |              37  |           0  |           3
 hostgroup_hg_worker_dsom              |              66  |           0  |          13
 hostgroup_hg_worker_finq              |               5  |           0  |           0
 hostgroup_hg_worker_flmc              |              11  |           0  |           0
 hostgroup_hg_worker_game              |              10  |           0  |           0
 hostgroup_hg_worker_ggn               |              36  |           1  |           6
 hostgroup_hg_worker_jzhz              |               5  |           0  |           0
 hostgroup_hg_worker_kivo              |              10  |           0  |           0
 hostgroup_hg_worker_lek               |              26  |           0  |          16
 hostgroup_hg_worker_oenr              |              25  |           0  |           6
 hostgroup_hg_worker_sdb               |              32  |           0  |           1
 hostgroup_hg_worker_sdzb              |              93  |           0  |          23
 hostgroup_hg_worker_snms              |             232  |           0  |          54
 hostgroup_hg_worker_sviz              |              29  |           0  |           2
 hostgroup_hg_worker_wnf               |              11  |           0  |           1
 worker_JZHZDCBTCSS-001                |               1  |           0  |           0
 worker_KC-MON-P01.kivo.tm             |               1  |           0  |           0
 worker_OENRDCBTMON01                  |               1  |           0  |           0
 worker_ZORG-WAATNMW01                 |               1  |           0  |           0
 worker_ZORG-WAATNMW02                 |               1  |           0  |           0
 worker_bmrdhgtcssm02.brandmr.local    |               0  |           0  |           0
 worker_bu-amf-vma01.bu-amf.local      |               1  |           0  |           0
 worker_dsbwaa01pmgw02                 |               1  |           0  |           0
 worker_dsom-nagt02.desom.mgmt         |               5  |           0  |           0
 worker_flmc-gropnag01.flamco.local    |               1  |           0  |           0
 worker_gdr01dcbmgw02                  |               1  |           0  |           0
 worker_monxisltn-vms.dockaas.nl       |               1  |           0  |           0
 worker_sbhptsssm013.sltn-beheer.local |               1  |           0  |           0
 worker_sbhptsssm014.sltn-beheer.local |               1  |           0  |           0
 worker_sbhptsssm015.sltn-beheer.local |               1  |           0  |           0
 worker_sdb-waatcssm01.dbij.local      |               1  |           0  |           0
 worker_sr_monsltn_pa.dockaas.nl       |               1  |           0  |           0
 worker_svavrdmont01.durable.local     |               1  |           0  |           0
 worker_svavrrmont01.durable.local     |               1  |           0  |           0
 worker_svizdcbpnagi02.vivium.local    |               1  |           0  |           0
 worker_svr-lnxngs-002                 |               1  |           0  |           0
 worker_svr-mgw200                     |               1  |           0  |           0
 worker_wnf-s-mgw01.wnf.local          |               3  |           0  |           0
----------------------------------------------------------------------------------------
You do not have the required permissions to view the files attached to this post.
jweijters
Posts: 63
Joined: Thu Feb 06, 2020 3:50 am

Re: orphaned check and force an imediate check doesn't work

Post by jweijters »

Hi benjaminsmith,

It looks like I can't download the system profile.
I get an empty page in NagiosXI I see this empty page in all our NagiosXI 5.8.1 installations, and also my dev system 5.8.2,
I checked with browsers: firefox 78.9.0 and Chrome Version 89.0.4389.90

In my browser I get a status 500
Capture5.JPG
and in the ssl_access log at my Nagios server:

ssl_access_log:10.128.20.105 - - [27/May/2021:08:08:23 +0200] "GET /nagiosxi/includes/components/profile/profile.php HTTP/1.1" 500 2

kind regards,

Joris Weijters
You do not have the required permissions to view the files attached to this post.
jweijters
Posts: 63
Joined: Thu Feb 06, 2020 3:50 am

Re: orphaned check and force an imediate check doesn't work

Post by jweijters »

Hi benjaminsmith,

I've been doing some debugging on the code for the profile.php.
in requires the /../../configwizards.inc.php
I did some debugging on this one, it includes the includes of the configuration wizards.

During the include of the nagiostats.inc.php something fails, and the configwizards.inc.php fails.
I didn't notice before, but that the configuration wizard also doesn't work when the "nagiostat wizard" directory is in the /usr/local/nagiosxi/html/includes/configwizards/

I get the error in the error_log:

[Thu May 27 12:55:22.959586 2021] [php7:error] [pid 59880] [client 192.168.10.1:62045] PHP Fatal error: Cannot redeclare val() (previously declared in /usr/local/nagiosxi/html/includes/components/ccm/includes/common_functions.inc.php:230) in /usr/local/nagiosxi/html/includes/configwizards/nagiostats/nagiostats.inc.php on line 38, referer: http://192.168.10.128/nagiosxi/admin/


I'm running
Nagios 5.8.2
and
PHP 7.2.34 (cli) (built: Feb 3 2021 09:23:21) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies
with Zend OPcache v7.2.34, Copyright (c) 1999-2018, by Zend Technologies

during my investigation I did a clean install of Nagios 5.8.2 at a Centos7.9 running php 5.4.16-48 and at that system everything seems to work.


Kind regards

Joris Weijters
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: orphaned check and force an imediate check doesn't work

Post by benjaminsmith »

Hi Joris,

We would like to know the Hostgroup the host is in and the Hostname so we know which worker it is supposed to run it on. Then, enable debugging on the Nagios server.

Edit the /etc/mod_gearman/module.conf and change this line from

Code: Select all

debug=0
to
debug=1
Save the change and restart the Nagios, and if you see the Orphan message, get this file from the Nagios server and post it.

Code: Select all

/var/log/mod_gearman/mod_gearman_neb.log
Also, please retreive log file from the Nagios server and post it.

Code: Select all

/var/log/gearmand/gearmand.log
The profile download is weird. Both of the php files do declare the val function but they should not be accessed at the same time when the profile is downloaded.

Try running it from the command line as well.

Code: Select all

rm -rf /usr/local/nagiosxi/var/components/profile.zip
/usr/local/nagiosxi/scripts/components/getprofile.sh SUPPORT
Then send me the resulting /usr/local/nagiosxi/var/components/profile.zip​ file. Thanks, Ben
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
jweijters
Posts: 63
Joined: Thu Feb 06, 2020 3:50 am

Re: orphaned check and force an imediate check doesn't work

Post by jweijters »

Hi Ben,

Can we split-off the problem with the "profile page"?
jweijters
Posts: 63
Joined: Thu Feb 06, 2020 3:50 am

Re: orphaned check and force an imediate check doesn't work

Post by jweijters »

Hi Ben,


[1622181710] SERVICE NOTIFICATION: alert_topdesk_7x24_desom_email;dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;xi_service_notification_handler_email;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622181710] SERVICE NOTIFICATION: alert_topdesk_7x24;dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;xi_service_notification_handler_topdesk;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622181710] SERVICE ALERT: dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;HARD;10;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622182901] Warning: The check of service 'Fortigate: high availability active-passive' on host 'dsom-med-mer1-fw01' looks like it was orphaned (results never came back; last_check=1622181701; next_check=1622182010). I'm scheduling an immediate check of the service...

I reissued the imediate check at ~08:32
I will upload the log file via a PM
Capture7.JPG

Code: Select all

 gearman_top -b
2021-05-28 08:38:34  -  localhost:4730  -  v1.1.18

 Queue Name                            | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------------------
 check_results                         |               1  |           0  |           0
 hostgroup_hg_worker_avr-dvn           |               8  |           2  |           8
 hostgroup_hg_worker_avr-rzb           |              21  |           0  |          10
 hostgroup_hg_worker_bran              |               5  |           0  |           0
 hostgroup_hg_worker_dock-ka           |               5  |           0  |           0
 hostgroup_hg_worker_dock-pa           |               5  |           0  |           0
 hostgroup_hg_worker_dro               |              32  |           0  |           3
 hostgroup_hg_worker_dsom              |              60  |           0  |           1
 hostgroup_hg_worker_finq              |               5  |           0  |           0
 hostgroup_hg_worker_flmc              |               9  |           0  |           1
 hostgroup_hg_worker_game              |              10  |           0  |           1
 hostgroup_hg_worker_ggn               |              43  |           0  |           4
 hostgroup_hg_worker_jzhz              |               5  |           0  |           0
 hostgroup_hg_worker_kivo              |              10  |           0  |           0
 hostgroup_hg_worker_lek               |              45  |           0  |           3
 hostgroup_hg_worker_oenr              |              25  |           0  |           7
 hostgroup_hg_worker_sdb               |              30  |           0  |           0
 hostgroup_hg_worker_sdzb              |              68  |           0  |          35
 hostgroup_hg_worker_snms              |             226  |           0  |         128
 hostgroup_hg_worker_sviz              |              48  |           0  |           1
 hostgroup_hg_worker_wnf               |              10  |           0  |           1
 worker_JZHZDCBTCSS-001                |               1  |           0  |           0
 worker_KC-MON-P01.kivo.tm             |               0  |           0  |           0
 worker_OENRDCBTMON01                  |               1  |           0  |           0
 worker_ZORG-WAATNMW01                 |               1  |           0  |           0
 worker_ZORG-WAATNMW02                 |               0  |           0  |           0
 worker_bmrdhgtcssm02.brandmr.local    |               1  |           0  |           0
 worker_bu-amf-vma01.bu-amf.local      |               1  |           0  |           0
 worker_dsbwaa01pmgw02                 |               1  |           0  |           0
 worker_dsom-nagt02.desom.mgmt         |               4  |           0  |           0
 worker_flmc-gropnag01.flamco.local    |               1  |           0  |           0
 worker_gdr01dcbmgw02                  |               1  |           0  |           0
 worker_monxisltn-vms.dockaas.nl       |               1  |           0  |           0
 worker_sbhptsssm013.sltn-beheer.local |               1  |           0  |           0
 worker_sbhptsssm014.sltn-beheer.local |               1  |           0  |           0
 worker_sbhptsssm015.sltn-beheer.local |               1  |           0  |           0
 worker_sdb-waatcssm01.dbij.local      |               1  |           0  |           0
 worker_sr_monsltn_pa.dockaas.nl       |               1  |           0  |           0
 worker_svavrdmont01.durable.local     |               1  |           0  |           0
 worker_svavrrmont01.durable.local     |               1  |           0  |           0
 worker_svizdcbpnagi02.vivium.local    |               1  |           0  |           0
 worker_svr-lnxngs-002                 |               1  |           0  |           0
 worker_svr-mgw200                     |               1  |           0  |           0
 worker_wnf-s-mgw01.wnf.local          |               1  |           0  |           0
----------------------------------------------------------------------------------------
You do not have the required permissions to view the files attached to this post.
benjaminsmith
Posts: 5324
Joined: Wed Aug 22, 2018 4:39 pm
Location: saint paul

Re: orphaned check and force an imediate check doesn't work

Post by benjaminsmith »

Hi @jweijters,

Thanks for sending the profile over and the gearman log. So based on the logs, this is a gearman worker and/or network issue (not directly related to Nagiso XI).
ERROR 2021-05-28 09:08:58.000000 [ main ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
To help troubleshoot this. log into the gearman worker for the hg_worker_dsom host group, and enable debugging. Edit the /etc/mod_gearman/worker.conf and change this line from

Code: Select all

debug=0
to
debug=1
Save the change and restart the worker.

If you see the orphaned message again, then retrieve the log from the worker and post that to the thread. Thanks, Benjamin

Code: Select all

/var/log/mod_gearman/mod_gearman_worker.log
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked