Page 1 of 3
orphaned check and force an imediate check doesn't work
Posted: Tue May 25, 2021 3:10 am
by jweijters
We use gearman as a distributed engine for our monitoring environment. We have ~25 hostgroups with ~50 gearman workers.
we run:
Nagios 5.8.1
gearmand 1.1.18
mod_gearman_worker: version 3.3.0 running on libgearman 1.1.19.1
Sometimes we see for a host that only 1 service is orphaned, other services are checking and returning data just fine.
Capture2.JPG
When I want to recheck, by force an imediate check, this service, it looks like it isn't rescheduled.
the last check timestamp doen't update, nor does the atempt infromation.
Capture.JPG
Where can I see the log of the imediade check now? how Can I fix this?
Re: orphaned message and force an imediate check
Posted: Tue May 25, 2021 7:04 am
by jweijters
Doing further investigation, I found that when a service is orphaned the check isn't rescheduled at all, although the parameters are set:
check_for_orphaned_hosts=1
check_for_orphaned_services=1
Re: orphaned check and force an imediate check doesn't work
Posted: Wed May 26, 2021 9:34 am
by benjaminsmith
Hi,
I would recommend trying to restart the Gearman server and worker and try once more to see if you get the same behavior. There is a specific order the must be followed when restarting, the steps can be found on page 8 of the guide below.
Integrating Mod-Gearman With Nagios XI
Also, if the error occurs again, force an immediate check, let me know if there is any discrepancies in the check results between the XI and Core interface. To view, the core interface, go to:
Lastly, what is the output of the gearman_top command, and please send us a system profile? Thanks, Benjamin
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Re: orphaned check and force an imediate check doesn't work
Posted: Thu May 27, 2021 12:09 am
by jweijters
Hi,
there are no descepancies in the check.
You can see for this host, all services are ok and regurarly checked, exept for just one service which is orphaned now for 2 days.
Capture3.JPG
I rechecked this service check, but this didn't give any success
Capture4.JPG
In the previous example I also set the worker in worker log at level 3 and followed the log for approxamately 30 minutes. The recheck of the service never came in the log, so it looks like it doesn't get rescheduled.
I send a system profile by PM.
Hereby the output of gearman_top
Code: Select all
gearman_top -b
2021-05-27 06:52:42 - localhost:4730 - v1.1.18
Queue Name | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------------------
check_results | 1 | 0 | 0
hostgroup_hg_worker_avr-dvn | 17 | 0 | 0
hostgroup_hg_worker_avr-rzb | 27 | 0 | 0
hostgroup_hg_worker_bran | 5 | 0 | 0
hostgroup_hg_worker_dock-ka | 5 | 0 | 0
hostgroup_hg_worker_dock-pa | 5 | 0 | 0
hostgroup_hg_worker_dro | 37 | 0 | 3
hostgroup_hg_worker_dsom | 66 | 0 | 13
hostgroup_hg_worker_finq | 5 | 0 | 0
hostgroup_hg_worker_flmc | 11 | 0 | 0
hostgroup_hg_worker_game | 10 | 0 | 0
hostgroup_hg_worker_ggn | 36 | 1 | 6
hostgroup_hg_worker_jzhz | 5 | 0 | 0
hostgroup_hg_worker_kivo | 10 | 0 | 0
hostgroup_hg_worker_lek | 26 | 0 | 16
hostgroup_hg_worker_oenr | 25 | 0 | 6
hostgroup_hg_worker_sdb | 32 | 0 | 1
hostgroup_hg_worker_sdzb | 93 | 0 | 23
hostgroup_hg_worker_snms | 232 | 0 | 54
hostgroup_hg_worker_sviz | 29 | 0 | 2
hostgroup_hg_worker_wnf | 11 | 0 | 1
worker_JZHZDCBTCSS-001 | 1 | 0 | 0
worker_KC-MON-P01.kivo.tm | 1 | 0 | 0
worker_OENRDCBTMON01 | 1 | 0 | 0
worker_ZORG-WAATNMW01 | 1 | 0 | 0
worker_ZORG-WAATNMW02 | 1 | 0 | 0
worker_bmrdhgtcssm02.brandmr.local | 0 | 0 | 0
worker_bu-amf-vma01.bu-amf.local | 1 | 0 | 0
worker_dsbwaa01pmgw02 | 1 | 0 | 0
worker_dsom-nagt02.desom.mgmt | 5 | 0 | 0
worker_flmc-gropnag01.flamco.local | 1 | 0 | 0
worker_gdr01dcbmgw02 | 1 | 0 | 0
worker_monxisltn-vms.dockaas.nl | 1 | 0 | 0
worker_sbhptsssm013.sltn-beheer.local | 1 | 0 | 0
worker_sbhptsssm014.sltn-beheer.local | 1 | 0 | 0
worker_sbhptsssm015.sltn-beheer.local | 1 | 0 | 0
worker_sdb-waatcssm01.dbij.local | 1 | 0 | 0
worker_sr_monsltn_pa.dockaas.nl | 1 | 0 | 0
worker_svavrdmont01.durable.local | 1 | 0 | 0
worker_svavrrmont01.durable.local | 1 | 0 | 0
worker_svizdcbpnagi02.vivium.local | 1 | 0 | 0
worker_svr-lnxngs-002 | 1 | 0 | 0
worker_svr-mgw200 | 1 | 0 | 0
worker_wnf-s-mgw01.wnf.local | 3 | 0 | 0
----------------------------------------------------------------------------------------
Re: orphaned check and force an imediate check doesn't work
Posted: Thu May 27, 2021 12:40 am
by jweijters
Hi benjaminsmith,
It looks like I can't download the system profile.
I get an empty page in NagiosXI I see this empty page in all our NagiosXI 5.8.1 installations, and also my dev system 5.8.2,
I checked with browsers: firefox 78.9.0 and Chrome Version 89.0.4389.90
In my browser I get a status 500
Capture5.JPG
and in the ssl_access log at my Nagios server:
ssl_access_log:10.128.20.105 - - [27/May/2021:08:08:23 +0200] "GET /nagiosxi/includes/components/profile/profile.php HTTP/1.1" 500 2
kind regards,
Joris Weijters
Re: orphaned check and force an imediate check doesn't work
Posted: Thu May 27, 2021 6:25 am
by jweijters
Hi benjaminsmith,
I've been doing some debugging on the code for the profile.php.
in requires the /../../configwizards.inc.php
I did some debugging on this one, it includes the includes of the configuration wizards.
During the include of the nagiostats.inc.php something fails, and the configwizards.inc.php fails.
I didn't notice before, but that the configuration wizard also doesn't work when the "nagiostat wizard" directory is in the /usr/local/nagiosxi/html/includes/configwizards/
I get the error in the error_log:
[Thu May 27 12:55:22.959586 2021] [php7:error] [pid 59880] [client 192.168.10.1:62045] PHP Fatal error: Cannot redeclare val() (previously declared in /usr/local/nagiosxi/html/includes/components/ccm/includes/common_functions.inc.php:230) in /usr/local/nagiosxi/html/includes/configwizards/nagiostats/nagiostats.inc.php on line 38, referer:
http://192.168.10.128/nagiosxi/admin/
I'm running
Nagios 5.8.2
and
PHP 7.2.34 (cli) (built: Feb 3 2021 09:23:21) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies
with Zend OPcache v7.2.34, Copyright (c) 1999-2018, by Zend Technologies
during my investigation I did a clean install of Nagios 5.8.2 at a Centos7.9 running php 5.4.16-48 and at that system everything seems to work.
Kind regards
Joris Weijters
Re: orphaned check and force an imediate check doesn't work
Posted: Thu May 27, 2021 5:21 pm
by benjaminsmith
Hi Joris,
We would like to know the Hostgroup the host is in and the Hostname so we know which worker it is supposed to run it on. Then, enable debugging on the Nagios server.
Edit the /etc/mod_gearman/module.conf and change this line from
Save the change and restart the Nagios, and if you see the Orphan message, get this file from the Nagios server and post it.
Code: Select all
/var/log/mod_gearman/mod_gearman_neb.log
Also, please retreive log file from the Nagios server and post it.
The profile download is weird. Both of the php files do declare the val function but they should not be accessed at the same time when the profile is downloaded.
Try running it from the command line as well.
Code: Select all
rm -rf /usr/local/nagiosxi/var/components/profile.zip
/usr/local/nagiosxi/scripts/components/getprofile.sh SUPPORT
Then send me the resulting /usr/local/nagiosxi/var/components/profile.zip​ file. Thanks, Ben
Re: orphaned check and force an imediate check doesn't work
Posted: Fri May 28, 2021 12:39 am
by jweijters
Hi Ben,
Can we split-off the problem with the "profile page"?
Re: orphaned check and force an imediate check doesn't work
Posted: Fri May 28, 2021 1:45 am
by jweijters
Hi Ben,
[1622181710] SERVICE NOTIFICATION: alert_topdesk_7x24_desom_email;dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;xi_service_notification_handler_email;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622181710] SERVICE NOTIFICATION: alert_topdesk_7x24;dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;xi_service_notification_handler_topdesk;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622181710] SERVICE ALERT: dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;HARD;10;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622182901] Warning: The check of service 'Fortigate: high availability active-passive' on host 'dsom-med-mer1-fw01' looks like it was orphaned (results never came back; last_check=1622181701; next_check=1622182010). I'm scheduling an immediate check of the service...
I reissued the imediate check at ~08:32
I will upload the log file via a PM
Capture7.JPG
Code: Select all
gearman_top -b
2021-05-28 08:38:34 - localhost:4730 - v1.1.18
Queue Name | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------------------
check_results | 1 | 0 | 0
hostgroup_hg_worker_avr-dvn | 8 | 2 | 8
hostgroup_hg_worker_avr-rzb | 21 | 0 | 10
hostgroup_hg_worker_bran | 5 | 0 | 0
hostgroup_hg_worker_dock-ka | 5 | 0 | 0
hostgroup_hg_worker_dock-pa | 5 | 0 | 0
hostgroup_hg_worker_dro | 32 | 0 | 3
hostgroup_hg_worker_dsom | 60 | 0 | 1
hostgroup_hg_worker_finq | 5 | 0 | 0
hostgroup_hg_worker_flmc | 9 | 0 | 1
hostgroup_hg_worker_game | 10 | 0 | 1
hostgroup_hg_worker_ggn | 43 | 0 | 4
hostgroup_hg_worker_jzhz | 5 | 0 | 0
hostgroup_hg_worker_kivo | 10 | 0 | 0
hostgroup_hg_worker_lek | 45 | 0 | 3
hostgroup_hg_worker_oenr | 25 | 0 | 7
hostgroup_hg_worker_sdb | 30 | 0 | 0
hostgroup_hg_worker_sdzb | 68 | 0 | 35
hostgroup_hg_worker_snms | 226 | 0 | 128
hostgroup_hg_worker_sviz | 48 | 0 | 1
hostgroup_hg_worker_wnf | 10 | 0 | 1
worker_JZHZDCBTCSS-001 | 1 | 0 | 0
worker_KC-MON-P01.kivo.tm | 0 | 0 | 0
worker_OENRDCBTMON01 | 1 | 0 | 0
worker_ZORG-WAATNMW01 | 1 | 0 | 0
worker_ZORG-WAATNMW02 | 0 | 0 | 0
worker_bmrdhgtcssm02.brandmr.local | 1 | 0 | 0
worker_bu-amf-vma01.bu-amf.local | 1 | 0 | 0
worker_dsbwaa01pmgw02 | 1 | 0 | 0
worker_dsom-nagt02.desom.mgmt | 4 | 0 | 0
worker_flmc-gropnag01.flamco.local | 1 | 0 | 0
worker_gdr01dcbmgw02 | 1 | 0 | 0
worker_monxisltn-vms.dockaas.nl | 1 | 0 | 0
worker_sbhptsssm013.sltn-beheer.local | 1 | 0 | 0
worker_sbhptsssm014.sltn-beheer.local | 1 | 0 | 0
worker_sbhptsssm015.sltn-beheer.local | 1 | 0 | 0
worker_sdb-waatcssm01.dbij.local | 1 | 0 | 0
worker_sr_monsltn_pa.dockaas.nl | 1 | 0 | 0
worker_svavrdmont01.durable.local | 1 | 0 | 0
worker_svavrrmont01.durable.local | 1 | 0 | 0
worker_svizdcbpnagi02.vivium.local | 1 | 0 | 0
worker_svr-lnxngs-002 | 1 | 0 | 0
worker_svr-mgw200 | 1 | 0 | 0
worker_wnf-s-mgw01.wnf.local | 1 | 0 | 0
----------------------------------------------------------------------------------------
Re: orphaned check and force an imediate check doesn't work
Posted: Fri May 28, 2021 4:48 pm
by benjaminsmith
Hi
@jweijters,
Thanks for sending the profile over and the gearman log. So based on the logs, this is a gearman worker and/or network issue (not directly related to Nagiso XI).
ERROR 2021-05-28 09:08:58.000000 [ main ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
To help troubleshoot this. log into the gearman worker for the hg_worker_dsom host group, and enable debugging. Edit the /etc/mod_gearman/worker.conf and change this line from
Save the change and restart the worker.
If you see the orphaned message again, then retrieve the log from the worker and post that to the thread. Thanks, Benjamin
Code: Select all
/var/log/mod_gearman/mod_gearman_worker.log