orphaned check and force an imediate check doesn't work
orphaned check and force an imediate check doesn't work
We use gearman as a distributed engine for our monitoring environment. We have ~25 hostgroups with ~50 gearman workers.
we run:
Nagios 5.8.1
gearmand 1.1.18
mod_gearman_worker: version 3.3.0 running on libgearman 1.1.19.1
Sometimes we see for a host that only 1 service is orphaned, other services are checking and returning data just fine.
When I want to recheck, by force an imediate check, this service, it looks like it isn't rescheduled.
the last check timestamp doen't update, nor does the atempt infromation. Where can I see the log of the imediade check now? how Can I fix this?
we run:
Nagios 5.8.1
gearmand 1.1.18
mod_gearman_worker: version 3.3.0 running on libgearman 1.1.19.1
Sometimes we see for a host that only 1 service is orphaned, other services are checking and returning data just fine.
When I want to recheck, by force an imediate check, this service, it looks like it isn't rescheduled.
the last check timestamp doen't update, nor does the atempt infromation. Where can I see the log of the imediade check now? how Can I fix this?
You do not have the required permissions to view the files attached to this post.
Last edited by jweijters on Tue May 25, 2021 11:48 pm, edited 1 time in total.
Re: orphaned message and force an imediate check
Doing further investigation, I found that when a service is orphaned the check isn't rescheduled at all, although the parameters are set:
check_for_orphaned_hosts=1
check_for_orphaned_services=1
check_for_orphaned_hosts=1
check_for_orphaned_services=1
-
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: orphaned check and force an imediate check doesn't work
Hi,
I would recommend trying to restart the Gearman server and worker and try once more to see if you get the same behavior. There is a specific order the must be followed when restarting, the steps can be found on page 8 of the guide below.
Integrating Mod-Gearman With Nagios XI
Also, if the error occurs again, force an immediate check, let me know if there is any discrepancies in the check results between the XI and Core interface. To view, the core interface, go to:
Lastly, what is the output of the gearman_top command, and please send us a system profile? Thanks, Benjamin
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
I would recommend trying to restart the Gearman server and worker and try once more to see if you get the same behavior. There is a specific order the must be followed when restarting, the steps can be found on page 8 of the guide below.
Integrating Mod-Gearman With Nagios XI
Also, if the error occurs again, force an immediate check, let me know if there is any discrepancies in the check results between the XI and Core interface. To view, the core interface, go to:
Code: Select all
http://<IP address>/nagios
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: orphaned check and force an imediate check doesn't work
Hi,
there are no descepancies in the check.
You can see for this host, all services are ok and regurarly checked, exept for just one service which is orphaned now for 2 days. I rechecked this service check, but this didn't give any success In the previous example I also set the worker in worker log at level 3 and followed the log for approxamately 30 minutes. The recheck of the service never came in the log, so it looks like it doesn't get rescheduled.
I send a system profile by PM.
Hereby the output of gearman_top
there are no descepancies in the check.
You can see for this host, all services are ok and regurarly checked, exept for just one service which is orphaned now for 2 days. I rechecked this service check, but this didn't give any success In the previous example I also set the worker in worker log at level 3 and followed the log for approxamately 30 minutes. The recheck of the service never came in the log, so it looks like it doesn't get rescheduled.
I send a system profile by PM.
Hereby the output of gearman_top
Code: Select all
gearman_top -b
2021-05-27 06:52:42 - localhost:4730 - v1.1.18
Queue Name | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------------------
check_results | 1 | 0 | 0
hostgroup_hg_worker_avr-dvn | 17 | 0 | 0
hostgroup_hg_worker_avr-rzb | 27 | 0 | 0
hostgroup_hg_worker_bran | 5 | 0 | 0
hostgroup_hg_worker_dock-ka | 5 | 0 | 0
hostgroup_hg_worker_dock-pa | 5 | 0 | 0
hostgroup_hg_worker_dro | 37 | 0 | 3
hostgroup_hg_worker_dsom | 66 | 0 | 13
hostgroup_hg_worker_finq | 5 | 0 | 0
hostgroup_hg_worker_flmc | 11 | 0 | 0
hostgroup_hg_worker_game | 10 | 0 | 0
hostgroup_hg_worker_ggn | 36 | 1 | 6
hostgroup_hg_worker_jzhz | 5 | 0 | 0
hostgroup_hg_worker_kivo | 10 | 0 | 0
hostgroup_hg_worker_lek | 26 | 0 | 16
hostgroup_hg_worker_oenr | 25 | 0 | 6
hostgroup_hg_worker_sdb | 32 | 0 | 1
hostgroup_hg_worker_sdzb | 93 | 0 | 23
hostgroup_hg_worker_snms | 232 | 0 | 54
hostgroup_hg_worker_sviz | 29 | 0 | 2
hostgroup_hg_worker_wnf | 11 | 0 | 1
worker_JZHZDCBTCSS-001 | 1 | 0 | 0
worker_KC-MON-P01.kivo.tm | 1 | 0 | 0
worker_OENRDCBTMON01 | 1 | 0 | 0
worker_ZORG-WAATNMW01 | 1 | 0 | 0
worker_ZORG-WAATNMW02 | 1 | 0 | 0
worker_bmrdhgtcssm02.brandmr.local | 0 | 0 | 0
worker_bu-amf-vma01.bu-amf.local | 1 | 0 | 0
worker_dsbwaa01pmgw02 | 1 | 0 | 0
worker_dsom-nagt02.desom.mgmt | 5 | 0 | 0
worker_flmc-gropnag01.flamco.local | 1 | 0 | 0
worker_gdr01dcbmgw02 | 1 | 0 | 0
worker_monxisltn-vms.dockaas.nl | 1 | 0 | 0
worker_sbhptsssm013.sltn-beheer.local | 1 | 0 | 0
worker_sbhptsssm014.sltn-beheer.local | 1 | 0 | 0
worker_sbhptsssm015.sltn-beheer.local | 1 | 0 | 0
worker_sdb-waatcssm01.dbij.local | 1 | 0 | 0
worker_sr_monsltn_pa.dockaas.nl | 1 | 0 | 0
worker_svavrdmont01.durable.local | 1 | 0 | 0
worker_svavrrmont01.durable.local | 1 | 0 | 0
worker_svizdcbpnagi02.vivium.local | 1 | 0 | 0
worker_svr-lnxngs-002 | 1 | 0 | 0
worker_svr-mgw200 | 1 | 0 | 0
worker_wnf-s-mgw01.wnf.local | 3 | 0 | 0
----------------------------------------------------------------------------------------
You do not have the required permissions to view the files attached to this post.
Re: orphaned check and force an imediate check doesn't work
Hi benjaminsmith,
It looks like I can't download the system profile.
I get an empty page in NagiosXI I see this empty page in all our NagiosXI 5.8.1 installations, and also my dev system 5.8.2,
I checked with browsers: firefox 78.9.0 and Chrome Version 89.0.4389.90
In my browser I get a status 500 and in the ssl_access log at my Nagios server:
ssl_access_log:10.128.20.105 - - [27/May/2021:08:08:23 +0200] "GET /nagiosxi/includes/components/profile/profile.php HTTP/1.1" 500 2
kind regards,
Joris Weijters
It looks like I can't download the system profile.
I get an empty page in NagiosXI I see this empty page in all our NagiosXI 5.8.1 installations, and also my dev system 5.8.2,
I checked with browsers: firefox 78.9.0 and Chrome Version 89.0.4389.90
In my browser I get a status 500 and in the ssl_access log at my Nagios server:
ssl_access_log:10.128.20.105 - - [27/May/2021:08:08:23 +0200] "GET /nagiosxi/includes/components/profile/profile.php HTTP/1.1" 500 2
kind regards,
Joris Weijters
You do not have the required permissions to view the files attached to this post.
Re: orphaned check and force an imediate check doesn't work
Hi benjaminsmith,
I've been doing some debugging on the code for the profile.php.
in requires the /../../configwizards.inc.php
I did some debugging on this one, it includes the includes of the configuration wizards.
During the include of the nagiostats.inc.php something fails, and the configwizards.inc.php fails.
I didn't notice before, but that the configuration wizard also doesn't work when the "nagiostat wizard" directory is in the /usr/local/nagiosxi/html/includes/configwizards/
I get the error in the error_log:
[Thu May 27 12:55:22.959586 2021] [php7:error] [pid 59880] [client 192.168.10.1:62045] PHP Fatal error: Cannot redeclare val() (previously declared in /usr/local/nagiosxi/html/includes/components/ccm/includes/common_functions.inc.php:230) in /usr/local/nagiosxi/html/includes/configwizards/nagiostats/nagiostats.inc.php on line 38, referer: http://192.168.10.128/nagiosxi/admin/
I'm running
Nagios 5.8.2
and
PHP 7.2.34 (cli) (built: Feb 3 2021 09:23:21) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies
with Zend OPcache v7.2.34, Copyright (c) 1999-2018, by Zend Technologies
during my investigation I did a clean install of Nagios 5.8.2 at a Centos7.9 running php 5.4.16-48 and at that system everything seems to work.
Kind regards
Joris Weijters
I've been doing some debugging on the code for the profile.php.
in requires the /../../configwizards.inc.php
I did some debugging on this one, it includes the includes of the configuration wizards.
During the include of the nagiostats.inc.php something fails, and the configwizards.inc.php fails.
I didn't notice before, but that the configuration wizard also doesn't work when the "nagiostat wizard" directory is in the /usr/local/nagiosxi/html/includes/configwizards/
I get the error in the error_log:
[Thu May 27 12:55:22.959586 2021] [php7:error] [pid 59880] [client 192.168.10.1:62045] PHP Fatal error: Cannot redeclare val() (previously declared in /usr/local/nagiosxi/html/includes/components/ccm/includes/common_functions.inc.php:230) in /usr/local/nagiosxi/html/includes/configwizards/nagiostats/nagiostats.inc.php on line 38, referer: http://192.168.10.128/nagiosxi/admin/
I'm running
Nagios 5.8.2
and
PHP 7.2.34 (cli) (built: Feb 3 2021 09:23:21) ( NTS )
Copyright (c) 1997-2018 The PHP Group
Zend Engine v3.2.0, Copyright (c) 1998-2018 Zend Technologies
with Zend OPcache v7.2.34, Copyright (c) 1999-2018, by Zend Technologies
during my investigation I did a clean install of Nagios 5.8.2 at a Centos7.9 running php 5.4.16-48 and at that system everything seems to work.
Kind regards
Joris Weijters
-
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: orphaned check and force an imediate check doesn't work
Hi Joris,
We would like to know the Hostgroup the host is in and the Hostname so we know which worker it is supposed to run it on. Then, enable debugging on the Nagios server.
Edit the /etc/mod_gearman/module.conf and change this line from
Save the change and restart the Nagios, and if you see the Orphan message, get this file from the Nagios server and post it.
Also, please retreive log file from the Nagios server and post it.
The profile download is weird. Both of the php files do declare the val function but they should not be accessed at the same time when the profile is downloaded.
Try running it from the command line as well.
Then send me the resulting /usr/local/nagiosxi/var/components/profile.zip file. Thanks, Ben
We would like to know the Hostgroup the host is in and the Hostname so we know which worker it is supposed to run it on. Then, enable debugging on the Nagios server.
Edit the /etc/mod_gearman/module.conf and change this line from
Code: Select all
debug=0
to
debug=1
Code: Select all
/var/log/mod_gearman/mod_gearman_neb.log
Code: Select all
/var/log/gearmand/gearmand.log
Try running it from the command line as well.
Code: Select all
rm -rf /usr/local/nagiosxi/var/components/profile.zip
/usr/local/nagiosxi/scripts/components/getprofile.sh SUPPORT
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: orphaned check and force an imediate check doesn't work
Hi Ben,
Can we split-off the problem with the "profile page"?
Can we split-off the problem with the "profile page"?
Re: orphaned check and force an imediate check doesn't work
Hi Ben,
[1622181710] SERVICE NOTIFICATION: alert_topdesk_7x24_desom_email;dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;xi_service_notification_handler_email;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622181710] SERVICE NOTIFICATION: alert_topdesk_7x24;dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;xi_service_notification_handler_topdesk;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622181710] SERVICE ALERT: dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;HARD;10;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622182901] Warning: The check of service 'Fortigate: high availability active-passive' on host 'dsom-med-mer1-fw01' looks like it was orphaned (results never came back; last_check=1622181701; next_check=1622182010). I'm scheduling an immediate check of the service...
I reissued the imediate check at ~08:32
I will upload the log file via a PM
[1622181710] SERVICE NOTIFICATION: alert_topdesk_7x24_desom_email;dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;xi_service_notification_handler_email;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622181710] SERVICE NOTIFICATION: alert_topdesk_7x24;dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;xi_service_notification_handler_topdesk;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622181710] SERVICE ALERT: dsom-med-mer1-fw01;Fortigate: high availability active-passive;CRITICAL;HARD;10;(service check orphaned, is the mod-gearman worker on queue 'hostgroup_hg_worker_dsom' running?)
[1622182901] Warning: The check of service 'Fortigate: high availability active-passive' on host 'dsom-med-mer1-fw01' looks like it was orphaned (results never came back; last_check=1622181701; next_check=1622182010). I'm scheduling an immediate check of the service...
I reissued the imediate check at ~08:32
I will upload the log file via a PM
Code: Select all
gearman_top -b
2021-05-28 08:38:34 - localhost:4730 - v1.1.18
Queue Name | Worker Available | Jobs Waiting | Jobs Running
----------------------------------------------------------------------------------------
check_results | 1 | 0 | 0
hostgroup_hg_worker_avr-dvn | 8 | 2 | 8
hostgroup_hg_worker_avr-rzb | 21 | 0 | 10
hostgroup_hg_worker_bran | 5 | 0 | 0
hostgroup_hg_worker_dock-ka | 5 | 0 | 0
hostgroup_hg_worker_dock-pa | 5 | 0 | 0
hostgroup_hg_worker_dro | 32 | 0 | 3
hostgroup_hg_worker_dsom | 60 | 0 | 1
hostgroup_hg_worker_finq | 5 | 0 | 0
hostgroup_hg_worker_flmc | 9 | 0 | 1
hostgroup_hg_worker_game | 10 | 0 | 1
hostgroup_hg_worker_ggn | 43 | 0 | 4
hostgroup_hg_worker_jzhz | 5 | 0 | 0
hostgroup_hg_worker_kivo | 10 | 0 | 0
hostgroup_hg_worker_lek | 45 | 0 | 3
hostgroup_hg_worker_oenr | 25 | 0 | 7
hostgroup_hg_worker_sdb | 30 | 0 | 0
hostgroup_hg_worker_sdzb | 68 | 0 | 35
hostgroup_hg_worker_snms | 226 | 0 | 128
hostgroup_hg_worker_sviz | 48 | 0 | 1
hostgroup_hg_worker_wnf | 10 | 0 | 1
worker_JZHZDCBTCSS-001 | 1 | 0 | 0
worker_KC-MON-P01.kivo.tm | 0 | 0 | 0
worker_OENRDCBTMON01 | 1 | 0 | 0
worker_ZORG-WAATNMW01 | 1 | 0 | 0
worker_ZORG-WAATNMW02 | 0 | 0 | 0
worker_bmrdhgtcssm02.brandmr.local | 1 | 0 | 0
worker_bu-amf-vma01.bu-amf.local | 1 | 0 | 0
worker_dsbwaa01pmgw02 | 1 | 0 | 0
worker_dsom-nagt02.desom.mgmt | 4 | 0 | 0
worker_flmc-gropnag01.flamco.local | 1 | 0 | 0
worker_gdr01dcbmgw02 | 1 | 0 | 0
worker_monxisltn-vms.dockaas.nl | 1 | 0 | 0
worker_sbhptsssm013.sltn-beheer.local | 1 | 0 | 0
worker_sbhptsssm014.sltn-beheer.local | 1 | 0 | 0
worker_sbhptsssm015.sltn-beheer.local | 1 | 0 | 0
worker_sdb-waatcssm01.dbij.local | 1 | 0 | 0
worker_sr_monsltn_pa.dockaas.nl | 1 | 0 | 0
worker_svavrdmont01.durable.local | 1 | 0 | 0
worker_svavrrmont01.durable.local | 1 | 0 | 0
worker_svizdcbpnagi02.vivium.local | 1 | 0 | 0
worker_svr-lnxngs-002 | 1 | 0 | 0
worker_svr-mgw200 | 1 | 0 | 0
worker_wnf-s-mgw01.wnf.local | 1 | 0 | 0
----------------------------------------------------------------------------------------
You do not have the required permissions to view the files attached to this post.
-
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: orphaned check and force an imediate check doesn't work
Hi @jweijters,
Thanks for sending the profile over and the gearman log. So based on the logs, this is a gearman worker and/or network issue (not directly related to Nagiso XI).
Save the change and restart the worker.
If you see the orphaned message again, then retrieve the log from the worker and post that to the thread. Thanks, Benjamin
Thanks for sending the profile over and the gearman log. So based on the logs, this is a gearman worker and/or network issue (not directly related to Nagiso XI).
To help troubleshoot this. log into the gearman worker for the hg_worker_dsom host group, and enable debugging. Edit the /etc/mod_gearman/worker.conf and change this line fromERROR 2021-05-28 09:08:58.000000 [ main ] closing connection due to previous errno error(Connection timed out) -> libgearman-server/io.cc:218
Code: Select all
debug=0
to
debug=1
If you see the orphaned message again, then retrieve the log from the worker and post that to the thread. Thanks, Benjamin
Code: Select all
/var/log/mod_gearman/mod_gearman_worker.log
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!