host check orphaned
Re: host check orphaned
Done.
how long do you want me to wait before sending your the logs.
This problem started just recently as we started to add more devices. that is the only thing that we have done recently. We also removed some legacy host groups and add new ones. Other than that nothing has change
here is the log from mod_gearman_neb.log
# tail -f mod_gearman_neb.log
[2015-03-17 09:11:54][17732][DEBUG] host: 'usmaibas1.bose.com', next_check is at 2015-03-17 09:11:40, latency so far: 14
[2015-03-17 09:11:54][17732][DEBUG] service job completed: uswb-idf-14-470 Port 12151 Status: 2
[2015-03-17 09:11:54][17732][DEBUG] service job completed: uswb-mdf-5510.bose.com Port 10127 Status: 2
[2015-03-17 09:11:54][17732][DEBUG] service job completed: uswb-idf-22-470 Port 12110 Status: 2
[2015-03-17 09:11:54][17732][DEBUG] service job completed: uswb-idf-25.bose.com Port 13638 Status: 2
[2015-03-17 09:11:54][17732][DEBUG] service job completed: uswb-ocg-lab.bose.com Port 10113 Status: 0
[2015-03-17 09:11:54][17732][DEBUG] service job completed: usst-idf-g-2960.bose.com Port 11647 Status: 0
[2015-03-17 09:11:54][17732][DEBUG] service job completed: usst-idf-b-2960 Port 10614 Status: 0
[2015-03-17 09:11:54][17732][DEBUG] service job completed: uswb-idf-25.bose.com Port 12606 Status: 2
[2015-03-17 09:11:54][17732][DEBUG] service job completed: dce-ngb Port 964 Status: 0
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: usrd-6880-vss-1.bose.com - unrouted-VLAN-727 Status
[2015-03-17 09:12:08][17732][DEBUG] service: 'usrd-6880-vss-1.bose.com' - 'unrouted-VLAN-727 Status', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: uswb-ocg-lab.bose.com - Port 10120 Status
[2015-03-17 09:12:08][17732][DEBUG] service: 'uswb-ocg-lab.bose.com' - 'Port 10120 Status', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: usfrrd-idf-o-2960.bose.com - Port 11110 Status
[2015-03-17 09:12:08][17732][DEBUG] service: 'usfrrd-idf-o-2960.bose.com' - 'Port 11110 Status', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: usst-idf-g-2960.bose.com - Port 10608 Status
[2015-03-17 09:12:08][17732][DEBUG] service: 'usst-idf-g-2960.bose.com' - 'Port 10608 Status', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: usst-idf-g-2960.bose.com - Port 10121 Status
[2015-03-17 09:12:08][17732][DEBUG] service: 'usst-idf-g-2960.bose.com' - 'Port 10121 Status', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: usfrrd-idf-j-2960.bose.com - Port 11641 Status
[2015-03-17 09:12:08][17732][DEBUG] service: 'usfrrd-idf-j-2960.bose.com' - 'Port 11641 Status', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: us-res-6880-vss-1.bose.com - Vlan3365 Bandwidth
[2015-03-17 09:12:08][17732][DEBUG] service: 'us-res-6880-vss-1.bose.com' - 'Vlan3365 Bandwidth', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] service job completed: usst-idf-g-2960.bose.com Port 10608 Status: 0
[2015-03-17 09:12:08][17732][DEBUG] service job completed: usst-idf-g-2960.bose.com Port 10121 Status: 2
[2015-03-17 09:12:08][17732][DEBUG] service job completed: uswb-ocg-lab.bose.com Port 10120 Status: 0
[2015-03-17 09:12:08][17732][DEBUG] service job completed: usrd-6880-vss-1.bose.com unrouted-VLAN-727 Status: 0
[2015-03-17 09:12:09][17732][DEBUG] service job completed: usfrrd-idf-j-2960.bose.com Port 11641 Status: 2
and from mod_gearman_worker
root@nagmonus1:(03-17 09:09): /var/log/mod_gearman
# tail -f mod_gearman_worker.log
[2015-03-17 09:12:53][1405][DEBUG] got service job: dcn-ngb.bose.com - Port 135 Bandwidth
[2015-03-17 09:12:53][17435][DEBUG] got service job: dce-nga.bose.com - Port 709 Bandwidth
[2015-03-17 09:12:53][26976][DEBUG] got host job: dce-r06c05-ser3.bose.com
[2015-03-17 09:13:07][17435][DEBUG] got host job: dce-7220.bose.com
[2015-03-17 09:13:07][7690][DEBUG] got service job: uswb-ocg-lab.bose.com - Port 10604 Status
[2015-03-17 09:13:07][32728][DEBUG] got service job: usst-idf-e-2960.bose.com - Port 11131 Status
[2015-03-17 09:13:07][26976][DEBUG] got service job: usfrrd-idf-o-2960.bose.com - Port 11148 Status
[2015-03-17 09:13:07][17219][DEBUG] got service job: usfrrd-idf-d-2960.bose.com - Port 11144 Status
[2015-03-17 09:13:07][814][DEBUG] got service job: usrd-6880-vss-2.bose.com - unrouted-VLAN-714 Bandwidth
[2015-03-17 09:13:08][4162][DEBUG] child started with pid: 4162
[2015-03-17 09:13:22][7690][DEBUG] got service job: dcnfc1.bose.com - Netapp RAID States
[2015-03-17 09:13:22][1405][DEBUG] got service job: us-fdc-6880-a.bose.com - unrouted-VLAN-746 Status
[2015-03-17 09:13:22][17435][DEBUG] got service job: usrd-6880-vss-2.bose.com - Te1/5/6- IDF-C Status
[2015-03-17 09:13:22][814][DEBUG] got service job: uswb-idf-21-470 - Port 12121 Status
[2015-03-17 09:13:22][17219][DEBUG] got service job: usfrrd-idf-r-2960.bose.com - Port 10603 Status
[2015-03-17 09:13:22][32728][DEBUG] got service job: usfm-fdc-mlx-b.bose.com - fdc-550m-02_Mgt Bandwidth
[2015-03-17 09:13:36][17435][DEBUG] got host job: dcn-r02C09-pdu1.bose.com
[2015-03-17 09:13:36][7690][DEBUG] got service job: dce-ga.bose.com - GigabitEthernet13/22 Status
[2015-03-17 09:13:36][32728][DEBUG] got service job: dce-ga.bose.com - fab-b-102-cp0 Status
[2015-03-17 09:13:36][1405][DEBUG] got service job: dce-nga.bose.com - Port 525 Status
[2015-03-17 09:13:36][26976][DEBUG] got service job: uscl-pa500-2.bose.com - Port 4 Bandwidth
[2015-03-17 09:13:36][814][DEBUG] got service job: usfrrd-idf-i-2960.bose.com - Port 10106 Status
[2015-03-17 09:13:36][17219][DEBUG] got service job: usst-idf-g-2960.bose.com - Port 11616 Bandwidth
[2015-03-17 09:13:36][17435][DEBUG] got service job: localhost - HTTP
[2015-03-17 09:13:39][4395][DEBUG] child started with pid: 4395
how long do you want me to wait before sending your the logs.
This problem started just recently as we started to add more devices. that is the only thing that we have done recently. We also removed some legacy host groups and add new ones. Other than that nothing has change
here is the log from mod_gearman_neb.log
# tail -f mod_gearman_neb.log
[2015-03-17 09:11:54][17732][DEBUG] host: 'usmaibas1.bose.com', next_check is at 2015-03-17 09:11:40, latency so far: 14
[2015-03-17 09:11:54][17732][DEBUG] service job completed: uswb-idf-14-470 Port 12151 Status: 2
[2015-03-17 09:11:54][17732][DEBUG] service job completed: uswb-mdf-5510.bose.com Port 10127 Status: 2
[2015-03-17 09:11:54][17732][DEBUG] service job completed: uswb-idf-22-470 Port 12110 Status: 2
[2015-03-17 09:11:54][17732][DEBUG] service job completed: uswb-idf-25.bose.com Port 13638 Status: 2
[2015-03-17 09:11:54][17732][DEBUG] service job completed: uswb-ocg-lab.bose.com Port 10113 Status: 0
[2015-03-17 09:11:54][17732][DEBUG] service job completed: usst-idf-g-2960.bose.com Port 11647 Status: 0
[2015-03-17 09:11:54][17732][DEBUG] service job completed: usst-idf-b-2960 Port 10614 Status: 0
[2015-03-17 09:11:54][17732][DEBUG] service job completed: uswb-idf-25.bose.com Port 12606 Status: 2
[2015-03-17 09:11:54][17732][DEBUG] service job completed: dce-ngb Port 964 Status: 0
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: usrd-6880-vss-1.bose.com - unrouted-VLAN-727 Status
[2015-03-17 09:12:08][17732][DEBUG] service: 'usrd-6880-vss-1.bose.com' - 'unrouted-VLAN-727 Status', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: uswb-ocg-lab.bose.com - Port 10120 Status
[2015-03-17 09:12:08][17732][DEBUG] service: 'uswb-ocg-lab.bose.com' - 'Port 10120 Status', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: usfrrd-idf-o-2960.bose.com - Port 11110 Status
[2015-03-17 09:12:08][17732][DEBUG] service: 'usfrrd-idf-o-2960.bose.com' - 'Port 11110 Status', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: usst-idf-g-2960.bose.com - Port 10608 Status
[2015-03-17 09:12:08][17732][DEBUG] service: 'usst-idf-g-2960.bose.com' - 'Port 10608 Status', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: usst-idf-g-2960.bose.com - Port 10121 Status
[2015-03-17 09:12:08][17732][DEBUG] service: 'usst-idf-g-2960.bose.com' - 'Port 10121 Status', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: usfrrd-idf-j-2960.bose.com - Port 11641 Status
[2015-03-17 09:12:08][17732][DEBUG] service: 'usfrrd-idf-j-2960.bose.com' - 'Port 11641 Status', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] received job for queue service: us-res-6880-vss-1.bose.com - Vlan3365 Bandwidth
[2015-03-17 09:12:08][17732][DEBUG] service: 'us-res-6880-vss-1.bose.com' - 'Vlan3365 Bandwidth', next_check is at 2015-03-17 09:11:55, latency so far: 13
[2015-03-17 09:12:08][17732][DEBUG] service job completed: usst-idf-g-2960.bose.com Port 10608 Status: 0
[2015-03-17 09:12:08][17732][DEBUG] service job completed: usst-idf-g-2960.bose.com Port 10121 Status: 2
[2015-03-17 09:12:08][17732][DEBUG] service job completed: uswb-ocg-lab.bose.com Port 10120 Status: 0
[2015-03-17 09:12:08][17732][DEBUG] service job completed: usrd-6880-vss-1.bose.com unrouted-VLAN-727 Status: 0
[2015-03-17 09:12:09][17732][DEBUG] service job completed: usfrrd-idf-j-2960.bose.com Port 11641 Status: 2
and from mod_gearman_worker
root@nagmonus1:(03-17 09:09): /var/log/mod_gearman
# tail -f mod_gearman_worker.log
[2015-03-17 09:12:53][1405][DEBUG] got service job: dcn-ngb.bose.com - Port 135 Bandwidth
[2015-03-17 09:12:53][17435][DEBUG] got service job: dce-nga.bose.com - Port 709 Bandwidth
[2015-03-17 09:12:53][26976][DEBUG] got host job: dce-r06c05-ser3.bose.com
[2015-03-17 09:13:07][17435][DEBUG] got host job: dce-7220.bose.com
[2015-03-17 09:13:07][7690][DEBUG] got service job: uswb-ocg-lab.bose.com - Port 10604 Status
[2015-03-17 09:13:07][32728][DEBUG] got service job: usst-idf-e-2960.bose.com - Port 11131 Status
[2015-03-17 09:13:07][26976][DEBUG] got service job: usfrrd-idf-o-2960.bose.com - Port 11148 Status
[2015-03-17 09:13:07][17219][DEBUG] got service job: usfrrd-idf-d-2960.bose.com - Port 11144 Status
[2015-03-17 09:13:07][814][DEBUG] got service job: usrd-6880-vss-2.bose.com - unrouted-VLAN-714 Bandwidth
[2015-03-17 09:13:08][4162][DEBUG] child started with pid: 4162
[2015-03-17 09:13:22][7690][DEBUG] got service job: dcnfc1.bose.com - Netapp RAID States
[2015-03-17 09:13:22][1405][DEBUG] got service job: us-fdc-6880-a.bose.com - unrouted-VLAN-746 Status
[2015-03-17 09:13:22][17435][DEBUG] got service job: usrd-6880-vss-2.bose.com - Te1/5/6- IDF-C Status
[2015-03-17 09:13:22][814][DEBUG] got service job: uswb-idf-21-470 - Port 12121 Status
[2015-03-17 09:13:22][17219][DEBUG] got service job: usfrrd-idf-r-2960.bose.com - Port 10603 Status
[2015-03-17 09:13:22][32728][DEBUG] got service job: usfm-fdc-mlx-b.bose.com - fdc-550m-02_Mgt Bandwidth
[2015-03-17 09:13:36][17435][DEBUG] got host job: dcn-r02C09-pdu1.bose.com
[2015-03-17 09:13:36][7690][DEBUG] got service job: dce-ga.bose.com - GigabitEthernet13/22 Status
[2015-03-17 09:13:36][32728][DEBUG] got service job: dce-ga.bose.com - fab-b-102-cp0 Status
[2015-03-17 09:13:36][1405][DEBUG] got service job: dce-nga.bose.com - Port 525 Status
[2015-03-17 09:13:36][26976][DEBUG] got service job: uscl-pa500-2.bose.com - Port 4 Bandwidth
[2015-03-17 09:13:36][814][DEBUG] got service job: usfrrd-idf-i-2960.bose.com - Port 10106 Status
[2015-03-17 09:13:36][17219][DEBUG] got service job: usst-idf-g-2960.bose.com - Port 11616 Bandwidth
[2015-03-17 09:13:36][17435][DEBUG] got service job: localhost - HTTP
[2015-03-17 09:13:39][4395][DEBUG] child started with pid: 4395
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
Everything looks very normal here bosecorp.
You can actually turn the logging on the job server back down to normal, that would be on mod_gearman_neb.conf.
If during the time you scraped this log everything was OK we need to gather some data from when it wasn't well.
If during the time you scraped this log there were host checks timing out then we need to turn logging on the worker up 1 more notch. Change mod_gearman_worker.conf so that the debug line reads:
Then restart the worker. Be aware this has VERY terse output so make sure you have enough disk space available in the /var/log filesystem so that we don't crash your system. I'm still not totally clear - you do have host checks in hostgroups defined on the Nagios/Job server failing correct? I want to make sure that we're analyzing a worker that we can expect some failures to show up on.
You can actually turn the logging on the job server back down to normal, that would be on mod_gearman_neb.conf.
If during the time you scraped this log everything was OK we need to gather some data from when it wasn't well.
If during the time you scraped this log there were host checks timing out then we need to turn logging on the worker up 1 more notch. Change mod_gearman_worker.conf so that the debug line reads:
Code: Select all
debug=2Re: host check orphaned
Thanks.
I have change to level 2.
the orphan issue happens every day at all times consistently
I will wait some time and I will send you the logs
I have change to level 2.
the orphan issue happens every day at all times consistently
I will wait some time and I will send you the logs
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
Those debug logs are going to be very substantial in size. You may have to parse it down a bit to show just the obviously broken parts. I don't know how we could not spot the problem in these logs with the debug turned up this high.
Re: host check orphaned
Small fyi, I too had orphaned host checks today on hosts which checks are run on gearman worker node. Had to do kilall nagios and restart nagios service..
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
Re: host check orphaned
Right, thanks for the information. I have tried that but it does not seem to help
Thank you for your input
Thank you for your input
Re: host check orphaned
Hi jdalrymple
I just you PM you the logs. I haven been having orphan issues all morning. Like I said before, it just won;t stop what ever I do.
like you said, it did not allow me to PM you the files because they are do big
here is the link where you can get the files
--removed--
and
--removed--
I just you PM you the logs. I haven been having orphan issues all morning. Like I said before, it just won;t stop what ever I do.
like you said, it did not allow me to PM you the files because they are do big
here is the link where you can get the files
--removed--
and
--removed--
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
I've received your logs and am reviewing them. In the meantime please return your debugging levels to 0 in all configs so that we don't fill your server's disk unnecessarily.
Thank you
Thank you
Re: host check orphaned
Done.
let me know what you find in the logs.
Thanks
let me know what you find in the logs.
Thanks
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
So bosecorp,
In looking at the logs I found 31 host checks timing out, successfully. What I mean by that is that nothing is wrong with the way Nagios is working, but 31 hosts are truly unpingable by the gearman worker server, which is in this case also the gearman job server.
Is it possible that in rearranging your hostgroups you've inadvertently placed hosts on servers that are unable to reach specific network segments?
In order to help I've PM'd the IPs of all the hosts that are timing out in the log you sent me.
Also - you we only have 132 seconds of trace long, which is plenty but if there are 31 hosts failing in 132 seconds it's possible that many more hosts are failing outside of this log window.
Please compare the IPs I PM'd to you with your expectation of what hosts this gearman worker should be monitoring and let us know if this sheds any light on the problem.
In looking at the logs I found 31 host checks timing out, successfully. What I mean by that is that nothing is wrong with the way Nagios is working, but 31 hosts are truly unpingable by the gearman worker server, which is in this case also the gearman job server.
Is it possible that in rearranging your hostgroups you've inadvertently placed hosts on servers that are unable to reach specific network segments?
In order to help I've PM'd the IPs of all the hosts that are timing out in the log you sent me.
Also - you we only have 132 seconds of trace long, which is plenty but if there are 31 hosts failing in 132 seconds it's possible that many more hosts are failing outside of this log window.
Please compare the IPs I PM'd to you with your expectation of what hosts this gearman worker should be monitoring and let us know if this sheds any light on the problem.