host check orphaned
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
If you remove hosts=yes and services=yes you will then stop processing in those queues, which I believe would be more ideal for your XI box.
Make sense?
Make sense?
Re: host check orphaned
You can do a forced check of all down hosts. if you go to Mass Acknowledge and change the Command Type to Schedule Immediate Check, select the hosts and submit the command, that will run an immediate check of the hosts.
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: host check orphaned
If I remove those, will they still process the specified service and host groups?
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
If you remove those from any worker, that worker will stop processing all the unmatched hostgroup/servicegroup queues. So in your case, it would only process stuff that fell into that network queue.
The remaining workers will do everything else.
The remaining workers will do everything else.
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: host check orphaned
Thanks. I've done this and looks like the host downs are still happening. I will have to turn on a higher debug level perhaps and try to catch the bad checks...? The main XI is now only doing the specified service checks.
Re: host check orphaned
When you received the Host Down's were you still getting the "host check orphaned" error?
Can you check the gearman logs and see if they are still processing the host checks that are failing?
Can you check the gearman logs and see if they are still processing the host checks that are failing?
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: host check orphaned
Right now we have 462 hosts and it is reporting that 337 of are down while 110 are up. I am not sure why some are good and some are bad. Trying to watch the logs to see if it produces anything, but nothing yet. This morning when I clicked on the hosts to force check, then it turns green. Just looking at the hosts where it is "orphaned" and check the last check date....
While I was typing this, thought I give a couple things a try...
1. Set use_uniq_jobs=off in mod_gearman_neb.conf
2. Set host_check_timeout to 60 from 30 in nagios.cfg
3. Added time out (-t 20) to check-host-alive command in case of timeouts
Restarted nagios / mod_gearman. I am seeing the numbers going down dramatically. I will keep an eye on it....
While I was typing this, thought I give a couple things a try...
1. Set use_uniq_jobs=off in mod_gearman_neb.conf
2. Set host_check_timeout to 60 from 30 in nagios.cfg
3. Added time out (-t 20) to check-host-alive command in case of timeouts
Restarted nagios / mod_gearman. I am seeing the numbers going down dramatically. I will keep an eye on it....
You do not have the required permissions to view the files attached to this post.
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: host check orphaned
Looks like the host downs are piling up again... I looked at the worker nodes and seems like the check results are coming in... but perhaps taking too long so XI or the gearman master thinks it has timed out somehow?
Code: Select all
[2015-07-13 13:02:15][30125][TRACE] got new job H:kdcnagxi01:150541
[2015-07-13 13:02:15][30125][TRACE] 364 +++>
qHC0K4WA+ODMQoyAP4STDlqbfXyJSX5OWxIVH8wIPdobXqEyluAU5n714EEgN67pbCLt641r19C32okaaq2YUbzwRDPl/vDaiB4gm6brrukdorXFuHWNFZR5SDfVIOYcc+ROcV/HNS9ntjmORriM8rGUVHZupGMl7mXzjfLrGDucSj+c26QGlXf1xvDADPq75Gttt9lz5tIufO6QmePgy0Eritd8cisFUscRmHZn12YvMNJLl5bIWoUD6JB+nxEpkXtOwYVg/nXLIP61Kkn7/o2cJfKKlNQ+xDZtB+FKXCEV/F2NB8KHxH/JbXp6eq6WawIyUI9sa1ipm4jV00SVpgoz1WqNEFeTuKvsk1jLcic=
<+++
[2015-07-13 13:02:15][30125][TRACE] 259 --->
type=host
result_queue=check_results
host_name=kdcnap10c02mgt
start_time=1436817737.0
next_check=1436817737.0
timeout=60
core_time=1436817737.140629
command_line=/usr/local/nagios/libexec/check_icmp -H kdcnap10c02mgt -w 3000.0,80% -c 5000.0,100% -p 5 -t 20
<---
[2015-07-13 13:02:15][30125][TRACE] do_exec_job()
[2015-07-13 13:02:15][30125][DEBUG] got host job: kdcnap10c02mgt
[2015-07-13 13:02:15][30125][TRACE] timeout: 60, core latency: -2
[2015-07-13 13:02:15][30125][TRACE] command: /usr/local/nagios/libexec/check_icmp -H kdcnap10c02mgt -w 3000.0,80% -c 5000.0,100% -p 5 -t 20
[2015-07-13 13:02:15][30125][TRACE] execute_safe_command()
[2015-07-13 13:02:15][30125][TRACE] using execvp, no shell characters found
[2015-07-13 13:02:15][30125][TRACE] send_result_back()
[2015-07-13 13:02:15][30125][TRACE] queue: check_results
[2015-07-13 13:02:15][30125][TRACE] data:
host_name=kdcnap10c02mgt
core_start_time=1436817737.0
start_time=1436817735.194620
finish_time=1436817735.199065
return_code=0
exited_ok=1
source=Mod-Gearman Worker @ kdcnaggm01
output=OK - kdcnap10c02mgt: rta 0.589ms, lost 0%|rta=0.589ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=0.807ms;;;; rtmin=0.513ms;;;; \n
[2015-07-13 13:02:15][30125][TRACE] add_job_to_queue(check_results, (null), 2, 1, 1, 1)
[2015-07-13 13:02:15][30125][TRACE] 318 --->host_name=kdcnap10c02mgt
core_start_time=1436817737.0
start_time=1436817735.194620
finish_time=1436817735.199065
return_code=0
exited_ok=1
source=Mod-Gearman Worker @ kdcnaggm01
output=OK - kdcnap10c02mgt: rta 0.589ms, lost 0%|rta=0.589ms;3000.000;5000.000;0; pl=0%;80;100;; rtmax=0.807ms;;;; rtmin=0.513ms;;;; \n
You do not have the required permissions to view the files attached to this post.
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: host check orphaned
I feel somewhat like we're chasing our tails. Worker doesn't indicate anything wrong, nothing jumps out in Nagios' logs. Next stop would be the gearmand logs. Can you turn debugging up to 2 in /etc/mod_gearman/mod_gearman_neb.conf and we'll keep an eye out there. I guess we need to correlate what's going on with all 3 processes to see where the breakdown is. Nagios, gearmand and gearman worker.
-
CFT6Server
- Posts: 506
- Joined: Wed Apr 15, 2015 4:21 pm
Re: host check orphaned
One more update. Working on the assumption that either timeout is happening or the workers are busy but not showing it on gearman_top, I have done the following:
1. in the mod_gearman_neb.conf, I've increased the result_worker value to 5.
2. in the mod_gearman_worker.conf, I've increased timeout from 60 to 120 and increased min worker to 100.
So far it is ok. But one thing I noticed, when restarting nagios and gearman (in order stated in documentation), my result_worker resets to 1 while it is configured to 5 in the neb.conf file. I am wondering if there's something else that needs to be restarted?
I am not sure if this is related, but I think restarting Nagios and Mod_Gearman.... when I click on hosts, the services are missing, and looks like there are system problems. I am not sure where it might've broke... since I cannot see any hosts and services
1. in the mod_gearman_neb.conf, I've increased the result_worker value to 5.
2. in the mod_gearman_worker.conf, I've increased timeout from 60 to 120 and increased min worker to 100.
So far it is ok. But one thing I noticed, when restarting nagios and gearman (in order stated in documentation), my result_worker resets to 1 while it is configured to 5 in the neb.conf file. I am wondering if there's something else that needs to be restarted?
Code: Select all
Queue Name | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 371 | 0 | 0
host | 271 | 0 | 0
hostgroup_Network_ALL | 100 | 0 | 0
service | 271 | 0 | 0
servicegroup_ALL_Network_Bandwidth | 100 | 0 | 0
worker_kdcnaggm01 | 1 | 0 | 0
worker_kdcnaggm02 | 1 | 0 | 0
worker_kdcnaggm03 | 1 | 0 | 0
worker_kdcnagxi01 | 1 | 0 | 0
-------------------------------------------------------------------------------------
I am not sure if this is related, but I think restarting Nagios and Mod_Gearman.... when I click on hosts, the services are missing, and looks like there are system problems. I am not sure where it might've broke... since I cannot see any hosts and services
You do not have the required permissions to view the files attached to this post.