Page 7 of 10
Re: host check orphaned
Posted: Tue Mar 17, 2015 2:42 pm
by bosecorp
I have PM you some of the IPs that are orphaned.
the IPs that you PM me that actually IP that are down. that is good, and I am aware of that. it;s the orphaned that I am concern about.
logs I sent I believe where for more than 5 minutes.
let me know if you need me to turn debugging back on
Re: host check orphaned
Posted: Tue Mar 17, 2015 3:41 pm
by jdalrymple
Everything in the logs looks OK, even the checks appear to be returning results to the job server for those hosts you mentioned in PM:
Code: Select all
[2015-03-17 12:07:31][30318][TRACE] command: /usr/local/nagios/libexec/check_icmp -H <IPADDR> -w 3000.0,80% -c 5000.0,100% -p 5
[2015-03-17 12:07:31][30318][TRACE] data:
host_name=<hostname>
core_start_time=1426608445.0
start_time=1426608451.156580
finish_time=1426608451.162433
return_code=0
exited_ok=1
source=Mod-Gearman Worker @ nagmonus1
output=OK - <IPADDR>: rta 0.703ms, lost 0%|rta=0.703ms;3000.000;5000.000;0; pl=0%;80;100;; \n
[2015-03-17 12:07:31][30318][TRACE] add_job_to_queue(check_results, (null), 2, 1, 1, 1)
[2015-03-17 12:07:31][30318][TRACE] 281 --->host_name=<hostname>
core_start_time=1426608445.0
start_time=1426608451.156580
finish_time=1426608451.162433
return_code=0
exited_ok=1
source=Mod-Gearman Worker @ nagmonus1
output=OK - <IPADDR>: rta 0.703ms, lost 0%|rta=0.703ms;3000.000;5000.000;0; pl=0%;80;100;; \n
[2015-03-17 12:07:31][30318][TRACE] add_job_to_queue() finished successfully: 0 0
[2015-03-17 12:07:31][30318][TRACE] send_result_back() finished successfully
[2015-03-17 12:07:31][30318][TRACE] send_result_back() has no duplicate servers to send to.
[2015-03-17 12:07:31][30318][TRACE] set_state(1)
[2015-03-17 12:07:31][30318][TRACE] set_state(0)
bosecorp wrote:
question, do I need to update the gearmand as well. I only updated mod_gearman
At this point I'm tempted to say yes.
The only other thing that may help (maybe try this beforehand) would be to increase the number of workers. At present I see 11 workers (right?), but I'm seeing about 8540 service checks in about 130 seconds - that's pretty aggressive.
Re: host check orphaned
Posted: Tue Mar 17, 2015 4:10 pm
by bosecorp
I have upgraded the gearman servers as well last night. it did not make any difference
are you saying that this could be a performance issue and therefore we might need to increase the number of workers?
I only have 4 workers
2015-03-17 17:09:43 - 10.100.30.111:4730 - v0.33
Queue Name | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 20 | 0 | 0
host | 20 | 0 | 0
hostgroup_gearman_dce1 | 0 | 0 | 0
hostgroup_gearman_dcn1 | 5 | 0 | 0
service | 20 | 0 | 0
worker_gearmandce1 | 1 | 0 | 0
worker_gearmandcn1 | 1 | 0 | 0
worker_nagmonus1 | 1 | 0 | 0
worker_nagmonus2 | 1 | 0 | 0
-------------------------------------------------------------------------
and lastly, how do I verify that the workers, in this case gearmandce1 and gearmandcn1 are actually doing the monitoring activities as well. I am starting to think that maybe nagmonus1 is doing all the work.
Re: host check orphaned
Posted: Tue Mar 17, 2015 4:36 pm
by jdalrymple
Your gearman_top should tell you, especially with the massive massive number of checks you have running. To be honest I'm baffled by the output that you're showing us from your gearman_top. Do the number of jobs waiting/running ever change from 0? I would expect both of those columns to be double digit numbers based upon the log output for AT LEAST 1 worker.
Re: host check orphaned
Posted: Tue Mar 17, 2015 5:57 pm
by Box293
bosecorp wrote:and lastly, how do I verify that the workers, in this case gearmandce1 and gearmandcn1 are actually doing the monitoring activities as well. I am starting to think that maybe nagmonus1 is doing all the work.
run gearman_top
Then from a worker stop the worker service. You should see the queues build up. Stopping all the workers should shed some light onto how the queues are working.
Re: host check orphaned
Posted: Tue Mar 17, 2015 6:03 pm
by bosecorp
we might be onto something
I don't see Jobs ruuning/waiting ever going more than 1, in fact I don;t remember ever being 1
I have stopped one of the workers and I don;t see anything building up
this is after I stopped gearmandce1
2015-03-17 19:02:48 - 10.100.30.111:4730 - v0.33
Queue Name | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 10 | 0 | 0
host | 10 | 0 | 0
hostgroup_gearman_dce1 | 0 | 0 | 0
hostgroup_gearman_dcn1 | 5 | 0 | 0
service | 10 | 0 | 0
worker_gearmandce1 | 0 | 0 | 0
worker_gearmandcn1 | 1 | 0 | 0
worker_nagmonus1 | 1 | 0 | 0
worker_nagmonus2 | 1 | 0 | 0
-------------------------------------------------------------------------
could this be an issue with configuration
I think I said this before, the way I control who does the monitoring activities is by hostgroups. the hostgroups I have are gearman_no, gearman_dce1 & gearman_dcn1
I have PM you the config files of my workers.
this is what I am also seeing in the nagios.log file
[1426688576] Warning: The check of service 'Port 13137 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686135; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13602 Bandwidth' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686135; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13621 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686153; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13630 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686117; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13633 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686135; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13634 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686136; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13647 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686093; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13649 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426685398; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13652 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426685442; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 5001 Bandwidth' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426685634; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 5182 Bandwidth' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426685670; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 5188 Bandwidth' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686341; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10110 Status' on host 'uswb-mdf-5510.bose.com' looks like it was orphaned (results never came back; last_check=1426686093; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10114 Status' on host 'uswb-mdf-5510.bose.com' looks like it was orphaned (results never came back; last_check=1426686153; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10116 Bandwidth' on host 'uswb-mdf-5510.bose.com' looks like it was orphaned (results never came back; last_check=1426686153; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10121 Status' on host 'uswb-mdf-5510.bose.com' looks like it was orphaned (results never came back; last_check=1426686117; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10604 Bandwidth' on host 'uswb-mdf-5510.bose.com' looks like it was orphaned (results never came back; last_check=1426686116; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10702 Status' on host 'uswb-mdf-5510.bose.com' looks like it was orphaned (results never came back; last_check=1426686093; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10138 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426685652; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10202 Status' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686093; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10602 Status' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426685398; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10611 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686117; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10619 Status' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686153; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10623 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686093; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10624 Status' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686093; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10626 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426685396; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10627 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426685396; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10629 Status' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686153; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10630 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426685443; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10631 Status' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686135; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10636 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686117; next_check=1426687855). I'm scheduling an immediate check of the service.
Re: host check orphaned
Posted: Wed Mar 18, 2015 12:23 pm
by jdalrymple
Why do you have the hostgroups commented out in your neb config? That is definitely causing some of the confusion:
Code: Select all
# sets a list of hostgroups which will go into seperate
# queues. Either specify a comma seperated list or use
# multiple lines.
#hostgroups=name1
#hostgroups=name2,name3
#hostgroups=gearman_a
#hostgroups=gearman_b
#hostgroups=gearman_c
Get rid of those comment marks on the gearmans you expect to get work allocated to them, then restart Nagios.
Re: host check orphaned
Posted: Wed Mar 18, 2015 12:41 pm
by bosecorp
which of the neb configs, in all 3 of them?
Re: host check orphaned
Posted: Wed Mar 18, 2015 12:50 pm
by jdalrymple
Just the one on the host from which you're running gearman_top, the job server.
Re: host check orphaned
Posted: Wed Mar 18, 2015 12:56 pm
by bosecorp
done.
question, why not on the workers servers.