host check orphaned

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned

Post by bosecorp »

I have PM you some of the IPs that are orphaned.

the IPs that you PM me that actually IP that are down. that is good, and I am aware of that. it;s the orphaned that I am concern about.

logs I sent I believe where for more than 5 minutes.

let me know if you need me to turn debugging back on
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: host check orphaned

Post by jdalrymple »

Everything in the logs looks OK, even the checks appear to be returning results to the job server for those hosts you mentioned in PM:

Code: Select all

[2015-03-17 12:07:31][30318][TRACE] command: /usr/local/nagios/libexec/check_icmp -H <IPADDR> -w 3000.0,80% -c 5000.0,100% -p 5
[2015-03-17 12:07:31][30318][TRACE] data:
host_name=<hostname>
core_start_time=1426608445.0
start_time=1426608451.156580
finish_time=1426608451.162433
return_code=0
exited_ok=1
source=Mod-Gearman Worker @ nagmonus1
output=OK - <IPADDR>: rta 0.703ms, lost 0%|rta=0.703ms;3000.000;5000.000;0; pl=0%;80;100;; \n
[2015-03-17 12:07:31][30318][TRACE] add_job_to_queue(check_results, (null), 2, 1, 1, 1)
[2015-03-17 12:07:31][30318][TRACE] 281 --->host_name=<hostname>
core_start_time=1426608445.0
start_time=1426608451.156580
finish_time=1426608451.162433
return_code=0
exited_ok=1
source=Mod-Gearman Worker @ nagmonus1
output=OK - <IPADDR>: rta 0.703ms, lost 0%|rta=0.703ms;3000.000;5000.000;0; pl=0%;80;100;; \n
[2015-03-17 12:07:31][30318][TRACE] add_job_to_queue() finished successfully: 0 0
[2015-03-17 12:07:31][30318][TRACE] send_result_back() finished successfully
[2015-03-17 12:07:31][30318][TRACE] send_result_back() has no duplicate servers to send to.
[2015-03-17 12:07:31][30318][TRACE] set_state(1)
[2015-03-17 12:07:31][30318][TRACE] set_state(0)
bosecorp wrote: question, do I need to update the gearmand as well. I only updated mod_gearman
At this point I'm tempted to say yes.

The only other thing that may help (maybe try this beforehand) would be to increase the number of workers. At present I see 11 workers (right?), but I'm seeing about 8540 service checks in about 130 seconds - that's pretty aggressive.
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned

Post by bosecorp »

I have upgraded the gearman servers as well last night. it did not make any difference

are you saying that this could be a performance issue and therefore we might need to increase the number of workers?

I only have 4 workers


2015-03-17 17:09:43 - 10.100.30.111:4730 - v0.33

Queue Name | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 20 | 0 | 0
host | 20 | 0 | 0
hostgroup_gearman_dce1 | 0 | 0 | 0
hostgroup_gearman_dcn1 | 5 | 0 | 0
service | 20 | 0 | 0
worker_gearmandce1 | 1 | 0 | 0
worker_gearmandcn1 | 1 | 0 | 0
worker_nagmonus1 | 1 | 0 | 0
worker_nagmonus2 | 1 | 0 | 0
-------------------------------------------------------------------------

and lastly, how do I verify that the workers, in this case gearmandce1 and gearmandcn1 are actually doing the monitoring activities as well. I am starting to think that maybe nagmonus1 is doing all the work.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: host check orphaned

Post by jdalrymple »

Your gearman_top should tell you, especially with the massive massive number of checks you have running. To be honest I'm baffled by the output that you're showing us from your gearman_top. Do the number of jobs waiting/running ever change from 0? I would expect both of those columns to be double digit numbers based upon the log output for AT LEAST 1 worker.
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: host check orphaned

Post by Box293 »

bosecorp wrote:and lastly, how do I verify that the workers, in this case gearmandce1 and gearmandcn1 are actually doing the monitoring activities as well. I am starting to think that maybe nagmonus1 is doing all the work.
run gearman_top

Then from a worker stop the worker service. You should see the queues build up. Stopping all the workers should shed some light onto how the queues are working.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned

Post by bosecorp »

we might be onto something

I don't see Jobs ruuning/waiting ever going more than 1, in fact I don;t remember ever being 1

I have stopped one of the workers and I don;t see anything building up

this is after I stopped gearmandce1

2015-03-17 19:02:48 - 10.100.30.111:4730 - v0.33

Queue Name | Worker Available | Jobs Waiting | Jobs Running
-------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 10 | 0 | 0
host | 10 | 0 | 0
hostgroup_gearman_dce1 | 0 | 0 | 0
hostgroup_gearman_dcn1 | 5 | 0 | 0
service | 10 | 0 | 0
worker_gearmandce1 | 0 | 0 | 0
worker_gearmandcn1 | 1 | 0 | 0
worker_nagmonus1 | 1 | 0 | 0
worker_nagmonus2 | 1 | 0 | 0
-------------------------------------------------------------------------


could this be an issue with configuration

I think I said this before, the way I control who does the monitoring activities is by hostgroups. the hostgroups I have are gearman_no, gearman_dce1 & gearman_dcn1

I have PM you the config files of my workers.

this is what I am also seeing in the nagios.log file


[1426688576] Warning: The check of service 'Port 13137 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686135; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13602 Bandwidth' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686135; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13621 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686153; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13630 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686117; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13633 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686135; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13634 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686136; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13647 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686093; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13649 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426685398; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 13652 Status' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426685442; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 5001 Bandwidth' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426685634; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 5182 Bandwidth' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426685670; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 5188 Bandwidth' on host 'uswb-idf-25.bose.com' looks like it was orphaned (results never came back; last_check=1426686341; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10110 Status' on host 'uswb-mdf-5510.bose.com' looks like it was orphaned (results never came back; last_check=1426686093; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10114 Status' on host 'uswb-mdf-5510.bose.com' looks like it was orphaned (results never came back; last_check=1426686153; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10116 Bandwidth' on host 'uswb-mdf-5510.bose.com' looks like it was orphaned (results never came back; last_check=1426686153; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10121 Status' on host 'uswb-mdf-5510.bose.com' looks like it was orphaned (results never came back; last_check=1426686117; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10604 Bandwidth' on host 'uswb-mdf-5510.bose.com' looks like it was orphaned (results never came back; last_check=1426686116; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10702 Status' on host 'uswb-mdf-5510.bose.com' looks like it was orphaned (results never came back; last_check=1426686093; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10138 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426685652; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10202 Status' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686093; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10602 Status' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426685398; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10611 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686117; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10619 Status' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686153; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10623 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686093; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10624 Status' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686093; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10626 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426685396; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10627 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426685396; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10629 Status' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686153; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10630 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426685443; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10631 Status' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686135; next_check=1426687855). I'm scheduling an immediate check of the service...
[1426688576] Warning: The check of service 'Port 10636 Bandwidth' on host 'uswb-ocg-lab.bose.com' looks like it was orphaned (results never came back; last_check=1426686117; next_check=1426687855). I'm scheduling an immediate check of the service.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: host check orphaned

Post by jdalrymple »

Why do you have the hostgroups commented out in your neb config? That is definitely causing some of the confusion:

Code: Select all

# sets a list of hostgroups which will go into seperate
# queues. Either specify a comma seperated list or use
# multiple lines.
#hostgroups=name1
#hostgroups=name2,name3
#hostgroups=gearman_a
#hostgroups=gearman_b
#hostgroups=gearman_c
Get rid of those comment marks on the gearmans you expect to get work allocated to them, then restart Nagios.
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned

Post by bosecorp »

which of the neb configs, in all 3 of them?
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: host check orphaned

Post by jdalrymple »

Just the one on the host from which you're running gearman_top, the job server.
bosecorp
Posts: 929
Joined: Thu Jun 26, 2014 1:00 pm

Re: host check orphaned

Post by bosecorp »

done.

question, why not on the workers servers.
Locked