Page 1 of 3
host checks running uncontrolled
Posted: Thu Sep 06, 2012 5:24 pm
by Mitchell
I have a very weired issue going on with my XI installation. the installation is small (550 hosts and 900 services with MySQL offloaded to another VM). I am also using mod_gearman server and client, installed on same machine.
the host checks are running with following configuration (confirmed configuration of all hosts using backend API)
check_interval 3 min
retry_interval 3 min
max_check_attempts 2
The no of host checks (using dashlet 'Monitoring Engine Check Statistics'), the load on server, no of mod_gearman workers(around 10) as well as 'Monitoring Engine Event Queue' looks good/normal after a nagios process restart.
ChecksAfterRestart.png
EventQueueAfterRestart.png
then I see steady increase in ''Monitoring Engine Event Queue' , load on server, no of concurrent mod_gearman worker process and 'Monitoring Engine Check Statistics'. It keeps growing until I restart with-in 4-5 days.
uncontrolledChecks.png
Can you please explain what could be happening here? How do I track what is wrong?
Regards
Ashish
Re: host checks running uncontrolled
Posted: Fri Sep 07, 2012 7:37 am
by scottwilkerson
My guess would be that maybe filter in mod_gearman is setup incorrectly and that some of the checks are not able to be processed by any of the workers
You may be able to get a clue looking at the mod_gearman logs.
Re: host checks running uncontrolled
Posted: Fri Sep 07, 2012 12:15 pm
by Mitchell
We are using filter only for localhostgroup (which has localhost only to monitor gearman itself)
I guess the issue is extra checks which should not run. Not sure if bad filters would result in extra checks?
Regards
Ashish
Re: host checks running uncontrolled
Posted: Fri Sep 07, 2012 12:24 pm
by scottwilkerson
The bad filters could make it so none of the check can get processed, and they continue to stay in the gearman queue because for example a gearman worker is setup to only process items from a certain group and the checks are not in this group. Then it will stay in the queue waiting for one of the workers to accept it.
Re: host checks running uncontrolled
Posted: Fri Sep 07, 2012 12:35 pm
by Mitchell
2012-09-07 10:32:10 - localhost:4730 - v0.25
Queue Name | Worker Available | Jobs Waiting | Jobs Running
-----------------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 24 | 0 | 0
host | 24 | 0 | 0
service | 24 | 0 | 0
worker_pnagios03lxv.mitchell.com | 1 | 0 | 0
-----------------------------------------------------------------------------------
okay. I am not seeing any cehcks staying in queue on gearman. I had orphan checks enabled as well with max age timeout to detect that situation.
I am actually seeing a lot of host checks getting being executed in ''Monitoring Engine Check Statistics' dashlet.
Re: host checks running uncontrolled
Posted: Fri Sep 07, 2012 1:24 pm
by scottwilkerson
How many hosts are on the system, and what is the average frequency of checks?
Re: host checks running uncontrolled
Posted: Fri Sep 07, 2012 3:32 pm
by Mitchell
542 hosts
check_interval 3 min
retry_interval 3 min
max_check_attempts 2
Re: host checks running uncontrolled
Posted: Mon Sep 10, 2012 9:25 am
by mguthrie
Could your machine handle turning off mod gearman for a couple of hours to see if the issue still shows up without the event broker?
Re: host checks running uncontrolled
Posted: Mon Sep 10, 2012 6:45 pm
by Mitchell
okay. I actually turned off mod_gearman on friday and realized Saturday and Sunday went fine.
It seems the version Version 1.3.6 had issues with duplicate data for timing-out checks which are fixed in Version 1.3.8.
I upgraded to Version 1.3.8 and watching if it resolves the issue. So far looking good. I will confirm in few days if it resolves the issue we had.
Thanks
Ashish
Re: host checks running uncontrolled
Posted: Tue Sep 11, 2012 9:25 am
by mguthrie
Good to know. Let us know what you find out.