host checks running uncontrolled

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
Mitchell
Posts: 130
Joined: Thu Jan 05, 2012 2:33 am

host checks running uncontrolled

Post by Mitchell »

I have a very weired issue going on with my XI installation. the installation is small (550 hosts and 900 services with MySQL offloaded to another VM). I am also using mod_gearman server and client, installed on same machine.
the host checks are running with following configuration (confirmed configuration of all hosts using backend API)
check_interval 3 min
retry_interval 3 min
max_check_attempts 2

The no of host checks (using dashlet 'Monitoring Engine Check Statistics'), the load on server, no of mod_gearman workers(around 10) as well as 'Monitoring Engine Event Queue' looks good/normal after a nagios process restart.
ChecksAfterRestart.png
EventQueueAfterRestart.png

then I see steady increase in ''Monitoring Engine Event Queue' , load on server, no of concurrent mod_gearman worker process and 'Monitoring Engine Check Statistics'. It keeps growing until I restart with-in 4-5 days.
uncontrolledChecks.png
Can you please explain what could be happening here? How do I track what is wrong?

Regards
Ashish
You do not have the required permissions to view the files attached to this post.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: host checks running uncontrolled

Post by scottwilkerson »

My guess would be that maybe filter in mod_gearman is setup incorrectly and that some of the checks are not able to be processed by any of the workers

You may be able to get a clue looking at the mod_gearman logs.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
Mitchell
Posts: 130
Joined: Thu Jan 05, 2012 2:33 am

Re: host checks running uncontrolled

Post by Mitchell »

We are using filter only for localhostgroup (which has localhost only to monitor gearman itself)

I guess the issue is extra checks which should not run. Not sure if bad filters would result in extra checks?

Regards
Ashish
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: host checks running uncontrolled

Post by scottwilkerson »

The bad filters could make it so none of the check can get processed, and they continue to stay in the gearman queue because for example a gearman worker is setup to only process items from a certain group and the checks are not in this group. Then it will stay in the queue waiting for one of the workers to accept it.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
Mitchell
Posts: 130
Joined: Thu Jan 05, 2012 2:33 am

Re: host checks running uncontrolled

Post by Mitchell »

2012-09-07 10:32:10 - localhost:4730 - v0.25

Queue Name | Worker Available | Jobs Waiting | Jobs Running
-----------------------------------------------------------------------------------
check_results | 1 | 0 | 0
eventhandler | 24 | 0 | 0
host | 24 | 0 | 0
service | 24 | 0 | 0
worker_pnagios03lxv.mitchell.com | 1 | 0 | 0
-----------------------------------------------------------------------------------

okay. I am not seeing any cehcks staying in queue on gearman. I had orphan checks enabled as well with max age timeout to detect that situation.

I am actually seeing a lot of host checks getting being executed in ''Monitoring Engine Check Statistics' dashlet.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: host checks running uncontrolled

Post by scottwilkerson »

How many hosts are on the system, and what is the average frequency of checks?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
Mitchell
Posts: 130
Joined: Thu Jan 05, 2012 2:33 am

Re: host checks running uncontrolled

Post by Mitchell »

542 hosts
check_interval 3 min
retry_interval 3 min
max_check_attempts 2
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: host checks running uncontrolled

Post by mguthrie »

Could your machine handle turning off mod gearman for a couple of hours to see if the issue still shows up without the event broker?
User avatar
Mitchell
Posts: 130
Joined: Thu Jan 05, 2012 2:33 am

Re: host checks running uncontrolled

Post by Mitchell »

okay. I actually turned off mod_gearman on friday and realized Saturday and Sunday went fine.
It seems the version Version 1.3.6 had issues with duplicate data for timing-out checks which are fixed in Version 1.3.8.
I upgraded to Version 1.3.8 and watching if it resolves the issue. So far looking good. I will confirm in few days if it resolves the issue we had.

Thanks
Ashish
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: host checks running uncontrolled

Post by mguthrie »

Good to know. Let us know what you find out.
Locked