Nagios scheduling queue

klajosh2 · Post by **klajosh2** » Wed Apr 08, 2015 6:04 am

Hi,

I am using the following setup:

I am using Nagios 3.5.1 and latest mod_gearman.

on Redhat 6.5: gearmand, mod_gearman_neb + nagios + mod_gearman worker
on Debian 7.8: mod_gearman worker

There is a check which should run quite frequently, in every 5 mins. This check checks the network devices' interfaces. I am using check_multi
to collect all the interface checks / device. What I noticed when I checked the Scheduling Queue of Nagios is following (an example):
the interface check ran at 12:19. The next check will run at 12:44? Is this related to mod_gearman or is this a nagios bug or configuration bug?
I noticed the problem when I was checking the age or the rrd files. and they did not get uptated more the 40 mins. (which is not good)

I attach the following 2 pictures as an example:

int.jpg:
this the snippet from the scheduling queue. This check should happen in every 5 mins. Why did nagios schedule it 25 mins away?
what can be the problem?

rrd-chk.jpg:
this check checks how often updates certain rrd files.

Can anybody help?

Thank you in advance,

klajosh

jdalrymple · Post by **jdalrymple** » Wed Apr 08, 2015 11:39 am

The first thing I'd take a look at is the clock skew across the gearman workers and the Nagios box. I've seen where results come back from a gearman worker whose clock is askew and that impacts the next run time for a host/service check.

klajosh2 · Post by **klajosh2** » Thu Apr 09, 2015 3:23 am

this can be a good idea, but the thing is that I have to monitor devices in different geographical locations. I solved this with one main nagios server
with different mod_gearman collectors. Those collectors not just poll the devices but do other things.. (like internal webserver). So in short: some
of the collectors are in different time zones and time zone setting cannot be changed on those machines.

on the another hand this problem happens on collectors in the same time zone where the main nagios server is and the time settings are the same.

Post by **tgriep** » Thu Apr 09, 2015 4:53 pm

Could you post your mod gearman worker and server config files and the service check that is having problems?

klajosh2 · Post by **klajosh2** » Fri Apr 24, 2015 10:40 am

(sorry for late answer I have quite busy days nowadays)
It turned out, I cannot narrow down the problem for a specific service. There are services what are randomly abandoned by the nagios scheduler.
(instead of checking them every 3 mins the offset between 2 checks is 30 mins)
what I am thinking that the root of the problem can be that I have too many checks too often, and nagios core with service_inter_check_delay_method=s
cannot handle that. I mean nagios core sees the whole monitoring environment as a one server environment but currently I have 7 pollers/collectors (call whater you want) in 5 different
locations doing checks with one main server, and nagios core wants to protect this server regarding load.
This is just an idea, I do not know if it is true or not.
So I think that the problem is not in mod_gearman but in nagios core (3.5.1) which does not schedule the checks properly.
What do you think?

jolson · Post by **jolson** » Fri Apr 24, 2015 12:17 pm

If you view the extended details of one of your affected services, what do you see? Please post a screenshot similar to the following:

I am interested in the check latency/duration - perhaps the check is taking a long time to execute? It's also possible that the latency is high.

Is there any excessive load on the server - are the resources being starved?

Code: Select all

top
free -m

klajosh2 · Post by **klajosh2** » Mon Apr 27, 2015 9:20 am

Hi,

the machine is definitely overloaded:

Code: Select all

# w
 15:34:17 up 25 days,  1:48,  1 user,  load average: 4.01, 3.48, 3.22

it has 4 cpu.
Please check the attachments.

klajosh

klajosh2 · Post by **klajosh2** » Mon Apr 27, 2015 9:42 am

same service:

Code: Select all

Check Latency / Duration: 	1.823 /34.001 seconds

jolson · Post by **jolson** » Mon Apr 27, 2015 1:49 pm

It's possible your issues are being caused by the performance of your box.
Upgrade to Nagios 4.x - Nagios 4.x is much faster than Nagios 3.x - this is mostly due to the introduction of 'Core Workers'. You can read more about the enhancements of 4 here: http://labs.nagios.com/2013/09/20/nagio ... available/
Some other performance tweaks: http://nagios.sourceforge.net/docs/nagi ... uning.html

While I do think that service_inter_check_delay_method=s could have something to do with the issues described here, I think that upgrading to 4.x has the possibility of helping you out the most.

As for why the Nagios server is scheduling the checks so far out, could you please post a service configuration of one of the affected services?

klajosh2 · Post by **klajosh2** » Tue Apr 28, 2015 10:02 am

there are few things which hold me back to upgrading to nagios 4.0.8.

- PNP4Nagios Broker Module npcdmod.o is not compatible with Nagios Core 4.x
and
- Mod-Gearman works best since version 3.2.2 up to the latest stable Nagios 3.5.1. Nagios 4 is not fully tested yet,

and my environment heavily uses these 2 broker module.

So I think my hands are tied here

I attach a graph with the visualizes host/service latency based on nagiostats.

Nagios Support Forum

Nagios scheduling queue

Nagios scheduling queue

Re: Nagios scheduling queue

Re: Nagios scheduling queue

Re: Nagios scheduling queue

Re: Nagios scheduling queue

Re: Nagios scheduling queue

Re: Nagios scheduling queue

Re: Nagios scheduling queue

Re: Nagios scheduling queue

Re: Nagios scheduling queue