Nagios scheduling queue
Nagios scheduling queue
Hi,
I am using the following setup:
I am using Nagios 3.5.1 and latest mod_gearman.
on Redhat 6.5: gearmand, mod_gearman_neb + nagios + mod_gearman worker
on Debian 7.8: mod_gearman worker
There is a check which should run quite frequently, in every 5 mins. This check checks the network devices' interfaces. I am using check_multi
to collect all the interface checks / device. What I noticed when I checked the Scheduling Queue of Nagios is following (an example):
the interface check ran at 12:19. The next check will run at 12:44? Is this related to mod_gearman or is this a nagios bug or configuration bug?
I noticed the problem when I was checking the age or the rrd files. and they did not get uptated more the 40 mins. (which is not good)
I attach the following 2 pictures as an example:
int.jpg:
this the snippet from the scheduling queue. This check should happen in every 5 mins. Why did nagios schedule it 25 mins away?
what can be the problem?
rrd-chk.jpg:
this check checks how often updates certain rrd files.
Can anybody help?
Thank you in advance,
klajosh
I am using the following setup:
I am using Nagios 3.5.1 and latest mod_gearman.
on Redhat 6.5: gearmand, mod_gearman_neb + nagios + mod_gearman worker
on Debian 7.8: mod_gearman worker
There is a check which should run quite frequently, in every 5 mins. This check checks the network devices' interfaces. I am using check_multi
to collect all the interface checks / device. What I noticed when I checked the Scheduling Queue of Nagios is following (an example):
the interface check ran at 12:19. The next check will run at 12:44? Is this related to mod_gearman or is this a nagios bug or configuration bug?
I noticed the problem when I was checking the age or the rrd files. and they did not get uptated more the 40 mins. (which is not good)
I attach the following 2 pictures as an example:
int.jpg:
this the snippet from the scheduling queue. This check should happen in every 5 mins. Why did nagios schedule it 25 mins away?
what can be the problem?
rrd-chk.jpg:
this check checks how often updates certain rrd files.
Can anybody help?
Thank you in advance,
klajosh
-
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: Nagios scheduling queue
The first thing I'd take a look at is the clock skew across the gearman workers and the Nagios box. I've seen where results come back from a gearman worker whose clock is askew and that impacts the next run time for a host/service check.
Re: Nagios scheduling queue
this can be a good idea, but the thing is that I have to monitor devices in different geographical locations. I solved this with one main nagios server
with different mod_gearman collectors. Those collectors not just poll the devices but do other things.. (like internal webserver). So in short: some
of the collectors are in different time zones and time zone setting cannot be changed on those machines.
on the another hand this problem happens on collectors in the same time zone where the main nagios server is and the time settings are the same.
with different mod_gearman collectors. Those collectors not just poll the devices but do other things.. (like internal webserver). So in short: some
of the collectors are in different time zones and time zone setting cannot be changed on those machines.
on the another hand this problem happens on collectors in the same time zone where the main nagios server is and the time settings are the same.
Re: Nagios scheduling queue
Could you post your mod gearman worker and server config files and the service check that is having problems?
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios scheduling queue
(sorry for late answer I have quite busy days nowadays)
It turned out, I cannot narrow down the problem for a specific service. There are services what are randomly abandoned by the nagios scheduler.
(instead of checking them every 3 mins the offset between 2 checks is 30 mins)
what I am thinking that the root of the problem can be that I have too many checks too often, and nagios core with service_inter_check_delay_method=s
cannot handle that. I mean nagios core sees the whole monitoring environment as a one server environment but currently I have 7 pollers/collectors (call whater you want) in 5 different
locations doing checks with one main server, and nagios core wants to protect this server regarding load.
This is just an idea, I do not know if it is true or not.
So I think that the problem is not in mod_gearman but in nagios core (3.5.1) which does not schedule the checks properly.
What do you think?
It turned out, I cannot narrow down the problem for a specific service. There are services what are randomly abandoned by the nagios scheduler.
(instead of checking them every 3 mins the offset between 2 checks is 30 mins)
what I am thinking that the root of the problem can be that I have too many checks too often, and nagios core with service_inter_check_delay_method=s
cannot handle that. I mean nagios core sees the whole monitoring environment as a one server environment but currently I have 7 pollers/collectors (call whater you want) in 5 different
locations doing checks with one main server, and nagios core wants to protect this server regarding load.
This is just an idea, I do not know if it is true or not.
So I think that the problem is not in mod_gearman but in nagios core (3.5.1) which does not schedule the checks properly.
What do you think?
Re: Nagios scheduling queue
If you view the extended details of one of your affected services, what do you see? Please post a screenshot similar to the following:
Is there any excessive load on the server - are the resources being starved?
I am interested in the check latency/duration - perhaps the check is taking a long time to execute? It's also possible that the latency is high.
Is there any excessive load on the server - are the resources being starved?
Code: Select all
top
free -m
Re: Nagios scheduling queue
Hi,
the machine is definitely overloaded:
it has 4 cpu.
Please check the attachments.
klajosh
the machine is definitely overloaded:
Code: Select all
# w
15:34:17 up 25 days, 1:48, 1 user, load average: 4.01, 3.48, 3.22
Please check the attachments.
klajosh
Re: Nagios scheduling queue
same service:
Code: Select all
Check Latency / Duration: 1.823 /34.001 seconds
Re: Nagios scheduling queue
It's possible your issues are being caused by the performance of your box.
Upgrade to Nagios 4.x - Nagios 4.x is much faster than Nagios 3.x - this is mostly due to the introduction of 'Core Workers'. You can read more about the enhancements of 4 here: http://labs.nagios.com/2013/09/20/nagio ... available/
Some other performance tweaks: http://nagios.sourceforge.net/docs/nagi ... uning.html
While I do think that service_inter_check_delay_method=s could have something to do with the issues described here, I think that upgrading to 4.x has the possibility of helping you out the most.
As for why the Nagios server is scheduling the checks so far out, could you please post a service configuration of one of the affected services?
Upgrade to Nagios 4.x - Nagios 4.x is much faster than Nagios 3.x - this is mostly due to the introduction of 'Core Workers'. You can read more about the enhancements of 4 here: http://labs.nagios.com/2013/09/20/nagio ... available/
Some other performance tweaks: http://nagios.sourceforge.net/docs/nagi ... uning.html
While I do think that service_inter_check_delay_method=s could have something to do with the issues described here, I think that upgrading to 4.x has the possibility of helping you out the most.
As for why the Nagios server is scheduling the checks so far out, could you please post a service configuration of one of the affected services?
Re: Nagios scheduling queue
there are few things which hold me back to upgrading to nagios 4.0.8.
- PNP4Nagios Broker Module npcdmod.o is not compatible with Nagios Core 4.x
and
- Mod-Gearman works best since version 3.2.2 up to the latest stable Nagios 3.5.1. Nagios 4 is not fully tested yet,
and my environment heavily uses these 2 broker module.
So I think my hands are tied here
I attach a graph with the visualizes host/service latency based on nagiostats.
- PNP4Nagios Broker Module npcdmod.o is not compatible with Nagios Core 4.x
and
- Mod-Gearman works best since version 3.2.2 up to the latest stable Nagios 3.5.1. Nagios 4 is not fully tested yet,
and my environment heavily uses these 2 broker module.
So I think my hands are tied here
I attach a graph with the visualizes host/service latency based on nagiostats.