Re: [Nagios-devel] Max concurrent checks - spreading the next_time

Guest · Post by **Guest** » Mon Jun 15, 2009 11:03 am

On 13 Jun 2009, at 10:29, Hiren Patel wrote:

> Ton Voon wrote:
>> This is the test case:
>> * set max_concurrent_checks=1 in nagios.cfg
>> * create a host with 3 services with a check_interval of 1 minute
>> * restart nagios
>> * go to the host page and schedule a check for all services on
>> the host (this makes all the services run at the same time)
>> * tail nagios.log. Should see "Max concurrent service checks (1)
>> has been reached"
>> * on the host page, notice the last run time. Only one will be
>> updated after 1 minute. All services get scheduled for the next
>> time at the same time, and after the next minute, only one of
>> those will have the last check time changed
>>
> yip exactly the behavior you describe. I setup a standalone machine
> running the default checks against itself, and the queue shows them
> all scheduled for the same time the next minute. also the log
> entries appear as you describe.

Thanks for testing.

>> I've just committed a patch into CVS HEAD. This nudges the time
>> ahead by 5 + random(10) seconds. I've also included a test case
>> which ensures that the nudge factor is added in these cases.
>> nagios.log will also have an entry which lists the affected
>> service. If you get this message a lot on a regular system, then
>> you need to consider increasing the max_concurrent_checks value.
>> I'd be grateful if you could try this out.
>>
> with the patch, I see the check spread in the queue now, and all the
> services are checked quicker than in the case without the patch, at
> least this is what I noticed. there is one odd behavior, with the
> default tests running, one check kept getting nudged, and as a
> result wasn't run for a while. attached is the nagios.log, the first
> two restarts are without the patch, and then with the patch. for the
> entire duration I ran with the patch, the "current users" check had
> not been run. am I doing something wrong in testing this though?

That is correct current behaviour - this could happen if a service is
scheduled at the same time as something else which is in the queue
before it.

I think this is poor behaviour, but it is a side effect of how this
currently works.

>> Thinking some more, setting the next check time ahead doesn't
>> really make sense, because the latency value does not reflect the
>> fact that this active service's check time was delayed. Maybe this
>> should be implemented as a remove of the event from the queue, and
>> then re-added with a nudged event run time but the old service-
>> >next_check time.
>> Anyhow, this should be better than it was.
> agree about the latency, although it is logging the incident so
> users should catch why their checks are running a little delayed.
> not sure about the event queue and how it works yet, haven't looked
> at this part of nagios.

I've been thinking a lot about this problem and I think this
functionality is poorly implemented. However I'd like some consensus
before making a major change to how this part works.

My thinking is that:
* if the limit is reached, add to the top of the event queue the
"service reaper" event
* nagios will then loop between this new service reaper event and
trying to execute the next service
* latency will go through the roof, but that is what you'd expect
if you said "only 1 service check is executing at a time"

Ton

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]