[Nagios-devel] Max concurrent checks - spreading the next_time

Guest · Post by **Guest** » Tue Jun 09, 2009 8:32 pm

--Apple-Mail-38--12053185
Content-Type: text/plain;
charset=US-ASCII;
format=flowed;
delsp=yes
Content-Transfer-Encoding: 7bit

Hi!

We've seen situations where this appears in the nagios.log:

Max concurrent service checks (50) has been reached. Delaying further
checks until previous checks are complete...

When switching on debugging, what we noticed is that services are
invoked all around the same time. I guess this happens when you have
selected a host and say "force check all services on this host".

What happens is that in the event code (base/events.c), it seems that
if this max_concurrent_checks is reached, then the service is ignored
and is rescheduled with a next check time based on the next regular
check interval. But if you do that, then all the other services will
still be invoked around the same time.

/* reschedule the check if we can't run it now */
if(run_event==FALSE){
/* remove the service check from the event queue and reschedule
it for a later time */
/* 12/20/05 since event was not executed, it needs to be
remove()'ed to maintain sync with event broker modules */
temp_event=event_list_low;
remove_event(temp_event,&event_list_low,&event_list_low_tail);
if(temp_service->state_type==SOFT_STATE && temp_service-
>current_state!=STATE_OK)
temp_service->next_check=(time_t)(temp_service->next_check+
(temp_service->retry_interval*interval_length));
else
temp_service->next_check=(time_t)(temp_service->next_check+
(temp_service->check_interval*interval_length));
temp_event->run_time=temp_service->next_check;
reschedule_event(temp_event,&event_list_low,&event_list_low_tail);
update_service_status(temp_service,FALSE);
run_event=FALSE;
}

I propose that instead of setting next_time = next_time +
check_interval, that there is a random factor added, maybe something
like:

next_time = now + max(5, min(int(rand(15)),
int(rand(retry_interval*interval_length))))

This means that the next check has been moved at least 5 seconds away
from now (to overcome the temporary load due to the number of
concurrent service checks), with a maximum of 15 seconds away (or less
if the retry_interval is lower).

Thoughts?

Ton

--Apple-Mail-38--12053185
Content-Type: text/html;
charset=US-ASCII
Content-Transfer-Encoding: quoted-printable

Hi!We've =
seen situations where this appears in the =
nagios.log:Max concurrent service checks (50) =
has been reached. Delaying further checks until previous checks are =
complete...When switching on debugging, what =
we noticed is that services are invoked all around the same time. I =
guess this happens when you have selected a host and say "force check =
all services on this host".What happens is =
that in the event code (base/events.c), it seems that if this =
max_concurrent_checks is reached, then the service is ignored and is =
rescheduled with a next check time based on the next regular check =
interval. But if you do that, then all the other services will still be =
invoked around the same =
time.  /* =
reschedule the check if we can't run it now =
*/  if(run_event=3D=3DFALSE){   =
/* remove the service check from the event queue and reschedule it =
for a later time */   /* 12/20/05 since event =
was not executed, it needs to be remove()'ed to maintain sync with event =
broker modules */   =
temp_event=3Devent_list_low;   =
remove_event(temp_event,&event_list_low,&event_list_low_tail=
);    if(temp_service->state_type=3D=3DSOFT=
_STATE && =
temp_service->current_state!=3DSTATE_OK)   =
temp_service->next_check=3D(time_t)(temp_service->next_check+(temp_s=
ervice->retry_interval*interval_length));   =
&

...[email truncated]...

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]