Re: [Nagios-devel] [PATCH] Re: alternative scheduler
Posted: Thu Dec 02, 2010 9:47 am
On 12/02/2010 10:03 AM, Jochen Bern wrote:
> On 12/01/2010 08:55 PM, Adam Augustine wrote:
>> While DNX and mod_gearman do implement that specific functionality,
>> they are still subject to the scheduler/reaper bottlenecks. We (the
>> institution that started the DNX project) have played around with the
>> check scheduling parameters quite a bit over the years and even with
>> our best scheduling parameters and DNX actually executing the plugins,
>> we still see checks scheduled such that we have a large number of
>> checks scheduled to execute in a single second with several seconds
>> (3-5) of nothing scheduled to execute between.
>
> Agreed. That's also the reason why I don't use either so far; I don't
> have a problem (yet ...) with the short-term scheduling (scheduling "due
> now" checks onto executors), but I see unnecessary churn in the mid-term
> scheduling (schedule next due time of checks just completed).
>
> Unless I *really* need new glasses, there's only three different kinds
> of such rescheduling code in the 3.2.x Nagios core:
>
> 1. Reschedule *exactly* check_interval / retry_interval from last due
> time (iff check_period allows this) - e.g., base/checks.c::1301ff :
>
> if(reschedule_check==TRUE)
> next_service_check=(time_t)(temp_service->last_check
> +(temp_service->check_interval*interval_length));
> }
>
This could trivially be changed by the simple expedient of scheduling the
checks with a random component and offsetting the check backwards in time
by half the random flex component. That shouldn't really be necessary
though. See below.
> 2. Reschedule to the *very first second* permitted by check_period -
> e.g., base/checks.c::278ff :
>
> /* make sure we rescheduled the next service check at a valid time */
> get_next_valid_time(preferred_time,
> &next_valid_time,svc->check_period_ptr);
> [...]
> svc->next_check=next_valid_time;
>
Here we could do a similar tweak, adding a random number between 0 and 60
to the scheduler. It wouldn't be perfect, but it would be better than the
current scheme, and with a half-decent PRNG it would mean checks would
stay smoothed out for the duration of Nagios' lifespan.
> 3. Special (error) cases falling back to some hardcoded "check interval"
> (five minutes, one week, ...).
>
These would benefit from just being rescheduled the normal way and pushed
forward by check_interval number of seconds each time they're supposed to
run.
> Neither case even *looks* at the list of already-scheduled check
> executions around the target time, much less does any smoothing.
>
> (For sake of completeness: A smoothing algorithm IMHO should:
> Case 1: *Decrease* next_check for at most a certain percentage of
> check_interval/retry_interval, so as to avoid consecutive faults in
> freshness checks and performance data processing (in the case of RRDs,
> violation of xff);
Not percentage. A fixed time would be both easier to implement and also
give a lot better behaviour in that it would be a lot less surprising
to users.
> Case 2: *Increase* next_check so as to stay within the check_period, but
> determining a max increment which simultaneously smoothes out the
> (potentially MANY) affected checks and avoids pushing the chain of
> subsequent processing (retry_interval / max_check_attempts if found
> non-OK, running event handlers, ...) *beyond* the valid timeframe is
> definitely nontrivial.)
>
Not really. The simple way of doing it is like so:
struct scheduled_thingie sched_queue[1024];
uint lowest = maxuint;
for (i = scheduled_time; i when > 1023) {
add_expensively_in_linked_list(sched_item);
} else {
sched_item->next = sched_queue[lowest].list;
}
sched_queue[lowest].list = sched_item;
When running checks, one simply has to grab the items in
sched_queue[sched_last_when].list and run the events there until a
time is encountered that doesn't match time(NULL), and then w
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
> On 12/01/2010 08:55 PM, Adam Augustine wrote:
>> While DNX and mod_gearman do implement that specific functionality,
>> they are still subject to the scheduler/reaper bottlenecks. We (the
>> institution that started the DNX project) have played around with the
>> check scheduling parameters quite a bit over the years and even with
>> our best scheduling parameters and DNX actually executing the plugins,
>> we still see checks scheduled such that we have a large number of
>> checks scheduled to execute in a single second with several seconds
>> (3-5) of nothing scheduled to execute between.
>
> Agreed. That's also the reason why I don't use either so far; I don't
> have a problem (yet ...) with the short-term scheduling (scheduling "due
> now" checks onto executors), but I see unnecessary churn in the mid-term
> scheduling (schedule next due time of checks just completed).
>
> Unless I *really* need new glasses, there's only three different kinds
> of such rescheduling code in the 3.2.x Nagios core:
>
> 1. Reschedule *exactly* check_interval / retry_interval from last due
> time (iff check_period allows this) - e.g., base/checks.c::1301ff :
>
> if(reschedule_check==TRUE)
> next_service_check=(time_t)(temp_service->last_check
> +(temp_service->check_interval*interval_length));
> }
>
This could trivially be changed by the simple expedient of scheduling the
checks with a random component and offsetting the check backwards in time
by half the random flex component. That shouldn't really be necessary
though. See below.
> 2. Reschedule to the *very first second* permitted by check_period -
> e.g., base/checks.c::278ff :
>
> /* make sure we rescheduled the next service check at a valid time */
> get_next_valid_time(preferred_time,
> &next_valid_time,svc->check_period_ptr);
> [...]
> svc->next_check=next_valid_time;
>
Here we could do a similar tweak, adding a random number between 0 and 60
to the scheduler. It wouldn't be perfect, but it would be better than the
current scheme, and with a half-decent PRNG it would mean checks would
stay smoothed out for the duration of Nagios' lifespan.
> 3. Special (error) cases falling back to some hardcoded "check interval"
> (five minutes, one week, ...).
>
These would benefit from just being rescheduled the normal way and pushed
forward by check_interval number of seconds each time they're supposed to
run.
> Neither case even *looks* at the list of already-scheduled check
> executions around the target time, much less does any smoothing.
>
> (For sake of completeness: A smoothing algorithm IMHO should:
> Case 1: *Decrease* next_check for at most a certain percentage of
> check_interval/retry_interval, so as to avoid consecutive faults in
> freshness checks and performance data processing (in the case of RRDs,
> violation of xff);
Not percentage. A fixed time would be both easier to implement and also
give a lot better behaviour in that it would be a lot less surprising
to users.
> Case 2: *Increase* next_check so as to stay within the check_period, but
> determining a max increment which simultaneously smoothes out the
> (potentially MANY) affected checks and avoids pushing the chain of
> subsequent processing (retry_interval / max_check_attempts if found
> non-OK, running event handlers, ...) *beyond* the valid timeframe is
> definitely nontrivial.)
>
Not really. The simple way of doing it is like so:
struct scheduled_thingie sched_queue[1024];
uint lowest = maxuint;
for (i = scheduled_time; i when > 1023) {
add_expensively_in_linked_list(sched_item);
} else {
sched_item->next = sched_queue[lowest].list;
}
sched_queue[lowest].list = sched_item;
When running checks, one simply has to grab the items in
sched_queue[sched_last_when].list and run the events there until a
time is encountered that doesn't match time(NULL), and then w
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]