Re: [Nagios-devel] [PATCH] Re: alternative scheduler

Guest · Post by **Guest** » Thu Dec 02, 2010 1:11 pm

On 12/02/2010 12:36 PM, Jochen Bern wrote:
> On 12/02/2010 10:46 AM, Andreas Ericsson wrote:
>> On 12/02/2010 10:03 AM, Jochen Bern wrote:
>>> Unless I *really* need new glasses, there's only three different kinds
>>> of such rescheduling code in the 3.2.x Nagios core:
>>> 1. Reschedule *exactly* check_interval / retry_interval from last due
>>> time (iff check_period allows this) - e.g., base/checks.c::1301ff :
>> This could trivially be changed by the simple expedient of scheduling the
>> checks with a random component and offsetting the check backwards in time
>> by half the random flex component.
>
> (Which is what I've hacked into the core right now - as I mentioned, a
> random offset of -7..0 seconds, typically every check_interval = 5
> minutes, takes ~6h to undo the peak-building of the nightly logfile
> rotation.)
>

If you use -15..+15 seconds it will spread a lot faster.

>>> 2. Reschedule to the *very first second* permitted by check_period -
>>> e.g., base/checks.c::278ff :
>> Here we could do a similar tweak, adding a random number between 0 and 60
>> to the scheduler. It wouldn't be perfect, but it would be better than the
>> current scheme, and with a half-decent PRNG it would mean checks would
>> stay smoothed out for the duration of Nagios' lifespan.
>
> Where "smoothed out" is defined as "randomly distributed in the first
> minute of a valid timeframe, spreading further due to check_interval
> randomization for as long as the timeframe runs, and losing all the
> latter randomization as they skip over the next *in*valid timeframe".
>

The "losing all the randomization" won't be necessary if the checks
were to be stepped by whatever recheck interval we're currently using
instead of set fixedly to the first second of the next valid timeframe.

>>> Case 2: *Increase* next_check so as to stay within the check_period, but
>>> determining a max increment which simultaneously smoothes out the
>>> (potentially MANY) affected checks and avoids pushing the chain of
>>> subsequent processing (retry_interval / max_check_attempts if found
>>> non-OK, running event handlers, ...) *beyond* the valid timeframe is
>>> definitely nontrivial.)
>> Not really.
>
> Let me play devil's advocate for a second and sketch my (so far)
> worst-case thought scenario:
>
> 1. A *very* expensive check which should be done only once per day
> during a low-load period, as long as the result is OK.
> --> check_period approximately == low-load period, check_interval larger
> than the length of the check_period's "valid" timeframe.
>
> 2. In cases where the test returns non-OK, a certain (low) number of
> rechecks shall be done to guard against secondary influences (say, temp
> LAN hiccups).
> --> max_check_retries and retry_interval such that their product is
> still reasonably lower than the length of the "valid" timeframe.
>
> 3. As soon as the service turns HARD non-OK (rather random choice, the
> formulae would change if we'd instead use the last SOFT non-OK result,
> but the problem stays pretty much the same), an event handler triggers
> some corrective action (try to fix the problem within the low-load
> period). This action needs some time to complete - let's assume it
> doesn't agree well with the retry_interval. Once it's completed, we want
> a last-ditch check.
> Since we already set "too high" a check_period in step 1, we need the
> event handler to trigger the action, make an educated guess whether it
> might succeed, and if yes, schedule the last-ditch check through the
> external command interface (to be executed X seconds later).
>
> 4. Now let's do the math: In order to make sure that the last-ditch
> check will still fall into the check_period, and not taking any
> retry_interval randomization into account, we need the *first* check to
> get scheduled between period_begin and
> period_end - (max_check_retries-1)*retry_interval - X
> - [some time for event handler latency&exec]
> where X is a substantial delay programmed into the event handler,
> nowhere to be found in the data available to Nagios itself.
>

Or we can

...[email truncated]...

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]