Re: [Nagios-devel] [PATCH] Re: alternative scheduler

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] [PATCH] Re: alternative scheduler

Post by Guest »

On 12/02/2010 10:46 AM, Andreas Ericsson wrote:
> On 12/02/2010 10:03 AM, Jochen Bern wrote:
>> Unless I *really* need new glasses, there's only three different kinds
>> of such rescheduling code in the 3.2.x Nagios core:
>> 1. Reschedule *exactly* check_interval / retry_interval from last due
>> time (iff check_period allows this) - e.g., base/checks.c::1301ff :
> This could trivially be changed by the simple expedient of scheduling t=
he
> checks with a random component and offsetting the check backwards in ti=
me
> by half the random flex component.

(Which is what I've hacked into the core right now - as I mentioned, a
random offset of -7..0 seconds, typically every check_interval =3D 5
minutes, takes ~6h to undo the peak-building of the nightly logfile
rotation.)

>> 2. Reschedule to the *very first second* permitted by check_period -
>> e.g., base/checks.c::278ff :
> Here we could do a similar tweak, adding a random number between 0 and =
60
> to the scheduler. It wouldn't be perfect, but it would be better than t=
he
> current scheme, and with a half-decent PRNG it would mean checks would
> stay smoothed out for the duration of Nagios' lifespan.

Where "smoothed out" is defined as "randomly distributed in the first
minute of a valid timeframe, spreading further due to check_interval
randomization for as long as the timeframe runs, and losing all the
latter randomization as they skip over the next *in*valid timeframe".

>> Case 2: *Increase* next_check so as to stay within the check_period, b=
ut
>> determining a max increment which simultaneously smoothes out the
>> (potentially MANY) affected checks and avoids pushing the chain of
>> subsequent processing (retry_interval / max_check_attempts if found
>> non-OK, running event handlers, ...) *beyond* the valid timeframe is
>> definitely nontrivial.)
> Not really.

Let me play devil's advocate for a second and sketch my (so far)
worst-case thought scenario:

1. A *very* expensive check which should be done only once per day
during a low-load period, as long as the result is OK.
--> check_period approximately =3D=3D low-load period, check_interval lar=
ger
than the length of the check_period's "valid" timeframe.

2. In cases where the test returns non-OK, a certain (low) number of
rechecks shall be done to guard against secondary influences (say, temp
LAN hiccups).
--> max_check_retries and retry_interval such that their product is
still reasonably lower than the length of the "valid" timeframe.

3. As soon as the service turns HARD non-OK (rather random choice, the
formulae would change if we'd instead use the last SOFT non-OK result,
but the problem stays pretty much the same), an event handler triggers
some corrective action (try to fix the problem within the low-load
period). This action needs some time to complete - let's assume it
doesn't agree well with the retry_interval. Once it's completed, we want
a last-ditch check.
Since we already set "too high" a check_period in step 1, we need the
event handler to trigger the action, make an educated guess whether it
might succeed, and if yes, schedule the last-ditch check through the
external command interface (to be executed X seconds later).

4. Now let's do the math: In order to make sure that the last-ditch
check will still fall into the check_period, and not taking any
retry_interval randomization into account, we need the *first* check to
get scheduled between period_begin and
period_end - (max_check_retries-1)*retry_interval - X
- [some time for event handler latency&exec]
where X is a substantial delay programmed into the event handler,
nowhere to be found in the data available to Nagios itself.

Kind regards,
J. Bern
--=20
Jochen Bern, Systemingenieur --- LINworks GmbH
Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
PGP (1024D/4096g) FP =3D D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C2=
7
Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
Unternehmenssitz Weiterstadt, Gesch=E4ftsf=FChrer Metin Dogan, Oliver Mic=
hel





This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked