Re: [Nagios-devel] alternative scheduler

Guest · Post by **Guest** » Thu Dec 02, 2010 2:08 pm

The problem with any smoothing or readjustment of time intervals comes
in when performance metrics are being collected along with state - not
having a stable interval between checks throws off intervals between
data points in metric databases.

Some amount of jitter in intervals can be accounted for when
inserting data points into metric databses with some fairly simple
math (truncating intervals to nearest minute for example) but if
intervals are not pretty accurate then using metrics over time for
trending and comparison gets to be much trickier and requires a lot of
mathematical adjustments on view if we are say looking at trend lines
for 10 or 20 elements at once - this then scales very poorly when
wanting to view hundreds or thousands of metric lines at once - even
if they are aggregated first (which is usually done in some fashion
with hugh #s of metrics).

We have mitigated this issue a bit by adding truncation code before
inserting metrics into our long term trending data warehouse - that
means that what goes in falls on even minute intervals, making
graphing a cheap operation evenr many data points.

Our longer term resolution to this will be to decouple fault
management tests from metrics collection as the metrics really make us
have to watch service latency and intervals for snmp delta metric
collection hard - it is a PITA. We plan on having an agent on every
system that focuses on streaming metrics to collectors, thereby
freeing the polling based tests from having to be locked into very
accurate check intervals.

Max

On 12/2/10, Andreas Ericsson wrote:
> On 12/02/2010 12:36 PM, Jochen Bern wrote:
>> On 12/02/2010 10:46 AM, Andreas Ericsson wrote:
>>> On 12/02/2010 10:03 AM, Jochen Bern wrote:
>>>> Unless I *really* need new glasses, there's only three different kinds
>>>> of such rescheduling code in the 3.2.x Nagios core:
>>>> 1. Reschedule *exactly* check_interval / retry_interval from last due
>>>> time (iff check_period allows this) - e.g., base/checks.c::1301ff :
>>> This could trivially be changed by the simple expedient of scheduling the
>>> checks with a random component and offsetting the check backwards in time
>>> by half the random flex component.
>>
>> (Which is what I've hacked into the core right now - as I mentioned, a
>> random offset of -7..0 seconds, typically every check_interval = 5
>> minutes, takes ~6h to undo the peak-building of the nightly logfile
>> rotation.)
>>
>
> If you use -15..+15 seconds it will spread a lot faster.
>
>>>> 2. Reschedule to the *very first second* permitted by check_period -
>>>> e.g., base/checks.c::278ff :
>>> Here we could do a similar tweak, adding a random number between 0 and 60
>>> to the scheduler. It wouldn't be perfect, but it would be better than the
>>> current scheme, and with a half-decent PRNG it would mean checks would
>>> stay smoothed out for the duration of Nagios' lifespan.
>>
>> Where "smoothed out" is defined as "randomly distributed in the first
>> minute of a valid timeframe, spreading further due to check_interval
>> randomization for as long as the timeframe runs, and losing all the
>> latter randomization as they skip over the next *in*valid timeframe".
>>
>
> The "losing all the randomization" won't be necessary if the checks
> were to be stepped by whatever recheck interval we're currently using
> instead of set fixedly to the first second of the next valid timeframe.
>
>>>> Case 2: *Increase* next_check so as to stay within the check_period, but
>>>> determining a max increment which simultaneously smoothes out the
>>>> (potentially MANY) affected checks and avoids pushing the chain of
>>>> subsequent processing (retry_interval / max_check_attempts if found
>>>> non-OK, running event handlers, ...) *beyond* the valid timeframe is
>>>> definitely nontrivial.)
>>> Not really.
>>
>> Let me play devil's advocate for a second and sketch my (so far)
>> worst-case thought scenario:
>>
>> 1. A *very* expensive check which should be done only once per day
>> during a low-load period, as long as the result is OK.
>> --> chec

...[email truncated]...

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]