Re: [Nagios-devel] alternative scheduler
Posted: Wed Nov 24, 2010 9:23 am
This is a multi-part message in MIME format.
--------------010908020608020402010306
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On 11/23/2010 09:08 PM, Fredrik Thulin wrote:
> On Tue, 2010-11-23 at 20:43 +0100, Jochen Bern wrote:
>> On 11/23/2010 01:59 PM, Fredrik Thulin wrote:
>>> I was able to write a brand new scheduler that works MUCH better - 11=
60
>>> checks per minute, compared to ~60. Any plans to do something drastic
>>> about the Nagios service check scheduler?
>> One question, for sake of clarification: Does your definition of "chec=
k
>> scheduling" include the mid-term planning (i.e., "check returned OK,
>> should be repeated after the configured check_interval, if check_perio=
d
>> permits" and the likes), or only the short-term scheduling of "due"
>> checks onto the resources for actual execution (in the style of a
>> (distributed) batch queue)?
> The proof of concept is super simple - it was all done in less than six
> hours time.
> You load it with in my case ~6000 checks, and say that you want them
> started in N seconds (in my case 300 seconds).=20
[...]
> Improving the scheduler to support different check_intervals etc. would
> not be difficult, but is something I've never utilized with Nagios to
> date.
I see. I should probably explain why I'm asking, then (everyone else,
please excuse the wall of text):
Given a Nagios configuration (number of active checks, their
check_period, check_interval, retry_interval, and max_check_attempts), a
distribution of state changes, and (I hope) a bunch of Queueing Theory
formulae, one can determine the average rate X/min at which checks
*ought* to be scheduled and executed. In evaluating a new check
scheduler, the first thing I'd be interested in would be its
*correctness*, from the detail (single host/service) up to the global
level (yielding a rate of X/min, not less, nor more - hence my confusion
about your "the more checks per minute, the better!" stance).
Once correctness has been established, one can go on to check whether
it's a "good" scheduler. However, there's more than one definition of
quality that one may use. One possibility is to measure the *maximum*
sustainable rate of checks that can be (scheduled and) executed. Another
gauge is that, if the scheduler goes to work on a handcrafted, badly
distributed initial schedule, it will smooth out the load within Y
cycles with a max deviation of Z % from the {check,retry}_interval.
Which brings us to the current Nagios code. In some installations,
random influences make the scheduled check times "flow together" into
peaks of workload (see the attached graph for what happens to my
scheduling every midnight when Nagios rotates the log). Nagios (3.2.x)
does *not* fix such peaks unless you do a restart with *complete*
rescheduling (I hacked a random -7..0 seconds offset into the code,
which smoothes out my midnight-induced peaks over the course of ~6 hours)=
.
Anyone who has to work with check_periods a lot has even more of a
problem. If the {check,retry}_interval would place the next check
outside the check_period, Nagios will schedule the next check for the
*very first second* of the upcoming in-period timeframe - *ALL* of them.
In a case reported by another colleague, that made for a fireball of
20,000 checks in the same second - which blew a redundant pair of Nagios
servers clear out of the water.
Kind regards,
J. Bern
--=20
Jochen Bern, Systemingenieur --- LINworks GmbH
Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
PGP (1024D/4096g) FP =3D D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C2=
7
Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
Unternehmenssitz Weiterstadt, Gesch=E4ftsf=FChrer Metin Dogan, Oliver Mic=
hel
--------------010908020608020402010306
Content-Type: image/png;
name="SchedGraph.2010-11-23-23:55.png"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="SchedGraph.2010-11-23-23:55.png"
iVBORw0KGgoAAAANSUhEUgAAAcwAAAFpEAIAAADO+FPPAAAACXBIWXM
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
--------------010908020608020402010306
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On 11/23/2010 09:08 PM, Fredrik Thulin wrote:
> On Tue, 2010-11-23 at 20:43 +0100, Jochen Bern wrote:
>> On 11/23/2010 01:59 PM, Fredrik Thulin wrote:
>>> I was able to write a brand new scheduler that works MUCH better - 11=
60
>>> checks per minute, compared to ~60. Any plans to do something drastic
>>> about the Nagios service check scheduler?
>> One question, for sake of clarification: Does your definition of "chec=
k
>> scheduling" include the mid-term planning (i.e., "check returned OK,
>> should be repeated after the configured check_interval, if check_perio=
d
>> permits" and the likes), or only the short-term scheduling of "due"
>> checks onto the resources for actual execution (in the style of a
>> (distributed) batch queue)?
> The proof of concept is super simple - it was all done in less than six
> hours time.
> You load it with in my case ~6000 checks, and say that you want them
> started in N seconds (in my case 300 seconds).=20
[...]
> Improving the scheduler to support different check_intervals etc. would
> not be difficult, but is something I've never utilized with Nagios to
> date.
I see. I should probably explain why I'm asking, then (everyone else,
please excuse the wall of text):
Given a Nagios configuration (number of active checks, their
check_period, check_interval, retry_interval, and max_check_attempts), a
distribution of state changes, and (I hope) a bunch of Queueing Theory
formulae, one can determine the average rate X/min at which checks
*ought* to be scheduled and executed. In evaluating a new check
scheduler, the first thing I'd be interested in would be its
*correctness*, from the detail (single host/service) up to the global
level (yielding a rate of X/min, not less, nor more - hence my confusion
about your "the more checks per minute, the better!" stance).
Once correctness has been established, one can go on to check whether
it's a "good" scheduler. However, there's more than one definition of
quality that one may use. One possibility is to measure the *maximum*
sustainable rate of checks that can be (scheduled and) executed. Another
gauge is that, if the scheduler goes to work on a handcrafted, badly
distributed initial schedule, it will smooth out the load within Y
cycles with a max deviation of Z % from the {check,retry}_interval.
Which brings us to the current Nagios code. In some installations,
random influences make the scheduled check times "flow together" into
peaks of workload (see the attached graph for what happens to my
scheduling every midnight when Nagios rotates the log). Nagios (3.2.x)
does *not* fix such peaks unless you do a restart with *complete*
rescheduling (I hacked a random -7..0 seconds offset into the code,
which smoothes out my midnight-induced peaks over the course of ~6 hours)=
.
Anyone who has to work with check_periods a lot has even more of a
problem. If the {check,retry}_interval would place the next check
outside the check_period, Nagios will schedule the next check for the
*very first second* of the upcoming in-period timeframe - *ALL* of them.
In a case reported by another colleague, that made for a fireball of
20,000 checks in the same second - which blew a redundant pair of Nagios
servers clear out of the water.
Kind regards,
J. Bern
--=20
Jochen Bern, Systemingenieur --- LINworks GmbH
Postfach 100121, 64201 Darmstadt | Robert-Koch-Str. 9, 64331 Weiterstadt
PGP (1024D/4096g) FP =3D D18B 41B1 16C0 11BA 7F8C DCF7 E1D5 FAF4 444E 1C2=
7
Tel. +49 6151 9067-231, Zentr. -0, Fax -299 - Amtsg. Darmstadt HRB 85202
Unternehmenssitz Weiterstadt, Gesch=E4ftsf=FChrer Metin Dogan, Oliver Mic=
hel
--------------010908020608020402010306
Content-Type: image/png;
name="SchedGraph.2010-11-23-23:55.png"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="SchedGraph.2010-11-23-23:55.png"
iVBORw0KGgoAAAANSUhEUgAAAcwAAAFpEAIAAADO+FPPAAAACXBIWXM
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]