Re: [Nagios-devel] [PATCH] Re: alternative scheduler
Posted: Fri Dec 03, 2010 4:40 pm
On 12/03/2010 03:24 PM, Fredrik Thulin wrote:
> On Fri, 2010-12-03 at 14:28 +0100, Andreas Ericsson wrote:
>> ...
>>> I meant to say that N is calculated when the list of checks is
>>> (re)loaded. As I don't even try to have retry_intervals and such, a
>>> steady tick interval works great as long as I can finish initiating
>>> another service check in between ticks.
>>>
>>
>> Ah, right. And initiating a check is quite cheap until they start
>> piling up when the network goes bad, which you sort of avoid by using
>> a constant stream of executing checks, so you always know there'll be
>> constant load on the system you're monitoring from.
>
> Right, but initiating checks doesn't get more expensive just because the
> checks require more CPU cycles to complete (because of retries). Other
> resources might suffer though - I guess the first one to be depleted
> would be file descriptors.
>
If you produce more ticks the more checks you run, then it becomes more
expensive per check to run each check. The number of ticks should be
constant and the number of checks to start at each tick should be
variable. Producing the tick has overhead too. So does looping over a
list of checks to run each tick, but I guarantee you that that overhead
smaller than producing a tick.
>> I'm wondering if
>> that doesn't sort of solve the problem in the wrong direction though,
>> since the monitoring system is supposed to serve the other systems and
>> endure the inconveniences it suffers itself as best it can. Sort of.
>
> Hmm. The goal here is to scale sideways as you put it. To evolve to more
> cores and more schedulers thus reaching higher number of checks possible
> per time unit, per server.
>
> If a given server can only take on 1000 checks per time unit and you
> typically run it around 900, nothing good will come out of
> retry_interval suddenly trying to get the server to do 1100 checks per
> minute. That is over-subscription and the result is undefined at best.
>
> I would rather dynamically figure out that I'm very probable to be able
> to run 1000 checks per time unit, and then either
>
> * use my current approach of always doing 950 and not having
> retry_interval and similar, or
> * do 800 per time unit, and allow retry_interval etc. to push it up to
> 900-1000 but never more
>
Skipping the retry_interval is retarded at best and moronic at worst. Or
possibly the other way around. If you do that you might as well just make
sure you've always got X checks running and let them complete when they
complete. That's an even simpler way of dropping monitoring precision in
favour of imaginary scalability.
Just so we understand each other here; It's quite cool that you wrote a
scheduler in erlang. I don't speak erlang myself, but I find it inspiring
when people get off their arses and solve a problem rather than moping
about it. However, the precision-regression in your scheduler makes it
clearly unsuitable for real-world monitoring. Its merits for the sake of
the server doing the monitoring leaves food for thought when implementing
a new scheduler, but IMNSHO you've aimed for the secondary goal of not
overloading the monitoring server rather than checking things with the
original precision or better. The fact that you have apparently succeeded
doesn't change the fact that what you've created is somewhat akin to an
airplane that can't fly, but has very comfortable chairs.
>>>> That's still "doing more than you did before", on a system level, so the
>>>> previous implementation must have been buggy somehow. Perhaps erlang
>>>> blocked a few signals when the signal handler was already running, or
>>>> perhaps you didn't start enough checks per tick?
>>>
>>> I agree it is more work for the scheduler, but that is better than
>>> having under-utilized additional CPUs/cores, right?
>>>
>>
>> So long as the net effect is that you can run more checks with it, yes, but
>> an exponential algorithm will always beat a non-exponential one, so with a
>> large enough number of checks you'll run into the reverse situation, where
>> the sch
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
> On Fri, 2010-12-03 at 14:28 +0100, Andreas Ericsson wrote:
>> ...
>>> I meant to say that N is calculated when the list of checks is
>>> (re)loaded. As I don't even try to have retry_intervals and such, a
>>> steady tick interval works great as long as I can finish initiating
>>> another service check in between ticks.
>>>
>>
>> Ah, right. And initiating a check is quite cheap until they start
>> piling up when the network goes bad, which you sort of avoid by using
>> a constant stream of executing checks, so you always know there'll be
>> constant load on the system you're monitoring from.
>
> Right, but initiating checks doesn't get more expensive just because the
> checks require more CPU cycles to complete (because of retries). Other
> resources might suffer though - I guess the first one to be depleted
> would be file descriptors.
>
If you produce more ticks the more checks you run, then it becomes more
expensive per check to run each check. The number of ticks should be
constant and the number of checks to start at each tick should be
variable. Producing the tick has overhead too. So does looping over a
list of checks to run each tick, but I guarantee you that that overhead
smaller than producing a tick.
>> I'm wondering if
>> that doesn't sort of solve the problem in the wrong direction though,
>> since the monitoring system is supposed to serve the other systems and
>> endure the inconveniences it suffers itself as best it can. Sort of.
>
> Hmm. The goal here is to scale sideways as you put it. To evolve to more
> cores and more schedulers thus reaching higher number of checks possible
> per time unit, per server.
>
> If a given server can only take on 1000 checks per time unit and you
> typically run it around 900, nothing good will come out of
> retry_interval suddenly trying to get the server to do 1100 checks per
> minute. That is over-subscription and the result is undefined at best.
>
> I would rather dynamically figure out that I'm very probable to be able
> to run 1000 checks per time unit, and then either
>
> * use my current approach of always doing 950 and not having
> retry_interval and similar, or
> * do 800 per time unit, and allow retry_interval etc. to push it up to
> 900-1000 but never more
>
Skipping the retry_interval is retarded at best and moronic at worst. Or
possibly the other way around. If you do that you might as well just make
sure you've always got X checks running and let them complete when they
complete. That's an even simpler way of dropping monitoring precision in
favour of imaginary scalability.
Just so we understand each other here; It's quite cool that you wrote a
scheduler in erlang. I don't speak erlang myself, but I find it inspiring
when people get off their arses and solve a problem rather than moping
about it. However, the precision-regression in your scheduler makes it
clearly unsuitable for real-world monitoring. Its merits for the sake of
the server doing the monitoring leaves food for thought when implementing
a new scheduler, but IMNSHO you've aimed for the secondary goal of not
overloading the monitoring server rather than checking things with the
original precision or better. The fact that you have apparently succeeded
doesn't change the fact that what you've created is somewhat akin to an
airplane that can't fly, but has very comfortable chairs.
>>>> That's still "doing more than you did before", on a system level, so the
>>>> previous implementation must have been buggy somehow. Perhaps erlang
>>>> blocked a few signals when the signal handler was already running, or
>>>> perhaps you didn't start enough checks per tick?
>>>
>>> I agree it is more work for the scheduler, but that is better than
>>> having under-utilized additional CPUs/cores, right?
>>>
>>
>> So long as the net effect is that you can run more checks with it, yes, but
>> an exponential algorithm will always beat a non-exponential one, so with a
>> large enough number of checks you'll run into the reverse situation, where
>> the sch
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]