Re: [Nagios-devel] [PATCH] Re: alternative scheduler

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] [PATCH] Re: alternative scheduler

Post by Guest »

On Wed, 2010-12-01 at 15:14 +0100, Andreas Ericsson wrote:
...
> > Host checks were still being scheduled, and every time a host check was
> > found at the front of event_list_low, Nagios would log "We're not
> > executing host checks right now, so we'll skip this event." and then
> > sleep for sleep_time seconds (0.25 was my setting, based on (Ubuntu)
> > defaults) (!!!).
>
>
> This should only happen if you've set a check_interval for hosts but
> have disabled them globally, either via nagios.cfg or via an external
> command. It seems weird that we run usleep() instead of just issuing
> a sched_yield() or something though, which would be a virtual noop
> unless other processes are waiting to run.

Guilty of setting a check_interval for hosts, even on slave servers,
yes.

IMNSHO, if that is an unsupported configuration in combination with
execute_host_checks=0, Nagios should refuse to load the configuration.

> > I made the attached minimalistic patch to not sleep if the next event in
> > the event list is already due.
> >
>
> Seems sensible, but I think it can be improved, such as issuing either
> a sched_yield() or, if sched_yield() is not available, running usleep(10)
> every 100 skipped items or so. That would avoid pinning the cpu but would
> still be a lot faster than what we have today.

What is sched_yield? I can't find that function anywhere in the source
code. Feel free to improve the patch - as I've previously said C isn't
my game.

> > This removed the total lack of performance in my installation, but
> > service reaping is still killing me slowly on my virtual development
> > server.
>
> How come?

I currently reap every 10 seconds, and crude empirical observations made
by tailing the log file says that reaping takes 3-4 seconds on my
virtual machine ( ... Still though, reaping more frequently means the cache
> would more often be hot and reaping will run a lot faster.

Which cache would be hotter by reaping more frequently do you mean? The
files are on RAM disk already.

> > The scheduler really needs much more work (like sub-second precision for
> > when to start checks - that gave me roughly 25% additional performance
> > in my Erlang based scheduler),
>
> That's not possible. With subsecond precision the program has to do
> more work, not less. You're looking at the wrong bottleneck here and
> you most certainly botched the implementation the first time around if
> adding subsecond precision made such a large improvement for you.

We should have a beer and talk about scheduling sometime, since we're
both in Stockholm (?).

My first scheduler ticked once per second and *BAM* started 30+ checks.

A lot of the times, a significant number of these checks were exactly
the same check (but different target hosts), so my theory is they all
requested the very same resources around the same millisecond. When I
changed the scheduler to start one check every 50 ms instead, I saw that
I could start around 25% more checks every second. Other theories are
welcome, but that was my observation.

> Try removing check_interval and retry_interval from your hosts instead,
> and set should_be_scheduled=0 in your retention file before restarting.
> execute_host_checks is about actually running the checks, whereas you
> want to skip even scheduling them.

I'll think about doing that, or just throwing hardware at the problem
now that my Nagios check servers perform reasonably well.

/Fredrik







This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked