Re: [Nagios-devel] [PATCH] Re: alternative scheduler

Guest · Post by **Guest** » Wed Dec 01, 2010 2:40 pm

On Wed, 2010-12-01 at 15:14 +0100, Andreas Ericsson wrote:
...
> > Host checks were still being scheduled, and every time a host check was
> > found at the front of event_list_low, Nagios would log "We're not
> > executing host checks right now, so we'll skip this event." and then
> > sleep for sleep_time seconds (0.25 was my setting, based on (Ubuntu)
> > defaults) (!!!).
>
>
> This should only happen if you've set a check_interval for hosts but
> have disabled them globally, either via nagios.cfg or via an external
> command. It seems weird that we run usleep() instead of just issuing
> a sched_yield() or something though, which would be a virtual noop
> unless other processes are waiting to run.

Guilty of setting a check_interval for hosts, even on slave servers,
yes.

IMNSHO, if that is an unsupported configuration in combination with
execute_host_checks=0, Nagios should refuse to load the configuration.

> > I made the attached minimalistic patch to not sleep if the next event in
> > the event list is already due.
> >
>
> Seems sensible, but I think it can be improved, such as issuing either
> a sched_yield() or, if sched_yield() is not available, running usleep(10)
> every 100 skipped items or so. That would avoid pinning the cpu but would
> still be a lot faster than what we have today.

What is sched_yield? I can't find that function anywhere in the source
code. Feel free to improve the patch - as I've previously said C isn't
my game.

> > This removed the total lack of performance in my installation, but
> > service reaping is still killing me slowly on my virtual development
> > server.
>
> How come?

I currently reap every 10 seconds, and crude empirical observations made
by tailing the log file says that reaping takes 3-4 seconds on my
virtual machine ( ... Still though, reaping more frequently means the cache
> would more often be hot and reaping will run a lot faster.

Which cache would be hotter by reaping more frequently do you mean? The
files are on RAM disk already.

> > The scheduler really needs much more work (like sub-second precision for
> > when to start checks - that gave me roughly 25% additional performance
> > in my Erlang based scheduler),
>
> That's not possible. With subsecond precision the program has to do
> more work, not less. You're looking at the wrong bottleneck here and
> you most certainly botched the implementation the first time around if
> adding subsecond precision made such a large improvement for you.

We should have a beer and talk about scheduling sometime, since we're
both in Stockholm (?).

My first scheduler ticked once per second and *BAM* started 30+ checks.

A lot of the times, a significant number of these checks were exactly
the same check (but different target hosts), so my theory is they all
requested the very same resources around the same millisecond. When I
changed the scheduler to start one check every 50 ms instead, I saw that
I could start around 25% more checks every second. Other theories are
welcome, but that was my observation.

> Try removing check_interval and retry_interval from your hosts instead,
> and set should_be_scheduled=0 in your retention file before restarting.
> execute_host_checks is about actually running the checks, whereas you
> want to skip even scheduling them.

I'll think about doing that, or just throwing hardware at the problem
now that my Nagios check servers perform reasonably well.

/Fredrik

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]