Re: [Nagios-devel] [PATCH] Re: alternative scheduler

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] [PATCH] Re: alternative scheduler

Post by Guest »

Sorry for the long delay. It seems I was half asleep when I scrolled by
this mail earlier.

On 12/01/2010 03:40 PM, Fredrik Thulin wrote:
> On Wed, 2010-12-01 at 15:14 +0100, Andreas Ericsson wrote:
> ...
>>> Host checks were still being scheduled, and every time a host check was
>>> found at the front of event_list_low, Nagios would log "We're not
>>> executing host checks right now, so we'll skip this event." and then
>>> sleep for sleep_time seconds (0.25 was my setting, based on (Ubuntu)
>>> defaults) (!!!).
>>
>>
>> This should only happen if you've set a check_interval for hosts but
>> have disabled them globally, either via nagios.cfg or via an external
>> command. It seems weird that we run usleep() instead of just issuing
>> a sched_yield() or something though, which would be a virtual noop
>> unless other processes are waiting to run.
>
> Guilty of setting a check_interval for hosts, even on slave servers,
> yes.
>
> IMNSHO, if that is an unsupported configuration in combination with
> execute_host_checks=0, Nagios should refuse to load the configuration.
>

It isn't. It just uncovers the the issue you experienced.

>>> I made the attached minimalistic patch to not sleep if the next event in
>>> the event list is already due.
>>>
>>
>> Seems sensible, but I think it can be improved, such as issuing either
>> a sched_yield() or, if sched_yield() is not available, running usleep(10)
>> every 100 skipped items or so. That would avoid pinning the cpu but would
>> still be a lot faster than what we have today.
>
> What is sched_yield? I can't find that function anywhere in the source
> code. Feel free to improve the patch - as I've previously said C isn't
> my game.
>

sched_yield() causes the kernel to check through its scheduling queue and
see if there are other processes waiting to run. If there are, those other
processes will run. If not, the current process will continue running.

>>> This removed the total lack of performance in my installation, but
>>> service reaping is still killing me slowly on my virtual development
>>> server.
>>
>> How come?
>
> I currently reap every 10 seconds, and crude empirical observations made
> by tailing the log file says that reaping takes 3-4 seconds on my
> virtual machine ( the following things on RAM disk :
>
> object_cache_file
> precached_object_file
> status_file
> temp_file
> temp_path
> check_result_path
> state_retention_file
> debug_file
>

ram disk doesn't mean anything on virtual servers, because it's quite
likely that the host os is still using a swap file to host that content.
In general, performance-testing anything in a virtual server is a bad
idea, since the IO performance is so utterly crap and one can never be
really sure that what appears to be stored in memory isn't stored on
disk by the host os.

> and with the tiniest C program that appends results to a file as
> ocsp_command.
>

Use Nagios' own native perfdata writing instead and use a same-partition
"mv" command to move the perfdata file to the reaper spool directory.

> I'll try changing reaping interval to every 2 seconds as per your
> advice, but I guess it will still take 30-40% of the total time.
>

On virtual machines, yes. On your physical server it's less than 10%.
How much less, one can only guess, but it should be very little if
you're using ramdisks.

>> ... Still though, reaping more frequently means the cache
>> would more often be hot and reaping will run a lot faster.
>
> Which cache would be hotter by reaping more frequently do you mean? The
> files are on RAM disk already.
>

I didn't know that. In that case, it won't matter more than a minuscule
amount.

>>> The scheduler really needs much more work (like sub-second precision for
>>> when to start checks - that gave me roughly 25% additional performance
>>> in my Erlang based scheduler),
>>
>> That's not possible. With subsecond precision the program has to do
>> more work, not less. You're looking at the wrong bottleneck here and
>> you most certainly bot

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked