Re: [Nagios-devel] OCSP affecting Nagios behavior

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] OCSP affecting Nagios behavior

Post by Guest »

What's your server_reaper_frequency directive set at in the main
config file? You might try lowering it or raising it and seeing if
that affects things for the better.

It sounds like child processes (the ones that run the plugins) might
be blocking when they try and write to the pipe to the main process.
If the pipe is full, child processes will block until they can write
plugin execution results back to the parent process. The system()
call (to run the OCSP command) happens in the parent process, so its
actually slowing the parent process down a bit. I would think this
would hurt things rather than help. Mabye the system() call gives
child processes enough time to write to the pipe before the parent
reads from it? I'm not sure and it doesn't make a lot of sense to
me.

Also note that you're executing at least 1.5 checks per second on
average. The results that each child writes back to the main process
is around 512 bytes. Some systems have small limits on what the pipe
buffer is (4K or less), which means that 8 messages will fill up the
pipe buffer and cause other children to block until the parent reads
from the pipe and frees some space.

That said, I don't really know what solution there might be if
changing the service_reaper_frequency directive doesn't help. :-(
Anyone else?


On 18 Nov 2002 at 17:31, Russell Scibetti wrote:

> I emailed about this once before, but I never received much response.
> Unfortunately, until I can figure this issue out, I can't fully rely on
> Nagios, which I would really like to do...
>
> It seems that turning on obsess_over_services somehow affects Nagios's
> process management behavior. I noticed this because my instance of
> nagios (without obsess_over_services turned on) was having problems.
> This was a 1.0b6 install on Linux RedHat 7.2 (after an upgrade from
> 1.0b3) with about 700 service checks. There are actually 5 other
> instances on this box, but this is the only occurance of this problem.
> The other instances have less checks at a less frequent interval, which
> may be why the problem doesn't occur in those instances.
>
> For the problem instance, there were way too many nagios processes open,
> service checks were getting over 1/2 hr behind schedule
> (normal_check_interval was set to 5 minutes for 500 of the checks), and
> the box was swapping like crazy.
>
> I wrote a small script to do logging of every check results to see if I
> could find out how it was falling behind. I made this the ocsp_command
> and turned on obsess_over_services. I restarted Nagios, and now the
> problem was gone. The process count stayed low, the checks stayed in
> schedule, and the box didn't use swap at all.
>
> I have repeated this test multiple times to make sure it wasn't a fluke,
> and its not. As soon as I turn of obsess_over_services, within 10
> minutes the same problems reappear. They disappear when I turn
> obsess... back on.
>
> I looked through the code and can't seem to find the problem, or at
> least how the ocsp command section would cause this behavior change.
> The only thing I could see is that the ocsp_command section uses the
> my_system function in the utils.c class. This has some process
> management in it, but I don't know how it could effect the service check
> processes.
>
> Please, if anyone has any idea why this could be occuring, or is
> familiar enough with the code to look at it and at least know how this
> behavior could be occurring, email the list back. I'm am really stuck
> at this point, and I can't rely on Nagios until I solve this problem.
> Thanks.
>
> -Russell Scibetti
>
> --
> Russell Scibetti
> Quadrix Solutions, Inc.
> http://www.quadrix.com
> (732) 235-2335, ext. 7038
>
>
>
>
> -------------------------------------------------------
> This sf.net email is sponsored by: To learn the basics of securing
> your web site with SSL, click here to get a FREE TRIAL of a Thawte
> Server Certificate: http://www.gothawte.com/rd524.html
> _______________________________________________
> Nagios-devel mailing list
> Nagios-devel@lists.sourcef

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: nagios@nagios.org
Locked