Re: [Nagios-devel] OCSP affecting Nagios behavior

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] OCSP affecting Nagios behavior

Post by Guest »

Good to hear this solution worked. I just added a FAQ on it. I hope
to get around this issue in 2.0 by having a separate thread dedicated
to just reading messages from the pipe, so child processes don't get
hung up. Same thing goes for the external command file, where NSCA
processes can block if it isn't checked frequently enough.


On 20 Nov 2002 at 14:33, Russell Scibetti wrote:

>
> You were pretty on the money about this. I changer the
> service_reaper_frequency from 10 to 5 for the problem instance and
> turned off obsess_over_services, and the problem went away. All the
> checks are executing at the right times and the box isn't swapping.
>
> I still don't really understand how the obsess_over_services being on
> made the problem go away, unless it really was just slowing the
> parent enough to allow the children to finish writing to the pipe.
>
> Anyway, this might want to go into the FAQ somewhere as something to
> try changing in nagios.cfg if the user has a problem with checks
> falling behind.
>
> Thanks for the help!
>
> -Russell
>
> Ethan Galstad wrote:
> What's your server_reaper_frequency directive set at in the main
> config file? You might try lowering it or raising it and seeing if
> that affects things for the better.
>
> It sounds like child processes (the ones that run the plugins) might
> be blocking when they try and write to the pipe to the main process.
> If the pipe is full, child processes will block until they can write
> plugin execution results back to the parent process. The system()
> call (to run the OCSP command) happens in the parent process, so its
> actually slowing the parent process down a bit. I would think this
> would hurt things rather than help. Mabye the system() call gives
> child processes enough time to write to the pipe before the parent
> reads from it? I'm not sure and it doesn't make a lot of sense to
> me.
>
> Also note that you're executing at least 1.5 checks per second on
> average. The results that each child writes back to the m
> ain process
> is around 512 bytes. Some systems have small limits on what the pipe
> buffer is (4K or less), which means that 8 messages will fill up the
> pipe buffer and cause other children to block until the parent reads
> from the pipe and frees some space.
>
> That said, I don't really know what solution there might be if
> changing the service_reaper_frequency directive doesn't help. :-(
> Anyone else?
>
>
> On 18 Nov 2002 at 17:31, Russell Scibetti wrote:
>
> I emailed about this once before, but I never received much response.
> Unfortunately, until I can figure this issue out, I can't fully rely on
> Nagios, which I would really like to do...
>
> It seems that turning on obsess_over_services somehow affects Nagios's
> process management behavior. I noticed this because my instance of
> nagios (without obsess_over_services turned on) was having problems.
> This was a 1.0b6 install on Linux RedHat 7.2 (after an upgrade from
> 1.0b3) with about 700 service checks. There are actually 5 other
> instances on this box, but this is the only occurance of this problem.
> The other instances have less checks at a less frequent interval, which
> may be why the problem doesn't occur in those instances.
>
> For the problem instance, there were way too many nagios processes open,
> service checks were getting over 1/2 hr behind schedule
> (normal_check_interval was set to 5 minutes for 500 of t
> he checks), and
> the box was swapping like crazy.
>
> I wrote a small script to do logging of every check results to see if I
> could find out how it was falling behind. I made this the ocsp_command
> and turned on obsess_over_services. I restarted Nagios, and now the
> problem was gone. The process count stayed low, the checks stayed in
> schedule, and the box didn't use swap at all.
>
> I have repeated this test multiple times to make sure it wasn't

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: nagios@nagios.org
Locked