Re: [Nagios-devel] OCSP affecting Nagios behavior

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] OCSP affecting Nagios behavior

Post by Guest »

--------------040805030605070804010406
Content-Type: text/plain; charset=us-ascii; format=flowed
Content-Transfer-Encoding: 7bit

You were pretty on the money about this. I changer the
service_reaper_frequency from 10 to 5 for the problem instance and
turned off obsess_over_services, and the problem went away. All the
checks are executing at the right times and the box isn't swapping.

I still don't really understand how the obsess_over_services being on
made the problem go away, unless it really was just slowing the parent
enough to allow the children to finish writing to the pipe.

Anyway, this might want to go into the FAQ somewhere as something to try
changing in nagios.cfg if the user has a problem with checks falling behind.

Thanks for the help!

-Russell

Ethan Galstad wrote:

>What's your server_reaper_frequency directive set at in the main
>config file? You might try lowering it or raising it and seeing if
>that affects things for the better.
>
>It sounds like child processes (the ones that run the plugins) might
>be blocking when they try and write to the pipe to the main process.
>If the pipe is full, child processes will block until they can write
>plugin execution results back to the parent process. The system()
>call (to run the OCSP command) happens in the parent process, so its
>actually slowing the parent process down a bit. I would think this
>would hurt things rather than help. Mabye the system() call gives
>child processes enough time to write to the pipe before the parent
>reads from it? I'm not sure and it doesn't make a lot of sense to
>me.
>
>Also note that you're executing at least 1.5 checks per second on
>average. The results that each child writes back to the main process
>is around 512 bytes. Some systems have small limits on what the pipe
>buffer is (4K or less), which means that 8 messages will fill up the
>pipe buffer and cause other children to block until the parent reads
>from the pipe and frees some space.
>
>That said, I don't really know what solution there might be if
>changing the service_reaper_frequency directive doesn't help. :-(
>Anyone else?
>
>
>On 18 Nov 2002 at 17:31, Russell Scibetti wrote:
>
>>I emailed about this once before, but I never received much response.
>> Unfortunately, until I can figure this issue out, I can't fully rely on
>>Nagios, which I would really like to do...
>>
>>It seems that turning on obsess_over_services somehow affects Nagios's
>>process management behavior. I noticed this because my instance of
>>nagios (without obsess_over_services turned on) was having problems.
>> This was a 1.0b6 install on Linux RedHat 7.2 (after an upgrade from
>>1.0b3) with about 700 service checks. There are actually 5 other
>>instances on this box, but this is the only occurance of this problem.
>> The other instances have less checks at a less frequent interval, which
>>may be why the problem doesn't occur in those instances.
>>
>>For the problem instance, there were way too many nagios processes open,
>>service checks were getting over 1/2 hr behind schedule
>>(normal_check_interval was set to 5 minutes for 500 of the checks), and
>>the box was swapping like crazy.
>>
>>I wrote a small script to do logging of every check results to see if I
>>could find out how it was falling behind. I made this the ocsp_command
>>and turned on obsess_over_services. I restarted Nagios, and now the
>>problem was gone. The process count stayed low, the checks stayed in
>>schedule, and the box didn't use swap at all.
>>
>>I have repeated this test multiple times to make sure it wasn't a fluke,
>>and its not. As soon as I turn of obsess_over_services, within 10
>>minutes the same problems reappear. They disappear when I turn
>>obsess... back on.
>>
>>I looked through the code and can't seem to find the problem, or at
>>least how the ocsp command section would cause this behavior change.
>> The only thing I could see is that the ocsp_command section uses the
>>my_system function in the utils.c class. This has some process
>>management in it, but I don't know how it could

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: russell@quadrix.com
Locked