[Nagios-devel] OCSP affecting Nagios behavior

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

[Nagios-devel] OCSP affecting Nagios behavior

Post by Guest »

I emailed about this once before, but I never received much response.
Unfortunately, until I can figure this issue out, I can't fully rely on
Nagios, which I would really like to do...

It seems that turning on obsess_over_services somehow affects Nagios's
process management behavior. I noticed this because my instance of
nagios (without obsess_over_services turned on) was having problems.
This was a 1.0b6 install on Linux RedHat 7.2 (after an upgrade from
1.0b3) with about 700 service checks. There are actually 5 other
instances on this box, but this is the only occurance of this problem.
The other instances have less checks at a less frequent interval, which
may be why the problem doesn't occur in those instances.

For the problem instance, there were way too many nagios processes open,
service checks were getting over 1/2 hr behind schedule
(normal_check_interval was set to 5 minutes for 500 of the checks), and
the box was swapping like crazy.

I wrote a small script to do logging of every check results to see if I
could find out how it was falling behind. I made this the ocsp_command
and turned on obsess_over_services. I restarted Nagios, and now the
problem was gone. The process count stayed low, the checks stayed in
schedule, and the box didn't use swap at all.

I have repeated this test multiple times to make sure it wasn't a fluke,
and its not. As soon as I turn of obsess_over_services, within 10
minutes the same problems reappear. They disappear when I turn
obsess... back on.

I looked through the code and can't seem to find the problem, or at
least how the ocsp command section would cause this behavior change.
The only thing I could see is that the ocsp_command section uses the
my_system function in the utils.c class. This has some process
management in it, but I don't know how it could effect the service check
processes.

Please, if anyone has any idea why this could be occuring, or is
familiar enough with the code to look at it and at least know how this
behavior could be occurring, email the list back. I'm am really stuck
at this point, and I can't rely on Nagios until I solve this problem.
Thanks.

-Russell Scibetti

--
Russell Scibetti
Quadrix Solutions, Inc.
http://www.quadrix.com
(732) 235-2335, ext. 7038







This post was automatically imported from historical nagios-devel mailing list archives
Original poster: russell@quadrix.com
Locked