[Nagios-devel] BUG/PATCH: Runaway processes under Linux (and others)

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

[Nagios-devel] BUG/PATCH: Runaway processes under Linux (and others)

Post by Guest »

This message is in MIME format. The first part should be readable text,
while the remaining parts are likely unreadable without MIME-aware tools.

--0-717035489-1146062803=:61
Content-Type: TEXT/PLAIN; charset=US-ASCII; format=flowed


This relates to a number of issues that people have seen with Nagios and
Nsca running under Linux, having many copies of these daemons running, and
eventually running out of memory, frequently crashing the machine. This
post attempts to summarise the problems for those searching the
archives. If you have an OS/distribution/libraries that are susceptible
to this problem, here is a short summary:

You're screwed.

The problem at heart is that Nagios, and Nsca, use function calls after
forking that are either susceptible to race conditions with other
children, have the possibility of blocking, or cancel pending alarm()s.

Depending on your OS/distribution/libraries, usage of such functions
within a fork()ed child may well mean that the alarm timeouts set simply
do not arrive. The child process will sit in an unknown state for a very
long time.

In the case of Nagios, this has a high chance of occuring after it has
fork()ed twice in base/checks.c->run_service_checks(). The main Nagios
process does not know the PID of the grandchild, and has no checks in
place to kill it after a timeout has elapsed. Thus, if the (grand)child
process just sits around, it will not cleaned up by Nagios.

In Nsca, there is no timeout set by default, and no reaping of child
processes. Thus, the child process can happily sit in an unknown state
for as long as the parent daemon exists. This happens more often when
Nsca is running but Nagios is not, as the contention for the opening
of the dump file, rather than the command pipe, more often results in
blocking.

In practical terms, these two cases manifest themselves as a high number
of Nagios and/or Nsca processes, which are being created at a rate
slightly lower than the freqency of service checks being run/incoming
result submission. Eventually, this will cause a crash, as very few
memory management schemes properly deal with the death-by-tiny-bites
situation.

Since my normal solution of installing a, shall we say, more
POSIX-compliant OS on the monitoring systems isn't valid in this
particular Fedora-loving Linux camp, some other solutions need to be
found.

In the short term, the Nsca issue can be avoided by invoking
'/etc/init.d/nsca restart' from Cron every 5 minutes. A dropped result
every 5 minutes is a comparitively small price to pay. The nsca patch
attached sets up a timeout just after the fork for a new connection, which
solves some of the issues.

On some systems, a rarer problem shows itself, making the solution to the
Nagios issue somewhat harder. This problem is when a child process,
inheriting the parent's signal handlers, receives a signal (usually
SIGCHLD, sometimes SIGTERM) and then exits, taking out the parent's
lock/pid file. Thus, one no longer knows which process is the legitimate
parent process.

Tracking down this rare problem (which happens all too often to suit me)
led me to creating the attached Nagios patch, which turns off daemon_mode
right away after forking (so the lock file doesn't get deleted if a stray
signal comes in), resets the signal handlers a bit earlier in the children
(so the parent's signal handlers aren't triggered) and reinstates the
alarm before talking to the parent (rather than no timeout). Overall, I'd
much rather missing test results (and Nagios trying the service check
again) than have my machines being nibbled to death.

With these patches on, the rate of stray process creation has dropped, but
I am still seeing occasional orphaned processes around; ie, I've fixed
some of the symptons, but not the actual cause. That will take some more
rewrites.

--==--
Bruce.

--0-717035489-1146062803=:61
Content-Type: TEXT/PLAIN; charset=US-ASCII; name=nsca-2.5.patch
Content-Transfer-Encoding: BASE64
Content-ID:
Content-Description:
Content-Disposition: atta

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked