Page 1 of 1

Re: [Nagios-devel] BUG/PATCH: Runaway processes under Linux (and

Posted: Thu Apr 27, 2006 1:49 am
by Guest
On Thu, 27 Apr 2006, Andreas Ericsson wrote:

> bruce wrote:

> Anyways, this:
>
>
> + /* exit with a dirty feeling */
> + static void signal_exit( void ){
> + _exit(1);
> + }
> +
>
> is wrong. The prototype for signal handlers must be
>
> void signal_exit(int signum);
>
> The static keyword is ofcourse optional and valid.
>
> Otherwise it looks like a good patch.

Ah. I bow to your greater C-Fu ;). Duly edited and applied on my working
copy.

>> On some systems, a rarer problem shows itself, making the solution to the
>> Nagios issue somewhat harder. This problem is when a child process,
>> inheriting the parent's signal handlers, receives a signal (usually
>> SIGCHLD, sometimes SIGTERM) and then exits, taking out the parent's
>> lock/pid file. Thus, one no longer knows which process is the legitimate
>> parent process.
>
> If nagios' grandchildren (the ones that popen() commands) receives SIGCHLD
> from anything but the check it's running something is very, very wrong with
> the system you're using. Are you perhaps using the old and deprecated
> NGPT-library?

The grandchild occurs in run_system_checks(), and I haven't caught child
processes created from that segment of code removing the lock file,
although this may be unwillingness on my part to fully match up the debug
output ;). ( For the record, the thread library used according to
'getconf GNU_LIBPTHREAD_VERSION', is 'NPTL 2.3.6' ).

The lock removal instead seems to be occuring with the child process
created in my_system(), which sometimes stalls at a point before the
signal handlers get reset (or they don't get reset, my debugging
statements weren't fine-grained enough). When the parent sends a TERM
signal to the child when it is in this state (due to timeout), the child
runs the signal handlers inherited from the parent, removing the lock
file.

>> With these patches on, the rate of stray process creation has dropped, but
>> I am still seeing occasional orphaned processes around;

Overnight, I had one machine fail due to the death-by-nibbles problem,
which due to its location and sudden lack of boot sector, will be a
two-banana fix. As an interim fix, the remaining machines are now
restarting Nagios every two hours from cron, although this smacks of
inelegance.

>> ie, I've fixed some
>> of the symptons, but not the actual cause. That will take some more
>> rewrites.
>
> Yup. The choice of a FIFO pipe for passing check-results back to the master
> process was unfortunately a bad one which is now irrevocable without major
> code-surgery.

Yes. It has scaling issues which do not show themselves in small
installations (say, under 100 service checks).

--
Bruce Campbell





This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]