Re: [Nagios-devel] BUG/PATCH: Runaway processes under Linux (and

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] BUG/PATCH: Runaway processes under Linux (and

Post by Guest »

On Thu, 27 Apr 2006, Andreas Ericsson wrote:

> bruce wrote:

> Anyways, this:
>
>
> + /* exit with a dirty feeling */
> + static void signal_exit( void ){
> + _exit(1);
> + }
> +
>
> is wrong. The prototype for signal handlers must be
>
> void signal_exit(int signum);
>
> The static keyword is ofcourse optional and valid.
>
> Otherwise it looks like a good patch.

Ah. I bow to your greater C-Fu ;). Duly edited and applied on my working
copy.

>> On some systems, a rarer problem shows itself, making the solution to the
>> Nagios issue somewhat harder. This problem is when a child process,
>> inheriting the parent's signal handlers, receives a signal (usually
>> SIGCHLD, sometimes SIGTERM) and then exits, taking out the parent's
>> lock/pid file. Thus, one no longer knows which process is the legitimate
>> parent process.
>
> If nagios' grandchildren (the ones that popen() commands) receives SIGCHLD
> from anything but the check it's running something is very, very wrong with
> the system you're using. Are you perhaps using the old and deprecated
> NGPT-library?

The grandchild occurs in run_system_checks(), and I haven't caught child
processes created from that segment of code removing the lock file,
although this may be unwillingness on my part to fully match up the debug
output ;). ( For the record, the thread library used according to
'getconf GNU_LIBPTHREAD_VERSION', is 'NPTL 2.3.6' ).

The lock removal instead seems to be occuring with the child process
created in my_system(), which sometimes stalls at a point before the
signal handlers get reset (or they don't get reset, my debugging
statements weren't fine-grained enough). When the parent sends a TERM
signal to the child when it is in this state (due to timeout), the child
runs the signal handlers inherited from the parent, removing the lock
file.

>> With these patches on, the rate of stray process creation has dropped, but
>> I am still seeing occasional orphaned processes around;

Overnight, I had one machine fail due to the death-by-nibbles problem,
which due to its location and sudden lack of boot sector, will be a
two-banana fix. As an interim fix, the remaining machines are now
restarting Nagios every two hours from cron, although this smacks of
inelegance.

>> ie, I've fixed some
>> of the symptons, but not the actual cause. That will take some more
>> rewrites.
>
> Yup. The choice of a FIFO pipe for passing check-results back to the master
> process was unfortunately a bad one which is now irrevocable without major
> code-surgery.

Yes. It has scaling issues which do not show themselves in small
installations (say, under 100 service checks).

--
Bruce Campbell





This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked