Re: [Nagios-devel] Nagios 3.0 hanging (10/19 CVS)
Posted: Mon Oct 22, 2007 7:00 am
Shad L. Lords wrote:
> I've had a few instances where nagios will be running but will fail to run
> checks or process anything. I noticed it this morning and did a quick
> strace of the process to see what it was trying to do (see below). I hope
> this will be of use to someone.
>
It is indeed. Thanks a lot.
> open("/var/spool/nagios", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = -1
> EMFILE (Too many open files)
> open("/var/log/nagios/nagios.log", O_RDWR|O_CREAT|O_APPEND|O_LARGEFILE,
> 0666) = -1 EMFILE (Too many open files)
Here is the primary symptom of the problem, methinks. EMFILE is a pretty
unusual error. There's probably some (or a lot) of codepaths in Nagios
where the check result files aren't closed properly, leading to all
sorts of weird errors ...
> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0xb7fe3708) = -1 ENOMEM (Cannot allocate memory)
... and eventually it runs into the good ole ENOMEM. I'm guessing this
happens because the scheduling queue keeps filling up more or less
indefinitely, and the child processes keep stacking up as well.
Personally, I think the only sane thing to do when you get ENOMEM is, in
the absence of garbage collectors to run, to just die as gracefully as
possible with a loud, loud error message in the logs, and possibly
leaving a core dump. kill(0, SIGSEGV) can accomplish that last thing.
I won't have time to dig into this until tomorrow, but with Ethan
blazing through the codebase he'd probably have it fixed before me
anyway.
--
Andreas Ericsson [email protected]
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
> I've had a few instances where nagios will be running but will fail to run
> checks or process anything. I noticed it this morning and did a quick
> strace of the process to see what it was trying to do (see below). I hope
> this will be of use to someone.
>
It is indeed. Thanks a lot.
> open("/var/spool/nagios", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = -1
> EMFILE (Too many open files)
> open("/var/log/nagios/nagios.log", O_RDWR|O_CREAT|O_APPEND|O_LARGEFILE,
> 0666) = -1 EMFILE (Too many open files)
Here is the primary symptom of the problem, methinks. EMFILE is a pretty
unusual error. There's probably some (or a lot) of codepaths in Nagios
where the check result files aren't closed properly, leading to all
sorts of weird errors ...
> clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0xb7fe3708) = -1 ENOMEM (Cannot allocate memory)
... and eventually it runs into the good ole ENOMEM. I'm guessing this
happens because the scheduling queue keeps filling up more or less
indefinitely, and the child processes keep stacking up as well.
Personally, I think the only sane thing to do when you get ENOMEM is, in
the absence of garbage collectors to run, to just die as gracefully as
possible with a loud, loud error message in the logs, and possibly
leaving a core dump. kill(0, SIGSEGV) can accomplish that last thing.
I won't have time to dig into this until tomorrow, but with Ethan
blazing through the codebase he'd probably have it fixed before me
anyway.
--
Andreas Ericsson [email protected]
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]