Just to add our 20 cent to this, this sounds very much like the problems =
we have been experiencing - latest thread is "3.0b5: External commands =
are not turned into passive checks after a while" from 15th of October.
As described in the mail we are also seing unregular memory usage, and =
if nagios runs long enough it will steal all available file descriptors, =
leaving us with:
root@:/usr/local/nagios/bin# /bin/echo "test"
bash: fork: Resource temporarily unavailable
- Until we can squeeze in a pkill nagios or similar.
We are running with embedded perl also. We have just compiled a new =
version without it and will try that one out (latest beta, 3.0b5 / not =
SVN).
Best regards,
Steffen Poulsen
> -----Oprindelig meddelelse-----
> Fra: [email protected]=20
> [mailto:[email protected]] P=E5 vegne=20
> af Andreas Ericsson
> Sendt: 22. oktober 2007 17:01
> Til: Nagios Developers List
> Emne: Re: [Nagios-devel] Nagios 3.0 hanging (10/19 CVS)
>=20
> Shad L. Lords wrote:
> > I've had a few instances where nagios will be running but=20
> will fail to=20
> > run checks or process anything. I noticed it this morning=20
> and did a=20
> > quick strace of the process to see what it was trying to do (see=20
> > below). I hope this will be of use to someone.
> >=20
>=20
> It is indeed. Thanks a lot.
>=20
> > open("/var/spool/nagios",=20
> O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY)=20
> > =3D -1 EMFILE (Too many open files)=20
> open("/var/log/nagios/nagios.log",=20
> > O_RDWR|O_CREAT|O_APPEND|O_LARGEFILE,
> > 0666) =3D -1 EMFILE (Too many open files)
>=20
>=20
> Here is the primary symptom of the problem, methinks. EMFILE=20
> is a pretty unusual error. There's probably some (or a lot)=20
> of codepaths in Nagios where the check result files aren't=20
> closed properly, leading to all sorts of weird errors ...
>=20
> > clone(child_stack=3D0,=20
> > flags=3DCLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> > child_tidptr=3D0xb7fe3708) =3D -1 ENOMEM (Cannot allocate memory)
>=20
> ... and eventually it runs into the good ole ENOMEM. I'm=20
> guessing this happens because the scheduling queue keeps=20
> filling up more or less indefinitely, and the child processes=20
> keep stacking up as well.
>=20
> Personally, I think the only sane thing to do when you get=20
> ENOMEM is, in the absence of garbage collectors to run, to=20
> just die as gracefully as possible with a loud, loud error=20
> message in the logs, and possibly leaving a core dump.=20
> kill(0, SIGSEGV) can accomplish that last thing.
>=20
> I won't have time to dig into this until tomorrow, but with=20
> Ethan blazing through the codebase he'd probably have it=20
> fixed before me anyway.
>=20
> --=20
> Andreas Ericsson [email protected]
> OP5 AB www.op5.se
> Tel: +46 8-230225 Fax: +46 8-230231
>=20
> --------------------------------------------------------------
> -----------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and=20
> a browser.
> Download your FREE copy of Splunk now >>=20
> http://get.splunk.com/ _______________________________________________
> Nagios-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/lis ... gios-devel
>=20
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]