Re: [Nagios-devel] Unexplained nagios crashes
Posted: Tue Aug 21, 2007 5:54 am
=20
> -----Oprindelig meddelelse-----
> Fra: [email protected]=20
> [mailto:[email protected]] P=E5 vegne=20
> af Andreas Ericsson
> Sendt: 21. august 2007 10:45
> Til: Nagios Developers List
> Emne: Re: [Nagios-devel] Unexplained nagios crashes
>=20
> What thread-library is the customer using (make, model,=20
> version, everything...)?
> What's the uname -a output?
> If Linux, which scheduler is being used in the kernel?
>=20
>=20
>=20
> Duncan Ferguson wrote:
> > Hiya Ethan, list.
> >=20
> > We are hoping someone may be able to help diagnose what is going on=20
> > with an obscure problem we have. After going cross-eyed=20
> from looking=20
> > at this over the last few weeks I thought it best to see if anyone=20
> > else has seen/experienced the same thing.
> >=20
> > We have a single customer that has been suffering sporadic nagios=20
> > daemon crashes since June - nothing is unique about their=20
> set up that=20
> > we have been able to find and other customers have the exact same=20
> > binaries (and distributed setup with same number of slaves) on the=20
> > same OS and have had no crashes in the same period of time.
> >=20
> > Salient points:
> > * this is using a patched nagios 2.8 binary, a patched=20
> 1.4b2 ndoutils=20
> > broker module and an in house broker module
> > * the crashes are intermittent and irregular, at no fixed=20
> time of day.=20
> > Might have three crashes one day, then nothing for two=20
> days, then one=20
> > crash a day for four days
> > * Studying the core dump, the code bombs out in=20
> > commands.c:process_passive_service_checks while transversing the=20
> > passive_check_result_list linked list
> >=20
> > We have added in a bit of extra code to print out the entire=20
> > passive_check_result_list structure before the fork, and=20
> from what we=20
> > can see in the core dump the list is corrupted mid way=20
> through - the=20
> > last readable record has a 'next' pointing to what looks=20
> like a valid=20
> > area of memory, but nothing is there, but=20
> > passive_check_result_list_tail has a valid entry which implies=20
> > everything was added into the list OK in the first place.
> >=20
> > So between being added into the linked list and being read from the=20
> > linked list a record is removed. The list has well below maximum=20
> > number of buffer slots so lack of memory isnt the problem (else the=20
> > tail entry would also be screwed).
> >=20
> > We have been unable to find any code that would cause this behavior=20
> > (especially when the list is confined to commands.c),=20
> especially when=20
> > this section is called and used as often as it is and the=20
> crashes few=20
> > and far between (in comparison).
> >=20
> > The nagios binary has been compiled with "-ggdb -O0" for debugging=20
> > purposes and is running on Debian Etch i386 with 4x Intel=20
> Xeon 1.86Hz=20
> > cpu's and 4Gb of memory. The core dump, nagios binary and=20
> commands.c=20
> > is available at http://resources.opsview.org/nagios_crash.tar.gz
> >=20
> > Any insight or help would be appreciated.
> >=20
> > Duncs
> >=20
> >=20
> ----------------------------------------------------------------------
> > --- This SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems? Stop.
> > Now Search log events and configuration files using AJAX=20
> and a browser.
> > Download your FREE copy of Splunk now >> http://get.splunk.com/=20
> > _______________________________________________
> > Nagios-devel mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/lis ... gios-devel
>=20
>=20
> --=20
> Andreas Ericsson [email protected]
> OP5 AB www.op5.se
> Tel: +46 8-230225 Fax: +46 8-230231
>=20
> --------------------------------------------------------------
> -----------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
> -----Oprindelig meddelelse-----
> Fra: [email protected]=20
> [mailto:[email protected]] P=E5 vegne=20
> af Andreas Ericsson
> Sendt: 21. august 2007 10:45
> Til: Nagios Developers List
> Emne: Re: [Nagios-devel] Unexplained nagios crashes
>=20
> What thread-library is the customer using (make, model,=20
> version, everything...)?
> What's the uname -a output?
> If Linux, which scheduler is being used in the kernel?
>=20
>=20
>=20
> Duncan Ferguson wrote:
> > Hiya Ethan, list.
> >=20
> > We are hoping someone may be able to help diagnose what is going on=20
> > with an obscure problem we have. After going cross-eyed=20
> from looking=20
> > at this over the last few weeks I thought it best to see if anyone=20
> > else has seen/experienced the same thing.
> >=20
> > We have a single customer that has been suffering sporadic nagios=20
> > daemon crashes since June - nothing is unique about their=20
> set up that=20
> > we have been able to find and other customers have the exact same=20
> > binaries (and distributed setup with same number of slaves) on the=20
> > same OS and have had no crashes in the same period of time.
> >=20
> > Salient points:
> > * this is using a patched nagios 2.8 binary, a patched=20
> 1.4b2 ndoutils=20
> > broker module and an in house broker module
> > * the crashes are intermittent and irregular, at no fixed=20
> time of day.=20
> > Might have three crashes one day, then nothing for two=20
> days, then one=20
> > crash a day for four days
> > * Studying the core dump, the code bombs out in=20
> > commands.c:process_passive_service_checks while transversing the=20
> > passive_check_result_list linked list
> >=20
> > We have added in a bit of extra code to print out the entire=20
> > passive_check_result_list structure before the fork, and=20
> from what we=20
> > can see in the core dump the list is corrupted mid way=20
> through - the=20
> > last readable record has a 'next' pointing to what looks=20
> like a valid=20
> > area of memory, but nothing is there, but=20
> > passive_check_result_list_tail has a valid entry which implies=20
> > everything was added into the list OK in the first place.
> >=20
> > So between being added into the linked list and being read from the=20
> > linked list a record is removed. The list has well below maximum=20
> > number of buffer slots so lack of memory isnt the problem (else the=20
> > tail entry would also be screwed).
> >=20
> > We have been unable to find any code that would cause this behavior=20
> > (especially when the list is confined to commands.c),=20
> especially when=20
> > this section is called and used as often as it is and the=20
> crashes few=20
> > and far between (in comparison).
> >=20
> > The nagios binary has been compiled with "-ggdb -O0" for debugging=20
> > purposes and is running on Debian Etch i386 with 4x Intel=20
> Xeon 1.86Hz=20
> > cpu's and 4Gb of memory. The core dump, nagios binary and=20
> commands.c=20
> > is available at http://resources.opsview.org/nagios_crash.tar.gz
> >=20
> > Any insight or help would be appreciated.
> >=20
> > Duncs
> >=20
> >=20
> ----------------------------------------------------------------------
> > --- This SF.net email is sponsored by: Splunk Inc.
> > Still grepping through log files to find problems? Stop.
> > Now Search log events and configuration files using AJAX=20
> and a browser.
> > Download your FREE copy of Splunk now >> http://get.splunk.com/=20
> > _______________________________________________
> > Nagios-devel mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/lis ... gios-devel
>=20
>=20
> --=20
> Andreas Ericsson [email protected]
> OP5 AB www.op5.se
> Tel: +46 8-230225 Fax: +46 8-230231
>=20
> --------------------------------------------------------------
> -----------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]