Re: [Nagios-devel] Unexplained nagios crashes

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Unexplained nagios crashes

Post by Guest »

What thread-library is the customer using (make, model, version, everything...)?
What's the uname -a output?
If Linux, which scheduler is being used in the kernel?



Duncan Ferguson wrote:
> Hiya Ethan, list.
>
> We are hoping someone may be able to help diagnose what is going on
> with an obscure problem we have. After going cross-eyed from looking
> at this over the last few weeks I thought it best to see if anyone
> else has seen/experienced the same thing.
>
> We have a single customer that has been suffering sporadic nagios
> daemon crashes since June - nothing is unique about their set up that
> we have been able to find and other customers have the exact same
> binaries (and distributed setup with same number of slaves) on the
> same OS and have had no crashes in the same period of time.
>
> Salient points:
> * this is using a patched nagios 2.8 binary, a patched 1.4b2 ndoutils
> broker module and an in house broker module
> * the crashes are intermittent and irregular, at no fixed time of
> day. Might have three crashes one day, then nothing for two days,
> then one crash a day for four days
> * Studying the core dump, the code bombs out in
> commands.c:process_passive_service_checks while transversing the
> passive_check_result_list linked list
>
> We have added in a bit of extra code to print out the entire
> passive_check_result_list structure before the fork, and from what we
> can see in the core dump the list is corrupted mid way through - the
> last readable record has a 'next' pointing to what looks like a valid
> area of memory, but nothing is there, but
> passive_check_result_list_tail has a valid entry which implies
> everything was added into the list OK in the first place.
>
> So between being added into the linked list and being read from the
> linked list a record is removed. The list has well below maximum
> number of buffer slots so lack of memory isnt the problem (else the
> tail entry would also be screwed).
>
> We have been unable to find any code that would cause this behavior
> (especially when the list is confined to commands.c), especially when
> this section is called and used as often as it is and the crashes few
> and far between (in comparison).
>
> The nagios binary has been compiled with "-ggdb -O0" for debugging
> purposes and is running on Debian Etch i386 with 4x Intel Xeon 1.86Hz
> cpu's and 4Gb of memory. The core dump, nagios binary and commands.c
> is available at http://resources.opsview.org/nagios_crash.tar.gz
>
> Any insight or help would be appreciated.
>
> Duncs
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by: Splunk Inc.
> Still grepping through log files to find problems? Stop.
> Now Search log events and configuration files using AJAX and a browser.
> Download your FREE copy of Splunk now >> http://get.splunk.com/
> _______________________________________________
> Nagios-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/lis ... gios-devel


--
Andreas Ericsson [email protected]
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231





This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked