[Nagios-devel] Unexplained nagios crashes

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

[Nagios-devel] Unexplained nagios crashes

Post by Guest »

Hiya Ethan, list.

We are hoping someone may be able to help diagnose what is going on
with an obscure problem we have. After going cross-eyed from looking
at this over the last few weeks I thought it best to see if anyone
else has seen/experienced the same thing.

We have a single customer that has been suffering sporadic nagios
daemon crashes since June - nothing is unique about their set up that
we have been able to find and other customers have the exact same
binaries (and distributed setup with same number of slaves) on the
same OS and have had no crashes in the same period of time.

Salient points:
* this is using a patched nagios 2.8 binary, a patched 1.4b2 ndoutils
broker module and an in house broker module
* the crashes are intermittent and irregular, at no fixed time of
day. Might have three crashes one day, then nothing for two days,
then one crash a day for four days
* Studying the core dump, the code bombs out in
commands.c:process_passive_service_checks while transversing the
passive_check_result_list linked list

We have added in a bit of extra code to print out the entire
passive_check_result_list structure before the fork, and from what we
can see in the core dump the list is corrupted mid way through - the
last readable record has a 'next' pointing to what looks like a valid
area of memory, but nothing is there, but
passive_check_result_list_tail has a valid entry which implies
everything was added into the list OK in the first place.

So between being added into the linked list and being read from the
linked list a record is removed. The list has well below maximum
number of buffer slots so lack of memory isnt the problem (else the
tail entry would also be screwed).

We have been unable to find any code that would cause this behavior
(especially when the list is confined to commands.c), especially when
this section is called and used as often as it is and the crashes few
and far between (in comparison).

The nagios binary has been compiled with "-ggdb -O0" for debugging
purposes and is running on Debian Etch i386 with 4x Intel Xeon 1.86Hz
cpu's and 4Gb of memory. The core dump, nagios binary and commands.c
is available at http://resources.opsview.org/nagios_crash.tar.gz

Any insight or help would be appreciated.

Duncs





This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked