Re: [Nagios-devel] coredumps in wobbly networks

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] coredumps in wobbly networks

Post by Guest »

Not sure where this is actually happening. It looks like malloc() is
to blame - not sure why. The only malloc() in the
service_result_worker_thread() routine occurs at line 4736 in
base/utils.c, which looks ok to me.

Anyone else have any ideas as to what might be happening?



On 24 Mar 2005 at 12:32, Andreas Ericsson wrote:

> Ahoy.
>
> I've observed a series of most unfortunate SIGSEGV's in Nagios.
> It appears to happen when service checks pop back to OK on the second
> attempt and then something happens (see logs below).
>
> Here are two separate log-entries leading up to the crash. They are
> taken from two separate nagios instances on separate machines and, as
> you can see by the timing, both instances occurred on different
> timings (the naglog program used to get human-readable time is
> available at http://oss.op5.se/nagios/naglog.c)
>
> [ crash 1, on primary server ]
> 2005-03-20 22:11:57: Auto-save of retention data completed
> successfully. 2005-03-20 22:25:56: SERVICE ALERT:
> foo-host;PING;WARNING;SOFT;1;WARNING - x.x.x.x: rta 107 ms, lost 0%
> 2005-03-20 22:26:56: SERVICE ALERT: foo-host;PING;OK;SOFT;2;OK -
> x.x.x.x: rta 1.82 ms, lost 0%
>
> [ crash 2, on secondary server ]
> 2005-03-21 06:19:41: Auto-save of retention data completed
> successfully. 2005-03-21 06:28:11: SERVICE ALERT:
> foo-host;PING;WARNING;SOFT;1;WARNING - x.x.x.x: rta 234.926ms, lost 0%
> 2005-03-21 06:29:11: SERVICE ALERT: foo-host;PING;OK;SOFT;2;OK -
> x.x.x.x: rta 0.150ms, lost 0%
>
>
> Note the "PING;OK;SOFT;2" part. These are the last two log-entries
> before the crash (it's the same host both times, actually) on both
> servers. host check command is standard and there are no problems with
> it.
>
> It's worth pointing out that this isn't latest CVS, but rather
> whichever one was latest Jan 19 2005. I haven't seen a checkin that
> touches this codesection though, so I believe the bug might still be
> lurking in there somewhere.
>
> The coredumps for these crashes are largely useless. The backtrace
> points to __glibc_malloc() called from pthread_create().
> pthread_create() is called with a NULL argument, and the coredump
> actually takes place at address 0x0.
>
> Here's some of the gdb output (I still have binaries and several
> core-files in case anyone's interested in running more commands).
>
> [ gdb session, core1 ]
> Program terminated with signal 11, Segmentation fault.
> Reading symbols from /lib/libm.so.6...done.
> Loaded symbols for /lib/libm.so.6
> Reading symbols from /lib/libnsl.so.1...done.
> Loaded symbols for /lib/libnsl.so.1
> Reading symbols from /lib/libpthread.so.0...done.
> Loaded symbols for /lib/libpthread.so.0
> Reading symbols from /lib/libc.so.6...done.
> Loaded symbols for /lib/libc.so.6
> Reading symbols from /lib/ld-linux.so.2...done.
> Loaded symbols for /lib/ld-linux.so.2
> Reading symbols from /lib/libnss_files.so.2...done.
> Loaded symbols for /lib/libnss_files.so.2
> #0 0x00000000 in ?? ()
> (gdb) bt
> #0 0x00000000 in ?? ()
> #1 0x001c100b in __libc_malloc (bytes=512) at malloc.c:2695
> #2 0x080612fe in service_result_worker_thread (arg=0x0) at
> #utils.c:4692 3 0x00162de2 in pthread_start_thread (arg=0xbf5ffe40)
> #at manager.c:241 4 0x0020f70a in thread_start () from /lib/libc.so.6
> (gdb)
> [ end gdb session, core1 ]
>
> The gdb session for core2 is identical.
>
> I'll investigate some more during the holidays and see if I can come
> up with a patch for this or at least some means of debugging it a bit
> more easily.
>
> --
> Andreas Ericsson [email protected]
> OP5 AB www.op5.se
> Lead Developer
>
>
> -------------------------------------------------------
> This SF.net email is sponsored by Microsoft Mobile & Embedded DevCon
> 2005 Attend MEDC 2005 May 9-12 in Vegas. Learn more about the latest
> Windows Embedded(r) & Windows Mobile(tm) platforms, applications &
> content. Register by 3/29 & save $300
> http://ads.osdn.com/?ad_id=6883&alloc_id=15149&op=click
> _______________________________________________ Nagios-devel mailing


...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked