Re: [Nagios-devel] [Nagios-users] external commands and segfault --

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] [Nagios-users] external commands and segfault --

Post by Guest »

Andreas Ericsson wrote:
> [email protected] wrote:
>> Hey Fellow Nagios-ites:
>>
>> I've been having this *exact* same segfault problem for the last couple o'
>> months.
>>
>> And, after looking at David's stack trace output, it is segfaulting for
>> him in the exact same way/place as it is for me.
>>
>> Here's what I've found:
>>
>> The core dump's that I've examined are all segfaulting when handling the
>> expiration of a scheduled downtime.
>>
>> Since David's stack trace looks identical to mine, I don't think it is in
>> the external command processing, as he believes, but it is in the downtime
>> expiration handling, as well.
>>
>> Having examined about a dozen of these identical core dumps, I see that it
>> is a corruption of the entire sheduled_downtime structure that is being
>> passed into the handled_scheduled_downtime() function.
>>
>> The handled_scheduled_downtime() function is being invoked by the high
>> priority event processing logic in the event_execution_loop(). So it
>> pulls a EVENT_SCHEDULED_DOWNTIME timed_event structure off of the high
>> priority event list, and then hands it to handle_timed_event(), which in
>> turns invoke the handle_scheduled_downtime() routine to handle the
>> expiration of the specified downtime event.
>>
>> The problem is, the scheduled_downtime structure is already corrupted
>> while sitting in the high_priority list - well before it is dequeued by
>> the event_execution_loop() logic.
>>
>> I've walked the high priority list in memory with gdb to examine other
>> timed_event structures and have noticed that only the scheduled_downtime
>> structure associated with EVENT_SCHEDULED_DOWNTIME timed events are
>> affected by the memory corruption. In fact, one time, I found nine
>> scheduled downtime expiration event sequentially listed in the high
>> priority list and the first three had their scheduled_downtime structures
>> corrupted and the remaining six were in pristine condition.
>>
>>
>> So, I've narrowed it down to a couple of possibilities (feel free to add
>> your own!):
>>
>> 1. The scheduled_downtime structure is already corrupted when it is being
>> added to the high priority timed event scheduling list, or
>>
>>
>> 2. The scheduled_downtime structure is OK when it is added to the high
>> priority list, but perhaps a bad pointer access is overwriting it with
>> garbage at some other point in the program. This would might be somewhat
>> painful to track down.
>>
>>
>> Of the two, I suspect that the second one is the more likely candidate.
>>
>
> I think the first, as it only happens with scheduled downtime stuff.
> Otherwise you'd see it on other high-prio events as well (unless you're
> extremely unlucky each time the crash happens).
>
>> Some other notes:
>>
>> 1. The timed event expirations that segfault Nagios seem to be "randomly"
>> chosen.
>>
>> We have some regularly submitted (via cron) scheduled downtimes that will
>> work fine for weeks, and then one of them will come up for expiration and
>> trigger this scheduled-downtime-expiration bug. I've also seen it happen
>> with ad-hoc scheduled downtime submissions via the CGI interface.
>>
>> I've seen it happen with "regular" scheduled downtimes as well as the new
>> "triggered" scheduled downtime. We thought it might have been related to
>> the new triggered downtime, since that was one of the first events causing
>> a segfault. But then after eliminating the use of triggered downtimes
>> altogether, the segfaults still occur with the regular scheduled downtime
>> expirations.
>>
>> 2. I've had this problem with Nagios 2.4, 2.5 and 2.6. So, "upgrading"
>> hasn't gotten rid of it.
>>
>> 3. We are currently running Nagios 2.6 on a 64-bit Linux platform: SLES-9
>> x86-64, Kernel 2.6.5-7.267-smp
>>
>
> This is the culprit, I guess. As this isn't a widespread problem, I
> wouldn't be surprised if it's related to 64-bit archs (kernel-2.6.5 is
> fairly ancient too, but that shouldn't matter as this is the only app
> you're seeing it in).
>
> I'm guessing this actually is an SMP-system and that SuSE doesn't
> i

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked