Re: [Nagios-devel] [Nagios-users] external commands and segfault --
Posted: Mon Jan 08, 2007 9:40 am
[email protected] wrote:
> Hey Fellow Nagios-ites:
>
> I've been having this *exact* same segfault problem for the last couple o'
> months.
>
> And, after looking at David's stack trace output, it is segfaulting for
> him in the exact same way/place as it is for me.
>
> Here's what I've found:
>
> The core dump's that I've examined are all segfaulting when handling the
> expiration of a scheduled downtime.
>
> Since David's stack trace looks identical to mine, I don't think it is in
> the external command processing, as he believes, but it is in the downtime
> expiration handling, as well.
>
> Having examined about a dozen of these identical core dumps, I see that it
> is a corruption of the entire sheduled_downtime structure that is being
> passed into the handled_scheduled_downtime() function.
>
> The handled_scheduled_downtime() function is being invoked by the high
> priority event processing logic in the event_execution_loop(). So it
> pulls a EVENT_SCHEDULED_DOWNTIME timed_event structure off of the high
> priority event list, and then hands it to handle_timed_event(), which in
> turns invoke the handle_scheduled_downtime() routine to handle the
> expiration of the specified downtime event.
>
> The problem is, the scheduled_downtime structure is already corrupted
> while sitting in the high_priority list - well before it is dequeued by
> the event_execution_loop() logic.
>
> I've walked the high priority list in memory with gdb to examine other
> timed_event structures and have noticed that only the scheduled_downtime
> structure associated with EVENT_SCHEDULED_DOWNTIME timed events are
> affected by the memory corruption. In fact, one time, I found nine
> scheduled downtime expiration event sequentially listed in the high
> priority list and the first three had their scheduled_downtime structures
> corrupted and the remaining six were in pristine condition.
>
>
> So, I've narrowed it down to a couple of possibilities (feel free to add
> your own!):
>
> 1. The scheduled_downtime structure is already corrupted when it is being
> added to the high priority timed event scheduling list, or
>
>
> 2. The scheduled_downtime structure is OK when it is added to the high
> priority list, but perhaps a bad pointer access is overwriting it with
> garbage at some other point in the program. This would might be somewhat
> painful to track down.
>
>
> Of the two, I suspect that the second one is the more likely candidate.
>
I think the first, as it only happens with scheduled downtime stuff.
Otherwise you'd see it on other high-prio events as well (unless you're
extremely unlucky each time the crash happens).
>
> Some other notes:
>
> 1. The timed event expirations that segfault Nagios seem to be "randomly"
> chosen.
>
> We have some regularly submitted (via cron) scheduled downtimes that will
> work fine for weeks, and then one of them will come up for expiration and
> trigger this scheduled-downtime-expiration bug. I've also seen it happen
> with ad-hoc scheduled downtime submissions via the CGI interface.
>
> I've seen it happen with "regular" scheduled downtimes as well as the new
> "triggered" scheduled downtime. We thought it might have been related to
> the new triggered downtime, since that was one of the first events causing
> a segfault. But then after eliminating the use of triggered downtimes
> altogether, the segfaults still occur with the regular scheduled downtime
> expirations.
>
> 2. I've had this problem with Nagios 2.4, 2.5 and 2.6. So, "upgrading"
> hasn't gotten rid of it.
>
> 3. We are currently running Nagios 2.6 on a 64-bit Linux platform: SLES-9
> x86-64, Kernel 2.6.5-7.267-smp
>
This is the culprit, I guess. As this isn't a widespread problem, I
wouldn't be surprised if it's related to 64-bit archs (kernel-2.6.5 is
fairly ancient too, but that shouldn't matter as this is the only app
you're seeing it in).
I'm guessing this actually is an SMP-system and that SuSE doesn't
install SMP kernels on all systems, correct? If so, this could also be a
source of problem for you. Nagios doesn't follow the
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
> Hey Fellow Nagios-ites:
>
> I've been having this *exact* same segfault problem for the last couple o'
> months.
>
> And, after looking at David's stack trace output, it is segfaulting for
> him in the exact same way/place as it is for me.
>
> Here's what I've found:
>
> The core dump's that I've examined are all segfaulting when handling the
> expiration of a scheduled downtime.
>
> Since David's stack trace looks identical to mine, I don't think it is in
> the external command processing, as he believes, but it is in the downtime
> expiration handling, as well.
>
> Having examined about a dozen of these identical core dumps, I see that it
> is a corruption of the entire sheduled_downtime structure that is being
> passed into the handled_scheduled_downtime() function.
>
> The handled_scheduled_downtime() function is being invoked by the high
> priority event processing logic in the event_execution_loop(). So it
> pulls a EVENT_SCHEDULED_DOWNTIME timed_event structure off of the high
> priority event list, and then hands it to handle_timed_event(), which in
> turns invoke the handle_scheduled_downtime() routine to handle the
> expiration of the specified downtime event.
>
> The problem is, the scheduled_downtime structure is already corrupted
> while sitting in the high_priority list - well before it is dequeued by
> the event_execution_loop() logic.
>
> I've walked the high priority list in memory with gdb to examine other
> timed_event structures and have noticed that only the scheduled_downtime
> structure associated with EVENT_SCHEDULED_DOWNTIME timed events are
> affected by the memory corruption. In fact, one time, I found nine
> scheduled downtime expiration event sequentially listed in the high
> priority list and the first three had their scheduled_downtime structures
> corrupted and the remaining six were in pristine condition.
>
>
> So, I've narrowed it down to a couple of possibilities (feel free to add
> your own!):
>
> 1. The scheduled_downtime structure is already corrupted when it is being
> added to the high priority timed event scheduling list, or
>
>
> 2. The scheduled_downtime structure is OK when it is added to the high
> priority list, but perhaps a bad pointer access is overwriting it with
> garbage at some other point in the program. This would might be somewhat
> painful to track down.
>
>
> Of the two, I suspect that the second one is the more likely candidate.
>
I think the first, as it only happens with scheduled downtime stuff.
Otherwise you'd see it on other high-prio events as well (unless you're
extremely unlucky each time the crash happens).
>
> Some other notes:
>
> 1. The timed event expirations that segfault Nagios seem to be "randomly"
> chosen.
>
> We have some regularly submitted (via cron) scheduled downtimes that will
> work fine for weeks, and then one of them will come up for expiration and
> trigger this scheduled-downtime-expiration bug. I've also seen it happen
> with ad-hoc scheduled downtime submissions via the CGI interface.
>
> I've seen it happen with "regular" scheduled downtimes as well as the new
> "triggered" scheduled downtime. We thought it might have been related to
> the new triggered downtime, since that was one of the first events causing
> a segfault. But then after eliminating the use of triggered downtimes
> altogether, the segfaults still occur with the regular scheduled downtime
> expirations.
>
> 2. I've had this problem with Nagios 2.4, 2.5 and 2.6. So, "upgrading"
> hasn't gotten rid of it.
>
> 3. We are currently running Nagios 2.6 on a 64-bit Linux platform: SLES-9
> x86-64, Kernel 2.6.5-7.267-smp
>
This is the culprit, I guess. As this isn't a widespread problem, I
wouldn't be surprised if it's related to 64-bit archs (kernel-2.6.5 is
fairly ancient too, but that shouldn't matter as this is the only app
you're seeing it in).
I'm guessing this actually is an SMP-system and that SuSE doesn't
install SMP kernels on all systems, correct? If so, this could also be a
source of problem for you. Nagios doesn't follow the
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]