Re: [Nagios-devel] Bug report: downtimes beyond 2038 cause event

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Bug report: downtimes beyond 2038 cause event

Post by Guest »

On 04/08/2013 02:12 PM, Ton Voon wrote:
>
> On 4 Apr 2013, at 22:55, Andreas Ericsson wrote:
>>> This fails on CentOS 5 64bit, though appears to work on Debian Squeeze 32bit, so it maybe a 64 bit only issue.
>>>
>>> We think this is an issue when the event is scheduled via squeue_add(). We've managed to get the test-squeue to fail by changing the time value to be greater than 2038 with the following:
>>>
>>> Index: test-squeue.c
>>> ===================================================================
>>> --- test-squeue.c (revision 2716)
>>> +++ test-squeue.c (working copy)
>>> @@ -116,7 +116,7 @@
>>> sq_test_random(sq);
>>> t(squeue_size(sq) == 0, "Size should be 0 after first sq_test_random");
>>>
>>> - t((a.evt = squeue_add(sq, time(NULL) + 9, &a)) != NULL);
>>> + t((a.evt = squeue_add(sq, time(NULL)*2, &a)) != NULL);
>>> t(squeue_size(sq) == 1);
>>> t((b.evt = squeue_add(sq, time(NULL) + 3, &b)) != NULL);
>>> t(squeue_size(sq) == 2);
>>>
>>> This gives the test result of:
>>>
>>> ### squeue tests
>>> FAIL max >> FAIL x == &b @test-squeue.c:133
>>> FAIL x->id == b.id @test-squeue.c:134
>>> FAIL x == &c @test-squeue.c:141
>>> about to fail pretty fucking hard...
>>> ea: 0xbfe065e0; &b: 0xbfe065d8; &c: 0xbfe065d0; ed: 0xbfe065c8; x: 0xbfde9b80
>>> FAIL x == &b @test-squeue.c:152
>>> FAIL x->id == b.id @test-squeue.c:153
>>> FAIL x == &b @test-squeue.c:160
>>> FAIL x->id == b.id @test-squeue.c:161
>>> FAIL x == &c @test-squeue.c:166
>>> FAIL x->id == c.id @test-squeue.c:167
>>> Test results: 390637 passed, 10 failed
>>>
>>> Changing to a factor of 1.1 instead of 2 passes:
>>>
>>
>> I'm not surprised. 1.1 would mean it's still within the unix timeframe.
>>
>> What's the size of time_t, long and struct timeval on systems where it
>> fails?
>> What's the sizes on systems where it succeeds?
>
> With the recreation steps, Nagios 4 works fine on rhel5 32bit, but fails on rhel5 64bit.
>
> sizes.c:
>
> #include
> #include
> #include
> #include
> #include
> #include
> #include
> #include "pqueue.h"
>
> int main(int argc, char **argv)
> {
> struct timeval tv;
>
> printf("long = %d\n", sizeof(long));
> printf("time_t = %d\n", sizeof(time_t));
> printf("tv = %d\n", sizeof(tv));
> printf("pqueue_pri_t = %d\n", sizeof(pqueue_pri_t));
> return 0;
>
> }
>
> RHEL5 32 bit:
> long = 4
> time_t = 4
> tv = 8
> pqueue_pri_t = 8
>
>
> RHEL5 64 bit:
> long = 8
> time_t = 8
> tv = 16
> pqueue_pri_t = 8
>

Ah. Obviously. We're truncating 16 bytes down to 8 for priority,
checking, but retain the 16 bytes to check if it's time to actually
run the event yet, so we hang in the event execution loop when
we come across such events.

I've committed a fix which will allow events to be scheduled
until year 141166 for 64-bit systems (43 bits). It's not
possible to extend the time for 32-bit systems, so if you think
you managed to schedule things beyond year 2038, you were wrong.

That last point is sort of important though. It's still possible
to schedule things that appear to be far into the future that
actually run straight away. We can't do anything about that, because
in lib/squeue we can't differ between "far in the past" or "far into
the future", since the value is already wrapped when we get it there.

That means it's considered buggy user-behaviour to try to schedule
things beyond year 2038 on 32-bit systems. We won't support it, ever,
because we can't. It's also considered a buggy user-behaviour to try
to schedule things beyond year 141166 on 64-bit systems. Both actions
will result in "undefined behaviour", which means we'll run the event
whenever the truncated value happens to suggest it should be run (so
it's actually pretty well-defined behaviour but will be hard for users
to verify).

We will not, however, stop running events anymore.

>>
>> Well, modifying the evt_compute_pri() algorithm to discard
>> everything but the 21 least significant bits of the tv->tv_usec


...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: ae@op5.se
Locked