Page 1 of 1

Re: [Nagios-devel] Bug report: downtimes beyond 2038 cause event

Posted: Mon Apr 08, 2013 11:13 am
by Guest

On 4 Apr 2013, at 22:55, Andreas Ericsson wrote:
>> This fails on CentOS 5 64bit, though appears to work on Debian =
Squeeze 32bit, so it maybe a 64 bit only issue.
>>=20
>> We think this is an issue when the event is scheduled via =
squeue_add(). We've managed to get the test-squeue to fail by changing =
the time value to be greater than 2038 with the following:
>>=20
>> Index: test-squeue.c
>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
>> --- test-squeue.c (revision 2716)
>> +++ test-squeue.c (working copy)
>> @@ -116,7 +116,7 @@
>> sq_test_random(sq);
>> t(squeue_size(sq) =3D=3D 0, "Size should be 0 after first =
sq_test_random");
>>=20
>> - t((a.evt =3D squeue_add(sq, time(NULL) + 9, &a)) !=3D NULL);
>> + t((a.evt =3D squeue_add(sq, time(NULL)*2, &a)) !=3D NULL);
>> t(squeue_size(sq) =3D=3D 1);
>> t((b.evt =3D squeue_add(sq, time(NULL) + 3, &b)) !=3D NULL);
>> t(squeue_size(sq) =3D=3D 2);
>>=20
>> This gives the test result of:
>>=20
>> ### squeue tests
>> FAIL max > FAIL x =3D=3D &b @test-squeue.c:133
>> FAIL x->id =3D=3D b.id @test-squeue.c:134
>> FAIL x =3D=3D &c @test-squeue.c:141
>> about to fail pretty fucking hard...
>> ea: 0xbfe065e0; &b: 0xbfe065d8; &c: 0xbfe065d0; ed: 0xbfe065c8; x: =
0xbfde9b80
>> FAIL x =3D=3D &b @test-squeue.c:152
>> FAIL x->id =3D=3D b.id @test-squeue.c:153
>> FAIL x =3D=3D &b @test-squeue.c:160
>> FAIL x->id =3D=3D b.id @test-squeue.c:161
>> FAIL x =3D=3D &c @test-squeue.c:166
>> FAIL x->id =3D=3D c.id @test-squeue.c:167
>> Test results: 390637 passed, 10 failed
>>=20
>> Changing to a factor of 1.1 instead of 2 passes:
>>=20
>=20
> I'm not surprised. 1.1 would mean it's still within the unix =
timeframe.
>=20
> What's the size of time_t, long and struct timeval on systems where it=20=

> fails?
> What's the sizes on systems where it succeeds?

With the recreation steps, Nagios 4 works fine on rhel5 32bit, but fails =
on rhel5 64bit.

sizes.c:

#include
#include
#include
#include
#include
#include
#include
#include "pqueue.h"

int main(int argc, char **argv)
{
struct timeval tv;

printf("long =3D %d\n", sizeof(long));
printf("time_t =3D %d\n", sizeof(time_t));
printf("tv =3D %d\n", sizeof(tv));
printf("pqueue_pri_t =3D %d\n", sizeof(pqueue_pri_t));
return 0;

}

RHEL5 32 bit:
long =3D 4
time_t =3D 4
tv =3D 8
pqueue_pri_t =3D 8


RHEL5 64 bit:
long =3D 8
time_t =3D 8
tv =3D 16
pqueue_pri_t =3D 8

> Does time_t differ in signedness on them?

Not sure how to check this.

> I think a runtime check based on those sizes should work just fine, =
and
> also be optimized away so we don't actually have to pay for it, but =
I'm
> curious to see where it actually goes wrong. If it's before we get to
> see the number in squeue.c we're pretty much fscked, as the only =
option
> then is a macro which does voodoo-casting so the squeue api sees the
> right number.
>=20
>> ### squeue tests
>> Test results: 390647 passed, 0 failed
>>=20
>> This worked in Nagios 3, so we're guessing that the change to use the =
squeue library for events is probably where this limitation has come in.
>>=20
>> Any thoughts?
>>=20
>=20
> Well, modifying the evt_compute_pri() algorithm to discard
> everything but the 21 least significant bits of the tv->tv_usec
> would allow us to use 43 bits for the seconds value. That would
> land us somewhere in the year 141234 before we run out of seconds.
> It's not a real fix though, since we could live with discarding
> events that are patently absurd, but blocking the entire scheduler
> because we get a bogus date is just plain wrong.

I've changed the code so it now looks like this:

static pqueue_pri_t evt_compute_pri(struct timeval *tv)
{
pqueue_pri_t ret;

/* keep weird compilers on 32-bit systems from doing wrong */
if(sizeof(pqueue_pri_t) tv_sec;
ret +=3D !!tv->tv_

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: ton.voon@opsview.com