[Nagios-devel] Antwort: Re: Check becomes unplanned

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

[Nagios-devel] Antwort: Re: Check becomes unplanned

Post by Guest »

Dies ist eine mehrteilige Nachricht im MIME-Format.
--=_alternative 00316240C12574C0_=
Content-Type: text/plain; charset="ISO-8859-1"
Content-Transfer-Encoding: quoted-printable

Hi Bernd,
hi Andreas,

> To alleviate your issue, you should be running an ntp daemon
> on the Nagios server which slews the clock into its right
> time rather than sets it (slew =3D make it go slightly faster
> or slower until it matches the correct time). Are you running
> ntpdate via a cronjob or something?
>
> I'm not sure how one would go about debugging this, as the
> time required to run a single test is prohibitive for rapid
> repeated testing.

I already encountered that problem before and started debugging it,
so I'll just share my knowledge so far. Sadly I didn't get the time
yet to really pinpoint a solution to it and produce a patch.
I'm not that big fan of C ;)

How to produce it:

- define a check "freaky_check" with limited check_period, let's
call it 7to11 and a check_interval of 3
- produce steady time-shifts backwards (nagios running in a VM someone?)

What happens:

1. it's 11pm, nagios schedules freaky_check for 7am according to its=20
check_period
2. Every X minutes timeshift -1 sec (jittering timesource)
3. nagios tries to compensate it and adjusts _all_ checks to the timeshift=
=20
(next_check =3D next_check - timeshift)
4. time goes by from 11pm to 6am, shifting time for - let's say - 8=20
minutes back
5. freaky_check is now scheduled for 6:52am because of the timeshifts
6. it's 6:52am and nagios tries to run the freaky_check according to the=20
schedule
7. sanity check says: ERROR: check outside check_period
8. nagios tries to compensate with a strange logic: next_check =3D=20
next_check + check_interval and just hopes it will fit
9. nagios reruns the sanity check: FATAL ERROR: check still outside=20
check_period - I have no clue what to do: rescheduling freaky_check:=20
next_check =3D next_check + 1year
10. user puzzled and nagios thinks it's all cool

Conclusion:

This behaviour turns up when the following criterias are met:

- check has a reduced check_period
- time is shifting back
- the timeshift outside the check_period is greater then 2 times the
check_interval

You can look it up in base/checks.c within the
run_scheduled_service_check(service *svc, int check_options, double=20
latency)
function for example.=20

After some basic checks this will be run:

/* attempt to run the check */
result=3Drun_async_service_check(svc,check_options,latency,TRUE,TRUE,&time_=
is_valid,&preferred_time);

which in turn ends up with:

/* is the service check viable at this time? */
if(check_service_check_viability(svc,check_options,time_is_valid,preferred_=
time)=3D=3DERROR)
return ERROR;

No, since nagios shifted it outside its check_period, the time is NOT=20
valid.

Back in run_scheduled_service_check we now enter the (if result=3D=3DERROR)=
=20
tree:

/* get current time */
time(&current_time);

/* determine next time we should check the service if needed */
/* if service has no check interval, schedule it again for 5 minutes from=20
now */
if(current_time>=3Dpreferred_time)
=20
preferred_time=3Dcurrent_time+((svc->check_intervalcheck_i=
nterval*interval_length));

COMMENT: nagios added the check_interval to preferred_time

/* make sure we rescheduled the next service check at a valid time */
get_next_valid_time(preferred_time,&next_valid_time,svc->check_period_ptr);

COMMENT: No, it didn't do as adding check_interval was not enough to=20
compensate the backshift in time

/* the service could not be rescheduled properly - set the next check time=
=20
for next year, but don't
actually reschedule it */
if(time_is_valid=3D=3DFALSE && next_valid_time=3D=3Dpreferred_time){

COMMENT: nagios it bailing out here and just adding 1 year to=20
preferred_time to get the scheduler running again

svc->next_check=3D(time_t)(next_valid_time+(60*60*24*365));
svc->should_be_scheduled=3DFALSE;

log_debug_info(DEBUGL_CHECKS,1,"Unable to find any valid times to=20
reschedule the next service check!\n");
}

/* this service could be

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked