Re: [Nagios-devel] Race condition in freshness checking

Guest · Post by **Guest** » Mon Sep 24, 2007 11:47 am

Ton Voon wrote:
> Hi!
>
> We found a bug in the calculation of the latency for a passive check. This has
> highlighted a possible race condition re: freshness checking. We wanted to get
> some ideas on what is the best approach to fix this.
>
> Background:
>
> We have a master/slave arrangement, with freshness checking
> (freshness_threshold=0) of slave services on the master.
>
> Looking in the NDO db, we realised that the latency values for passive results
> were incorrectly calculate - sometimes latency values could be 1000x out. The
> patch is attached. However, since using this patch, we've seen occasional race
> conditions.
>
> Problem:
>
> Within checks.c:check_service_result_freshness, if a service has past its
> expiration_time, it is marked as is_being_freshened and a forced service check
> is scheduled. However, if a passive result for this service is processed before
> this forced check is run, then the service is marked as stale and the state is
> inconsistent between master and slave.
>
> Possible solutions:
>
> - If a check result is processed with is_being_freshened set for the service,
> then remove forced check from schedule if it exists.

Sounds like a good solution, since the service will be marked as 'is_being_checked'
when the check actually runs, in which case it's pointless to update the status
as it will be overwritten by the master's own active check anyways.

> - Change is_being_freshened to stale_time (0 if not stale). On running the
> forced check, if stale_time is less than last_check_time (+ latency?), break out
> of running the forced check.
>

This I didn't quite get. You mean the passive check should alter the figure passed
in is_being_freshened? If so, what if stale_time is exactly 1? How can Nagios then
determine that it's actually received a result rather than just being updated by
the passive check-result coming in.

I'm sure you thought of it, but the simplest way should be to re-check the timer
since last check arrived when the forced check is being run, and cancel it if it's
fresh enough then. That way you'll keep the the change to a single spot in the code
and it'll be quite maintainable, provided some comment is added that explains the
anomaly in the code-path.

--
Andreas Ericsson [email protected]
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]