[Nagios-devel] Race condition in freshness checking

Guest · Post by **Guest** » Mon Sep 24, 2007 9:56 am

--Apple-Mail-31--251484363
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=UTF-8;
delsp=yes;
format=flowed

Hi!

We found a bug in the calculation of the latency for a passive check. =20=

This has highlighted a possible race condition re: freshness =20
checking. We wanted to get some ideas on what is the best approach to =20=

fix this.

Background:

We have a master/slave arrangement, with freshness checking =20
(freshness_threshold=3D0) of slave services on the master.

Looking in the NDO db, we realised that the latency values for =20
passive results were incorrectly calculate - sometimes latency values =20=

could be 1000x out. The patch is attached. However, since using this =20
patch, we've seen occasional race conditions.

Problem:

Within checks.c:check_service_result_freshness, if a service has past =20=

its expiration_time, it is marked as is_being_freshened and a forced =20
service check is scheduled. However, if a passive result for this =20
service is processed before this forced check is run, then the =20
service is marked as stale and the state is inconsistent between =20
master and slave.

Possible solutions:

- If a check result is processed with is_being_freshened set for =20
the service, then remove forced check from schedule if it exists.
- Change is_being_freshened to stale_time (0 if not stale). On =20
running the forced check, if stale_time is less than last_check_time =20
(+ latency?), break out of running the forced check.

None of these sound particularly appealing to us. Are there other =20
possible solutions? Any opinions?

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon

=EF=BF=BC

--Apple-Mail-31--251484363
Content-Type: multipart/mixed;
boundary=Apple-Mail-32--251484362

--Apple-Mail-32--251484362
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
charset=ISO-8859-1

Hi!We found a bug in the =
calculation of the latency for a passive check. This has highlighted a =
possible race condition re: freshness checking. We wanted to get some =
ideas on what is the best approach to fix this.Background:We have a master/slave =
arrangement, with freshness checking (freshness_threshold=3D0) of slave =
services on the master.Looking in the NDO db, we =
realised that the latency values for passive results were incorrectly =
calculate - sometimes latency values could be 1000x out. The patch is =
attached. However, since using this patch, we've seen occasional race =
conditions.Problem:Within =
checks.c:check_service_result_freshness, if a service has past its =
expiration_time, it is marked as is_being_freshened and a forced service =
check is scheduled. However, if a passive result for this service is =
processed before this forced check is run, then the service is marked as =
stale and the state is inconsistent between master and =
slave.Possible=
solutions:=A0=
- If a check result is processed with is_being_freshened set for the =
service, then remove forced check from schedule if it =
exists.=A0 - Change is_being_freshened to stale_time (0 if =
not stale). On running the forced check, if stale_time is less than =
last_check_time (+ latency?), break out of running the forced =
check.None =
of these sound particularly appealing to us. Are there other possible =
solutions? Any opinions?Ton<S

...[email truncated]...

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]