[Nagios-devel] Race condition in freshness checking
-
Guest
[Nagios-devel] Race condition in freshness checking
--Apple-Mail-31--251484363
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=UTF-8;
delsp=yes;
format=flowed
Hi!
We found a bug in the calculation of the latency for a passive check. =20=
This has highlighted a possible race condition re: freshness =20
checking. We wanted to get some ideas on what is the best approach to =20=
fix this.
Background:
We have a master/slave arrangement, with freshness checking =20
(freshness_threshold=3D0) of slave services on the master.
Looking in the NDO db, we realised that the latency values for =20
passive results were incorrectly calculate - sometimes latency values =20=
could be 1000x out. The patch is attached. However, since using this =20
patch, we've seen occasional race conditions.
Problem:
Within checks.c:check_service_result_freshness, if a service has past =20=
its expiration_time, it is marked as is_being_freshened and a forced =20
service check is scheduled. However, if a passive result for this =20
service is processed before this forced check is run, then the =20
service is marked as stale and the state is inconsistent between =20
master and slave.
Possible solutions:
- If a check result is processed with is_being_freshened set for =20
the service, then remove forced check from schedule if it exists.
- Change is_being_freshened to stale_time (0 if not stale). On =20
running the forced check, if stale_time is less than last_check_time =20
(+ latency?), break out of running the forced check.
None of these sound particularly appealing to us. Are there other =20
possible solutions? Any opinions?
Ton
http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon
=EF=BF=BC
--Apple-Mail-31--251484363
Content-Type: multipart/mixed;
boundary=Apple-Mail-32--251484362
--Apple-Mail-32--251484362
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
charset=ISO-8859-1
Hi!We found a bug in the =
calculation of the latency for a passive check. This has highlighted a =
possible race condition re: freshness checking. We wanted to get some =
ideas on what is the best approach to fix this.Background:We have a master/slave =
arrangement, with freshness checking (freshness_threshold=3D0) of slave =
services on the master.Looking in the NDO db, we =
realised that the latency values for passive results were incorrectly =
calculate - sometimes latency values could be 1000x out. The patch is =
attached. However, since using this patch, we've seen occasional race =
conditions.Problem:Within =
checks.c:check_service_result_freshness, if a service has past its =
expiration_time, it is marked as is_being_freshened and a forced service =
check is scheduled. However, if a passive result for this service is =
processed before this forced check is run, then the service is marked as =
stale and the state is inconsistent between master and =
slave.Possible=
solutions:=A0=
- If a check result is processed with is_being_freshened set for the =
service, then remove forced check from schedule if it =
exists.=A0 - Change is_being_freshened to stale_time (0 if =
not stale). On running the forced check, if stale_time is less than =
last_check_time (+ latency?), break out of running the forced =
check.None =
of these sound particularly appealing to us. Are there other possible =
solutions? Any opinions?Ton<S
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]