[Nagios-devel] Race condition in freshness checking

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

[Nagios-devel] Race condition in freshness checking

Post by Guest »


--Apple-Mail-31--251484363
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset=UTF-8;
delsp=yes;
format=flowed

Hi!

We found a bug in the calculation of the latency for a passive check. =20=

This has highlighted a possible race condition re: freshness =20
checking. We wanted to get some ideas on what is the best approach to =20=

fix this.

Background:

We have a master/slave arrangement, with freshness checking =20
(freshness_threshold=3D0) of slave services on the master.

Looking in the NDO db, we realised that the latency values for =20
passive results were incorrectly calculate - sometimes latency values =20=

could be 1000x out. The patch is attached. However, since using this =20
patch, we've seen occasional race conditions.

Problem:

Within checks.c:check_service_result_freshness, if a service has past =20=

its expiration_time, it is marked as is_being_freshened and a forced =20
service check is scheduled. However, if a passive result for this =20
service is processed before this forced check is run, then the =20
service is marked as stale and the state is inconsistent between =20
master and slave.

Possible solutions:

- If a check result is processed with is_being_freshened set for =20
the service, then remove forced check from schedule if it exists.
- Change is_being_freshened to stale_time (0 if not stale). On =20
running the forced check, if stale_time is less than last_check_time =20
(+ latency?), break out of running the forced check.

None of these sound particularly appealing to us. Are there other =20
possible solutions? Any opinions?

Ton

http://www.altinity.com
T: +44 (0)870 787 9243
F: +44 (0)845 280 1725
Skype: tonvoon

=EF=BF=BC

--Apple-Mail-31--251484363
Content-Type: multipart/mixed;
boundary=Apple-Mail-32--251484362


--Apple-Mail-32--251484362
Content-Transfer-Encoding: quoted-printable
Content-Type: text/html;
charset=ISO-8859-1

Hi!We found a bug in the =
calculation of the latency for a passive check. This has highlighted a =
possible race condition re: freshness checking. We wanted to get some =
ideas on what is the best approach to fix this.Background:We have a master/slave =
arrangement, with freshness checking (freshness_threshold=3D0) of slave =
services on the master.Looking in the NDO db, we =
realised that the latency values for passive results were incorrectly =
calculate - sometimes latency values could be 1000x out. The patch is =
attached. However, since using this patch, we've seen occasional race =
conditions.Problem:Within =
checks.c:check_service_result_freshness, if a service has past its =
expiration_time, it is marked as is_being_freshened and a forced service =
check is scheduled. However, if a passive result for this service is =
processed before this forced check is run, then the service is marked as =
stale and the state is inconsistent between master and =
slave.Possible=
solutions:=A0=
- If a check result is processed with is_being_freshened set for the =
service, then remove forced check from schedule if it =
exists.=A0 - Change is_being_freshened to stale_time (0 if =
not stale). On running the forced check, if stale_time is less than =
last_check_time (+ latency?), break out of running the forced =
check.None =
of these sound particularly appealing to us. Are there other possible =
solutions? Any opinions?Ton<S

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked