Re: [Nagios-devel] RFC: Downtime and flapping

Guest · Post by **Guest** » Sun Feb 06, 2011 8:59 pm

On 4 Feb 2011, at 10:30, Jochen Bern wrote:

> On 02/03/2011 11:59 PM, Andreas Ericsson wrote:
>> On 02/03/2011 07:53 PM, Ton Voon wrote:
>>> =46rom the code, I can see that Nagios does not record any soft
>>> non-OK states in this state history. Any objections if I add "host
>>> or service in downtime" to that exception?
>> None at all. In fact, +1 on doing so. This way, downtime makes all
>> effects of statechanges void and null
>=20
> Umh, not quite, I'm afraid. It means that hosts/services will emerge
> from downtime with the history they had when they entered downtime
> way-back-when - which may well be the non-OK or FLAPPING which =
prompted
> you to schedule urgent repairs in the first place.
>=20
> It IIUC also means that during the downtime, the CGI-bins will keep
> displaying the *historic* flapping state, along with the *current*
> host/service state.
>=20
> Downtime disables notifications anyway, and there already is logic to
> trigger actions when downtime ends (*). IMHO, the proper way to =
provide
> a clean slate after a downtime would be to flush (**) the entire =
history
> at that point.
>=20
> (*) Notification type "s" - BTW,
> http://nagios.sourceforge.net/docs/3_0/ ... ml#contact
> lists services-"s" in the Definition Format but not in the Directive
> Descriptions.
>=20
> (**) Whether the bins should be reset to OK, PENDING,
> last-before-downtime or the current post-downtime $*STATE$ (if one is
> already available) is up for discussion ...

I think your main objection is that the flapping calculation could be =
based on "very old states" and thus "inaccurate" and "unintuitive". I'm =
happy with making a more radical change if it makes sense.

Stepping back, the purpose of flap detection is to disable notifications =
temporarily, but since scheduling downtime already disables =
notifications, does it make any sense to have flapping during downtimes?

So if we agree that downtime and flapping for the same object makes no =
sense when overlapping, I propose:
* if an object is in a flapping start state at the time of a downtime =
start, a flapping stop is sent (this would need documenting that an =
object goes can be flapping stop due to downtime starting. If a user has =
downtime notifications, they'll get two notifications in this case)
* when an object goes into downtime, the state history is erased (I'm =
assuming the state history is only used for flap detection) and new =
states coming in during this downtime are not recorded. When the object =
comes out of downtime, state history starts again

During a downtime, the flapping percent will always be 0 and then its an =
education/documentation issue that flap detection does not take effect =
in this period.

Would that be better?

Ton

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]