Re: [Nagios-devel] RFC: Downtime and flapping

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] RFC: Downtime and flapping

Post by Guest »


On 4 Feb 2011, at 10:30, Jochen Bern wrote:

> On 02/03/2011 11:59 PM, Andreas Ericsson wrote:
>> On 02/03/2011 07:53 PM, Ton Voon wrote:
>>> =46rom the code, I can see that Nagios does not record any soft
>>> non-OK states in this state history. Any objections if I add "host
>>> or service in downtime" to that exception?
>> None at all. In fact, +1 on doing so. This way, downtime makes all
>> effects of statechanges void and null
>=20
> Umh, not quite, I'm afraid. It means that hosts/services will emerge
> from downtime with the history they had when they entered downtime
> way-back-when - which may well be the non-OK or FLAPPING which =
prompted
> you to schedule urgent repairs in the first place.
>=20
> It IIUC also means that during the downtime, the CGI-bins will keep
> displaying the *historic* flapping state, along with the *current*
> host/service state.
>=20
> Downtime disables notifications anyway, and there already is logic to
> trigger actions when downtime ends (*). IMHO, the proper way to =
provide
> a clean slate after a downtime would be to flush (**) the entire =
history
> at that point.
>=20
> (*) Notification type "s" - BTW,
> http://nagios.sourceforge.net/docs/3_0/ ... ml#contact
> lists services-"s" in the Definition Format but not in the Directive
> Descriptions.
>=20
> (**) Whether the bins should be reset to OK, PENDING,
> last-before-downtime or the current post-downtime $*STATE$ (if one is
> already available) is up for discussion ...

I think your main objection is that the flapping calculation could be =
based on "very old states" and thus "inaccurate" and "unintuitive". I'm =
happy with making a more radical change if it makes sense.

Stepping back, the purpose of flap detection is to disable notifications =
temporarily, but since scheduling downtime already disables =
notifications, does it make any sense to have flapping during downtimes?

So if we agree that downtime and flapping for the same object makes no =
sense when overlapping, I propose:
* if an object is in a flapping start state at the time of a downtime =
start, a flapping stop is sent (this would need documenting that an =
object goes can be flapping stop due to downtime starting. If a user has =
downtime notifications, they'll get two notifications in this case)
* when an object goes into downtime, the state history is erased (I'm =
assuming the state history is only used for flap detection) and new =
states coming in during this downtime are not recorded. When the object =
comes out of downtime, state history starts again

During a downtime, the flapping percent will always be 0 and then its an =
education/documentation issue that flap detection does not take effect =
in this period.

Would that be better?

Ton









This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked