Page 1 of 1

State changes while status is critical

Posted: Tue Oct 01, 2013 7:43 am
by IceOner
Hello,

I have a question, and am having trouble finding out why.
We monitor our company's central storage with nagios.
This is a Nexenta cluster.

We monitor the triggers Nexenta has in place.
One of those triggers is a disk that needs replacing.
If a disk fails, we get alerted and the check state goes to critical.
After we replaced the disk and started the resilver of the disk, the server is under quite some stress.
In fact, so much stress that the check occasionally times out. Cuasing an Unknown status

The resilvering process of a disk on Nexenta can take up to 48 hours to complete.
We had acknowledged the Critical status of the trigger.
However, when the check of the trigger times out due to the systemload, nagios seems to think that is a state change.
Causing the next check which doesn't timeout, to change the state back to Critical.
Which is in fact still the same state it had before the Unknown status.

We do however start receiving SMS notifications again after the state is 'back' to Critical.
The service then has to be Acknowledged again to stop the SMS notifications coming in.

Anyone that can tell me how we can avoid these CRITICAL > UNKOWN > CRITICAL state changes?
The example is for our storage, but is applicable to all these types of state changes.

Thanks for reading, and I hope someone can answer this for me.

Kind regards!

Re: State changes while status is critical

Posted: Tue Oct 01, 2013 10:37 am
by slansing
Have you considered using Downtime for this? http://nagios.sourceforge.net/docs/3_0/downtime.html

Also, how rapidly are you seeing these state changes? Is it quite fast? You could also try adding flapping detection to your hosts/services.

Re: State changes while status is critical

Posted: Wed Oct 02, 2013 4:04 am
by IceOner
Scheduling downtime for the service could be an option, but we'd like to be notified as the service recovers.
The state changes are not too fast for flapping detection :(

Is the CRITICAL > UNKNOWN > CRITICAL state (due to timeouts) change this way by design?
The actual real state hasn't changed, the check just could not be performed.
It was critical, and still is critical even if one the checks times out in between.
The time-out shouldn't change the actual critical state, causing us to start receiving notifications again.

Re: State changes while status is critical

Posted: Wed Oct 02, 2013 9:35 am
by scottwilkerson
If you want to see every state change, you can set is_volatile to 1

http://nagios.sourceforge.net/docs/3_0/ ... vices.html