State changes while status is critical

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
IceOner
Posts: 2
Joined: Tue Oct 01, 2013 7:18 am

State changes while status is critical

Post by IceOner »

Hello,

I have a question, and am having trouble finding out why.
We monitor our company's central storage with nagios.
This is a Nexenta cluster.

We monitor the triggers Nexenta has in place.
One of those triggers is a disk that needs replacing.
If a disk fails, we get alerted and the check state goes to critical.
After we replaced the disk and started the resilver of the disk, the server is under quite some stress.
In fact, so much stress that the check occasionally times out. Cuasing an Unknown status

The resilvering process of a disk on Nexenta can take up to 48 hours to complete.
We had acknowledged the Critical status of the trigger.
However, when the check of the trigger times out due to the systemload, nagios seems to think that is a state change.
Causing the next check which doesn't timeout, to change the state back to Critical.
Which is in fact still the same state it had before the Unknown status.

We do however start receiving SMS notifications again after the state is 'back' to Critical.
The service then has to be Acknowledged again to stop the SMS notifications coming in.

Anyone that can tell me how we can avoid these CRITICAL > UNKOWN > CRITICAL state changes?
The example is for our storage, but is applicable to all these types of state changes.

Thanks for reading, and I hope someone can answer this for me.

Kind regards!
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: State changes while status is critical

Post by slansing »

Have you considered using Downtime for this? http://nagios.sourceforge.net/docs/3_0/downtime.html

Also, how rapidly are you seeing these state changes? Is it quite fast? You could also try adding flapping detection to your hosts/services.
IceOner
Posts: 2
Joined: Tue Oct 01, 2013 7:18 am

Re: State changes while status is critical

Post by IceOner »

Scheduling downtime for the service could be an option, but we'd like to be notified as the service recovers.
The state changes are not too fast for flapping detection :(

Is the CRITICAL > UNKNOWN > CRITICAL state (due to timeouts) change this way by design?
The actual real state hasn't changed, the check just could not be performed.
It was critical, and still is critical even if one the checks times out in between.
The time-out shouldn't change the actual critical state, causing us to start receiving notifications again.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: State changes while status is critical

Post by scottwilkerson »

If you want to see every state change, you can set is_volatile to 1

http://nagios.sourceforge.net/docs/3_0/ ... vices.html
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked