Hello,
I have a question, and am having trouble finding out why.
We monitor our company's central storage with nagios.
This is a Nexenta cluster.
We monitor the triggers Nexenta has in place.
One of those triggers is a disk that needs replacing.
If a disk fails, we get alerted and the check state goes to critical.
After we replaced the disk and started the resilver of the disk, the server is under quite some stress.
In fact, so much stress that the check occasionally times out. Cuasing an Unknown status
The resilvering process of a disk on Nexenta can take up to 48 hours to complete.
We had acknowledged the Critical status of the trigger.
However, when the check of the trigger times out due to the systemload, nagios seems to think that is a state change.
Causing the next check which doesn't timeout, to change the state back to Critical.
Which is in fact still the same state it had before the Unknown status.
We do however start receiving SMS notifications again after the state is 'back' to Critical.
The service then has to be Acknowledged again to stop the SMS notifications coming in.
Anyone that can tell me how we can avoid these CRITICAL > UNKOWN > CRITICAL state changes?
The example is for our storage, but is applicable to all these types of state changes.
Thanks for reading, and I hope someone can answer this for me.
Kind regards!
State changes while status is critical
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: State changes while status is critical
Have you considered using Downtime for this? http://nagios.sourceforge.net/docs/3_0/downtime.html
Also, how rapidly are you seeing these state changes? Is it quite fast? You could also try adding flapping detection to your hosts/services.
Also, how rapidly are you seeing these state changes? Is it quite fast? You could also try adding flapping detection to your hosts/services.
Re: State changes while status is critical
Scheduling downtime for the service could be an option, but we'd like to be notified as the service recovers.
The state changes are not too fast for flapping detection
Is the CRITICAL > UNKNOWN > CRITICAL state (due to timeouts) change this way by design?
The actual real state hasn't changed, the check just could not be performed.
It was critical, and still is critical even if one the checks times out in between.
The time-out shouldn't change the actual critical state, causing us to start receiving notifications again.
The state changes are not too fast for flapping detection
Is the CRITICAL > UNKNOWN > CRITICAL state (due to timeouts) change this way by design?
The actual real state hasn't changed, the check just could not be performed.
It was critical, and still is critical even if one the checks times out in between.
The time-out shouldn't change the actual critical state, causing us to start receiving notifications again.
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: State changes while status is critical
If you want to see every state change, you can set is_volatile to 1
http://nagios.sourceforge.net/docs/3_0/ ... vices.html
http://nagios.sourceforge.net/docs/3_0/ ... vices.html