State changes while status is critical
Posted: Tue Oct 01, 2013 7:43 am
Hello,
I have a question, and am having trouble finding out why.
We monitor our company's central storage with nagios.
This is a Nexenta cluster.
We monitor the triggers Nexenta has in place.
One of those triggers is a disk that needs replacing.
If a disk fails, we get alerted and the check state goes to critical.
After we replaced the disk and started the resilver of the disk, the server is under quite some stress.
In fact, so much stress that the check occasionally times out. Cuasing an Unknown status
The resilvering process of a disk on Nexenta can take up to 48 hours to complete.
We had acknowledged the Critical status of the trigger.
However, when the check of the trigger times out due to the systemload, nagios seems to think that is a state change.
Causing the next check which doesn't timeout, to change the state back to Critical.
Which is in fact still the same state it had before the Unknown status.
We do however start receiving SMS notifications again after the state is 'back' to Critical.
The service then has to be Acknowledged again to stop the SMS notifications coming in.
Anyone that can tell me how we can avoid these CRITICAL > UNKOWN > CRITICAL state changes?
The example is for our storage, but is applicable to all these types of state changes.
Thanks for reading, and I hope someone can answer this for me.
Kind regards!
I have a question, and am having trouble finding out why.
We monitor our company's central storage with nagios.
This is a Nexenta cluster.
We monitor the triggers Nexenta has in place.
One of those triggers is a disk that needs replacing.
If a disk fails, we get alerted and the check state goes to critical.
After we replaced the disk and started the resilver of the disk, the server is under quite some stress.
In fact, so much stress that the check occasionally times out. Cuasing an Unknown status
The resilvering process of a disk on Nexenta can take up to 48 hours to complete.
We had acknowledged the Critical status of the trigger.
However, when the check of the trigger times out due to the systemload, nagios seems to think that is a state change.
Causing the next check which doesn't timeout, to change the state back to Critical.
Which is in fact still the same state it had before the Unknown status.
We do however start receiving SMS notifications again after the state is 'back' to Critical.
The service then has to be Acknowledged again to stop the SMS notifications coming in.
Anyone that can tell me how we can avoid these CRITICAL > UNKOWN > CRITICAL state changes?
The example is for our storage, but is applicable to all these types of state changes.
Thanks for reading, and I hope someone can answer this for me.
Kind regards!