Handling response checks with frequent spikes
Posted: Wed Feb 26, 2014 12:13 am
Hi,
I am monitoring REST and SOAP response time checks that are volatile, with spikes -- quite often -- of 300+ ms, but not for sustained periods of time. Essentially, I'd like a check that only alerts if over a certain period of time the average check is greater than 300ms. I think I can roughly simulate this by playing around with max check attempts and retry intervals:
Assuming a service in a hard OK state, is the following correct?
O = OK state
n = non-okay
- = unchecked
Is this correct?
Thanks,
Mike
I am monitoring REST and SOAP response time checks that are volatile, with spikes -- quite often -- of 300+ ms, but not for sustained periods of time. Essentially, I'd like a check that only alerts if over a certain period of time the average check is greater than 300ms. I think I can roughly simulate this by playing around with max check attempts and retry intervals:
- check interval = 3 minutes
- retry-interval = 1 minute
- max check attempts = 6
Assuming a service in a hard OK state, is the following correct?
- The first time Nagios gets a non-OK, it sets current check count to 1 (soft error)
- Every check after the first non-OK increments current check count by 1
- If an OK state is returned (soft recovery) before current check count = max check attempts:
- Current check count is reset to 1
- Service remains in hard state of OK
- 10 instances of failure (in a worse case scenario meaning 10 minutes of 300+ms response time)
- 4 unchecked/unknown instances (in the minutes skipped between check intervals)
- only 3 verifiable instances of OK
Code: Select all
STATE O - - n n n O - - n n n n n n O
MINUTE 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5n = non-okay
- = unchecked
Is this correct?
Thanks,
Mike