Page 1 of 1

Handling response checks with frequent spikes

Posted: Wed Feb 26, 2014 12:13 am
by gurkakrieg
Hi,

I am monitoring REST and SOAP response time checks that are volatile, with spikes -- quite often -- of 300+ ms, but not for sustained periods of time. Essentially, I'd like a check that only alerts if over a certain period of time the average check is greater than 300ms. I think I can roughly simulate this by playing around with max check attempts and retry intervals:
  • check interval = 3 minutes
  • retry-interval = 1 minute
  • max check attempts = 6
Am I correct in thinking that any OK state during those 6 check attempts will keep us from going to a hard critical state, essentially giving us a 6 minute buffer to get at least 1 ok return?

Assuming a service in a hard OK state, is the following correct?
  1. The first time Nagios gets a non-OK, it sets current check count to 1 (soft error)
  2. Every check after the first non-OK increments current check count by 1
  3. If an OK state is returned (soft recovery) before current check count = max check attempts:
    1. Current check count is reset to 1
    2. Service remains in hard state of OK
Now that I look at things, however, I'm not sure this is the result I want. Over a 16 minute period with the settings I've used, I could have
  • 10 instances of failure (in a worse case scenario meaning 10 minutes of 300+ms response time)
  • 4 unchecked/unknown instances (in the minutes skipped between check intervals)
  • only 3 verifiable instances of OK
This model hopefully illustrates this:

Code: Select all

STATE  O - - n n n O - - n n n n n n O
MINUTE 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5
O = OK state
n = non-okay
- = unchecked

Is this correct?

Thanks,

Mike

Re: Handling response checks with frequent spikes

Posted: Wed Feb 26, 2014 12:27 pm
by tmcdonald
That should work as you described it, however you can expect some false alerts if your checks deviate from that pattern. This is normal and there is not a lot you can do to mitigate this.

Now if you really want to over-engineer the solution, I wrote a document about a bit of software called bischeck, which is useful for monitoring averages and trends over time. It's a bit complex at first, but once you learn the ins and outs you can do some pretty cool stuff with it. The doc is not completed as of right now, but I can give you the pre-final draft if you're interested.

Re: Handling response checks with frequent spikes

Posted: Wed Feb 26, 2014 3:51 pm
by gurkakrieg
I'm always up for a bit of over-engineering :D

I'd appreciate a peek at your bischeck doc. In a former life I was a technical writer -- I'm used to reading lots of drafts.

Re: Handling response checks with frequent spikes

Posted: Wed Feb 26, 2014 4:28 pm
by abrist
I pm'ed you Trevor's newest bischeck draft.