Page 1 of 1
Timeperiod based intervals for hosts/services
Posted: Tue Feb 07, 2012 10:01 am
by ericusf
I don't see an easy way around it, but here's what we wanted to do:
We wanted to have different settings for different time periods. For example, for a single host, during business hours:
- max_check_attempts = 2
- check_interval = 3
- retry_interval = 1
But say you wanted to avoid "transients" - issues like rebooting switches due to power brownouts. Nothing that a tech needed to be paged about right then and there; something that would resolve itself (usually within 3 minutes). So you might want to change that during off hours:
- max_check_attempts = 3
- check_interval = 5
- retry_interval = 2
Thing is, the only way I can see to do this requires two different "host_name"s.
I thought about doing this with hostescalations; however, we don't want a truly down piece of equipment to wait a full notification interval (15 minutes for critical equipment) before the tech is notified.
Any ideas? Comments?
Re: Timeperiod based intervals for hosts/services
Posted: Tue Feb 07, 2012 7:14 pm
by jsmurphy
While I can't suggest a solution to specifically to the question you posed, I can suggest that it may be unnecessary to have the time_periods different. If you are worried about catching reboots a better course of action might be to use an eventlog agent to search for unexpected reboots in the logs... using a slightly extended timeout period also goes a long way to reducing your chances of a false positive.
I've noticed that there tends to be an overzealous approach to monitoring during business hours under the guise "If something critical goes down we need to know about it right now!". But the fact is if something critical goes down you will know about it before your monitoring tells you because your office will turn into meerkat manor as everyone starts peering over their cubicle walls and rushing to your desk. There's only a very small time difference between 2 + 1 + 1 (4 minutes max) and 5 + 2 + 2 (9 minutes max). That might in a best case scenario give you a 5 minute head start if by some miracle no one else notices that it's broken.
Re: Timeperiod based intervals for hosts/services
Posted: Thu Feb 09, 2012 8:02 am
by ericusf
Actually, given that we are a 40000+ student, multi-campus university with a large IT staff, Nagios tells us a lot more about what's up or down than people do, usually. Even with people knowing all the network engineers' (cell)phone numbers. Sometimes it takes the helpdesk a while to get around to asking us if something's broken ... and 50/50 it's either "we know" or "it's fixed" by the time they get to us.
We *do* look at logging to tell us when equipment is rebooting, but the issue for us is more of power (which in several of our buildings has low reliability, and in older installs, there aren't any reliable UPS equipment involved).
The Associate Director (and I agree) desired to go about it this way because we can ignore the odd page, but like to know that it's happening, when it happens. At least during business hours ...
We're also looking at completely revamping our settings so that a lot of stuff doesn't even page off hours ... which, once that's in place, will probably lessen the "transients" the on-call person is currently getting paged about (as most stuff that's critical has more solid power and hefty UPS equipment). Also, we're looking at setting it up so that paging only kicks in via [host|service]escalations, and by default email is used 24x7 for all equipment.
But, it was something he suggested I look into.
Re: Timeperiod based intervals for hosts/services
Posted: Thu Feb 09, 2012 5:22 pm
by jsmurphy
Fair enough, I was just trying provide some food for thought. All too often people do this without questioning why they are doing it and what value ~5 minutes actually adds.
There is only one way I know of to implement this, you could create a time period for business-hours then create a time-period for out of hours. Create two template definitions one for inside business hours and one for outside hours, you would also need to duplicate the host/services and assign one the inside hours template and one the out of hours template.
Unfortunately if you were to inherit both templates to a single host one would over-ride the other, so the only way to achieve this is by duplicating the host/service and template definitions so that they only run in the desired time frames. Something about Nagios that does bite you in the ass from time to time is the lack of an internal decision making engine of when and how to apply configurations in relation to what else exists.
Re: Timeperiod based intervals for hosts/services
Posted: Fri Feb 10, 2012 8:56 am
by ericusf
Ah, but where you then get bitten, and quite hard, is you can't have two "define host {}" stanzas with the same host_name. I've tried that (or at least swear that I thought I did).
You actually have to have unique host_names, and that's the rub - I'd rather not have to do that.
Re: Timeperiod based intervals for hosts/services
Posted: Sun Feb 12, 2012 5:10 pm
by jsmurphy
Yep, you are 100% correct, I didn't say it was a good solution

. Just the only one I can think of that would work under those guidelines.