Can someone explain how I should set up the following. We have an "on duty" BlakcBerry who receives only alerts from 07:00 to 23:00, as we do like to get some sleep at night
Now the problem is that we do want to receive alerts on this BlackBerry if something happened at night and the problem is still there. At the moment we monitor everything with SCOM 2007 and there we execute a status reset of all our servers at 07:00 after which problems which still exist will send a new alert. How should I do something like a status reset of all servers / services at 07:00? Or is there some other way to resend alerts for critical servers at 07:00
The fourth host or service filter that must be passed is the time period test. Each host and service definition has a <notification_period> option that specifies which time period contains valid notification times for the host or service. If the time that the notification is being made does not fall within a valid time range in the specified time period, no one gets contacted. If it falls within a valid time range, the notification gets passed to the next filter... Note: If the time period filter is not passed, Nagios will reschedule the next notification for the host or service (if its in a non-OK state) for the next valid time present in the time period. This helps ensure that contacts are notified of problems as soon as possible when the next valid time in time period arrives.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
I haven't been able to test this, but if I read this correct, it means that if a host or service is still down on the moment a notifications schedule becomes active, a notifications will be sent out. Does this also apply when there are several notification contacts, for example:
user a: 09:00 - 23:00
user b: always available
alert comes in at 02:00 => notification is sent out at user b
meanwhile the problem is still there, will user a also get a notification at 09:00 for this problem?
If they are using the same time period, or the notification period is the same on the host/service the above is correct. If the host/service is still down when user A's timeperiod comes online, and they are designated to receive alerts for the object, they will receive the notification that the object is still down.
Strange becasue this morning @ 10 am I received an alert on my "on duty" BlackBerry, that a critical web application has recovered and no other alerts. This web application has a recurring downtime form 2:00 to 7:00 each day. Today however the maintenance tasks went wrong and looking into the event log, I can see that it took 3 more hours to finish it's jobs.
So the on duty user / contact has a notificaiton period from 08:00 to 23:00. As you say "If the host/service is still down when user A's timeperiod comes online, and they are designated to receive alerts for the object, they will receive the notification that the object is still down."
Why didn't I receive any email on my BlackBerry at 08:00 that my critical web application was still down? Maybe it's important to say that the hosts and services are configured to only send an email every 1440 minutes (24 hours) instead of the default of 60 minutes. Could this be the reason?
As we 'd rather not send an email every 60 minutes to all contacts, is there any way to reset the health of all critical services at for example 08:00 or maybe scheduling a new check for all critical hosts / services, so when this new check fails it sens an email to available contacts?
Maybe it's important to say that the hosts and services are configured to only send an email every 1440 minutes (24 hours) instead of the default of 60 minutes. Could this be the reason?
Yes this would be precisely the reason. You may want to decrease the notification interval for objects such as these since during downtime they are dependent on another application "updating, pruning, whatever it may be that causes them to go down."