Trying to understand escalation

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
jag7720
Posts: 14
Joined: Wed Sep 15, 2010 11:24 am

Trying to understand escalation

Post by jag7720 »

I have taken over an existing Nagios system and I am trying to understand escalations

here is an example of a PING check on one of our servers

Code: Select all

define service {
	 	 host_name 	 	 	yuma-065wm
	 	 service_description 	 	 PING
	 	 is_volatile 	 	 	  0
	 	 check_command 	 	 	 check_ping!200,20%!500,60%
	 	 max_check_attempts 	 	 5
	 	 normal_check_interval 	 	 5
	 	 retry_check_interval 	 	 5
	 	 passive_checks_enabled 	 1
	 	 check_period 	 	 	 24x7
	 	 parallelize_check 	 	 1
	 	 obsess_over_service 	 	 1
	 	 check_freshness 	 	 1
	 	 event_handler_enabled 	 	 1
	 	 flap_detection_enabled 	 1
	 	 process_perf_data 	 	 0
	 	 retain_status_information 	 1
	 	 retain_nonstatus_information 	 1
	 	 contact_groups 	 	 novell-admins-Email
	 	 active_checks_enabled 	 	 0
	 	 notification_interval 	 	 120
	 	 notification_period 	 	 24x7
	 	 notification_options 	 	 w,c,u,r,f
	 	 notifications_enabled 	 	 1
	 	 register 	 	 	 1
}

From my understanding this means that a check of a service will occur every 5 minutes (normal_check_interval 5).

If a check changes from a OK state to a non-OK state, (soft) the checks will occur every 5 minutes (retry_check_interval 5) 5 more times (max_check_attempts 5)

If the service stays in a non-ok state (25min later), an Alert will be sent out and the checking of the service will happen every 5 minutes again because the state has changed again (retry_check_interval 5) and if it stays in that Hard non-OK state the checks will again do a retry_check_interval every 5 and if if the state stays Hard non-ok revert back to a normal_check_interval occurring every 5 minutes and alert every 5 minutes.

Is that right?

Then, according to my escalation rule

Code: Select all

define  serviceescalation {
        hostgroup_name                  Branch_Servers
        service_description             PING,LVM,Load,NSS,NTPD Syncing,RAID Status,San Storage
        contact_groups                  admins-SMS
        first_notification              2
        last_notification               6
        notification_interval           60
        escalation_period               Branch_Servers  
        escalation_options              u,c,r


I will get an escalation alert on the second alert (which would be 10 minutes after the state change) then every 25 minutes after but only on the second and third alert. 25min and 50min


So I'll get a level one alert at 25min and every 5 min until recovery.

AND an escalation alert at 25min and no others because the notification_interval of the escalation is 60 and the 4/5/6 alerts are before the 60min notification_interval.

If that is incorrect, what is the "notification_interval 60" mean? And what should it realistically be set to?

If I have any of the wrong please correct as necessary. I'm trying to wrap my head around this.


Thanks
Locked