Nagios service escalation not working as expected

An open discussion forum for obtaining help with Nagios Core.

Nagios service escalation not working as expected

Postby hssbc » Tue Mar 27, 2012 3:03 pm

I'm having issues with service escalations that keep seding emails every 5 minutes instead of what they are setup to do.
I want two sent five minutes apart, then one 15 minutes later, 30 minutes later and 60 minutes later then finally two x four hours apart then no more.

What is happening is I get them every five minutes if the output changes .. so lets say cpu load triggers a warning at two but then changes to 2.1 that resets the escalation and then I keep getting the five minute alerts.
I would expect that unless it changes state from Warning to Critical that the escalations be followed as I have them setup.

Here is what I have in my escalations rules.


define serviceescalation{
host_name *
service_description *
first_notification 2
last_notification 2
notification_interval 15
}
define serviceescalation{
host_name *
service_description *
first_notification 3
last_notification 3
notification_interval 30
}
define serviceescalation{
host_name *
service_description *
first_notification 4
last_notification 4
notification_interval 60
}

define serviceescalation{
host_name *
service_description *
first_notification 5
last_notification 7
notification_interval 240
}


Let me know if anyone sees something wrong here.

Nagios server config:
Nagios core 3.3.1 on Linux ES5.7 64b VM


Thanks
hssbc
 
Posts: 3
Joined: Fri Mar 23, 2012 10:45 am

Re: Nagios service escalation not working as expected

Postby jsmurphy » Tue Mar 27, 2012 10:33 pm

Do you have state stalking enabled on that CPU service? Normally I believe the default behaviour is to only track state changes (warning -> critical -> etc) but you can alter that behaviour with options like state stalking which tells it to care about if the actual description information changes.
User avatar
jsmurphy
 
Posts: 932
Joined: Wed Aug 18, 2010 9:46 pm

Re: Nagios service escalation not working as expected

Postby hssbc » Wed Mar 28, 2012 11:09 am

I don't have state stalking enabled on any services.

I decided to run a more controlled test using check_local_users so I can have better control.
Warning is 10, critical 13

here are the results:

From Subject Received Size Categories
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:50 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:45 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:40 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:35 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:30 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:25 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:20 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:16 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:10 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:05 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:00 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:55 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:50 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:45 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:40 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:35 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:30 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:25 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:20 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:15 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:10 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:05 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:00 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 4:55 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 12:55 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** Tue 8:55 PM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** Tue 4:55 PM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** Tue 3:55 PM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** Tue 3:25 PM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** Tue 3:10 PM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** Tue 3:05 PM

Okay this worked as expected except that after the last escalation at 4:55 .. nagios then keeps sending alerts every 5 minutes. The user count did not change during this test and I'm still getting alerts every 5 mins as we speak ..

Is my last escalation defintion not setup correctly ?

I would need to test the cpu load the same way but need a dedicated server .. I'll set one up and post the results.
hssbc
 
Posts: 3
Joined: Fri Mar 23, 2012 10:45 am

Re: Nagios service escalation not working as expected

Postby hssbc » Wed Apr 18, 2012 10:26 am

I ran some controlled tests and indeed Nagios is working as it should and escalations sent as expect .. The only thing I can think of now is that services may be flapping and/or just go back to Okay state then Warning/Critical..
hssbc
 
Posts: 3
Joined: Fri Mar 23, 2012 10:45 am

Re: Nagios service escalation not working as expected

Postby jsmurphy » Wed Apr 18, 2012 7:44 pm

you should be able to see it return to an ok state in the logs if the state was resetting, run an availability report for that service and then click "show all log entries" which will show you soft states as well as hard... see if it is changing.
User avatar
jsmurphy
 
Posts: 932
Joined: Wed Aug 18, 2010 9:46 pm


Return to Nagios Core

Who is online

Users browsing this forum: No registered users and 0 guests

cron