Page 1 of 1
Nagios service escalation not working as expected
Posted: Tue Mar 27, 2012 3:03 pm
by hssbc
I'm having issues with service escalations that keep seding emails every 5 minutes instead of what they are setup to do.
I want two sent five minutes apart, then one 15 minutes later, 30 minutes later and 60 minutes later then finally two x four hours apart then no more.
What is happening is I get them every five minutes if the output changes .. so lets say cpu load triggers a warning at two but then changes to 2.1 that resets the escalation and then I keep getting the five minute alerts.
I would expect that unless it changes state from Warning to Critical that the escalations be followed as I have them setup.
Here is what I have in my escalations rules.
define serviceescalation{
host_name *
service_description *
first_notification 2
last_notification 2
notification_interval 15
}
define serviceescalation{
host_name *
service_description *
first_notification 3
last_notification 3
notification_interval 30
}
define serviceescalation{
host_name *
service_description *
first_notification 4
last_notification 4
notification_interval 60
}
define serviceescalation{
host_name *
service_description *
first_notification 5
last_notification 7
notification_interval 240
}
Let me know if anyone sees something wrong here.
Nagios server config:
Nagios core 3.3.1 on Linux ES5.7 64b VM
Thanks
Re: Nagios service escalation not working as expected
Posted: Tue Mar 27, 2012 10:33 pm
by jsmurphy
Do you have state stalking enabled on that CPU service? Normally I believe the default behaviour is to only track state changes (warning -> critical -> etc) but you can alter that behaviour with options like state stalking which tells it to care about if the actual description information changes.
Re: Nagios service escalation not working as expected
Posted: Wed Mar 28, 2012 11:09 am
by hssbc
I don't have state stalking enabled on any services.
I decided to run a more controlled test using check_local_users so I can have better control.
Warning is 10, critical 13
here are the results:
From Subject Received Size Categories
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:50 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:45 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:40 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:35 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:30 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:25 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:20 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:16 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:10 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:05 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 6:00 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:55 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:50 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:45 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:40 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:35 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:30 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:25 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:20 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:15 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:10 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:05 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 5:00 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 4:55 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** 12:55 AM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** Tue 8:55 PM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** Tue 4:55 PM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** Tue 3:55 PM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** Tue 3:25 PM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** Tue 3:10 PM
nagios_system_account ** PROBLEM Service Alert: localhost/Current Users is WARNING ** Tue 3:05 PM
Okay this worked as expected except that after the last escalation at 4:55 .. nagios then keeps sending alerts every 5 minutes. The user count did not change during this test and I'm still getting alerts every 5 mins as we speak ..
Is my last escalation defintion not setup correctly ?
I would need to test the cpu load the same way but need a dedicated server .. I'll set one up and post the results.
Re: Nagios service escalation not working as expected
Posted: Wed Apr 18, 2012 10:26 am
by hssbc
I ran some controlled tests and indeed Nagios is working as it should and escalations sent as expect .. The only thing I can think of now is that services may be flapping and/or just go back to Okay state then Warning/Critical..
Re: Nagios service escalation not working as expected
Posted: Wed Apr 18, 2012 7:44 pm
by jsmurphy
you should be able to see it return to an ok state in the logs if the state was resetting, run an availability report for that service and then click "show all log entries" which will show you soft states as well as hard... see if it is changing.