Service Flapping And Notifications

toodaly · Post by **toodaly** » Mon Dec 21, 2015 2:55 pm

When flapping is detected for a service (e.g. CPU Utilization) and notifications are temporarily disabled, do other services on the same host get disabled as well (e.g. notifications for Disk Utilization)?

Thanks.

jolson · Post by **jolson** » Mon Dec 21, 2015 3:35 pm

It's worth giving this a read: https://assets.nagios.com/downloads/nag ... pping.html

When a service or host is first detected as flapping, Nagios will:

Log a message indicating that the service or host is flapping.
Add a non-persistent comment to the host or service indicating that it is flapping.
Send a "flapping start" notification for the host or service to appropriate contacts.
Suppress other notifications for the service or host (this is one of the filters in the notification logic).

When a service or host stops flapping, Nagios will:

Log a message indicating that the service or host has stopped flapping.
Delete the comment that was originally added to the service or host when it started flapping.
Send a "flapping stop" notification for the host or service to appropriate contacts.
Remove the block on notifications for the service or host (notifications will still be bound to the normal notification logic).

toodaly · Post by **toodaly** » Mon Dec 21, 2015 4:24 pm

Suppress other notifications for the service or host (this is one of the filters in the notification logic).

Yes, I had taken a look at the documentation. The line above is the one I was asking about. Does a service that is flapping put the host in a flapping state which would put the other services in a flapping state? What does "other" mean in suppress other notifications?

Thanks.

jolson · Post by **jolson** » Mon Dec 21, 2015 4:33 pm

No problem! A flapping host will not affect the services that are attached to that host - likewise a flapping service does not affect the host. I hope that helps!

toodaly · Post by **toodaly** » Wed Feb 24, 2016 11:10 am

I have another flapping related question. Here's the log for a service flapping (Linux Load):
Current settings (default):
low_host_flap_threshold 5.0
high_host_flap_threshold 20.0
low_service_flap_threshold 5.0
high_service_flap_threshold 20.0

Feb 13 18:30:39 NAGIOS_SRV01 nagios: SERVICE ALERT: linux_srv01;Load;CRITICAL;HARD;1;CRITICAL - load average: 0.16, 0.26, 0.23
Feb 13 18:36:10 NAGIOS_SRV01 nagios: SERVICE ALERT: linux_srv01;Load;OK;HARD;1;OK - load average: 0.09, 0.26, 0.24
Feb 13 18:40:10 NAGIOS_SRV01 nagios: SERVICE ALERT: linux_srv01;Load;CRITICAL;HARD;1;CRITICAL - load average: 0.34, 0.28, 0.25
Feb 13 18:46:41 NAGIOS_SRV01 nagios: SERVICE ALERT: linux_srv01;Load;WARNING;HARD;1;WARNING - load average: 0.08, 0.18, 0.20
Feb 13 18:46:41 NAGIOS_SRV01 nagios: SERVICE FLAPPING ALERT: linux_srv01;Load;STARTED; Service appears to have started flapping (20.4% change >= 20.0% threshold)
Feb 13 18:54:44 NAGIOS_SRV01 nagios: SERVICE ALERT: linux_srv01;Load;CRITICAL;HARD;1;CRITICAL - load average: 0.30, 0.18, 0.18
Feb 13 18:56:45 NAGIOS_SRV01 nagios: SERVICE ALERT: linux_srv01;Load;WARNING;HARD;1;WARNING - load average: 0.12, 0.16, 0.17
Feb 13 18:58:46 NAGIOS_SRV01 nagios: SERVICE ALERT: linux_srv01;Load;CRITICAL;HARD;1;CRITICAL - load average: 0.26, 0.18, 0.17
Feb 13 18:59:24 NAGIOS_SRV01 nagios: SERVICE ALERT: linux_srv01;Load;WARNING;HARD;1;WARNING - load average: 0.13, 0.15, 0.17
Feb 13 18:59:47 NAGIOS_SRV01 nagios: SERVICE ALERT: linux_srv01;Load;CRITICAL;HARD;1;CRITICAL - load average: 0.39, 0.21, 0.18
Feb 13 19:00:24 NAGIOS_SRV01 nagios: SERVICE ALERT: linux_srv01;Load;WARNING;HARD;1;WARNING - load average: 0.20, 0.18, 0.17
Feb 13 19:00:47 NAGIOS_SRV01 nagios: SERVICE ALERT: linux_srv01;Load;CRITICAL;HARD;1;CRITICAL - load average: 0.37, 0.22, 0.19
Feb 13 19:01:45 NAGIOS_SRV01 nagios: SERVICE ALERT: linux_srv01;Load;WARNING;HARD;1;WARNING - load average: 0.18, 0.19, 0.18
Feb 13 19:04:41 NAGIOS_SRV01 nagios: SERVICE ALERT: linux_srv01;Load;OK;HARD;1;OK - load average: 0.07, 0.12, 0.15
Feb 13 19:19:54 NAGIOS_SRV01 nagios: SERVICE FLAPPING ALERT: linux_srv01;Load;STOPPED; Service appears to have stopped flapping (4.3% change < 5.0% threshold)

According to the documentation, the changes in state are weighted based on time over the last 21 service checks (assume over the last 20 minutes) between 1.2 and 0.8 (assume 0.02 for every service check or minute)
Feb 13 18:46:41 - 1.2 - Most current and most heavily weighted over the last 20 min
Feb 13 18:40:10 - 1.06 - 6 minutes 31 seconds previous = - 0.14 (assuming it goes down to the second and round up to the nearest 100th)
Feb 13 18:36:10 - 0.98 - 4 minutes previous = 4 * 0.02 = - 0.08
Feb 13 18:30:39 - 0.86 - 5 minutes 31 seconds previous = 5.5 * 0.02 = - 0.12
Total: 4.1 / 20 = 20.5% which is roughly close to the 20.4% in the log, so I somewhat understand how flapping is detected.

My question is that I can't figure out the 4.3% when the last 20 minutes should go 19:19:54 - 18:59:54 which includes service state changes at 19:00:24, 19:00:47, 19:01:45, and 19:04:41. The documentation doesn't explain (or I can't find) how Nagios determines a host or service is no longer flapping other than the state changes falling outside of the 20 minutes (21 service checks). Does "schedule an immediate check" factor into this?

Thanks

tmcdonald · Post by **tmcdonald** » Wed Feb 24, 2016 6:14 pm

"Schedule an Immediate Check" will absolutely disrupt the execution timing, so if you do that 21 times in a minute (to use an extreme example) it will even out the check history and most likely will not be flapping anymore. Does that clear it up? It's based off checks, not minutes, in other words.

Nagios Support Forum

Service Flapping And Notifications

Service Flapping And Notifications

Re: Service Flapping And Notifications

Re: Service Flapping And Notifications

Re: Service Flapping And Notifications

Re: Service Flapping And Notifications

Re: Service Flapping And Notifications