Checks randomly not reaching hard state
Posted: Mon Aug 20, 2018 10:18 am
Since upgrading to Nagios Core 4.4.x (we are now on 4.4.2, latest) we have seen a recurring and serious issue, where checks randomly remain in a soft state even when they have reached their max check attempts. Therefore they never notify, but we do get recovery notifications. Here's an example service and corresponding Event log entries:
define service{
use generic-service
host_name devel.blahblah.com
service_description root filesystem
is_volatile 0
check_period 24x7
max_check_attempts 3
check_interval 2
retry_interval 1
contact_groups blahblah-sysadmins
notification_interval 240
notification_period HDhours (this occurred within this defined time period)
notification_options u,c,r
check_command check_nrpe!check_root
}
[08-20-2018 09:59:11] SERVICE ALERT: devel.blablah.com;root filesystem;OK;HARD;1;DISK OK - free space: / 957 MB (20% inode=79%):
Service Notification[08-20-2018 09:59:11] SERVICE NOTIFICATION: admin1;devel.blablah.com;root filesystem;OK;notify-by-email;DISK OK - free space: / 957 MB (20% inode=79%):
Service Critical[08-20-2018 09:58:11] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
External Command[08-20-2018 09:57:05] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;1534777024
Service Critical[08-20-2018 09:56:16] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:55:13] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:54:10] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
External Command[08-20-2018 09:54:07] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;1534776846
Service Critical[08-20-2018 09:54:02] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:53:48] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;2;Connection refused or timed out
External Command[08-20-2018 09:53:45] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;15347768241
Service Critical[08-20-2018 09:53:08] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;1;Connection refused or timed out
We started noticing we were not receiving non-recovery alerts, and the way I'm interpreting the data above is that the service is set to go hard after 3 failed attempts, but never does, and never sends an alert. It does send a recovery though. This is a Nagios Core instance that has been in place and rock solid for years until v4.4. Please let me know if more information is helpful, I'll gladly provide it. Thanks.
define service{
use generic-service
host_name devel.blahblah.com
service_description root filesystem
is_volatile 0
check_period 24x7
max_check_attempts 3
check_interval 2
retry_interval 1
contact_groups blahblah-sysadmins
notification_interval 240
notification_period HDhours (this occurred within this defined time period)
notification_options u,c,r
check_command check_nrpe!check_root
}
[08-20-2018 09:59:11] SERVICE ALERT: devel.blablah.com;root filesystem;OK;HARD;1;DISK OK - free space: / 957 MB (20% inode=79%):
Service Notification[08-20-2018 09:59:11] SERVICE NOTIFICATION: admin1;devel.blablah.com;root filesystem;OK;notify-by-email;DISK OK - free space: / 957 MB (20% inode=79%):
Service Critical[08-20-2018 09:58:11] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
External Command[08-20-2018 09:57:05] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;1534777024
Service Critical[08-20-2018 09:56:16] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:55:13] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:54:10] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
External Command[08-20-2018 09:54:07] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;1534776846
Service Critical[08-20-2018 09:54:02] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:53:48] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;2;Connection refused or timed out
External Command[08-20-2018 09:53:45] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;15347768241
Service Critical[08-20-2018 09:53:08] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;1;Connection refused or timed out
We started noticing we were not receiving non-recovery alerts, and the way I'm interpreting the data above is that the service is set to go hard after 3 failed attempts, but never does, and never sends an alert. It does send a recovery though. This is a Nagios Core instance that has been in place and rock solid for years until v4.4. Please let me know if more information is helpful, I'll gladly provide it. Thanks.