Checks randomly not reaching hard state

NSchoenbaechler · Post by **NSchoenbaechler** » Mon Aug 20, 2018 10:18 am

Since upgrading to Nagios Core 4.4.x (we are now on 4.4.2, latest) we have seen a recurring and serious issue, where checks randomly remain in a soft state even when they have reached their max check attempts. Therefore they never notify, but we do get recovery notifications. Here's an example service and corresponding Event log entries:

define service{
use generic-service
host_name devel.blahblah.com
service_description root filesystem
is_volatile 0
check_period 24x7
max_check_attempts 3
check_interval 2
retry_interval 1
contact_groups blahblah-sysadmins
notification_interval 240
notification_period HDhours (this occurred within this defined time period)
notification_options u,c,r
check_command check_nrpe!check_root
}

[08-20-2018 09:59:11] SERVICE ALERT: devel.blablah.com;root filesystem;OK;HARD;1;DISK OK - free space: / 957 MB (20% inode=79%):
Service Notification[08-20-2018 09:59:11] SERVICE NOTIFICATION: admin1;devel.blablah.com;root filesystem;OK;notify-by-email;DISK OK - free space: / 957 MB (20% inode=79%):
Service Critical[08-20-2018 09:58:11] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
External Command[08-20-2018 09:57:05] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;1534777024
Service Critical[08-20-2018 09:56:16] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:55:13] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:54:10] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
External Command[08-20-2018 09:54:07] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;1534776846
Service Critical[08-20-2018 09:54:02] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:53:48] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;2;Connection refused or timed out
External Command[08-20-2018 09:53:45] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;15347768241
Service Critical[08-20-2018 09:53:08] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;1;Connection refused or timed out

We started noticing we were not receiving non-recovery alerts, and the way I'm interpreting the data above is that the service is set to go hard after 3 failed attempts, but never does, and never sends an alert. It does send a recovery though. This is a Nagios Core instance that has been in place and rock solid for years until v4.4. Please let me know if more information is helpful, I'll gladly provide it. Thanks.

scottwilkerson · Post by **scottwilkerson** » Mon Aug 20, 2018 10:25 am

We know this was an issue prior to 4.4.2 but this should have been resolved.

Is the host in an up or down state when this happens.

Any additional information you have would be useful because to fix the problem I will need to make sure I can re-create the issue.

Also can you show the results of the following?

Code: Select all

ps -ef|grep nagios.cfg

NSchoenbaechler · Post by **NSchoenbaechler** » Mon Aug 20, 2018 10:47 am

Scott,

Thank you so much for the reply. I was really scratching my head and thinking I was crazy. In the example I provided, the host was in a down state at the time. I just shut the server down and let things play out (though I was forcing checks to speed things up).

Here's the output you requested:
[root@monitor ~]# ps -ef|grep nagios.cfg
root 18990 18786 0 10:44 pts/0 00:00:00 grep nagios.cfg
nagios 26094 1 0 10:01 ? 00:00:12 /usr/local/nagios/bin/nagios -ud /usr/local/nagios/etc/nagios.cfg
nagios 26100 26094 0 10:01 ? 00:00:00 /usr/local/nagios/bin/nagios -ud /usr/local/nagios/etc/nagios.cfg
[root@monitor ~]#

Here is another item you may find interesting. On a separate host, I shut the server down. The host had two service checks with max_check_attempts at 3. Same issue, never went to a hard state. However, I did notice that the host check itself had its max_check_attempts set to 10. Once the 10th host check failed, it alerted as you would expect (for the host check, still nothing on the services though). Please let me know if there's any more information I can provide. Thanks again.

scottwilkerson · Post by **scottwilkerson** » Mon Aug 20, 2018 11:19 am

Ok, if the host is in a down state you shouldn't get the service notification when it goes critical, but you also shouldn't get a recovery.

I just ran some tests and can confirm there is a bug that a recovery email is sent when it shouldn't be in this scenario.
I have logged the following bug report:
https://github.com/NagiosEnterprises/na ... issues/572

I could not however re-create the constant soft state in 4.4.2

NSchoenbaechler · Post by **NSchoenbaechler** » Mon Aug 20, 2018 11:22 am

Oh yeah I forgot about the host state stuff, so my testing methodology was bad. I'll check with a few instances where a service goes down but the host stays up. Thanks.

NSchoenbaechler · Post by **NSchoenbaechler** » Mon Aug 20, 2018 11:33 am

Re-tested the same service in my original post with the host still up. Worked and notified fine. Thanks for your help, and thanks for submitting the bug request.

scottwilkerson · Post by **scottwilkerson** » Mon Aug 20, 2018 12:42 pm

NSchoenbaechler wrote:Re-tested the same service in my original post with the host still up. Worked and notified fine. Thanks for your help, and thanks for submitting the bug request.

Glad that part is ok. good.

scottwilkerson · Post by **scottwilkerson** » Wed Aug 22, 2018 11:45 am

I believe this to be resolved in the maint branch of nagios on github
https://github.com/NagiosEnterprises/na ... tree/maint

burkm · Post by **burkm** » Wed Sep 05, 2018 12:29 pm

scottwilkerson wrote:I believe this to be resolved in the maint branch of nagios on github
https://github.com/NagiosEnterprises/na ... tree/maint

Hello,
I'm having the exact same problem as the OP on 4.4.1 and 4.4.2. I installed the referenced version from github, but the problem persists. I can't figure out the pattern for when it works and when it doesn't.

Thanks,
Michael

scottwilkerson · Post by **scottwilkerson** » Wed Sep 05, 2018 12:41 pm

Did you install the maint branch after Aug 23, 2018?

And to be clear, can you define the exact problem you are seeing, as this thread has spanned several intermingled problems

Nagios Support Forum

Checks randomly not reaching hard state

Checks randomly not reaching hard state

Re: Checks randomly not reaching hard state

Re: Checks randomly not reaching hard state

Re: Checks randomly not reaching hard state

Re: Checks randomly not reaching hard state

Re: Checks randomly not reaching hard state

Re: Checks randomly not reaching hard state

Re: Checks randomly not reaching hard state

Re: Checks randomly not reaching hard state

Re: Checks randomly not reaching hard state