Checks randomly not reaching hard state

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
NSchoenbaechler
Posts: 15
Joined: Fri Feb 02, 2018 10:10 am

Checks randomly not reaching hard state

Post by NSchoenbaechler »

Since upgrading to Nagios Core 4.4.x (we are now on 4.4.2, latest) we have seen a recurring and serious issue, where checks randomly remain in a soft state even when they have reached their max check attempts. Therefore they never notify, but we do get recovery notifications. Here's an example service and corresponding Event log entries:

define service{
use generic-service
host_name devel.blahblah.com
service_description root filesystem
is_volatile 0
check_period 24x7
max_check_attempts 3
check_interval 2
retry_interval 1
contact_groups blahblah-sysadmins
notification_interval 240
notification_period HDhours (this occurred within this defined time period)
notification_options u,c,r
check_command check_nrpe!check_root
}

[08-20-2018 09:59:11] SERVICE ALERT: devel.blablah.com;root filesystem;OK;HARD;1;DISK OK - free space: / 957 MB (20% inode=79%):
Service Notification[08-20-2018 09:59:11] SERVICE NOTIFICATION: admin1;devel.blablah.com;root filesystem;OK;notify-by-email;DISK OK - free space: / 957 MB (20% inode=79%):
Service Critical[08-20-2018 09:58:11] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
External Command[08-20-2018 09:57:05] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;1534777024
Service Critical[08-20-2018 09:56:16] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:55:13] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:54:10] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
External Command[08-20-2018 09:54:07] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;1534776846
Service Critical[08-20-2018 09:54:02] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:53:48] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;2;Connection refused or timed out
External Command[08-20-2018 09:53:45] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;15347768241
Service Critical[08-20-2018 09:53:08] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;1;Connection refused or timed out

We started noticing we were not receiving non-recovery alerts, and the way I'm interpreting the data above is that the service is set to go hard after 3 failed attempts, but never does, and never sends an alert. It does send a recovery though. This is a Nagios Core instance that has been in place and rock solid for years until v4.4. Please let me know if more information is helpful, I'll gladly provide it. Thanks.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Checks randomly not reaching hard state

Post by scottwilkerson »

We know this was an issue prior to 4.4.2 but this should have been resolved.

Is the host in an up or down state when this happens.

Any additional information you have would be useful because to fix the problem I will need to make sure I can re-create the issue.

Also can you show the results of the following?

Code: Select all

ps -ef|grep nagios.cfg
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
NSchoenbaechler
Posts: 15
Joined: Fri Feb 02, 2018 10:10 am

Re: Checks randomly not reaching hard state

Post by NSchoenbaechler »

Scott,

Thank you so much for the reply. I was really scratching my head and thinking I was crazy. In the example I provided, the host was in a down state at the time. I just shut the server down and let things play out (though I was forcing checks to speed things up).

Here's the output you requested:
[root@monitor ~]# ps -ef|grep nagios.cfg
root 18990 18786 0 10:44 pts/0 00:00:00 grep nagios.cfg
nagios 26094 1 0 10:01 ? 00:00:12 /usr/local/nagios/bin/nagios -ud /usr/local/nagios/etc/nagios.cfg
nagios 26100 26094 0 10:01 ? 00:00:00 /usr/local/nagios/bin/nagios -ud /usr/local/nagios/etc/nagios.cfg
[root@monitor ~]#

Here is another item you may find interesting. On a separate host, I shut the server down. The host had two service checks with max_check_attempts at 3. Same issue, never went to a hard state. However, I did notice that the host check itself had its max_check_attempts set to 10. Once the 10th host check failed, it alerted as you would expect (for the host check, still nothing on the services though). Please let me know if there's any more information I can provide. Thanks again.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Checks randomly not reaching hard state

Post by scottwilkerson »

Ok, if the host is in a down state you shouldn't get the service notification when it goes critical, but you also shouldn't get a recovery.

I just ran some tests and can confirm there is a bug that a recovery email is sent when it shouldn't be in this scenario.
I have logged the following bug report:
https://github.com/NagiosEnterprises/na ... issues/572


I could not however re-create the constant soft state in 4.4.2
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
NSchoenbaechler
Posts: 15
Joined: Fri Feb 02, 2018 10:10 am

Re: Checks randomly not reaching hard state

Post by NSchoenbaechler »

Oh yeah I forgot about the host state stuff, so my testing methodology was bad. I'll check with a few instances where a service goes down but the host stays up. Thanks.
NSchoenbaechler
Posts: 15
Joined: Fri Feb 02, 2018 10:10 am

Re: Checks randomly not reaching hard state

Post by NSchoenbaechler »

Re-tested the same service in my original post with the host still up. Worked and notified fine. Thanks for your help, and thanks for submitting the bug request.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Checks randomly not reaching hard state

Post by scottwilkerson »

NSchoenbaechler wrote:Re-tested the same service in my original post with the host still up. Worked and notified fine. Thanks for your help, and thanks for submitting the bug request.
Glad that part is ok. good.
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Checks randomly not reaching hard state

Post by scottwilkerson »

I believe this to be resolved in the maint branch of nagios on github
https://github.com/NagiosEnterprises/na ... tree/maint
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
burkm
Posts: 31
Joined: Thu Jan 21, 2016 5:10 pm

Re: Checks randomly not reaching hard state

Post by burkm »

scottwilkerson wrote:I believe this to be resolved in the maint branch of nagios on github
https://github.com/NagiosEnterprises/na ... tree/maint
Hello,
I'm having the exact same problem as the OP on 4.4.1 and 4.4.2. I installed the referenced version from github, but the problem persists. I can't figure out the pattern for when it works and when it doesn't.

Thanks,
Michael
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Checks randomly not reaching hard state

Post by scottwilkerson »

Did you install the maint branch after Aug 23, 2018?

And to be clear, can you define the exact problem you are seeing, as this thread has spanned several intermingled problems
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Locked