Checks randomly not reaching hard state
-
- Posts: 15
- Joined: Fri Feb 02, 2018 10:10 am
Checks randomly not reaching hard state
Since upgrading to Nagios Core 4.4.x (we are now on 4.4.2, latest) we have seen a recurring and serious issue, where checks randomly remain in a soft state even when they have reached their max check attempts. Therefore they never notify, but we do get recovery notifications. Here's an example service and corresponding Event log entries:
define service{
use generic-service
host_name devel.blahblah.com
service_description root filesystem
is_volatile 0
check_period 24x7
max_check_attempts 3
check_interval 2
retry_interval 1
contact_groups blahblah-sysadmins
notification_interval 240
notification_period HDhours (this occurred within this defined time period)
notification_options u,c,r
check_command check_nrpe!check_root
}
[08-20-2018 09:59:11] SERVICE ALERT: devel.blablah.com;root filesystem;OK;HARD;1;DISK OK - free space: / 957 MB (20% inode=79%):
Service Notification[08-20-2018 09:59:11] SERVICE NOTIFICATION: admin1;devel.blablah.com;root filesystem;OK;notify-by-email;DISK OK - free space: / 957 MB (20% inode=79%):
Service Critical[08-20-2018 09:58:11] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
External Command[08-20-2018 09:57:05] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;1534777024
Service Critical[08-20-2018 09:56:16] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:55:13] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:54:10] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
External Command[08-20-2018 09:54:07] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;1534776846
Service Critical[08-20-2018 09:54:02] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:53:48] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;2;Connection refused or timed out
External Command[08-20-2018 09:53:45] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;15347768241
Service Critical[08-20-2018 09:53:08] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;1;Connection refused or timed out
We started noticing we were not receiving non-recovery alerts, and the way I'm interpreting the data above is that the service is set to go hard after 3 failed attempts, but never does, and never sends an alert. It does send a recovery though. This is a Nagios Core instance that has been in place and rock solid for years until v4.4. Please let me know if more information is helpful, I'll gladly provide it. Thanks.
define service{
use generic-service
host_name devel.blahblah.com
service_description root filesystem
is_volatile 0
check_period 24x7
max_check_attempts 3
check_interval 2
retry_interval 1
contact_groups blahblah-sysadmins
notification_interval 240
notification_period HDhours (this occurred within this defined time period)
notification_options u,c,r
check_command check_nrpe!check_root
}
[08-20-2018 09:59:11] SERVICE ALERT: devel.blablah.com;root filesystem;OK;HARD;1;DISK OK - free space: / 957 MB (20% inode=79%):
Service Notification[08-20-2018 09:59:11] SERVICE NOTIFICATION: admin1;devel.blablah.com;root filesystem;OK;notify-by-email;DISK OK - free space: / 957 MB (20% inode=79%):
Service Critical[08-20-2018 09:58:11] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
External Command[08-20-2018 09:57:05] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;1534777024
Service Critical[08-20-2018 09:56:16] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:55:13] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:54:10] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
External Command[08-20-2018 09:54:07] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;1534776846
Service Critical[08-20-2018 09:54:02] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;3;Connection refused or timed out
Service Critical[08-20-2018 09:53:48] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;2;Connection refused or timed out
External Command[08-20-2018 09:53:45] EXTERNAL COMMAND: SCHEDULE_FORCED_SVC_CHECK;devel.blablah.com;root filesystem;15347768241
Service Critical[08-20-2018 09:53:08] SERVICE ALERT: devel.blablah.com;root filesystem;CRITICAL;SOFT;1;Connection refused or timed out
We started noticing we were not receiving non-recovery alerts, and the way I'm interpreting the data above is that the service is set to go hard after 3 failed attempts, but never does, and never sends an alert. It does send a recovery though. This is a Nagios Core instance that has been in place and rock solid for years until v4.4. Please let me know if more information is helpful, I'll gladly provide it. Thanks.
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Checks randomly not reaching hard state
We know this was an issue prior to 4.4.2 but this should have been resolved.
Is the host in an up or down state when this happens.
Any additional information you have would be useful because to fix the problem I will need to make sure I can re-create the issue.
Also can you show the results of the following?
Is the host in an up or down state when this happens.
Any additional information you have would be useful because to fix the problem I will need to make sure I can re-create the issue.
Also can you show the results of the following?
Code: Select all
ps -ef|grep nagios.cfg
-
- Posts: 15
- Joined: Fri Feb 02, 2018 10:10 am
Re: Checks randomly not reaching hard state
Scott,
Thank you so much for the reply. I was really scratching my head and thinking I was crazy. In the example I provided, the host was in a down state at the time. I just shut the server down and let things play out (though I was forcing checks to speed things up).
Here's the output you requested:
[root@monitor ~]# ps -ef|grep nagios.cfg
root 18990 18786 0 10:44 pts/0 00:00:00 grep nagios.cfg
nagios 26094 1 0 10:01 ? 00:00:12 /usr/local/nagios/bin/nagios -ud /usr/local/nagios/etc/nagios.cfg
nagios 26100 26094 0 10:01 ? 00:00:00 /usr/local/nagios/bin/nagios -ud /usr/local/nagios/etc/nagios.cfg
[root@monitor ~]#
Here is another item you may find interesting. On a separate host, I shut the server down. The host had two service checks with max_check_attempts at 3. Same issue, never went to a hard state. However, I did notice that the host check itself had its max_check_attempts set to 10. Once the 10th host check failed, it alerted as you would expect (for the host check, still nothing on the services though). Please let me know if there's any more information I can provide. Thanks again.
Thank you so much for the reply. I was really scratching my head and thinking I was crazy. In the example I provided, the host was in a down state at the time. I just shut the server down and let things play out (though I was forcing checks to speed things up).
Here's the output you requested:
[root@monitor ~]# ps -ef|grep nagios.cfg
root 18990 18786 0 10:44 pts/0 00:00:00 grep nagios.cfg
nagios 26094 1 0 10:01 ? 00:00:12 /usr/local/nagios/bin/nagios -ud /usr/local/nagios/etc/nagios.cfg
nagios 26100 26094 0 10:01 ? 00:00:00 /usr/local/nagios/bin/nagios -ud /usr/local/nagios/etc/nagios.cfg
[root@monitor ~]#
Here is another item you may find interesting. On a separate host, I shut the server down. The host had two service checks with max_check_attempts at 3. Same issue, never went to a hard state. However, I did notice that the host check itself had its max_check_attempts set to 10. Once the 10th host check failed, it alerted as you would expect (for the host check, still nothing on the services though). Please let me know if there's any more information I can provide. Thanks again.
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Checks randomly not reaching hard state
Ok, if the host is in a down state you shouldn't get the service notification when it goes critical, but you also shouldn't get a recovery.
I just ran some tests and can confirm there is a bug that a recovery email is sent when it shouldn't be in this scenario.
I have logged the following bug report:
https://github.com/NagiosEnterprises/na ... issues/572
I could not however re-create the constant soft state in 4.4.2
I just ran some tests and can confirm there is a bug that a recovery email is sent when it shouldn't be in this scenario.
I have logged the following bug report:
https://github.com/NagiosEnterprises/na ... issues/572
I could not however re-create the constant soft state in 4.4.2
-
- Posts: 15
- Joined: Fri Feb 02, 2018 10:10 am
Re: Checks randomly not reaching hard state
Oh yeah I forgot about the host state stuff, so my testing methodology was bad. I'll check with a few instances where a service goes down but the host stays up. Thanks.
-
- Posts: 15
- Joined: Fri Feb 02, 2018 10:10 am
Re: Checks randomly not reaching hard state
Re-tested the same service in my original post with the host still up. Worked and notified fine. Thanks for your help, and thanks for submitting the bug request.
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Checks randomly not reaching hard state
Glad that part is ok. good.NSchoenbaechler wrote:Re-tested the same service in my original post with the host still up. Worked and notified fine. Thanks for your help, and thanks for submitting the bug request.
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Checks randomly not reaching hard state
I believe this to be resolved in the maint branch of nagios on github
https://github.com/NagiosEnterprises/na ... tree/maint
https://github.com/NagiosEnterprises/na ... tree/maint
Re: Checks randomly not reaching hard state
Hello,scottwilkerson wrote:I believe this to be resolved in the maint branch of nagios on github
https://github.com/NagiosEnterprises/na ... tree/maint
I'm having the exact same problem as the OP on 4.4.1 and 4.4.2. I installed the referenced version from github, but the problem persists. I can't figure out the pattern for when it works and when it doesn't.
Thanks,
Michael
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Checks randomly not reaching hard state
Did you install the maint branch after Aug 23, 2018?
And to be clear, can you define the exact problem you are seeing, as this thread has spanned several intermingled problems
And to be clear, can you define the exact problem you are seeing, as this thread has spanned several intermingled problems