I have been noticing that when a check fails due to a socket timeout, the check doesn't retry, and instead immediately goes into a Hard failure state. Even more odd is that, in this instance, 5 disk checks failed but only 2 sent notifications.
For instance, I had a team reboot a server without putting it into downtime. The socket timeout emails on check failure 1 of 5.
We are running Nagios XI 5.5.7 on Red Hat 7.6 64bit VM's. NRPE v3.2.1.
Socket Timeouts immediately going into a HARD state
Socket Timeouts immediately going into a HARD state
You do not have the required permissions to view the files attached to this post.
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: Socket Timeouts immediately going into a HARD state
@hbouma, Was the host in a Critical state when services started going into hard states? This sounds like an issue from this thread:
https://support.nagios.com/forum/viewto ... 16&t=52032
https://support.nagios.com/forum/viewto ... 16&t=52032
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Socket Timeouts immediately going into a HARD state
This is likely intended functionality, the host was in a problem state.
The way that it's supposed to work is that when the service checks and detects a problem it then checks the host and if the host is in a down state (hard or soft), the service will go into a hard problem state, it won't go through the soft states if the host is down. I'm referring to this specifically:
https://assets.nagios.com/downloads/nag ... uling.html
One thing that you could do would be to add host_down_disable_service_checks=1 in your /usr/local/nagios/etc/nagios.cfg and then restart the nagios service:
That option will not even perform the service checks if the host is in a problem state (hard or soft).
So the functionality was actually broken in earlier versions and it is working as intended now in XI 5.5+ with the upgraded Core backend.
The way that it's supposed to work is that when the service checks and detects a problem it then checks the host and if the host is in a down state (hard or soft), the service will go into a hard problem state, it won't go through the soft states if the host is down. I'm referring to this specifically:
Taken from here:When a service check results in a non-OK state, Nagios will check the host that the service is associated with to determine whether or not is UP. If the host is not UP (i.e. it is either down or unreachable), Nagios will immediately put the service into a hard non-OK state and it will reset the current attempt number to 1. Since the service is in a hard non-OK state, the service check will be rescheduled at the normal frequency specified by the check_interval option instead of the retry_interval option.
https://assets.nagios.com/downloads/nag ... uling.html
One thing that you could do would be to add host_down_disable_service_checks=1 in your /usr/local/nagios/etc/nagios.cfg and then restart the nagios service:
Code: Select all
service nagios restartSo the functionality was actually broken in earlier versions and it is working as intended now in XI 5.5+ with the upgraded Core backend.
Re: Socket Timeouts immediately going into a HARD state
Thank you.
This change has been added to our system and should resolve my issue.
This change has been added to our system and should resolve my issue.
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: Socket Timeouts immediately going into a HARD state
@hbouma, Please let us know if its ok to close this thread as resolved?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Socket Timeouts immediately going into a HARD state
Yes, you can consider this as resolved.