Nagios Support Forum

Posted: **Tue Sep 25, 2018 1:15 pm**

G 'Day Nagios Gurus,

I have a certain condition that occurs once a month at the same time on multiple Linux hosts across multiple data centers that I need help understanding.

I am using a host template as a means of performing a check_by_ssh call to the remote host and executing a command on the remote hosts to provide me with two pieces of information. First to verify that ssh is available and the server is responding. And secondly to execute a command on the remote host to verify the existence of my plugins directory.

The specific command and template are not an issue and work fine continously with no problem except for the 10 minute windows on the last Tuesday of the month.

Modified check_xi_by_ssh command:

Code: Select all

define command {
       command_name                             check_xi_by_ssh_fips
       command_line                             $USER1$/check_by_ssh -t 30:1 -l patrol $ARG1$ -H $HOSTADDRESS$ $ARG2$
}

The template definition:

Code: Select all

define host {
       name                                     TSO-UNIX_FIPS
       check_command                            check_xi_by_ssh_patrol_fips!-q -oConnectTimeout=3!-C "~/isLibExec"!!!!!!
       register                                 0

}

So I end up each month with a slew of hosts that exceed my normal SOFT alerts and show HOST DOWN HARD.
What I am trying to understand is from what condition would the Status Information field show the message, "Host check timed out after 30.01 seconds"?
Since the failure point only occurs once a month for a short 10 minute period I need to try and duplicate myself but unless I can really understand under what conditions the error message being generates originates I cannot seem to duplicate it.

Can anyone from Nagios Core development help me understand from what condition would this error message be produced? Of course, the check_by_ssh binary does not appear to to produce this error message and my guess is it originates within the Nagios Core program but what result code triggers it from the check_by_ssh is unknown.

Anybody have a suggestion?

Thanks for your attention,
Danny

Posted: **Tue Sep 25, 2018 2:42 pm**

Hello, @onegative. Full disclosure: I'm not a developer. But I believe the code that outputs this line is here:

https://github.com/NagiosEnterprises/na ... e/checks.c

What is your goal? To increase the timeout value from 30 seconds to a higher value?

Timeout is a combination of settings in the plugin as well as the following settings in the nagios.cfg file:

service_check_timeout=
host_check_timeout=

Have you thought about scheduling a recurring downtime for this host and services on last Tuesday of each month?

Posted: **Tue Sep 25, 2018 3:32 pm**

Hey npolovenko,

Thanks for your help...I am just trying to understand what the exact nature of the ssh timeouts that are occurring.

Danny

Posted: **Tue Sep 25, 2018 4:08 pm**

@onegative, Perhaps there is some scheduled maintenance going on every once in a while on the remote server causing a high system load and making ssh service irresponsive? Like antivirus check or a backup? Next time Nagios sends a notification that it can't ssh I recommend checking the load on the remote server. Also, take a look at the /var/log/secure and /var/log/messages on the remote server and let us know if you see any errors.

Posted: **Wed Sep 26, 2018 9:01 am**

@npolovenko,

Yes there is a problem but no one seems to understand what is actually happening. I am seeing this on 50/60 hosts that use this particular method for HOST CHECK. There is no jobs running on the main Nagios XI server and likewise on the remote systems. I have tcpdump from the Nagios server that shows the 3-way handshaking SYN, SYN ACK, ACK but then I am seeing the Nagios server issue a RST on the connection which then appears to trigger the (cr->early_timeout) which produces the "Host check timed out after 30.01 seconds" once its 5/5 is reached. I am so unsure what actual condition is occurring during this ssh session timeout and what the exact nature of the early_timeout is occurring. Because this is a single point issue that only occurs for that 15 minute window each month it is hard to replicate.

If I understand exactly how this section of code is triggered and what the result code that initiates it may be I would have a better idea how to actually trigger it and understand what I should be focusing my attention on during the short issue time each month....but basically we get a event storm of those servers which of course wakes up the OnCall SysAdmin which then creates ripples throughout my world as well.

Danny

Code: Select all

	/* did the check result have an early timeout? */
	if (cr->early_timeout) {

		my_free(hst->plugin_output);
		my_free(hst->long_plugin_output);
		my_free(hst->perf_data);

		logit(NSLOG_RUNTIME_WARNING, TRUE, "Warning: Check of host '%s' timed out after %.2lf seconds\n", hst->name, hst->execution_time);
		asprintf(&hst->plugin_output, "(Host check timed out after %.2lf seconds)", hst->execution_time);

		rc = HOST_UNREACHABLE;
	}

Posted: **Wed Sep 26, 2018 4:00 pm**

You would receive that message if the host check doesn't finish within the host_check_timeout setting in your /usr/local/nagios/etc/nagios.cfg.

By default it's set to 30 seconds so whenever you are seeing that message it means the host check did not complete within the time allotted by the host_check_timeout.

If you're seeing this on a recurring basis at a specific time it may be related to like monthly backups taking place or VM backups/vMotion (or SAN/Storage/Network high utilization) or something along those lines that could be causing high IO/load/latency that doesn't allow the servers to respond in that amount of time.

The only thing you can really do from the XI side would be to increase the host_check_timeout in your nagios.cfg and restart the nagios service for the changes to be picked up.

More than likely you'll need to investigate what the problem is during that time, talk to your backup admins, etc to see if there is anything that occurs during that time that could affect it.

You could also setup a recurring downtime if it's happening on a regular schedule to ignore them during that time.

Posted: **Thu Sep 27, 2018 2:39 pm**

@ssax,

Yes this helps...so that leads to understand that the ssh session is obviously succeeding and the issue must be on all the vm's most likely to do with a shared resource. I guess I was confused as I had set the check_by_ssh timeout to 30 seconds as well which is probably why I thought it had something to do with the ssh session and not the fact the remote host did not send a result back with the host_check_timeout value...

This help,
Danny

request can be locked...thanks everyone...

Posted: **Fri Sep 28, 2018 8:36 am**

onegative wrote:@ssax,

Yes this helps...so that leads to understand that the ssh session is obviously succeeding and the issue must be on all the vm's most likely to do with a shared resource. I guess I was confused as I had set the check_by_ssh timeout to 30 seconds as well which is probably why I thought it had something to do with the ssh session and not the fact the remote host did not send a result back with the host_check_timeout value...

This help,
Danny

request can be locked...thanks everyone...

Great!

Locking

Nagios Support Forum

HOST ALERT using check_by_ssh

HOST ALERT using check_by_ssh

Re: HOST ALERT using check_by_ssh

Re: HOST ALERT using check_by_ssh

Re: HOST ALERT using check_by_ssh

Re: HOST ALERT using check_by_ssh

Re: HOST ALERT using check_by_ssh

Re: HOST ALERT using check_by_ssh

Re: HOST ALERT using check_by_ssh