HOST ALERT using check_by_ssh
Posted: Tue Sep 25, 2018 1:15 pm
G 'Day Nagios Gurus,
I have a certain condition that occurs once a month at the same time on multiple Linux hosts across multiple data centers that I need help understanding.
I am using a host template as a means of performing a check_by_ssh call to the remote host and executing a command on the remote hosts to provide me with two pieces of information. First to verify that ssh is available and the server is responding. And secondly to execute a command on the remote host to verify the existence of my plugins directory.
The specific command and template are not an issue and work fine continously with no problem except for the 10 minute windows on the last Tuesday of the month.
Modified check_xi_by_ssh command:
The template definition:
So I end up each month with a slew of hosts that exceed my normal SOFT alerts and show HOST DOWN HARD.
What I am trying to understand is from what condition would the Status Information field show the message, "Host check timed out after 30.01 seconds"?
Since the failure point only occurs once a month for a short 10 minute period I need to try and duplicate myself but unless I can really understand under what conditions the error message being generates originates I cannot seem to duplicate it.
Can anyone from Nagios Core development help me understand from what condition would this error message be produced? Of course, the check_by_ssh binary does not appear to to produce this error message and my guess is it originates within the Nagios Core program but what result code triggers it from the check_by_ssh is unknown.
Anybody have a suggestion?
Thanks for your attention,
Danny
I have a certain condition that occurs once a month at the same time on multiple Linux hosts across multiple data centers that I need help understanding.
I am using a host template as a means of performing a check_by_ssh call to the remote host and executing a command on the remote hosts to provide me with two pieces of information. First to verify that ssh is available and the server is responding. And secondly to execute a command on the remote host to verify the existence of my plugins directory.
The specific command and template are not an issue and work fine continously with no problem except for the 10 minute windows on the last Tuesday of the month.
Modified check_xi_by_ssh command:
Code: Select all
define command {
command_name check_xi_by_ssh_fips
command_line $USER1$/check_by_ssh -t 30:1 -l patrol $ARG1$ -H $HOSTADDRESS$ $ARG2$
}
Code: Select all
define host {
name TSO-UNIX_FIPS
check_command check_xi_by_ssh_patrol_fips!-q -oConnectTimeout=3!-C "~/isLibExec"!!!!!!
register 0
}
What I am trying to understand is from what condition would the Status Information field show the message, "Host check timed out after 30.01 seconds"?
Since the failure point only occurs once a month for a short 10 minute period I need to try and duplicate myself but unless I can really understand under what conditions the error message being generates originates I cannot seem to duplicate it.
Can anyone from Nagios Core development help me understand from what condition would this error message be produced? Of course, the check_by_ssh binary does not appear to to produce this error message and my guess is it originates within the Nagios Core program but what result code triggers it from the check_by_ssh is unknown.
Anybody have a suggestion?
Thanks for your attention,
Danny