Page 1 of 1
check_by_ssh timeout scenario
Posted: Wed Mar 24, 2021 1:19 pm
by dlovett
Trivial question but where are the log files for the check_by_ssh plugin? I am in the process of migrating our monitors and alerts from our legacy monitoring engine (SiteScope) to Nagios XI and I am having an issue with the check_by_ssh plugin throwing a timeout error.
The script on the remote host has been in place for a decade (or more) and executes in less than a second without problem. Both engines execute the same scripts via SSH but Nagios throws a timeout error on the first execution 90% of the time. Nagios is able to successfully execute the script on the retry most of the time. Sometimes on the 2nd or 3rd retry. Sitescope is able to execute the script over SSH flawlessly every time.
The remote script runs on an AIX host. Sitescope runs on a windows host. Nagios XI runs on Linux.
I would love to see the log file of the Nagios plugin if it exists (ideally in debug mode).
Is there a log file for check_by_ssh?
Any help would be appreciated.
Re: check_by_ssh timeout scenario
Posted: Thu Mar 25, 2021 10:58 am
by vtrac
Hi,
If you ran this check_by_ssh script from the Nagios XI services, then all logs should be in:
/usr/local/nagios/var/nagios.log
However, if you are to run this script manually then I don't think there is any log being collected.
I also looked at the code of this script and there is no log file defined:
https://github.com/nagios-plugins/nagio ... k_by_ssh.c
You can try adding the "-t" to increase the timeout, default is "10" seconds.
Also, adding the "-v" for verbose if you like.
Regards,
Vinh
Re: check_by_ssh timeout scenario
Posted: Thu Mar 25, 2021 12:53 pm
by dlovett
vtrac wrote:Hi,
If you ran this check_by_ssh script from the Nagios XI services, then all logs should be in:
/usr/local/nagios/var/nagios.log
However, if you are to run this script manually then I don't think there is any log being collected.
I also looked at the code of this script and there is no log file defined:
https://github.com/nagios-plugins/nagio ... k_by_ssh.c
You can try adding the "-t" to increase the timeout, default is "10" seconds.
Also, adding the "-v" for verbose if you like.
Regards,
Vinh
Thanks Vinh. information in nagios.log hasn't been helpful. What we really need is to put the plugin in debug mode and look at the log file. We've already tried modifying the timeout in the config file and using the -t parm. That didn't work.
The issue isn't the script timing out as it only takes 1-2 seconds to run and been in our production environment for roughly 10 years. The issue only occurs with Nagios and we can't seem to get a useful log file to triage the issue. Frustrating.
Re: check_by_ssh timeout scenario
Posted: Thu Mar 25, 2021 4:33 pm
by vtrac
Hi,
SSH sits on top of TCP.
When you get connection timed out errors it means that the SSH client is not seeing any responses from the server (ie ... the TCP handshake is not completing) which almost always means the problems is not with SSH, rather its at a lower level.
I would get your network administrator involved and maybe run something like:
Code: Select all
tcpdump -n -i any src or dst XXX.XXX.XXX.XXX
Regards,
Vinh
Re: check_by_ssh timeout scenario
Posted: Mon Apr 19, 2021 1:08 pm
by dlovett
Very interesting development. Some of the scripts have a line that reads: "set -x" in them. When I remove this line I DO NOT get the timeout issue. set -x looks to be a debug feature to prints executed commands and their arguments.
In addition, the issue occurs reliably with a script that creates an array via reading values from a file. The script then iterates through the array and applies business logic to determine a response. The file contains approx. 100 items. With the set -x line in the script, if I reduce the number of items in the file to 10-15 the timeout issue does not occur. Issue WILL resurface when the list is greater than 10-15. If I remove "set -x" then the issue does not occur.
Given the timeout issue ONLY occurs with the Nagios check_by_ssh plugin I'm wondering if this is a buffer overflow/overrun issue? Is there a maximum length of output the plugin can handle? I ask because I believe Nagios SSH plugin is written in C where our other monitoring software (SiteScope) is written in Java and it has no issue running these scripts via ssh.
Re: check_by_ssh timeout scenario
Posted: Mon Apr 19, 2021 4:52 pm
by vtrac
Hi,
Yes
"set -x" display commands and their arguments as they are executed.
I am assuming that your "check_by_ssh" is called inside a "bash" script, right?
Usually, the "set-x" is called inside a "DEBUG" mode.
Example:
Code: Select all
if [ "$_DEBUG" == "on" ] then
{
DEBUG set -x
Cmd1
Cmd2
DEBUG set +x
}
Try running your script without "debug" enabled.
Can you upload your script? .... and the whole command (with arguments) you used.
Best Regards,
Vinh