Transaction Timeout Anomalies for SSH Access

louissiong · Post by **louissiong** » Thu Jan 23, 2020 3:17 am

Hi Nagios,

Lately, we encountered some issues with SSH passwordless access via Nagios Xi
Server to some of our clients. It seems the SSH login and termination occurred at around
the same timing.

Our check interval is set as 5 mins, retry interval at 1 and max check attempts at 10mins.
Nonetheless, we are seeing the SSH sessions being established every 10 seconds.
Eg. From Source Nagios XI to ABC server.

Is this a normal behaviour with Nagios Xi or any further configuration required from our end
to fix the problem ?
This is causing some of the Apps to experience serious slowness as a result.

Please advise. Thanks. See attached for screenshot.

Regards,
Louis

benjaminsmith · Post by **benjaminsmith** » Thu Jan 23, 2020 12:34 pm

Hello Louis,

Let's stop Nagios, kill off all of the processes and restart to rule out the chance that this may be caused by multiple Nagios processes running.

Code: Select all

systemctl stop nagios
systemctl stop ndo2db
pkill -9 -u nagios
systemctl start ndo2db
systemctl start nagios

If the issue persists, please send us your system profile to review. Thanks.

To send us your system profile.
Login to the Nagios XI GUI using a web browser
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and share this in a private message and then reply to this post to bring it up in the queue.

louissiong · Post by **louissiong** » Thu Jan 30, 2020 11:52 pm

Hi Ben,

Thanks for your reply. We had tried restarting the services as suggested but still the issue remains.
Another thing to note the IP address 172.30.154.53 is a VIP for 2 nodes.
For the monitoring to check the IIB session, it took about 11 seconds to response from the IIB.
The question being why it took 11 sec during that period.
Thanks for your help. Please see attached for system profile.

Regards,
Louis

Moderator's Note: The profile has been shared with the support team but has been removed from the public forum

Post by **mbellerue** » Fri Jan 31, 2020 2:06 pm

I think Nagios is doing roughly what is being asked of it here. There are 6 services related to IIB that use check_by_ssh. The check interval is 5 minutes, and retry is 1 minute with a max of 10.

5 minutes is 300 seconds
300 seconds divided by 6 service checks is 1 service check every 50 seconds

Which is a far cry from 1 service check every 10-20 seconds that we see in your screenshot, but if we throw in a timeout every now and again, that would cause Nagios to retry in 60 seconds.

You might also be missing some outages, depending on the setup here. For 5/6 service checks, Nagios is monitoring the VIP, not the actual IP addresses of the servers. Which means that if Nagios sends the check, it gets routed to Server A, but Server B is having an issue, you won't find out about it until the next check. And even then, maybe.

Would it be possible to set up passive checks? Assuming that you can install an agent on these servers, that would be less overhead than an ssh connection every 50 seconds or less.

Another option may be to combine some of these checks into a single service check. It's usually not great to have a service check that checks multiple metrics, but it could be worth while here.

louissiong · Post by **louissiong** » Mon Feb 03, 2020 10:13 pm

Thanks Bell for your reply.

Please correct me if I am wrong.
From the logs, the session established and termination took only 1 sec. But we configured the Check Interval as 5 mins
with retry of 1 min. Why is there such a huge discrepancy in this case ? Is there a way to reduce the SSH connection or
normal behaviour ? Thanks.

Regards,
Louis

Post by **mbellerue** » Tue Feb 04, 2020 4:17 pm

Each individual check is scheduled for 5 minutes, but there are (at least) 6 checks. Which means at least 1 check every 50 seconds. Sometimes more frequent depending on circumstances.

The two ways I can see to alleviate this load is to make the checks passive checks. Have an agent on the system that executes the check, and sends the results back to Nagios.

Or, if you can't have an agent on the remote systems, you could try to combine some of the service checks so there are fewer checks total.

louissiong · Post by **louissiong** » Tue Feb 04, 2020 9:56 pm

ServiceCheckExample1.jpg

Right. However, at the moment we are not allowed to have passive checks in place due to the need to open
additional ports for firewall. Is there a way to combine the service checks as suggested ? We have a number
of Qmanagers and Channels with different arguments put to use.

Please advise. Thanks.

Regards,
Louis

Post by **mbellerue** » Wed Feb 05, 2020 1:36 pm

Okay, if you have to combine service checks, the best way to do it is to write a wrapper script that combines two or more of the calls that are currently being handled by individual checks.

In your screenshot, you are running check_mq_channel.sh, with a couple of arguments. So your wrapper script would have this and another check command with its arguments in the script. You then have the wrapper script run those commands, process the exit code of each, and output the worst error code and the name of the command that threw that error code.

It's not elegant, but it's a pretty quick procedure, and it will cut down on the number of times Nagios will need to ssh into the boxes. The script can be made more robust after the current performance issues are resolved.

louissiong · Post by **louissiong** » Thu Feb 06, 2020 1:07 am

Sure, do you have any sample scripts that we can use or follow ?
By using this wrapper method, how will the outputs be generated ?
At present, we have a NOC Team to keep us informed of any service
failures or outages.

Please provide more details. Thanks.

Regards,
Louis

Post by **mbellerue** » Thu Feb 06, 2020 5:25 pm

I'm going to use an excerpt from this script,
https://exchange.nagios.org/directory/A ... er/details

The script does a lot more than just wrapping up another command. All we care about in this instance is this section,

Code: Select all

# Small safety check, this won't stop a kid.
# Might help a careless person though (yeah right)
dangerous_commands="rm rmdir dd del mv cp halt shutdown reboot init telinit kill killall pkill"
for x in $cmd; do
    for y in $dangerous_commands; do
    if [ "$x" == "$y" ]; then
        echo "DANGER: the $y command was found in the string given to execute under nsca_wrapper, aborting..."
        exit 3
    fi
    done
done

output="`$cmd 2>&1`"
result=$?
[ -z "$quiet_mode" ] && echo "$output"
output="`echo $output | sed 's/%/%%/g'`"

send_output=`printf "$host\t$service\t$result\t$output\n" | $send_nsca -H $nagios_server -c $send_nsca_config 2>&1`
send_result=$?
[ -z "$quiet_mode" ] && echo "Sending to NSCA daemon: $send_output"

if [ -n "$return_plugin_code" ]; then
    exit "$result"
else
    exit "$send_result"
fi

I'm including their "small safety check," because it is equal parts hilarious and awesome.

More importantly, however, are these two lines.

output="`$cmd 2>&1`"
result=$?

They show us first how to execute a command in a bash script, and get the output of it utilizing the tick character (`). And second they show us how to get the exit code of a command, which is the $?.

What you will need to do is build a bash script that runs the check commands one after another, and if it comes across an exit code of anything other than 0, the exit code that your wrapper script will exit with will be something other than 0 as well.

Does this help?

Nagios Support Forum

Transaction Timeout Anomalies for SSH Access

Transaction Timeout Anomalies for SSH Access

Re: Transaction Timeout Anomalies for SSH Access

Re: Transaction Timeout Anomalies for SSH Access

Re: Transaction Timeout Anomalies for SSH Access

Re: Transaction Timeout Anomalies for SSH Access

Re: Transaction Timeout Anomalies for SSH Access

Re: Transaction Timeout Anomalies for SSH Access

Re: Transaction Timeout Anomalies for SSH Access

Re: Transaction Timeout Anomalies for SSH Access

Re: Transaction Timeout Anomalies for SSH Access