Page 1 of 1

SSH timeout (Linux level)

Posted: Wed Sep 05, 2018 5:16 am
by nms
Hi,

We monitor our services using ssh connection. The check by ssh command has a setting for timeout of "t 120" which seems to work well.
However, we noticed that due to an issue we experienced today, in which we lost connection to several of our nodes, the sshd connection was still initiated on the Linux level which caused a lack of resources on the server.
Due to this, we noticed that nagios xi was not functioning well, giving problems with the monitoring engine and other crucial system components. It was literally stuck.
To fix this we had to kill the ssh sessions on the server until Nagios could perform well again.

We think that this can be fixed by somehow doing a timeout if there is no keepalive between the ssh sessions in order to drop the connection.
But we need to be careful what to do since ALL of our monitoring is done via ssh (i.e ssh connection followed by running of script).

As an example, I have attached a screenshot showing the ssh process still running (as you can see without its child id) after we stopped the monitoring from Nagios gui. This means that the ssh processes were still there.

Can you shed some light?

Rgds,

Re: SSH timeout (Linux level)

Posted: Wed Sep 05, 2018 1:28 pm
by ssax
You could try editing your check_by_ssh command to pass the ConnectTimeout option to see if that helps:

Code: Select all

-o ConnectTimeout=60
Just in case, what is the output of these commands:

Code: Select all

ps aux | grep nagios.cfg
ipcs -q

Re: SSH timeout (Linux level)

Posted: Thu Sep 06, 2018 1:46 am
by nms
Hi

Here's the output as requested:

Code: Select all

 ps aux | grep nagios.cfg
nagios    2605  3.3  0.1  48412 28036 ?        Ss   04:00   9:29 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    3111  0.0  0.1  48280 16416 ?        S    04:00   0:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    3427  0.0  0.0  49068  1484 ?        Ss   Aug14   0:05 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    3521  0.0  0.0  48936   296 ?        S    Aug14   6:19 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root      4675  0.0  0.0 103328   952 pts/0    S+   08:42   0:00 grep nagios.cfg
nagios   13199  0.0  0.0  50392  1488 ?        Ss   Aug20   4:44 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   13298  0.0  0.0  50260   292 ?        S    Aug20   4:05 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

Code: Select all

ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x6b000002 9568256    nagios     600        0            0

Re: SSH timeout (Linux level)

Posted: Thu Sep 06, 2018 1:59 am
by nms
Does the Connect timeout eventually terminate the ssh processes if they are idle?

Re: SSH timeout (Linux level)

Posted: Thu Sep 06, 2018 1:58 pm
by ssax
ConnectTimeout Specifies the timeout (in seconds) used when connecting to the SSH server, instead of using the default system TCP timeout. This value is used only when the target is down or really unreachable, not when it refuses the connection.
Looks like you have too many nagios processes running, you should only have two, that may be why they are getting hung up. Please run these commands and see if it resolves your issue:

Code: Select all

service nagios stop
service ndo2db stop
pkill -9 nagios
killall -9 nagios
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service ndo2db start
service nagios start

Re: SSH timeout (Linux level)

Posted: Thu Sep 06, 2018 1:59 pm
by tgriep
Here is a description of the ConnectTimeout option and what is does.

It specifies the timeout (in seconds) used when connecting to the SSH server, instead of using the default system TCP timeout.
This value is used only when the target is down or really unreachable, not when it refuses the connection.

In your ps output, I see that that there are duplicate Nagios processes running and that would contribute to the issue you were having.
So the duplicates need to be stopped and to do that, run the following as root

Code: Select all

service nagios stop
killall -9 nagios
service nagios start
Let us know if you have any further questions.

Re: SSH timeout (Linux level)

Posted: Fri Sep 07, 2018 3:19 am
by nms
Thanks. I have performed the stop/start and can verify I have now 2 processes:

Code: Select all

nagios   16994 18.2  0.1  49600 19012 ?        Ss   09:55   0:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   17021  0.4  0.1  49468 17664 ?        S    09:55   0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Why there should be two processes running?

Re: SSH timeout (Linux level)

Posted: Fri Sep 07, 2018 11:30 am
by lmiltchev
It is normal to have two processes running - one is a "child" process. Look at the PID and PPID to make sure.

Example:
[root@main-nagios-xi ~]# ps -ef | grep nagios.cfg | grep -v grep
nagios 31801 1 0 10:36 ? 00:00:17 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 31815 31801 0 10:36 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Let us know if you have any further questions.

Re: SSH timeout (Linux level)

Posted: Mon Sep 10, 2018 10:04 am
by nms
Thanks. Ticket can be closed