SSH timeout (Linux level)

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
nms
Posts: 222
Joined: Wed Sep 28, 2016 9:35 am

SSH timeout (Linux level)

Post by nms »

Hi,

We monitor our services using ssh connection. The check by ssh command has a setting for timeout of "t 120" which seems to work well.
However, we noticed that due to an issue we experienced today, in which we lost connection to several of our nodes, the sshd connection was still initiated on the Linux level which caused a lack of resources on the server.
Due to this, we noticed that nagios xi was not functioning well, giving problems with the monitoring engine and other crucial system components. It was literally stuck.
To fix this we had to kill the ssh sessions on the server until Nagios could perform well again.

We think that this can be fixed by somehow doing a timeout if there is no keepalive between the ssh sessions in order to drop the connection.
But we need to be careful what to do since ALL of our monitoring is done via ssh (i.e ssh connection followed by running of script).

As an example, I have attached a screenshot showing the ssh process still running (as you can see without its child id) after we stopped the monitoring from Nagios gui. This means that the ssh processes were still there.

Can you shed some light?

Rgds,
You do not have the required permissions to view the files attached to this post.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: SSH timeout (Linux level)

Post by ssax »

You could try editing your check_by_ssh command to pass the ConnectTimeout option to see if that helps:

Code: Select all

-o ConnectTimeout=60
Just in case, what is the output of these commands:

Code: Select all

ps aux | grep nagios.cfg
ipcs -q
nms
Posts: 222
Joined: Wed Sep 28, 2016 9:35 am

Re: SSH timeout (Linux level)

Post by nms »

Hi

Here's the output as requested:

Code: Select all

 ps aux | grep nagios.cfg
nagios    2605  3.3  0.1  48412 28036 ?        Ss   04:00   9:29 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    3111  0.0  0.1  48280 16416 ?        S    04:00   0:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    3427  0.0  0.0  49068  1484 ?        Ss   Aug14   0:05 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    3521  0.0  0.0  48936   296 ?        S    Aug14   6:19 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root      4675  0.0  0.0 103328   952 pts/0    S+   08:42   0:00 grep nagios.cfg
nagios   13199  0.0  0.0  50392  1488 ?        Ss   Aug20   4:44 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   13298  0.0  0.0  50260   292 ?        S    Aug20   4:05 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

Code: Select all

ipcs -q

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages
0x6b000002 9568256    nagios     600        0            0
nms
Posts: 222
Joined: Wed Sep 28, 2016 9:35 am

Re: SSH timeout (Linux level)

Post by nms »

Does the Connect timeout eventually terminate the ssh processes if they are idle?
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: SSH timeout (Linux level)

Post by ssax »

ConnectTimeout Specifies the timeout (in seconds) used when connecting to the SSH server, instead of using the default system TCP timeout. This value is used only when the target is down or really unreachable, not when it refuses the connection.
Looks like you have too many nagios processes running, you should only have two, that may be why they are getting hung up. Please run these commands and see if it resolves your issue:

Code: Select all

service nagios stop
service ndo2db stop
pkill -9 nagios
killall -9 nagios
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service ndo2db start
service nagios start
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: SSH timeout (Linux level)

Post by tgriep »

Here is a description of the ConnectTimeout option and what is does.

It specifies the timeout (in seconds) used when connecting to the SSH server, instead of using the default system TCP timeout.
This value is used only when the target is down or really unreachable, not when it refuses the connection.

In your ps output, I see that that there are duplicate Nagios processes running and that would contribute to the issue you were having.
So the duplicates need to be stopped and to do that, run the following as root

Code: Select all

service nagios stop
killall -9 nagios
service nagios start
Let us know if you have any further questions.
Be sure to check out our Knowledgebase for helpful articles and solutions!
nms
Posts: 222
Joined: Wed Sep 28, 2016 9:35 am

Re: SSH timeout (Linux level)

Post by nms »

Thanks. I have performed the stop/start and can verify I have now 2 processes:

Code: Select all

nagios   16994 18.2  0.1  49600 19012 ?        Ss   09:55   0:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios   17021  0.4  0.1  49468 17664 ?        S    09:55   0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Why there should be two processes running?
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: SSH timeout (Linux level)

Post by lmiltchev »

It is normal to have two processes running - one is a "child" process. Look at the PID and PPID to make sure.

Example:
[root@main-nagios-xi ~]# ps -ef | grep nagios.cfg | grep -v grep
nagios 31801 1 0 10:36 ? 00:00:17 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 31815 31801 0 10:36 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Let us know if you have any further questions.
Be sure to check out our Knowledgebase for helpful articles and solutions!
nms
Posts: 222
Joined: Wed Sep 28, 2016 9:35 am

Re: SSH timeout (Linux level)

Post by nms »

Thanks. Ticket can be closed
Locked