Page 1 of 1
SSH timeout (Linux level)
Posted: Wed Sep 05, 2018 5:16 am
by nms
Hi,
We monitor our services using ssh connection. The check by ssh command has a setting for timeout of "t 120" which seems to work well.
However, we noticed that due to an issue we experienced today, in which we lost connection to several of our nodes, the sshd connection was still initiated on the Linux level which caused a lack of resources on the server.
Due to this, we noticed that nagios xi was not functioning well, giving problems with the monitoring engine and other crucial system components. It was literally stuck.
To fix this we had to kill the ssh sessions on the server until Nagios could perform well again.
We think that this can be fixed by somehow doing a timeout if there is no keepalive between the ssh sessions in order to drop the connection.
But we need to be careful what to do since ALL of our monitoring is done via ssh (i.e ssh connection followed by running of script).
As an example, I have attached a screenshot showing the ssh process still running (as you can see without its child id) after we stopped the monitoring from Nagios gui. This means that the ssh processes were still there.
Can you shed some light?
Rgds,
Re: SSH timeout (Linux level)
Posted: Wed Sep 05, 2018 1:28 pm
by ssax
You could try editing your check_by_ssh command to pass the ConnectTimeout option to see if that helps:
Just in case, what is the output of these commands:
Re: SSH timeout (Linux level)
Posted: Thu Sep 06, 2018 1:46 am
by nms
Hi
Here's the output as requested:
Code: Select all
ps aux | grep nagios.cfg
nagios 2605 3.3 0.1 48412 28036 ? Ss 04:00 9:29 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 3111 0.0 0.1 48280 16416 ? S 04:00 0:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 3427 0.0 0.0 49068 1484 ? Ss Aug14 0:05 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 3521 0.0 0.0 48936 296 ? S Aug14 6:19 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 4675 0.0 0.0 103328 952 pts/0 S+ 08:42 0:00 grep nagios.cfg
nagios 13199 0.0 0.0 50392 1488 ? Ss Aug20 4:44 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 13298 0.0 0.0 50260 292 ? S Aug20 4:05 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Code: Select all
ipcs -q
------ Message Queues --------
key msqid owner perms used-bytes messages
0x6b000002 9568256 nagios 600 0 0
Re: SSH timeout (Linux level)
Posted: Thu Sep 06, 2018 1:59 am
by nms
Does the Connect timeout eventually terminate the ssh processes if they are idle?
Re: SSH timeout (Linux level)
Posted: Thu Sep 06, 2018 1:58 pm
by ssax
ConnectTimeout Specifies the timeout (in seconds) used when connecting to the SSH server, instead of using the default system TCP timeout. This value is used only when the target is down or really unreachable, not when it refuses the connection.
Looks like you have too many nagios processes running, you should only have two, that may be why they are getting hung up. Please run these commands and see if it resolves your issue:
Code: Select all
service nagios stop
service ndo2db stop
pkill -9 nagios
killall -9 nagios
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service ndo2db start
service nagios start
Re: SSH timeout (Linux level)
Posted: Thu Sep 06, 2018 1:59 pm
by tgriep
Here is a description of the ConnectTimeout option and what is does.
It specifies the timeout (in seconds) used when connecting to the SSH server, instead of using the default system TCP timeout.
This value is used only when the target is down or really unreachable, not when it refuses the connection.
In your ps output, I see that that there are duplicate Nagios processes running and that would contribute to the issue you were having.
So the duplicates need to be stopped and to do that, run the following as root
Code: Select all
service nagios stop
killall -9 nagios
service nagios start
Let us know if you have any further questions.
Re: SSH timeout (Linux level)
Posted: Fri Sep 07, 2018 3:19 am
by nms
Thanks. I have performed the stop/start and can verify I have now 2 processes:
Code: Select all
nagios 16994 18.2 0.1 49600 19012 ? Ss 09:55 0:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 17021 0.4 0.1 49468 17664 ? S 09:55 0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Why there should be two processes running?
Re: SSH timeout (Linux level)
Posted: Fri Sep 07, 2018 11:30 am
by lmiltchev
It is normal to have two processes running - one is a "child" process. Look at the PID and PPID to make sure.
Example:
[root@main-nagios-xi ~]# ps -ef | grep nagios.cfg | grep -v grep
nagios 31801 1 0 10:36 ? 00:00:17 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 31815 31801 0 10:36 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Let us know if you have any further questions.
Re: SSH timeout (Linux level)
Posted: Mon Sep 10, 2018 10:04 am
by nms
Thanks. Ticket can be closed