SSH timeout (Linux level)
SSH timeout (Linux level)
Hi,
We monitor our services using ssh connection. The check by ssh command has a setting for timeout of "t 120" which seems to work well.
However, we noticed that due to an issue we experienced today, in which we lost connection to several of our nodes, the sshd connection was still initiated on the Linux level which caused a lack of resources on the server.
Due to this, we noticed that nagios xi was not functioning well, giving problems with the monitoring engine and other crucial system components. It was literally stuck.
To fix this we had to kill the ssh sessions on the server until Nagios could perform well again.
We think that this can be fixed by somehow doing a timeout if there is no keepalive between the ssh sessions in order to drop the connection.
But we need to be careful what to do since ALL of our monitoring is done via ssh (i.e ssh connection followed by running of script).
As an example, I have attached a screenshot showing the ssh process still running (as you can see without its child id) after we stopped the monitoring from Nagios gui. This means that the ssh processes were still there.
Can you shed some light?
Rgds,
We monitor our services using ssh connection. The check by ssh command has a setting for timeout of "t 120" which seems to work well.
However, we noticed that due to an issue we experienced today, in which we lost connection to several of our nodes, the sshd connection was still initiated on the Linux level which caused a lack of resources on the server.
Due to this, we noticed that nagios xi was not functioning well, giving problems with the monitoring engine and other crucial system components. It was literally stuck.
To fix this we had to kill the ssh sessions on the server until Nagios could perform well again.
We think that this can be fixed by somehow doing a timeout if there is no keepalive between the ssh sessions in order to drop the connection.
But we need to be careful what to do since ALL of our monitoring is done via ssh (i.e ssh connection followed by running of script).
As an example, I have attached a screenshot showing the ssh process still running (as you can see without its child id) after we stopped the monitoring from Nagios gui. This means that the ssh processes were still there.
Can you shed some light?
Rgds,
You do not have the required permissions to view the files attached to this post.
Re: SSH timeout (Linux level)
You could try editing your check_by_ssh command to pass the ConnectTimeout option to see if that helps:
Just in case, what is the output of these commands:
Code: Select all
-o ConnectTimeout=60Code: Select all
ps aux | grep nagios.cfg
ipcs -qRe: SSH timeout (Linux level)
Hi
Here's the output as requested:
Here's the output as requested:
Code: Select all
ps aux | grep nagios.cfg
nagios 2605 3.3 0.1 48412 28036 ? Ss 04:00 9:29 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 3111 0.0 0.1 48280 16416 ? S 04:00 0:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 3427 0.0 0.0 49068 1484 ? Ss Aug14 0:05 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 3521 0.0 0.0 48936 296 ? S Aug14 6:19 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
root 4675 0.0 0.0 103328 952 pts/0 S+ 08:42 0:00 grep nagios.cfg
nagios 13199 0.0 0.0 50392 1488 ? Ss Aug20 4:44 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 13298 0.0 0.0 50260 292 ? S Aug20 4:05 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Code: Select all
ipcs -q
------ Message Queues --------
key msqid owner perms used-bytes messages
0x6b000002 9568256 nagios 600 0 0Re: SSH timeout (Linux level)
Does the Connect timeout eventually terminate the ssh processes if they are idle?
Re: SSH timeout (Linux level)
Looks like you have too many nagios processes running, you should only have two, that may be why they are getting hung up. Please run these commands and see if it resolves your issue:ConnectTimeout Specifies the timeout (in seconds) used when connecting to the SSH server, instead of using the default system TCP timeout. This value is used only when the target is down or really unreachable, not when it refuses the connection.
Code: Select all
service nagios stop
service ndo2db stop
pkill -9 nagios
killall -9 nagios
for i in `ipcs -q | grep nagios |awk '{print $2}'`; do ipcrm -q $i; done
service ndo2db start
service nagios startRe: SSH timeout (Linux level)
Here is a description of the ConnectTimeout option and what is does.
It specifies the timeout (in seconds) used when connecting to the SSH server, instead of using the default system TCP timeout.
This value is used only when the target is down or really unreachable, not when it refuses the connection.
In your ps output, I see that that there are duplicate Nagios processes running and that would contribute to the issue you were having.
So the duplicates need to be stopped and to do that, run the following as root
Let us know if you have any further questions.
It specifies the timeout (in seconds) used when connecting to the SSH server, instead of using the default system TCP timeout.
This value is used only when the target is down or really unreachable, not when it refuses the connection.
In your ps output, I see that that there are duplicate Nagios processes running and that would contribute to the issue you were having.
So the duplicates need to be stopped and to do that, run the following as root
Code: Select all
service nagios stop
killall -9 nagios
service nagios startBe sure to check out our Knowledgebase for helpful articles and solutions!
Re: SSH timeout (Linux level)
Thanks. I have performed the stop/start and can verify I have now 2 processes:
Why there should be two processes running?
Code: Select all
nagios 16994 18.2 0.1 49600 19012 ? Ss 09:55 0:01 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 17021 0.4 0.1 49468 17664 ? S 09:55 0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Re: SSH timeout (Linux level)
It is normal to have two processes running - one is a "child" process. Look at the PID and PPID to make sure.
Example:
Example:
Let us know if you have any further questions.[root@main-nagios-xi ~]# ps -ef | grep nagios.cfg | grep -v grep
nagios 31801 1 0 10:36 ? 00:00:17 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 31815 31801 0 10:36 ? 00:00:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: SSH timeout (Linux level)
Thanks. Ticket can be closed