Page 1 of 1

check_by_ssh issue?

Posted: Thu Feb 12, 2015 12:20 am
by BanditBBS
Ok,

So I have about 1000 hosts that are linux based. We have a variety of checks being performed, ports, websites, oracle sql connections and also each server has on average 10 checks that use check_by_ssh to check disk, memory, load and other OS level checks. They are all configured as separate active checks, not using the passive mode of check_by_ssh. My issue happens every so often(in the last 6 months maybe average 1 time a month) for a 2-3 hour period I get thousands or alerts because quite a few ssh checks start failing. Its not a network issue as all the other checks are working fine, it's just an ssh issue. When trying to connect manually via ssh it either takes minutes to connect or the connection is just dropped by the remote server. It may last 2-3 hours on the servers, or 10-15 minutes and then work fine and move onto a different group of servers. We can never find anything wrong on the machines and after the total 3 hours window everything seems to just magically start working again.

Anyone else that heavily uses check_by_ssh see similar behavior? I know its a long shot and more than likely something in our environment....but have to ask.

Re: check_by_ssh issue?

Posted: Thu Feb 12, 2015 8:48 am
by mrochelle
We use check_by_ssh heavily with Linux and Solaris systems. The few times we have experienced a problem similar to yours, it could be isolated to a particular time period or particular group of hosts. Just to comment on some of the more difficult ssh connection issues:

We maxed out the default allowable ssh connections which on the Solaris systems was (50). This option was configurable but there was not a lot of available info on it that we found. I don't know if there is such a default with Linux.

We experienced another ssh connection problem where, I don't believe we ever found the root cause but during a certain time between 2300 to 0100 the number of ssh connections to a group of systems had problems. While I'm not recommending this, I creatively solve the problem by writing my own perl version of the check_by_ssh with configurable retry attempts and a performance measurement that indicated the number of retries for a given check. We then found out that most checks during this problem time period cleared on the 1st retry.

My comments,
Marcus :geek:

Re: check_by_ssh issue?

Posted: Thu Feb 12, 2015 8:55 am
by BanditBBS
Check these links out:
http://unix.stackexchange.com/questions ... -parameter
http://unix.stackexchange.com/questions ... onnections

I am almost certain we are not hitting those limits, but it sure describes our problem well. So I think we may try bumping those up.

Re: check_by_ssh issue?

Posted: Thu Feb 12, 2015 10:09 am
by scottwilkerson
Default MaxSessions on a lot of systems is 10, and if you have 10 checks using that plus any other people / processes that are ssh'd in (think rsync, scp, etc. also) I could see you hitting the limit.

Re: check_by_ssh issue?

Posted: Thu Feb 12, 2015 10:27 am
by BanditBBS
scottwilkerson wrote:Default MaxSessions on a lot of systems is 10, and if you have 10 checks using that plus any other people / processes that are ssh'd in (think rsync, scp, etc. also) I could see you hitting the limit.
Very good point Scott. Please leave this open for a day or two in case someone else has any input...but after a day or two, lock her up!

Re: check_by_ssh issue?

Posted: Thu Feb 12, 2015 12:55 pm
by tmcdonald
Will do.