check_by_ssh issue?

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

check_by_ssh issue?

Post by BanditBBS »

Ok,

So I have about 1000 hosts that are linux based. We have a variety of checks being performed, ports, websites, oracle sql connections and also each server has on average 10 checks that use check_by_ssh to check disk, memory, load and other OS level checks. They are all configured as separate active checks, not using the passive mode of check_by_ssh. My issue happens every so often(in the last 6 months maybe average 1 time a month) for a 2-3 hour period I get thousands or alerts because quite a few ssh checks start failing. Its not a network issue as all the other checks are working fine, it's just an ssh issue. When trying to connect manually via ssh it either takes minutes to connect or the connection is just dropped by the remote server. It may last 2-3 hours on the servers, or 10-15 minutes and then work fine and move onto a different group of servers. We can never find anything wrong on the machines and after the total 3 hours window everything seems to just magically start working again.

Anyone else that heavily uses check_by_ssh see similar behavior? I know its a long shot and more than likely something in our environment....but have to ask.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
mrochelle
Posts: 238
Joined: Fri May 04, 2012 11:20 am
Location: Heart of America

Re: check_by_ssh issue?

Post by mrochelle »

We use check_by_ssh heavily with Linux and Solaris systems. The few times we have experienced a problem similar to yours, it could be isolated to a particular time period or particular group of hosts. Just to comment on some of the more difficult ssh connection issues:

We maxed out the default allowable ssh connections which on the Solaris systems was (50). This option was configurable but there was not a lot of available info on it that we found. I don't know if there is such a default with Linux.

We experienced another ssh connection problem where, I don't believe we ever found the root cause but during a certain time between 2300 to 0100 the number of ssh connections to a group of systems had problems. While I'm not recommending this, I creatively solve the problem by writing my own perl version of the check_by_ssh with configurable retry attempts and a performance measurement that indicated the number of retries for a given check. We then found out that most checks during this problem time period cleared on the 1st retry.

My comments,
Marcus :geek:
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: check_by_ssh issue?

Post by BanditBBS »

Check these links out:
http://unix.stackexchange.com/questions ... -parameter
http://unix.stackexchange.com/questions ... onnections

I am almost certain we are not hitting those limits, but it sure describes our problem well. So I think we may try bumping those up.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: check_by_ssh issue?

Post by scottwilkerson »

Default MaxSessions on a lot of systems is 10, and if you have 10 checks using that plus any other people / processes that are ssh'd in (think rsync, scp, etc. also) I could see you hitting the limit.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: check_by_ssh issue?

Post by BanditBBS »

scottwilkerson wrote:Default MaxSessions on a lot of systems is 10, and if you have 10 checks using that plus any other people / processes that are ssh'd in (think rsync, scp, etc. also) I could see you hitting the limit.
Very good point Scott. Please leave this open for a day or two in case someone else has any input...but after a day or two, lock her up!
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: check_by_ssh issue?

Post by tmcdonald »

Will do.
Former Nagios employee
Locked