check_by_ssh issue?
Posted: Thu Feb 12, 2015 12:20 am
Ok,
So I have about 1000 hosts that are linux based. We have a variety of checks being performed, ports, websites, oracle sql connections and also each server has on average 10 checks that use check_by_ssh to check disk, memory, load and other OS level checks. They are all configured as separate active checks, not using the passive mode of check_by_ssh. My issue happens every so often(in the last 6 months maybe average 1 time a month) for a 2-3 hour period I get thousands or alerts because quite a few ssh checks start failing. Its not a network issue as all the other checks are working fine, it's just an ssh issue. When trying to connect manually via ssh it either takes minutes to connect or the connection is just dropped by the remote server. It may last 2-3 hours on the servers, or 10-15 minutes and then work fine and move onto a different group of servers. We can never find anything wrong on the machines and after the total 3 hours window everything seems to just magically start working again.
Anyone else that heavily uses check_by_ssh see similar behavior? I know its a long shot and more than likely something in our environment....but have to ask.
So I have about 1000 hosts that are linux based. We have a variety of checks being performed, ports, websites, oracle sql connections and also each server has on average 10 checks that use check_by_ssh to check disk, memory, load and other OS level checks. They are all configured as separate active checks, not using the passive mode of check_by_ssh. My issue happens every so often(in the last 6 months maybe average 1 time a month) for a 2-3 hour period I get thousands or alerts because quite a few ssh checks start failing. Its not a network issue as all the other checks are working fine, it's just an ssh issue. When trying to connect manually via ssh it either takes minutes to connect or the connection is just dropped by the remote server. It may last 2-3 hours on the servers, or 10-15 minutes and then work fine and move onto a different group of servers. We can never find anything wrong on the machines and after the total 3 hours window everything seems to just magically start working again.
Anyone else that heavily uses check_by_ssh see similar behavior? I know its a long shot and more than likely something in our environment....but have to ask.