Page 1 of 1
Flapping - Critical: Socket timeout after 10 seconds
Posted: Wed Feb 04, 2015 10:39 pm
by McBinary
Hello all!
I was wondering if you could help me with a problem we've been having recently at work. We currently monitor a couple hundred servers in our datacenter, all of which are doing fine with the 10 second timeout - except one.
We've recently been burdened the following email every few minutes from the same host:
Code: Select all
State: CRITICAL
Additional Info: CRITICAL - Socket timeout after 10 seconds
The client service has been restarted, the server has been rebooted, yet it still continues to flap up/down every few minutes. This happens to random checks - CPU/Disks/HTTP/RDP. After rescheduling the next check they will recover. I am unable to increase the global timeout as it would effect the entire datacenter, for just this one host.
I'm talking 126 emails so far TODAY from this one host on false positives...
Is it possible to increase the socket timeout for a single host? If not, any ideas on how to address this issue?
Re: Flapping - Critical: Socket timeout after 10 seconds
Posted: Thu Feb 05, 2015 10:43 am
by abrist
McBinary wrote:
Is it possible to increase the socket timeout for a single host?
Not without creating a new service check config for this one host.
McBinary wrote: If not, any ideas on how to address this issue?
Alter the global flapping thresholds (I assume you do not want to do this) or suppress notifications on the service check?
Re: Flapping - Critical: Socket timeout after 10 seconds
Posted: Thu Feb 05, 2015 9:35 pm
by McBinary
abrist wrote:McBinary wrote:
Alter the global flapping thresholds (I assume you do not want to do this) or suppress notifications on the service check?
I definitely wouldn't want to change that threshold just to accommodate this one host. However I suppose suppressing just socket timeout notifications for this host wouldn't be a terrible idea, as long as it still left any other alert free to get through.
After a restarting the server again last night, and subsequently
not receiving these alerts for most of the day today it appears that this may be resource related. We don't monitor memory on this one, but trending graphs show there were at max 182 users logged in yesterday, whereas we've only seen up to 40 today and not receiving these timeouts. It may be time to nag the client to increase their memory, or set a logoff timer at least.
Re: Flapping - Critical: Socket timeout after 10 seconds
Posted: Fri Feb 06, 2015 10:09 am
by abrist
McBinary wrote: It may be time to nag the client to increase their memory, or set a logoff timer at least.
If this is the case, you may want to leave the check as it is - it sounds relevant to their environment. Maybe add a memory and cpu usage check as well?
Re: Flapping - Critical: Socket timeout after 10 seconds
Posted: Thu Feb 12, 2015 12:59 am
by McBinary
abrist wrote:If this is the case, you may want to leave the check as it is - it sounds relevant to their environment. Maybe add a memory and cpu usage check as well?
Just wanted to update this in case it comes up in the future.
I've set a task to reboot this server weekly, which seems to be holding the socket time-outs at bay due to keeping the disconnected users from amassing. Additional memory is definitely needed here, but I don't really have much say in this environment, so this seems to be an acceptable band-aid for now.
Re: Flapping - Critical: Socket timeout after 10 seconds
Posted: Thu Feb 12, 2015 11:56 am
by abrist
Fair enough. Let us know if you have further issues though.