Page 1 of 1

Service Check time out

Posted: Sun Mar 11, 2018 1:41 pm
by saffer
hi folks,

First post in many years that has me truly perplexed.

I am running Nagios 4.3.4, with 10500+ service checks and some 400 hosts on a SLES 12SP3 VM ware host.

Every now and then we get a Service check time out from about 1000+ services checks, and occasionally host check time outs.

The perplexing issue here, is there appears to be no resource issue as we have more than sufficient memory and cpu. I have played with service check timeouts, and increased then from 60 seconds to 180. This did assist in minimizing the issue, but it still occurs randomly. Kernel has been tuned as well as TCP buffers etc. The typical checks are not perl. Typically check_npre is the primary service checker we use. Other service checks such as check_tcp and check_ssh also fail with service check timeout.

I have many years of experience with Nagios, and have run environments with >20,000 service checks and 650+ hosts, and never seen this issue.

SO any thoughts. Sounds like a cooky Nagios version.

cheers

Re: Service Check time out

Posted: Mon Mar 12, 2018 9:30 am
by eloyd
I'd be more interested in your VMWare host resource usage. Can you look back at network/memory/processor utilization there and see if you were waiting for anything ? Maybe another machine that has higher resource priority?

Re: Service Check time out

Posted: Tue Mar 13, 2018 2:36 pm
by cdienger
Were you able to check the vmware resources suggested by @eloyd?

Another potential place to check would be any firewall devices that the traffic may go through. Perhaps the frequent icmp and tcp connections are getting flagged as potentially malicious behavior and dropped ?

Re: Service Check time out

Posted: Tue Apr 10, 2018 8:57 am
by saffer
Hi,

Apologies for the tardiness. My account was locked.

We checked the VMWARE host, and it was running very quiet. Since my first post, we have moved to a new server, and the problem continued.

What seems to have quietened it down is adding double the memory. This is weird, as the server never showed showed memory constraints.
I have worked with Nagios since the early days, and never seen this behaviour.

Re: Service Check time out

Posted: Tue Apr 10, 2018 4:49 pm
by scottwilkerson
You would really need to check the VMWare resources when the problem is occurring, specifically I would be thinking CPU or disk I/O.

Re: Service Check time out

Posted: Sat May 05, 2018 2:58 pm
by saffer
Until tonight, my problem has been very quiet. The dreaded service check timed out issue returned with a vengeance.

Now have set specific settings in Nagios to report OK if such an issue occurs. I have also set the value for time out -T180:3 to report unknown, but this really is a hit and miss situation. Some checks get reported critical with service check timed out, and some get marked as unknown.

Seems to be a 50/50 split. What I am noticing, is a lot of socket time outs and the following errors in the log files. these errors only happen when I get this problem.

nagios: job 12517 (pid=5955): read() returned error 11
nagios: job 12518 (pid=5959): read() returned error 11
nagios: job 12517 (pid=5956): read() returned error 11
nagios: job 12517 (pid=5956): read() returned error 11
nagios: job 12517 (pid=5962): read() returned error 11
nagios: job 12517 (pid=5954): read() returned error 11
nagios: job 12517 (pid=5953): read() returned error 11
nagios: job 12517 (pid=5955): read() returned error 11
nagios: job 12518 (pid=5959): read() returned error 11
nagios: job 12517 (pid=5962): read() returned error 11
nagios: job 12517 (pid=5953): read() returned error 11
nagios: job 12517 (pid=5957): read() returned error 11
nagios: job 12516 (pid=5951): read() returned error 11

Re: Service Check time out

Posted: Sun May 06, 2018 8:40 am
by saffer
An update for everyone that has contributed.

This problem is being caused by storage. 3.5 second latency during the last two outages.

Re: Service Check time out

Posted: Sun May 06, 2018 8:46 am
by eloyd
Funny. That's almost so obvious that no one thought to ask about it. :-)

Re: Service Check time out

Posted: Mon May 07, 2018 9:55 am
by tmcdonald
If the issue has been resolved, are we all good to close this up?