Service Check time out

An open discussion forum for obtaining help with Nagios Core. Nagios Core users of all experience levels are welcome here. Subforum have been created for the discussion of Nagios Core and Nagios Plugin development.

NOTE: The SourceForge.net mailing lists have been deprecated in favor of this forum in order to expedite support and provide additional features not available on the old mailing list.

Service Check time out

Postby saffer » Sun Mar 11, 2018 1:41 pm

hi folks,

First post in many years that has me truly perplexed.

I am running Nagios 4.3.4, with 10500+ service checks and some 400 hosts on a SLES 12SP3 VM ware host.

Every now and then we get a Service check time out from about 1000+ services checks, and occasionally host check time outs.

The perplexing issue here, is there appears to be no resource issue as we have more than sufficient memory and cpu. I have played with service check timeouts, and increased then from 60 seconds to 180. This did assist in minimizing the issue, but it still occurs randomly. Kernel has been tuned as well as TCP buffers etc. The typical checks are not perl. Typically check_npre is the primary service checker we use. Other service checks such as check_tcp and check_ssh also fail with service check timeout.

I have many years of experience with Nagios, and have run environments with >20,000 service checks and 650+ hosts, and never seen this issue.

SO any thoughts. Sounds like a cooky Nagios version.

cheers
saffer
 
Posts: 5
Joined: Tue Dec 10, 2013 2:12 am

Re: Service Check time out

Postby eloyd » Mon Mar 12, 2018 9:30 am

I'd be more interested in your VMWare host resource usage. Can you look back at network/memory/processor utilization there and see if you were waiting for anything ? Maybe another machine that has higher resource priority?
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoydI'm a Nagios Fanatic!
User avatar
eloyd
Cool Title Here
 
Posts: 1981
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY

Re: Service Check time out

Postby cdienger » Tue Mar 13, 2018 2:36 pm

Were you able to check the vmware resources suggested by @eloyd?

Another potential place to check would be any firewall devices that the traffic may go through. Perhaps the frequent icmp and tcp connections are getting flagged as potentially malicious behavior and dropped ?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
cdienger
Support Tech
 
Posts: 2191
Joined: Tue Feb 07, 2017 11:26 am

Re: Service Check time out

Postby saffer » Tue Apr 10, 2018 8:57 am

Hi,

Apologies for the tardiness. My account was locked.

We checked the VMWARE host, and it was running very quiet. Since my first post, we have moved to a new server, and the problem continued.

What seems to have quietened it down is adding double the memory. This is weird, as the server never showed showed memory constraints.
I have worked with Nagios since the early days, and never seen this behaviour.
saffer
 
Posts: 5
Joined: Tue Dec 10, 2013 2:12 am

Re: Service Check time out

Postby scottwilkerson » Tue Apr 10, 2018 4:49 pm

You would really need to check the VMWare resources when the problem is occurring, specifically I would be thinking CPU or disk I/O.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
User avatar
scottwilkerson
DevOps Engineer
 
Posts: 12594
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises

Re: Service Check time out

Postby saffer » Sat May 05, 2018 2:58 pm

Until tonight, my problem has been very quiet. The dreaded service check timed out issue returned with a vengeance.

Now have set specific settings in Nagios to report OK if such an issue occurs. I have also set the value for time out -T180:3 to report unknown, but this really is a hit and miss situation. Some checks get reported critical with service check timed out, and some get marked as unknown.

Seems to be a 50/50 split. What I am noticing, is a lot of socket time outs and the following errors in the log files. these errors only happen when I get this problem.

nagios: job 12517 (pid=5955): read() returned error 11
nagios: job 12518 (pid=5959): read() returned error 11
nagios: job 12517 (pid=5956): read() returned error 11
nagios: job 12517 (pid=5956): read() returned error 11
nagios: job 12517 (pid=5962): read() returned error 11
nagios: job 12517 (pid=5954): read() returned error 11
nagios: job 12517 (pid=5953): read() returned error 11
nagios: job 12517 (pid=5955): read() returned error 11
nagios: job 12518 (pid=5959): read() returned error 11
nagios: job 12517 (pid=5962): read() returned error 11
nagios: job 12517 (pid=5953): read() returned error 11
nagios: job 12517 (pid=5957): read() returned error 11
nagios: job 12516 (pid=5951): read() returned error 11
saffer
 
Posts: 5
Joined: Tue Dec 10, 2013 2:12 am

Re: Service Check time out

Postby saffer » Sun May 06, 2018 8:40 am

An update for everyone that has contributed.

This problem is being caused by storage. 3.5 second latency during the last two outages.
saffer
 
Posts: 5
Joined: Tue Dec 10, 2013 2:12 am

Re: Service Check time out

Postby eloyd » Sun May 06, 2018 8:46 am

Funny. That's almost so obvious that no one thought to ask about it. :-)
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoydI'm a Nagios Fanatic!
User avatar
eloyd
Cool Title Here
 
Posts: 1981
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY

Re: Service Check time out

Postby tmcdonald » Mon May 07, 2018 9:55 am

If the issue has been resolved, are we all good to close this up?
Former Nagios employee
tmcdonald
 
Posts: 9118
Joined: Mon Sep 23, 2013 8:40 am


Return to Nagios Core

Who is online

Users browsing this forum: Google [Bot] and 14 guests