Service Check time out

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
saffer
Posts: 5
Joined: Tue Dec 10, 2013 2:12 am

Service Check time out

Post by saffer »

hi folks,

First post in many years that has me truly perplexed.

I am running Nagios 4.3.4, with 10500+ service checks and some 400 hosts on a SLES 12SP3 VM ware host.

Every now and then we get a Service check time out from about 1000+ services checks, and occasionally host check time outs.

The perplexing issue here, is there appears to be no resource issue as we have more than sufficient memory and cpu. I have played with service check timeouts, and increased then from 60 seconds to 180. This did assist in minimizing the issue, but it still occurs randomly. Kernel has been tuned as well as TCP buffers etc. The typical checks are not perl. Typically check_npre is the primary service checker we use. Other service checks such as check_tcp and check_ssh also fail with service check timeout.

I have many years of experience with Nagios, and have run environments with >20,000 service checks and 650+ hosts, and never seen this issue.

SO any thoughts. Sounds like a cooky Nagios version.

cheers
User avatar
eloyd
Cool Title Here
Posts: 2129
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: Service Check time out

Post by eloyd »

I'd be more interested in your VMWare host resource usage. Can you look back at network/memory/processor utilization there and see if you were waiting for anything ? Maybe another machine that has higher resource priority?
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoydI'm a Nagios Fanatic!
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Service Check time out

Post by cdienger »

Were you able to check the vmware resources suggested by @eloyd?

Another potential place to check would be any firewall devices that the traffic may go through. Perhaps the frequent icmp and tcp connections are getting flagged as potentially malicious behavior and dropped ?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
saffer
Posts: 5
Joined: Tue Dec 10, 2013 2:12 am

Re: Service Check time out

Post by saffer »

Hi,

Apologies for the tardiness. My account was locked.

We checked the VMWARE host, and it was running very quiet. Since my first post, we have moved to a new server, and the problem continued.

What seems to have quietened it down is adding double the memory. This is weird, as the server never showed showed memory constraints.
I have worked with Nagios since the early days, and never seen this behaviour.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Service Check time out

Post by scottwilkerson »

You would really need to check the VMWare resources when the problem is occurring, specifically I would be thinking CPU or disk I/O.
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
saffer
Posts: 5
Joined: Tue Dec 10, 2013 2:12 am

Re: Service Check time out

Post by saffer »

Until tonight, my problem has been very quiet. The dreaded service check timed out issue returned with a vengeance.

Now have set specific settings in Nagios to report OK if such an issue occurs. I have also set the value for time out -T180:3 to report unknown, but this really is a hit and miss situation. Some checks get reported critical with service check timed out, and some get marked as unknown.

Seems to be a 50/50 split. What I am noticing, is a lot of socket time outs and the following errors in the log files. these errors only happen when I get this problem.

nagios: job 12517 (pid=5955): read() returned error 11
nagios: job 12518 (pid=5959): read() returned error 11
nagios: job 12517 (pid=5956): read() returned error 11
nagios: job 12517 (pid=5956): read() returned error 11
nagios: job 12517 (pid=5962): read() returned error 11
nagios: job 12517 (pid=5954): read() returned error 11
nagios: job 12517 (pid=5953): read() returned error 11
nagios: job 12517 (pid=5955): read() returned error 11
nagios: job 12518 (pid=5959): read() returned error 11
nagios: job 12517 (pid=5962): read() returned error 11
nagios: job 12517 (pid=5953): read() returned error 11
nagios: job 12517 (pid=5957): read() returned error 11
nagios: job 12516 (pid=5951): read() returned error 11
saffer
Posts: 5
Joined: Tue Dec 10, 2013 2:12 am

Re: Service Check time out

Post by saffer »

An update for everyone that has contributed.

This problem is being caused by storage. 3.5 second latency during the last two outages.
User avatar
eloyd
Cool Title Here
Posts: 2129
Joined: Thu Sep 27, 2012 9:14 am
Location: Rochester, NY
Contact:

Re: Service Check time out

Post by eloyd »

Funny. That's almost so obvious that no one thought to ask about it. :-)
Image
Eric Loyd • http://everwatch.global • 844.240.EVER • @EricLoydI'm a Nagios Fanatic!
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Service Check time out

Post by tmcdonald »

If the issue has been resolved, are we all good to close this up?
Former Nagios employee
Locked