occasional "socket timeout after 10 seconds"

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
bolson

Re: occasional "socket timeout after 10 seconds"

Post by bolson »

Probably not as "No data was received from host" or "could not fetch information from server" are not timeout related. You may want to run the following against a host returning these random errors:

Code: Select all

ping ip_address > ping.txt
for several hours (stop with ^C) examine the file to see if you're having packet drop issues corresponding to these errors.
caterpillartce
Posts: 117
Joined: Mon Jul 11, 2016 11:22 am

Re: occasional "socket timeout after 10 seconds"

Post by caterpillartce »

ping did not seem to capture any issue - I am having "could not fetch information from server" alerts with one server today every few minutes but ping results have been fine. Anything else I can check?

Thanks
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: occasional "socket timeout after 10 seconds"

Post by tgriep »

Login to that server and take a look in the nsclient.log file for any errors when the Nagios server displayed the "could not fetch information from server" message.
One thing that I have found to cause that error is if the connection between the remote host and the Nagios server has a firewall / router that is NATing the IP address, sometimes that causes the error if the Address changes so you would have to add it to the nsclient.ini file.
To be sure the nsclient.log file would have the error.
Be sure to check out our Knowledgebase for helpful articles and solutions!
caterpillartce
Posts: 117
Joined: Mon Jul 11, 2016 11:22 am

Re: occasional "socket timeout after 10 seconds"

Post by caterpillartce »

the nsclient.log file is full of below errors. There should be no firewall between the monitored server and the Nagios server, though they are far from each other - Nagios server is in US and the monitored server is in Singapore. We have quite a few servers in Singapore and only 4 are having this issue.

2017-08-26 02:07:47: error:c:\source\nscp\include\socket/connection.hpp:137: Failed to read data: The I/O operation has been aborted because of either a thread exit or an application request
2017-08-26 02:08:21: error:c:\source\nscp\include\socket/connection.hpp:137: Failed to read data: The I/O operation has been aborted because of either a thread exit or an application request
2017-08-26 02:08:45: error:c:\source\nscp\include\socket/connection.hpp:137: Failed to read data: The I/O operation has been aborted because of either a thread exit or an application request
2017-08-26 02:09:19: error:c:\source\nscp\include\socket/connection.hpp:137: Failed to read data: The I/O operation has been aborted because of either a thread exit or an application request
2017-08-26 02:10:17: error:c:\source\nscp\include\socket/connection.hpp:137: Failed to read data: The I/O operation has been aborted because of either a thread exit or an application request
2017-08-26 02:11:15: error:c:\source\nscp\include\socket/connection.hpp:137: Failed to read data: The I/O operation has been aborted because of either a thread exit or an application request
2017-08-26 02:11:46: error:c:\source\nscp\include\socket/connection.hpp:137: Failed to read data: The I/O operation has been aborted because of either a thread exit or an application request
2017-08-26 02:12:13: error:c:\source\nscp\include\socket/connection.hpp:137: Failed to read data: The I/O operation has been aborted because of either a thread exit or an application request
2017-08-26 02:12:45: error:c:\source\nscp\include\socket/connection.hpp:137: Failed to read data: The I/O operation has been aborted because of either a thread exit or an application request
2017-08-26 02:13:17: error:c:\source\nscp\include\socket/connection.hpp:243: Failed to establish secure connection: short read: 219
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: occasional "socket timeout after 10 seconds"

Post by dwhitfield »

As was already said, this could be a bottleneck issue on those servers that are failing. Perhaps the drives are starting to go bad.

That said, please attach your nsclient.ini file.

Also, if you are not using NSClient .4.4, you should uninstall what you have, remove the directory and all related configs left behind, and install https://github.com/mickem/nscp/releases ... 23-x64.msi (unless you need 32-bit, in which case use https://github.com/mickem/nscp/releases ... -Win32.msi)
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: occasional "socket timeout after 10 seconds"

Post by tgriep »

First thing to try is to stop the NSClient++ agent on the Windows server, verify that it is not running, and restart it and see if it fixes the issue.
Second, remove the NSClient++ agent and install the latest stable 4.4.xx version of NSClient. You can get that here.
https://nsclient.org/download/0.4.4/
When removing the agent, uninstall it, delete the folders on the C Drive and then install the 4.4.xx version.
Let us know how this works out.
Be sure to check out our Knowledgebase for helpful articles and solutions!
caterpillartce
Posts: 117
Joined: Mon Jul 11, 2016 11:22 am

Re: occasional "socket timeout after 10 seconds"

Post by caterpillartce »

Hi, tgriep,

I tried the following per your suggestion, however those servers are still giving out intermittent but frequent alerts like "could not fetch information from server" "No data was received from host!"...

1. stopped and restarted nsclient
2. removed the old nsclient (was on version 0.4.4.19), verified nsclient folder was deleted, then installed the latest version 0.4.4.23

I will PM you nsclient.ini and the latest log file. Look forward to further help on this as these frequent alerts are filling up our inboxes! Thanks
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: occasional "socket timeout after 10 seconds"

Post by tgriep »

Profile received and shared with the fellow Nagios Techs.

Lets do some changes to the NSClient++ agent's ini file and see if you can get it to work consistently for you.

Edit the nsclient.ini file and under this section

Code: Select all

[/settings/default]
Add this line to increase the time out to 120 seconds.

Code: Select all

timeout = 120

Then add this to the bottom to the ini file to enable debugging in the nsclient.log file.

Code: Select all

; Configure log properties.
[/settings/log]

; LOG LEVEL - Log level to use. Available levels are error,warning,info,debug,trace
level = debug

; FILENAME - The file to write log data to. Set this to none to disable log to file.
file name = nsclient.log

; DATEMASK - The size of the buffer to use when getting messages this affects the speed and maximum size of messages you can recieve.
date format = %Y-%m-%d %H:%M:%S
Save and restart the NSClient++ agent.

Then the next time you get the No data or Fetch error, look in the nsclient.log file and see what the error is and resolve it is you can.

Most of the time we have heard about this type of failure, it is because the network between the Nagios server and the Windows server is NATing the IP address and to fix it, we have had to add the IP address in the device that is NATing the traffic and that fixed it.

Can you describe the path between the Nagios server and the Windows Host?

If the path is going through a device, try adding it's IP address to it and see if that helps.
Be sure to check out our Knowledgebase for helpful articles and solutions!
caterpillartce
Posts: 117
Joined: Mon Jul 11, 2016 11:22 am

Re: occasional "socket timeout after 10 seconds"

Post by caterpillartce »

Thanks for the reply!

I updated the ini file and restarted the service. But the information in the log file does not provide indication as to what could be wrong. It is just below two lines repeating:

2017-09-07 02:18:21: debug:c:\source\nscp\include\check_nt/server/protocol.hpp:61: Accepting connection from: (Nagios server IP)
2017-09-07 02:18:23: error:c:\source\nscp\include\socket/connection.hpp:137: Failed to read data: End of file

I then got hold of our network support and he checked and said there was no NATting at all. He also did a testing by sending traffic to one of the problematic servers and said he sent over 5500 packets and only dropped 1.

I asked about the path between Nagios server and those problematic servers (those servers are in Singapore). And here is his reply and it does not sound feasible to add Nagios server's IP to that many devices in between: "the nagios server sits behind a load balancer in the US. So it passes through the load balancer, dc switch, MAN Router, WAN router, Wan Circuit to Sing Wan router and then a few Sing lan switches before hitting the server".

Is there a switch in the ini file that we can turn on to show the real error/problem? or anything else we can do to troubleshoot?

Thanks
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: occasional "socket timeout after 10 seconds"

Post by tgriep »

The debugging settings that were added to the ini file should show all of the details that the agent can log.

Sorry if my explanation was confusing, I did not mean to add the XI server's IP address to all of the devices, I meant to add all of the IP addresses of the devices between the XI server and the Windows server to the allowed hosts section in the nsclient.ini file.
Try adding the load balancer's IP address to the nsclient.ini file and see if the balancer can be configured to not distribute the network traffic to the XI server.
If in the middle of the transfer, the balancer changes the path, I would guess that would cause the error you are seeing.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked