Connection reset
Posted: Mon Apr 15, 2013 9:27 am
Okay, so this is going to be a hard one to describe. It is very interesting and may not end up being a Nagios issue, but we just can't figure it out yet! So I figure I should try and post here and see if anyone else can offer some insight!
Our data center is broken in two. Half is corporate and the other half is our web retail side. The sides are separated by firewalls. Currently, the web side is being monitored by a core installation in the corp side, with holes in firewall of course. I duplicated everything to my XI installation and have the same checks being performed now by a gearman worker located on the web side of data center. Alerts are currently shut off on XI for these hosts/services.
On our app servers there are many checks. In core they are configured to use NRPE or a custom check, check_webmon, which makes a http request on a specific port and looks for something on the page. In XI, the ones using NRPE have been changed to use the custom check.
Here is the problem:
On the Core server we get this in the logs:
With the NRPE check - Timeout while attempting connection
With the custom check - Could not reach monitor, exit code 28!
On the XI install, we get the same error with the custom check. The error code 28 is basically a timeout as well.
We did a tcpdump on the servers(XI and CORE) and see the same issue in both. There is only one network switch between the XI install and the servers, no firewalls at all.
In the tcpdump, we see the XI and/or core server RST the connection right after the request is made. The data is even returned, but is not processed since the connection was terminated. The servers are not under any time of load or anything either. The core install is only the one server and does not use gearman.
Any thoughts?
Our data center is broken in two. Half is corporate and the other half is our web retail side. The sides are separated by firewalls. Currently, the web side is being monitored by a core installation in the corp side, with holes in firewall of course. I duplicated everything to my XI installation and have the same checks being performed now by a gearman worker located on the web side of data center. Alerts are currently shut off on XI for these hosts/services.
On our app servers there are many checks. In core they are configured to use NRPE or a custom check, check_webmon, which makes a http request on a specific port and looks for something on the page. In XI, the ones using NRPE have been changed to use the custom check.
Here is the problem:
On the Core server we get this in the logs:
With the NRPE check - Timeout while attempting connection
With the custom check - Could not reach monitor, exit code 28!
On the XI install, we get the same error with the custom check. The error code 28 is basically a timeout as well.
We did a tcpdump on the servers(XI and CORE) and see the same issue in both. There is only one network switch between the XI install and the servers, no firewalls at all.
In the tcpdump, we see the XI and/or core server RST the connection right after the request is made. The data is even returned, but is not processed since the connection was terminated. The servers are not under any time of load or anything either. The core install is only the one server and does not use gearman.
Any thoughts?