Nagios Support Forum

Posted: **Mon Apr 15, 2013 9:27 am**

Okay, so this is going to be a hard one to describe. It is very interesting and may not end up being a Nagios issue, but we just can't figure it out yet! So I figure I should try and post here and see if anyone else can offer some insight!

Our data center is broken in two. Half is corporate and the other half is our web retail side. The sides are separated by firewalls. Currently, the web side is being monitored by a core installation in the corp side, with holes in firewall of course. I duplicated everything to my XI installation and have the same checks being performed now by a gearman worker located on the web side of data center. Alerts are currently shut off on XI for these hosts/services.

On our app servers there are many checks. In core they are configured to use NRPE or a custom check, check_webmon, which makes a http request on a specific port and looks for something on the page. In XI, the ones using NRPE have been changed to use the custom check.

Here is the problem:
On the Core server we get this in the logs:
With the NRPE check - Timeout while attempting connection
With the custom check - Could not reach monitor, exit code 28!

On the XI install, we get the same error with the custom check. The error code 28 is basically a timeout as well.

We did a tcpdump on the servers(XI and CORE) and see the same issue in both. There is only one network switch between the XI install and the servers, no firewalls at all.

In the tcpdump, we see the XI and/or core server RST the connection right after the request is made. The data is even returned, but is not processed since the connection was terminated. The servers are not under any time of load or anything either. The core install is only the one server and does not use gearman.

Any thoughts?

Posted: **Mon Apr 15, 2013 10:57 am**

Is this happening specifically when you make any connection to that server via NRPE? Or only when attempting to return data for these specific checks? I believe someone else in the office is replying as I type but let us know!

Posted: **Mon Apr 15, 2013 10:57 am**

Would the attached network diagram be a very basic layout of how it works, or is the XI instance on the web app side of things? Otherwise, considering the tcp dump information, are you sure that XI is sending the reset and not the firewall, especially with the nrpe checks from core? Are you able to do a wget or curl request to any of the webpages from core and XI?

Posted: **Mon Apr 15, 2013 11:01 am**

sreinhardt wrote:Would the attached network diagram be a very basic layout of how it works, or is the XI instance on the web app side of things? Otherwise, considering the tcp dump information, are you sure that XI is sending the reset and not the firewall, especially with the nrpe checks from core? Are you able to do a wget or curl request to any of the webpages from core and XI?

What attached diagram?
The XI instance has NO firwalls between it and the app servers. The only network devices are switches, which do not send RST packets. The source for the RST is the XI box and the Core box when the issue occurs.

The custom check is a Curl command already.

It is happening randomly and as stated, the source of the RST is the nagios boxes.

Thanks

Posted: **Mon Apr 15, 2013 11:05 am**

sreinhardt wrote:Would the attached network diagram be a very basic layout of how it works, or is the XI instance on the web app side of things? Otherwise, considering the tcp dump information, are you sure that XI is sending the reset and not the firewall, especially with the nrpe checks from core? Are you able to do a wget or curl request to any of the webpages from core and XI?

Ok, let me clarify.

XI communicates to gearman worker through a hole in the firewall. It is the gearman worker for the XI install that sends the RST. Everything else seems proper in drawing. The Core install communicates through a hole in the fw as well.

Posted: **Mon Apr 15, 2013 11:07 am**

Just out of curiosity, is a really short timeout set on the nrpe checks?

Posted: **Mon Apr 15, 2013 11:07 am**

abrist wrote:Just out of curiosity, is a really short timeout set on the nrpe checks?

60 seconds

Posted: **Mon Apr 15, 2013 11:28 am**

Have you tried increasing this to 120 + seconds for testing purposes? It would be unusual for them to take more but it would be a worth while test.

Posted: **Mon Apr 15, 2013 11:30 am**

slansing wrote:Have you tried increasing this to 120 + seconds for testing purposes? It would be unusual for them to take more but it would be a worth while test.

We have not, I was considering doing it though.

Posted: **Mon Apr 15, 2013 11:34 am**

One would hope 60 seconds was enough though. What type of checks are these? Are you firing off really heavy scripts?

Nagios Support Forum

Connection reset

Connection reset

Re: Connection reset

Re: Connection reset

Re: Connection reset

Re: Connection reset

Re: Connection reset

Re: Connection reset

Re: Connection reset

Re: Connection reset

Re: Connection reset