Connection reset

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Connection reset

Post by BanditBBS »

Okay, so this is going to be a hard one to describe. It is very interesting and may not end up being a Nagios issue, but we just can't figure it out yet! So I figure I should try and post here and see if anyone else can offer some insight!

Our data center is broken in two. Half is corporate and the other half is our web retail side. The sides are separated by firewalls. Currently, the web side is being monitored by a core installation in the corp side, with holes in firewall of course. I duplicated everything to my XI installation and have the same checks being performed now by a gearman worker located on the web side of data center. Alerts are currently shut off on XI for these hosts/services.

On our app servers there are many checks. In core they are configured to use NRPE or a custom check, check_webmon, which makes a http request on a specific port and looks for something on the page. In XI, the ones using NRPE have been changed to use the custom check.

Here is the problem:
On the Core server we get this in the logs:
With the NRPE check - Timeout while attempting connection
With the custom check - Could not reach monitor, exit code 28!

On the XI install, we get the same error with the custom check. The error code 28 is basically a timeout as well.

We did a tcpdump on the servers(XI and CORE) and see the same issue in both. There is only one network switch between the XI install and the servers, no firewalls at all.

In the tcpdump, we see the XI and/or core server RST the connection right after the request is made. The data is even returned, but is not processed since the connection was terminated. The servers are not under any time of load or anything either. The core install is only the one server and does not use gearman.

Any thoughts?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Connection reset

Post by slansing »

Is this happening specifically when you make any connection to that server via NRPE? Or only when attempting to return data for these specific checks? I believe someone else in the office is replying as I type but let us know!
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Connection reset

Post by sreinhardt »

Would the attached network diagram be a very basic layout of how it works, or is the XI instance on the web app side of things? Otherwise, considering the tcp dump information, are you sure that XI is sending the reset and not the firewall, especially with the nrpe checks from core? Are you able to do a wget or curl request to any of the webpages from core and XI?
You do not have the required permissions to view the files attached to this post.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Connection reset

Post by BanditBBS »

sreinhardt wrote:Would the attached network diagram be a very basic layout of how it works, or is the XI instance on the web app side of things? Otherwise, considering the tcp dump information, are you sure that XI is sending the reset and not the firewall, especially with the nrpe checks from core? Are you able to do a wget or curl request to any of the webpages from core and XI?
What attached diagram?
The XI instance has NO firwalls between it and the app servers. The only network devices are switches, which do not send RST packets. The source for the RST is the XI box and the Core box when the issue occurs.

The custom check is a Curl command already.

It is happening randomly and as stated, the source of the RST is the nagios boxes.

Thanks
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Connection reset

Post by BanditBBS »

sreinhardt wrote:Would the attached network diagram be a very basic layout of how it works, or is the XI instance on the web app side of things? Otherwise, considering the tcp dump information, are you sure that XI is sending the reset and not the firewall, especially with the nrpe checks from core? Are you able to do a wget or curl request to any of the webpages from core and XI?
Ok, let me clarify.

XI communicates to gearman worker through a hole in the firewall. It is the gearman worker for the XI install that sends the RST. Everything else seems proper in drawing. The Core install communicates through a hole in the fw as well.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Connection reset

Post by abrist »

Just out of curiosity, is a really short timeout set on the nrpe checks?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Connection reset

Post by BanditBBS »

abrist wrote:Just out of curiosity, is a really short timeout set on the nrpe checks?
60 seconds
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Connection reset

Post by slansing »

Have you tried increasing this to 120 + seconds for testing purposes? It would be unusual for them to take more but it would be a worth while test.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: Connection reset

Post by BanditBBS »

slansing wrote:Have you tried increasing this to 120 + seconds for testing purposes? It would be unusual for them to take more but it would be a worth while test.
We have not, I was considering doing it though.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Connection reset

Post by abrist »

One would hope 60 seconds was enough though. What type of checks are these? Are you firing off really heavy scripts?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked