Okay, so this is going to be a hard one to describe. It is very interesting and may not end up being a Nagios issue, but we just can't figure it out yet! So I figure I should try and post here and see if anyone else can offer some insight!
Our data center is broken in two. Half is corporate and the other half is our web retail side. The sides are separated by firewalls. Currently, the web side is being monitored by a core installation in the corp side, with holes in firewall of course. I duplicated everything to my XI installation and have the same checks being performed now by a gearman worker located on the web side of data center. Alerts are currently shut off on XI for these hosts/services.
On our app servers there are many checks. In core they are configured to use NRPE or a custom check, check_webmon, which makes a http request on a specific port and looks for something on the page. In XI, the ones using NRPE have been changed to use the custom check.
Here is the problem:
On the Core server we get this in the logs:
With the NRPE check - Timeout while attempting connection
With the custom check - Could not reach monitor, exit code 28!
On the XI install, we get the same error with the custom check. The error code 28 is basically a timeout as well.
We did a tcpdump on the servers(XI and CORE) and see the same issue in both. There is only one network switch between the XI install and the servers, no firewalls at all.
In the tcpdump, we see the XI and/or core server RST the connection right after the request is made. The data is even returned, but is not processed since the connection was terminated. The servers are not under any time of load or anything either. The core install is only the one server and does not use gearman.
Any thoughts?
Connection reset
Connection reset
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Connection reset
Is this happening specifically when you make any connection to that server via NRPE? Or only when attempting to return data for these specific checks? I believe someone else in the office is replying as I type but let us know!
-
sreinhardt
- -fno-stack-protector
- Posts: 4366
- Joined: Mon Nov 19, 2012 12:10 pm
Re: Connection reset
Would the attached network diagram be a very basic layout of how it works, or is the XI instance on the web app side of things? Otherwise, considering the tcp dump information, are you sure that XI is sending the reset and not the firewall, especially with the nrpe checks from core? Are you able to do a wget or curl request to any of the webpages from core and XI?
You do not have the required permissions to view the files attached to this post.
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Re: Connection reset
What attached diagram?sreinhardt wrote:Would the attached network diagram be a very basic layout of how it works, or is the XI instance on the web app side of things? Otherwise, considering the tcp dump information, are you sure that XI is sending the reset and not the firewall, especially with the nrpe checks from core? Are you able to do a wget or curl request to any of the webpages from core and XI?
The XI instance has NO firwalls between it and the app servers. The only network devices are switches, which do not send RST packets. The source for the RST is the XI box and the Core box when the issue occurs.
The custom check is a Curl command already.
It is happening randomly and as stated, the source of the RST is the nagios boxes.
Thanks
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: Connection reset
Ok, let me clarify.sreinhardt wrote:Would the attached network diagram be a very basic layout of how it works, or is the XI instance on the web app side of things? Otherwise, considering the tcp dump information, are you sure that XI is sending the reset and not the firewall, especially with the nrpe checks from core? Are you able to do a wget or curl request to any of the webpages from core and XI?
XI communicates to gearman worker through a hole in the firewall. It is the gearman worker for the XI install that sends the RST. Everything else seems proper in drawing. The Core install communicates through a hole in the fw as well.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: Connection reset
Just out of curiosity, is a really short timeout set on the nrpe checks?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Connection reset
60 secondsabrist wrote:Just out of curiosity, is a really short timeout set on the nrpe checks?
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
-
slansing
- Posts: 7698
- Joined: Mon Apr 23, 2012 4:28 pm
- Location: Travelling through time and space...
Re: Connection reset
Have you tried increasing this to 120 + seconds for testing purposes? It would be unusual for them to take more but it would be a worth while test.
Re: Connection reset
We have not, I was considering doing it though.slansing wrote:Have you tried increasing this to 120 + seconds for testing purposes? It would be unusual for them to take more but it would be a worth while test.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: Connection reset
One would hope 60 seconds was enough though. What type of checks are these? Are you firing off really heavy scripts?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.