TCP/UDP ports stop responding

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: TCP/UDP ports stop responding

Post by mcapra »

Just a quick update on this, definitely looks like there's some weirdness happening with 10 nodes. My lsof counts are also incredibly high in a 10 node environment even if the number of logs being sent is relatively low. We'll continue to monitor this and provide updates.
Former Nagios employee
https://www.mcapra.com/
jspink
Posts: 43
Joined: Wed Nov 25, 2015 3:27 pm

Re: TCP/UDP ports stop responding

Post by jspink »

mcapra wrote:Just a quick update on this, definitely looks like there's some weirdness happening with 10 nodes. My lsof counts are also incredibly high in a 10 node environment even if the number of logs being sent is relatively low. We'll continue to monitor this and provide updates.
Dropped to 9 nodes - sometime between 4pm Friday and 11am Sunday, 6 of the remaining nodes stopped responding all together, so I had 3 nodes attempting to take in logs for everything.
Had to reboot all nodes, so lsof isn't going to be helpful, and since I have a ton of servers trying to catch up to 2 days of logs, i doubt a tail of the logstash log will help much either, but wanted to let you know it happened.
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: TCP/UDP ports stop responding

Post by mcapra »

This looks like an issue with Logstash that traces back a few months:
https://github.com/elastic/logstash/issues/4815
https://github.com/elastic/logstash/issues/4225

Others have remedied the problem by scheduling the logstash service to restart on a regular interval (via cron). You could give that a shot, though I realize this is less than ideal. From what I gather, logstash is at times not properly closing connections which creates a sort of blockage on the back-end.

I have filed an internal bug report for this issue (ID 9305).
Former Nagios employee
https://www.mcapra.com/
jspink
Posts: 43
Joined: Wed Nov 25, 2015 3:27 pm

Re: TCP/UDP ports stop responding

Post by jspink »

mcapra wrote:This looks like an issue with Logstash that traces back a few months:
https://github.com/elastic/logstash/issues/4815
https://github.com/elastic/logstash/issues/4225

Others have remedied the problem by scheduling the logstash service to restart on a regular interval (via cron). You could give that a shot, though I realize this is less than ideal. From what I gather, logstash is at times not properly closing connections which creates a sort of blockage on the back-end.

I have filed an internal bug report for this issue (ID 9305).
cron jobs set - we had discussed doing this internally, but wanted to wait for your findings.

So with the bug report being entered, does this mean your devs will continue to look for a solution, or if the cron jobs resolve the issue, will it be left to stand?
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: TCP/UDP ports stop responding

Post by mcapra »

As a bug report was submit, it's going to be addressed one way or another by them.

Let us know how the cron jobs handle this. If restarting logstash regularly solves the problem, then it's useful in terms of applying a proper fix at the logstash level.
Former Nagios employee
https://www.mcapra.com/
jspink
Posts: 43
Joined: Wed Nov 25, 2015 3:27 pm

Re: TCP/UDP ports stop responding

Post by jspink »

Just looking for a possible status update on this.

Scheduled reboots do seem to be helping, but would like to get back to our 10 instance cluster.
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: TCP/UDP ports stop responding

Post by mcapra »

It doesn't look like either github issue has received any updates. You could try bringing the 10th back up with the same scheduled restarts as the other nodes, but we don't have a very good environment to test against that sort of thing unfortunately so I can't promise adding the 10th will not affect stability. My opinion is that the issue isn't specific to 10+ nodes, but I haven't spent too much time diving into the logstash back-end.
Former Nagios employee
https://www.mcapra.com/
jspink
Posts: 43
Joined: Wed Nov 25, 2015 3:27 pm

Re: TCP/UDP ports stop responding

Post by jspink »

mcapra wrote:It doesn't look like either github issue has received any updates. You could try bringing the 10th back up with the same scheduled restarts as the other nodes, but we don't have a very good environment to test against that sort of thing unfortunately so I can't promise adding the 10th will not affect stability. My opinion is that the issue isn't specific to 10+ nodes, but I haven't spent too much time diving into the logstash back-end.
ok - thanks for the quick response.
I'll wait for some work on this topic (https://support.nagios.com/forum/viewto ... 38&t=40282) before re-adding the 10th
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: TCP/UDP ports stop responding

Post by mcapra »

Alrighty, will continue correspondence there
Former Nagios employee
https://www.mcapra.com/
Locked