TCP/UDP ports stop responding
Re: TCP/UDP ports stop responding
Just a quick update on this, definitely looks like there's some weirdness happening with 10 nodes. My lsof counts are also incredibly high in a 10 node environment even if the number of logs being sent is relatively low. We'll continue to monitor this and provide updates.
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: TCP/UDP ports stop responding
Dropped to 9 nodes - sometime between 4pm Friday and 11am Sunday, 6 of the remaining nodes stopped responding all together, so I had 3 nodes attempting to take in logs for everything.mcapra wrote:Just a quick update on this, definitely looks like there's some weirdness happening with 10 nodes. My lsof counts are also incredibly high in a 10 node environment even if the number of logs being sent is relatively low. We'll continue to monitor this and provide updates.
Had to reboot all nodes, so lsof isn't going to be helpful, and since I have a ton of servers trying to catch up to 2 days of logs, i doubt a tail of the logstash log will help much either, but wanted to let you know it happened.
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
Re: TCP/UDP ports stop responding
This looks like an issue with Logstash that traces back a few months:
https://github.com/elastic/logstash/issues/4815
https://github.com/elastic/logstash/issues/4225
Others have remedied the problem by scheduling the logstash service to restart on a regular interval (via cron). You could give that a shot, though I realize this is less than ideal. From what I gather, logstash is at times not properly closing connections which creates a sort of blockage on the back-end.
I have filed an internal bug report for this issue (ID 9305).
https://github.com/elastic/logstash/issues/4815
https://github.com/elastic/logstash/issues/4225
Others have remedied the problem by scheduling the logstash service to restart on a regular interval (via cron). You could give that a shot, though I realize this is less than ideal. From what I gather, logstash is at times not properly closing connections which creates a sort of blockage on the back-end.
I have filed an internal bug report for this issue (ID 9305).
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: TCP/UDP ports stop responding
cron jobs set - we had discussed doing this internally, but wanted to wait for your findings.mcapra wrote:This looks like an issue with Logstash that traces back a few months:
https://github.com/elastic/logstash/issues/4815
https://github.com/elastic/logstash/issues/4225
Others have remedied the problem by scheduling the logstash service to restart on a regular interval (via cron). You could give that a shot, though I realize this is less than ideal. From what I gather, logstash is at times not properly closing connections which creates a sort of blockage on the back-end.
I have filed an internal bug report for this issue (ID 9305).
So with the bug report being entered, does this mean your devs will continue to look for a solution, or if the cron jobs resolve the issue, will it be left to stand?
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
Re: TCP/UDP ports stop responding
As a bug report was submit, it's going to be addressed one way or another by them.
Let us know how the cron jobs handle this. If restarting logstash regularly solves the problem, then it's useful in terms of applying a proper fix at the logstash level.
Let us know how the cron jobs handle this. If restarting logstash regularly solves the problem, then it's useful in terms of applying a proper fix at the logstash level.
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: TCP/UDP ports stop responding
Just looking for a possible status update on this.
Scheduled reboots do seem to be helping, but would like to get back to our 10 instance cluster.
Scheduled reboots do seem to be helping, but would like to get back to our 10 instance cluster.
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
Re: TCP/UDP ports stop responding
It doesn't look like either github issue has received any updates. You could try bringing the 10th back up with the same scheduled restarts as the other nodes, but we don't have a very good environment to test against that sort of thing unfortunately so I can't promise adding the 10th will not affect stability. My opinion is that the issue isn't specific to 10+ nodes, but I haven't spent too much time diving into the logstash back-end.
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: TCP/UDP ports stop responding
ok - thanks for the quick response.mcapra wrote:It doesn't look like either github issue has received any updates. You could try bringing the 10th back up with the same scheduled restarts as the other nodes, but we don't have a very good environment to test against that sort of thing unfortunately so I can't promise adding the 10th will not affect stability. My opinion is that the issue isn't specific to 10+ nodes, but I haven't spent too much time diving into the logstash back-end.
I'll wait for some work on this topic (https://support.nagios.com/forum/viewto ... 38&t=40282) before re-adding the 10th
Nagios Log Server: 10 Instances - 3,916,302,797 documents last check in 180 shards
Re: TCP/UDP ports stop responding
Alrighty, will continue correspondence there
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/