Nagios Log Server Cluster dropping logs after ~5M/15min

ScottMc · Post by **ScottMc** » Mon Aug 06, 2018 10:33 am

Greetings all,
I’ve downloaded the latest Log Server demo and created a 3 server load balanced cluster. The logs collect fine with no errors for a small amount of servers, but eventually it caps out at ~5m/15 minutes and the clients have reconnect messages in their nxlog log files. If I stop enough clients to bring the ingress rate below that 5/15 mark, I don't see any errors in the client logs, and the cluster shows the correct number of logs for each client. I've tried load balancing using RRDNS and an HAProxy server, increasing RAM in the cluster, and adding additional nodes to the cluster all with no change. We're only collecting Windows Security logs on this server (the servers report everything related to Security so there are a lot of logs) and they collect fine until we hit that mark. In the nxlog.log, with a low ingress rate everything is quiet. At a high rate, we start to see a lot of "ERROR om_tcp send failed; An existing connection was forcibly closed by the remote host. " and "INFO reconnecting in 2 seconds".

Is there a tuning option that can be tweaked to help with this threshold?

Thanks!

Scott

Post by **mcapra** » Mon Aug 06, 2018 1:08 pm

ScottMc wrote:Is there a tuning option that can be tweaked to help with this threshold?

Maybe, but first and foremost I think it's worth taking a look at the Logstash (/var/log/logstash) and ElasticSearch (/var/log/elasticsearch) logs.

ScottMc · Post by **ScottMc** » Mon Aug 06, 2018 1:39 pm

The only thing I see in any of the logs that I see that's indicative of this issue is in the elasticsearch logs:

[2018-08-06 13:15:08,590][INFO ][index.engine ] [f8086297-a769-4be2-b36e-d0e570209b2d] [logstash-2018.08.06][2] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2018-08-06 13:15:09,365][INFO ][index.engine ] [f8086297-a769-4be2-b36e-d0e570209b2d] [logstash-2018.08.06][2] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5

On some servers there are a lot of them going back and forth (every few seconds), on some this message only appears a couple of times in the past day.

Post by **cdienger** » Mon Aug 06, 2018 3:49 pm

How much RAM are one the machines and how much did you try increasing it to? 64GB total is the recommended limit - anything above that would negatively impact performance.

ScottMc · Post by **ScottMc** » Tue Aug 07, 2018 9:07 am

cdienger wrote:How much RAM are one the machines and how much did you try increasing it to? 64GB total is the recommended limit - anything above that would negatively impact performance.

Initially 64 GB in all servers. After hitting this limit, we upped it to 128 GB but it made little or no difference.

Post by **cdienger** » Tue Aug 07, 2018 1:48 pm

Was this prior to or after adding additional nodes? Again, about 64 can actually hurt things. If you're still able to add nodes, I would try adding a couple more and make sure that nodes are maxed out at 64GB but don't exceed it.

What does the heap.percent look like for each node if you run 'curl 'localhost:9200/_cat/nodes?v' ?

You mentioned you're only accepting Windows Security logs -are the other inputs disabled? Do you have any filters enabled that you can disable to see if that improves performance? If so, please provide a copy of the filters so we can see if those can be improved on.

rexconsulting · Post by **rexconsulting** » Tue Aug 07, 2018 1:58 pm

Some upfront filtering within nxlog might be useful to reduce some of the noise from the Windows logs.

ScottMc · Post by **ScottMc** » Tue Aug 07, 2018 2:53 pm

cdienger wrote:Was this prior to or after adding additional nodes? Again, about 64 can actually hurt things. If you're still able to add nodes, I would try adding a couple more and make sure that nodes are maxed out at 64GB but don't exceed it.

What does the heap.percent look like for each node if you run 'curl 'localhost:9200/_cat/nodes?v' ?

You mentioned you're only accepting Windows Security logs -are the other inputs disabled? Do you have any filters enabled that you can disable to see if that improves performance? If so, please provide a copy of the filters so we can see if those can be improved on.

All other inputs are disabled. Only the Windows Event Logs are enabled, and only security logs are being pushed from the client. We have to collect the complete logs for security reasons so we can't really reduce the noise from there.

Output from curl 'localhost:9200/_cat/nodes?v

Code: Select all

host       ip	        heap.percent	ram.percent	load	node.role	master
nlshost2   10.0.201.202	14	               54      4.93     d          *
nlshost3	10.0.201.203	41	               54      8.28     d          m
nlshost4	10.0.201.204	53	               54     17.14     d          m
nlshost5	10.0.201.205	53	               54      4.97     d          m

Performance was similar with just two nodes, though I can't remember offhand when the memory was upgraded. I'll have some memory pulled and see what changes.

I did add "indices.store.throttle.type: none" to the .yml files and that seems to have helped a bit (seems to max out at around 9m/15min) but we're still dropping a lot of logs.

Post by **cdienger** » Tue Aug 07, 2018 4:24 pm

I'm leaning more towards this being something on the elasticsearch side but would still suggest following https://support.nagios.com/kb/article/n ... g-576.html to make sure these settings are not causing a bottle neck impacting something else(elasticsearch) downstream.

Nagios Support Forum

Nagios Log Server Cluster dropping logs after ~5M/15min

Nagios Log Server Cluster dropping logs after ~5M/15min

Re: Nagios Log Server Cluster dropping logs after ~5M/15min

Re: Nagios Log Server Cluster dropping logs after ~5M/15min

Re: Nagios Log Server Cluster dropping logs after ~5M/15min

Re: Nagios Log Server Cluster dropping logs after ~5M/15min

Re: Nagios Log Server Cluster dropping logs after ~5M/15min

Re: Nagios Log Server Cluster dropping logs after ~5M/15min

Re: Nagios Log Server Cluster dropping logs after ~5M/15min

Re: Nagios Log Server Cluster dropping logs after ~5M/15min