Greetings all,
I’ve downloaded the latest Log Server demo and created a 3 server load balanced cluster. The logs collect fine with no errors for a small amount of servers, but eventually it caps out at ~5m/15 minutes and the clients have reconnect messages in their nxlog log files. If I stop enough clients to bring the ingress rate below that 5/15 mark, I don't see any errors in the client logs, and the cluster shows the correct number of logs for each client. I've tried load balancing using RRDNS and an HAProxy server, increasing RAM in the cluster, and adding additional nodes to the cluster all with no change. We're only collecting Windows Security logs on this server (the servers report everything related to Security so there are a lot of logs) and they collect fine until we hit that mark. In the nxlog.log, with a low ingress rate everything is quiet. At a high rate, we start to see a lot of "ERROR om_tcp send failed; An existing connection was forcibly closed by the remote host. " and "INFO reconnecting in 2 seconds".
Is there a tuning option that can be tweaked to help with this threshold?
Thanks!
Scott
Nagios Log Server Cluster dropping logs after ~5M/15min
Re: Nagios Log Server Cluster dropping logs after ~5M/15min
Maybe, but first and foremost I think it's worth taking a look at the Logstash (/var/log/logstash) and ElasticSearch (/var/log/elasticsearch) logs.ScottMc wrote:Is there a tuning option that can be tweaked to help with this threshold?
Former Nagios employee
https://www.mcapra.com/
https://www.mcapra.com/
Re: Nagios Log Server Cluster dropping logs after ~5M/15min
The only thing I see in any of the logs that I see that's indicative of this issue is in the elasticsearch logs:
[2018-08-06 13:15:08,590][INFO ][index.engine ] [f8086297-a769-4be2-b36e-d0e570209b2d] [logstash-2018.08.06][2] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2018-08-06 13:15:09,365][INFO ][index.engine ] [f8086297-a769-4be2-b36e-d0e570209b2d] [logstash-2018.08.06][2] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
On some servers there are a lot of them going back and forth (every few seconds), on some this message only appears a couple of times in the past day.
[2018-08-06 13:15:08,590][INFO ][index.engine ] [f8086297-a769-4be2-b36e-d0e570209b2d] [logstash-2018.08.06][2] now throttling indexing: numMergesInFlight=6, maxNumMerges=5
[2018-08-06 13:15:09,365][INFO ][index.engine ] [f8086297-a769-4be2-b36e-d0e570209b2d] [logstash-2018.08.06][2] stop throttling indexing: numMergesInFlight=4, maxNumMerges=5
On some servers there are a lot of them going back and forth (every few seconds), on some this message only appears a couple of times in the past day.
Re: Nagios Log Server Cluster dropping logs after ~5M/15min
How much RAM are one the machines and how much did you try increasing it to? 64GB total is the recommended limit - anything above that would negatively impact performance.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Nagios Log Server Cluster dropping logs after ~5M/15min
Initially 64 GB in all servers. After hitting this limit, we upped it to 128 GB but it made little or no difference.cdienger wrote:How much RAM are one the machines and how much did you try increasing it to? 64GB total is the recommended limit - anything above that would negatively impact performance.
Re: Nagios Log Server Cluster dropping logs after ~5M/15min
Was this prior to or after adding additional nodes? Again, about 64 can actually hurt things. If you're still able to add nodes, I would try adding a couple more and make sure that nodes are maxed out at 64GB but don't exceed it.
What does the heap.percent look like for each node if you run 'curl 'localhost:9200/_cat/nodes?v' ?
You mentioned you're only accepting Windows Security logs -are the other inputs disabled? Do you have any filters enabled that you can disable to see if that improves performance? If so, please provide a copy of the filters so we can see if those can be improved on.
What does the heap.percent look like for each node if you run 'curl 'localhost:9200/_cat/nodes?v' ?
You mentioned you're only accepting Windows Security logs -are the other inputs disabled? Do you have any filters enabled that you can disable to see if that improves performance? If so, please provide a copy of the filters so we can see if those can be improved on.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
- rexconsulting
- Posts: 60
- Joined: Fri May 04, 2012 4:27 pm
- Location: Oakland, CA
- Contact:
Re: Nagios Log Server Cluster dropping logs after ~5M/15min
Some upfront filtering within nxlog might be useful to reduce some of the noise from the Windows logs.
CP
--
Chris Paul
Rex Consulting, Inc
5652 Florence Terrace, Oakland, CA 94611
email: [email protected]
web: http://www.rexconsulting.net
phone, toll-free: +1 (888) 403-8996 ext 1
--
Chris Paul
Rex Consulting, Inc
5652 Florence Terrace, Oakland, CA 94611
email: [email protected]
web: http://www.rexconsulting.net
phone, toll-free: +1 (888) 403-8996 ext 1
Re: Nagios Log Server Cluster dropping logs after ~5M/15min
All other inputs are disabled. Only the Windows Event Logs are enabled, and only security logs are being pushed from the client. We have to collect the complete logs for security reasons so we can't really reduce the noise from there.cdienger wrote:Was this prior to or after adding additional nodes? Again, about 64 can actually hurt things. If you're still able to add nodes, I would try adding a couple more and make sure that nodes are maxed out at 64GB but don't exceed it.
What does the heap.percent look like for each node if you run 'curl 'localhost:9200/_cat/nodes?v' ?
You mentioned you're only accepting Windows Security logs -are the other inputs disabled? Do you have any filters enabled that you can disable to see if that improves performance? If so, please provide a copy of the filters so we can see if those can be improved on.
Output from curl 'localhost:9200/_cat/nodes?v
Code: Select all
host ip heap.percent ram.percent load node.role master
nlshost2 10.0.201.202 14 54 4.93 d *
nlshost3 10.0.201.203 41 54 8.28 d m
nlshost4 10.0.201.204 53 54 17.14 d m
nlshost5 10.0.201.205 53 54 4.97 d mI did add "indices.store.throttle.type: none" to the .yml files and that seems to have helped a bit (seems to max out at around 9m/15min) but we're still dropping a lot of logs.
Re: Nagios Log Server Cluster dropping logs after ~5M/15min
I'm leaning more towards this being something on the elasticsearch side but would still suggest following https://support.nagios.com/kb/article/n ... g-576.html to make sure these settings are not causing a bottle neck impacting something else(elasticsearch) downstream.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.