Nagios Log Server listening port abruptly halts v2

Post by **cdienger** » Thu Jun 08, 2017 10:15 am

16Gigs may not be enough depending on how many indices ES has open at any given time and how much data is coming in. Out of the 16Gigs total, ES is given half of it to load all open indices, run queries, maintenance, etc... Run the following to get a list of open indices and their size:

Code: Select all

curl -XGET http://localhost:9200/_cat/shards | grep STARTED

ES should have at least enough memory allocated to it load all indices listed.

james.liew · Post by **james.liew** » Tue Jun 13, 2017 12:53 am

cdienger wrote:16Gigs may not be enough depending on how many indices ES has open at any given time and how much data is coming in. Out of the 16Gigs total, ES is given half of it to load all open indices, run queries, maintenance, etc... Run the following to get a list of open indices and their size:
Code: Select all
curl -XGET http://localhost:9200/_cat/shards | grep STARTED
ES should have at least enough memory allocated to it load all indices listed.

I have attached the output to this reply.

EDIT: I have indices open for the past 60 days, perhaps this is taking up too much RAM. I could cut it down to 45 or 30 days and see if performance improves.

Post by **cdienger** » Tue Jun 13, 2017 11:49 am

I would go with 30 days and see how that goes. You may be able increase it from there but you may also need to decrease it even further or install more ram. The maximum recommended for logstash is 32Gigs so 64 total on the system(logstash will automatically take half of the total system memory).

james.liew · Post by **james.liew** » Thu Jun 15, 2017 1:15 am

I set indices to close after 30 days, will monitor for now.

Post by **cdienger** » Thu Jun 15, 2017 12:54 pm

Thanks for the update. Keep us posted!

james.liew · Post by **james.liew** » Tue Jun 27, 2017 8:31 pm

Two weeks before it started dying again. I'm hitting 70% RAM usage before it decides to quit and roll-over.

Perhaps I should reduce indices to 20 days? Or less?

Post by **mcapra** » Wed Jun 28, 2017 8:39 am

Perhaps. The Elasticsearch logs only go back 7 days unfortunately, but this is definitely worth mentioning:

Code: Select all

[2017-06-28 03:07:06,592][WARN ][index.shard              ] [791cc6c8-f646-495e-9e58-1ec21a24b61c] [logstash-2017.06.28][4] Failed to perform scheduled engine optimize/merge
org.elasticsearch.index.engine.OptimizeFailedEngineException: [logstash-2017.06.28][4] force merge failed

...

Caused by: java.lang.OutOfMemoryError: unable to create new native thread

Which closely ties to when it looks like Logstash died:

Code: Select all

{:timestamp=>"2017-06-28T03:07:29.298000+0200", :message=>"Got error to send bulk of actions: Failed to deserialize exception response from stream", :level=>:error}
{:timestamp=>"2017-06-28T03:07:29.298000+0200", :message=>"Failed to flush outgoing items", :outgoing_count=>2, :exception=>org.elasticsearch.transport.TransportSerializationException: Failed to deserialize exception response from stream, [backtrace here], :level=>:warn}

And following the backtrace of the above Logstash message (its very big), it looks like the JVM in general was just unable to spawn new threads due to it's heap being exhausted.

One big red flag to me is this particular occurrence all happened during a "force merge", which is tied in Nagios Log Server to the "Optimize Indices Older Than" setting in your Backup & Maintenance settings. I'd try setting that to 0 (which as of 1.4.4 disables the action) since it seems to be negatively impacting your system's performance.

Post by **cdienger** » Wed Jun 28, 2017 1:46 pm

As mcapra pointed out, the merge appears to be causing the memory issue. You can disable this option without losing any data and performance shouldn't be impacted(except minus the crashes).

james.liew · Post by **james.liew** » Wed Jun 28, 2017 8:29 pm

I think I might have made a bit of a mess with the logs.I turned on optimization again just to see what would happen last week and it went kaput.

This log-server is a bit different to the rest of my other clusters as it is also logging for another site nearby.

Perhaps this site requires 32GB RAM instead of the 16 I have everywhere else that hasn't seem to be broken

Post by **cdienger** » Thu Jun 29, 2017 12:35 pm

If the site is taking in more data than the others then it's quite possible that the merge operation would have a better chance of causing problems on this instance. I would increase the memory if possible and disable the merge option if the merge/memory messages pop up again in the logs.

Nagios Support Forum

Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2