Nagios Log Server listening port abruptly halts v2

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Nagios Log Server listening port abruptly halts v2

Post by cdienger »

16Gigs may not be enough depending on how many indices ES has open at any given time and how much data is coming in. Out of the 16Gigs total, ES is given half of it to load all open indices, run queries, maintenance, etc... Run the following to get a list of open indices and their size:

Code: Select all

curl -XGET http://localhost:9200/_cat/shards | grep STARTED
ES should have at least enough memory allocated to it load all indices listed.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
james.liew
Posts: 59
Joined: Wed Feb 22, 2017 1:30 am

Re: Nagios Log Server listening port abruptly halts v2

Post by james.liew »

cdienger wrote:16Gigs may not be enough depending on how many indices ES has open at any given time and how much data is coming in. Out of the 16Gigs total, ES is given half of it to load all open indices, run queries, maintenance, etc... Run the following to get a list of open indices and their size:

Code: Select all

curl -XGET http://localhost:9200/_cat/shards | grep STARTED
ES should have at least enough memory allocated to it load all indices listed.
I have attached the output to this reply. :)

EDIT: I have indices open for the past 60 days, perhaps this is taking up too much RAM. I could cut it down to 45 or 30 days and see if performance improves.
You do not have the required permissions to view the files attached to this post.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Nagios Log Server listening port abruptly halts v2

Post by cdienger »

I would go with 30 days and see how that goes. You may be able increase it from there but you may also need to decrease it even further or install more ram. The maximum recommended for logstash is 32Gigs so 64 total on the system(logstash will automatically take half of the total system memory).
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
james.liew
Posts: 59
Joined: Wed Feb 22, 2017 1:30 am

Re: Nagios Log Server listening port abruptly halts v2

Post by james.liew »

I set indices to close after 30 days, will monitor for now.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Nagios Log Server listening port abruptly halts v2

Post by cdienger »

Thanks for the update. Keep us posted!
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
james.liew
Posts: 59
Joined: Wed Feb 22, 2017 1:30 am

Re: Nagios Log Server listening port abruptly halts v2

Post by james.liew »

Two weeks before it started dying again. I'm hitting 70% RAM usage before it decides to quit and roll-over.

Perhaps I should reduce indices to 20 days? Or less?
You do not have the required permissions to view the files attached to this post.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Nagios Log Server listening port abruptly halts v2

Post by mcapra »

Perhaps. The Elasticsearch logs only go back 7 days unfortunately, but this is definitely worth mentioning:

Code: Select all

[2017-06-28 03:07:06,592][WARN ][index.shard              ] [791cc6c8-f646-495e-9e58-1ec21a24b61c] [logstash-2017.06.28][4] Failed to perform scheduled engine optimize/merge
org.elasticsearch.index.engine.OptimizeFailedEngineException: [logstash-2017.06.28][4] force merge failed

...

Caused by: java.lang.OutOfMemoryError: unable to create new native thread
Which closely ties to when it looks like Logstash died:

Code: Select all

{:timestamp=>"2017-06-28T03:07:29.298000+0200", :message=>"Got error to send bulk of actions: Failed to deserialize exception response from stream", :level=>:error}
{:timestamp=>"2017-06-28T03:07:29.298000+0200", :message=>"Failed to flush outgoing items", :outgoing_count=>2, :exception=>org.elasticsearch.transport.TransportSerializationException: Failed to deserialize exception response from stream, [backtrace here], :level=>:warn}
And following the backtrace of the above Logstash message (its very big), it looks like the JVM in general was just unable to spawn new threads due to it's heap being exhausted.

One big red flag to me is this particular occurrence all happened during a "force merge", which is tied in Nagios Log Server to the "Optimize Indices Older Than" setting in your Backup & Maintenance settings. I'd try setting that to 0 (which as of 1.4.4 disables the action) since it seems to be negatively impacting your system's performance.
Former Nagios employee
https://www.mcapra.com/
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Nagios Log Server listening port abruptly halts v2

Post by cdienger »

As mcapra pointed out, the merge appears to be causing the memory issue. You can disable this option without losing any data and performance shouldn't be impacted(except minus the crashes).
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
james.liew
Posts: 59
Joined: Wed Feb 22, 2017 1:30 am

Re: Nagios Log Server listening port abruptly halts v2

Post by james.liew »

I think I might have made a bit of a mess with the logs.I turned on optimization again just to see what would happen last week and it went kaput.

This log-server is a bit different to the rest of my other clusters as it is also logging for another site nearby.

Perhaps this site requires 32GB RAM instead of the 16 I have everywhere else that hasn't seem to be broken
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: Nagios Log Server listening port abruptly halts v2

Post by cdienger »

If the site is taking in more data than the others then it's quite possible that the merge operation would have a better chance of causing problems on this instance. I would increase the memory if possible and disable the merge option if the merge/memory messages pop up again in the logs.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked