NLS stopped working
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: NLS stopped working
Awesome, definitely let us know it the current version helps...
Re: NLS stopped working
Scott,
It seems the update did not help after all..
This morning NLS opened fine. 5 minutes ago however, I log in, go to dashboards, loading took a litle long on my home dashboards, which is just a * query. So I went to a different dashboard and it seemed the gui frooze again. I've been waiting for 5+ minutes now, so I guess the only thing left to do is restart elasticsearch service again..
This is an extract of the elasticsearch log:
And an extract of logstash log:
and
I see things like [refresh failed][OutOfMemoryError[Java heap space]]] This is a top of the server atm:
Why would I be suddenly out of memory? I expanded memory from 2GB to 4 GB last week...
I also set vm.swappiness to 1 see this thread http://support.nagios.com/forum/viewtop ... 38&t=31343 and this link about memory http://www.elasticsearch.org/guide/en/e ... izing.html
It seems my NLS cluster is also having issue (which I did not see before), check the screenshot.
Please advise how to continue making my NLS stable..
Thanks.
Willem
It seems the update did not help after all..
This morning NLS opened fine. 5 minutes ago however, I log in, go to dashboards, loading took a litle long on my home dashboards, which is just a * query. So I went to a different dashboard and it seemed the gui frooze again. I've been waiting for 5+ minutes now, so I guess the only thing left to do is restart elasticsearch service again..
This is an extract of the elasticsearch log:
Code: Select all
[2015-02-17 15:24:31,476][DEBUG][action.search.type ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.14][3], node[8a2YUZmdT5asS6nywulupg], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@7b456191] lastShard [true]
org.elasticsearch.search.query.QueryPhaseExecutionException: [logstash-2015.02.14][3]: query[ConstantScore(*:*)],from[0],size[0]: Query Failed [Failed to execute main query]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:162)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
[2015-02-17 15:24:32,597][WARN ][cluster.action.shard ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][2] received shard failed for [logstash-2015.02.17][2], node[8a2YUZmdT5asS6nywulupg], [R], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [Failed to perform [bulk/shard] on replica, message [NodeDisconnectedException[[c4d16075-9bc2-4095-9f00-e7de7f96930c][inet[/10.54.24.140:9300]][bulk/shard/replica] disconnected]]]
[2015-02-17 15:24:35,466][WARN ][cluster.action.shard ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][0] received shard failed for [logstash-2015.02.17][0], node[dBHw3nRDTQeDUGajtxsAkg], [P], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [engine failure, message [out of memory][OutOfMemoryError[Java heap space]]]
[2015-02-17 15:24:36,599][WARN ][cluster.action.shard ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][2] received shard failed for [logstash-2015.02.17][2], node[8a2YUZmdT5asS6nywulupg], [R], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [Failed to perform [bulk/shard] on replica, message [NodeDisconnectedException[[c4d16075-9bc2-4095-9f00-e7de7f96930c][inet[/10.54.24.140:9300]][bulk/shard/replica] disconnected]]]
[2015-02-17 15:24:38,155][WARN ][cluster.action.shard ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][0] received shard failed for [logstash-2015.02.17][0], node[8a2YUZmdT5asS6nywulupg], [R], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [Failed to perform [bulk/shard] on replica, message [NodeDisconnectedException[[c4d16075-9bc2-4095-9f00-e7de7f96930c][inet[/10.54.24.140:9300]][bulk/shard/replica] disconnected]]]
[2015-02-17 15:25:00,646][WARN ][search.action ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] Failed to send release search context
org.elasticsearch.transport.SendRequestTransportException: [95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][search/freeContext]
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:220)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:190)
at org.elasticsearch.search.action.SearchServiceTransportAction.sendFreeContext(SearchServiceTransportAction.java:125)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(TransportSearchTypeAction.java:348)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:147)
at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.moveToSecondPhase(TransportSearchQueryThenFetchAction.java:90)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.innerMoveToSecondPhase(TransportSearchTypeAction.java:404)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:198)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onResult(TransportSearchTypeAction.java:174)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onResult(TransportSearchTypeAction.java:171)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:526)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]] Node not connected
at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:874)
at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:556)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:206)
... 13 more
[2015-02-17 15:25:02,133][WARN ][cluster.action.shard ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][2] received shard failed for [logstash-2015.02.17][2], node[dBHw3nRDTQeDUGajtxsAkg], [P], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [engine failure, message [refresh failed][OutOfMemoryError[Java heap space]]]
[2015-02-17 15:25:26,789][WARN ][transport.netty ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] Message not fully read (request) for [287020] and action [bulk], resetting
[2015-02-17 15:25:33,573][WARN ][cluster.action.shard ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][1] received shard failed for [logstash-2015.02.17][1], node[8a2YUZmdT5asS6nywulupg], [R], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [Failed to perform [bulk/shard] on replica, message [SendRequestTransportException[[c4d16075-9bc2-4095-9f00-e7de7f96930c][inet[/10.54.24.140:9300]][bulk/shard/replica]]; nested: NodeNotConnectedException[[c4d16075-9bc2-4095-9f00-e7de7f96930c][inet[/10.54.24.140:9300]] Node not connected]; ]]
[root@srvnaglog01 ~]# tail -100 /var/log/elasticsearch/ee9e60a0-f4cb-41ec-a97f-8f17434b748e.log
[2015-02-17 15:22:56,766][WARN ][cluster.action.shard ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [nagioslogserver][0] sending failed shard for [nagioslogserver][0], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED], indexUUID [YbgZhXrHRzqYGUCT-9_q5Q], reason [Failed to perform [index] on replica, message [NodeDisconnectedException[[95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][index/replica] disconnected]]]
[2015-02-17 15:22:56,767][WARN ][cluster.action.shard ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [nagioslogserver][0] received shard failed for [nagioslogserver][0], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED], indexUUID [YbgZhXrHRzqYGUCT-9_q5Q], reason [Failed to perform [index] on replica, message [NodeDisconnectedException[[95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][index/replica] disconnected]]]
[2015-02-17 15:22:56,766][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.4m]
[2015-02-17 15:22:56,767][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.5m]
[2015-02-17 15:22:56,766][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.3m]
[2015-02-17 15:22:56,771][DEBUG][action.admin.indices.status] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [nagioslogserver][0], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@39b94e07]
org.elasticsearch.transport.NodeDisconnectedException: [95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][indices/status/s] disconnected
[2015-02-17 15:22:56,767][WARN ][action.index ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] Failed to perform index on remote replica [95f9ab14-da22-4144-bb0b-6bbc5662115c][dBHw3nRDTQeDUGajtxsAkg][srvnaglog02][inet[/10.54.24.141:9300]]{max_local_storage_nodes=1}[nagioslogserver][0]
org.elasticsearch.transport.NodeDisconnectedException: [95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][index/replica] disconnected
[2015-02-17 15:22:56,788][WARN ][cluster.action.shard ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [nagioslogserver][0] sending failed shard for [nagioslogserver][0], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED], indexUUID [YbgZhXrHRzqYGUCT-9_q5Q], reason [Failed to perform [index] on replica, message [NodeDisconnectedException[[95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][index/replica] disconnected]]]
[2015-02-17 15:22:56,820][WARN ][cluster.action.shard ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [nagioslogserver][0] received shard failed for [nagioslogserver][0], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED], indexUUID [YbgZhXrHRzqYGUCT-9_q5Q], reason [Failed to perform [index] on replica, message [NodeDisconnectedException[[95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][index/replica] disconnected]]]
[2015-02-17 15:22:56,916][DEBUG][action.search.type ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.11][3], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@64953665] lastShard [true]
org.elasticsearch.transport.NodeDisconnectedException: [95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][search/phase/query] disconnected
[2015-02-17 15:22:56,916][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.2m]
[2015-02-17 15:22:56,944][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.2m]
[2015-02-17 15:22:56,916][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.2m]
[2015-02-17 15:22:56,916][DEBUG][action.search.type ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.12][2], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@64953665] lastShard [true]
org.elasticsearch.transport.NodeDisconnectedException: [95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][search/phase/query] disconnected
[2015-02-17 15:22:56,916][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.2m]
[2015-02-17 15:22:56,979][DEBUG][action.bulk ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.2m]
[2015-02-17 15:23:57,548][WARN ][cluster.action.shard ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][2] received shard failed for [logstash-2015.02.17][2], node[8a2YUZmdT5asS6nywulupg], [R], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [Failed to perform [bulk/shard] on replica, message [NodeDisconnectedException[[c4d16075-9bc2-4095-9f00-e7de7f96930c][inet[/10.54.24.140:9300]][bulk/shard/replica] disconnected]]]
[2015-02-17 15:24:03,804][DEBUG][action.search.type ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][1], node[8a2YUZmdT5asS6nywulupg], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@64953665] lastShard [true]
org.elasticsearch.search.query.QueryPhaseExecutionException: [logstash-2015.02.17][1]: query[ConstantScore(*:*)],from[0],size[0]: Query Failed [Failed to execute main query]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:162)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchException: java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.index.fielddata.AbstractIndexFieldData.load(AbstractIndexFieldData.java:79)
at org.elasticsearch.index.fielddata.plain.AbstractBytesIndexFieldData.load(AbstractBytesIndexFieldData.java:41)
at org.elasticsearch.search.facet.terms.strings.TermsStringOrdinalsFacetExecutor$Collector.setNextReader(TermsStringOrdinalsFacetExecutor.java:214)
at org.elasticsearch.common.lucene.search.FilteredCollector.setNextReader(FilteredCollector.java:67)
at org.elasticsearch.common.lucene.MultiCollector.setNextReader(MultiCollector.java:68)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:612)
at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:175)
at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:116)
... 7 more
Caused by: org.elasticsearch.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2199)
at org.elasticsearch.common.cache.LocalCache.get(LocalCache.java:3934)
at org.elasticsearch.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4736)
at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.load(IndicesFieldDataCache.java:154)
at org.elasticsearch.index.fielddata.AbstractIndexFieldData.load(AbstractIndexFieldData.java:73)
... 15 more
Caused by: java.lang.OutOfMemoryError: Java heap space
[2015-02-17 15:24:17,262][DEBUG][action.search.type ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.13][1], node[8a2YUZmdT5asS6nywulupg], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@7b456191] lastShard [true]
org.elasticsearch.search.query.QueryPhaseExecutionException: [logstash-2015.02.13][1]: query[ConstantScore(*:*)],from[0],size[0]: Query Failed [Failed to execute main query]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:162)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
[2015-02-17 15:24:31,476][DEBUG][action.search.type ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.14][3], node[8a2YUZmdT5asS6nywulupg], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@7b456191] lastShard [true]
org.elasticsearch.search.query.QueryPhaseExecutionException: [logstash-2015.02.14][3]: query[ConstantScore(*:*)],from[0],size[0]: Query Failed [Failed to execute main query]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:162)
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
Code: Select all
tail -50 /var/log/logstash/logstash.log
{:timestamp=>"2015-02-14T15:49:56.419000+0100", :message=>"Using milestone 2 input plugin 'tcp'. This plugin should be stable, but if you see strange behavior, please let us know! For more information on plugin milestones, see http://logstash.net/docs/1.4.2/plugin-milestones", :level=>:warn}
{:timestamp=>"2015-02-14T15:49:56.658000+0100", :message=>"Using milestone 1 input plugin 'syslog'. This plugin should work, but would benefit from use by folks like you. Please let us know if you find bugs or have suggestions on how to improve this plugin. For more information on plugin milestones, see http://logstash.net/docs/1.4.2/plugin-milestones", :level=>:warn}
{:timestamp=>"2015-02-17T15:25:27.063000+0100", :message=>"Failed to flush outgoing items", :outgoing_count=>5000, :exception=>org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution, :backtrace=>["org.elasticsearch.action.support.AdapterActionFuture.rethrowExecutionException(org/elasticsearch/action/support/AdapterActionFuture.java:90)", "org.elasticsearch.action.support.AdapterActionFuture.actionGet(org/elasticsearch/action/support/AdapterActionFuture.java:50)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:606)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:207)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:207)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1339)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_receive(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:159)", "Stud::Buffer.buffer_receive(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:159)", "LogStash::Outputs::ElasticSearch.receive(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:311)", "LogStash::Outputs::ElasticSearch.receive(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:311)", "LogStash::Outputs::Base.handle(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/base.rb:86)", "LogStash::Outputs::Base.handle(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/base.rb:86)", "RUBY.worker_setup(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/base.rb:78)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}
Code: Select all
{:timestamp=>"2015-02-17T15:39:30.169000+0100", :message=>"Failed to flush outgoing items", :outgoing_count=>2, :exception=>org.elasticsearch.client.transport.NoNodeAvailableException: No node available, :backtrace=>["org.elasticsearch.client.transport.TransportClientNodesService.execute(org/elasticsearch/client/transport/TransportClientNodesService.java:219)", "org.elasticsearch.client.transport.support.InternalTransportClient.execute(org/elasticsearch/client/transport/support/InternalTransportClient.java:106)", "org.elasticsearch.client.support.AbstractClient.bulk(org/elasticsearch/client/support/AbstractClient.java:147)", "org.elasticsearch.client.transport.TransportClient.bulk(org/elasticsearch/client/transport/TransportClient.java:360)", "org.elasticsearch.action.bulk.BulkRequestBuilder.doExecute(org/elasticsearch/action/bulk/BulkRequestBuilder.java:165)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:85)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:59)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:606)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:207)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:207)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1339)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:112)", "org.jruby.RubyKernel.loop(org/jruby/RubyKernel.java:1521)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:110)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}
{:timestamp=>"2015-02-17T15:39:31.172000+0100", :message=>"Failed to flush outgoing items", :outgoing_count=>2, :exception=>org.elasticsearch.client.transport.NoNodeAvailableException: No node available, :backtrace=>["org.elasticsearch.client.transport.TransportClientNodesService.execute(org/elasticsearch/client/transport/TransportClientNodesService.java:219)", "org.elasticsearch.client.transport.support.InternalTransportClient.execute(org/elasticsearch/client/transport/support/InternalTransportClient.java:106)", "org.elasticsearch.client.support.AbstractClient.bulk(org/elasticsearch/client/support/AbstractClient.java:147)", "org.elasticsearch.client.transport.TransportClient.bulk(org/elasticsearch/client/transport/TransportClient.java:360)", "org.elasticsearch.action.bulk.BulkRequestBuilder.doExecute(org/elasticsearch/action/bulk/BulkRequestBuilder.java:165)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:85)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:59)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:606)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:207)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:207)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1339)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:112)", "org.jruby.RubyKernel.loop(org/jruby/RubyKernel.java:1521)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:110)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}
Code: Select all
top - 15:27:36 up 2 days, 23:38, 1 user, load average: 1.10, 1.09, 0.97
Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie
Cpu(s): 8.9%us, 0.6%sy, 2.9%ni, 87.4%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 4056456k total, 3926828k used, 129628k free, 123948k buffers
Swap: 2047992k total, 0k used, 2047992k free, 1488220k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1233 nagios 20 0 16.8g 1.3g 132m S 152.3 34.4 2371:22 java
1317 root 39 19 4092m 621m 6400 S 2.0 15.7 901:23.68 java
1 root 20 0 19232 1132 856 S 0.0 0.0 0:01.69 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:00.59 migration/0
I also set vm.swappiness to 1 see this thread http://support.nagios.com/forum/viewtop ... 38&t=31343 and this link about memory http://www.elasticsearch.org/guide/en/e ... izing.html
It seems my NLS cluster is also having issue (which I did not see before), check the screenshot.
Please advise how to continue making my NLS stable..
Thanks.
Willem
You do not have the required permissions to view the files attached to this post.
Last edited by WillemDH on Wed Feb 18, 2015 3:14 am, edited 1 time in total.
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: NLS stopped working
this is definately the problem
Then, a couple of things, first off, depending on how much data you are pushing 4GB may not be near enough memory, but more importantly, if you have not done so, change the HEAP size to use 1/2 the available memory on each instance in the cluster by following this post
http://support.nagios.com/forum/viewtop ... 32#p120532
Then you will need to restart ES again after making changes.
First, lets get the cluster back online, I would like you to restart elasticsearch on each instance[refresh failed][OutOfMemoryError[Java heap space]]]
Code: Select all
service elasticsearch restarthttp://support.nagios.com/forum/viewtop ... 32#p120532
Then you will need to restart ES again after making changes.
Re: NLS stopped working
Scott,
I expanded the RAM from 4 to 8 GB on both nodes. Then I did
Then restarted elasticsearch service.
So when I do a printenv on both nodes, I can see the new ES_HEAP_SIZE setting.
I'll do some more tests and will let you know if things are more stable now.
Grtz
Willem
I expanded the RAM from 4 to 8 GB on both nodes. Then I did
Code: Select all
export ES_HEAP_SIZE=4gSo when I do a printenv on both nodes, I can see the new ES_HEAP_SIZE setting.
Code: Select all
printenv ES_HEAP_SIZE
4g
Grtz
Willem
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
Re: NLS stopped working
Scott,
Dashboards up to a day seem to load quickly and without issue. The problem is with dashboards with a timeperiod of more then 7 days. Loading still takes 5+ minutes. (EDIT 1 => 30+ minutes and still loading)
I can understand loading longer period dashboards can take long to generate and need a lot of system resources, but hte big problem is that when I click any other tab, eg home or administration, or just another dashboard, the website freezes. It seems the NLS is unable to properly stop dashlets which are still loading. The things that makes it even worse is that the only solution seems to be a restart of elasticsearch services from cli, as during such a freeze it is not even possible to log into NLS from another webpage.
When I will admit other people but me to start using NLS, this will 100 % certain cause problems, as
1) they will try out dashboards with a longer timeperiod, eg 30 days (or complexer queries)
2) they will click on other tabs/dashboards or change settings while dashlets are still loading, as they think somthing is wrong and will try other settings
3) the website will freeze and they will just stop using it and complain
4) NLS stops processing logs which could be a huge risk when logs during such an issue are needed
EDIT 2 => As far as I have seen it are always the histograms that keep loading and can't be killed.
EDIT 3 => As the dashboard of which I was talking in EDIT 2 was still losing after 40 minutes, I had to restart elasticsearch service again. After restart on node 1, I went to cluster status and it seemed node 2 was not even available, see attached screenshot.
EDIT 4 => Attached another screenshot, where it is clear we have a gap of 40 minutes caused by the hanging dashlet in a dashboard with a timeperiod of 30 days.
The above situation must be reproducable in your setup? Or is there some other setting which enables NLS to just kill loading dashlets instead of making the complete NLS freeze when clicking another tab/dashboard while dashlets are still loading?
EDIT 5 => Bad news, Just had the same issue as above with a dashboard with a simple query and a timeperiod of only 2 days. Checking elasticsearch logs points to outofmemory exceptions again...
Please advice how to continue. We really need some sort of fix for this issue, even if we are not able to expand the RAM to 16, 32 or 64 GB. We have no problem with dashboards loading slow, but websites freezes and log processing stopping is a major issue.
Grtz
Willem
Dashboards up to a day seem to load quickly and without issue. The problem is with dashboards with a timeperiod of more then 7 days. Loading still takes 5+ minutes. (EDIT 1 => 30+ minutes and still loading)
I can understand loading longer period dashboards can take long to generate and need a lot of system resources, but hte big problem is that when I click any other tab, eg home or administration, or just another dashboard, the website freezes. It seems the NLS is unable to properly stop dashlets which are still loading. The things that makes it even worse is that the only solution seems to be a restart of elasticsearch services from cli, as during such a freeze it is not even possible to log into NLS from another webpage.
When I will admit other people but me to start using NLS, this will 100 % certain cause problems, as
1) they will try out dashboards with a longer timeperiod, eg 30 days (or complexer queries)
2) they will click on other tabs/dashboards or change settings while dashlets are still loading, as they think somthing is wrong and will try other settings
3) the website will freeze and they will just stop using it and complain
4) NLS stops processing logs which could be a huge risk when logs during such an issue are needed
EDIT 2 => As far as I have seen it are always the histograms that keep loading and can't be killed.
EDIT 3 => As the dashboard of which I was talking in EDIT 2 was still losing after 40 minutes, I had to restart elasticsearch service again. After restart on node 1, I went to cluster status and it seemed node 2 was not even available, see attached screenshot.
EDIT 4 => Attached another screenshot, where it is clear we have a gap of 40 minutes caused by the hanging dashlet in a dashboard with a timeperiod of 30 days.
The above situation must be reproducable in your setup? Or is there some other setting which enables NLS to just kill loading dashlets instead of making the complete NLS freeze when clicking another tab/dashboard while dashlets are still loading?
EDIT 5 => Bad news, Just had the same issue as above with a dashboard with a simple query and a timeperiod of only 2 days. Checking elasticsearch logs points to outofmemory exceptions again...
Code: Select all
[2015-02-18 10:38:13,561][DEBUG][action.admin.cluster.node.stats] [95f9ab14-da22-4144-bb0b-6bbc5662115c] failed to execute on node [141qYQb1SqivehW4r14RcQ]
java.lang.OutOfMemoryError: Java heap space
[2015-02-18 10:38:13,561][DEBUG][action.search.type ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] [nagioslogserver][0], node[141qYQb1SqivehW4r14RcQ], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@79e8468a]
org.elasticsearch.ElasticsearchException: Java heap space
at org.elasticsearch.ExceptionsHelper.convertToRuntime(ExceptionsHelper.java:41)
at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:352)
at org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:308)
at org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:305)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
[2015-02-18 10:38:54,649][DEBUG][action.search.type ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] All shards failed for phase: [query_fetch]
[2015-02-18 10:38:20,436][DEBUG][action.get ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] [nagioslogserver][0]: failed to execute [get [nagioslogserver][cf_option][license_key]: routing [null]]
java.lang.OutOfMemoryError: Java heap space
[2015-02-18 10:38:18,052][WARN ][index.engine.internal ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] [logstash-2015.02.18][3] failed engine [out of memory]
[2015-02-18 10:41:04,218][WARN ][index.engine.internal ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] [logstash-2015.02.18][0] failed engine [refresh failed]
[2015-02-18 10:41:27,950][DEBUG][action.admin.cluster.node.stats] [95f9ab14-da22-4144-bb0b-6bbc5662115c] failed to execute on node [141qYQb1SqivehW4r14RcQ]
java.lang.OutOfMemoryError: Java heap space
Grtz
Willem
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: NLS stopped working
Willemdh,
because this is run as a service, simply exporting the variable will not be enough, you need to add it to /etc/sysconfig/elasticsearch
Edit /etc/sysconfig/elasticsearch
Uncomment ES_HEAP_SIZE=1g
change to
Also, for a gauge, what are the average size of your daily indexes?
Remind me, how many instances do you have setup?
What are the Drive specs? SSD, or spinning? if spinning, RPM?
because this is run as a service, simply exporting the variable will not be enough, you need to add it to /etc/sysconfig/elasticsearch
Edit /etc/sysconfig/elasticsearch
Uncomment ES_HEAP_SIZE=1g
change to
Code: Select all
ES_HEAP_SIZE=4gRemind me, how many instances do you have setup?
What are the Drive specs? SSD, or spinning? if spinning, RPM?
Re: NLS stopped working
Scott,
Ok, edited /etc/sysconfig/elasticsearch
Restarted elasticsearch on both nodes and applied configuration.
I'll do some more tests tomorrow to test if it more stable.
About my hardware, I have two NLS servers, both on 2015R1.3. I added two screenshots for more info on the indexes. it seems daily average index size is +- 2.4 GB?
CPU:
6 in each NLS
Disks: (Spinning AST)
Nearline SAS RAID 5 (7200rpm) + FC RAID 6 (15000rpm)
Memory
8 GB in each NLS
Grtz
Willem
Ok, edited /etc/sysconfig/elasticsearch
Restarted elasticsearch on both nodes and applied configuration.
I'll do some more tests tomorrow to test if it more stable.
About my hardware, I have two NLS servers, both on 2015R1.3. I added two screenshots for more info on the indexes. it seems daily average index size is +- 2.4 GB?
CPU:
6 in each NLS
Disks: (Spinning AST)
Nearline SAS RAID 5 (7200rpm) + FC RAID 6 (15000rpm)
Memory
8 GB in each NLS
Grtz
Willem
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: NLS stopped working
You should also be able to see the Heap Committed on the Instance Status page for each node in the JVM section
Re: NLS stopped working
Hey Scott,
It seem more stable since changing ES_HEAP_SIZE in /etc/sysconfig/elasticsearch
I was able to load 30 days histograms in about 10 seconds (better then 20+ minutes...) and have not seen any website freezes for two days.
Do you mean this with
Grtz
Willem
It seem more stable since changing ES_HEAP_SIZE in /etc/sysconfig/elasticsearch
I was able to load 30 days histograms in about 10 seconds (better then 20+ minutes...) and have not seen any website freezes for two days.
Do you mean this with
I don't really see a 'JVM' section though?You should also be able to see the Heap Committed on the Instance Status page for each node in the JVM section
Grtz
Willem
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
https://outsideit.net
-
scottwilkerson
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: NLS stopped working
If you click on the IP address of one of your instances in the list, you will get a whole page full of statistics...WillemDH wrote: I don't really see a 'JVM' section though?
Glad to hear it is loading much faster!
Scott