NLS stopped working

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NLS stopped working

Post by scottwilkerson »

Awesome, definitely let us know it the current version helps...
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: NLS stopped working

Post by WillemDH »

Scott,

It seems the update did not help after all.. :(
This morning NLS opened fine. 5 minutes ago however, I log in, go to dashboards, loading took a litle long on my home dashboards, which is just a * query. So I went to a different dashboard and it seemed the gui frooze again. I've been waiting for 5+ minutes now, so I guess the only thing left to do is restart elasticsearch service again..

This is an extract of the elasticsearch log:

Code: Select all

[2015-02-17 15:24:31,476][DEBUG][action.search.type       ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.14][3], node[8a2YUZmdT5asS6nywulupg], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@7b456191] lastShard [true]
org.elasticsearch.search.query.QueryPhaseExecutionException: [logstash-2015.02.14][3]: query[ConstantScore(*:*)],from[0],size[0]: Query Failed [Failed to execute main query]
        at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:162)
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
        at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
        at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
        at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
[2015-02-17 15:24:32,597][WARN ][cluster.action.shard     ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][2] received shard failed for [logstash-2015.02.17][2], node[8a2YUZmdT5asS6nywulupg], [R], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [Failed to perform [bulk/shard] on replica, message [NodeDisconnectedException[[c4d16075-9bc2-4095-9f00-e7de7f96930c][inet[/10.54.24.140:9300]][bulk/shard/replica] disconnected]]]
[2015-02-17 15:24:35,466][WARN ][cluster.action.shard     ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][0] received shard failed for [logstash-2015.02.17][0], node[dBHw3nRDTQeDUGajtxsAkg], [P], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [engine failure, message [out of memory][OutOfMemoryError[Java heap space]]]
[2015-02-17 15:24:36,599][WARN ][cluster.action.shard     ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][2] received shard failed for [logstash-2015.02.17][2], node[8a2YUZmdT5asS6nywulupg], [R], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [Failed to perform [bulk/shard] on replica, message [NodeDisconnectedException[[c4d16075-9bc2-4095-9f00-e7de7f96930c][inet[/10.54.24.140:9300]][bulk/shard/replica] disconnected]]]
[2015-02-17 15:24:38,155][WARN ][cluster.action.shard     ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][0] received shard failed for [logstash-2015.02.17][0], node[8a2YUZmdT5asS6nywulupg], [R], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [Failed to perform [bulk/shard] on replica, message [NodeDisconnectedException[[c4d16075-9bc2-4095-9f00-e7de7f96930c][inet[/10.54.24.140:9300]][bulk/shard/replica] disconnected]]]
[2015-02-17 15:25:00,646][WARN ][search.action            ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] Failed to send release search context
org.elasticsearch.transport.SendRequestTransportException: [95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][search/freeContext]
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:220)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:190)
        at org.elasticsearch.search.action.SearchServiceTransportAction.sendFreeContext(SearchServiceTransportAction.java:125)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.releaseIrrelevantSearchContexts(TransportSearchTypeAction.java:348)
        at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.finishHim(TransportSearchQueryThenFetchAction.java:147)
        at org.elasticsearch.action.search.type.TransportSearchQueryThenFetchAction$AsyncAction.moveToSecondPhase(TransportSearchQueryThenFetchAction.java:90)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.innerMoveToSecondPhase(TransportSearchTypeAction.java:404)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:198)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onResult(TransportSearchTypeAction.java:174)
        at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onResult(TransportSearchTypeAction.java:171)
        at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:526)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]] Node not connected
        at org.elasticsearch.transport.netty.NettyTransport.nodeChannel(NettyTransport.java:874)
        at org.elasticsearch.transport.netty.NettyTransport.sendRequest(NettyTransport.java:556)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:206)
        ... 13 more
[2015-02-17 15:25:02,133][WARN ][cluster.action.shard     ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][2] received shard failed for [logstash-2015.02.17][2], node[dBHw3nRDTQeDUGajtxsAkg], [P], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [engine failure, message [refresh failed][OutOfMemoryError[Java heap space]]]
[2015-02-17 15:25:26,789][WARN ][transport.netty          ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] Message not fully read (request) for [287020] and action [bulk], resetting
[2015-02-17 15:25:33,573][WARN ][cluster.action.shard     ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][1] received shard failed for [logstash-2015.02.17][1], node[8a2YUZmdT5asS6nywulupg], [R], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [Failed to perform [bulk/shard] on replica, message [SendRequestTransportException[[c4d16075-9bc2-4095-9f00-e7de7f96930c][inet[/10.54.24.140:9300]][bulk/shard/replica]]; nested: NodeNotConnectedException[[c4d16075-9bc2-4095-9f00-e7de7f96930c][inet[/10.54.24.140:9300]] Node not connected]; ]]
[root@srvnaglog01 ~]# tail -100 /var/log/elasticsearch/ee9e60a0-f4cb-41ec-a97f-8f17434b748e.log
[2015-02-17 15:22:56,766][WARN ][cluster.action.shard     ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [nagioslogserver][0] sending failed shard for [nagioslogserver][0], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED], indexUUID [YbgZhXrHRzqYGUCT-9_q5Q], reason [Failed to perform [index] on replica, message [NodeDisconnectedException[[95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][index/replica] disconnected]]]
[2015-02-17 15:22:56,767][WARN ][cluster.action.shard     ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [nagioslogserver][0] received shard failed for [nagioslogserver][0], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED], indexUUID [YbgZhXrHRzqYGUCT-9_q5Q], reason [Failed to perform [index] on replica, message [NodeDisconnectedException[[95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][index/replica] disconnected]]]
[2015-02-17 15:22:56,766][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.4m]
[2015-02-17 15:22:56,767][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.5m]
[2015-02-17 15:22:56,766][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.3m]
[2015-02-17 15:22:56,771][DEBUG][action.admin.indices.status] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [nagioslogserver][0], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED]: failed to executed [org.elasticsearch.action.admin.indices.status.IndicesStatusRequest@39b94e07]
org.elasticsearch.transport.NodeDisconnectedException: [95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][indices/status/s] disconnected
[2015-02-17 15:22:56,767][WARN ][action.index             ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] Failed to perform index on remote replica [95f9ab14-da22-4144-bb0b-6bbc5662115c][dBHw3nRDTQeDUGajtxsAkg][srvnaglog02][inet[/10.54.24.141:9300]]{max_local_storage_nodes=1}[nagioslogserver][0]
org.elasticsearch.transport.NodeDisconnectedException: [95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][index/replica] disconnected
[2015-02-17 15:22:56,788][WARN ][cluster.action.shard     ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [nagioslogserver][0] sending failed shard for [nagioslogserver][0], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED], indexUUID [YbgZhXrHRzqYGUCT-9_q5Q], reason [Failed to perform [index] on replica, message [NodeDisconnectedException[[95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][index/replica] disconnected]]]
[2015-02-17 15:22:56,820][WARN ][cluster.action.shard     ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [nagioslogserver][0] received shard failed for [nagioslogserver][0], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED], indexUUID [YbgZhXrHRzqYGUCT-9_q5Q], reason [Failed to perform [index] on replica, message [NodeDisconnectedException[[95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][index/replica] disconnected]]]
[2015-02-17 15:22:56,916][DEBUG][action.search.type       ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.11][3], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@64953665] lastShard [true]
org.elasticsearch.transport.NodeDisconnectedException: [95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][search/phase/query] disconnected
[2015-02-17 15:22:56,916][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.2m]
[2015-02-17 15:22:56,944][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.2m]
[2015-02-17 15:22:56,916][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.2m]
[2015-02-17 15:22:56,916][DEBUG][action.search.type       ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.12][2], node[dBHw3nRDTQeDUGajtxsAkg], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@64953665] lastShard [true]
org.elasticsearch.transport.NodeDisconnectedException: [95f9ab14-da22-4144-bb0b-6bbc5662115c][inet[/10.54.24.141:9300]][search/phase/query] disconnected
[2015-02-17 15:22:56,916][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.2m]
[2015-02-17 15:22:56,979][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] observer timed out. notifying listener. timeout setting [1m], time since start [4.2m]
[2015-02-17 15:23:57,548][WARN ][cluster.action.shard     ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][2] received shard failed for [logstash-2015.02.17][2], node[8a2YUZmdT5asS6nywulupg], [R], s[STARTED], indexUUID [eqiYscAiRg26DF8AZxBI1A], reason [Failed to perform [bulk/shard] on replica, message [NodeDisconnectedException[[c4d16075-9bc2-4095-9f00-e7de7f96930c][inet[/10.54.24.140:9300]][bulk/shard/replica] disconnected]]]
[2015-02-17 15:24:03,804][DEBUG][action.search.type       ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.17][1], node[8a2YUZmdT5asS6nywulupg], [R], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@64953665] lastShard [true]
org.elasticsearch.search.query.QueryPhaseExecutionException: [logstash-2015.02.17][1]: query[ConstantScore(*:*)],from[0],size[0]: Query Failed [Failed to execute main query]
        at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:162)
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
        at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
        at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
        at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.ElasticsearchException: java.lang.OutOfMemoryError: Java heap space
        at org.elasticsearch.index.fielddata.AbstractIndexFieldData.load(AbstractIndexFieldData.java:79)
        at org.elasticsearch.index.fielddata.plain.AbstractBytesIndexFieldData.load(AbstractBytesIndexFieldData.java:41)
        at org.elasticsearch.search.facet.terms.strings.TermsStringOrdinalsFacetExecutor$Collector.setNextReader(TermsStringOrdinalsFacetExecutor.java:214)
        at org.elasticsearch.common.lucene.search.FilteredCollector.setNextReader(FilteredCollector.java:67)
        at org.elasticsearch.common.lucene.MultiCollector.setNextReader(MultiCollector.java:68)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:612)
        at org.elasticsearch.search.internal.ContextIndexSearcher.search(ContextIndexSearcher.java:175)
        at org.apache.lucene.search.IndexSearcher.search(IndexSearcher.java:309)
        at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:116)
        ... 7 more
Caused by: org.elasticsearch.common.util.concurrent.ExecutionError: java.lang.OutOfMemoryError: Java heap space
        at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2199)
        at org.elasticsearch.common.cache.LocalCache.get(LocalCache.java:3934)
        at org.elasticsearch.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4736)
        at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.load(IndicesFieldDataCache.java:154)
        at org.elasticsearch.index.fielddata.AbstractIndexFieldData.load(AbstractIndexFieldData.java:73)
        ... 15 more
Caused by: java.lang.OutOfMemoryError: Java heap space
[2015-02-17 15:24:17,262][DEBUG][action.search.type       ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.13][1], node[8a2YUZmdT5asS6nywulupg], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@7b456191] lastShard [true]
org.elasticsearch.search.query.QueryPhaseExecutionException: [logstash-2015.02.13][1]: query[ConstantScore(*:*)],from[0],size[0]: Query Failed [Failed to execute main query]
        at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:162)
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
        at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
        at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
        at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
[2015-02-17 15:24:31,476][DEBUG][action.search.type       ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.02.14][3], node[8a2YUZmdT5asS6nywulupg], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@7b456191] lastShard [true]
org.elasticsearch.search.query.QueryPhaseExecutionException: [logstash-2015.02.14][3]: query[ConstantScore(*:*)],from[0],size[0]: Query Failed [Failed to execute main query]
        at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:162)
        at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:261)
        at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:206)
        at org.elasticsearch.search.action.SearchServiceTransportAction$5.call(SearchServiceTransportAction.java:203)
        at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
And an extract of logstash log:

Code: Select all

tail -50 /var/log/logstash/logstash.log
{:timestamp=>"2015-02-14T15:49:56.419000+0100", :message=>"Using milestone 2 input plugin 'tcp'. This plugin should be stable, but if you see strange behavior, please let us know! For more information on plugin milestones, see http://logstash.net/docs/1.4.2/plugin-milestones", :level=>:warn}
{:timestamp=>"2015-02-14T15:49:56.658000+0100", :message=>"Using milestone 1 input plugin 'syslog'. This plugin should work, but would benefit from use by folks like you. Please let us know if you find bugs or have suggestions on how to improve this plugin.  For more information on plugin milestones, see http://logstash.net/docs/1.4.2/plugin-milestones", :level=>:warn}
{:timestamp=>"2015-02-17T15:25:27.063000+0100", :message=>"Failed to flush outgoing items", :outgoing_count=>5000, :exception=>org.elasticsearch.common.util.concurrent.UncategorizedExecutionException: Failed execution, :backtrace=>["org.elasticsearch.action.support.AdapterActionFuture.rethrowExecutionException(org/elasticsearch/action/support/AdapterActionFuture.java:90)", "org.elasticsearch.action.support.AdapterActionFuture.actionGet(org/elasticsearch/action/support/AdapterActionFuture.java:50)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:606)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:207)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:207)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1339)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_receive(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:159)", "Stud::Buffer.buffer_receive(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:159)", "LogStash::Outputs::ElasticSearch.receive(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:311)", "LogStash::Outputs::ElasticSearch.receive(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:311)", "LogStash::Outputs::Base.handle(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/base.rb:86)", "LogStash::Outputs::Base.handle(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/base.rb:86)", "RUBY.worker_setup(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/base.rb:78)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}
and

Code: Select all

{:timestamp=>"2015-02-17T15:39:30.169000+0100", :message=>"Failed to flush outgoing items", :outgoing_count=>2, :exception=>org.elasticsearch.client.transport.NoNodeAvailableException: No node available, :backtrace=>["org.elasticsearch.client.transport.TransportClientNodesService.execute(org/elasticsearch/client/transport/TransportClientNodesService.java:219)", "org.elasticsearch.client.transport.support.InternalTransportClient.execute(org/elasticsearch/client/transport/support/InternalTransportClient.java:106)", "org.elasticsearch.client.support.AbstractClient.bulk(org/elasticsearch/client/support/AbstractClient.java:147)", "org.elasticsearch.client.transport.TransportClient.bulk(org/elasticsearch/client/transport/TransportClient.java:360)", "org.elasticsearch.action.bulk.BulkRequestBuilder.doExecute(org/elasticsearch/action/bulk/BulkRequestBuilder.java:165)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:85)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:59)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:606)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:207)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:207)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1339)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:112)", "org.jruby.RubyKernel.loop(org/jruby/RubyKernel.java:1521)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:110)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}
{:timestamp=>"2015-02-17T15:39:31.172000+0100", :message=>"Failed to flush outgoing items", :outgoing_count=>2, :exception=>org.elasticsearch.client.transport.NoNodeAvailableException: No node available, :backtrace=>["org.elasticsearch.client.transport.TransportClientNodesService.execute(org/elasticsearch/client/transport/TransportClientNodesService.java:219)", "org.elasticsearch.client.transport.support.InternalTransportClient.execute(org/elasticsearch/client/transport/support/InternalTransportClient.java:106)", "org.elasticsearch.client.support.AbstractClient.bulk(org/elasticsearch/client/support/AbstractClient.java:147)", "org.elasticsearch.client.transport.TransportClient.bulk(org/elasticsearch/client/transport/TransportClient.java:360)", "org.elasticsearch.action.bulk.BulkRequestBuilder.doExecute(org/elasticsearch/action/bulk/BulkRequestBuilder.java:165)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:85)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:59)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:606)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:207)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch/protocol.rb:207)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/lib/logstash/outputs/elasticsearch.rb:315)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:219)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1339)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:193)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:112)", "org.jruby.RubyKernel.loop(org/jruby/RubyKernel.java:1521)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:110)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}
I see things like [refresh failed][OutOfMemoryError[Java heap space]]] This is a top of the server atm:

Code: Select all

top - 15:27:36 up 2 days, 23:38,  1 user,  load average: 1.10, 1.09, 0.97
Tasks: 178 total,   1 running, 177 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.9%us,  0.6%sy,  2.9%ni, 87.4%id,  0.1%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4056456k total,  3926828k used,   129628k free,   123948k buffers
Swap:  2047992k total,        0k used,  2047992k free,  1488220k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 1233 nagios    20   0 16.8g 1.3g 132m S 152.3 34.4   2371:22 java
 1317 root      39  19 4092m 621m 6400 S  2.0 15.7 901:23.68 java
    1 root      20   0 19232 1132  856 S  0.0  0.0   0:01.69 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
    3 root      RT   0     0    0    0 S  0.0  0.0   0:00.59 migration/0
Why would I be suddenly out of memory? I expanded memory from 2GB to 4 GB last week...

I also set vm.swappiness to 1 see this thread http://support.nagios.com/forum/viewtop ... 38&t=31343 and this link about memory http://www.elasticsearch.org/guide/en/e ... izing.html

It seems my NLS cluster is also having issue (which I did not see before), check the screenshot.

Please advise how to continue making my NLS stable..

Thanks.

Willem
You do not have the required permissions to view the files attached to this post.
Last edited by WillemDH on Wed Feb 18, 2015 3:14 am, edited 1 time in total.
Nagios XI 5.8.1
https://outsideit.net
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NLS stopped working

Post by scottwilkerson »

this is definately the problem
[refresh failed][OutOfMemoryError[Java heap space]]]
First, lets get the cluster back online, I would like you to restart elasticsearch on each instance

Code: Select all

service elasticsearch restart
Then, a couple of things, first off, depending on how much data you are pushing 4GB may not be near enough memory, but more importantly, if you have not done so, change the HEAP size to use 1/2 the available memory on each instance in the cluster by following this post
http://support.nagios.com/forum/viewtop ... 32#p120532

Then you will need to restart ES again after making changes.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: NLS stopped working

Post by WillemDH »

Scott,

I expanded the RAM from 4 to 8 GB on both nodes. Then I did

Code: Select all

export ES_HEAP_SIZE=4g
Then restarted elasticsearch service.
So when I do a printenv on both nodes, I can see the new ES_HEAP_SIZE setting.

Code: Select all

 printenv ES_HEAP_SIZE
4g
I'll do some more tests and will let you know if things are more stable now.

Grtz

Willem
Nagios XI 5.8.1
https://outsideit.net
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: NLS stopped working

Post by WillemDH »

Scott,

Dashboards up to a day seem to load quickly and without issue. The problem is with dashboards with a timeperiod of more then 7 days. Loading still takes 5+ minutes. (EDIT 1 => 30+ minutes and still loading)

I can understand loading longer period dashboards can take long to generate and need a lot of system resources, but hte big problem is that when I click any other tab, eg home or administration, or just another dashboard, the website freezes. It seems the NLS is unable to properly stop dashlets which are still loading. The things that makes it even worse is that the only solution seems to be a restart of elasticsearch services from cli, as during such a freeze it is not even possible to log into NLS from another webpage.


When I will admit other people but me to start using NLS, this will 100 % certain cause problems, as
1) they will try out dashboards with a longer timeperiod, eg 30 days (or complexer queries)
2) they will click on other tabs/dashboards or change settings while dashlets are still loading, as they think somthing is wrong and will try other settings
3) the website will freeze and they will just stop using it and complain
4) NLS stops processing logs which could be a huge risk when logs during such an issue are needed

EDIT 2 => As far as I have seen it are always the histograms that keep loading and can't be killed.

EDIT 3 => As the dashboard of which I was talking in EDIT 2 was still losing after 40 minutes, I had to restart elasticsearch service again. After restart on node 1, I went to cluster status and it seemed node 2 was not even available, see attached screenshot.

EDIT 4 => Attached another screenshot, where it is clear we have a gap of 40 minutes caused by the hanging dashlet in a dashboard with a timeperiod of 30 days.

The above situation must be reproducable in your setup? Or is there some other setting which enables NLS to just kill loading dashlets instead of making the complete NLS freeze when clicking another tab/dashboard while dashlets are still loading?

EDIT 5 => Bad news, Just had the same issue as above with a dashboard with a simple query and a timeperiod of only 2 days. Checking elasticsearch logs points to outofmemory exceptions again...

Code: Select all

[2015-02-18 10:38:13,561][DEBUG][action.admin.cluster.node.stats] [95f9ab14-da22-4144-bb0b-6bbc5662115c] failed to execute on node [141qYQb1SqivehW4r14RcQ]
java.lang.OutOfMemoryError: Java heap space
[2015-02-18 10:38:13,561][DEBUG][action.search.type       ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] [nagioslogserver][0], node[141qYQb1SqivehW4r14RcQ], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.search.SearchRequest@79e8468a]
org.elasticsearch.ElasticsearchException: Java heap space
        at org.elasticsearch.ExceptionsHelper.convertToRuntime(ExceptionsHelper.java:41)
        at org.elasticsearch.search.SearchService.executeFetchPhase(SearchService.java:352)
        at org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:308)
        at org.elasticsearch.search.action.SearchServiceTransportAction$11.call(SearchServiceTransportAction.java:305)
        at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:517)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.OutOfMemoryError: Java heap space
[2015-02-18 10:38:54,649][DEBUG][action.search.type       ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] All shards failed for phase: [query_fetch]
[2015-02-18 10:38:20,436][DEBUG][action.get               ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] [nagioslogserver][0]: failed to execute [get [nagioslogserver][cf_option][license_key]: routing [null]]
java.lang.OutOfMemoryError: Java heap space
[2015-02-18 10:38:18,052][WARN ][index.engine.internal    ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] [logstash-2015.02.18][3] failed engine [out of memory]
[2015-02-18 10:41:04,218][WARN ][index.engine.internal    ] [95f9ab14-da22-4144-bb0b-6bbc5662115c] [logstash-2015.02.18][0] failed engine [refresh failed]
[2015-02-18 10:41:27,950][DEBUG][action.admin.cluster.node.stats] [95f9ab14-da22-4144-bb0b-6bbc5662115c] failed to execute on node [141qYQb1SqivehW4r14RcQ]
java.lang.OutOfMemoryError: Java heap space
Please advice how to continue. We really need some sort of fix for this issue, even if we are not able to expand the RAM to 16, 32 or 64 GB. We have no problem with dashboards loading slow, but websites freezes and log processing stopping is a major issue.

Grtz

Willem
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NLS stopped working

Post by scottwilkerson »

Willemdh,

because this is run as a service, simply exporting the variable will not be enough, you need to add it to /etc/sysconfig/elasticsearch

Edit /etc/sysconfig/elasticsearch
Uncomment ES_HEAP_SIZE=1g
change to

Code: Select all

ES_HEAP_SIZE=4g
Also, for a gauge, what are the average size of your daily indexes?

Remind me, how many instances do you have setup?

What are the Drive specs? SSD, or spinning? if spinning, RPM?
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: NLS stopped working

Post by WillemDH »

Scott,

Ok, edited /etc/sysconfig/elasticsearch

Restarted elasticsearch on both nodes and applied configuration.

I'll do some more tests tomorrow to test if it more stable.

About my hardware, I have two NLS servers, both on 2015R1.3. I added two screenshots for more info on the indexes. it seems daily average index size is +- 2.4 GB?

CPU:
6 in each NLS

Disks: (Spinning AST)
Nearline SAS RAID 5 (7200rpm) + FC RAID 6 (15000rpm)

Memory
8 GB in each NLS

Grtz

Willem
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NLS stopped working

Post by scottwilkerson »

You should also be able to see the Heap Committed on the Instance Status page for each node in the JVM section
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: NLS stopped working

Post by WillemDH »

Hey Scott,

It seem more stable since changing ES_HEAP_SIZE in /etc/sysconfig/elasticsearch

I was able to load 30 days histograms in about 10 seconds (better then 20+ minutes...) and have not seen any website freezes for two days.

Do you mean this with
You should also be able to see the Heap Committed on the Instance Status page for each node in the JVM section
I don't really see a 'JVM' section though?

Grtz

Willem
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: NLS stopped working

Post by scottwilkerson »

WillemDH wrote: I don't really see a 'JVM' section though?
If you click on the IP address of one of your instances in the list, you will get a whole page full of statistics...

Glad to hear it is loading much faster!

Scott
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked