Page 1 of 1

NLS Downs Unexpectedly

Posted: Mon Jun 12, 2017 1:12 pm
by carlos.mangini
Folks,

I have 2 Nagios Log Server nodes in cluster, receiving log data of approximately 250 servers. However, for no reason the application gets down. The only entries I found looking at the Logstash and Elasticsearch logs are below:



###Errors on Logstash
{:timestamp=>"2017-06-12T13:26:59.967000-0300", :message=>"Received an event that has a different character encoding than you configured.", :text=>"{\\\"EventReceivedTime\\\":\\\"2017-06-09 21:02:11\\\",\\\"SourceModuleName\\\":\\\"file1\\\",\\\"SourceModuleType\\\":\\\"im_file\\\",\\\"message\\\":\\\"2017-06-09 21:02:10 ERROR couldn't connect to tcp socket on AAA.BBB.CCC.DDD:3515; Nenhuma conex\\xE3o p\\xF4de ser feita porque a m\\xE1quina de destino as recusou ativamente. \\\"}\\r", :expected_charset=>"UTF-8", :level=>:warn}
{:timestamp=>"2017-06-12T13:27:01.691000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2017-06-12T13:27:01.692000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2017-06-12T13:27:01.692000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2017-06-12T13:27:01.692000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2017-06-12T13:27:01.692000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}






###After restart Logstash and Elasticsearch
{:timestamp=>"2017-06-12T13:30:34.964000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2017-06-12T13:30:34.964000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2017-06-12T13:55:04.505000-0300", :message=>"Failed to install template: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:55:04.768000-0300", :message=>"Failed to install template: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:55:04.792000-0300", :message=>"Failed to install template: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:55:04.950000-0300", :message=>"Failed to install template: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:55:04.954000-0300", :message=>"Failed to install template: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:55:06.625000-0300", :message=>"Got error to send bulk of actions: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:55:06.628000-0300", :message=>"Failed to flush outgoing items", :outgoing_count=>219, :exception=>org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: [], :backtrace=>["org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(org/elasticsearch/client/transport/TransportClientNodesService.java:279)", "org.elasticsearch.client.transport.TransportClientNodesService.execute(org/elasticsearch/client/transport/TransportClientNodesService.java:198)", "org.elasticsearch.client.transport.support.InternalTransportClient.execute(org/elasticsearch/client/transport/support/InternalTransportClient.java:106)", "org.elasticsearch.client.support.AbstractClient.bulk(org/elasticsearch/client/support/AbstractClient.java:163)", "org.elasticsearch.client.transport.TransportClient.bulk(org/elasticsearch/client/transport/TransportClient.java:356)", "org.elasticsearch.action.bulk.BulkRequestBuilder.doExecute(org/elasticsearch/action/bulk/BulkRequestBuilder.java:164)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:91)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:65)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:606)", "RUBY.bulk(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch/protocol.rb:224)", "RUBY.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:466)", "RUBY.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:490)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:219)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:219)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1341)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_receive(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:159)", "Stud::Buffer.buffer_receive(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:159)", "LogStash::Outputs::ElasticSearch.receive(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:455)", "LogStash::Outputs::ElasticSearch.receive(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:455)", "LogStash::Outputs::Base.handle(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.1-java/lib/logstash/outputs/base.rb:88)", "LogStash::Outputs::Base.handle(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.1-java/lib/logstash/outputs/base.rb:88)", "RUBY.worker_setup(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.1-java/lib/logstash/outputs/base.rb:79)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}






###After open the sockets for receive the logs
{:timestamp=>"2017-06-12T13:56:28.716000-0300", :message=>"Got error to send bulk of actions: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:56:28.716000-0300", :message=>"Failed to flush outgoing items", :outgoing_count=>243, :exception=>org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: [], :backtrace=>["org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(org/elasticsearch/client/transport/TransportClientNodesService.java:279)", "org.elasticsearch.client.transport.TransportClientNodesService.execute(org/elasticsearch/client/transport/TransportClientNodesService.java:198)", "org.elasticsearch.client.transport.support.InternalTransportClient.execute(org/elasticsearch/client/transport/support/InternalTransportClient.java:106)", "org.elasticsearch.client.support.AbstractClient.bulk(org/elasticsearch/client/support/AbstractClient.java:163)", "org.elasticsearch.client.transport.TransportClient.bulk(org/elasticsearch/client/transport/TransportClient.java:356)", "org.elasticsearch.action.bulk.BulkRequestBuilder.doExecute(org/elasticsearch/action/bulk/BulkRequestBuilder.java:164)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:91)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:65)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:606)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch/protocol.rb:224)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch/protocol.rb:224)", "LogStash::Outputs::ElasticSearch.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:466)", "LogStash::Outputs::ElasticSearch.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:466)", "LogStash::Outputs::ElasticSearch.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:465)", "LogStash::Outputs::ElasticSearch.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:465)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:490)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:490)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:489)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:489)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:219)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:219)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1341)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:193)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:112)", "org.jruby.RubyKernel.loop(org/jruby/RubyKernel.java:1511)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:110)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}




Is there any other place in the tool where I can find details of what is happening to the environment? Can this unexpected fault behavior be corrected by performing some kind of tunning in the environment?

Thank you for your help! ;)

Re: NLS Downs Unexpectedly

Posted: Mon Jun 12, 2017 1:51 pm
by cdienger
The elasticsearch and logstash logs are usually the best thing to look and the error message shows a failure to establish a tcp connection to one of the machines. This could be due to load, a crash, etc...

Please upload all elasticsearch and logstash logs from both machines as well as profiles from each machine to a location where we can download them for review. Profiles can be generated under Administration > System > System Status. If you'd like to password protect them, please PM the password.

Re: NLS Downs Unexpectedly

Posted: Tue Jun 13, 2017 5:16 pm
by carlos.mangini
OK. Below are the system profile files of each node.

Re: NLS Downs Unexpectedly

Posted: Wed Jun 14, 2017 10:29 am
by cdienger
The heap usage, total ram usage, and load on tivit2030.localdomain seem pretty high:

host ip heap.percent ram.percent load node.role master name
tivit2030.localdomain 127.0.0.1 81 87 10.08 d * 0bc6b024-83cb-4ea6-bae9-8d2eb1a9fd95
tivit2031.localdomain 127.0.0.1 75 73 4.49 d m 12229b60-24c4-4104-8a33-fdb40ae2e415

If the profile was taken during a time of normal operation, it's likely seeing times of much more activity and spiking even further.

I would start by upping the memory from 4 to 8Gigs.

Re: NLS Downs Unexpectedly

Posted: Wed Jun 14, 2017 3:38 pm
by carlos.mangini
I upgraded to 16GB of RAM and will keep up with server operation next week. I hope this stabilizes the cluster.
If there is any tunning documentation or best practice, please make it available.

Thanks for helping.

Re: NLS Downs Unexpectedly

Posted: Thu Jun 15, 2017 9:41 am
by cdienger
https://assets.nagios.com/downloads/nag ... quirements covers minimum requirements and recommended requirements.

https://support.nagios.com/kb/article/n ... rview.html has some good information. The memory issue that Jesse mentions is explained more in https://www.elastic.co/guide/en/elastic ... izing.html. NLS/Elasticsearch keeps the open indexes in memory and you can see the size of indexes by going to Administration > System > Cluster Status.

The amount of memory on the system will determine how much data will be readily available. You can always close and store indexes and open them again if needed in the future. https://assets.nagios.com/downloads/nag ... enance.pdf covers backups and maintenance. I would recommend storing backups on a remote server to save on on local disk space.

Re: NLS Downs Unexpectedly

Posted: Mon Jan 08, 2018 4:39 pm
by carlos.mangini
@cdienger

Thaks for the tip, problem solved. :D

Re: NLS Downs Unexpectedly

Posted: Mon Jan 08, 2018 4:44 pm
by cdienger
Glad to hear :)