NLS Downs Unexpectedly

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
carlos.mangini
Posts: 12
Joined: Thu Jun 01, 2017 9:33 am

NLS Downs Unexpectedly

Post by carlos.mangini »

Folks,

I have 2 Nagios Log Server nodes in cluster, receiving log data of approximately 250 servers. However, for no reason the application gets down. The only entries I found looking at the Logstash and Elasticsearch logs are below:



###Errors on Logstash
{:timestamp=>"2017-06-12T13:26:59.967000-0300", :message=>"Received an event that has a different character encoding than you configured.", :text=>"{\\\"EventReceivedTime\\\":\\\"2017-06-09 21:02:11\\\",\\\"SourceModuleName\\\":\\\"file1\\\",\\\"SourceModuleType\\\":\\\"im_file\\\",\\\"message\\\":\\\"2017-06-09 21:02:10 ERROR couldn't connect to tcp socket on AAA.BBB.CCC.DDD:3515; Nenhuma conex\\xE3o p\\xF4de ser feita porque a m\\xE1quina de destino as recusou ativamente. \\\"}\\r", :expected_charset=>"UTF-8", :level=>:warn}
{:timestamp=>"2017-06-12T13:27:01.691000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2017-06-12T13:27:01.692000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2017-06-12T13:27:01.692000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2017-06-12T13:27:01.692000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2017-06-12T13:27:01.692000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}






###After restart Logstash and Elasticsearch
{:timestamp=>"2017-06-12T13:30:34.964000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2017-06-12T13:30:34.964000-0300", :message=>"retrying failed action with response code: 429", :level=>:warn}
{:timestamp=>"2017-06-12T13:55:04.505000-0300", :message=>"Failed to install template: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:55:04.768000-0300", :message=>"Failed to install template: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:55:04.792000-0300", :message=>"Failed to install template: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:55:04.950000-0300", :message=>"Failed to install template: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:55:04.954000-0300", :message=>"Failed to install template: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:55:06.625000-0300", :message=>"Got error to send bulk of actions: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:55:06.628000-0300", :message=>"Failed to flush outgoing items", :outgoing_count=>219, :exception=>org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: [], :backtrace=>["org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(org/elasticsearch/client/transport/TransportClientNodesService.java:279)", "org.elasticsearch.client.transport.TransportClientNodesService.execute(org/elasticsearch/client/transport/TransportClientNodesService.java:198)", "org.elasticsearch.client.transport.support.InternalTransportClient.execute(org/elasticsearch/client/transport/support/InternalTransportClient.java:106)", "org.elasticsearch.client.support.AbstractClient.bulk(org/elasticsearch/client/support/AbstractClient.java:163)", "org.elasticsearch.client.transport.TransportClient.bulk(org/elasticsearch/client/transport/TransportClient.java:356)", "org.elasticsearch.action.bulk.BulkRequestBuilder.doExecute(org/elasticsearch/action/bulk/BulkRequestBuilder.java:164)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:91)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:65)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:606)", "RUBY.bulk(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch/protocol.rb:224)", "RUBY.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:466)", "RUBY.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:490)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:219)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:219)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1341)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_receive(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:159)", "Stud::Buffer.buffer_receive(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:159)", "LogStash::Outputs::ElasticSearch.receive(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:455)", "LogStash::Outputs::ElasticSearch.receive(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:455)", "LogStash::Outputs::Base.handle(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.1-java/lib/logstash/outputs/base.rb:88)", "LogStash::Outputs::Base.handle(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.1-java/lib/logstash/outputs/base.rb:88)", "RUBY.worker_setup(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-core-1.5.1-java/lib/logstash/outputs/base.rb:79)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}






###After open the sockets for receive the logs
{:timestamp=>"2017-06-12T13:56:28.716000-0300", :message=>"Got error to send bulk of actions: None of the configured nodes are available: []", :level=>:error}
{:timestamp=>"2017-06-12T13:56:28.716000-0300", :message=>"Failed to flush outgoing items", :outgoing_count=>243, :exception=>org.elasticsearch.client.transport.NoNodeAvailableException: None of the configured nodes are available: [], :backtrace=>["org.elasticsearch.client.transport.TransportClientNodesService.ensureNodesAreAvailable(org/elasticsearch/client/transport/TransportClientNodesService.java:279)", "org.elasticsearch.client.transport.TransportClientNodesService.execute(org/elasticsearch/client/transport/TransportClientNodesService.java:198)", "org.elasticsearch.client.transport.support.InternalTransportClient.execute(org/elasticsearch/client/transport/support/InternalTransportClient.java:106)", "org.elasticsearch.client.support.AbstractClient.bulk(org/elasticsearch/client/support/AbstractClient.java:163)", "org.elasticsearch.client.transport.TransportClient.bulk(org/elasticsearch/client/transport/TransportClient.java:356)", "org.elasticsearch.action.bulk.BulkRequestBuilder.doExecute(org/elasticsearch/action/bulk/BulkRequestBuilder.java:164)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:91)", "org.elasticsearch.action.ActionRequestBuilder.execute(org/elasticsearch/action/ActionRequestBuilder.java:65)", "java.lang.reflect.Method.invoke(java/lang/reflect/Method.java:606)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch/protocol.rb:224)", "LogStash::Outputs::Elasticsearch::Protocols::NodeClient.bulk(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch/protocol.rb:224)", "LogStash::Outputs::ElasticSearch.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:466)", "LogStash::Outputs::ElasticSearch.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:466)", "LogStash::Outputs::ElasticSearch.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:465)", "LogStash::Outputs::ElasticSearch.submit(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:465)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:490)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:490)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:489)", "LogStash::Outputs::ElasticSearch.flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-output-elasticsearch-0.2.8-java/lib/logstash/outputs/elasticsearch.rb:489)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:219)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:219)", "org.jruby.RubyHash.each(org/jruby/RubyHash.java:1341)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:216)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:193)", "Stud::Buffer.buffer_flush(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:193)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:112)", "org.jruby.RubyKernel.loop(org/jruby/RubyKernel.java:1511)", "RUBY.buffer_initialize(/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.19/lib/stud/buffer.rb:110)", "java.lang.Thread.run(java/lang/Thread.java:745)"], :level=>:warn}




Is there any other place in the tool where I can find details of what is happening to the environment? Can this unexpected fault behavior be corrected by performing some kind of tunning in the environment?

Thank you for your help! ;)
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: NLS Downs Unexpectedly

Post by cdienger »

The elasticsearch and logstash logs are usually the best thing to look and the error message shows a failure to establish a tcp connection to one of the machines. This could be due to load, a crash, etc...

Please upload all elasticsearch and logstash logs from both machines as well as profiles from each machine to a location where we can download them for review. Profiles can be generated under Administration > System > System Status. If you'd like to password protect them, please PM the password.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
carlos.mangini
Posts: 12
Joined: Thu Jun 01, 2017 9:33 am

Re: NLS Downs Unexpectedly

Post by carlos.mangini »

OK. Below are the system profile files of each node.
You do not have the required permissions to view the files attached to this post.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: NLS Downs Unexpectedly

Post by cdienger »

The heap usage, total ram usage, and load on tivit2030.localdomain seem pretty high:

host ip heap.percent ram.percent load node.role master name
tivit2030.localdomain 127.0.0.1 81 87 10.08 d * 0bc6b024-83cb-4ea6-bae9-8d2eb1a9fd95
tivit2031.localdomain 127.0.0.1 75 73 4.49 d m 12229b60-24c4-4104-8a33-fdb40ae2e415

If the profile was taken during a time of normal operation, it's likely seeing times of much more activity and spiking even further.

I would start by upping the memory from 4 to 8Gigs.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
carlos.mangini
Posts: 12
Joined: Thu Jun 01, 2017 9:33 am

Re: NLS Downs Unexpectedly

Post by carlos.mangini »

I upgraded to 16GB of RAM and will keep up with server operation next week. I hope this stabilizes the cluster.
If there is any tunning documentation or best practice, please make it available.

Thanks for helping.
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: NLS Downs Unexpectedly

Post by cdienger »

https://assets.nagios.com/downloads/nag ... quirements covers minimum requirements and recommended requirements.

https://support.nagios.com/kb/article/n ... rview.html has some good information. The memory issue that Jesse mentions is explained more in https://www.elastic.co/guide/en/elastic ... izing.html. NLS/Elasticsearch keeps the open indexes in memory and you can see the size of indexes by going to Administration > System > Cluster Status.

The amount of memory on the system will determine how much data will be readily available. You can always close and store indexes and open them again if needed in the future. https://assets.nagios.com/downloads/nag ... enance.pdf covers backups and maintenance. I would recommend storing backups on a remote server to save on on local disk space.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
carlos.mangini
Posts: 12
Joined: Thu Jun 01, 2017 9:33 am

Re: NLS Downs Unexpectedly

Post by carlos.mangini »

@cdienger

Thaks for the tip, problem solved. :D
User avatar
cdienger
Support Tech
Posts: 5045
Joined: Tue Feb 07, 2017 11:26 am

Re: NLS Downs Unexpectedly

Post by cdienger »

Glad to hear :)
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked