Nagios Log Server listening port abruptly halts v2

avandemore · Post by **avandemore** » Wed May 31, 2017 3:32 pm

It would mostly likely be a better approach to solve the problem rather than mask it. Just working around could introduce worse issues as well.

The only way to do that effectively is as @mcapra already suggested with ALL the logs from the contemporary time slot.

You can try things like perhaps turning off certain inputs to attempt to isolate the problem, but that is more a shot in the dark approach. Logs are much more definitive.

james.liew · Post by **james.liew** » Wed May 31, 2017 7:45 pm

I'll grab them the next time we have an issue, likely Friday evening/Sat morning.

avandemore · Post by **avandemore** » Thu Jun 01, 2017 10:06 am

Sounds good, our support hours are listed here:

https://www.nagios.com/contact/

However you can post or PM your logs at any point.

james.liew · Post by **james.liew** » Sun Jun 04, 2017 10:35 pm

Cron job seems to have done it's job, no alerts for Saturday or Friday or Sunday even. Will monitor further for now.

EDIT: I will remove the cron job sometime this week. We have a support contract setup and I'm waiting for my account access to the Customer section of the forum.

This thread might need to be moved later in the week!

avandemore · Post by **avandemore** » Mon Jun 05, 2017 12:00 pm

Sure we'll keep it open for now.

james.liew · Post by **james.liew** » Tue Jun 06, 2017 2:43 am

Looks like the service just died right just now.

Can this thread be moved to the customer support section? I've just received access to it.

I will upload all the necessary log files once I WINSCP and then restart the services.

james.liew · Post by **james.liew** » Tue Jun 06, 2017 2:57 am

HI all,

New logs attached, as previously requested, I've uploaded ALL the logs from both Elasticsearch and Logstash

Post by **mcapra** » Tue Jun 06, 2017 8:10 am

The start of the chatter in the latest Logstash log is here:

Code: Select all

## timestamp=>"2017-06-06T09:31:32.317000+0200"
"org/jruby/RubyIO.java:2996:in `sysread'", 
"/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-tcp-0.1.5/lib/logstash/inputs/tcp.rb:164:in `read'", 
"/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-tcp-0.1.5/lib/logstash/inputs/tcp.rb:112:in `handle_socket'", 
"/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-tcp-0.1.5/lib/logstash/inputs/tcp.rb:147:in `client_thread'", 
"/usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/logstash-input-tcp-0.1.5/lib/logstash/inputs/tcp.rb:145:in `client_thread'"

Which I think means Logstash was unable to handle all the socket connections it was trying to maintain. This could be, as mentioned previously, a side-effect of the tcp input plugin not responsibly terminating connections.

However, I noticed this activity occurring around the same time in the Elasticsearch logs:

Code: Select all

[2017-06-06 09:31:31,223][WARN ][index.engine             ] [791cc6c8-f646-495e-9e58-1ec21a24b61c] [logstash-2017.06.06][2] failed engine [out of memory]
java.lang.OutOfMemoryError: unable to create new native thread

Which leads me to believe that, rather than Logstash misbehaving internally, the available memory on this machine is being exhausted. Do you have performance data available for this machine around those times? I apologize if memory as a root cause has already been examined, but the correlation is a strong one in this case I believe.

Looking back to the May 30th occurrence, I noticed this:

Code: Select all

{
	: timestamp => "2017-05-30T02:16:58.454000+0200",
	: message => "Failed to install template: None of the configured nodes are available: []",
	: level => : error
}

Unfortunately, the earliest back our Elasticsearch logs go is 17:44:00 of that day and everything looks like it was ok by then:

Code: Select all

[2017-05-30 17:44:01,588][INFO ][KnapsackExportAction     ] start of export: {"mode":"export","started":"2017-05-30T15:44:01.586Z","path":"file:///store/backups/nagioslogserver/1496159041/nagioslogserver.tar.gz","node_name":"791cc6c8-f646-495e-9e58-1ec21a24b61c"}

Though if we had Elasticsearch logs that we could match up to our Logstash logs, my hunch is that we would see similar memory related exceptions.

avandemore · Post by **avandemore** » Tue Jun 06, 2017 1:58 pm

I would concur with @mcapra's assessment at this point, as ES issues tend to bubble up to LS. @james.liew, can you confirm this is the issue/resolution?

james.liew · Post by **james.liew** » Thu Jun 08, 2017 3:31 am

Based on previous feedback, I've already allocated an additiona 8GB of RAM, I'm now at 16GB on LOG-01.

Each "dip" indicates when the LS/ES service was restarted.

2017-06-08_16-30-11.png

Nagios Support Forum

Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2

Re: Nagios Log Server listening port abruptly halts v2