Hard Crash

CFT6Server · Post by **CFT6Server** » Tue Aug 25, 2015 7:21 pm

I am not sure if this is relating to our recent Log Server upgrade, but seems like the cluster hard crashed. All nodes were spitting out errors similar to the following:

Code: Select all

# Aug 25, 2015 5:07:34 PM org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler doSample
INFO: [e63648a3-d912-4f5d-a867-1b99282a5e7c] failed to get node info for [#transport#-1][kdcnagls1n3.bchydro.bc.ca][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [14384] timed out after [5000ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Aug 25, 2015 5:07:34 PM org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler doSample
INFO: [e63648a3-d912-4f5d-a867-1b99282a5e7c] failed to get node info for [#transport#-1][kdcnagls1n3.bchydro.bc.ca][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [14402] timed out after [5001ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Also when trying to send commands, it wasn't responding and just continues to spit out errors. I manage to stop the elasticsearch and logstash services, but some threw this error while trying to stop elasticsearch.

Code: Select all

Aug 25, 2015 5:10:26 PM org.elasticsearch.transport.netty.MessageChannelHandler messageReceived
WARNING: [4521585a-88af-47c9-81e5-c4d13cffb148] Message not fully read (response) for [32739] handler org.elasticsearch.action.TransportActionNodeProxy$1@18487912, error [true], resetting
Aug 25, 2015 5:10:27 PM org.elasticsearch.transport.netty.MessageChannelHandler messageReceived
WARNING: [4521585a-88af-47c9-81e5-c4d13cffb148] Message not fully read (response) for [32740] handler org.elasticsearch.action.TransportActionNodeProxy$1@687865a9, error [true], resetting
Aug 25, 2015 5:10:28 PM org.elasticsearch.transport.netty.MessageChannelHandler messageReceived
WARNING: [4521585a-88af-47c9-81e5-c4d13cffb148] Message not fully read (response) for [32741] handler org.elasticsearch.action.TransportActionNodeProxy$1@4d07163f, error [true], resetting
Aug 25, 2015 5:10:29 PM org.elasticsearch.transport.netty.MessageChannelHandler messageReceived

I was left with the only option of shutting off the nodes and then just bring them up and let Log Server sort itself out.

The last action I was running was doing a query of past 7 days.

CFT6Server · Post by **CFT6Server** » Tue Aug 25, 2015 7:30 pm

Looks like either the upgrade or the crash caused a reset of my customization to elasticsearch.yml configuration which significantly lowered the memory settings that I have configured for field data cache size and breaker limits.

jolson · Post by **jolson** » Wed Aug 26, 2015 10:42 am

After resetting your previous customizations, is everything back to normal, or is your problems still persisting? It's very likely that the elasticsearch upgrade cleared out your custom settings.

CFT6Server · Post by **CFT6Server** » Wed Aug 26, 2015 11:18 am

It has settled now after I reset the configurations and restarted the cluster.

jolson · Post by **jolson** » Wed Aug 26, 2015 1:19 pm

Understood. Is there anything I could help you with here?

CFT6Server · Post by **CFT6Server** » Wed Aug 26, 2015 4:51 pm

Another note on the upgrade so on top of losing customizations to elasticsearch.yml, looks like on instance configurations for inputs are all gone as well. This sucks because I have to dig back to see if I have that backed up somewhere.

If configurations are removed during upgrades, then there should be a backup process somewhere and documented.

tmcdonald · Post by **tmcdonald** » Thu Aug 27, 2015 11:29 am

I can't imagine why an upgrade would do this, but we will test this and bring it up with the developers.

jolson · Post by **jolson** » Thu Aug 27, 2015 11:41 am

I found the area in our upgrade script where Nagios Log Server is told to replace the 'elasticsearch.yml' file. Is there a list of particular files that you'd like to see preserved during future upgrades? Off the top of my head, I can think of the following:

Code: Select all

/usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml
/usr/local/nagioslogserver/elasticsearch/config/logging.yml
/usr/local/nagioslogserver/logstash/patterns/
/etc/sysconfig/elasticsearch
/etc/sysconfig/logstash

CFT6Server · Post by **CFT6Server** » Thu Aug 27, 2015 11:48 am

I would say any local input/filter configurations.

/usr/local/nagioslogserver/logstash/etc/conf.d

so far, that's all I have.

jolson · Post by **jolson** » Thu Aug 27, 2015 3:22 pm

I am looking into the configuration file re-write business. In the meantime, I made a bug report ( Task ID 6354 ) for the per-instance configuration deletion issue. I reproduced the problem on my lab box, the following represents my findings:

Steps to Reproduce:

1. Spin up a latest release Nagios Log Server (NLS 2015R2.2).

2. Define a simple per-instance configuration (mine was a simple syslog input) and Save+Apply configuration. The input will be reflected in /usr/local/nagioslogserver/logstash/etc/conf.d/000_inputs.conf appropriately.

3. Run the 'upgrade' script on your instance (I upgraded my 2.2 instance to 2.2).

4. Go back to your per-instance configurations on the Web GUI, note that it has disappeared.

Additional Information:
-I noted that post-upgrade, the per-instance configuration is available via the command line, but does *NOT* appear in the Web GUI. Global Configurations are unaffected.

-If you run a single 'Apply Configuration', all of the per-instance configurations that exist on the command line are erased.

Nagios Support Forum

Hard Crash

Hard Crash

Re: Hard Crash

Re: Hard Crash

Re: Hard Crash

Re: Hard Crash

Re: Hard Crash

Re: Hard Crash

Re: Hard Crash

Re: Hard Crash

Re: Hard Crash