Hard Crash

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Hard Crash

Post by CFT6Server »

I am not sure if this is relating to our recent Log Server upgrade, but seems like the cluster hard crashed. All nodes were spitting out errors similar to the following:

Code: Select all

# Aug 25, 2015 5:07:34 PM org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler doSample
INFO: [e63648a3-d912-4f5d-a867-1b99282a5e7c] failed to get node info for [#transport#-1][kdcnagls1n3.bchydro.bc.ca][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [14384] timed out after [5000ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Aug 25, 2015 5:07:34 PM org.elasticsearch.client.transport.TransportClientNodesService$SimpleNodeSampler doSample
INFO: [e63648a3-d912-4f5d-a867-1b99282a5e7c] failed to get node info for [#transport#-1][kdcnagls1n3.bchydro.bc.ca][inet[localhost/127.0.0.1:9300]], disconnecting...
org.elasticsearch.transport.ReceiveTimeoutTransportException: [][inet[localhost/127.0.0.1:9300]][cluster:monitor/nodes/info] request_id [14402] timed out after [5001ms]
        at org.elasticsearch.transport.TransportService$TimeoutHandler.run(TransportService.java:529)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Also when trying to send commands, it wasn't responding and just continues to spit out errors. I manage to stop the elasticsearch and logstash services, but some threw this error while trying to stop elasticsearch.

Code: Select all

Aug 25, 2015 5:10:26 PM org.elasticsearch.transport.netty.MessageChannelHandler messageReceived
WARNING: [4521585a-88af-47c9-81e5-c4d13cffb148] Message not fully read (response) for [32739] handler org.elasticsearch.action.TransportActionNodeProxy$1@18487912, error [true], resetting
Aug 25, 2015 5:10:27 PM org.elasticsearch.transport.netty.MessageChannelHandler messageReceived
WARNING: [4521585a-88af-47c9-81e5-c4d13cffb148] Message not fully read (response) for [32740] handler org.elasticsearch.action.TransportActionNodeProxy$1@687865a9, error [true], resetting
Aug 25, 2015 5:10:28 PM org.elasticsearch.transport.netty.MessageChannelHandler messageReceived
WARNING: [4521585a-88af-47c9-81e5-c4d13cffb148] Message not fully read (response) for [32741] handler org.elasticsearch.action.TransportActionNodeProxy$1@4d07163f, error [true], resetting
Aug 25, 2015 5:10:29 PM org.elasticsearch.transport.netty.MessageChannelHandler messageReceived
I was left with the only option of shutting off the nodes and then just bring them up and let Log Server sort itself out.

The last action I was running was doing a query of past 7 days.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Hard Crash

Post by CFT6Server »

Looks like either the upgrade or the crash caused a reset of my customization to elasticsearch.yml configuration which significantly lowered the memory settings that I have configured for field data cache size and breaker limits.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Hard Crash

Post by jolson »

After resetting your previous customizations, is everything back to normal, or is your problems still persisting? It's very likely that the elasticsearch upgrade cleared out your custom settings.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Hard Crash

Post by CFT6Server »

It has settled now after I reset the configurations and restarted the cluster.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Hard Crash

Post by jolson »

Understood. Is there anything I could help you with here?
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Hard Crash

Post by CFT6Server »

Another note on the upgrade so on top of losing customizations to elasticsearch.yml, looks like on instance configurations for inputs are all gone as well. This sucks because I have to dig back to see if I have that backed up somewhere.

If configurations are removed during upgrades, then there should be a backup process somewhere and documented.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Hard Crash

Post by tmcdonald »

I can't imagine why an upgrade would do this, but we will test this and bring it up with the developers.
Former Nagios employee
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Hard Crash

Post by jolson »

I found the area in our upgrade script where Nagios Log Server is told to replace the 'elasticsearch.yml' file. Is there a list of particular files that you'd like to see preserved during future upgrades? Off the top of my head, I can think of the following:

Code: Select all

/usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml
/usr/local/nagioslogserver/elasticsearch/config/logging.yml
/usr/local/nagioslogserver/logstash/patterns/
/etc/sysconfig/elasticsearch
/etc/sysconfig/logstash
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Hard Crash

Post by CFT6Server »

I would say any local input/filter configurations.

/usr/local/nagioslogserver/logstash/etc/conf.d

so far, that's all I have.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Hard Crash

Post by jolson »

I am looking into the configuration file re-write business. In the meantime, I made a bug report ( Task ID 6354 ) for the per-instance configuration deletion issue. I reproduced the problem on my lab box, the following represents my findings:

Steps to Reproduce:

1. Spin up a latest release Nagios Log Server (NLS 2015R2.2).

2. Define a simple per-instance configuration (mine was a simple syslog input) and Save+Apply configuration. The input will be reflected in /usr/local/nagioslogserver/logstash/etc/conf.d/000_inputs.conf appropriately.

3. Run the 'upgrade' script on your instance (I upgraded my 2.2 instance to 2.2).

4. Go back to your per-instance configurations on the Web GUI, note that it has disappeared.


Additional Information:
-I noted that post-upgrade, the per-instance configuration is available via the command line, but does *NOT* appear in the Web GUI. Global Configurations are unaffected.

-If you run a single 'Apply Configuration', all of the per-instance configurations that exist on the command line are erased.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked