Page 1 of 2

Log Server hang ups

Posted: Thu Aug 27, 2015 3:48 pm
by CFT6Server
I am trying to figure out why our cluster is constantly hanging up. I am trying to bump the breaker limits in attempt to workaround some of the out of memory issue. We have 3 nodes and each have 16GB of RAM.

I am seeing the following errors while tailing the logs, but not sure what these are, and how to address this.

Code: Select all

[2015-08-27 13:42:31,316][DEBUG][action.bulk] [4521585a-88af-47c9-81e5-c4d13cffb148] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-08-27 13:42:31,319][DEBUG][action.bulk] [4521585a-88af-47c9-81e5-c4d13cffb148] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-08-27 13:42:31,468][DEBUG][action.bulk] [4521585a-88af-47c9-81e5-c4d13cffb148] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
The cluster tends to hang up every few days.....

Re: Log Server hang ups

Posted: Thu Aug 27, 2015 4:10 pm
by jolson
Are your nodes physically far apart? What is the average ping time from node to node? We recommend keeping nodes in the same physical location so as to minimize ping time. The timeout you're experiencing is typically due to nodes being unreachable.

If your nodes are in the same datacenter, we could try bumping up the timeout interval - but I'm afraid that this measure would only be a 'bandaid' and not a true solution.

Does the following command produce any output?

Code: Select all

find /usr/local/nagioslogserver/elasticsearch/data/*/nodes -name "*.recovering"

Re: Log Server hang ups

Posted: Thu Aug 27, 2015 4:16 pm
by CFT6Server
They are on the same datacenter and the same subnet. Once I reboot all nodes and get them back up and running. This seems to subside. However, I think these are some of the errors that I see in and around the time the cluster falls over.

The only sign I get usually is visually seeing less activity on all nodes in XI, and when I check the nodes, it just hangs up and the web GUI, this is usually when I find the cluster to be at red or a node has fallen off.

Re: Log Server hang ups

Posted: Fri Aug 28, 2015 9:13 am
by jdalrymple
See any weird resource usage?

Code: Select all

[root@localhost scripts]# sar -s 00:00:00 -e 01:00:00
Linux 2.6.32-504.el6.x86_64 (jrd-cent66-2)      08/28/2015      _x86_64_        (2 CPU)

12:00:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
12:10:01 AM     all      1.62      0.00      0.78      0.11      0.00     97.50
12:20:01 AM     all      1.53      0.00      0.61      0.05      0.00     97.81
12:30:01 AM     all      1.64      0.00      0.64      0.07      0.00     97.65
12:40:01 AM     all      1.55      0.00      0.61      0.06      0.00     97.78
12:50:01 AM     all      1.54      0.00      0.61      0.06      0.00     97.80
Average:        all      1.57      0.00      0.65      0.07      0.00     97.71
[root@localhost scripts]# sar -d -s 00:00:00 -e 01:00:00
Linux 2.6.32-504.el6.x86_64 (jrd-cent66-2)      08/28/2015      _x86_64_        (2 CPU)

12:00:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
12:10:01 AM    dev8-0     13.07      0.01    656.39     50.23      0.20     15.07      0.62      0.81
12:10:01 AM  dev253-0     82.05      0.01    656.39      8.00     11.41    139.00      0.10      0.81
12:10:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:20:01 AM    dev8-0     12.29      0.01    270.40     22.00      0.12     10.11      0.53      0.65
12:20:01 AM  dev253-0     33.80      0.01    270.40      8.00      0.17      4.98      0.19      0.65
12:20:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:30:01 AM    dev8-0     12.08      0.01    271.67     22.49      0.15     12.67      0.55      0.67
12:30:01 AM  dev253-0     33.96      0.01    271.67      8.00      0.22      6.43      0.20      0.67
12:30:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:40:01 AM    dev8-0     12.27      0.04    274.75     22.40      0.10      7.82      0.47      0.57
12:40:01 AM  dev253-0     34.35      0.04    274.75      8.00      0.13      3.75      0.17      0.57
12:40:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:50:01 AM    dev8-0     12.36      0.07    277.30     22.44      0.06      4.68      0.47      0.58
12:50:01 AM  dev253-0     34.67      0.07    277.30      8.00      0.11      3.19      0.17      0.58
12:50:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:       dev8-0     12.41      0.03    349.57     28.17      0.13     10.11      0.53      0.66
Average:     dev253-0     43.70      0.03    349.57      8.00      2.39     54.71      0.15      0.66
Average:     dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
You can look at days past (preferably just before a time of failure) by adding `-f /var/log/sa/saDD` where DD is the date.

Re: Log Server hang ups

Posted: Fri Aug 28, 2015 10:58 am
by CFT6Server
Looks like the appliance does not have syssttat installed. I will install it and keep watch.

Re: Log Server hang ups

Posted: Fri Aug 28, 2015 11:22 am
by jdalrymple
That's crummy (and surprising) - I'll suggest to the devs that get put on there.

Also - just to be clear I'm not suggesting that there is a system problem, but we should try to isolate before we go digging too deep into what ES has going on. Maybe your vSphere performance tab can lend a hand?

Re: Log Server hang ups

Posted: Fri Aug 28, 2015 1:10 pm
by CFT6Server
Here's what I have for the time when the crash happen... the dip is when the cluster went down.

VMware
VMCPU.JPG
CPU Load from XI
CPULoad.JPG
Memory Usage from XI
Memory.JPG

Re: Log Server hang ups

Posted: Fri Aug 28, 2015 1:12 pm
by CFT6Server
Here's the network bandwidth. Eth0 is the prod network and Eth1 is the storage where the indices are held on NFS.
Eth.JPG

Re: Log Server hang ups

Posted: Fri Aug 28, 2015 1:14 pm
by jolson
Based on the memory dip, I'm guessing that elasticsearch was killed by the kernel - try running the following on your problem node.

Code: Select all

grep -i 'out of memory' /var/log/messages
Any messages that indicate elasticsearch being killed? If so, you'll likely need to bump up the memory allocated to your instances.

Re: Log Server hang ups

Posted: Fri Aug 28, 2015 1:21 pm
by CFT6Server
I killed the process because I couldn't get the node/cluster going. The dip is when I took the node down for reboot and get the cluster going.