Log Server hang ups

CFT6Server · Post by **CFT6Server** » Thu Aug 27, 2015 3:48 pm

I am trying to figure out why our cluster is constantly hanging up. I am trying to bump the breaker limits in attempt to workaround some of the out of memory issue. We have 3 nodes and each have 16GB of RAM.

I am seeing the following errors while tailing the logs, but not sure what these are, and how to address this.

Code: Select all

[2015-08-27 13:42:31,316][DEBUG][action.bulk] [4521585a-88af-47c9-81e5-c4d13cffb148] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-08-27 13:42:31,319][DEBUG][action.bulk] [4521585a-88af-47c9-81e5-c4d13cffb148] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-08-27 13:42:31,468][DEBUG][action.bulk] [4521585a-88af-47c9-81e5-c4d13cffb148] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]

The cluster tends to hang up every few days.....

jolson · Post by **jolson** » Thu Aug 27, 2015 4:10 pm

Are your nodes physically far apart? What is the average ping time from node to node? We recommend keeping nodes in the same physical location so as to minimize ping time. The timeout you're experiencing is typically due to nodes being unreachable.

If your nodes are in the same datacenter, we could try bumping up the timeout interval - but I'm afraid that this measure would only be a 'bandaid' and not a true solution.

Does the following command produce any output?

Code: Select all

find /usr/local/nagioslogserver/elasticsearch/data/*/nodes -name "*.recovering"

CFT6Server · Post by **CFT6Server** » Thu Aug 27, 2015 4:16 pm

They are on the same datacenter and the same subnet. Once I reboot all nodes and get them back up and running. This seems to subside. However, I think these are some of the errors that I see in and around the time the cluster falls over.

The only sign I get usually is visually seeing less activity on all nodes in XI, and when I check the nodes, it just hangs up and the web GUI, this is usually when I find the cluster to be at red or a node has fallen off.

jdalrymple · Post by **jdalrymple** » Fri Aug 28, 2015 9:13 am

See any weird resource usage?

Code: Select all

[root@localhost scripts]# sar -s 00:00:00 -e 01:00:00
Linux 2.6.32-504.el6.x86_64 (jrd-cent66-2)      08/28/2015      _x86_64_        (2 CPU)

12:00:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
12:10:01 AM     all      1.62      0.00      0.78      0.11      0.00     97.50
12:20:01 AM     all      1.53      0.00      0.61      0.05      0.00     97.81
12:30:01 AM     all      1.64      0.00      0.64      0.07      0.00     97.65
12:40:01 AM     all      1.55      0.00      0.61      0.06      0.00     97.78
12:50:01 AM     all      1.54      0.00      0.61      0.06      0.00     97.80
Average:        all      1.57      0.00      0.65      0.07      0.00     97.71
[root@localhost scripts]# sar -d -s 00:00:00 -e 01:00:00
Linux 2.6.32-504.el6.x86_64 (jrd-cent66-2)      08/28/2015      _x86_64_        (2 CPU)

12:00:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
12:10:01 AM    dev8-0     13.07      0.01    656.39     50.23      0.20     15.07      0.62      0.81
12:10:01 AM  dev253-0     82.05      0.01    656.39      8.00     11.41    139.00      0.10      0.81
12:10:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:20:01 AM    dev8-0     12.29      0.01    270.40     22.00      0.12     10.11      0.53      0.65
12:20:01 AM  dev253-0     33.80      0.01    270.40      8.00      0.17      4.98      0.19      0.65
12:20:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:30:01 AM    dev8-0     12.08      0.01    271.67     22.49      0.15     12.67      0.55      0.67
12:30:01 AM  dev253-0     33.96      0.01    271.67      8.00      0.22      6.43      0.20      0.67
12:30:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:40:01 AM    dev8-0     12.27      0.04    274.75     22.40      0.10      7.82      0.47      0.57
12:40:01 AM  dev253-0     34.35      0.04    274.75      8.00      0.13      3.75      0.17      0.57
12:40:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:50:01 AM    dev8-0     12.36      0.07    277.30     22.44      0.06      4.68      0.47      0.58
12:50:01 AM  dev253-0     34.67      0.07    277.30      8.00      0.11      3.19      0.17      0.58
12:50:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:       dev8-0     12.41      0.03    349.57     28.17      0.13     10.11      0.53      0.66
Average:     dev253-0     43.70      0.03    349.57      8.00      2.39     54.71      0.15      0.66
Average:     dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

You can look at days past (preferably just before a time of failure) by adding `-f /var/log/sa/saDD` where DD is the date.

CFT6Server · Post by **CFT6Server** » Fri Aug 28, 2015 10:58 am

Looks like the appliance does not have syssttat installed. I will install it and keep watch.

jdalrymple · Post by **jdalrymple** » Fri Aug 28, 2015 11:22 am

That's crummy (and surprising) - I'll suggest to the devs that get put on there.

Also - just to be clear I'm not suggesting that there is a system problem, but we should try to isolate before we go digging too deep into what ES has going on. Maybe your vSphere performance tab can lend a hand?

CFT6Server · Post by **CFT6Server** » Fri Aug 28, 2015 1:10 pm

Here's what I have for the time when the crash happen... the dip is when the cluster went down.

VMware

VMCPU.JPG

CPU Load from XI

CPULoad.JPG

Memory Usage from XI

Memory.JPG

CFT6Server · Post by **CFT6Server** » Fri Aug 28, 2015 1:12 pm

Here's the network bandwidth. Eth0 is the prod network and Eth1 is the storage where the indices are held on NFS.

Eth.JPG

jolson · Post by **jolson** » Fri Aug 28, 2015 1:14 pm

Based on the memory dip, I'm guessing that elasticsearch was killed by the kernel - try running the following on your problem node.

Code: Select all

grep -i 'out of memory' /var/log/messages

Any messages that indicate elasticsearch being killed? If so, you'll likely need to bump up the memory allocated to your instances.

CFT6Server · Post by **CFT6Server** » Fri Aug 28, 2015 1:21 pm

I killed the process because I couldn't get the node/cluster going. The dip is when I took the node down for reboot and get the cluster going.

Nagios Support Forum

Log Server hang ups

Log Server hang ups

Re: Log Server hang ups

Re: Log Server hang ups

Re: Log Server hang ups

Re: Log Server hang ups

Re: Log Server hang ups

Re: Log Server hang ups

Re: Log Server hang ups

Re: Log Server hang ups

Re: Log Server hang ups