Log Server hang ups

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Log Server hang ups

Post by CFT6Server »

I am trying to figure out why our cluster is constantly hanging up. I am trying to bump the breaker limits in attempt to workaround some of the out of memory issue. We have 3 nodes and each have 16GB of RAM.

I am seeing the following errors while tailing the logs, but not sure what these are, and how to address this.

Code: Select all

[2015-08-27 13:42:31,316][DEBUG][action.bulk] [4521585a-88af-47c9-81e5-c4d13cffb148] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-08-27 13:42:31,319][DEBUG][action.bulk] [4521585a-88af-47c9-81e5-c4d13cffb148] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
[2015-08-27 13:42:31,468][DEBUG][action.bulk] [4521585a-88af-47c9-81e5-c4d13cffb148] observer: timeout notification from cluster service. timeout setting [1m], time since start [1m]
The cluster tends to hang up every few days.....
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Log Server hang ups

Post by jolson »

Are your nodes physically far apart? What is the average ping time from node to node? We recommend keeping nodes in the same physical location so as to minimize ping time. The timeout you're experiencing is typically due to nodes being unreachable.

If your nodes are in the same datacenter, we could try bumping up the timeout interval - but I'm afraid that this measure would only be a 'bandaid' and not a true solution.

Does the following command produce any output?

Code: Select all

find /usr/local/nagioslogserver/elasticsearch/data/*/nodes -name "*.recovering"
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Log Server hang ups

Post by CFT6Server »

They are on the same datacenter and the same subnet. Once I reboot all nodes and get them back up and running. This seems to subside. However, I think these are some of the errors that I see in and around the time the cluster falls over.

The only sign I get usually is visually seeing less activity on all nodes in XI, and when I check the nodes, it just hangs up and the web GUI, this is usually when I find the cluster to be at red or a node has fallen off.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Log Server hang ups

Post by jdalrymple »

See any weird resource usage?

Code: Select all

[root@localhost scripts]# sar -s 00:00:00 -e 01:00:00
Linux 2.6.32-504.el6.x86_64 (jrd-cent66-2)      08/28/2015      _x86_64_        (2 CPU)

12:00:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
12:10:01 AM     all      1.62      0.00      0.78      0.11      0.00     97.50
12:20:01 AM     all      1.53      0.00      0.61      0.05      0.00     97.81
12:30:01 AM     all      1.64      0.00      0.64      0.07      0.00     97.65
12:40:01 AM     all      1.55      0.00      0.61      0.06      0.00     97.78
12:50:01 AM     all      1.54      0.00      0.61      0.06      0.00     97.80
Average:        all      1.57      0.00      0.65      0.07      0.00     97.71
[root@localhost scripts]# sar -d -s 00:00:00 -e 01:00:00
Linux 2.6.32-504.el6.x86_64 (jrd-cent66-2)      08/28/2015      _x86_64_        (2 CPU)

12:00:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
12:10:01 AM    dev8-0     13.07      0.01    656.39     50.23      0.20     15.07      0.62      0.81
12:10:01 AM  dev253-0     82.05      0.01    656.39      8.00     11.41    139.00      0.10      0.81
12:10:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:20:01 AM    dev8-0     12.29      0.01    270.40     22.00      0.12     10.11      0.53      0.65
12:20:01 AM  dev253-0     33.80      0.01    270.40      8.00      0.17      4.98      0.19      0.65
12:20:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:30:01 AM    dev8-0     12.08      0.01    271.67     22.49      0.15     12.67      0.55      0.67
12:30:01 AM  dev253-0     33.96      0.01    271.67      8.00      0.22      6.43      0.20      0.67
12:30:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:40:01 AM    dev8-0     12.27      0.04    274.75     22.40      0.10      7.82      0.47      0.57
12:40:01 AM  dev253-0     34.35      0.04    274.75      8.00      0.13      3.75      0.17      0.57
12:40:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:50:01 AM    dev8-0     12.36      0.07    277.30     22.44      0.06      4.68      0.47      0.58
12:50:01 AM  dev253-0     34.67      0.07    277.30      8.00      0.11      3.19      0.17      0.58
12:50:01 AM  dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:       dev8-0     12.41      0.03    349.57     28.17      0.13     10.11      0.53      0.66
Average:     dev253-0     43.70      0.03    349.57      8.00      2.39     54.71      0.15      0.66
Average:     dev253-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
You can look at days past (preferably just before a time of failure) by adding `-f /var/log/sa/saDD` where DD is the date.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Log Server hang ups

Post by CFT6Server »

Looks like the appliance does not have syssttat installed. I will install it and keep watch.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: Log Server hang ups

Post by jdalrymple »

That's crummy (and surprising) - I'll suggest to the devs that get put on there.

Also - just to be clear I'm not suggesting that there is a system problem, but we should try to isolate before we go digging too deep into what ES has going on. Maybe your vSphere performance tab can lend a hand?
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Log Server hang ups

Post by CFT6Server »

Here's what I have for the time when the crash happen... the dip is when the cluster went down.

VMware
VMCPU.JPG
CPU Load from XI
CPULoad.JPG
Memory Usage from XI
Memory.JPG
You do not have the required permissions to view the files attached to this post.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Log Server hang ups

Post by CFT6Server »

Here's the network bandwidth. Eth0 is the prod network and Eth1 is the storage where the indices are held on NFS.
Eth.JPG
You do not have the required permissions to view the files attached to this post.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Log Server hang ups

Post by jolson »

Based on the memory dip, I'm guessing that elasticsearch was killed by the kernel - try running the following on your problem node.

Code: Select all

grep -i 'out of memory' /var/log/messages
Any messages that indicate elasticsearch being killed? If so, you'll likely need to bump up the memory allocated to your instances.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
CFT6Server
Posts: 506
Joined: Wed Apr 15, 2015 4:21 pm

Re: Log Server hang ups

Post by CFT6Server »

I killed the process because I couldn't get the node/cluster going. The dip is when I took the node down for reboot and get the cluster going.
Locked