Nagios Support Forum

Posted: **Tue Sep 22, 2015 9:16 am**

I have a 3 node cluster on ver. 2015R2.1
On my first node, which is not the Master bot is the server where logstash gothers the data (NagiosLogServer1), suddenly logstash fails and after restarting it elasticsearch follows!

I get on both logstash and on elasticsearch the pid error below:
[root@NagiosLogServer1 elasticsearch]# service elasticsearch status
elasticsearch dead but pid file exists

I attach you the latest logs from the server.

Thanx a lot.

Posted: **Tue Sep 22, 2015 9:50 am**

What does your free memory look like?

Code: Select all

free -m

This is almost certainly happening because elasticsearch is starting to take up too much memory. How much memory is in your instance right now? I'd recommend doubling it as a test.

Once you notice that elasticsearch has crashed, you may be able to catch the error with the following command:

Code: Select all

grep -i 'out of memory' /var/log/messages

After you've doubled the memory in your box, you will need to restart elasticsearch:

Code: Select all

service elasticsearch restart

Posted: **Thu Sep 24, 2015 2:20 am**

I have a 3 cluster node. So I've doubled the RAM to 32Gig on all servers. (I have setup to all servers "the discovery.zen.minimum_master_nodes: 2").
Now nodes 1 & 2 seem to be working fine but the 3rd (NagiosLogServer3) always drops the elasticsearch service because it is out of memory. You can see below how the memory consumption increases till elasticsearch is down.

[root@NagiosLogServer3 ~]# free -m
total used free shared buffers cached
Mem: 32241 23118 9122 0 6 118
-/+ buffers/cache: 22994 9246
Swap: 255 123 132
[root@NagiosLogServer3 ~]# service elasticsearch status
elasticsearch (pid 31496) is running...
[root@NagiosLogServer3 ~]# free -m
total used free shared buffers cached
Mem: 32241 30671 1569 0 6 121
-/+ buffers/cache: 30543 1698
Swap: 255 121 134
[root@NagiosLogServer3 ~]#
[root@NagiosLogServer3 ~]#
[root@NagiosLogServer3 ~]# free -m
total used free shared buffers cached
Mem: 32241 30923 1317 0 6 122
-/+ buffers/cache: 30794 1446
Swap: 255 120 135
[root@NagiosLogServer3 ~]#
[root@NagiosLogServer3 ~]#
[root@NagiosLogServer3 ~]# free -m
total used free shared buffers cached
Mem: 32241 31513 727 0 6 111
-/+ buffers/cache: 31395 846
Swap: 255 116 139
[root@NagiosLogServer3 ~]#
[root@NagiosLogServer3 ~]# free -m
total used free shared buffers cached
Mem: 32241 31597 643 0 6 108
-/+ buffers/cache: 31483 758
Swap: 255 111 144
[root@NagiosLogServer3 ~]# free -m
total used free shared buffers cached
Mem: 32241 18278 13962 0 4 35
-/+ buffers/cache: 18239 14001
Swap: 255 158 97
[root@NagiosLogServer3 ~]# service elasticsearch status
elasticsearch dead but pid file exists

[root@NagiosLogServer3 ~]# grep -i 'out of memory' /var/log/messages
Sep 23 15:28:46 NagiosLogServer3 kernel: Out of memory: Kill process 1518 (java) score 517 or sacrifice child
Sep 23 15:48:08 NagiosLogServer3 kernel: Out of memory: Kill process 1487 (java) score 517 or sacrifice child
Sep 23 16:56:47 NagiosLogServer3 kernel: Out of memory: Kill process 1563 (java) score 520 or sacrifice child
Sep 24 09:39:53 NagiosLogServer3 kernel: Out of memory: Kill process 1554 (java) score 522 or sacrifice child

Thanx

[root@NagiosLogServer3 ~]# java -XX:+PrintFlagsFinal -version | grep -iE 'HeapSize|PermSize|ThreadStackSize'
uintx AdaptivePermSizeWeight = 20 {product}
intx CompilerThreadStackSize = 0 {pd product}
uintx ErgoHeapSizeLimit = 0 {product}
uintx HeapSizePerGCThread = 87241520 {product}
uintx InitialHeapSize := 528238528 {product}
uintx LargePageHeapSizeThreshold = 134217728 {product}
uintx MaxHeapSize := 8453619712 {product}
uintx MaxPermSize = 174063616 {pd product}
uintx PermSize = 21757952 {pd product}
intx ThreadStackSize = 1024 {pd product}
intx VMThreadStackSize = 1024 {pd product}
java version "1.7.0_85"
OpenJDK Runtime Environment (rhel-2.6.1.3.el6_6-x86_64 u85-b01)
OpenJDK 64-Bit Server VM (build 24.85-b03, mixed mode)

Posted: **Thu Sep 24, 2015 9:22 am**

Is your third node taking in substantially more logs than your other nodes? It's possible that it's crashing due to the pure volume of log input - though 32GB is generally a very capable amount of RAM.

I'd like to see a couple of elasticsearch configuration files, in addition to your elasticsearch logs.

Code: Select all

cat /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml
cat /etc/sysconfig/elasticsearch
cat /var/log/elasticsearch/*.log

Posted: **Fri Sep 25, 2015 7:19 am**

Currently only NagiosLogServer1 receives logs in logstash, and the same server is my NFS server(separate disk for NFS filesystem /NLSBackup).

Also if you see the disk space distribution below (all 3 nodes have the same disk space allocated) the data is evenly distributed.
[root@NagiosLogServer1 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
rootfs 99G 45G 53G 46% /
devtmpfs 16G 152K 16G 1% /dev
tmpfs 16G 0 16G 0% /dev/shm
/dev/sda1 99G 45G 53G 46% /
/dev/sdb 99G 43G 51G 46% /NLSBackup

[root@NagiosLogServer2 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
rootfs 99G 43G 56G 44% /
devtmpfs 16G 148K 16G 1% /dev
tmpfs 16G 0 16G 0% /dev/shm
/dev/sda1 99G 43G 56G 44% /
10.1.11.10:/NLSBackup
99G 43G 51G 46% /NLSBackup
[root@NagiosLogServer3 ~]# df -h
Filesystem Size Used Avail Use% Mounted on
rootfs 99G 47G 52G 48% /
devtmpfs 16G 140K 16G 1% /dev
tmpfs 16G 0 16G 0% /dev/shm
/dev/sda1 99G 47G 52G 48% /
10.1.11.10:/NLSBackup
99G 43G 51G 46% /NLSBackup

I attach the files you requested.

Thanx a lot.

Posted: **Fri Sep 25, 2015 10:30 am**

I took a look at your logs, and nothing jumped out at me. Since elasticsearch increases in terms of memory usage until it dies, would you be fine with upping the memory in that particular node until this stops happening?

I recommend attempting 40GB, 48GB, and 60GB in terms of benchmarks. Don't increase your memory beyond 60GB in any particular node, it will cause the entire cluster to slow down considerably - it's always best to add another instance after hitting the RAM cap.

Let me know if elasticsearch stops acting up after increasing that nodes memory.

Posted: **Mon Oct 05, 2015 6:35 am**

Unfortunately, currently, we do not have any more resources to allocate in our VM servers.
All these days I've noticed that the memory consumption over the allocated RAM (32Gb + default swap size), does not apply always on the same server but in one of my 3 servers of the cluster i.e. at some point I loose elasticsearch or logstash of a server (and not in particular the master). The logging at the time of the high memory usage does not indicate any problems (as you also noticed). I do not know what causes these high memory spikes?
Anyway pls go ahead and close this thread.
Thanx for your help.

Posted: **Mon Oct 05, 2015 10:49 am**

Sounds good, I'll close it up. The high amount of memory could be caused by an influx in incoming logs, a 'heavy' query being run from the UI, or things like the backup process taking place. It's hard to pin down exactly, but more memory will certainly help. Feel free to start a new thread if you have any additional questions or comments. Thanks!

Nagios Support Forum

Logstash and Elasticsearch fail

Logstash and Elasticsearch fail

Re: Logstash and Elasticsearch fail

Re: Logstash and Elasticsearch fail

Re: Logstash and Elasticsearch fail

Re: Logstash and Elasticsearch fail

Re: Logstash and Elasticsearch fail

Re: Logstash and Elasticsearch fail

Re: Logstash and Elasticsearch fail