Explosive growth in Elasticsearch log

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
burkm
Posts: 31
Joined: Thu Jan 21, 2016 5:10 pm

Explosive growth in Elasticsearch log

Post by burkm »

Hello,

Lately the Elasticsearch log has been growing at a rate of up to 4GB/hr (sounds like the other post about the logstash log).
It will log at a "normal" pace (which is still quite wordy) for a while, then suddenly start logging millions of exceptions per hour. This happens once or twice a day. When it does, I stop Elasticsearch and Logstash, delete the file, and restart. Then it functions normally for a time.

Many of the errors are like this:
java.io.IOException: Too many open files
This seems odd, because on some advice found while Googling I set the nofiles limit high for the nagios user in /etc/security/limits.conf:

Code: Select all

nagios 		hard   	nofile 		65536
nagios 		soft   	nofile 		65536
Are these limits not adequate? NLS has worked fine for ~1.5 years without specifying any limits, so I'm not sure if it's even necessary.

Here's a sample of the logs, just grepping on Exception so it's not too verbose:

Code: Select all

java.io.IOException: Too many open files
java.io.IOException: Too many open files
org.elasticsearch.index.shard.IndexShardCreationException: [admin.cgi][3] failed to create shard
Caused by: java.io.IOException: directory '/data1/data/76900ee2-f769-413c-9948-850204a96b32/nodes/0/indices/admin.cgi/3/index' exists and is a directory, but cannot be listed: list() returned nulld
Caused by: java.io.IOException: directory '/data1/data/76900ee2-f769-413c-9948-850204a96b32/nodes/0/indices/admin.cgi/2/index' exists and is a directory, but cannot be listed: list() returned nullices: Too many open filesfs.UnixException.translateToIOException(UnixException.java:91)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
org.elasticsearch.index.shard.IndexShardCreationException: [administrator.cgi][3] failed to create shard
Caused by: java.io.IOException: directory '/data1/data/76900ee2-f769-413c-9948-850204a96b32/nodes/0/indices/administrator.cgi/3/index' exists and is a directory, but cannot be listed: list() returned null
org.elasticsearch.index.shard.IndexShardCreationException: [administrator.cgi][2] failed to create shard
Caused by: java.io.IOException: directory '/data1/data/76900ee2-f769-413c-9948-850204a96b32/nodes/0/indices/administrator.cgi/2/index' exists and is a directory, but cannot be listed: list() returned null
java.nio.file.FileSystemException: /data1/data/76900ee2-f769-413c-9948-850204a96b32/nodes/0/indices: Too many open files
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
org.elasticsearch.index.shard.IndexShardCreationException: [admin.cgi][2] failed to create shard
Caused by: java.io.IOException: directory '/data1/data/76900ee2-f769-413c-9948-850204a96b32/nodes/0/indices/admin.cgi/2/index' exists and is a directory, but cannot be listed: list() returned null
org.elasticsearch.index.shard.IndexShardCreationException: [admin.cgi][3] failed to create shard
Caused by: java.io.IOException: directory '/data1/data/76900ee2-f769-413c-9948-850204a96b32/nodes/0/indices/admin.cgi/3/index' exists and is a directory, but cannot be listed: list() returned null
java.nio.file.FileSystemException: /data1/data/76900ee2-f769-413c-9948-850204a96b32/nodes/0/indices: Too many open files
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
org.elasticsearch.index.shard.IndexShardCreationException: [administrator.cgi][1] failed to create shard
Caused by: java.io.IOException: directory '/data1/data/76900ee2-f769-413c-9948-850204a96b32/nodes/0/indices/administrator.cgi/1/index' exists and is a directory, but cannot be listed: list() returned null
org.elasticsearch.index.shard.IndexShardCreationException: [admin.cgi][1] failed to create shard
Caused by: java.io.IOException: directory '/data1/data/76900ee2-f769-413c-9948-850204a96b32/nodes/0/indices/admin.cgi/1/index' exists and is a directory, but cannot be listed: list() returned null
java.nio.file.FileSystemException: /data1/data/76900ee2-f769-413c-9948-850204a96b32/nodes/0/indices: Too many open files
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
java.io.IOException: Too many open files
After this happened a couple times on NLS 1.4.1, I upgraded to 1.4.2. I'm having a host of problems with the web interface now, but I'll save that for another post if this doesn't seem to be related.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Explosive growth in Elasticsearch log

Post by mcapra »

Can you show the output of the following commands:

Code: Select all

ps aux | grep java
df -h
free -m
Few messages in there about not being able to create a shard that I would like to investigate.
Former Nagios employee
https://www.mcapra.com/
burkm
Posts: 31
Joined: Thu Jan 21, 2016 5:10 pm

Re: Explosive growth in Elasticsearch log

Post by burkm »

It's behaving well so far today...

Code: Select all

burkm@calcium:~ $ ps aux |grep java
burkm    15351  0.0  0.0 112648   972 pts/0    S+   09:56   0:00 grep --color=auto java
nagios   17025 26.1 53.4 34944372 13076264 ?   SLl  Aug15 269:21 /bin/java -Xms11949m -Xmx11949m -Djava.awt.headless=true -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -Dfile.encoding=UTF-8 -Des.cluster.name=76900ee2-f769-413c-9948-850204a96b32 -Des.node.name=707a55e2-e8cb-4073-a800-2fbc5bd4c6df -Des.discovery.zen.ping.unicast.hosts=10.30.216.56,10.30.216.20 -Des.path.repo=/ -Delasticsearch -Des.pidfile=/var/run/elasticsearch/elasticsearch.pid -Des.path.home=/usr/local/nagioslogserver/elasticsearch -cp :/usr/local/nagioslogserver/elasticsearch/lib/elasticsearch-1.6.0.jar:/usr/local/nagioslogserver/elasticsearch/lib/*:/usr/local/nagioslogserver/elasticsearch/lib/sigar/* -Des.default.path.home=/usr/local/nagioslogserver/elasticsearch -Des.default.path.logs=/var/log/elasticsearch -Des.default.path.data=/data1/data -Des.default.path.work=/usr/local/nagioslogserver/tmp/elasticsearch -Des.default.path.conf=/usr/local/nagioslogserver/elasticsearch/config org.elasticsearch.bootstrap.Elasticsearch
nagios   31697  2.5  3.6 6716476 893416 ?      SNsl Aug15  37:16 /bin/java -Djava.io.tmpdir=/usr/local/nagioslogserver/tmp -Djava.io.tmpdir=/usr/local/nagioslogserver/tmp -Xmx500m -Xss2048k -Djffi.boot.library.path=/usr/local/nagioslogserver/logstash/vendor/jruby/lib/jni -Djava.io.tmpdir=/usr/local/nagioslogserver/tmp -Djava.io.tmpdir=/usr/local/nagioslogserver/tmp -Xbootclasspath/a:/usr/local/nagioslogserver/logstash/vendor/jruby/lib/jruby.jar -classpath : -Djruby.home=/usr/local/nagioslogserver/logstash/vendor/jruby -Djruby.lib=/usr/local/nagioslogserver/logstash/vendor/jruby/lib -Djruby.script=jruby -Djruby.shell=/bin/sh org.jruby.Main --1.9 /usr/local/nagioslogserver/logstash/lib/bootstrap/environment.rb logstash/runner.rb agent -f /usr/local/nagioslogserver/logstash/etc/conf.d -l /var/log/logstash/logstash.log -w 4

Code: Select all

burkm@calcium:~ $ df -h
Filesystem                      Size  Used Avail Use% Mounted on
/dev/mapper/vg_calcium-lv_root  4.0G  569M  3.5G  14% /
devtmpfs                         12G     0   12G   0% /dev
tmpfs                            12G     0   12G   0% /dev/shm
tmpfs                            12G  153M   12G   2% /run
tmpfs                            12G     0   12G   0% /sys/fs/cgroup
/dev/mapper/vg_calcium-lv_usr    16G  2.1G   14G  13% /usr
/dev/sdb1                       558G   69G  490G  13% /data1
/dev/mapper/vg_calcium-lv_tmp   4.0G   94M  3.9G   3% /tmp
/dev/mapper/vg_calcium-lv_var    32G  629M   32G   2% /var
/dev/mapper/vg_calcium-lv_home   16G  529M   16G   4% /home
/dev/sda1                       253M  199M   55M  79% /boot
tmpfs                           2.4G     0  2.4G   0% /run/user/129
tmpfs                           2.4G     0  2.4G   0% /run/user/1000

Code: Select all

burkm@calcium:~ $ free -m
              total        used        free      shared  buff/cache   available
Mem:          23899       14272        1260         149        8366        8989
Swap:         24575           3       24572
Thanks,
Michael
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Explosive growth in Elasticsearch log

Post by mcapra »

This might be related to the following github issues i've been tracking:
https://github.com/elastic/logstash/issues/4815
https://github.com/elastic/logstash/issues/4225

From what I gather, logstash is at times not properly closing connections which creates a sort of blockage on the back-end. Some users have remedied this by restarting the logstash service on a regular interval. You could give that a try, though I realize it is less than ideal.

I have already filed an internal bug report for this issue (ID 9305).
Former Nagios employee
https://www.mcapra.com/
burkm
Posts: 31
Joined: Thu Jan 21, 2016 5:10 pm

Re: Explosive growth in Elasticsearch log

Post by burkm »

Quick update: I increased the nofiles parameter to 128K, and I haven't had problems for a few days now.
User avatar
mcapra
Posts: 3739
Joined: Thu May 05, 2016 3:54 pm

Re: Explosive growth in Elasticsearch log

Post by mcapra »

Good news! Let us know if there's any updates; This is something we're following pretty closely internally.
Former Nagios employee
https://www.mcapra.com/
Locked