Nagios Support Forum

Posted: **Thu Jul 23, 2015 2:26 pm**

We had some issues this morning on the infrastructure and the nodes had to be reboot. Logstash is crashing in one of the nodes with the following message:

Code: Select all

 Exception in thread "Ruby-0-Thread-40: /usr/local/nagioslogserver/logstash/vendor/bundle/jruby/1.9/gems/stud-0.0.17/lib/stud/buffer.rb:92" java.lang.ArrayIndexOutOfBoundsException: -1
        at org.jruby.runtime.ThreadContext.popRubyClass(ThreadContext.java:697)
        at org.jruby.runtime.ThreadContext.postYield(ThreadContext.java:1257)
        at org.jruby.runtime.ContextAwareBlockBody.post(ContextAwareBlockBody.java:29)
        at org.jruby.runtime.Interpreted19Block.yield(Interpreted19Block.java:198)
        at org.jruby.runtime.Interpreted19Block.call(Interpreted19Block.java:125)
        at org.jruby.runtime.Block.call(Block.java:101)
        at org.jruby.RubyProc.call(RubyProc.java:290)
        at org.jruby.RubyProc.call(RubyProc.java:228)
        at org.jruby.internal.runtime.RubyRunnable.run(RubyRunnable.java:99)
        at java.lang.Thread.run(Thread.java:745)
Exception in thread "elasticsearch[30ab2b2c-439f-4bcc-977d-7c0e9a90f3a5][generic][T#1]" Exception in thread "elasticsearch[30ab2b2c-439f-4bcc-977d-7c0e9a90f3a5][generic][T#4]" java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
Error: Your application used more memory than the safety cap of 500M.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace

The other nodes are fine. This is the JVM info.

JVM.JPG

JVM on another node that's working

JVM_good.JPG

I tried giving it more RAM so it can recover, but I continue to get errors on this node and logstash is crashing.

Code: Select all

WARN: org.elasticsearch.transport.netty: [30ab2b2c-439f-4bcc-977d-7c0e9a90f3a5] exception caught on transport layer [[id: 0xc0622298, /127.0.0.1:48013 :> localhost/127.0.0.1:9300]], closing connection
java.io.StreamCorruptedException: invalid internal transport message format

Posted: **Thu Jul 23, 2015 3:02 pm**

We'll need to increase the HEAP_SIZE of Logstash itself. Give this a try.

Open up /etc/sysconfig/logstash

Change:
#LS_HEAP_SIZE="256m"

To:
LS_HEAP_SIZE="1024m"

And run a service logstash restart. Let me know if this helps with your heap problems. Thanks!

Posted: **Mon Jul 27, 2015 1:47 pm**

Not sure if this is related, but the nodes are now throwing indices.fielddata.breaker errors. Looks like we are tripping the fielddata.limit. I would like to increase this to 70% instead. This is the current breaker stats.

Code: Select all

"cluster_name" : "80e9022e-f73f-429e-8927-f23d0d88dfd2",
  "nodes" : {
    "kcJDKIbyTUWwnXtbyJ9gpQ" : {
      "timestamp" : 1438022601405,
      "name" : "30ab2b2c-439f-4bcc-977d-7c0e9a90f3a5",
      "transport_address" : "inet[/10.242.102.108:9300]",
      "host" : "kdcnagls1n2.bchydro.bc.ca",
      "ip" : [ "inet[/10.242.102.108:9300]", "NONE" ],
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "fielddata_breaker" : {
        "maximum_size_in_bytes" : 5026951987,
        "maximum_size" : "4.6gb",
        "estimated_size_in_bytes" : 4876893485,
        "estimated_size" : "4.5gb",
        "overhead" : 1.03,
        "tripped" : 271
      }
    },
    "uZ8wjGAYQFykeK7MhxIhMQ" : {
      "timestamp" : 1438022601392,
      "name" : "e63648a3-d912-4f5d-a867-1b99282a5e7c",
      "transport_address" : "inet[/10.242.102.109:9300]",
      "host" : "kdcnagls1n3.bchydro.bc.ca",
      "ip" : [ "inet[/10.242.102.109:9300]", "NONE" ],
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "fielddata_breaker" : {
        "maximum_size_in_bytes" : 5026951987,
        "maximum_size" : "4.6gb",
        "estimated_size_in_bytes" : 4511806069,
        "estimated_size" : "4.2gb",
        "overhead" : 1.03,
        "tripped" : 424
      }
    },
    "Qc57wXjdTC-2LWeqy54XMw" : {
      "timestamp" : 1438022601409,
      "name" : "4521585a-88af-47c9-81e5-c4d13cffb148",
      "transport_address" : "inet[/10.242.102.107:9300]",
      "host" : "kdcnagls1n1.bchydro.bc.ca",
      "ip" : [ "inet[/10.242.102.107:9300]", "NONE" ],
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "fielddata_breaker" : {
        "maximum_size_in_bytes" : 5026951987,
        "maximum_size" : "4.6gb",
        "estimated_size_in_bytes" : 4891010595,
        "estimated_size" : "4.5gb",
        "overhead" : 1.03,
        "tripped" : 493
      }

Any recommendations that I can set on top of the limit? (ie. cache.size)

Posted: **Mon Jul 27, 2015 2:38 pm**

You should definitely read through this page of documentation before adjusting the fielddata breaker setting: https://www.elastic.co/guide/en/elastic ... usage.html

This setting is a safeguard, not a solution for insufficient memory.
If you don’t have enough memory to keep your fielddata resident in memory, Elasticsearch will constantly have to reload data from disk, and evict other data to make space. Evictions cause heavy disk I/O and generate a large amount of garbage in memory, which must be garbage collected later on.

If possible, increase the amount of memory allocated to your Nagios Log Server node, and the fielddata will automatically adjust accordingly. If you'd like to adjust this setting manually, you may do so in the elasticsearch.yml file:

To prevent this scenario, place an upper limit on the fielddata by adding this setting to the config/elasticsearch.yml file:

indices.fielddata.cache.size: 40%

Posted: **Mon Jul 27, 2015 3:16 pm**

We already have 16GB allocated to these nodes. The current limit of is allocated based now JVM Heap memory size which is 50% (8GB). So out of the Heap memory, 4.6GB is allocated as field data. These are quite conservative and we are definitely running into the fielddata breaker limit. So we are looking at increase this to accommodate for our queries. In our case, I think the limit isn't large enough to even have Elasticsearch evict data in memory and doesn't run the disk. Have you guys run into this?

Posted: **Mon Jul 27, 2015 4:03 pm**

Have you guys run into this?

I have not run into this before, and I have a couple of thoughts.

The recommended way to address this would be increasing the amount of physical RAM in the box.

A few other ways to address this:

You could manually set the Elasticsearch HEAP_SIZE value to a higher number, the configuration file for setting this is located in /etc/sysconfig/elasticsearch. With 16GB of RAM, I would say that you could probably set it to ~8-10g.

You could implement a higher fielddata breaker limit as described in my previous post - I would start with something rather conservative (60%) and increase the value as necessary.

I don't have a lot of experience with this particular setting, but I spoke with a developer and his recommendations are in line with mine.

Thanks!

Jesse

Posted: **Mon Jul 27, 2015 5:22 pm**

Thanks! Let me play around with these settings. In order for me to apply them, I have to do a rolling restart on the cluster, so will take a while. I am still waiting for the unassigned shards to clear from the first node.

Posted: **Mon Jul 27, 2015 6:00 pm**

Looks like the first node I rebooted is stuck allocating shards and nothing is happening when I watch the resource usage on that node. Nothing is showing in the logs. Could it be allocating the unassigned shards but not generating any resource usage?

Code: Select all

# curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "80e9022e-f73f-429e-8927-f23d0d88dfd2",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 86,
  "active_shards" : 136,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unass
igned_shards" : 30

Posted: **Mon Jul 27, 2015 10:01 pm**

CFT6Server wrote:Looks like the first node I rebooted is stuck allocating shards and nothing is happening when I watch the resource usage on that node. Nothing is showing in the logs. Could it be allocating the unassigned shards but not generating any resource usage?
Code: Select all
# curl -XGET 'http://localhost:9200/_cluster/health?pretty=true'
{
  "cluster_name" : "80e9022e-f73f-429e-8927-f23d0d88dfd2",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 86,
  "active_shards" : 136,
  "relocating_shards" : 0,
  "initializing_shards" : 6,
  "unass
igned_shards" : 30

It certainly could. Let me know if your shards are still stuck in any unassigned or initializing states.

Nagios Support Forum

Heap Space: OutOfMemoryError

Heap Space: OutOfMemoryError

Re: Heap Space: OutOfMemoryError

Re: Heap Space: OutOfMemoryError

Re: Heap Space: OutOfMemoryError

Re: Heap Space: OutOfMemoryError

Re: Heap Space: OutOfMemoryError

Re: Heap Space: OutOfMemoryError

Re: Heap Space: OutOfMemoryError

Re: Heap Space: OutOfMemoryError