Nagios Support Forum

Posted: **Wed May 11, 2016 1:34 pm**

I have a fresh install of NLS with 45 hosts pushing logs to the server. Within 10 min of receiving logs user nagios command java is 100%. I'm running centos 6.7 with a source install of NLM. Any thought?

Posted: **Wed May 11, 2016 2:22 pm**

Some information that will help us:

How much memory do you have?

Code: Select all

free -m

How's your CPU doing?

Code: Select all

top | head -n5

How much data per day worth of logs are you receiving? You can find this information under Administration tab, by following the 'Index Status' link.

Anything in the logstash log that might be a hint?

Code: Select all

tail /var/log/logstash/logstash.log

How about the elasticsearch log?

Code: Select all

tail /var/log/elasticsearch/*.log

Posted: **Thu May 12, 2016 6:15 am**

Free mem
Mem: Total 3832 Used 3694 Free 138

logstash has been throwing this error alot
:timestamp=>"2016-05-12T07:16:50.680000-0400", :message=>"retrying failed action with response code: 503", :level=>:warn}

elasticsearch tail log

==> /var/log/elasticsearch/de670d53-cf3c-4c88-876a-d74e20244397.log <==
at sun.nio.ch.FileDispatcherImpl.write(FileDispatcherImpl.java:60)
at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
at sun.nio.ch.IOUtil.write(IOUtil.java:65)
at sun.nio.ch.FileChannelImpl.write(FileChannelImpl.java:205)
at org.elasticsearch.common.io.Channels.writeToChannel(Channels.java:193)
at org.elasticsearch.index.translog.fs.BufferingFsTranslogFile.flushBuffer(BufferingFsTranslogFile.java:116)
at org.elasticsearch.index.translog.fs.BufferingFsTranslogFile.add(BufferingFsTranslogFile.java:101)
at org.elasticsearch.index.translog.fs.FsTranslog.add(FsTranslog.java:379)
... 9 more
[2016-05-12 07:18:10,562][WARN ][cluster.action.shard ] [ad929a48-ccb5-4e5b-a5b7-8853e7aa30db] [logstash-2016.05.12][0] received shard failed for [logstash-2016.05.12][0], node[LGTW1UYZS-2XS6Xjc-Qnpg], [P], s[INITIALIZING], indexUUID [wJtFiyg3TfCYFb8g89XpPQ], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[logstash-2016.05.12][0] failed to recover shard]; nested: TranslogException[[logstash-2016.05.12][0] Failed to write operation [org.elasticsearch.index.translog.Translog$Create@3f5e0b6d]]; nested: IOException[No space left on device]; ]]

At the moment java is fine but it early in the morning - I will post CPU status once java starts acting up again. Thanks for the help

Posted: **Thu May 12, 2016 6:17 am**

I've also noticed that log stash service keeps dying. "Logstash Daemon dead but pid file exists"

Posted: **Thu May 12, 2016 9:00 am**

If you have the ability to increase physical memory (I do not know if this is a virtual machine or a physical machine), as that will help immensely.

By default, NLS allocates 50% of available system RAM to elasticsearch. You can verify this in /etc/sysconfig/elasticsearch. It MAY be worthwhile to increase the ES_HEAP_SIZE to something bigger than 50% of RAM. You can test this out by changing the last line in this code:

Code: Select all

# Heap Size (defaults to 256m min, 1g max)
# Nagios Log Server Default to 0.5 physical Memory
ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m

to

Code: Select all

ES_HEAP_SIZE=5120m

Which will allocate 5G of RAM instead of ~4G of RAM to elasticsearch's Java heap. This might work. You might be good to go as far as 6114m, which would be 6GB, but I wouldn't push it any farther than that on a system with 8GB of RAM. Increasing swap will likely not gain you anything on this system, and in fact, will degrade performance. Whatever remaining memory is left can be sent to Logstash by changing /etc/sysconfig/logstash similarly:

Code: Select all

# Arguments to pass to java
#LS_HEAP_SIZE="256m"
LS_JAVA_OPTS="-Djava.io.tmpdir=$APP_DIR/tmp"

Change that middle line to be something like:

Code: Select all

LS_HEAP_SIZE="1024m"

(Note that you have to uncomment it). Restart logsearch and elasticsearch. You may need to play with the numbers between elasticsearch and logstash. Don't go less than 50% memory for elasticsearch though. A good article on elasticsearch JVM tuning can be found at https://www.elastic.co/guide/en/elastic ... izing.html

Note that, no matter what you do, 100% CPU is fine. There's nothing wrong with using all of your processor if it's doing actual work.

Posted: **Thu May 12, 2016 9:11 am**

eloyd

thanks for the reply - I'm running as a vm and have uped memory to 8GB and added a second CPU. I made both of your recommended changes. I'll keep an eye on NLS and post again later.

Thanks you the advice.

Posted: **Thu May 12, 2016 9:20 am**

No problem. Again, it's a tuning issue so what works for one person may not work for another based on the overall workload that your server is under. Also, make sure not to allocate 100% of memory to ES and LS. You need some leftover for the OS.

Posted: **Thu May 12, 2016 9:39 am**

eloyd wrote:No problem. Again, it's a tuning issue so what works for one person may not work for another based on the overall workload that your server is under. Also, make sure not to allocate 100% of memory to ES and LS. You need some leftover for the OS.

So far she is cooking along nicely, I'll let her run for the day and report back tomorrow.

Posted: **Thu May 12, 2016 9:59 am**

Sounds good - let us know!

Posted: **Fri May 13, 2016 6:06 am**

So after working with the memory settings here is what I've found. I have 30 mix of servers / workstations pushing logs to NLS without any problems. Problems start when ever I start pushing logs from 14 esxi hosts to NLS. I know that esxi spits out alot of logs but I'm surprised they send out enough to kill the NLS. As soon as I open up port 1514 the NLS cpu hits high 90% within 5 min.

Nagios Support Forum

Java running at 100%

Java running at 100%

Re: Java running at 100%

Re: Java running at 100%

Re: Java running at 100%

Re: Java running at 100%

Re: Java running at 100%

Re: Java running at 100%

Re: Java running at 100%

Re: Java running at 100%

Re: Java running at 100%