Recently downloaded trial Nagios LogServer crashes

DC6171 · Post by **DC6171** » Thu Jun 25, 2015 9:52 am

Hello,

We recently started a Nagios Logserver trial using the vmware ovf and have seven devices pushing about 2.8GB of data per day to the log server, deleting all data after 7 days. Unfortunately, the appliance keeps periodically dying for some reason and has to be powered off and back on again. It doesn't seem to be too busy looking at vmware performance graphs for the appliance nor out of space. I've attached the tail end of the console message when the situation occurs. Any suggestions on what to try would be appreciated.

Environment Info:
4 CPU cores
100GB eager thick @~25% full
Physical Host: HP Proliant DL380 G9
Hypervisor: VMware ESXi, 6.0.0, 2494585
VMware Tools: Running, version:9536
Storage:VMFS5

jolson · Post by **jolson** » Thu Jun 25, 2015 10:14 am

First, I'd like to know about what's inside of the virtual machine:
-Memory
-Number of CPUs
-Storage
-What version of NLS? (I assume the latest - R1.4)

I'd like you to send us the elasticsearch log generated during the time of failure:

Code: Select all

cat /var/log/elasticsearch/*.log

Some additional debug information:

Code: Select all

cat /etc/sysconfig/elasticsearch
grep -i 'out of memory' /var/log/messages

DC6171 · Post by **DC6171** » Thu Jun 25, 2015 10:21 am

[root@logserver01 elasticsearch]# cat /var/log/elasticsearch/*.log
[2015-06-25 09:17:52,342][INFO ][node ] [581ddc65-44cc-48af-88ce-290f486c5695] version[1.3.2], pid[1389], build[dee175d/2014-08-13T14:29:30Z]
[2015-06-25 09:17:52,343][INFO ][node ] [581ddc65-44cc-48af-88ce-290f486c5695] initializing ...
[2015-06-25 09:17:52,427][INFO ][plugins ] [581ddc65-44cc-48af-88ce-290f486c5695] loaded [knapsack-1.3.2.0-d5501ef], sites []
[2015-06-25 09:18:00,281][INFO ][node ] [581ddc65-44cc-48af-88ce-290f486c5695] initialized
[2015-06-25 09:18:00,282][INFO ][node ] [581ddc65-44cc-48af-88ce-290f486c5695] starting ...
[2015-06-25 09:18:00,531][INFO ][transport ] [581ddc65-44cc-48af-88ce-290f486c5695] bound_address {inet[/0:0:0:0:0:0:0:0:9300]}, publish_address {inet[/172.22.12.38:9300]}
[2015-06-25 09:18:00,541][INFO ][discovery ] [581ddc65-44cc-48af-88ce-290f486c5695] fb3b397f-4380-4031-a93b-fcbe65d50872/ed6RA74XQA-Y_IaEZkGaSw
[2015-06-25 09:18:03,651][WARN ][transport.netty ] [581ddc65-44cc-48af-88ce-290f486c5695] exception caught on transport layer [[id: 0x4f3046df]], closing connection
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:150)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-06-25 09:18:05,119][INFO ][cluster.service ] [581ddc65-44cc-48af-88ce-290f486c5695] new_master [581ddc65-44cc-48af-88ce-290f486c5695][ed6RA74XQA-Y_IaEZkGaSw][logserver01.udp.com][inet[/172.22.12.38:9300]]{max_local_storage_nodes=1}, reason: zen-disco-join (elected_as_master)
[2015-06-25 09:18:05,201][INFO ][http ] [581ddc65-44cc-48af-88ce-290f486c5695] bound_address {inet[/127.0.0.1:9200]}, publish_address {inet[localhost/127.0.0.1:9200]}
[2015-06-25 09:18:05,202][INFO ][node ] [581ddc65-44cc-48af-88ce-290f486c5695] started
[2015-06-25 09:18:06,650][WARN ][transport.netty ] [581ddc65-44cc-48af-88ce-290f486c5695] exception caught on transport layer [[id: 0x69318183]], closing connection
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.connect(NioClientBoss.java:150)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processSelectedKeys(NioClientBoss.java:105)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:79)
at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
[2015-06-25 09:18:07,164][INFO ][gateway ] [581ddc65-44cc-48af-88ce-290f486c5695] recovered [11] indices into cluster_state
[2015-06-25 09:18:07,955][DEBUG][action.search.type ] [581ddc65-44cc-48af-88ce-290f486c5695] All shards failed for phase: [query_fetch]
[2015-06-25 09:18:08,000][DEBUG][action.search.type ] [581ddc65-44cc-48af-88ce-290f486c5695] All shards failed for phase: [query_fetch]
[2015-06-25 09:18:13,015][DEBUG][action.search.type ] [581ddc65-44cc-48af-88ce-290f486c5695] All shards failed for phase: [query_fetch]
[2015-06-25 09:18:13,031][DEBUG][action.search.type ] [581ddc65-44cc-48af-88ce-290f486c5695] All shards failed for phase: [query_fetch]
[2015-06-25 09:18:18,079][DEBUG][action.search.type ] [581ddc65-44cc-48af-88ce-290f486c5695] All shards failed for phase: [query_fetch]
[2015-06-25 09:18:18,083][DEBUG][action.search.type ] [581ddc65-44cc-48af-88ce-290f486c5695] All shards failed for phase: [query_fetch]
[root@logserver01 elasticsearch]#

[root@logserver01 elasticsearch]# grep -i 'out of memory' /var/log/messages
[root@logserver01 elasticsearch]#

Thank you for any info.

DC6171 · Post by **DC6171** » Thu Jun 25, 2015 10:22 am

[root@logserver01 elasticsearch]# cat /etc/sysconfig/elasticsearch
# Directory where the Elasticsearch binary distribution resides
APP_DIR="/usr/local/nagioslogserver"
ES_HOME="$APP_DIR/elasticsearch"

# Heap Size (defaults to 256m min, 1g max)
# Nagios Log Server Default to 0.5 physical Memory
ES_HEAP_SIZE=$(expr $(free -m|awk '/^Mem:/{print $2}') / 2 )m

# Heap new generation
#ES_HEAP_NEWSIZE=

# max direct memory
#ES_DIRECT_SIZE=

# Additional Java OPTS
#ES_JAVA_OPTS=

# Maximum number of open files
MAX_OPEN_FILES=65535

# Maximum amount of locked memory
MAX_LOCKED_MEMORY=unlimited

# Maximum number of VMA (Virtual Memory Areas) a process can own
MAX_MAP_COUNT=262144

# Elasticsearch log directory
LOG_DIR=/var/log/elasticsearch

# Elasticsearch data directory
DATA_DIR="$ES_HOME/data"

# Elasticsearch work directory
WORK_DIR="$APP_DIR/tmp/elasticsearch"

# Elasticsearch conf directory
CONF_DIR="$ES_HOME/config"

# Elasticsearch configuration file (elasticsearch.yml)
CONF_FILE="$ES_HOME/config/elasticsearch.yml"

# User to run as, change this to a specific elasticsearch user if possible
# Also make sure, this user can write into the log directories in case you change them
# This setting only works for the init script, but has to be configured separately for systemd startup
ES_USER=nagios
ES_GROUP=nagios

# Configure restart on package upgrade (true, every other setting will lead to not restarting)
#RESTART_ON_UPGRADE=true

if [ "x$1" == "xstart" -o "x$1" == "xrestart" -o "x$1" == "xreload" -o "x$1" == "xforce-reload" ];then
GET_ES_CONFIG_MESSAGE="$( php $APP_DIR/scripts/get_es_config.php )"
GET_ES_CONFIG_RETURN=$?

if [ "$GET_ES_CONFIG_RETURN" != "0" ]; then
echo $GET_ES_CONFIG_MESSAGE
exit 1
else
ES_JAVA_OPTS="$GET_ES_CONFIG_MESSAGE"
fi
fi
[root@logserver01 elasticsearch]#

jolson · Post by **jolson** » Thu Jun 25, 2015 10:35 am

How much memory is in your server? I assume you'll need between 8 and 16 GB of memory to handle the load of 2.2GB daily logs.

When your server 'crashes', what are the symptoms?

Your elasticsearch log looks quite normal (or at least I don't see any obvious indication of a crash). Let's also see your logstash log:

Code: Select all

cat /var/log/logstash/logstash.log

DC6171 · Post by **DC6171** » Thu Jun 25, 2015 10:44 am

logserver guest is assigned 6GB ram. We can assign more if needed, we just haven't seen an indication.

As far as symptoms, the server console is at the screen as previously attached and the logserver guest is otherwise non-responsive.

Log result is:
[root@logserver01 elasticsearch]# cat /var/log/logstash/logstash.log
[root@logserver01 elasticsearch]#

jolson · Post by **jolson** » Thu Jun 25, 2015 10:57 am

I think it's a good idea to increase the amount of RAM allocated to the box as a test. Are you capable of giving the box 16GB of RAM? If so, please do so.

Once the RAM has been allocated and the box has been restarted, wait to see if it crashes once more. If so, immediately collect the logs mentioned below:

Code: Select all

cat /var/log/logstash/logstash.log
cat /var/log/elasticsearch/*.log

In the meantime, I'd like to take a look at some of your rotated logs. Please create some .tar.gz archives and send the resulting files to me:

Code: Select all

tar zcf /tmp/elasticsearch.tar.gz /var/log/elasticsearch/*
tar zcf /tmp/logstash.tar.gz /var/log/logstash/*

DC6171 · Post by **DC6171** » Thu Jun 25, 2015 11:09 am

Bumped memory from 6GB to 16GB to see if it makes a difference. Requested logs updated. Will respond back after a couple days of running or next event. Thank you.

jolson · Post by **jolson** » Thu Jun 25, 2015 12:07 pm

Sounds good - let us know what you find out. Thanks!

DC6171 · Post by **DC6171** » Mon Jun 29, 2015 1:32 pm

Looks like insufficient memory was to blame. After allocating more memory, we have not had a recurrence where previously it was failing nightly. Thank you for the help!

Nagios Support Forum

Recently downloaded trial Nagios LogServer crashes

Recently downloaded trial Nagios LogServer crashes

Re: Recently downloaded trial Nagios LogServer crashes

Re: Recently downloaded trial Nagios LogServer crashes

Re: Recently downloaded trial Nagios LogServer crashes

Re: Recently downloaded trial Nagios LogServer crashes

Re: Recently downloaded trial Nagios LogServer crashes

Re: Recently downloaded trial Nagios LogServer crashes

Re: Recently downloaded trial Nagios LogServer crashes

Re: Recently downloaded trial Nagios LogServer crashes

Re: Recently downloaded trial Nagios LogServer crashes