Logstash crashing continuously

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Logstash crashing continuously

Post by WillemDH »

Hello,

the Logstash service on our NLS nodes is crashing continuously. I saw these errors passing:

Code: Select all

Oct 29, 2015 10:48:00 AM org.elasticsearch.transport.netty.NettyInternalESLogger warn
WARNING: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: GC overhead limit exceeded
        at java.util.Collections$UnmodifiableCollection.iterator(Collections.java:1063)
        at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.processConnectTimeout(NioClientBoss.java:118)
        at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.process(NioClientBoss.java:83)
        at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:337)
        at org.elasticsearch.common.netty.channel.socket.nio.NioClientBoss.run(NioClientBoss.java:42)
        at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
        at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)

Error: Your application used more memory than the safety cap of 500M.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace
I'm seeing the above message appear in the cli. The nodes had 8 GB of RAM. I added 2 more GB. Do I need to increase the cap size?

Code: Select all

Specify -J-Xmx####m to increase it (#### = cap size in MB).
Tried

Code: Select all

grep -i 'out of memory' /var/log/messages
But it did not have any results. it seems my Logstash service keeps crashing 30 seconds after the sevrice is started again. (only on one node)

JVM settings:

Code: Select all

curl -XGET localhost:9200/_nodes/jvm?pretty
{
  "cluster_name" : "ee9e60a0-f4cb-41ec-a97f-8f17434b748e",
  "nodes" : {
    "JHyDfTFIT82MUiBkxpmGmw" : {
      "name" : "95f9ab14-da22-4144-bb0b-6bbc5662115c",
      "transport_address" : "inet[/10.54.24.141:9300]",
      "host" : "nls02",
      "ip" : "127.0.0.1",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 32338,
        "version" : "1.7.0_85",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.85-b03",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1446106732260,
        "mem" : {
          "heap_init_in_bytes" : 4294967296,
          "heap_max_in_bytes" : 4260102144,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 4260102144
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    },
    "MpQpx1GcTOmWEoILA4sn7w" : {
      "name" : "c4d16075-9bc2-4095-9f00-e7de7f96930c",
      "transport_address" : "inet[/10.54.24.140:9300]",
      "host" : "nls01",
      "ip" : "127.0.0.1",
      "version" : "1.6.0",
      "build" : "cdd3ac4",
      "http_address" : "inet[localhost/127.0.0.1:9200]",
      "attributes" : {
        "max_local_storage_nodes" : "1"
      },
      "jvm" : {
        "pid" : 1587,
        "version" : "1.7.0_85",
        "vm_name" : "OpenJDK 64-Bit Server VM",
        "vm_version" : "24.85-b03",
        "vm_vendor" : "Oracle Corporation",
        "start_time_in_millis" : 1446112726670,
        "mem" : {
          "heap_init_in_bytes" : 4294967296,
          "heap_max_in_bytes" : 4242669568,
          "non_heap_init_in_bytes" : 24313856,
          "non_heap_max_in_bytes" : 224395264,
          "direct_max_in_bytes" : 4242669568
        },
        "gc_collectors" : [ "ParNew", "ConcurrentMarkSweep" ],
        "memory_pools" : [ "Code Cache", "Par Eden Space", "Par Survivor Space", "CMS Old Gen", "CMS Perm Gen" ]
      }
    }
  }
}

Code: Select all

free -m
             total       used       free     shared    buffers     cached
Mem:         10014       6530       3483          0         41       1597
-/+ buffers/cache:       4892       5122
Swap:         1999          0       1999
Logstash config:

Code: Select all

cat /etc/sysconfig/logstash
###############################
# Default settings for logstash
###############################

# Override Java location
#JAVACMD=/usr/bin/java

# Set a home directory
APP_DIR=/usr/local/nagioslogserver
LS_HOME="$APP_DIR/logstash"

# set ES_CLUSTER
ES_CLUSTER=$(cat $APP_DIR/var/cluster_uuid)

# Arguments to pass to java
#LS_HEAP_SIZE="256m"
LS_JAVA_OPTS="-Djava.io.tmpdir=$APP_DIR/tmp"

# Logstash filter worker threads
#LS_WORKER_THREADS=1

# pidfiles aren't used for upstart; this is for sysv users.
#LS_PIDFILE=/var/run/logstash.pid

# user id to be invoked as; for upstart: edit /etc/init/logstash.conf
LS_USER=root
LS_GROUP=nagios

# logstash logging
#LS_LOG_FILE=/var/log/logstash/logstash.log
#LS_USE_GC_LOGGING="true"

# logstash configuration directory
LS_CONF_DIR="$LS_HOME/etc/conf.d"

# Open file limit; cannot be overridden in upstart
#LS_OPEN_FILES=2048

# Nice level
#LS_NICE=0

# Increase Filter workers to 4 threads
LS_OPTS=" -w 4"

if [ "x$1" == "xstart" -o "x$1" == "xrestart" -o "x$1" == "xreload" ];then
        GET_LOGSTASH_CONFIG_MESSAGE=$( php /usr/local/nagioslogserver/scripts/get_logstash_config.php )
        GET_LOGSTASH_CONFIG_RETURN=$?
        if [ "$GET_LOGSTASH_CONFIG_RETURN" != "0" ]; then
                echo $GET_LOGSTASH_CONFIG_MESSAGE
                exit 1
        fi
fi
Elasticsearch config:

Code: Select all

cat /etc/sysconfig/elasticsearch
# Directory where the Elasticsearch binary distribution resides
APP_DIR="/usr/local/nagioslogserver"
ES_HOME="$APP_DIR/elasticsearch"

# Heap Size (defaults to 256m min, 1g max)
ES_HEAP_SIZE=4g

# Heap new generation
#ES_HEAP_NEWSIZE=

# max direct memory
#ES_DIRECT_SIZE=

# Additional Java OPTS
#ES_JAVA_OPTS=

# Maximum number of open files
MAX_OPEN_FILES=65535

# Maximum amount of locked memory
MAX_LOCKED_MEMORY=unlimited

# Maximum number of VMA (Virtual Memory Areas) a process can own
MAX_MAP_COUNT=262144

# Elasticsearch log directory
LOG_DIR=/var/log/elasticsearch

# Elasticsearch data directory
DATA_DIR="$ES_HOME/data,/mnt/data"

# Elasticsearch work directory
WORK_DIR="$APP_DIR/tmp/elasticsearch"

# Elasticsearch conf directory
CONF_DIR="$ES_HOME/config"

# Elasticsearch configuration file (elasticsearch.yml)
CONF_FILE="$ES_HOME/config/elasticsearch.yml"

# User to run as, change this to a specific elasticsearch user if possible
# Also make sure, this user can write into the log directories in case you change them
# This setting only works for the init script, but has to be configured separately for systemd startup
ES_USER=nagios
ES_GROUP=nagios

# Configure restart on package upgrade (true, every other setting will lead to not restarting)
#RESTART_ON_UPGRADE=true

if [ "x$1" == "xstart" -o "x$1" == "xrestart" -o "x$1" == "xreload" -o "x$1" == "xforce-reload" ];then
        GET_ES_CONFIG_MESSAGE="$( php $APP_DIR/scripts/get_es_config.php )"
        GET_ES_CONFIG_RETURN=$?

        if [ "$GET_ES_CONFIG_RETURN" != "0" ]; then
                echo $GET_ES_CONFIG_MESSAGE
                exit 1
        else
                ES_JAVA_OPTS="$GET_ES_CONFIG_MESSAGE"
        fi
fi
outputs:

Code: Select all

cat /usr/local/nagioslogserver/logstash/etc/conf.d/999_outputs.conf
#
# Logstash Configuration File
# Dynamically created by Nagios Log Server
#
# DO NOT EDIT THIS FILE. IT WILL BE OVERWRITTEN.
#
# Created Sat, 17 Oct 2015 17:16:13 +0200
#

#
# Required output for Nagios Log Server
#

output {
    elasticsearch {
        cluster => 'ee9e60a0-f4cb-41ec-a97f-8f17434b748e'
        host => 'localhost'
        document_type => '%{type}'
        node_name => 'c4d16075-9bc2-4095-9f00-e7de7f96930c'
        protocol => 'transport'
        workers => 4
    }
}

#
# Global outputs
#



#
# Local outputs
#

Code: Select all

cat /usr/local/nagioslogserver/logstash/etc/conf.d/999_outputs.conf
#
# Logstash Configuration File
# Dynamically created by Nagios Log Server
#
# DO NOT EDIT THIS FILE. IT WILL BE OVERWRITTEN.
#
# Created Sat, 17 Oct 2015 17:16:13 +0200
#

#
# Required output for Nagios Log Server
#

output {
    elasticsearch {
        cluster => 'ee9e60a0-f4cb-41ec-a97f-8f17434b748e'
        host => 'localhost'
        document_type => '%{type}'
        node_name => 'c4d16075-9bc2-4095-9f00-e7de7f96930c'
        protocol => 'transport'
        workers => 4
    }
}

#
# Global outputs
#



#
# Local outputs
#

elasticsearch.yml
[root@srvnaglog01 ~]# cat /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml
##################### Elasticsearch Configuration Example #####################

# This file contains an overview of various configuration settings,
# targeted at operations staff. Application developers should
# consult the guide at <http://elasticsearch.org/guide>.
#
# The installation procedure is covered at
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/setup.html>.
#
# Elasticsearch comes with reasonable defaults for most settings,
# so you can try it out without bothering with configuration.
#
# Most of the time, these defaults are just fine for running a production
# cluster. If you're fine-tuning your cluster, or wondering about the
# effect of certain configuration option, please _do ask_ on the
# mailing list or IRC channel [http://elasticsearch.org/community].

# Any element in the configuration can be replaced with environment variables
# by placing them in ${...} notation. For example:
#
# node.rack: ${RACK_ENV_VAR}

# For information on supported formats and syntax for the config file, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html>


################################### Cluster ###################################

# Cluster name identifies your cluster for auto-discovery. If you're running
# multiple clusters on the same network, make sure you're using unique names.
#
cluster.name: nagios_elasticsearch


#################################### Node #####################################

# Node names are generated dynamically on startup, so you're relieved
# from configuring them manually. You can tie this node to a specific name:
#
# node.name: "Franz Kafka"

# Every node can be configured to allow or deny being eligible as the master,
# and to allow or deny to store the data.
#
# Allow this node to be eligible as a master node (enabled by default):
#
# node.master: true
#
# Allow this node to store data (enabled by default):
#
# node.data: true

# You can exploit these settings to design advanced cluster topologies.
#
# 1. You want this node to never become a master node, only to hold data.
#    This will be the "workhorse" of your cluster.
#
# node.master: false
# node.data: true
#
# 2. You want this node to only serve as a master: to not store any data and
#    to have free resources. This will be the "coordinator" of your cluster.
#
# node.master: true
# node.data: false
#
# 3. You want this node to be neither master nor data node, but
#    to act as a "search load balancer" (fetching data from nodes,
#    aggregating results, etc.)
#
# node.master: false
# node.data: false

# Use the Cluster Health API [http://localhost:9200/_cluster/health], the
# Node Info API [http://localhost:9200/_nodes] or GUI tools
# such as <http://www.elasticsearch.org/overview/marvel/>,
# <http://github.com/karmi/elasticsearch-paramedic>,
# <http://github.com/lukas-vlcek/bigdesk> and
# <http://mobz.github.com/elasticsearch-head> to inspect the cluster state.

# A node can have generic attributes associated with it, which can later be used
# for customized shard allocation filtering, or allocation awareness. An attribute
# is a simple key value pair, similar to node.key: value, here is an example:
#
# node.rack: rack314

# By default, multiple nodes are allowed to start from the same installation location
# to disable it, set the following:
node.max_local_storage_nodes: 1


#################################### Index ####################################

# You can set a number of options (such as shard/replica options, mapping
# or analyzer definitions, translog settings, ...) for indices globally,
# in this file.
#
# Note, that it makes more sense to configure index settings specifically for
# a certain index, either when creating it or by using the index templates API.
#
# See <http://elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html> and
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/indices-create-index.html>
# for more information.

# Set the number of shards (splits) of an index (5 by default):
#
# index.number_of_shards: 5

# Set the number of replicas (additional copies) of an index (1 by default):
#
# index.number_of_replicas: 1

# Note, that for development on a local machine, with small indices, it usually
# makes sense to "disable" the distributed features:
#
# index.number_of_shards: 1
# index.number_of_replicas: 0

# These settings directly affect the performance of index and search operations
# in your cluster. Assuming you have enough machines to hold shards and
# replicas, the rule of thumb is:
#
# 1. Having more *shards* enhances the _indexing_ performance and allows to
#    _distribute_ a big index across machines.
# 2. Having more *replicas* enhances the _search_ performance and improves the
#    cluster _availability_.
#
# The "number_of_shards" is a one-time setting for an index.
#
# The "number_of_replicas" can be increased or decreased anytime,
# by using the Index Update Settings API.
#
# Elasticsearch takes care about load balancing, relocating, gathering the
# results from nodes, etc. Experiment with different settings to fine-tune
# your setup.

# Use the Index Status API (<http://localhost:9200/A/_status>) to inspect
# the index status.


#################################### Paths ####################################

# Path to directory containing configuration (this file and logging.yml):
#
# path.conf: /path/to/conf

# Path to directory where to store index data allocated for this node.
#
# path.data: /path/to/data
#
# Can optionally include more than one location, causing data to be striped across
# the locations (a la RAID 0) on a file level, favouring locations with most free
# space on creation. For example:
#
# path.data: /path/to/data1,/path/to/data2

# Path to temporary files:
#
# path.work: /path/to/work

# Path to log files:
#
# path.logs: /path/to/logs

# Path to where plugins are installed:
#
# path.plugins: /path/to/plugins


#################################### Plugin ###################################

# If a plugin listed here is not installed for current node, the node will not start.
#
# plugin.mandatory: mapper-attachments,lang-groovy


################################### Memory ####################################

# Elasticsearch performs poorly when JVM starts swapping: you should ensure that
# it _never_ swaps.
#
# Set this property to true to lock the memory:
#
bootstrap.mlockall: true

# Make sure that the ES_MIN_MEM and ES_MAX_MEM environment variables are set
# to the same value, and that the machine has enough memory to allocate
# for Elasticsearch, leaving enough memory for the operating system itself.
#
# You should also make sure that the Elasticsearch process is allowed to lock
# the memory, eg. by using `ulimit -l unlimited`.


############################## Network And HTTP ###############################

# Elasticsearch, by default, binds itself to the 0.0.0.0 address, and listens
# on port [9200-9300] for HTTP traffic and on port [9300-9400] for node-to-node
# communication. (the range means that if the port is busy, it will automatically
# try the next port).

# Set the bind address specifically (IPv4 or IPv6):
#
# network.bind_host: 192.168.0.1

# Set the address other nodes will use to communicate with this node. If not
# set, it is automatically derived. It must point to an actual IP address.
#
# network.publish_host: 192.168.0.1

# Set both 'bind_host' and 'publish_host':
#
# network.host: 192.168.0.1

# Set a custom port for the node to node communication (9300 by default):
#
# transport.tcp.port: 9300

# Enable compression for all communication between nodes (disabled by default):
#
transport.tcp.compress: true

# Set a custom port to listen for HTTP traffic:
#
# http.port: 9200

# Set a custom allowed content length:
#
# http.max_content_length: 100mb

# Disable HTTP completely:
#
# http.enabled: false

# Set the HTTP host to listen to
#
http.host: "localhost"

################################### Gateway ###################################

# The gateway allows for persisting the cluster state between full cluster
# restarts. Every change to the state (such as adding an index) will be stored
# in the gateway, and when the cluster starts up for the first time,
# it will read its state from the gateway.

# There are several types of gateway implementations. For more information, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html>.

# The default gateway type is the "local" gateway (recommended):
#
# gateway.type: local

# Settings below control how and when to start the initial recovery process on
# a full cluster restart (to reuse as much local data as possible when using shared
# gateway).

# Allow recovery process after N nodes in a cluster are up:
#
# gateway.recover_after_nodes: 1

# Set the timeout to initiate the recovery process, once the N nodes
# from previous setting are up (accepts time value):
#
# gateway.recover_after_time: 5m

# Set how many nodes are expected in this cluster. Once these N nodes
# are up (and recover_after_nodes is met), begin recovery process immediately
# (without waiting for recover_after_time to expire):
#
# gateway.expected_nodes: 2


############################# Recovery Throttling #############################

# These settings allow to control the process of shards allocation between
# nodes during initial recovery, replica allocation, rebalancing,
# or when adding and removing nodes.

# Set the number of concurrent recoveries happening on a node:
#
# 1. During the initial recovery
#
# cluster.routing.allocation.node_initial_primaries_recoveries: 4
#
# 2. During adding/removing nodes, rebalancing, etc
#
# cluster.routing.allocation.node_concurrent_recoveries: 2

# Set to throttle throughput when recovering (eg. 100mb, by default 20mb):
#
# indices.recovery.max_bytes_per_sec: 20mb

# Set to limit the number of open concurrent streams when
# recovering a shard from a peer:
#
# indices.recovery.concurrent_streams: 5


################################## Discovery ##################################

# Discovery infrastructure ensures nodes can be found within a cluster
# and master node is elected. Multicast discovery is the default.

# Set to ensure a node sees N other master eligible nodes to be considered
# operational within the cluster. Its recommended to set it to a higher value
# than 1 when running more than 2 nodes in the cluster.
#
# discovery.zen.minimum_master_nodes: 1

# Set the time to wait for ping responses from other nodes when discovering.
# Set this option to a higher value on a slow or congested network
# to minimize discovery failures:
#
# discovery.zen.ping.timeout: 3s

# For more information, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html>

# Unicast discovery allows to explicitly control which nodes will be used
# to discover the cluster. It can be used when multicast is not present,
# or to restrict the cluster communication-wise.
#
# 1. Disable multicast discovery (enabled by default):
#
discovery.zen.ping.multicast.enabled: false
#
# 2. Configure an initial list of master nodes in the cluster
#    to perform discovery when new nodes (master or data) are started:
#
discovery.zen.ping.unicast.hosts: ["localhost"]

# EC2 discovery allows to use AWS EC2 API in order to perform discovery.
#
# You have to install the cloud-aws plugin for enabling the EC2 discovery.
#
# For more information, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-ec2.html>
#
# See <http://elasticsearch.org/tutorials/elasticsearch-on-ec2/>
# for a step-by-step tutorial.

# GCE discovery allows to use Google Compute Engine API in order to perform discovery.
#
# You have to install the cloud-gce plugin for enabling the GCE discovery.
#
# For more information, see <https://github.com/elasticsearch/elasticsearch-cloud-gce>.

# Azure discovery allows to use Azure API in order to perform discovery.
#
# You have to install the cloud-azure plugin for enabling the Azure discovery.
#
# For more information, see <https://github.com/elasticsearch/elasticsearch-cloud-azure>.

################################## Slow Log ##################################

# Shard level query and fetch threshold logging.

#index.search.slowlog.threshold.query.warn: 10s
#index.search.slowlog.threshold.query.info: 5s
#index.search.slowlog.threshold.query.debug: 2s
#index.search.slowlog.threshold.query.trace: 500ms

#index.search.slowlog.threshold.fetch.warn: 1s
#index.search.slowlog.threshold.fetch.info: 800ms
#index.search.slowlog.threshold.fetch.debug: 500ms
#index.search.slowlog.threshold.fetch.trace: 200ms

#index.indexing.slowlog.threshold.index.warn: 10s
#index.indexing.slowlog.threshold.index.info: 5s
#index.indexing.slowlog.threshold.index.debug: 2s
#index.indexing.slowlog.threshold.index.trace: 500ms

################################## GC Logging ################################

#monitor.jvm.gc.young.warn: 1000ms
#monitor.jvm.gc.young.info: 700ms
#monitor.jvm.gc.young.debug: 400ms

#monitor.jvm.gc.old.warn: 10s
#monitor.jvm.gc.old.info: 5s
#monitor.jvm.gc.old.debug: 2s
I hope you have enough information to help pinpoint the problem.

EDIT1: it seems 10 seconds after I start Logstash, I get

Code: Select all

service logstash status
Logstash Daemon (pid  8892) is running...
[root@srvnaglog01 ~]# Error: Your application used more memory than the safety cap of 500M.
Specify -J-Xmx####m to increase it (#### = cap size in MB).
Specify -w for full OutOfMemoryError stack trace
service logstash status
Logstash Daemon dead but pid file exists
[root@srvnaglog01 ~]# service logstash start
Starting Logstash Daemon: WARNING: Default JAVA_OPTS will be overridden by the JAVA_OPTS defined in the environment. Environment JAVA_OPTS are -Djava.io.tmpdir=/usr/local/nagioslogserver/tmp
                                                           [  OK  ]
[root@srvnaglog01 ~]# Oct 29, 2015 11:31:03 AM org.elasticsearch.plugins.PluginsService <init>
INFO: [c4d16075-9bc2-4095-9f00-e7de7f96930c] loaded [], sites []
Oct 29, 2015 11:31:04 AM org.elasticsearch.plugins.PluginsService <init>
INFO: [c4d16075-9bc2-4095-9f00-e7de7f96930c] loaded [], sites []
Oct 29, 2015 11:31:04 AM org.elasticsearch.plugins.PluginsService <init>
INFO: [c4d16075-9bc2-4095-9f00-e7de7f96930c] loaded [], sites []
Oct 29, 2015 11:31:04 AM org.elasticsearch.plugins.PluginsService <init>
INFO: [c4d16075-9bc2-4095-9f00-e7de7f96930c] loaded [], sites []
Oct 29, 2015 11:31:04 AM org.elasticsearch.plugins.PluginsService <init>
INFO: [c4d16075-9bc2-4095-9f00-e7de7f96930c] loaded [], sites []
service logstash status
Logstash Daemon (pid  9767) is running...
[root@srvnaglog01 ~]# service logstash status
Logstash Daemon (pid  9767) is running...
[root@srvnaglog01 ~]# service logstash status
Logstash Daemon (pid  9767) is running...
[root@srvnaglog01 ~]# service logstash status
Logstash Daemon (pid  9767) is running...
[root@srvnaglog01 ~]# service logstash status
Logstash Daemon (pid  9767) is running...
[root@srvnaglog01 ~]# service logstash status
Logstash Daemon (pid  9767) is running...
[root@srvnaglog01 ~]# service logstash status
Logstash Daemon (pid  9767) is running...
[root@srvnaglog01 ~]# service logstash status
Logstash Daemon (pid  9767) is running...
[root@srvnaglog01 ~]# service logstash status
Logstash Daemon (pid  9767) is running...
[root@srvnaglog01 ~]# service logstash status
Logstash Daemon (pid  9767) is running...
[root@srvnaglog01 ~]# service logstash status
Logstash Daemon (pid  9767) is running...
[root@srvnaglog01 ~]# service logstash status
Logstash Daemon (pid  9767) is running...
[root@srvnaglog01 ~]# service logstash statusOct 29, 2015 11:32:10 AM org.elasticsearch.transport.netty.NettyInternalESLogger warn
WARNING: Unexpected exception in the selector loop.
java.lang.OutOfMemoryError: GC overhead limit exceeded

                                                                                                                                                               logstash start^Cstart
[root@srvnaglog01 ~]# service logstash status
Logstash Daemon dead but pid file exists
EDIT2:
I edited the /etc/sysconfig/logstash
uncommented LS_HEAP_SIZE and set it to 1024m

Code: Select all

# Arguments to pass to java
LS_HEAP_SIZE="1024m"
LS_JAVA_OPTS="-Djava.io.tmpdir=$APP_DIR/tmp"
EDIT3:
The Logstash service seem more stable since I changed the LS_HEAP_SIZE. I also changed the elasticsearch HEAP_SIZE in /etc/sysconfig/elasticsearch from 4g to 6g, https://support.nagios.com/forum/viewto ... h&start=10:

Code: Select all

ES_HEAP_SIZE=6g
What would be the Nagios recommendation for thise HEAP_SIZE settings for Logstash and Elasticsearch? I set it to 5 as elastic seems to recommend to give half your memory to lucene.

EDIT4: Seems I'm still having issues. The webgui is superslow or unresponsive. Logs are not coming in.

EDIT5: Logstash is no longer stopping it seems after a few restarts. I guess Logstash needed catching up..

Grtz
Last edited by WillemDH on Sun Nov 01, 2015 6:08 am, edited 1 time in total.
Nagios XI 5.8.1
https://outsideit.net
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Logsash crashing continuously

Post by jolson »

What would be the Nagios recommendation for thise HEAP_SIZE settings for Logstash and Elasticsearch? I set it to 5 as elastic seems to recommend to give half your memory to lucene.
Typically the Logstash heap won't need to be higher than 512MB. It's a very sane default. The Elasticsearch HEAP should be set to about 50% of your total memory.

The reasons that I most typically see Logstash crash are:
1. A memory leak in one of the Logstash plugins
2. Logstash can't process data fast enough, and backs up to the point of crashing (this also has to do with ES performance).
3. High latency between instances.
uncommented LS_HEAP_SIZE and set it to 1024m
In certain situations I've seen this work appropriately - if there is a memory leak, it's more of a band-aid than a solution since the memory leak will eventually stop the process from operating.

I noticed the following:

Code: Select all

LS_USER=root
I imagine you're listening on a privileged port? If so, there's one more step that needs to be taken:

Code: Select all

echo -e "\nsetcap 'cap_net_bind_service=+ep' \$(readlink -f \$(which java))" >> /etc/sysconfig/logstash
Now restart Logstash:

Code: Select all

service logstash restart
Does the above help at all? WIthout that last line logstash may encounter problems.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Logsash crashing continuously

Post by WillemDH »

Jesse,

Yes our NLS nodes are i ndeed listening on a private port:

Code: Select all

syslog {
    type => 'syslog-esx'
    port => 514
}
This is necessary as there is a bug with syslog on ESXi servers sending syslog over TCP. When the Logstash service is restarted, the ESXi servers stop sending untill syslog is restarted. The workaround is to use UDP for syslog on ESXi servers.

Added

Code: Select all

setcap 'cap_net_bind_service=+ep' $(readlink -f $(which java))
to the Logstash config file. The documentation is a bit confusing:
the second option will preserve logstash running as the nagios user, however it should be pointed out that this method may be less
secure in some environments as it will allow any java process to listen on privileged ports. To use this method, run the following
commands
I thought Nagios meant with 'the second option' that we needed to or use the first or use the second option. Maybe it's better to use 'the second step'? The Logstash servers are stable since I set the heap size for logstash to 1024m though, so I will leave it like that.

Just wanted to say one more thing I've been noticing. It seems after loading somes dashboards and doing some queries etc, the cpu utilisation which is usually between 5 and 15 %, suddenly gets higher to around 30 - 35 %. Yes this is normal when making queries etc, but sometimes it doesn't go back to the usual utilisation, event after logging out of all guis.. It seems some Java processes are not getting closed properly or not calming down. It's only when the elasticsearch service is being restarted that NLS goes back to it's normal CPU utilisation.
DId you ever see this behaviour? Check out this graph Nagios_LS_CPU which demonstrates the problem of cpu util of my two nodes. You can see the 3 higher blocks of cpu usage which starts after doing some queries in the gui, but it doesn't revert to the 'normal' situation untill I restart the elasticsearch service.

The Nagios_LS_CPU2 screenshot clearly shows the cpu (blue) going down a lot and mem (green ( a little bit) after the restart of the elasticsearch service. THis is something that really needs to be addresses, as having to restart this service every time we check a dashboard is really a waste of time and resources. I'm seeing this problem since the installation of NLS.
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Logsash crashing continuously

Post by jolson »

Just to clarify: After making a query, your CPU usage _never_ goes down? If not, that's certainly a problem - and it's a problem that I can't claim to have seen before.
Maybe it's better to use 'the second step'?
Agreed, I will make a note to change our documentation appropriately.

I recommend upgrading your cluster to get on the latest version of Elasticsearch that we've released - is there any chance you could do so?
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Logsash crashing continuously

Post by WillemDH »

Jesse,

My NLS nodes are already at the latest version. Sorry, somehow is the public ip of my enterprise network considered as spam, and so I can only update my signature from home. (It's impossible to get it off the spamlists with 6k users)
Just to show you and proove my point (about the cpu usage not going dow) Check out this screenshot where you can clearly see I restarted elasticsearch service friday and two more times in the course of October. It would be nice if this problem was solved. I'm pretty sure I can simulate it. Just need to load a few dashboards, make some queries. The question is of course which query or dashlet exactly is causing it... Any tips for troubleshooting this? Started monitoring the java process with http://outsideit.net/check-lin-process/ which is already showing some interesting results. I'll post a screenshot later when I the issue re-appears.

I looked aroung in the elasticsearch logs. APart from this one, there seem no particular problems. Any idea what's casuing this one to create a writefailure btw? (it's just a Windows server eventlog) Invalid format: "3:08:00" is malformed at ":08:00" => What can I do to make NLS accept my dateformats for all my Windows eventlogs?

Code: Select all

[2015-10-30 03:08:03,084][DEBUG][action.bulk              ] [c4d16075-9bc2-4095-9f00-e7de7f96930c] [logstash-2015.10.30][2] failed to execute bulk item (index) index {[logstash-2015.10.30][eventlog][AVC2gHP8LOr2EbcHX-sQ], source[{"PreviousDate":"30/10/2015","PreviousTime":"3:07:54","NewDate":"30/10/2015","NewTime":"3:08:00","message":"The system time was changed.\r\n\r\nSubject:\r\n\tSecurity ID:\t\tS-1-5-18\r\n\tAccount Name:\t\tserver$\r\n\tAccount Domain:\t\tGENTGRP\r\n\tLogon ID:\t\t0x3e7\r\n\r\nProcess Information:\r\n\tProcess ID:\t0x914\r\n\tName:\t\tC:\\Program Files\\VMware\\VMware Tools\\vmtoolsd.exe\r\n\r\nPrevious Time:\t\t3:07:54 30/10/2015\r\nNew Time:\t\t3:08:00 30/10/2015\r\n\r\nThis event is generated when the system time is changed. It is normal for the Windows Time Service, which runs with System privilege, to change the system time on a regular basis. Other system time changes may be indicative of attempts to tamper with the computer.","@version":"1","@timestamp":"2015-10-30T02:08:02.301Z","host":"xx.xx.139.120","type":"eventlog","category":"Security State Change","channel":"Security","eventid":4616,"hostname":"server.domain","keywords":-9214364837600034816,"opcode":"info","opcodevalue":0,"processid":672,"processname":"C:\\Program Files\\VMware\\VMware Tools\\vmtoolsd.exe","providerguid":"{54849625-5478-4994-A5BA-3E3B0328C30D}","recordnumber":3754887,"severity_label":"informational","severity":2,"sourcemodulename":"eventlog","sourcename":"Microsoft-Windows-Security-Auditing","subjectdomainname":"GENTGRP","subjectlogonid":"0x3e7","subjectusername":"server$","subjectusersid":"S-1-5-18","task":12288,"threadid":724,"version":0,"logsource":"server.domain"}]}
org.elasticsearch.action.WriteFailureException
        at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:470)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardIndexOperation(TransportShardBulkAction.java:418)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:148)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase.performOnPrimary(TransportShardReplicationOperationAction.java:574)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$PrimaryPhase$1.doRun(TransportShardReplicationOperationAction.java:440)
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:36)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.mapper.MapperParsingException: failed to parse [NewTime]
        at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:411)
        at org.elasticsearch.index.mapper.object.ObjectMapper.serializeValue(ObjectMapper.java:706)
        at org.elasticsearch.index.mapper.object.ObjectMapper.parse(ObjectMapper.java:497)
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:544)
        at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:493)
        at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:466)
        ... 8 more
Caused by: org.elasticsearch.index.mapper.MapperParsingException: failed to parse date field [3:08:00], tried both date format [dateOptionalTime], and timestamp number with locale []
        at org.elasticsearch.index.mapper.core.DateFieldMapper.parseStringValue(DateFieldMapper.java:617)
        at org.elasticsearch.index.mapper.core.DateFieldMapper.innerParseCreateField(DateFieldMapper.java:535)
        at org.elasticsearch.index.mapper.core.NumberFieldMapper.parseCreateField(NumberFieldMapper.java:239)
        at org.elasticsearch.index.mapper.core.AbstractFieldMapper.parse(AbstractFieldMapper.java:401)
        ... 13 more
Caused by: java.lang.IllegalArgumentException: Invalid format: "3:08:00" is malformed at ":08:00"
        at org.elasticsearch.common.joda.time.format.DateTimeParserBucket.doParseMillis(DateTimeParserBucket.java:187)
        at org.elasticsearch.common.joda.time.format.DateTimeFormatter.parseMillis(DateTimeFormatter.java:780)
        at org.elasticsearch.index.mapper.core.DateFieldMapper.parseStringValue(DateFieldMapper.java:612)
        ... 16 more
EDIT: I was able to simulate the "cpu-not-dropping" problem very qucikly. After loading my dashboard with a last 30 days setting, it took some time (30 sec) to load (which is normal). After is has loaded completely, and me closing the web gui, There still is a Java process running in top. This process wont go away or calm down untill I restart the elasticsearch service..

Grtz
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Logstash crashing continuously

Post by jolson »

What I would like you to do is use 'top' to tunnel into the top process.

Code: Select all

top -p <PID>
Where PID is the PID of the java process taking up the highest amount of resources.

I'm also interested in your elasticsearch.log file:

Code: Select all

tail -n50 /var/log/elasticsearch/*.log
Does anything jump out at you after running the above commands?
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
User avatar
WillemDH
Posts: 2320
Joined: Wed Mar 20, 2013 5:49 am
Location: Ghent
Contact:

Re: Logstash crashing continuously

Post by WillemDH »

Jesse,

Check out the screenshot of the 'top -p 61259' command. As you can see, this process has been running since my tests I did Saturday.

checking the elasticsearch logs did not show anything particular.

Grtz
You do not have the required permissions to view the files attached to this post.
Nagios XI 5.8.1
https://outsideit.net
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Logstash crashing continuously

Post by jolson »

I'd like to set up a remote session to try and get to the bottom of this problem. Would you email customersupport@nagios.com and reference this thread please?
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Logstash crashing continuously

Post by jolson »

Thread locked because an internal ticket was opened.
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked