On NLS client rsyslog/python/httpd processing stopped

GhostRider2110 · Post by **GhostRider2110** » Wed Mar 11, 2015 9:27 am

Nagios Log Server • 2015R1.3 (From VM Template)

We had a strange thing happen yesterday. 2 systems, a primary and failover, which are configured to send all logs (system and httpd) to NLS just pretty much stopped processing. They mainly run python wgsi processing. We spent some time trying to figure out what was going on since when the primary would stop responding to the varnishd cache server, the varnish cache server would move over the failover system, it would work for a few minutes, then bog down fail back to primary since it was now responding... back and forth. We went like this for 40mins or so until I decided to start backing out "extra" processing. First shut off all Nagios XI checks, then moved the NLS rsyslog config files and restarted rsyslogd on the primary system. When it came back as primary, everything was working as it should. No slowdown, no errors etc.. So on failover system, I moved the NLS rsyslog config files, and monitored httpd access logs and forces a failover. When we saw the system start to fail, I restarted rsyslog to implement the new config and all processing started back up. In the apache error_log, there were several stack trace entries from the restart of rsyslogd.

Code: Select all

Tue Mar 10 14:43:37 2015] [error] Traceback (most recent call last):
[Tue Mar 10 14:43:37 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 799, in emit
[Tue Mar 10 14:43:37 2015] [error]     self._connect_unixsocket(self.address)
[Tue Mar 10 14:43:37 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 731, in _connect_unixsocket
[Tue Mar 10 14:43:37 2015] [error]     self.socket.connect(address)
[Tue Mar 10 14:43:37 2015] [error]   File "<string>", line 1, in connect
[Tue Mar 10 14:43:37 2015] [error] error: [Errno 2] No such file or directory
[Tue Mar 10 14:43:37 2015] [error] Traceback (most recent call last):
[Tue Mar 10 14:43:37 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 799, in emit
[Tue Mar 10 14:43:37 2015] [error]     self._connect_unixsocket(self.address)
[Tue Mar 10 14:43:37 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 731, in _connect_unixsocket
[Tue Mar 10 14:43:37 2015] [error]     self.socket.connect(address)
[Tue Mar 10 14:43:37 2015] [error]   File "<string>", line 1, in connect
[Tue Mar 10 14:43:37 2015] [error] error: [Errno 2] No such file or directory
[Tue Mar 10 14:43:37 2015] [error] Traceback (most recent call last):
[Tue Mar 10 14:43:37 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 799, in emit
[Tue Mar 10 14:43:37 2015] [error]     self._connect_unixsocket(self.address)
[Tue Mar 10 14:43:37 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 731, in _connect_unixsocket
[Tue Mar 10 14:43:37 2015] [error]     self.socket.connect(address)
[Tue Mar 10 14:43:37 2015] [error]   File "<string>", line 1, in connect
[Tue Mar 10 14:43:37 2015] [error] error: [Errno 2] No such file or directory
[Tue Mar 10 14:43:37 2015] [error] Traceback (most recent call last):
[Tue Mar 10 14:43:37 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 799, in emit
[Tue Mar 10 14:43:37 2015] [error]     self._connect_unixsocket(self.address)
[Tue Mar 10 14:43:37 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 731, in _connect_unixsocket
[Tue Mar 10 14:43:37 2015] [error]     self.socket.connect(address)
[Tue Mar 10 14:43:37 2015] [error]   File "<string>", line 1, in connect
[Tue Mar 10 14:43:37 2015] [error] error: [Errno 2] No such file or directory

Code: Select all

Tue Mar 10 14:43:37 2015] [error] Traceback (most recent call last):
[Tue Mar 10 14:43:37 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 799, in emit
[Tue Mar 10 14:43:37 2015] [error]     self._connect_unixsocket(self.address)
[Tue Mar 10 14:43:37 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 731, in _connect_unixsocket
[Tue Mar 10 14:43:37 2015] [error]     self.socket.connect(address)
[Tue Mar 10 14:43:37 2015] [error]   File "<string>", line 1, in connect
[Tue Mar 10 14:43:37 2015] [error] error: [Errno 2] No such file or directory
[Tue Mar 10 14:43:37 2015] [error] Traceback (most recent call last):
[Tue Mar 10 14:43:37 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 799, in emit
[Tue Mar 10 14:43:37 2015] [error]     self._connect_unixsocket(self.address)
[Tue Mar 10 14:43:37 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 731, in _connect_unixsocket
[Tue Mar 10 14:43:37 2015] [error]     self.socket.connect(address)
[Tue Mar 10 14:43:37 2015] [error]   File "<string>", line 1, in connect
[Tue Mar 10 14:43:37 2015] [error] error: [Errno 2] No such file or directory

This configuration had been up and running since around Feb 25th so it surprised us when this cleared up the problems. I have attached all the rsyslog config files used for the configuration. After we got things back up and running, I did notice that the NLS web interface was very slow to respond, in fact I had to reboot the NLS to clear it up.

System only will allow 3 attachments will attach other 3 in reply.

Client systems OS one is RHEL 6.4 and one is RHEL 6.6

If other logs from NLS are wanted, let me know.

Thanks
Mitch

GhostRider2110 · Post by **GhostRider2110** » Wed Mar 11, 2015 9:28 am

Other config files attached.

Thanks again
Mitch

GhostRider2110 · Post by **GhostRider2110** » Wed Mar 11, 2015 11:14 am

Noticing a few java processes with high mem and cpu usage. See attached screen shots.

topnls.png

Still baffled over the behavior of those systems just due to the fact they were configured to send logs to NLS. What condition could the NLS have been in to cause systems to seeming lock up processing sockets?

Thanks
Mitch

jolson · Post by **jolson** » Wed Mar 11, 2015 11:39 am

I would like you to output the following information from NLS:

Code: Select all

cat /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml
netstat -na|grep LISTEN

And for clarification - the NLS never stopped processing, it was the NLS clients in question that stopped processing - is that correct? Or did the NLS lockup and an rsyslogd restart resolved that somehow?

To me, this feels like a case of rsyslogd crashing the client systems for some reason. I appreciate the good information that you have provided, and will continue to do research while you get the above information to us. Thank you very much.

GhostRider2110 · Post by **GhostRider2110** » Wed Mar 11, 2015 12:04 pm

jolson wrote:I would like you to output the following information from NLS:
Code: Select all
cat /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml
netstat -na|grep LISTEN
And for clarification - the NLS never stopped processing, it was the NLS clients in question that stopped processing - is that correct? Or did the NLS lockup and an rsyslogd restart resolved that somehow?

To me, this feels like a case of rsyslogd crashing the client systems for some reason. I appreciate the good information that you have provided, and will continue to do research while you get the above information to us. Thank you very much.

NLS never totally stopped processing, but it was in a state that was unusable at least from the web interface. Removing the NLS configurations from rsyslogd and restarting rsyslogd cleared up the client workstations to allow them to process again. All notice of the NLS server was after the fact since we were concentrating on getting the production services back up and functioning. I forgot to mention, besides log delay and even timeouts trying to log into the client systems, they had no load, not excessive CPU activity, IO activity, etc. I wish I had thought to check open files and open sockets. When logged into the clients, the only indication you had something was wrong (besides the httpd/wsgi processes not working) was delay in processing commands at times. And this was not consistent. Things would work fine for a few minutes, then start to error out. Login delay, file opening delays, command execution delays, etc. Then clear up. We didn't notice any problems with other systems configured to use NLS, but the others configured don't have the same httpd/wsgi processes running on them. Most are httpd caching servers or DB servers. At one point we even rebooted on of the two clients which were having issues, things cleared up for a few minutes again but errors resumed. A couple of our developers are looking at the stack traces, there were a lot of them, mostly repeat of what I included earlier. All happened between rsyslogd being shutdown and started back up. After the startup message in /var/log/messages, no more stack traces.

Code: Select all

[root@IGAnagioslog ~]# cat /usr/local/nagioslogserver/elasticsearch/config/elasticsearch.yml
##################### Elasticsearch Configuration Example #####################

# This file contains an overview of various configuration settings,
# targeted at operations staff. Application developers should
# consult the guide at <http://elasticsearch.org/guide>.
#
# The installation procedure is covered at
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/setup.html>.
#
# Elasticsearch comes with reasonable defaults for most settings,
# so you can try it out without bothering with configuration.
#
# Most of the time, these defaults are just fine for running a production
# cluster. If you're fine-tuning your cluster, or wondering about the
# effect of certain configuration option, please _do ask_ on the
# mailing list or IRC channel [http://elasticsearch.org/community].

# Any element in the configuration can be replaced with environment variables
# by placing them in ${...} notation. For example:
#
# node.rack: ${RACK_ENV_VAR}

# For information on supported formats and syntax for the config file, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/setup-configuration.html>


################################### Cluster ###################################

# Cluster name identifies your cluster for auto-discovery. If you're running
# multiple clusters on the same network, make sure you're using unique names.
#
cluster.name: nagios_elasticsearch


#################################### Node #####################################

# Node names are generated dynamically on startup, so you're relieved
# from configuring them manually. You can tie this node to a specific name:
#
# node.name: "Franz Kafka"

# Every node can be configured to allow or deny being eligible as the master,
# and to allow or deny to store the data.
#
# Allow this node to be eligible as a master node (enabled by default):
#
# node.master: true
#
# Allow this node to store data (enabled by default):
#
# node.data: true

# You can exploit these settings to design advanced cluster topologies.
#
# 1. You want this node to never become a master node, only to hold data.
#    This will be the "workhorse" of your cluster.
#
# node.master: false
# node.data: true
#
# 2. You want this node to only serve as a master: to not store any data and
#    to have free resources. This will be the "coordinator" of your cluster.
#
# node.master: true
# node.data: false
#
# 3. You want this node to be neither master nor data node, but
#    to act as a "search load balancer" (fetching data from nodes,
#    aggregating results, etc.)
#
# node.master: false
# node.data: false

# Use the Cluster Health API [http://localhost:9200/_cluster/health], the
# Node Info API [http://localhost:9200/_nodes] or GUI tools
# such as <http://www.elasticsearch.org/overview/marvel/>,
# <http://github.com/karmi/elasticsearch-paramedic>,
# <http://github.com/lukas-vlcek/bigdesk> and
# <http://mobz.github.com/elasticsearch-head> to inspect the cluster state.

# A node can have generic attributes associated with it, which can later be used
# for customized shard allocation filtering, or allocation awareness. An attribute
# is a simple key value pair, similar to node.key: value, here is an example:
#
# node.rack: rack314

# By default, multiple nodes are allowed to start from the same installation location
# to disable it, set the following:
node.max_local_storage_nodes: 1


#################################### Index ####################################

# You can set a number of options (such as shard/replica options, mapping
# or analyzer definitions, translog settings, ...) for indices globally,
# in this file.
#
# Note, that it makes more sense to configure index settings specifically for
# a certain index, either when creating it or by using the index templates API.
#
# See <http://elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules.html> and
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/indices-create-index.html>
# for more information.

# Set the number of shards (splits) of an index (5 by default):
#
# index.number_of_shards: 5

# Set the number of replicas (additional copies) of an index (1 by default):
#
# index.number_of_replicas: 1

# Note, that for development on a local machine, with small indices, it usually
# makes sense to "disable" the distributed features:
#
# index.number_of_shards: 1
# index.number_of_replicas: 0

# These settings directly affect the performance of index and search operations
# in your cluster. Assuming you have enough machines to hold shards and
# replicas, the rule of thumb is:
#
# 1. Having more *shards* enhances the _indexing_ performance and allows to
#    _distribute_ a big index across machines.
# 2. Having more *replicas* enhances the _search_ performance and improves the
#    cluster _availability_.
#
# The "number_of_shards" is a one-time setting for an index.
#
# The "number_of_replicas" can be increased or decreased anytime,
# by using the Index Update Settings API.
#
# Elasticsearch takes care about load balancing, relocating, gathering the
# results from nodes, etc. Experiment with different settings to fine-tune
# your setup.

# Use the Index Status API (<http://localhost:9200/A/_status>) to inspect
# the index status.


#################################### Paths ####################################

# Path to directory containing configuration (this file and logging.yml):
#
# path.conf: /path/to/conf

# Path to directory where to store index data allocated for this node.
#
# path.data: /path/to/data
#
# Can optionally include more than one location, causing data to be striped across
# the locations (a la RAID 0) on a file level, favouring locations with most free
# space on creation. For example:
#
# path.data: /path/to/data1,/path/to/data2

# Path to temporary files:
#
# path.work: /path/to/work

# Path to log files:
#
# path.logs: /path/to/logs

# Path to where plugins are installed:
#
# path.plugins: /path/to/plugins


#################################### Plugin ###################################

# If a plugin listed here is not installed for current node, the node will not start.
#
# plugin.mandatory: mapper-attachments,lang-groovy


################################### Memory ####################################

# Elasticsearch performs poorly when JVM starts swapping: you should ensure that
# it _never_ swaps.
#
# Set this property to true to lock the memory:
#
bootstrap.mlockall: true

# Make sure that the ES_MIN_MEM and ES_MAX_MEM environment variables are set
# to the same value, and that the machine has enough memory to allocate
# for Elasticsearch, leaving enough memory for the operating system itself.
#
# You should also make sure that the Elasticsearch process is allowed to lock
# the memory, eg. by using `ulimit -l unlimited`.


############################## Network And HTTP ###############################

# Elasticsearch, by default, binds itself to the 0.0.0.0 address, and listens
# on port [9200-9300] for HTTP traffic and on port [9300-9400] for node-to-node
# communication. (the range means that if the port is busy, it will automatically
# try the next port).

# Set the bind address specifically (IPv4 or IPv6):
#
# network.bind_host: 192.168.0.1

# Set the address other nodes will use to communicate with this node. If not
# set, it is automatically derived. It must point to an actual IP address.
#
# network.publish_host: 192.168.0.1

# Set both 'bind_host' and 'publish_host':
#
# network.host: 192.168.0.1

# Set a custom port for the node to node communication (9300 by default):
#
# transport.tcp.port: 9300

# Enable compression for all communication between nodes (disabled by default):
#
transport.tcp.compress: true

# Set a custom port to listen for HTTP traffic:
#
# http.port: 9200

# Set a custom allowed content length:
#
# http.max_content_length: 100mb

# Disable HTTP completely:
#
# http.enabled: false

# Set the HTTP host to listen to
#
http.host: "localhost"

################################### Gateway ###################################

# The gateway allows for persisting the cluster state between full cluster
# restarts. Every change to the state (such as adding an index) will be stored
# in the gateway, and when the cluster starts up for the first time,
# it will read its state from the gateway.

# There are several types of gateway implementations. For more information, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/modules-gateway.html>.

# The default gateway type is the "local" gateway (recommended):
#
# gateway.type: local

# Settings below control how and when to start the initial recovery process on
# a full cluster restart (to reuse as much local data as possible when using shared
# gateway).

# Allow recovery process after N nodes in a cluster are up:
#
# gateway.recover_after_nodes: 1

# Set the timeout to initiate the recovery process, once the N nodes
# from previous setting are up (accepts time value):
#
# gateway.recover_after_time: 5m

# Set how many nodes are expected in this cluster. Once these N nodes
# are up (and recover_after_nodes is met), begin recovery process immediately
# (without waiting for recover_after_time to expire):
#
# gateway.expected_nodes: 2


############################# Recovery Throttling #############################

# These settings allow to control the process of shards allocation between
# nodes during initial recovery, replica allocation, rebalancing,
# or when adding and removing nodes.

# Set the number of concurrent recoveries happening on a node:
#
# 1. During the initial recovery
#
# cluster.routing.allocation.node_initial_primaries_recoveries: 4
#
# 2. During adding/removing nodes, rebalancing, etc
#
# cluster.routing.allocation.node_concurrent_recoveries: 2

# Set to throttle throughput when recovering (eg. 100mb, by default 20mb):
#
# indices.recovery.max_bytes_per_sec: 20mb

# Set to limit the number of open concurrent streams when
# recovering a shard from a peer:
#
# indices.recovery.concurrent_streams: 5


################################## Discovery ##################################

# Discovery infrastructure ensures nodes can be found within a cluster
# and master node is elected. Multicast discovery is the default.

# Set to ensure a node sees N other master eligible nodes to be considered
# operational within the cluster. Its recommended to set it to a higher value
# than 1 when running more than 2 nodes in the cluster.
#
# discovery.zen.minimum_master_nodes: 1

# Set the time to wait for ping responses from other nodes when discovering.
# Set this option to a higher value on a slow or congested network
# to minimize discovery failures:
#
# discovery.zen.ping.timeout: 3s

# For more information, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-zen.html>

# Unicast discovery allows to explicitly control which nodes will be used
# to discover the cluster. It can be used when multicast is not present,
# or to restrict the cluster communication-wise.
#
# 1. Disable multicast discovery (enabled by default):
#
discovery.zen.ping.multicast.enabled: false
#
# 2. Configure an initial list of master nodes in the cluster
#    to perform discovery when new nodes (master or data) are started:
#
discovery.zen.ping.unicast.hosts: ["localhost"]

# EC2 discovery allows to use AWS EC2 API in order to perform discovery.
#
# You have to install the cloud-aws plugin for enabling the EC2 discovery.
#
# For more information, see
# <http://elasticsearch.org/guide/en/elasticsearch/reference/current/modules-discovery-ec2.html>
#
# See <http://elasticsearch.org/tutorials/elasticsearch-on-ec2/>
# for a step-by-step tutorial.

# GCE discovery allows to use Google Compute Engine API in order to perform discovery.
#
# You have to install the cloud-gce plugin for enabling the GCE discovery.
#
# For more information, see <https://github.com/elasticsearch/elasticsearch-cloud-gce>.

# Azure discovery allows to use Azure API in order to perform discovery.
#
# You have to install the cloud-azure plugin for enabling the Azure discovery.
#
# For more information, see <https://github.com/elasticsearch/elasticsearch-cloud-azure>.

################################## Slow Log ##################################

# Shard level query and fetch threshold logging.

#index.search.slowlog.threshold.query.warn: 10s
#index.search.slowlog.threshold.query.info: 5s
#index.search.slowlog.threshold.query.debug: 2s
#index.search.slowlog.threshold.query.trace: 500ms

#index.search.slowlog.threshold.fetch.warn: 1s
#index.search.slowlog.threshold.fetch.info: 800ms
#index.search.slowlog.threshold.fetch.debug: 500ms
#index.search.slowlog.threshold.fetch.trace: 200ms

#index.indexing.slowlog.threshold.index.warn: 10s
#index.indexing.slowlog.threshold.index.info: 5s
#index.indexing.slowlog.threshold.index.debug: 2s
#index.indexing.slowlog.threshold.index.trace: 500ms

################################## GC Logging ################################

#monitor.jvm.gc.young.warn: 1000ms
#monitor.jvm.gc.young.info: 700ms
#monitor.jvm.gc.young.debug: 400ms

#monitor.jvm.gc.old.warn: 10s
#monitor.jvm.gc.old.info: 5s
#monitor.jvm.gc.old.debug: 2s

Code: Select all

[root@IGAnagioslog ~]# netstat -na|grep LISTEN
tcp        0      0 0.0.0.0:22                  0.0.0.0:*                   LISTEN      
tcp        0      0 127.0.0.1:25                0.0.0.0:*                   LISTEN      
tcp        0      0 127.0.0.1:6010              0.0.0.0:*                   LISTEN      
tcp        0      0 :::2056                     :::*                        LISTEN      
tcp        0      0 :::5544                     :::*                        LISTEN      
tcp        0      0 :::2057                     :::*                        LISTEN      
tcp        0      0 ::ffff:127.0.0.1:9200       :::*                        LISTEN      
tcp        0      0 :::6544                     :::*                        LISTEN      
tcp        0      0 :::80                       :::*                        LISTEN      
tcp        0      0 :::9300                     :::*                        LISTEN      
tcp        0      0 :::22                       :::*                        LISTEN      
tcp        0      0 ::1:6010                    :::*                        LISTEN      
tcp        0      0 :::3515                     :::*                        LISTEN      
tcp        0      0 :::514                      :::*                        LISTEN      
tcp        0      0 :::5666                     :::*                        LISTEN      
tcp        0      0 :::5667                     :::*                        LISTEN      
unix  2      [ ACC ]     STREAM     LISTENING     7395   @/com/ubuntu/upstart
unix  2      [ ACC ]     STREAM     LISTENING     11640  @/var/run/hald/dbus-kQZnK4dGlo
unix  2      [ ACC ]     STREAM     LISTENING     11542  /var/run/dbus/system_bus_socket
unix  2      [ ACC ]     STREAM     LISTENING     11635  @/var/run/hald/dbus-qgRSGERR3n

jolson · Post by **jolson** » Wed Mar 11, 2015 12:37 pm

Thank you for all of the good information. Could you run this command for me also? I meant to include this one in my last post also.

Code: Select all

cat /etc/sysconfig/elasticsearch

I ask because sometimes slowness in the NLS cluster can come from not having two properties set:

Code: Select all

ES_HEAP_SIZE=1g
-and-
MAX_LOCKED_MEMORY=unlimited

I highly recommend uncommenting ES_HEAP_SIZE and setting the value to half of the amount of RAM on all nodes.
I highly recommend uncommenting MAX_LOCKED_MEMORY and changing the value to unlimited on all nodes.
After the values have been changed, please restart elasticsearch on all nodes.

Code: Select all

service elasticsearch restart

In future releases, we plan to have these values defined automatically.

In a lot of cases, this will resolve slugishness on the side of the Nagios Log Server - I have seen threads where it improved the GUI speed from the point of unusability to being usable once more.

The issue regarding the clients not sending logs is more of a grey area, it's hard to say that it is within our jurisdiction to troubleshoot the clients, since the NLS doesn't push data. We will do what we can, but cannot guarantee a fix. You might have better results posting on a python/rsyslog forum regarding the client problems.

Let me know how the above works out for you. Thanks!

Edit: for clarity

GhostRider2110 · Post by **GhostRider2110** » Wed Mar 11, 2015 1:16 pm

That is interesting, sayd the max for ES_HEAP_SIZE is 1g, yet the commented out line has 2g. I'll do as you suggest.

Code: Select all

[root@IGAnagioslog ~]# cat /etc/sysconfig/elasticsearch
# Directory where the Elasticsearch binary distribution resides
APP_DIR="/usr/local/nagioslogserver"
ES_HOME="$APP_DIR/elasticsearch"

# Heap Size (defaults to 256m min, 1g max)
#ES_HEAP_SIZE=2g

# Heap new generation
#ES_HEAP_NEWSIZE=

# max direct memory
#ES_DIRECT_SIZE=

# Additional Java OPTS
#ES_JAVA_OPTS=

# Maximum number of open files
MAX_OPEN_FILES=65535

# Maximum amount of locked memory
#MAX_LOCKED_MEMORY=

# Maximum number of VMA (Virtual Memory Areas) a process can own
MAX_MAP_COUNT=262144

# Elasticsearch log directory
LOG_DIR=/var/log/elasticsearch

# Elasticsearch data directory
DATA_DIR="$ES_HOME/data"

# Elasticsearch work directory
WORK_DIR="$APP_DIR/tmp/elasticsearch"

# Elasticsearch conf directory
CONF_DIR="$ES_HOME/config"

# Elasticsearch configuration file (elasticsearch.yml)
CONF_FILE="$ES_HOME/config/elasticsearch.yml"

# User to run as, change this to a specific elasticsearch user if possible
# Also make sure, this user can write into the log directories in case you change them
# This setting only works for the init script, but has to be configured separately for systemd startup
ES_USER=nagios
ES_GROUP=nagios

# Configure restart on package upgrade (true, every other setting will lead to not restarting)
#RESTART_ON_UPGRADE=true

if [ "x$1" == "xstart" -o "x$1" == "xrestart" -o "x$1" == "xreload" -o "x$1" == "xforce-reload" ];then
	GET_ES_CONFIG_MESSAGE="$( php $APP_DIR/scripts/get_es_config.php )"
	GET_ES_CONFIG_RETURN=$?

	if [ "$GET_ES_CONFIG_RETURN" != "0" ]; then
		echo $GET_ES_CONFIG_MESSAGE
		exit 1
	else
		ES_JAVA_OPTS="$GET_ES_CONFIG_MESSAGE"
	fi
fi

I knew this would be a grey area, but since we didn't totally turn off rsyslogd, only removed it doing any sending of logs to the NLS, it would seem related. I know that the rsyslog facility does have built in "overflow" in case it can't talk to the NLS, but bringing a system to it's knees.... that hurts lol. Here is a look at the log flow into the NLS from one of the systems in question. Would have bursts of information.

rep03.png

Now granted, the system is the top log producer. Here is a look with filer set to it's IP for source and time period of : "2015-03-09T04:00:01.000Z" to : "2015-03-11T04:00:01.000Z" But during the time we had the problems, it was not even at real a real peak.

rep03normal.png

I was also just wondering if anyone else had run across this issue. I know it has to be related to python/rsyslog and sockets. Made the suggested changes will let you know about response time improvements.

See-ya
Mitch

GhostRider2110 · Post by **GhostRider2110** » Wed Mar 11, 2015 1:55 pm

I also wanted to have someone look at the config files used on the client to make sure I had not done something to induce the problem. --Mitch

jolson · Post by **jolson** » Wed Mar 11, 2015 2:42 pm

Are there any SELinux audit logs from around that time period? I used to work with VoIP systems, and I have seen rsyslog bring down servers when SELinux gets in the way. You could try setting enforcement to permissive when this issue is occuring to see if that helps? This is assuming that SELinux is enforcing to begin with of course.

Code: Select all

setenforce 0

Code: Select all

tail -n100 /var/log/audit/audit*.log

I cannot see anything wrong with your rsyslog configuration. Was anything changed leading up to this event - perhaps security patches?

That is interesting, sayd the max for ES_HEAP_SIZE is 1g, yet the commented out line has 2g. I'll do as you suggest.

The comment is a confusing one.

# Heap Size (defaults to 256m min, 1g max)

It is intended to read as 'if this variable is not set, the default heap size will allow between 256m - 1g of allocation.'

Reference bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=754455
https://bugzilla.redhat.com/show_bug.cgi?id=834316

Best Regards.

Jesse

GhostRider2110 · Post by **GhostRider2110** » Wed Mar 11, 2015 3:18 pm

Thanks, SElinux is set to disabled on those systems.

That was one of the things I had confirmed after the event... hoping it was something that simple. Oh well..

Going to try to setup the same logging config on in one of the development environments to see if I can reproduce. Will post what I find out.

See-ya
Mitch

Nagios Support Forum

On NLS client rsyslog/python/httpd processing stopped

On NLS client rsyslog/python/httpd processing stopped

Re: NLS client python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped