Runs out of RAM, then stops reading altogether.

polarbear1 · Post by **polarbear1** » Mon Apr 13, 2015 4:55 pm

Hi,

I'm testing Nagios Log Server for my company, and I ran into a serious show stopper I can't figure out.

Problem 1:
After some period of uptime, the RAM usage creeps up until it's nearly 100%, then the Events dashboard shows a slow down, and eventually 0 events. I've addressed this by restarting ElasticSearch which would free up about 1GB of ram and the activity would spike up (and then level out) and keep working for a while. After it happened again, I rebooted the machine and am currently watching the ram usage increase.

Problem 2:
After Nagios dies (Events = 0), all the machines that Nagios is reading from begin to slow down because of rsyslog. I can confirm this because turning off rsyslog, the machine becomes snappy again.

The setup:
Nagios Log Server 2015R1.3 is running on a dedicated machine - RHEL 6.6 (kernel - 2.6.32-431.el6.x86_64, Java - OpenJDK 1.7.0_75) Xeon E5504 (8 core @ 2Ghz), 16GB ram, and the Nagios data directory is on a separate disk from the OS. It is reading a bit shy of 100 logs (total) from 8 other similarly spec-ed machines. According to the Events chart, the activity volume normal range from 60k to 200k per 10 min, with spikes up to 600k per 10 min. All Nagios settings are default (except the data directory).

Thanks for looking.

jolson · Post by **jolson** » Mon Apr 13, 2015 5:12 pm

Have you set either the ES_HEAP_SIZE or MAX_LOCKED_MEMORY variables? If not, they may result in a noticeable performance boost.

Please run:

Code: Select all

cat /etc/sysconfig/elasticsearch

The variables of interest:

#Heap Size
ES_HEAP_SIZE=
# Maximum amount of locked memory
MAX_LOCKED_MEMORY=

I suggest tuning ES_HEAP_SIZE to half of the amount of RAM in your box - in your case, 8GB seems like a good number. MAX_LOCKED_MEMORY can be set to 'unlimited' to allow elasticsearch to keep memory locked in RAM, which boosts performance.

Examples:

Code: Select all

ES_HEAP_SIZE=8g
MAX_LOCKED_MEMORY=unlimited

After setting these variables, restart elasticsearch:

Code: Select all

service elasticsearch restart

I have seen the above significantly improve performance. I'm not sure what may be causing your RAM bloat - my guess is that the culprit is elasticsearch. Does elasticsearch ever get reaped by the kernel?

200k per 10m is a high amount of logs - any chance you could bump the machine up to 32GB, or is that not a possibility? In that case, be sure to set ES_HEAP_SIZE to 16g.

Let me know if the above helps with your RAM issue.

Regarding your second problem, you may be interested in rsyslog queues: http://www.rsyslog.com/doc/queues.html
While I do some testing on my end to figure out what might be causing rsyslog to stall, could you please post one of your rsyslog configurations on a machine you see slowness on?

Code: Select all

cat /etc/rsyslog.conf
cat /etc/rsyslog.d/*

Thanks!

Jesse

polarbear1 · Post by **polarbear1** » Tue Apr 14, 2015 11:40 am

Thanks for the quick reply. I will try the memory management settings you provided before throwing more hardware at the problem. Do you really think that for my activity spikes (200-600k/10min), the hardware may just not be enough?

For problem #1 - memory management seems to be key. Not sure if there's a leak somewhere or just sub-optimal settings.

For problem #2 - I figured it was the rsyslog queues, I guess I haven't worked with rsyslog long enough to know any of the tricks.

/etc/rsyslog.conf

Code: Select all

# rsyslog v5 configuration file

# For more information see /usr/share/doc/rsyslog-*/rsyslog_conf.html
# If you experience problems, see http://www.rsyslog.com/doc/troubleshoot.html

#### MODULES ####

$ModLoad imuxsock # provides support for local system logging (e.g. via logger command)
$ModLoad imklog   # provides kernel logging support (previously done by rklogd)
#$ModLoad immark  # provides --MARK-- message capability

# Provides UDP syslog reception
#$ModLoad imudp
#$UDPServerRun 514

# Provides TCP syslog reception
#$ModLoad imtcp
#$InputTCPServerRun 514


#### GLOBAL DIRECTIVES ####

# Use default timestamp format
$ActionFileDefaultTemplate RSYSLOG_TraditionalFileFormat

# File syncing capability is disabled by default. This feature is usually not required,
# not useful and an extreme performance hit
#$ActionFileEnableSync on

# Include all config files in /etc/rsyslog.d/
$IncludeConfig /etc/rsyslog.d/*.conf


#### RULES ####

# Log all kernel messages to the console.
# Logging much else clutters up the screen.
#kern.*                                                 /dev/console

# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;authpriv.none;cron.none                /var/log/messages

# The authpriv file has restricted access.
authpriv.*                                              /var/log/secure

# Log all the mail messages in one place.
mail.*                                                  -/var/log/maillog


# Log cron stuff
cron.*                                                  /var/log/cron

# Everybody gets emergency messages
*.emerg                                                 *

# Save news errors of level crit and higher in a special file.
uucp,news.crit                                          /var/log/spooler

# Save boot messages also to boot.log
local7.*                                                /var/log/boot.log


# ### begin forwarding rule ###
# The statement between the begin ... end define a SINGLE forwarding
# rule. They belong together, do NOT split them. If you create multiple
# forwarding rules, duplicate the whole block!
# Remote Logging (we use TCP for reliable delivery)
#
# An on-disk queue is created for this action. If the remote host is
# down, messages are spooled to disk and sent when it is up again.
#$WorkDirectory /var/lib/rsyslog # where to place spool files
#$ActionQueueFileName fwdRule1 # unique name prefix for spool files
#$ActionQueueMaxDiskSpace 1g   # 1gb space limit (use as much as possible)
#$ActionQueueSaveOnShutdown on # save messages to disk on shutdown
#$ActionQueueType LinkedList   # run asynchronously
#$ActionResumeRetryCount -1    # infinite retries if host is down
# remote host is: name/ip:port, e.g. 192.168.0.1:514, port optional
#*.* @@remote-host:514
# ### end of the forwarding rule ###

# A template to for higher precision timestamps + severity logging
$template SpiceTmpl,"%TIMESTAMP%.%TIMESTAMP:::date-subseconds% %syslogtag% %syslogseverity-text%:%msg:::sp-if-no-1st-sp%%msg:::drop-last-lf%\n"

:programname, startswith, "spice-vdagent"       /var/log/spice-vdagent.log;SpiceTmpl

/etc/rsyslog.d/*

Code: Select all

$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog

# Input for gvsi_news_mt
$InputFileName /home/wombat/feeds/mq_reader/log/gvsi_news_mt.log
$InputFileTag gvsi_news_mt:
$InputFileStateFile nls-state-home_wombat_feeds_mq_reader_log_gvsi_news_mt.log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor

# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'gvsi_news_mt' then @@schpnag1:5544
if $programname == 'gvsi_news_mt' then ~
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog

# Input for mqreader_dowjones_news_mt
$InputFileName /home/wombat/feeds/mq_reader/log/mqreader_dowjones_news_mt.log
$InputFileTag mqreader_dowjones_news_mt:
$InputFileStateFile nls-state-home_wombat_feeds_mq_reader_log_mqreader_dowjones_news_mt.log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor

# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'mqreader_dowjones_news_mt' then @@schpnag1:5544
if $programname == 'mqreader_dowjones_news_mt' then ~
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog

# Input for mqreader_platts_news_mt
$InputFileName /home/wombat/feeds/mq_reader/log/mqreader_platts_news_mt.log
$InputFileTag mqreader_platts_news_mt:
$InputFileStateFile nls-state-home_wombat_feeds_mq_reader_log_mqreader_platts_news_mt.log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor

# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'mqreader_platts_news_mt' then @@schpnag1:5544
if $programname == 'mqreader_platts_news_mt' then ~
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog

# Input for mqreader_platts_pricess_mt
$InputFileName /home/wombat/feeds/mq_reader/log/mqreader_platts_pricess_mt.log
$InputFileTag mqreader_platts_pricess_mt:
$InputFileStateFile nls-state-home_wombat_feeds_mq_reader_log_mqreader_platts_pricess_mt.log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor

# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'mqreader_platts_pricess_mt' then @@schpnag1:5544
if $programname == 'mqreader_platts_pricess_mt' then ~
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog

# Input for mqreader_proprietary_cop_mt
$InputFileName /home/wombat/feeds/mq_reader/log/mqreader_proprietary_cop_mt.log
$InputFileTag mqreader_proprietary_cop_mt:
$InputFileStateFile nls-state-home_wombat_feeds_mq_reader_log_mqreader_proprietary_cop_mt.log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor

# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'mqreader_proprietary_cop_mt' then @@schpnag1:5544
if $programname == 'mqreader_proprietary_cop_mt' then ~
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog

# Input for mqreader_proprietary_daily_mt
$InputFileName /home/wombat/feeds/mq_reader/log/mqreader_proprietary_daily_mt.log
$InputFileTag mqreader_proprietary_daily_mt:
$InputFileStateFile nls-state-home_wombat_feeds_mq_reader_log_mqreader_proprietary_daily_mt.log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor

# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'mqreader_proprietary_daily_mt' then @@schpnag1:5544
if $programname == 'mqreader_proprietary_daily_mt' then ~
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog

# Input for mqreader_proprietary_mt
$InputFileName /home/wombat/feeds/mq_reader/log/mqreader_proprietary_mt.log
$InputFileTag mqreader_proprietary_mt:
$InputFileStateFile nls-state-home_wombat_feeds_mq_reader_log_mqreader_proprietary_mt.log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor

# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'mqreader_proprietary_mt' then @@schpnag1:5544
if $programname == 'mqreader_proprietary_mt' then ~
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog

# Input for mqreader_rss_news_mt
$InputFileName /home/wombat/feeds/mq_reader/log/mqreader_rss_news_mt.log
$InputFileTag mqreader_rss_news_mt:
$InputFileStateFile nls-state-home_wombat_feeds_mq_reader_log_mqreader_rss_news_mt.log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor

# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'mqreader_rss_news_mt' then @@schpnag1:5544
if $programname == 'mqreader_rss_news_mt' then ~
[root@schtwb05 ~]#

jolson · Post by **jolson** » Tue Apr 14, 2015 2:49 pm

Do you really think that for my activity spikes (200-600k/10min), the hardware may just not be enough?

It depends heavily on a lot of different variables, but for Nagios Log Server, RAM is almost always going to be the limiting factor.

I recommend you take a look at the 'Instance Status' page for information regarding hardware usage. Please see the highlighted areas:

Capture.PNG

Please let me know how setting the HEAP_SIZE goes, I look forward to your response.

For problem #2 - I figured it was the rsyslog queues, I guess I haven't worked with rsyslog long enough to know any of the tricks.

I am not extremely familiar with rsyslog either, but this is likely a queue issue. What happens to the servers that get bogged down - do they run low on memory? CPU? Disk Space? Are there any logs that tell us what might be happening to them? If you could supply us with some additional troubleshooting information here, that could be useful - I tried to reproduce your issue in-house and could not.

polarbear1 · Post by **polarbear1** » Wed Apr 15, 2015 11:32 am

This is a screenshot from this point in time, but everything is working correctly at this moment. This is very reflective of the usual state - RAM is used up, CPU usage fairly low. Granted, I set HEAP_SIZE to 8GB and restarted the elasticsearch process only about 24 hours ago, I so sure expect it to not die on me this fast.

Still researching the rsyslog end of things. Will report back in a few days, or whenever it fails again.

jolson · Post by **jolson** » Wed Apr 15, 2015 11:43 am

Thanks polarbear1 - by all means it looks like a healthy setup. Let us know if you have any further difficulty - thank you.

Nagios Support Forum

Runs out of RAM, then stops reading altogether.

Runs out of RAM, then stops reading altogether.

Re: Runs out of RAM, then stops reading altogether.

Re: Runs out of RAM, then stops reading altogether.

Re: Runs out of RAM, then stops reading altogether.

Re: Runs out of RAM, then stops reading altogether.

Re: Runs out of RAM, then stops reading altogether.