On NLS client rsyslog/python/httpd processing stopped

GhostRider2110 · Post by **GhostRider2110** » Wed Apr 15, 2015 10:10 am

Interesting datapoint. Just had a system slowdown with the symptoms I've stated in this thread. Restart rsyslog, and I got over 800 of the Traceback messages in the httpd/error_log from python. Then everything cleared back up. I have opened a thread

http://support.nagios.com/forum/viewtop ... 38&t=32396

About unusually high number of processes on the NLS at the same time as this started happening. I am wondering if there is a two fold problem. Something is slowing up the processing of the rsyslog stream from the client which in turn is causing the python/wsgi log handler to backup?

See-ya
Mitch

GhostRider2110 · Post by **GhostRider2110** » Wed Apr 15, 2015 10:39 am

I finally had to restart elasticsearch on NLS. Actually had to run the restart twice:

Code: Select all

[root@IGAnagioslog ~]# /etc/init.d/elasticsearch restart
Stopping elasticsearch:                                    [FAILED]
Starting elasticsearch:                                    [  OK  ]
[root@IGAnagioslog ~]# /etc/init.d/elasticsearch restart
Stopping elasticsearch:                                    [  OK  ]
Starting elasticsearch:                                    [  OK  ]

About 3-4 min after all the extra poller processes cleared up. You can see the reduction in log processing from the image attached from the dashboard home page on NLS. (Also could not get the NLS web page to respond until after a restart of elasticsearch and httpd.

eventsovertime.png

Before resetting elasticsearch and httpd the client I was having problems with seemed to "clog" (for lack of a better term) twice more. Since the restart/reset it has been find.

See-ya
Mitch

jolson · Post by **jolson** » Wed Apr 15, 2015 12:54 pm

Mitch,

When you restarted elasticsearch on NLS, the clients started behaving normally - is that correct? In addition to this, the only thing out of the ordinary is the excessive jobs?

GhostRider2110 · Post by **GhostRider2110** » Wed Apr 15, 2015 1:15 pm

I still had to restart rsyslog on the client. Now since I have just had the "epiphany" with the relationship of the two problems, I can't say for sure that restarting electricsearch will allow the client to start processing as it should. From the looks of the Traceback, I don't think it would. But the combination of restart of electricsearch on NLS and rsyslog on the client, does seem to give a longer fix. Before restarting electicsearch on the NLS, I had cleared the client at least 2 maybe 3 times by restarting rsyslog.

See-ya
Mitch

jolson · Post by **jolson** » Wed Apr 15, 2015 3:30 pm

How often does this occur? If you can reproduce this issue, could you strace rsyslog and provide us with the output? I'm wondering if the strace would help us out:

Code: Select all

strace -p <pid> -o output.txt

You could also strace when you expect the python exception to occur - which would likely be during a restart of rsyslog:

Code: Select all

strace -o output2.txt service rsyslog restart

Below are some resources I found, I don't know if you'll find them helpful, but I figured I'd include them.
https://docs.python.org/2/library/logging.handlers.html
https://lists.secondlife.com/pipermail/ ... 01095.html

GhostRider2110 · Post by **GhostRider2110** » Wed Apr 15, 2015 3:53 pm

I can't seem to reproduce the problem at will. But next time I see it will run the strace.

See-ya
--Mitch

jolson · Post by **jolson** » Thu Apr 16, 2015 9:12 am

Sounds great - I look forward to hearing back. Thanks Mitch.

GhostRider2110 · Post by **GhostRider2110** » Thu May 28, 2015 9:34 am

Ok, one of the redundant repository systems was slow responding. Showing the same symptoms when the seems to be a problem with rsyslog and python.

I captured a stack trace of the restart of rsyslog and the lrms-dev processes which are just mod_wsgi processes from httpd. I attached to each of them in background and restarted rsyslog. Got about 1000 Traceback entries in the httpd/error_log all looking like this:

Code: Select all

[Thu May 28 09:20:39 2015] [error] Traceback (most recent call last):
[Thu May 28 09:20:39 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 803, in emit
[Thu May 28 09:20:39 2015] [error]     self._connect_unixsocket(self.address)
[Thu May 28 09:20:39 2015] [error]   File "/usr/lib64/python2.6/logging/handlers.py", line 737, in _connect_unixsocket
[Thu May 28 09:20:39 2015] [error]     self.socket.connect(address)
[Thu May 28 09:20:39 2015] [error]   File "<string>", line 1, in connect
[Thu May 28 09:20:39 2015] [error] error: [Errno 2] No such file or directory

All happening when I restarted rsyslog.

The trace of the individual files only produced one line:

Code: Select all

[root@igapubrep01 ~]# cat lrms-dev-3291-trace.txt
restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[root@igapubrep01 ~]#

One note, logstash died on me last night, not sure if that is related or not but I do know that if things are not functioning correctly on the NLS I have seen this on the clients.

Only access/error for httpd and default logs are configured to send to NLS and they were setup via the scirpt provided by the NLS

Code: Select all

[root@igapubrep01 rsyslog.d]# cat 90-nagioslogserver_var_log_httpd_access_log.conf 
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog

# Input for apache_access
$InputFileName /var/log/httpd/access_log
$InputFileTag apache_access:
$InputFileStateFile nls-state-var_log_httpd_access_log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor

# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'apache_access' then @@iganagioslog.iga.local:5544
if $programname == 'apache_access' then ~
[root@igapubrep01 rsyslog.d]# cat 90-nagioslogserver_var_log_httpd_error_log.conf 
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog

# Input for apache_error
$InputFileName /var/log/httpd/error_log
$InputFileTag apache_error:
$InputFileStateFile nls-state-var_log_httpd_error_log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor

# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'apache_error' then @@iganagioslog.iga.local:5544
if $programname == 'apache_error' then ~
[root@igapubrep01 rsyslog.d]# cat 99-nagioslogserver.conf 
### Begin forwarding rule for Nagios Log Server                           NAGIOSLOGSERVER
$WorkDirectory /var/lib/rsyslog # Where spool files will live             NAGIOSLOGSERVER
$ActionQueueFileName nlsFwdRule0 # Unique name prefix for spool files     NAGIOSLOGSERVER
$ActionQueueMaxDiskSpace 1g   # 1GB space limit (use as much as possible) NAGIOSLOGSERVER
$ActionQueueSaveOnShutdown on # Save messages to disk on shutdown         NAGIOSLOGSERVER
$ActionQueueType LinkedList   # Use asynchronous processing               NAGIOSLOGSERVER
$ActionResumeRetryCount -1    # Infinite retries if host is down          NAGIOSLOGSERVER
# Remote host is: name/ip:port, e.g. 192.168.0.1:514, port optional       NAGIOSLOGSERVER
:msg, contains, "START: nrpe pid" ~
:msg, contains, "EXIT: nrpe status=0" ~
*.* @@iganagioslog.iga.local:5544                                               # NAGIOSLOGSERVER
### End of Nagios Log Server forwarding rule                              NAGIOSLOGSERVER

See-ya
Mitch

GhostRider2110 · Post by **GhostRider2110** » Thu May 28, 2015 2:03 pm

Sidenote: I forgot to add when after I restarted logstash and reset the client that as showing problems, it was still giving me sluggish response. Went and looked at the NLS and logs were still not being processed, so I restarted electicsearch and everything started working again.

See-ya
Mitch

tmcdonald · Post by **tmcdonald** » Fri May 29, 2015 1:02 pm

I almost wonder if you are hitting some sort of open file limit on the sending machines. That could cause logs to queue up in memory in rsyslog (I believe it holds on to messages in the event that it cannot send them) and then something clears them up and rsyslog sends them en-masse? Just a thought.

Code: Select all

ulimit -a

Let's see that from your sending machines.

Nagios Support Forum

On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped

Re: On NLS client rsyslog/python/httpd processing stopped