On NLS client rsyslog/python/httpd processing stopped
-
GhostRider2110
- Posts: 193
- Joined: Thu Oct 30, 2014 8:04 am
- Location: Indiana
- Contact:
Re: On NLS client rsyslog/python/httpd processing stopped
Interesting datapoint. Just had a system slowdown with the symptoms I've stated in this thread. Restart rsyslog, and I got over 800 of the Traceback messages in the httpd/error_log from python. Then everything cleared back up. I have opened a thread
http://support.nagios.com/forum/viewtop ... 38&t=32396
About unusually high number of processes on the NLS at the same time as this started happening. I am wondering if there is a two fold problem. Something is slowing up the processing of the rsyslog stream from the client which in turn is causing the python/wsgi log handler to backup?
See-ya
Mitch
http://support.nagios.com/forum/viewtop ... 38&t=32396
About unusually high number of processes on the NLS at the same time as this started happening. I am wondering if there is a two fold problem. Something is slowing up the processing of the rsyslog stream from the client which in turn is causing the python/wsgi log handler to backup?
See-ya
Mitch
-
GhostRider2110
- Posts: 193
- Joined: Thu Oct 30, 2014 8:04 am
- Location: Indiana
- Contact:
Re: On NLS client rsyslog/python/httpd processing stopped
I finally had to restart elasticsearch on NLS. Actually had to run the restart twice:
About 3-4 min after all the extra poller processes cleared up. You can see the reduction in log processing from the image attached from the dashboard home page on NLS. (Also could not get the NLS web page to respond until after a restart of elasticsearch and httpd.
Before resetting elasticsearch and httpd the client I was having problems with seemed to "clog" (for lack of a better term) twice more. Since the restart/reset it has been find.
See-ya
Mitch
Code: Select all
[root@IGAnagioslog ~]# /etc/init.d/elasticsearch restart
Stopping elasticsearch: [FAILED]
Starting elasticsearch: [ OK ]
[root@IGAnagioslog ~]# /etc/init.d/elasticsearch restart
Stopping elasticsearch: [ OK ]
Starting elasticsearch: [ OK ]
Before resetting elasticsearch and httpd the client I was having problems with seemed to "clog" (for lack of a better term) twice more. Since the restart/reset it has been find.
See-ya
Mitch
You do not have the required permissions to view the files attached to this post.
Re: On NLS client rsyslog/python/httpd processing stopped
Mitch,
When you restarted elasticsearch on NLS, the clients started behaving normally - is that correct? In addition to this, the only thing out of the ordinary is the excessive jobs?
When you restarted elasticsearch on NLS, the clients started behaving normally - is that correct? In addition to this, the only thing out of the ordinary is the excessive jobs?
-
GhostRider2110
- Posts: 193
- Joined: Thu Oct 30, 2014 8:04 am
- Location: Indiana
- Contact:
Re: On NLS client rsyslog/python/httpd processing stopped
I still had to restart rsyslog on the client. Now since I have just had the "epiphany" with the relationship of the two problems, I can't say for sure that restarting electricsearch will allow the client to start processing as it should. From the looks of the Traceback, I don't think it would. But the combination of restart of electricsearch on NLS and rsyslog on the client, does seem to give a longer fix. Before restarting electicsearch on the NLS, I had cleared the client at least 2 maybe 3 times by restarting rsyslog.
See-ya
Mitch
See-ya
Mitch
Re: On NLS client rsyslog/python/httpd processing stopped
How often does this occur? If you can reproduce this issue, could you strace rsyslog and provide us with the output? I'm wondering if the strace would help us out:
You could also strace when you expect the python exception to occur - which would likely be during a restart of rsyslog:
Below are some resources I found, I don't know if you'll find them helpful, but I figured I'd include them.
https://docs.python.org/2/library/logging.handlers.html
https://lists.secondlife.com/pipermail/ ... 01095.html
Code: Select all
strace -p <pid> -o output.txtCode: Select all
strace -o output2.txt service rsyslog restarthttps://docs.python.org/2/library/logging.handlers.html
https://lists.secondlife.com/pipermail/ ... 01095.html
-
GhostRider2110
- Posts: 193
- Joined: Thu Oct 30, 2014 8:04 am
- Location: Indiana
- Contact:
Re: On NLS client rsyslog/python/httpd processing stopped
I can't seem to reproduce the problem at will. But next time I see it will run the strace.
See-ya
--Mitch
See-ya
--Mitch
Re: On NLS client rsyslog/python/httpd processing stopped
Sounds great - I look forward to hearing back. Thanks Mitch.
-
GhostRider2110
- Posts: 193
- Joined: Thu Oct 30, 2014 8:04 am
- Location: Indiana
- Contact:
Re: On NLS client rsyslog/python/httpd processing stopped
Ok, one of the redundant repository systems was slow responding. Showing the same symptoms when the seems to be a problem with rsyslog and python.
I captured a stack trace of the restart of rsyslog and the lrms-dev processes which are just mod_wsgi processes from httpd. I attached to each of them in background and restarted rsyslog. Got about 1000 Traceback entries in the httpd/error_log all looking like this:
All happening when I restarted rsyslog.
The trace of the individual files only produced one line:
One note, logstash died on me last night, not sure if that is related or not but I do know that if things are not functioning correctly on the NLS I have seen this on the clients.
Only access/error for httpd and default logs are configured to send to NLS and they were setup via the scirpt provided by the NLS
See-ya
Mitch
I captured a stack trace of the restart of rsyslog and the lrms-dev processes which are just mod_wsgi processes from httpd. I attached to each of them in background and restarted rsyslog. Got about 1000 Traceback entries in the httpd/error_log all looking like this:
Code: Select all
[Thu May 28 09:20:39 2015] [error] Traceback (most recent call last):
[Thu May 28 09:20:39 2015] [error] File "/usr/lib64/python2.6/logging/handlers.py", line 803, in emit
[Thu May 28 09:20:39 2015] [error] self._connect_unixsocket(self.address)
[Thu May 28 09:20:39 2015] [error] File "/usr/lib64/python2.6/logging/handlers.py", line 737, in _connect_unixsocket
[Thu May 28 09:20:39 2015] [error] self.socket.connect(address)
[Thu May 28 09:20:39 2015] [error] File "<string>", line 1, in connect
[Thu May 28 09:20:39 2015] [error] error: [Errno 2] No such file or directory
The trace of the individual files only produced one line:
Code: Select all
[root@igapubrep01 ~]# cat lrms-dev-3291-trace.txt
restart_syscall(<... resuming interrupted call ...> <unfinished ...>
[root@igapubrep01 ~]#Only access/error for httpd and default logs are configured to send to NLS and they were setup via the scirpt provided by the NLS
Code: Select all
[root@igapubrep01 rsyslog.d]# cat 90-nagioslogserver_var_log_httpd_access_log.conf
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog
# Input for apache_access
$InputFileName /var/log/httpd/access_log
$InputFileTag apache_access:
$InputFileStateFile nls-state-var_log_httpd_access_log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor
# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'apache_access' then @@iganagioslog.iga.local:5544
if $programname == 'apache_access' then ~
[root@igapubrep01 rsyslog.d]# cat 90-nagioslogserver_var_log_httpd_error_log.conf
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /var/lib/rsyslog
# Input for apache_error
$InputFileName /var/log/httpd/error_log
$InputFileTag apache_error:
$InputFileStateFile nls-state-var_log_httpd_error_log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor
# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
if $programname == 'apache_error' then @@iganagioslog.iga.local:5544
if $programname == 'apache_error' then ~
[root@igapubrep01 rsyslog.d]# cat 99-nagioslogserver.conf
### Begin forwarding rule for Nagios Log Server NAGIOSLOGSERVER
$WorkDirectory /var/lib/rsyslog # Where spool files will live NAGIOSLOGSERVER
$ActionQueueFileName nlsFwdRule0 # Unique name prefix for spool files NAGIOSLOGSERVER
$ActionQueueMaxDiskSpace 1g # 1GB space limit (use as much as possible) NAGIOSLOGSERVER
$ActionQueueSaveOnShutdown on # Save messages to disk on shutdown NAGIOSLOGSERVER
$ActionQueueType LinkedList # Use asynchronous processing NAGIOSLOGSERVER
$ActionResumeRetryCount -1 # Infinite retries if host is down NAGIOSLOGSERVER
# Remote host is: name/ip:port, e.g. 192.168.0.1:514, port optional NAGIOSLOGSERVER
:msg, contains, "START: nrpe pid" ~
:msg, contains, "EXIT: nrpe status=0" ~
*.* @@iganagioslog.iga.local:5544 # NAGIOSLOGSERVER
### End of Nagios Log Server forwarding rule NAGIOSLOGSERVER
See-ya
Mitch
You do not have the required permissions to view the files attached to this post.
-
GhostRider2110
- Posts: 193
- Joined: Thu Oct 30, 2014 8:04 am
- Location: Indiana
- Contact:
Re: On NLS client rsyslog/python/httpd processing stopped
Sidenote: I forgot to add when after I restarted logstash and reset the client that as showing problems, it was still giving me sluggish response. Went and looked at the NLS and logs were still not being processed, so I restarted electicsearch and everything started working again.
See-ya
Mitch
See-ya
Mitch
Re: On NLS client rsyslog/python/httpd processing stopped
I almost wonder if you are hitting some sort of open file limit on the sending machines. That could cause logs to queue up in memory in rsyslog (I believe it holds on to messages in the event that it cannot send them) and then something clears them up and rsyslog sends them en-masse? Just a thought.
Let's see that from your sending machines.
Code: Select all
ulimit -a
Former Nagios employee