Rsyslog: Abandoned Spool Files

This support forum board is for support questions relating to Nagios Log Server, our solution for managing and monitoring critical log data.
Locked
polarbear1
Posts: 73
Joined: Mon Apr 13, 2015 4:26 pm

Rsyslog: Abandoned Spool Files

Post by polarbear1 »

Hi,

This is a pure rsyslog question, but maybe you ran into this problem before.

I have a disk-assisted queue configured that appears to generally be working pretty well. A connection interrupts, a .qi file is created, and then numbered (.00000001,.00000002,.00000003...) spool files are created as space is needed. We reconnect and the spool files are processed and removed from the $WorkDirectory with the exception of the .qi file (which remains permanently by design and this isn't a problem) and the last spool file (ex: myqueue.00000531) that appears to never have been processed. That is a problem because those are messages that should have been sent to NLS that never made it there. Restarting rsyslog process appears to clean it up - so in a pinch, nightly rsyslog restarts would be a band aid, but I'm hoping to find a better solution.

Red Hat version

Code: Select all

[root@schtwb03 ~]# cat /etc/redhat-release && uname -rms
Red Hat Enterprise Linux Server release 6.6 (Santiago)
Linux 2.6.32-504.1.3.el6.x86_64 x86_64
Rsyslog - Not the newest version, but one of the later builds from 7-series and latest officially shipped for my flavor of RHEL.

Code: Select all

[root@schtwb03 ~]# rsyslogd -version
rsyslogd 7.4.10, compiled with:
        FEATURE_REGEXP:                         Yes
        FEATURE_LARGEFILE:                      No
        GSSAPI Kerberos 5 support:              Yes
        FEATURE_DEBUG (debug build, slow code): No
        32bit Atomic operations supported:      Yes
        64bit Atomic operations supported:      Yes
        Runtime Instrumentation (slow code):    No
        uuid support:                           Yes

Looks like the qi file is updated after the last spool file is created. (Moved my $WorkDirectory here for partition sizing reasons)

Code: Select all

[root@schtwb03 ~]# ls -l /home/logs | grep -v .log
total 884
-rw------- 1 root adm 486012 Jul 29 18:55 iceeumt.00000123
-rw------- 1 root adm    495 Jul 29 18:57 iceeumt.qi
rsyslog config file (names/directories changed for privacy)--

Code: Select all

[root@schtwb03 ~]
$ModLoad imfile
$InputFilePollInterval 10
$PrivDropToGroup adm
$WorkDirectory /home/logs

# Input for ice_eu_mt
$InputFileName /home/log/ice_eu_mt.log
$InputFileTag ice_eu_mt:
$InputFileStateFile nls-state-home_log_ice_eu_mt.log # Must be unique for each file being polled
# Uncomment the folowing line to override the default severity for messages
# from this file.
#$InputFileSeverity info
$InputFilePersistStateInterval 20000
$InputRunFileMonitor

# Forward to Nagios Log Server and then discard, otherwise these messages
# will end up in the syslog file (/var/log/messages) unless there are other
# overriding rules.
#Buffer Settings
$ActionResumeInterval 10
$ActionQueueSize 100000
$ActionQueueDiscardMark 97500
$ActionQueueHighWaterMark 80000
$ActionQueueType LinkedList
$ActionQueueFileName iceeumt
$ActionQueueCheckpointInterval 100
$ActionQueueMaxDiskSpace 500m
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
$ActionQueueTimeoutEnqueue 0
$ActionQueueDiscardSeverity 0
if $programname == 'ice_eu_mt' then @@nls:5544
if $programname == 'ice_eu_mt' then ~
In my Googling, I found the recover_qi.pl script but to my understanding is it rebuilds the .qi file because of a rsyslog bug that used to exist where the qi file goes missing. In my case, the qi file is there and I've tried the script and it didn't do anything for me. I also tried going through the changelogs for future versions of rsyslog (later 7 releases, but also 8 releases) to try to find a hint that this was a fixed bug, and was unsuccessful there too.

So short of a nightly rsyslog reboot, any other ideas?

Thanks.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Rsyslog: Abandoned Spool Files

Post by jolson »

Is this issue reproducible in any way? I would like to spin up a test environment and spend some time on this problem, but it's hard to troubleshoot form this end because you've certainly done your research.

Is this happening on more than one of your endpoints?
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
polarbear1
Posts: 73
Joined: Mon Apr 13, 2015 4:26 pm

Re: Rsyslog: Abandoned Spool Files

Post by polarbear1 »

Yes it is. I have 8 Linux end points going and at this time 4 have this problem on a queue. First time I caught this being an issue was earlier this week and did an rsyslog restart on all the servers to address it - this was on the 27th around mid-day. I also set up a cron job to monitor the work directory and if it exceeds 1000kb (with no spool files, the directory is ~500kb on average), send me an email with the current size - just so I can monitor the status. The good (?) news is that the directory size isn't really growing..much.

It's not that just some of the queues don't empty, it's that every queue that went to disk (ie - a qi file was created, which we don't expect to go away) also has an associated numbered spool file just sitting there too. And it's not like I just caught it in a situation where it just hasn't processed the queue yet. Since my last rsyslog restart on the 27th, there is a spool file from later on the 27th that is still sitting there.


As far as how to reproduce, I have no idea. I didn't do anything crazy. I have a huge spike in logs at end of day so I'm guessing that's where it goes to the queue because we're hitting some kind of bottleneck. Shortly after it processes the spool files, but leaves one (what I assume is the highest numbered) spool file behind.
jolson
Attack Rabbit
Posts: 2560
Joined: Thu Feb 12, 2015 12:40 pm

Re: Rsyslog: Abandoned Spool Files

Post by jolson »

I've set up a couple of linux hosts on my side and purposely failed logstash on my NLS cluster - I will spin logstash back up on Monday morning and see if all of the spool files are processed properly. Hopefully I'll be able to reproduce the problem you're experiencing here and find a little-known config value or similar. Thanks!
Twits Blog
Show me a man who lives alone and has a perpetually clean kitchen, and 8 times out of 9 I'll show you a man with detestable spiritual qualities.
Locked