Got a RAMDisk full again today - this time on the original server.
Code: Select all
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
27G 8.2G 18G 32% /
tmpfs 1.9G 1.0M 1.9G 1% /dev/shm
/dev/mapper/VolGroup-lv_app
50G 6.4G 41G 14% /usr/local
/dev/sda1 477M 66M 386M 15% /boot
tmpfs 500M 500M 0 100% /var/nagiosramdisk
Followed the same process - Identified the deleted open files (copied them to /tmp as well this time) and restarted Nagios. The files were quite small and didn't add up to the total space used like the last time. I will PM you the files as requested.
Code: Select all
# lsof | grep deleted
nagios 4140 nagios 14w REG 0,17 11856 199318204 /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.host-PID-10992 (deleted)
nagios 4140 nagios 15w REG 0,17 71264 199318199 /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.service-PID-10993 (deleted)
# ps -ef | grep 4140
nagios 4140 4071 0 Jan05 ? 00:00:17 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
# ls -l /proc/4140/fd/14
l-wx------ 1 root root 64 Jan 13 11:18 /proc/4140/fd/14 -> /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.host-PID-10992 (deleted)
# ls -l /proc/4140/fd/15
l-wx------ 1 root root 64 Jan 13 11:18 /proc/4140/fd/15 -> /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.service-PID-10993 (deleted)
# cp /proc/4140/fd/14 /tmp/1451951648.perfdata.host-PID-10992
# cp /proc/4140/fd/15 /tmp/1451951648.perfdata.service-PID-10993
# ls -la /tmp/1451951648.perfdata.*
-rw-r--r-- 1 root root 11856 Jan 13 11:21 /tmp/1451951648.perfdata.host-PID-10992
-rw-r--r-- 1 root root 71264 Jan 13 11:22 /tmp/1451951648.perfdata.service-PID-10993
However.... RAMDisk used space did not go down. Checked RAMDisk and no files visible - deleted or active.
I ran ps to see if there was another Nagios process running - there wasn't, but this time I found 261 cron initiated /usr/local/nagiosxi/cron/recurringdowntime.pl processes running - some dating back to last year.
Sample below
Code: Select all
nagios 569 556 0 Jan01 ? 00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios 576 569 0 Jan01 ? 00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
nagios 612 602 0 2015 ? 00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios 624 612 0 2015 ? 00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
nagios 1076 1064 0 Jan03 ? 00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios 1084 1076 0 Jan03 ? 00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
nagios 1232 1221 0 2015 ? 00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios 1239 1232 0 2015 ? 00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
Restarting crond service cleaned up the RAMDisk issue. I then manually killed all the CROND processes that had a parent ID of 1
tail of /usr/local/nagiosxi/var/recurringdowntime.log
Code: Select all
nd=13;nm=0;ny=116
Current candidate(dow): 19:00 on 13/1/2016
Checking days of week: days (0,1,2,3,4,5,6) are valid
Scheduling for day 3 (today is 3, looking at scheds for 3 and later)
nd=13;nm=0;ny=116
dow: 3
lst: 0
nd=13;nm=0;ny=116
Current candidate: 19:00 on 13/1/2016
Scheduling service XXX.YYY:Memory Used - Wintel
ERROR: Invalid service 1452675600 on host XXX.YYY!
So - not helpful in that the RAMDisk filled again but with different symptoms this time.