RAMDISK full

Fred Kroeger · Post by **Fred Kroeger** » Sun Jan 10, 2016 6:14 pm

The RAMDisk had been previously increased to 500M as well back in December when I changed the original server.

top - 07:07:04 up 25 days, 13:53,  1 user,  load average: 3.74, 3.74, 3.61
Tasks: 264 total,   4 running, 259 sleeping,   0 stopped,   1 zombie
Cpu(s): 29.9%us,  7.7%sy,  0.0%ni, 58.5%id,  2.1%wa,  0.2%hi,  1.5%si,  0.0%st
Mem:   8061552k total,  6361216k used,  1700336k free,    68788k buffers
Swap:  2359288k total,    51012k used,  2308276k free,  3465736k cached

Monitoring 862 Hosts & 6148 Services

ssax · Post by **ssax** » Mon Jan 11, 2016 4:42 pm

Is there any chance that you could grab a copy of that deleted file, zip it up, and PM it to us so that we can take a look at what is in there?

http://www.serverwatch.com/tutorials/ar ... h-lsof.htm

Fred Kroeger · Post by **Fred Kroeger** » Mon Jan 11, 2016 5:18 pm

Will do - I've got a monitor setup for the RAM Disk so I should know early enough the next time it happens.

regards.... Fred

Post by **lmiltchev** » Tue Jan 12, 2016 10:43 am

Sounds good, Fred! We will keep the thread open.

Fred Kroeger · Post by **Fred Kroeger** » Tue Jan 12, 2016 10:24 pm

Got a RAMDisk full again today - this time on the original server.

Code: Select all

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                       27G  8.2G   18G  32% /
tmpfs                 1.9G  1.0M  1.9G   1% /dev/shm
/dev/mapper/VolGroup-lv_app
                       50G  6.4G   41G  14% /usr/local
/dev/sda1             477M   66M  386M  15% /boot
tmpfs                 500M  500M     0 100% /var/nagiosramdisk

Followed the same process - Identified the deleted open files (copied them to /tmp as well this time) and restarted Nagios. The files were quite small and didn't add up to the total space used like the last time. I will PM you the files as requested.

Code: Select all

# lsof | grep deleted
nagios     4140   nagios   14w      REG               0,17     11856  199318204 /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.host-PID-10992 (deleted)
nagios     4140   nagios   15w      REG               0,17     71264  199318199 /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.service-PID-10993 (deleted)

# ps -ef | grep 4140
nagios    4140  4071  0 Jan05 ?        00:00:17 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

# ls -l /proc/4140/fd/14
l-wx------ 1 root root 64 Jan 13 11:18 /proc/4140/fd/14 -> /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.host-PID-10992 (deleted)
# ls -l /proc/4140/fd/15
l-wx------ 1 root root 64 Jan 13 11:18 /proc/4140/fd/15 -> /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.service-PID-10993 (deleted)

# cp /proc/4140/fd/14 /tmp/1451951648.perfdata.host-PID-10992
# cp /proc/4140/fd/15 /tmp/1451951648.perfdata.service-PID-10993

# ls -la /tmp/1451951648.perfdata.*
-rw-r--r-- 1 root root 11856 Jan 13 11:21 /tmp/1451951648.perfdata.host-PID-10992
-rw-r--r-- 1 root root 71264 Jan 13 11:22 /tmp/1451951648.perfdata.service-PID-10993

However.... RAMDisk used space did not go down. Checked RAMDisk and no files visible - deleted or active.
I ran ps to see if there was another Nagios process running - there wasn't, but this time I found 261 cron initiated /usr/local/nagiosxi/cron/recurringdowntime.pl processes running - some dating back to last year.
Sample below

Code: Select all

nagios     569   556  0 Jan01 ?        00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios     576   569  0 Jan01 ?        00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
nagios     612   602  0  2015 ?        00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios     624   612  0  2015 ?        00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
nagios    1076  1064  0 Jan03 ?        00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios    1084  1076  0 Jan03 ?        00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
nagios    1232  1221  0  2015 ?        00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios    1239  1232  0  2015 ?        00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl

Restarting crond service cleaned up the RAMDisk issue. I then manually killed all the CROND processes that had a parent ID of 1
tail of /usr/local/nagiosxi/var/recurringdowntime.log

Code: Select all

nd=13;nm=0;ny=116
Current candidate(dow): 19:00 on 13/1/2016
Checking days of week: days (0,1,2,3,4,5,6) are valid
Scheduling for day 3 (today is 3, looking at scheds for 3 and later)

nd=13;nm=0;ny=116

dow: 3
lst: 0

nd=13;nm=0;ny=116
Current candidate: 19:00 on 13/1/2016
Scheduling service XXX.YYY:Memory Used - Wintel
ERROR: Invalid service 1452675600 on host XXX.YYY!

So - not helpful in that the RAMDisk filled again but with different symptoms this time.

ssax · Post by **ssax** » Wed Jan 13, 2016 10:36 am

Hmm, I reviewed those files and they don't show any indication at all either since they are pretty small.

Please post the output of these commands:

Code: Select all

grep "perfdata\|ramdisk" /usr/local/nagios/etc/nagios.cfg
grep perfdata /usr/local/nagios/etc/commands.cfg

Fred Kroeger · Post by **Fred Kroeger** » Wed Jan 13, 2016 8:12 pm

These config files haven't changed - ie: they are the same now as they were before this started happening.

# grep "perfdata\|ramdisk" /usr/local/nagios/etc/nagios.cfg

Code: Select all

#service_perfdata_file=/usr/local/nagios/var/service-perfdata
service_perfdata_file=/var/nagiosramdisk/service-perfdata
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=process-service-perfdata-file-bulk
#host_perfdata_file=/usr/local/nagios/var/host-perfdata
host_perfdata_file=/var/nagiosramdisk/host-perfdata
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tHOSTOUTPUT::$HOSTOUTPUT$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=process-host-perfdata-file-bulk
check_result_path=/var/nagiosramdisk/spool/checkresults
object_cache_file=/var/nagiosramdisk/objects.cache
perfdata_timeout=5
status_file=/var/nagiosramdisk/status.dat
temp_path=/var/nagiosramdisk/tmp

# grep perfdata /usr/local/nagios/etc/commands.cfg

Code: Select all

       command_name                             launch_perfdata_process
       command_name                             process-host-perfdata
       command_line                             /usr/bin/printf "%b" "$LASTHOSTCHECK$\t$HOSTNAME$\t$HOSTSTATE$\t$HOSTATTEMPT$\t$HOSTSTATETYPE$\t$HOSTEXECUTIONTIME$\t$HOSTOUTPUT$\t$HOSTPERFDATA$\n" >> /usr/local/groundwork/nagios/var/host-perfdata.out
       command_name                             process-host-perfdata-file-bulk
       command_line                             /bin/mv /var/nagiosramdisk/host-perfdata /var/nagiosramdisk/spool/xidpe/$TIMET$.perfdata.host
       command_name                             process-host-perfdata-file-pnp-bulk
       command_line                             /bin/mv /var/nagiosramdisk/host-perfdata /usr/local/nagios/var/spool/perfdata/host-perfdata.$TIMET$
       command_name                             process-host-perfdata-pnp-normal
       command_line                             /usr/bin/perl /usr/local/nagios/libexec/process_perfdata.pl -d HOSTPERFDATA
       command_name                             process-service-perfdata-file-bulk
       command_line                             /bin/mv /var/nagiosramdisk/service-perfdata /var/nagiosramdisk/spool/xidpe/$TIMET$.perfdata.service
       command_name                             process-service-perfdata-file-pnp-bulk
       command_line                             /bin/mv /var/nagiosramdisk/service-perfdata /usr/local/nagios/var/spool/perfdata/service-perfdata.$TIMET$
       command_name                             process-service-perfdata-pnp-normal
       command_line                             /usr/bin/perl /usr/local/nagios/libexec/process_perfdata.pl
       command_name                             process_service_perfdata_file

ssax · Post by **ssax** » Thu Jan 14, 2016 1:45 pm

Looks good to me, what I was really looking for was to see what was all being stored on the RAMDisk. From the last time it filled up since those files were so small it must've been some other file(s) that were consuming all of the data in the RAMDisk.

What we need to do is get access to those very large files to see what they contain.

When it happens again, if you only see small files in the lsof list then cd into the RAMDisk directory an ls -lh in all directories until you find the offending file(s) and save a copy so we can look at them.

Thank you

Post by **Box293** » Thu Jan 14, 2016 5:26 pm

Can you also do a df -i next time, interested in seeing the inode usage as well.

Fred Kroeger · Post by **Fred Kroeger** » Thu Jan 14, 2016 7:02 pm

Hi Troy - did that as well each time - inodes are 99% free - it's generally just been the 2 huge open & deleted perfdata files.
Of course this doesn't explain the last time when cron was the culprit.

regards... Fred

Nagios Support Forum

RAMDISK full

Re: RAMDISK full

Re: RAMDISK full

Re: RAMDISK full

Re: RAMDISK full

Re: RAMDISK full

Re: RAMDISK full

Re: RAMDISK full

Re: RAMDISK full

Re: RAMDISK full

Re: RAMDISK full