Page 2 of 3
Re: RAMDISK full
Posted: Sun Jan 10, 2016 6:14 pm
by Fred Kroeger
The RAMDisk had been previously increased to 500M as well back in December when I changed the original server.
Code: Select all
top - 07:07:04 up 25 days, 13:53, 1 user, load average: 3.74, 3.74, 3.61
Tasks: 264 total, 4 running, 259 sleeping, 0 stopped, 1 zombie
Cpu(s): 29.9%us, 7.7%sy, 0.0%ni, 58.5%id, 2.1%wa, 0.2%hi, 1.5%si, 0.0%st
Mem: 8061552k total, 6361216k used, 1700336k free, 68788k buffers
Swap: 2359288k total, 51012k used, 2308276k free, 3465736k cached
Monitoring 862 Hosts & 6148 Services
Re: RAMDISK full
Posted: Mon Jan 11, 2016 4:42 pm
by ssax
Is there any chance that you could grab a copy of that deleted file, zip it up, and PM it to us so that we can take a look at what is in there?
http://www.serverwatch.com/tutorials/ar ... h-lsof.htm
Re: RAMDISK full
Posted: Mon Jan 11, 2016 5:18 pm
by Fred Kroeger
Will do - I've got a monitor setup for the RAM Disk so I should know early enough the next time it happens.
regards.... Fred
Re: RAMDISK full
Posted: Tue Jan 12, 2016 10:43 am
by lmiltchev
Sounds good, Fred! We will keep the thread open.
Re: RAMDISK full
Posted: Tue Jan 12, 2016 10:24 pm
by Fred Kroeger
Got a RAMDisk full again today - this time on the original server.
Code: Select all
# df -h
Filesystem Size Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
27G 8.2G 18G 32% /
tmpfs 1.9G 1.0M 1.9G 1% /dev/shm
/dev/mapper/VolGroup-lv_app
50G 6.4G 41G 14% /usr/local
/dev/sda1 477M 66M 386M 15% /boot
tmpfs 500M 500M 0 100% /var/nagiosramdisk
Followed the same process - Identified the deleted open files (copied them to /tmp as well this time) and restarted Nagios. The files were quite small and didn't add up to the total space used like the last time. I will PM you the files as requested.
Code: Select all
# lsof | grep deleted
nagios 4140 nagios 14w REG 0,17 11856 199318204 /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.host-PID-10992 (deleted)
nagios 4140 nagios 15w REG 0,17 71264 199318199 /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.service-PID-10993 (deleted)
# ps -ef | grep 4140
nagios 4140 4071 0 Jan05 ? 00:00:17 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
# ls -l /proc/4140/fd/14
l-wx------ 1 root root 64 Jan 13 11:18 /proc/4140/fd/14 -> /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.host-PID-10992 (deleted)
# ls -l /proc/4140/fd/15
l-wx------ 1 root root 64 Jan 13 11:18 /proc/4140/fd/15 -> /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.service-PID-10993 (deleted)
# cp /proc/4140/fd/14 /tmp/1451951648.perfdata.host-PID-10992
# cp /proc/4140/fd/15 /tmp/1451951648.perfdata.service-PID-10993
# ls -la /tmp/1451951648.perfdata.*
-rw-r--r-- 1 root root 11856 Jan 13 11:21 /tmp/1451951648.perfdata.host-PID-10992
-rw-r--r-- 1 root root 71264 Jan 13 11:22 /tmp/1451951648.perfdata.service-PID-10993
However.... RAMDisk used space did not go down. Checked RAMDisk and no files visible - deleted or active.
I ran ps to see if there was another Nagios process running - there wasn't, but this time I found 261 cron initiated /usr/local/nagiosxi/cron/recurringdowntime.pl processes running - some dating back to last year.
Sample below
Code: Select all
nagios 569 556 0 Jan01 ? 00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios 576 569 0 Jan01 ? 00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
nagios 612 602 0 2015 ? 00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios 624 612 0 2015 ? 00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
nagios 1076 1064 0 Jan03 ? 00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios 1084 1076 0 Jan03 ? 00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
nagios 1232 1221 0 2015 ? 00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios 1239 1232 0 2015 ? 00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
Restarting crond service cleaned up the RAMDisk issue. I then manually killed all the CROND processes that had a parent ID of 1
tail of /usr/local/nagiosxi/var/recurringdowntime.log
Code: Select all
nd=13;nm=0;ny=116
Current candidate(dow): 19:00 on 13/1/2016
Checking days of week: days (0,1,2,3,4,5,6) are valid
Scheduling for day 3 (today is 3, looking at scheds for 3 and later)
nd=13;nm=0;ny=116
dow: 3
lst: 0
nd=13;nm=0;ny=116
Current candidate: 19:00 on 13/1/2016
Scheduling service XXX.YYY:Memory Used - Wintel
ERROR: Invalid service 1452675600 on host XXX.YYY!
So - not helpful in that the RAMDisk filled again but with different symptoms this time.
Re: RAMDISK full
Posted: Wed Jan 13, 2016 10:36 am
by ssax
Hmm, I reviewed those files and they don't show any indication at all either since they are pretty small.
Please post the output of these commands:
Code: Select all
grep "perfdata\|ramdisk" /usr/local/nagios/etc/nagios.cfg
grep perfdata /usr/local/nagios/etc/commands.cfg
Re: RAMDISK full
Posted: Wed Jan 13, 2016 8:12 pm
by Fred Kroeger
These config files haven't changed - ie: they are the same now as they were before this started happening.
# grep "perfdata\|ramdisk" /usr/local/nagios/etc/nagios.cfg
Code: Select all
#service_perfdata_file=/usr/local/nagios/var/service-perfdata
service_perfdata_file=/var/nagiosramdisk/service-perfdata
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=process-service-perfdata-file-bulk
#host_perfdata_file=/usr/local/nagios/var/host-perfdata
host_perfdata_file=/var/nagiosramdisk/host-perfdata
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tHOSTOUTPUT::$HOSTOUTPUT$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=process-host-perfdata-file-bulk
check_result_path=/var/nagiosramdisk/spool/checkresults
object_cache_file=/var/nagiosramdisk/objects.cache
perfdata_timeout=5
status_file=/var/nagiosramdisk/status.dat
temp_path=/var/nagiosramdisk/tmp
# grep perfdata /usr/local/nagios/etc/commands.cfg
Code: Select all
command_name launch_perfdata_process
command_name process-host-perfdata
command_line /usr/bin/printf "%b" "$LASTHOSTCHECK$\t$HOSTNAME$\t$HOSTSTATE$\t$HOSTATTEMPT$\t$HOSTSTATETYPE$\t$HOSTEXECUTIONTIME$\t$HOSTOUTPUT$\t$HOSTPERFDATA$\n" >> /usr/local/groundwork/nagios/var/host-perfdata.out
command_name process-host-perfdata-file-bulk
command_line /bin/mv /var/nagiosramdisk/host-perfdata /var/nagiosramdisk/spool/xidpe/$TIMET$.perfdata.host
command_name process-host-perfdata-file-pnp-bulk
command_line /bin/mv /var/nagiosramdisk/host-perfdata /usr/local/nagios/var/spool/perfdata/host-perfdata.$TIMET$
command_name process-host-perfdata-pnp-normal
command_line /usr/bin/perl /usr/local/nagios/libexec/process_perfdata.pl -d HOSTPERFDATA
command_name process-service-perfdata-file-bulk
command_line /bin/mv /var/nagiosramdisk/service-perfdata /var/nagiosramdisk/spool/xidpe/$TIMET$.perfdata.service
command_name process-service-perfdata-file-pnp-bulk
command_line /bin/mv /var/nagiosramdisk/service-perfdata /usr/local/nagios/var/spool/perfdata/service-perfdata.$TIMET$
command_name process-service-perfdata-pnp-normal
command_line /usr/bin/perl /usr/local/nagios/libexec/process_perfdata.pl
command_name process_service_perfdata_file
Re: RAMDISK full
Posted: Thu Jan 14, 2016 1:45 pm
by ssax
Looks good to me, what I was really looking for was to see what was all being stored on the RAMDisk. From the last time it filled up since those files were so small it must've been some other file(s) that were consuming all of the data in the RAMDisk.
What we need to do is get access to those very large files to see what they contain.
When it happens again, if you only see small files in the lsof list then cd into the RAMDisk directory an ls -lh in all directories until you find the offending file(s) and save a copy so we can look at them.
Thank you
Re: RAMDISK full
Posted: Thu Jan 14, 2016 5:26 pm
by Box293
Can you also do a df -i next time, interested in seeing the inode usage as well.
Re: RAMDISK full
Posted: Thu Jan 14, 2016 7:02 pm
by Fred Kroeger
Hi Troy - did that as well each time - inodes are 99% free - it's generally just been the 2 huge open & deleted perfdata files.
Of course this doesn't explain the last time when cron was the culprit.
regards... Fred