RAMDISK full

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: RAMDISK full

Post by Fred Kroeger »

The RAMDisk had been previously increased to 500M as well back in December when I changed the original server.

Code: Select all

top - 07:07:04 up 25 days, 13:53,  1 user,  load average: 3.74, 3.74, 3.61
Tasks: 264 total,   4 running, 259 sleeping,   0 stopped,   1 zombie
Cpu(s): 29.9%us,  7.7%sy,  0.0%ni, 58.5%id,  2.1%wa,  0.2%hi,  1.5%si,  0.0%st
Mem:   8061552k total,  6361216k used,  1700336k free,    68788k buffers
Swap:  2359288k total,    51012k used,  2308276k free,  3465736k cached
Monitoring 862 Hosts & 6148 Services
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: RAMDISK full

Post by ssax »

Is there any chance that you could grab a copy of that deleted file, zip it up, and PM it to us so that we can take a look at what is in there?

http://www.serverwatch.com/tutorials/ar ... h-lsof.htm
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: RAMDISK full

Post by Fred Kroeger »

Will do - I've got a monitor setup for the RAM Disk so I should know early enough the next time it happens.

regards.... Fred
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: RAMDISK full

Post by lmiltchev »

Sounds good, Fred! We will keep the thread open.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: RAMDISK full

Post by Fred Kroeger »

Got a RAMDisk full again today - this time on the original server.

Code: Select all

# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root
                       27G  8.2G   18G  32% /
tmpfs                 1.9G  1.0M  1.9G   1% /dev/shm
/dev/mapper/VolGroup-lv_app
                       50G  6.4G   41G  14% /usr/local
/dev/sda1             477M   66M  386M  15% /boot
tmpfs                 500M  500M     0 100% /var/nagiosramdisk
Followed the same process - Identified the deleted open files (copied them to /tmp as well this time) and restarted Nagios. The files were quite small and didn't add up to the total space used like the last time. I will PM you the files as requested.

Code: Select all

# lsof | grep deleted
nagios     4140   nagios   14w      REG               0,17     11856  199318204 /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.host-PID-10992 (deleted)
nagios     4140   nagios   15w      REG               0,17     71264  199318199 /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.service-PID-10993 (deleted)

# ps -ef | grep 4140
nagios    4140  4071  0 Jan05 ?        00:00:17 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg

# ls -l /proc/4140/fd/14
l-wx------ 1 root root 64 Jan 13 11:18 /proc/4140/fd/14 -> /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.host-PID-10992 (deleted)
# ls -l /proc/4140/fd/15
l-wx------ 1 root root 64 Jan 13 11:18 /proc/4140/fd/15 -> /var/nagiosramdisk/spool/perfdata/1451951648.perfdata.service-PID-10993 (deleted)

# cp /proc/4140/fd/14 /tmp/1451951648.perfdata.host-PID-10992
# cp /proc/4140/fd/15 /tmp/1451951648.perfdata.service-PID-10993

# ls -la /tmp/1451951648.perfdata.*
-rw-r--r-- 1 root root 11856 Jan 13 11:21 /tmp/1451951648.perfdata.host-PID-10992
-rw-r--r-- 1 root root 71264 Jan 13 11:22 /tmp/1451951648.perfdata.service-PID-10993

However.... RAMDisk used space did not go down. Checked RAMDisk and no files visible - deleted or active.
I ran ps to see if there was another Nagios process running - there wasn't, but this time I found 261 cron initiated /usr/local/nagiosxi/cron/recurringdowntime.pl processes running - some dating back to last year.
Sample below

Code: Select all

nagios     569   556  0 Jan01 ?        00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios     576   569  0 Jan01 ?        00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
nagios     612   602  0  2015 ?        00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios     624   612  0  2015 ?        00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
nagios    1076  1064  0 Jan03 ?        00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios    1084  1076  0 Jan03 ?        00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
nagios    1232  1221  0  2015 ?        00:00:00 /bin/sh -c /usr/local/nagiosxi/cron/recurringdowntime.pl > /usr/local/nagiosxi/var/recurringdowntime.log 2>&1
nagios    1239  1232  0  2015 ?        00:00:00 /usr/bin/perl /usr/local/nagiosxi/cron/recurringdowntime.pl
Restarting crond service cleaned up the RAMDisk issue. I then manually killed all the CROND processes that had a parent ID of 1
tail of /usr/local/nagiosxi/var/recurringdowntime.log

Code: Select all

nd=13;nm=0;ny=116
Current candidate(dow): 19:00 on 13/1/2016
Checking days of week: days (0,1,2,3,4,5,6) are valid
Scheduling for day 3 (today is 3, looking at scheds for 3 and later)

nd=13;nm=0;ny=116

dow: 3
lst: 0

nd=13;nm=0;ny=116
Current candidate: 19:00 on 13/1/2016
Scheduling service XXX.YYY:Memory Used - Wintel
ERROR: Invalid service 1452675600 on host XXX.YYY!
So - not helpful in that the RAMDisk filled again but with different symptoms this time.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: RAMDISK full

Post by ssax »

Hmm, I reviewed those files and they don't show any indication at all either since they are pretty small.

Please post the output of these commands:

Code: Select all

grep "perfdata\|ramdisk" /usr/local/nagios/etc/nagios.cfg
grep perfdata /usr/local/nagios/etc/commands.cfg
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: RAMDISK full

Post by Fred Kroeger »

These config files haven't changed - ie: they are the same now as they were before this started happening.

# grep "perfdata\|ramdisk" /usr/local/nagios/etc/nagios.cfg

Code: Select all

#service_perfdata_file=/usr/local/nagios/var/service-perfdata
service_perfdata_file=/var/nagiosramdisk/service-perfdata
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\tSERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTATETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=process-service-perfdata-file-bulk
#host_perfdata_file=/usr/local/nagios/var/host-perfdata
host_perfdata_file=/var/nagiosramdisk/host-perfdata
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$HOSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tHOSTOUTPUT::$HOSTOUTPUT$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=process-host-perfdata-file-bulk
check_result_path=/var/nagiosramdisk/spool/checkresults
object_cache_file=/var/nagiosramdisk/objects.cache
perfdata_timeout=5
status_file=/var/nagiosramdisk/status.dat
temp_path=/var/nagiosramdisk/tmp
# grep perfdata /usr/local/nagios/etc/commands.cfg

Code: Select all

       command_name                             launch_perfdata_process
       command_name                             process-host-perfdata
       command_line                             /usr/bin/printf "%b" "$LASTHOSTCHECK$\t$HOSTNAME$\t$HOSTSTATE$\t$HOSTATTEMPT$\t$HOSTSTATETYPE$\t$HOSTEXECUTIONTIME$\t$HOSTOUTPUT$\t$HOSTPERFDATA$\n" >> /usr/local/groundwork/nagios/var/host-perfdata.out
       command_name                             process-host-perfdata-file-bulk
       command_line                             /bin/mv /var/nagiosramdisk/host-perfdata /var/nagiosramdisk/spool/xidpe/$TIMET$.perfdata.host
       command_name                             process-host-perfdata-file-pnp-bulk
       command_line                             /bin/mv /var/nagiosramdisk/host-perfdata /usr/local/nagios/var/spool/perfdata/host-perfdata.$TIMET$
       command_name                             process-host-perfdata-pnp-normal
       command_line                             /usr/bin/perl /usr/local/nagios/libexec/process_perfdata.pl -d HOSTPERFDATA
       command_name                             process-service-perfdata-file-bulk
       command_line                             /bin/mv /var/nagiosramdisk/service-perfdata /var/nagiosramdisk/spool/xidpe/$TIMET$.perfdata.service
       command_name                             process-service-perfdata-file-pnp-bulk
       command_line                             /bin/mv /var/nagiosramdisk/service-perfdata /usr/local/nagios/var/spool/perfdata/service-perfdata.$TIMET$
       command_name                             process-service-perfdata-pnp-normal
       command_line                             /usr/bin/perl /usr/local/nagios/libexec/process_perfdata.pl
       command_name                             process_service_perfdata_file
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: RAMDISK full

Post by ssax »

Looks good to me, what I was really looking for was to see what was all being stored on the RAMDisk. From the last time it filled up since those files were so small it must've been some other file(s) that were consuming all of the data in the RAMDisk.

What we need to do is get access to those very large files to see what they contain.

When it happens again, if you only see small files in the lsof list then cd into the RAMDisk directory an ls -lh in all directories until you find the offending file(s) and save a copy so we can look at them.

Thank you
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: RAMDISK full

Post by Box293 »

Can you also do a df -i next time, interested in seeing the inode usage as well.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: RAMDISK full

Post by Fred Kroeger »

Hi Troy - did that as well each time - inodes are 99% free - it's generally just been the 2 huge open & deleted perfdata files.
Of course this doesn't explain the last time when cron was the culprit.

regards... Fred
Locked