Unable to write to checkresults, no space left on device

skynardo · Post by **skynardo** » Mon May 18, 2015 2:10 pm

I am trying to follow up on an issue I had over the weekend where /var filled up and broke things on my Nagios XI system. Since my earlier post was locked, opening a new one as I am trying to understand what caused this and keep it from happening in the future. It looks like the /usr/local/nagios/var/spool/checkresults contained too many files causing voluminous fprintf log entries to be written to the /var/log/httpd/ssl_error_log which in turn filled up /var. Below is the first error and an example of subsequent fprintf errors.
[Sat May 16 19:50:10 2015] [error] [client 10.204.242.101] PHP Warning: fopen(/usr/local/nagios/var/spool/checkresults/csJe3Q7.ok): failed to open stream: No space left on device in /usr/local/nrdp/server/plugins/nagioscorepassivecheck/nagioscorepassivecheck.inc.php on line 155
[Sat May 16 19:50:10 2015] [error] [client 10.204.242.101] PHP Warning: fclose() expects parameter 1 to be resource, boolean given in /usr/local/nrdp/server/plugins/nagioscorepassivecheck/nagioscorepassivecheck.inc.php on line 156
[Sat May 16 22:55:17 2015] [error] [client 10.204.1.187] PHP Warning: fopen(): Filename cannot be empty in /usr/local/nrdp/server/plugins/nagioscorepassivecheck/nagioscorepassivecheck.inc.php on line 132
[Sat May 16 22:55:17 2015] [error] [client 10.204.1.187] PHP Warning: fprintf() expects parameter 1 to be resource, boolean given in /usr/local/nrdp/server/plugins/nagioscorepassivecheck/nagioscorepassivecheck.inc.php on line 134
[Sat May 16 22:55:17 2015] [error] [client 10.204.1.187] PHP Warning: fprintf() expects parameter 1 to be resource, boolean given in /usr/local/nrdp/server/plugins/nagioscorepassivecheck/nagioscorepass

I currently have over 361,000 files in my /usr/local/nagios/var/spool/checkresults directory. What is supposed to clean these files up?

jolson · Post by **jolson** » Mon May 18, 2015 2:24 pm

The old topic was Unable to login to nagiosxi after /var filesystem filled up - it sounds like that problem was resolved and "Unable to login" is no longer an issue that needs to be sorted. I think that it's proper for you to open a new thread for "Unable to write to checkresults" as I don't think that it and the unable to login problem are going to be solved using the same process. Yes, the original issue was the same, but the 2 new problems are separate.

First thing to check: is NPCD running?

Code: Select all

service npcd status

If not, I recommend starting it and seeing whether or not it begins parsing your perfdata.

The NPCD process typically combs through perfdata and processes it. This process will stop working if a server hits a certain load threshold. The question is whether or not you want to maintain this perfdata, and also what you should increase the threshold to (if at all). If you don't care about your historical perfdata, feel free to delete the entries and restart npcd.

Code: Select all

find /usr/local/nagios/var/spool/perfdata/ -exec rm {} \;
service npcd restart

If you still see problems, refer to the following:

Please see our FAQ entry:
http://support.nagios.com/wiki/index.ph ... ta_Timeout

Bulk NPCD processing has a load threshold setting that is intended to halt performance processing if the system is under heavy load. Large installations will need this value increased and NPCD restarted.

Check the NPCD log for load warnings (if the log file does not exist, increase the log level, restart npcd, and wait 5 minutes before proceeding):

tail -50 /usr/local/nagios/var/npcd.log | grep "MAX load reached"

If any recent errors are found, increase load threshold by editing the file:

/usr/local/nagios/etc/pnp/npcd.cfg

Change:

load_threshold = 10.0

To:

load_threshold = 20.0

Save out and restart NPCD:

service npcd restart

For really large installations, or servers with minimal resources, you may need to increase the npcd load_threshold and perfdata TIMEOUT even more than is suggested above.

skynardo · Post by **skynardo** » Mon May 18, 2015 2:47 pm

npcd service is running
I don't have any files in the perfdata directory, I have 361,322 files =~ c006iUW and c006iUW.ok in the checkresults directory.
It looks like the npcd log shows something went haywire Saturday afternoon. Previously in the log there were a few MAX Load warnings, the highest of which was a load of 26.

[05-09-2015 14:49:29] NPCD: ERROR: Executed command exits with return code '7'
[05-09-2015 14:49:29] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1431198427.perfdata.service'
[05-09-2015 14:55:47] NPCD: ERROR: Executed command exits with return code '7'
[05-09-2015 14:55:47] NPCD: ERROR: Command line was '/usr/local/nagios/libexec/process_perfdata.pl -n -b /usr/local/nagios/var/spool/perfdata//1431198427.perfdata.host'
[05-09-2015 14:56:02] NPCD: WARN: MAX load reached: load 378.990000/10.000000 at i=0
[05-09-2015 14:56:17] NPCD: WARN: MAX load reached: load 406.560000/10.000000 at i=1
[05-09-2015 14:56:32] NPCD: WARN: MAX load reached: load 493.830000/10.000000 at i=1
[05-09-2015 14:56:47] NPCD: WARN: MAX load reached: load 549.100000/10.000000 at i=1
[05-09-2015 14:57:02] NPCD: WARN: MAX load reached: load 601.280000/10.000000 at i=1
[05-09-2015 14:57:17] NPCD: WARN: MAX load reached: load 632.220000/10.000000 at i=1

skynardo · Post by **skynardo** » Mon May 18, 2015 2:50 pm

Just noticed those entries were from the previous Saturday. We had a SAN outage that no doubt caused those high load numbers so I don't want to cloud the current issue.

jolson · Post by **jolson** » Mon May 18, 2015 4:20 pm

Which version of Nagios are you currently running?

The assumed diagnosis at this point is that at some point the nagios process died. If you have a lot of passive results, those results will pile up in the checkresults folder. If the disk is full, nagios will likely no be able to stat those files. We can go ahead and remove all of those passive results if you don't mind the historical information being erased. You would lose any perfdata and any check history associated with everything in the checkresults directory:

Code: Select all

find /usr/local/nagios/var/spool/checkresults -type f -delete

After doing so, restart nagios and ensure that the directory is staying clean:

Code: Select all

service nagios restart

Code: Select all

ls -l /usr/local/nagios/var/spool/checkresults | wc -l

Let us know if that data is important to you. If it is, we can come up with an alternate plan designed to save your historical information - though of course we'll have to jump through some more hoops for the sake of preservation.

skynardo · Post by **skynardo** » Tue May 19, 2015 9:33 am

I decided to just blast all of those files and restart nagios. Things seem to be processing normally now and the files are getting cleaned up. I'm still not sure what caused what but I will be putting some more monitors on my monitoring server today. I guess if Nagios is broken, passive checks results accumulate in this directory. When nagios comes back up it should process these ? Maybe there were too many or they were too old to process after I resolved the filesystem and postgres issues on Monday. I also saw a doc with recommendations for ulimit changes but wasn't sure if it was referring to nagios or root user.

jolson · Post by **jolson** » Tue May 19, 2015 9:44 am

When nagios comes back up it should process these ?

Nagios will do exactly this, but there are a few stipulations.

First and foremost, disk space. Since you were out of disk space, it's possible that Nagios couldn't stat the files located in the checkresults directory - meaning that Nagios couldn't process them.

Secondly, file age. If files are too old when Nagios comes back up, it will not attempt to process them. This behavior can be controlled via the Max Check Result File Age definition. You can read more about this particular definition here: http://nagios.sourceforge.net/docs/3_0/configmain.html.

Maybe there were too many or they were too old to process after I resolved the filesystem and postgres issues on Monday.

Yup, spot on.

I also saw a doc with recommendations for ulimit changes but wasn't sure if it was referring to nagios or root user.

Which document are you referring to?

Overall - I'm glad you're up and running again. My guess is that the 'nagios' process died at some point, and when it came back up there were just too many passive results to deal with.

Best,

Jesse

skynardo · Post by **skynardo** » Tue May 19, 2015 2:17 pm

I can't seem to find the exact link I was reading at the time but found these were the ulimit suggestions I was talking about, found in the Nagios support Wiki searching for ulimit.
Try the following solutions:

Edit /etc/security/limits.conf

* hard memlock 128 #locked memory
* soft memlock 128
* soft nofile 4096 #open files
* hard nofile 4096
* hard nproc 4096 #max user processes
* soft nproc 4096
* hard stack 20480 #stack size
* soft stack 20480
and restart the server. Run

ulimit -a
to verify that the new settings are in place.

ssax · Post by **ssax** » Tue May 19, 2015 2:46 pm

skynardo, how do your limits look compared to those from the wiki?

Nagios Support Forum

Unable to write to checkresults, no space left on device

Unable to write to checkresults, no space left on device

Re: Unable to write to checkresults, no space left on device

Re: Unable to write to checkresults, no space left on device

Re: Unable to write to checkresults, no space left on device

Re: Unable to write to checkresults, no space left on device

Re: Unable to write to checkresults, no space left on device

Re: Unable to write to checkresults, no space left on device

Re: Unable to write to checkresults, no space left on device

Re: Unable to write to checkresults, no space left on device