Page 1 of 4

Hundreds of Active Check result files in /tmp

Posted: Thu Oct 04, 2012 9:28 am
by jbennett
I'm not sure if this is normal.

I have hundreds (thousands?) of files like the following in my /tmp folder on my Nagios box:

Code: Select all

[root@nagiosxivm tmp]# vi check6ve7op
### Active Check Result File ###
file_time=1343197496

### Nagios Service Check Result ###
# Time: Wed Jul 25 06:24:56 2012
host_name=xxxxx
service_description=xxxxxxxxxx
check_type=0
check_options=4
scheduled_check=1
reschedule_check=1
latency=1.200000
start_time=1343197496.200880
Is this normal? If not, what is causing all of these files to collect in the /tmp folder?

Re: Hundreds of Active Check result files in /tmp

Posted: Thu Oct 04, 2012 9:38 am
by scottwilkerson
It is normal for them to be there temporarily, but if they aren't from the last several minutes it was likely caused by having multiple nagios processes running sometime in the past.

It is safe to remove these

Code: Select all

rm -f /tmp/check*
If they build back up and don't remove them selves we need to investigate further as it could be related to the following wiki
http://support.nagios.com/wiki/index.ph ... g_Orphaned

Re: Hundreds of Active Check result files in /tmp

Posted: Fri Oct 05, 2012 9:33 am
by jbennett
When I make these changes, gradually, every single host & service comes back with a critical check.

Change it back and everything goes back to normal.

Is this to be expected at first?

I should note that I did manage to update to 3.3.

Re: Hundreds of Active Check result files in /tmp

Posted: Fri Oct 05, 2012 9:57 am
by yancy
jbennet,

Can you be more specific about what your changing that makes the checks return critical?

did you try the following?

Code: Select all

 rm -f /tmp/check* 

Code: Select all

  killall -9 nagios 
Then restart Nagios from the Admin menu of the web interface.

Regards,

-Yancy

Re: Hundreds of Active Check result files in /tmp

Posted: Fri Oct 05, 2012 12:56 pm
by jbennett
The changes I'm making are those that were linked. Per the above linke:

Code: Select all

heck Services Being Orphaned
Some users have encountered large numbers of warning messages that accumulate quickly that read as follows:

Warning: The check of service <Your Service> on host <Your Host> looks like it was orphaned (results never came back). I'm scheduling an immediate check of the service..

This is most likely caused by multiple instances of Nagios running. To fix this kill all instances of Nagios and then restart the process.

 killall -9 nagios
Then restart Nagios from the Admin menu of the web interface.

Related forum post can be read here.


If the issue continues to persist after reboots and restarts of the Nagios service, then the issue is most likely caused by either a memory leak in embedded perl, or system ulimit restrictions. Symptoms can include the /tmp directory filling up quickly with check* files, and the following errors in the nagios log.

 [1331905537] Warning: The check of service 'SERVICE' on host 'NAMESERVER' looks like it WAS 
 orphaned (results never Came back). I'm scheduling an immediate check of the service ...
 [1331755699] Warning: The check of service 'SWAP' on host 'nameserver' not could be due to Performed 
 to fork () error 'Resource temporarily unavailable'. The check will be rescheduled. 
Try the following solutions:

Edit /etc/security/limits.conf

 * hard memlock 128     #locked memory
 * soft memlock 128
 * soft nofile 4096      #open files
 * hard nofile 4096
 * hard nproc 4096     #max user processes
 * soft nproc 4096
 * hard stack 20480     #stack size
 * soft stack 20480
and restart the server. Run

 ulimit -a 
to verify that the new settings are in place.


And also update the settings in your nagios.cfg file to match the following:

 enable_embedded_perl=0
 use_embedded_perl_implicitly=0
I tried making the changes and killing Nagios from both the command line (service nagios stop / killall -9 nagios / serivce nagios start) as well as a full reboot of the server and I would still end up with hundreds of hosts and services showing critical (ping times were exceptionally long and over 50% packet loss for instance).

Yes, I did clear out all of the orphaned checks per the suggestion.

Re: Hundreds of Active Check result files in /tmp

Posted: Fri Oct 05, 2012 2:33 pm
by mguthrie
How large is your install? (how many hosts + services?)

Also try increasing the restart time of Nagios by updating the following line in /etc/init.d/nagios, line 157.

Change:

Code: Select all

                for i in 1 2 3 4 5 6 7 8 9 10 ; do
To:

Code: Select all

                for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 ; do
This will give the nagios process plenty of time to cleanly close off any remaining checks. If you have a lot of checks that take a long time to complete, you could potentially have some temp files leftover upon a restart of the process.

Re: Hundreds of Active Check result files in /tmp

Posted: Mon Oct 08, 2012 9:06 am
by jbennett

Code: Select all

 Monitoring Performance
Service Check Execution Time:	0.00 / 12.72 / 1.457 sec
Service Check Latency:	0.00 / 9036.63 / 3097.635 sec
Host Check Execution Time:	0.00 / 12.39 / 1.553 sec
Host Check Latency:	0.00 / 9035.63 / 3370.838 sec
#[b] Active Host / Service Checks:	2867 / 5613[/b]
# Passive Host / Service Checks:	0 / 1
This is immediately after rebooting the server this morning.

I came in to find about 300 hosts down with packet loss on ping. From my desktop, I'm able to ping without issue. One I reboot the system, it seems to resolve, for a little while anyways.

Re: Hundreds of Active Check result files in /tmp

Posted: Mon Oct 08, 2012 9:51 am
by mguthrie
Are you using DNX or Mod Gearman on your install?

Re: Hundreds of Active Check result files in /tmp

Posted: Mon Oct 08, 2012 10:57 am
by jbennett
I am not using a distributed set-up. I have checked the plugin page, and do not see any plug-ins for either DNX or Gearman installed.

I did take over this install from a previous admin. This previous admin did not leave any documentation or any further information about how they set-up the system so I am having to figure it out as I go.

Re: Hundreds of Active Check result files in /tmp

Posted: Mon Oct 08, 2012 11:58 am
by lmiltchev
Can you post your nagios.cfg file? Thank you!