Could not stat() check result file

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
paulds
Posts: 8
Joined: Thu Aug 01, 2013 7:30 am

Could not stat() check result file

Post by paulds »

Running Nagios 3.5.0 on EL6, using packages from the EPEL repo. We monitor a few hundred hosts and services. We are getting sporadic warning in our nagios.log every few days like the following:

[1375153859] Warning: Could not stat() check result file '/var/log/nagios/spool/checkresults/cpkYQTf'.

Aside from these occasional warnings, everything appears to be functioning correctly. We've been trying to track down the cause of these warnings, with no success. FWIW, at any given time there appear to be on the order of a dozen or so check result files in that directory, all of them with very recent timestamps (less than 1 min), and they appear to be created and removed fairly rapidly.

Can anyone shed any light on the possible cause of these warnings, or suggest any diagnostic steps? Is there any reason to be concerned about this, or are these warnings utterly trivial and harmless?

Of possible significance is that this system was upgraded a while back from EL5, which had Nagios 2.12 in EPEL, and all of the old Nagios config files were migrated and then manually tweaked to get things working after the upgrade, since apparently a lot had changed. There was a lot of trial-and-error involved, but we eventually got everything working correctly, with only these sporadic warnings remaining. Just mentioning this in case it may shed any light on what may be going on here.
yancy
Posts: 523
Joined: Thu Oct 06, 2011 10:12 am

Re: Could not stat() check result file

Post by yancy »

paulds,

is it always the same file? cpkYQTf

can you check what the file permissions are of the file in question?

-Yancy
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Could not stat() check result file

Post by abrist »

Could you check the following permissions:

Code: Select all

ls -lad /usr/local/nagios/var/spool/checkresults/
ls -la /usr/local/nagios/var/spool/checkresults/
Additionally, do you notice any open file limits errors in /var/log/messages ?

Code: Select all

grep "open files" /var/log/messages
Also check the groups:

Code: Select all

grep nag /etc/group/
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
paulds
Posts: 8
Joined: Thu Aug 01, 2013 7:30 am

Re: Could not stat() check result file

Post by paulds »

yancy, No, it's a different random filename each time.

I can't report on the permissions of the file in question, because by the time we check, it is already gone. I doubt it is a permissions issue, however, since if that's what it was, I'd expect we'd be getting hundreds of these warnings every minute, instead of one every couple of days.

$ ls -lad /var/log/nagios/spool/checkresults/
drwxr-x--- 2 nagios nagios 20480 Aug 5 09:17 /var/log/nagios/spool/checkresults/

$ ls -la /var/log/nagios/spool/checkresults/
total 48
drwxr-x--- 2 nagios nagios 20480 Aug 5 09:15 .
drwxr-x--- 3 nagios nagios 4096 Apr 24 21:05 ..
-rw------- 1 nagios nagios 436 Aug 5 09:15 c0kUn3T
-rw------- 1 nagios nagios 0 Aug 5 09:15 c0kUn3T.ok
-rw------- 1 nagios nagios 445 Aug 5 09:15 c92Z7l6
-rw------- 1 nagios nagios 0 Aug 5 09:15 c92Z7l6.ok
-rw------- 1 nagios nagios 469 Aug 5 09:15 cMmGQdO
-rw------- 1 nagios nagios 0 Aug 5 09:15 cMmGQdO.ok
-rw------- 1 nagios nagios 445 Aug 5 09:15 cPGw7Ai
-rw------- 1 nagios nagios 0 Aug 5 09:15 cPGw7Ai.ok
-rw------- 1 nagios nagios 443 Aug 5 09:15 cQ6rcQu
-rw------- 1 nagios nagios 0 Aug 5 09:15 cQ6rcQu.ok
-rw------- 1 nagios nagios 403 Aug 5 09:15 cV8GlRH
-rw------- 1 nagios nagios 0 Aug 5 09:15 cV8GlRH.ok

$ grep "open files" /var/log/messages*
$
[ie, no results]

Nagios runs as user "nagios", and is a member of group "nagios".

(Sorry for the delayed response. I expected to be notified when someone replied to my post, but I guess the forum doesn't do that. I'll try to remember to check back here at least daily.)
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Could not stat() check result file

Post by scottwilkerson »

if I had to guess I would say this is caused by more than one nagios bin running at the same time

Code: Select all

ps -ef|grep bin/nagios

Code: Select all

service nagios stop
killall-9 nagios
service nagios start
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
paulds
Posts: 8
Joined: Thu Aug 01, 2013 7:30 am

Re: Could not stat() check result file

Post by paulds »

(Oh hey, it sent me an e-mail notification this time! Cool.)

scottwilkerson, Nope, only one instance running.

$ ps -ef|grep bin/nagios
nagios 24121 1 0 Jul22 ? 00:11:58 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Could not stat() check result file

Post by scottwilkerson »

paulds wrote:Can anyone shed any light on the possible cause of these warnings, or suggest any diagnostic steps? Is there any reason to be concerned about this, or are these warnings utterly trivial and harmless?
Basically this error is saying that the file disappeared (was processed by another thread before it could be read) before it could be read. It really isn't much to be concerned about if you have verified that you do not have multiple instances running (which would be the most common cause of this).
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
paulds
Posts: 8
Joined: Thu Aug 01, 2013 7:30 am

Re: Could not stat() check result file

Post by paulds »

Thanks for the additional details, Scott.

I really don't like ignoring mysteries like this, as they're often a subtle indication that something else is wrong.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Could not stat() check result file

Post by abrist »

Do you see any "defunct" nagios processes that hang around for longer than a few moments?

Code: Select all

watch "ps -aef| grep defunct"
You may want to the commands Scott suggested anyways:

Code: Select all

service nagios stop
killall-9 nagios
service nagios start
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
paulds
Posts: 8
Joined: Thu Aug 01, 2013 7:30 am

Re: Could not stat() check result file

Post by paulds »

abrist, Ok, running that via watch -n1, and I am seeing one or more defunct nagios processes periodically, maybe every 5-10 seconds (it's irregular). Occasionally there are a bunch all at once, maybe 8-10 of them. But none ever linger for more than a second.

I'll restart the service just for the heck of it, but I really don't think this is going to be very helpful, since we actually restart it probably once a week on average already, since we do a restart whenever we add/change/remove any systems or services. And these warnings have been ongoing for many months.
Locked