Could not stat() check result file
Could not stat() check result file
Running Nagios 3.5.0 on EL6, using packages from the EPEL repo. We monitor a few hundred hosts and services. We are getting sporadic warning in our nagios.log every few days like the following:
[1375153859] Warning: Could not stat() check result file '/var/log/nagios/spool/checkresults/cpkYQTf'.
Aside from these occasional warnings, everything appears to be functioning correctly. We've been trying to track down the cause of these warnings, with no success. FWIW, at any given time there appear to be on the order of a dozen or so check result files in that directory, all of them with very recent timestamps (less than 1 min), and they appear to be created and removed fairly rapidly.
Can anyone shed any light on the possible cause of these warnings, or suggest any diagnostic steps? Is there any reason to be concerned about this, or are these warnings utterly trivial and harmless?
Of possible significance is that this system was upgraded a while back from EL5, which had Nagios 2.12 in EPEL, and all of the old Nagios config files were migrated and then manually tweaked to get things working after the upgrade, since apparently a lot had changed. There was a lot of trial-and-error involved, but we eventually got everything working correctly, with only these sporadic warnings remaining. Just mentioning this in case it may shed any light on what may be going on here.
[1375153859] Warning: Could not stat() check result file '/var/log/nagios/spool/checkresults/cpkYQTf'.
Aside from these occasional warnings, everything appears to be functioning correctly. We've been trying to track down the cause of these warnings, with no success. FWIW, at any given time there appear to be on the order of a dozen or so check result files in that directory, all of them with very recent timestamps (less than 1 min), and they appear to be created and removed fairly rapidly.
Can anyone shed any light on the possible cause of these warnings, or suggest any diagnostic steps? Is there any reason to be concerned about this, or are these warnings utterly trivial and harmless?
Of possible significance is that this system was upgraded a while back from EL5, which had Nagios 2.12 in EPEL, and all of the old Nagios config files were migrated and then manually tweaked to get things working after the upgrade, since apparently a lot had changed. There was a lot of trial-and-error involved, but we eventually got everything working correctly, with only these sporadic warnings remaining. Just mentioning this in case it may shed any light on what may be going on here.
Re: Could not stat() check result file
paulds,
is it always the same file? cpkYQTf
can you check what the file permissions are of the file in question?
-Yancy
is it always the same file? cpkYQTf
can you check what the file permissions are of the file in question?
-Yancy
Re: Could not stat() check result file
Could you check the following permissions:
Additionally, do you notice any open file limits errors in /var/log/messages ?
Also check the groups:
Code: Select all
ls -lad /usr/local/nagios/var/spool/checkresults/
ls -la /usr/local/nagios/var/spool/checkresults/
Code: Select all
grep "open files" /var/log/messages
Code: Select all
grep nag /etc/group/
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Could not stat() check result file
yancy, No, it's a different random filename each time.
I can't report on the permissions of the file in question, because by the time we check, it is already gone. I doubt it is a permissions issue, however, since if that's what it was, I'd expect we'd be getting hundreds of these warnings every minute, instead of one every couple of days.
$ ls -lad /var/log/nagios/spool/checkresults/
drwxr-x--- 2 nagios nagios 20480 Aug 5 09:17 /var/log/nagios/spool/checkresults/
$ ls -la /var/log/nagios/spool/checkresults/
total 48
drwxr-x--- 2 nagios nagios 20480 Aug 5 09:15 .
drwxr-x--- 3 nagios nagios 4096 Apr 24 21:05 ..
-rw------- 1 nagios nagios 436 Aug 5 09:15 c0kUn3T
-rw------- 1 nagios nagios 0 Aug 5 09:15 c0kUn3T.ok
-rw------- 1 nagios nagios 445 Aug 5 09:15 c92Z7l6
-rw------- 1 nagios nagios 0 Aug 5 09:15 c92Z7l6.ok
-rw------- 1 nagios nagios 469 Aug 5 09:15 cMmGQdO
-rw------- 1 nagios nagios 0 Aug 5 09:15 cMmGQdO.ok
-rw------- 1 nagios nagios 445 Aug 5 09:15 cPGw7Ai
-rw------- 1 nagios nagios 0 Aug 5 09:15 cPGw7Ai.ok
-rw------- 1 nagios nagios 443 Aug 5 09:15 cQ6rcQu
-rw------- 1 nagios nagios 0 Aug 5 09:15 cQ6rcQu.ok
-rw------- 1 nagios nagios 403 Aug 5 09:15 cV8GlRH
-rw------- 1 nagios nagios 0 Aug 5 09:15 cV8GlRH.ok
$ grep "open files" /var/log/messages*
$
[ie, no results]
Nagios runs as user "nagios", and is a member of group "nagios".
(Sorry for the delayed response. I expected to be notified when someone replied to my post, but I guess the forum doesn't do that. I'll try to remember to check back here at least daily.)
I can't report on the permissions of the file in question, because by the time we check, it is already gone. I doubt it is a permissions issue, however, since if that's what it was, I'd expect we'd be getting hundreds of these warnings every minute, instead of one every couple of days.
$ ls -lad /var/log/nagios/spool/checkresults/
drwxr-x--- 2 nagios nagios 20480 Aug 5 09:17 /var/log/nagios/spool/checkresults/
$ ls -la /var/log/nagios/spool/checkresults/
total 48
drwxr-x--- 2 nagios nagios 20480 Aug 5 09:15 .
drwxr-x--- 3 nagios nagios 4096 Apr 24 21:05 ..
-rw------- 1 nagios nagios 436 Aug 5 09:15 c0kUn3T
-rw------- 1 nagios nagios 0 Aug 5 09:15 c0kUn3T.ok
-rw------- 1 nagios nagios 445 Aug 5 09:15 c92Z7l6
-rw------- 1 nagios nagios 0 Aug 5 09:15 c92Z7l6.ok
-rw------- 1 nagios nagios 469 Aug 5 09:15 cMmGQdO
-rw------- 1 nagios nagios 0 Aug 5 09:15 cMmGQdO.ok
-rw------- 1 nagios nagios 445 Aug 5 09:15 cPGw7Ai
-rw------- 1 nagios nagios 0 Aug 5 09:15 cPGw7Ai.ok
-rw------- 1 nagios nagios 443 Aug 5 09:15 cQ6rcQu
-rw------- 1 nagios nagios 0 Aug 5 09:15 cQ6rcQu.ok
-rw------- 1 nagios nagios 403 Aug 5 09:15 cV8GlRH
-rw------- 1 nagios nagios 0 Aug 5 09:15 cV8GlRH.ok
$ grep "open files" /var/log/messages*
$
[ie, no results]
Nagios runs as user "nagios", and is a member of group "nagios".
(Sorry for the delayed response. I expected to be notified when someone replied to my post, but I guess the forum doesn't do that. I'll try to remember to check back here at least daily.)
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Could not stat() check result file
if I had to guess I would say this is caused by more than one nagios bin running at the same time
Code: Select all
ps -ef|grep bin/nagios
Code: Select all
service nagios stop
killall-9 nagios
service nagios start
Re: Could not stat() check result file
(Oh hey, it sent me an e-mail notification this time! Cool.)
scottwilkerson, Nope, only one instance running.
$ ps -ef|grep bin/nagios
nagios 24121 1 0 Jul22 ? 00:11:58 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
scottwilkerson, Nope, only one instance running.
$ ps -ef|grep bin/nagios
nagios 24121 1 0 Jul22 ? 00:11:58 /usr/sbin/nagios -d /etc/nagios/nagios.cfg
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Could not stat() check result file
Basically this error is saying that the file disappeared (was processed by another thread before it could be read) before it could be read. It really isn't much to be concerned about if you have verified that you do not have multiple instances running (which would be the most common cause of this).paulds wrote:Can anyone shed any light on the possible cause of these warnings, or suggest any diagnostic steps? Is there any reason to be concerned about this, or are these warnings utterly trivial and harmless?
Re: Could not stat() check result file
Thanks for the additional details, Scott.
I really don't like ignoring mysteries like this, as they're often a subtle indication that something else is wrong.
I really don't like ignoring mysteries like this, as they're often a subtle indication that something else is wrong.
Re: Could not stat() check result file
Do you see any "defunct" nagios processes that hang around for longer than a few moments?
You may want to the commands Scott suggested anyways:
Code: Select all
watch "ps -aef| grep defunct"
Code: Select all
service nagios stop
killall-9 nagios
service nagios start
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: Could not stat() check result file
abrist, Ok, running that via watch -n1, and I am seeing one or more defunct nagios processes periodically, maybe every 5-10 seconds (it's irregular). Occasionally there are a bunch all at once, maybe 8-10 of them. But none ever linger for more than a second.
I'll restart the service just for the heck of it, but I really don't think this is going to be very helpful, since we actually restart it probably once a week on average already, since we do a restart whenever we add/change/remove any systems or services. And these warnings have been ongoing for many months.
I'll restart the service just for the heck of it, but I really don't think this is going to be very helpful, since we actually restart it probably once a week on average already, since we do a restart whenever we add/change/remove any systems or services. And these warnings have been ongoing for many months.