check_disk not timing out!
check_disk not timing out!
Hey all. Have some weird issues here with check_disk plugin. When running it hangs and never closes, I just had to kill over 100 instances if it on a server that were causing a very high load. After further exploring I dound that even when I do a df on the cli that it hung and never finished, so something definitely wrong with the server. My issue though is, why is the check_disk not timing out? I have -t 240 in the command line. Instead the check_by_ssh is timing out after 270 seconds and leaves the check_disk running forever on the remote host.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: check_disk not timing out!
Sounds like a stale NFS mount to me.
Re: check_disk not timing out!
Oh, I completely agree, and having it resolved. My question though remains, is there a reason the check_disk plugin isn't timing out and closing like it should be and instead staying open forever until I kill the process.jdalrymple wrote:Sounds like a stale NFS mount to me.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: check_disk not timing out!
I've ran into very few pieces of software in this world than *CAN* properly handle stale NFS mounts. I mean if df or ls can't detect and bypass a stale mount, what will?
That said - it doesn't appear that there is an amazingly reliable cross-platform way of doing just that with stdio.h so that's probably why most of those binaries (including check_disk) are still broken.
http://stackoverflow.com/questions/1643 ... -nfs-mount
If it wasn't "why is check_disk not timing out!" it would be "Why won't nagios-plugins compile on my machine?"
That said - it doesn't appear that there is an amazingly reliable cross-platform way of doing just that with stdio.h so that's probably why most of those binaries (including check_disk) are still broken.
http://stackoverflow.com/questions/1643 ... -nfs-mount
If it wasn't "why is check_disk not timing out!" it would be "Why won't nagios-plugins compile on my machine?"
Re: check_disk not timing out!
Got a method for reproducing it?
Or could we set up a remote so I could rebuild check_disk a few times to find out where it is failing?
If you run it with "-vvv" where does it hang?
Or could we set up a remote so I could rebuild check_disk a few times to find out where it is failing?
If you run it with "-vvv" where does it hang?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: check_disk not timing out!
It hangs here on a stale NFS:
You could recreate by simply creating an NFS mount and then making it go stale somehow. I do have a server that its happening on right now, but it is a PROD server, so would rather not compile on it. but if you supply multiple 64 bit compiled versions I would be willing to run it on the server.
Code: Select all
calling stat on /stage2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: check_disk not timing out!
You may want to try adding one of these options to your check_disk command.
Code: Select all
-l, --local
Only check local filesystems
-L, --stat-remote-fs
Only check local filesystems against thresholds. Yet call stat on remote filesystems
to test if they are accessible (e.g. to detect Stale NFS Handles)
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: check_disk not timing out!
Alright. Well, I want to know if this is actually the fault of stat, or something right after it. Attached is a bin with just a couple lines added to let us know if the stats finish. If they do not, it should error out. if anything, the alarm should catch it. But if there is an issue with the fsusage.h or stat.h files, then we may find the issue with this bin. Do understand that you need to run the plugin with -vvv to generate the extra debug and you will need to run it when your share is stale (and the standard checks are freezing).
You do not have the required permissions to view the files attached to this post.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Re: check_disk not timing out!
Andy,
It still froze on the same line and not proceeding past it.
tgriep,
We want to monitor the NFS mounts so -l and -L are not options for us unfortunately.
It still froze on the same line and not proceeding past it.
tgriep,
We want to monitor the NFS mounts so -l and -L are not options for us unfortunately.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
Re: check_disk not timing out!
Then stat() is definitely failing. I thought it was odd that the alarm as not triggering on timeout. . .well, there was no alarm 
So I added one. Can you test this new bin to see if it times out?
So I added one. Can you test this new bin to see if it times out?
You do not have the required permissions to view the files attached to this post.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.