Page 1 of 2
check_disk not timing out!
Posted: Tue May 19, 2015 9:37 am
by BanditBBS
Hey all. Have some weird issues here with check_disk plugin. When running it hangs and never closes, I just had to kill over 100 instances if it on a server that were causing a very high load. After further exploring I dound that even when I do a df on the cli that it hung and never finished, so something definitely wrong with the server. My issue though is, why is the check_disk not timing out? I have -t 240 in the command line. Instead the check_by_ssh is timing out after 270 seconds and leaves the check_disk running forever on the remote host.
Re: check_disk not timing out!
Posted: Tue May 19, 2015 10:28 am
by jdalrymple
Sounds like a stale NFS mount to me.
Re: check_disk not timing out!
Posted: Tue May 19, 2015 10:31 am
by BanditBBS
jdalrymple wrote:Sounds like a stale NFS mount to me.
Oh, I completely agree, and having it resolved. My question though remains, is there a reason the check_disk plugin isn't timing out and closing like it should be and instead staying open forever until I kill the process.
Re: check_disk not timing out!
Posted: Tue May 19, 2015 10:43 am
by jdalrymple
I've ran into very few pieces of software in this world than *CAN* properly handle stale NFS mounts. I mean if df or ls can't detect and bypass a stale mount, what will?
That said - it doesn't appear that there is an amazingly reliable cross-platform way of doing just that with stdio.h so that's probably why most of those binaries (including check_disk) are still broken.
http://stackoverflow.com/questions/1643 ... -nfs-mount
If it wasn't "why is check_disk not timing out!" it would be "Why won't nagios-plugins compile on my machine?"
Re: check_disk not timing out!
Posted: Wed May 20, 2015 9:39 am
by abrist
Got a method for reproducing it?
Or could we set up a remote so I could rebuild check_disk a few times to find out where it is failing?
If you run it with "-vvv" where does it hang?
Re: check_disk not timing out!
Posted: Wed May 20, 2015 9:48 am
by BanditBBS
It hangs here on a stale NFS:
You could recreate by simply creating an NFS mount and then making it go stale somehow. I do have a server that its happening on right now, but it is a PROD server, so would rather not compile on it. but if you supply multiple 64 bit compiled versions I would be willing to run it on the server.
Re: check_disk not timing out!
Posted: Wed May 20, 2015 1:41 pm
by tgriep
You may want to try adding one of these options to your check_disk command.
Code: Select all
-l, --local
Only check local filesystems
-L, --stat-remote-fs
Only check local filesystems against thresholds. Yet call stat on remote filesystems
to test if they are accessible (e.g. to detect Stale NFS Handles)
Re: check_disk not timing out!
Posted: Wed May 20, 2015 1:56 pm
by abrist
Alright. Well, I want to know if this is actually the fault of stat, or something right after it. Attached is a bin with just a couple lines added to let us know if the stats finish. If they do not, it should error out. if anything, the alarm should catch it. But if there is an issue with the fsusage.h or stat.h files, then we may find the issue with this bin. Do understand that you need to run the plugin with -vvv to generate the extra debug and you will need to run it when your share is stale (and the standard checks are freezing).
Re: check_disk not timing out!
Posted: Wed May 20, 2015 2:09 pm
by BanditBBS
Andy,
It still froze on the same line and not proceeding past it.
tgriep,
We want to monitor the NFS mounts so -l and -L are not options for us unfortunately.
Re: check_disk not timing out!
Posted: Wed May 20, 2015 2:54 pm
by abrist
Then stat() is definitely failing. I thought it was odd that the alarm as not triggering on timeout. . .well, there was no alarm

So I added one. Can you test this new bin to see if it times out?