check_disk not timing out!

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

check_disk not timing out!

Post by BanditBBS »

Hey all. Have some weird issues here with check_disk plugin. When running it hangs and never closes, I just had to kill over 100 instances if it on a server that were causing a very high load. After further exploring I dound that even when I do a df on the cli that it hung and never finished, so something definitely wrong with the server. My issue though is, why is the check_disk not timing out? I have -t 240 in the command line. Instead the check_by_ssh is timing out after 270 seconds and leaves the check_disk running forever on the remote host.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: check_disk not timing out!

Post by jdalrymple »

Sounds like a stale NFS mount to me.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: check_disk not timing out!

Post by BanditBBS »

jdalrymple wrote:Sounds like a stale NFS mount to me.
Oh, I completely agree, and having it resolved. My question though remains, is there a reason the check_disk plugin isn't timing out and closing like it should be and instead staying open forever until I kill the process.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: check_disk not timing out!

Post by jdalrymple »

I've ran into very few pieces of software in this world than *CAN* properly handle stale NFS mounts. I mean if df or ls can't detect and bypass a stale mount, what will?

That said - it doesn't appear that there is an amazingly reliable cross-platform way of doing just that with stdio.h so that's probably why most of those binaries (including check_disk) are still broken.

http://stackoverflow.com/questions/1643 ... -nfs-mount

If it wasn't "why is check_disk not timing out!" it would be "Why won't nagios-plugins compile on my machine?"
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: check_disk not timing out!

Post by abrist »

Got a method for reproducing it?
Or could we set up a remote so I could rebuild check_disk a few times to find out where it is failing?
If you run it with "-vvv" where does it hang?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: check_disk not timing out!

Post by BanditBBS »

It hangs here on a stale NFS:

Code: Select all

calling stat on /stage
You could recreate by simply creating an NFS mount and then making it go stale somehow. I do have a server that its happening on right now, but it is a PROD server, so would rather not compile on it. but if you supply multiple 64 bit compiled versions I would be willing to run it on the server.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: check_disk not timing out!

Post by tgriep »

You may want to try adding one of these options to your check_disk command.

Code: Select all

 -l, --local
    Only check local filesystems
 -L, --stat-remote-fs
    Only check local filesystems against thresholds. Yet call stat on remote filesystems
    to test if they are accessible (e.g. to detect Stale NFS Handles)
Be sure to check out our Knowledgebase for helpful articles and solutions!
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: check_disk not timing out!

Post by abrist »

Alright. Well, I want to know if this is actually the fault of stat, or something right after it. Attached is a bin with just a couple lines added to let us know if the stats finish. If they do not, it should error out. if anything, the alarm should catch it. But if there is an issue with the fsusage.h or stat.h files, then we may find the issue with this bin. Do understand that you need to run the plugin with -vvv to generate the extra debug and you will need to run it when your share is stale (and the standard checks are freezing).
You do not have the required permissions to view the files attached to this post.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
User avatar
BanditBBS
Posts: 2474
Joined: Tue May 31, 2011 12:57 pm
Location: Scio, OH
Contact:

Re: check_disk not timing out!

Post by BanditBBS »

Andy,

It still froze on the same line and not proceeding past it.

tgriep,

We want to monitor the NFS mounts so -l and -L are not options for us unfortunately.
2 of XI5.6.14 Prod/DR/DEV - Nagios LogServer 2 Nodes
See my projects on the Exchange at BanditBBS - Also check out my Nagios stuff on my personal page at Bandit's Home and at github
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: check_disk not timing out!

Post by abrist »

Then stat() is definitely failing. I thought it was odd that the alarm as not triggering on timeout. . .well, there was no alarm :P
So I added one. Can you test this new bin to see if it times out?
You do not have the required permissions to view the files attached to this post.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Locked