Disk Health Checks

maxTim · Post by **maxTim** » Thu Aug 05, 2021 11:35 pm

hello, I'm pretty new to using Nagios and was looking for a bit of advice.

I have a few servers with some old hardware and I'd like to check disk health*. I've installed smartmontools here on my local machine to get to know it, but the information is pretty complex and I'm not really sure how to aggregate it. I will be looking deeper into this tool as it does seem to be quite comprehensive and informative.

I've browsed to plugins such as check_smart_attributes and check_smartmon. One thing stuck out, however, being that smartctl requires sudo access. I'm not super stoked about giving sudo access to user nagios or modifying the sudoers file. Another thing that I'm wondering about is which checks do I run and how can I decipher them? It's been my experience that just running a simple SMART health check and seeing 'passed' is not very proactive. In fact, I've seen failing drives that 'passed' the SMART health check in BIOS or live rescue environments.

Also, what if I'm running zfs in Debian? I understand that that's a software RAID(?). So then I'm still just checking the individual disks in that case? And what about Windows hosts running ncpa?

Anyways, thanks for any advice.

*Note: Yes, I know - just proactively replace the drives. But that requires capitol, which for some of these devices really isn't always available. These aren't mission-critical devices, but I would like to stay ahead of the curve should the drives begin to fail.