Disk checks

Post by **BanditBBS** » Fri Dec 11, 2015 3:27 pm

Ok, so currently we do one disk check per server and use the -e option to only show errors. Sometimes it can take weeks for a space issue to get fixed and it stays in a warning or critical state that entire time. If I turn on volatile it would log all non OK states but also send a notification for every one. i swear there was an option to only send alert if the text of the plugin changed as well(ie a new drive passed a threshold)

So, I'm being dumb right now, but for which reason? lol
1.) I don't know what option to enable
2.) No such option exists

ssax · Post by **ssax** » Mon Dec 14, 2015 10:11 am

I believe you may be referring to state stalking:

https://assets.nagios.com/downloads/nag ... lking.html

Let me know if that isn't what you're looking for.

Post by **BanditBBS** » Mon Dec 14, 2015 11:21 am

What I'm looking for is a mixture of Stalking and Volatile. I ideally want Volatile, but notifications only on output change.

i.e. this:

1.) 9:00 - Warning - /usr 90% (Send notification and log it)
2.) 9:05 - Warning - /usr 90% (nothing logged and no notifications)
3.) 9:10 - Warning - /usr 90% - Warning - /var 94% (Send notification and log it!)

Stalking would have just logged 1 and 3 and sent notification for 1, right?
Volatile would log all 3 and send out notifications for all three, right?

So neither are what I am wanting fully, at least from what I read on them. If no option to do what I want, my only other option is to convert 1000 disk checks into 11,000 disk checks and well...ugh, lol

Post by **WillemDH** » Mon Dec 14, 2015 12:05 pm

I'm having similar issues for multiple volumes on our NetApp, but slightly different as we acknowledge a problem once it arrives in open service problems. when "1.) 9:00 - Warning - /usr 90% (Send notification and log it)" is happening someone will acknowledge the problem. Am I understanding right that you don't ack the problem? Needing to acknowledge the problem (and creating a support ticket) generates an extra layer of complexity to solve the issue. Making the ack unsticky only helps to send an email, but as the problem is still acknowledged, a new support ticket is not made, which potentially causes issues.
For Windows and Linux servers I'm creating a service for each disk as I need to monitor disk load too, so i'm not having this issue there. This setup of course needs a lot more service (as you said), but also requires some mechanism which will check on a regular interval (eg once a week if new disks are added to a server and add the service to Nagios if found.

Post by **BanditBBS** » Mon Dec 14, 2015 1:39 pm

Willem,

We ACK also and our notification interval is set to 0 so only the one alert is sent. We open a task in our helpdesk app for every notification. My end goal here is to open a task for /usr crossing a threshold and then another with /var does even if /usr is still in bad state. It is looking more and more that I'll be adding 11,000 new checks

EDIT: Now to figure out the best way to add 11,000 services...hmmmm...suggestions? Can I just use bulk cloning and add the services to already existing hosts?
Edit #2: DUH, the new API!!!

Post by **WillemDH** » Mon Dec 14, 2015 4:27 pm

I'm still using bulk host cloning for most of my bulk edits. It's proven and very fast, you just need to make a weekly inventory of all your Nagios hosts and export to Excel with a column for each type of hostgroup (os, location, infrastructure and role), filter the excel based on whatever you need. Paste in Bulk Host cloning. Select service, next, next finish and done. I hope your systems are up for the extra load! Grtz

Post by **BanditBBS** » Mon Dec 14, 2015 4:40 pm

WillemDH wrote:I'm still using bulk host cloning for most of my bulk edits. It's proven and very fast, you just need to make a weekly inventory of all your Nagios hosts and export to Excel with a column for each type of hostgroup (os, location, infrastructure and role), filter the excel based on whatever you need. Paste in Bulk Host cloning. Select service, next, next finish and done. I hope your systems are up for the extra load! Grtz

Willem - So how do you go about filling in those other fields every time you do your weekly export?

Edit: Nevermind, it wouldn't work for me as every 10 or so would be using a different service template and belong to a different servicegroup. Thats what makes the API such a nice thing for me, I'm building a complex as ever spreadsheet that I just have to past in the hostnames and bam...it'll spit out the 11,000 curl statements, then I just to do cut-n-paste.

And yeah, I pray it can handle the load too....going from ~19000 services to ~30000 with this change.

Post by **WillemDH** » Mon Dec 14, 2015 5:06 pm

Haa, a little Powershell script I wrote making use of the PSExcel module (http://ramblingcookiemonster.github.io/PSExcel-Intro/). Just get all hosts in a hostgroup with invoke-webrequest json query. Then loop through all hosts and retrieve the hostgroups for each host. Put everything in a custom Powershell object , eg $Demodata and then

Code: Select all

$DemoData | Export-XLSX -Path C:\temp\Demo.xlsx

Sry I hope the above makes sense to you. I can't publish this on the Exchange as it would only work when your hostgroups are named exactly the same etc... It also contains some sensitive information and needs large cleanup.

But I can highly advice creating something similar. If you (or a colleague) have a bit Powershell knowledge you should be able to get it working. Let me know where you are stuck if you would ever give it a try. Every week I try to expand the script a bit. Nowadays it will compare the host's hostgroups with the actual os, location, infratructure and role and update the hostgroups with Reactor if needed (https://github.com/willemdh/naf_nagios_ ... hostgroups)
And I'm halfway to also retrieve all kinds of info of the server, eg contact and address, serial number etc and put it in the free variables with https://github.com/willemdh/naf_nagios_ ... st_freevar

Grtz

Post by **BanditBBS** » Mon Dec 14, 2015 5:10 pm

One of these days I'll have to do a remote join.me session for you and show you my setup(and maybe pass presenter to you and you do same if you can) that way we both might get good ideas

Post by **WillemDH** » Mon Dec 14, 2015 5:13 pm

Well I think that's a good idea. Let's try do this somewhere mid January, as I have a two week holiday starting Friday.

Nagios Support Forum

Disk checks

Disk checks

Re: Disk checks

Re: Disk checks

Re: Disk checks

Re: Disk checks

Re: Disk checks

Re: Disk checks

Re: Disk checks

Re: Disk checks

Re: Disk checks