Nagios design

Maxwellb99 · Post by **Maxwellb99** » Fri Nov 02, 2018 4:09 pm

Good Afternoon,

This question is kind of a follow up to IT-OPS-SYS's question "We are stuck on trying to parameterize this check so we don't have to create X copies of the different check for different Database SIDs."

I'm using NCPA.

Management decided that they want all six physical disk checks. Please advise on the best practices.

Scenario 0::HG's
6 x 146 x 1(HG's = instances = Services) = 876 services + 146 HG's
checks x services x HG's

instances: [sda1, ..., ssd9]

services:
ex. 'disk/physical/sda1/read_bytes'

'disk/physical/instance_i/read_bytes'
'disk/physical/instance_i/read_count'
'disk/physical/instance_i/read_time'
'disk/physical/instance_i/write_bytes'
'disk/physical/instance_i/write_count'
'disk/physical/instance_i/write_time'

HG's:
[sda1-HG, ... , ssd9-HG]

Scenario 1::Hosts
6 x 146 x n(n=hosts; in this case it's 562) = 492,312 services
checks x services x Hosts

Services* & instances stay the same; instead of one HG/service it's one Host/service.

I get that it's 492,312 services in the backend but in the GUI that becomes unwieldy. Further that's only disk_physical. Disk_Logical we have 634 unique mount points.

Please confirm, There's no other way of implementing this other than ending up with 780 hostgroups & probably 1000+ services or probably on the order of a million+ services for for the functionality of Disk_Logical & Disk_Physical? Generally I'd appeal to templates or cmd's but each unique instance & cmd (ex. disk/logical/ssd9/free) has to be in the service for each host or hostgroup.

I'm open to suggestions.

Thanks,
Maxwell Ramirez

Post by **cdienger** » Mon Nov 05, 2018 12:52 pm

You could consolidate a lot of that with some custom plugins. For example, Nagios could call a custom service check called check_disk_stats leave and it up to the NCPA client to run a script that is responsible for running the checks. pseudo code:

for instance_i in "list_of_instances"{
request 'disk/physical/instance_i/read_bytes'
request 'disk/physical/instance_i/read_count'
request 'disk/physical/instance_i/read_time'
request 'disk/physical/instance_i/write_bytes'
request 'disk/physical/instance_i/write_count'
request 'disk/physical/instance_i/write_time'
}

See https://support.nagios.com/kb/article/n ... a-722.html for some more details on runnig custom plugins with NCPA

Maxwellb99 · Post by **Maxwellb99** » Thu Nov 08, 2018 10:59 am

for instance_i in "list_of_instances"{
request 'disk/physical/instance_i/read_bytes'
request 'disk/physical/instance_i/read_count'
request 'disk/physical/instance_i/read_time'
request 'disk/physical/instance_i/write_bytes'
request 'disk/physical/instance_i/write_count'
request 'disk/physical/instance_i/write_time'
}

Please confirm that you want me to write a script that sits on each remote Nagios host or on Nagios itself. I would then use check_ncpa to query this script via /ncpa/plugins. From your "Pseudo-code" I assume 'request' is a function call? The script would then return a Nagios status corresponding to the health of the set of the list_of_instances?

The corresponding services on the Nagios instance would be:
host: host_i || hostgroup_i
Service_description: host_i_physical_drives
check_cmd: /plugins/check_disk_stats.sh -h $HOSTADDRESS$ # The list of instances are semi-unique to each host I need the host to query the list of instances.

how does $HOSTADDRESS$ expand? Can I pass that to a shell script? If I'm using a Hostgroup would that become a list of hosts passed in as arguments?

If grouping the services is your official solution, that is not a viable option for us.

What happens if sda0 & sda1 are both in an alert state?
In theory, I think, I could collect up the instance_i & metric/check that are in an alert state as perfdata & return that. Well, actually can I return that as a Macro? . We need the metric (which currently is a known widget to the service) as I use it in the notification for example: KMXXXX PROBLEM Service Alert: mpzoraqa1 disk_free home is CRITICAL (disk_logical but yeah you get the idea).

how do I maintain this?
I would need to hardcode the warning & critical thresholds for each instance in the "request" function then push that out to the script on the remote hosts. We'd be maintaining thousands of configs.

Let me reiterate. The goal here is to minimize the number/view of services for maintainability. Please confirm that given the off the shelf solution. The minimum number of services is a service for each instance; keeping in mind, the instances are like sda0 or /usr/app/oracle followed by a metric: free, used_percent, total, etc, & a HG consisting of the hosts that have that drive instance? Is this correct? (Service := instance x metric x HG)

To the Community: Nagios has been around for a long time. How are some of your orgs handling this? I mostly wrote about disk_physical but disk_logical is the real crux. I have at least 634 unique disk_logical instances to handle, did y'all create 634 services multiplied by the number of checks/functionality that you're going for? Further, if you didn't use HG's you'd need that service multiplied by hosts included with that service.

Thanks,
Maxwell Ramirez

Post by **cdienger** » Thu Nov 08, 2018 5:17 pm

The requests would just be calls to the local API:

./check_ncpa.py -H 127.0.0.1 -t '<your token>' -M 'disk/physical/sda1/read_time'
./check_ncpa.py -H 127.0.0.1 -t '<your token>' -M 'disk/physical/sda1/read_bytes'
./check_ncpa.py -H 127.0.0.1 -t '<your token>' -M 'disk/physical/sda1/read_count'
./check_ncpa.py -H 127.0.0.1 -t '<your token>' -M 'disk/physical/sda1/write_time'
./check_ncpa.py -H 127.0.0.1 -t '<your token>' -M 'disk/physical/sda1/write_bytes'
./check_ncpa.py -H 127.0.0.1 -t '<your token>' -M 'disk/physical/sda1/write_count'

Where the instances are gathered by first running another call to the API:

/check_ncpa.py -H 127.0.0.1 -t '<your token>' -l -M disk

Using this method, the script would be deployed to each remote host and should accommodate hosts with differing instances. How return codes and messages are handled in various scenarios would be up to the logic of the script.

Without a custom script, the number of services required would be (the number of unique instances x 6). These services could then be applied to host or host groups as needed.

Services:

'disk/physical/sda1/read_bytes'
'disk/physical/sda1/read_count'
'disk/physical/sda1/read_time'
'disk/physical/sda1/write_bytes'
'disk/physical/sda1/write_count'
'disk/physical/sda1/write_time'

'disk/physical/sda2/read_bytes'
'disk/physical/sda2/read_count'
'disk/physical/sda2/read_time'
'disk/physical/sda2/write_bytes'
'disk/physical/sda2/write_count'
'disk/physical/sda2/write_time'

...
etc

If I'm following the initial post correctly, this would be 6x146 Services. Each host does not need it's own uniquely configured service since you can apply the same service to different hosts. So that would be 876 services.

Maxwellb99 · Post by **Maxwellb99** » Fri Nov 09, 2018 12:03 am

Hello,

Thank you, you told me what I needed to know.

A couple things:
- I didn't think it through in the beginning, the ~500k num is an upper bound. I added ~20k services almost by accident running my add_baseline for nprod-windows method using python. I am looking to avoid that in the future.

- If I'm understanding correctly the script is essentially a decorator for the NCPA calls?
- Assuming that you're returning a legal Nagios Response [0, 1, 2, 3] how are you accounting for multiple events? Put another way, how would you de-aggregate the checks? Given the example: sda0 & sda1 are both in a Hard-Down critical state, then sda0 returns below threshold. I assume that the check would still return critical as at least one check is critical. As mentioned, I am mandated to send out notifications upon state-change including return-to-normal.
- haha, our deployment teams would do terrible things to me if I told them that I wanted them to push out a script change every time I wanted to change a threshold. That's why we love NCPA, No plugins, configs to change, just throw it all in the check_cmd/service.
- The hosts implementation is a convenience thing. teams often ask us to adjust their thresholds. 1-1 service-host gives me flexability to make adjustments. But, they're probably not the only team with servers in the |app|logs Hostgroup/service for example. change one, change them all.

OK, yep. We're in agreement that's how I handled windows.

I presented disk_physical because it's a bit more cut & dry. But Disk_Logical is the one I'm worried about.
We have thresholds on used_percent & used(think distributed-DB instances. 5% of 1 TB is different than a vm with 25 GB.) inodes free, & we'd like to collect the other metrics as well (inodes, inodes_used, free, total)

that's 634 x 7 = 4,438 services and 634 hostgroups
combine that with 6 x 146 = 876 services & 146 hostgroups.

That's 5,314 services to do disk. So, like 1/2 to a 1/3 of the Nagios instance is consumed by that alone, I just wanted to make sure there wasn't a more efficient way staring me in the face.

Kind of a neat idea for a wizard. I'ma hit it with Python. Thanks for your responses.

Please leave this thread open, as I did reach out to the community requesting feedback. If nobody responds after an appropriate amount of time please feel free to close the thread.

Thanks,
Maxwell Ramirez

Post by **cdienger** » Fri Nov 09, 2018 2:14 pm

Yeah, decorator is probably an appropriate description. It'd be a script that just executes NCPA API requests, parses the output, and returns a status code and items of interest. Configuring it to return status info for all instances each time it is run probably wouldn't be ideal, but having it return status info for the checks that detect a problem could be useful. As you've pointed out though, adjusting thresholds would be difficult. And handling perfdata would introduce a new set of hurdles. With the way the services have to be broken out to handle the combinations of disks and their metrics though, I don't see another way to reduce the number of services that would need to be configured on the XI side.

We'll keep the thread open as requested. If someone from the community has tackled this and come up with better solution, I'd like to hear about it as well

Nagios Support Forum

Nagios design

Nagios design

Re: Nagios design

Re: Nagios design

Re: Nagios design

Re: Nagios design

Re: Nagios design