Page 1 of 1

Extract a Portion of a Service Group Membership

Posted: Wed Jul 11, 2012 12:41 pm
by mfarrenk
Our primary purpose for Nagios is interface status checking. To that end, I'm trying to improve its efficiency by having it check all of the monitored interfaces on a switch or router without having to run the check script every time. I only want it to return results for the configured interfaces so I don't end up with a lot of "Passive check result returned for xxx but no service is defined" messages. (And most of the interfaces on each device I'm not going to care about.)

My first option is to add all of the monitored interfaces across all of the devices to a servicegroup called interfaces but that entire list across all devices in the group gets passed to the script every time. I don't see a way to specify the host name in an on-demand macro (like $SERVICEGROUPMEMBERS:switch1:interfaces$) so that I just get the interface list from the device it is currently checking.

My second option is to define separate servicegroups for each device (switch1_interfaces, switch2_interfaces, router1_interfaces, etc.) and then configure each check specifically for that servicegroup. That seems a PITA; it would be helpful if I could pre-define a service template that could do a substitution like $SERVICEGROUPMEMBERS:$HOSTNAME$_interfaces$. But I'm not seeing that so I'll have to do it the long way. For each device, I can define a service and have a custom macro (like _INTERFACES $SERVICEGROUPMEMBERS:switch1_interfaces$) so that I can just reference _INTERFACES in the service template, but I still need to manually change the "switch1_interfaces" in each service. It just feels error-prone.

Unless anyone has any suggestions? Thanks!

Re: Extract a Portion of a Service Group Membership

Posted: Wed Jul 11, 2012 7:37 pm
by jsmurphy
I'm not 100% sure I actually understand how you have configured your current set up or what the end result you are expecting is... would you be able to explain your set-up a little further (How you are retrieving information, what you want it to display... or what you are trying to do with the service-groups). I'm just not sure I appropriately understand yet to offer any useful advice :D.

Re: Extract a Portion of a Service Group Membership

Posted: Wed Jul 11, 2012 11:16 pm
by mfarrenk
I am a network engineer for a large organization. When I say "network engineer", that's really my job -- I'm not like a jack-of-all-trades, some server admin/network admin/tech-type position. So monitoring servers (disk space, CPU utilization, etc.) is not really in scope for me. What I monitor are network devices, uplink ports, etc. And in this context, I'm looking at availability only; we have another system that monitors interface statistics (bits in/out, errors, discards, etc.).

I have Nagios configured with our hierarchical network design -- traditional core/distribution/access layers. Each interface that I care about is a separate service. Right now, whenever Nagios checks an interface, it performs an active check on that individual interface. So if I have a router AAAdist1 and interfaces Gi1/1, Gi1/2, Gi4/1, and Gi4/2 are all used as uplinks, then Nagios has services Gi1-1, Gi1-2, Gi4-1, and Gi4-2 on host AAAdist1. These are all active checks, so each interface gets scheduled separately. Each check executes the interface check command (the underlying script uses SNMP to check ifOperStatus and ifAdminStatus). So that is four checks that it has to schedule and four times it has to run the underlying script for one device.

What I want to do is create a separate service (I've called it check_interfaces). I will redefine all of the interface services to be passive checks, then have check_interfaces as an active check. When check_interfaces gets run, it needs to pass a list of all of the interfaces (Gi1-1, Gi1-2, Gi4-1, Gi4-2) on the host (AAAdist1) to the script. The script will run once and pass all of the results back at one time, reducing the number of active checks Nagios has to make. I have freshness defined for each interface, so if the interface state does go stale, it has a way of executing a check for the individual interface.

So my problem is: how to configure Nagios so that it passes a list of only the interfaces for the specific host (AAAdist1) to the script.

The easiest solution, if it were possible, would be to add all interfaces to a servicegroup (such as a group called "interfaces"). If I try that, when I pass it to a script, I'll get the entire list of host names and services. With several thousand host/service combinations, that's not practical in my mind. I need to be able to filter it based on the host name associated with the service. Keeping everything in the same service group makes it easy to create a service template. For each interface, I define the service with the host name (AAAdist1) and the service name (Gi1-1). The service template would automatically add the new service to the "interfaces" service group. Then, when the check_interfaces service is executed for AAAdist1, it would get a list of the services defined for AAAdist1 (Gi1-1, Gi1-2, Gi4-1, Gi4-2). For the corresponding distribution router and its interfaces (let's say AAAdist2 and Gi3-7, Gi3-8, Gi7-12, Gi7-13), when the check_interfaces service is checked for AAAdist2, the script would only get the interfaces for AAAdist2.

I was hoping to be able to define the check_command in the service template like:

Code: Select all

check_command     check_ifstate!$SERVICEGROUPMEMBERS::interfaces$
But that's not the right format and, unsurprisingly, it doesn't work.

So if there's no way to automatically extract the list of services out of a group based on a host name as a key, then I have to define a separate service group per device. For AAAdist1, I would add Gi1-1, Gi1-2, Gi4-1, and Gi4-2 to a service group aaadist1_interfaces. For AAAdist2, I add Gi3-7, Gi3-8, Gi7-12, and Gi7-13 to service group aaadist2_interfaces. But this makes it more tedious to try to template-ize. Unfortunately, I can't do this:

Code: Select all

check_command     check_ifstate!$SERVICEGROUPMEMBERS:$HOSTNAME$_interfaces$
One other idea I had was to define the service template with a custom variable:

Code: Select all

check_command     check_ifstate!$_INTERFACE_LIST$
then define the device service so that it fills in _INTERFACE_LIST:

Code: Select all

define service {
     service_description     check_interface
     host_name     aaadist1
     _INTERFACE_LIST     $SERVICEGROUPMEMBERS:aaadist1_interfaces$
}

define service {
     service_description     check_interface
     host_name     aaadist2
     _INTERFACE_LIST     $SERVICEGROUPMEMBERS:aaadist2_interfaces$
}
This is doable. However, that's tedious at best and prone to error; if I'm using another device as a template and I forget to change aaadist1_interfaces to something else (aaacore1_interfaces, for example), then the interfaces for AAAcore1 will be getting checked due to interface states going stale and having individual checks.

I know it's long but I hope you've got a better idea of what I'm trying to do. Thank you!

Re: Extract a Portion of a Service Group Membership

Posted: Thu Jul 12, 2012 7:11 pm
by jsmurphy
Ahhhhhh now I understand :D! I enjoy a good configuration design puzzle but can't think of an appropriate solution that will get you exactly what you want, especially one that doesn't cause an unnecessary administrative burden such as the one you've constructed there. We had a similar sort of discussion with the network team here... they also have their own tool that analyses performance data and we ultimately decided to take an occam's razor approach to the interface monitoring situation.

We have about... ~2500 switches/routers and doing every interface was going to be untenable, so what we decided to do was actively monitor interfaces on the "can't-go-down-ever" links each with their own service. For the other interfaces we decided that SNMP traps that all get dumped to a single passive "down interfaces" service (this service exists on every switch and router) were "good enough".

Initially this meant a lot of work as a new config had to be tested and pushed out to all of the network devices but since then we've never had to do any maintenance as far as interfaces are concerned as every new switch will automatically be monitored when it inherits the service that the interface traps get dumped to. I know this isn't the answer you are looking for but hopefully it can provide some food for thought.

Re: Extract a Portion of a Service Group Membership

Posted: Fri Jul 13, 2012 6:03 pm
by mfarrenk
I don't take no for an answer ( . . . usually . . .) and I didn't this time. ;)

I still haven't been able to find a way to extract a subset of a service group membership (new feature please???). However, I've come up with a way to do it that minimizes the human element. I have a script to run that will create all of the base configuration files using the name of the (switch/router), including a new interfacecheck.cfg config file. The script automatically updates the (switch/router) name of the device-specific service group.

As for _INTERFACE_LIST, this didn't work:

Code: Select all

define service {
     service_description     check_interface
     host_name     aaadist1
     _INTERFACE_LIST     $SERVICEGROUPMEMBERS:aaadist1_interfaces$
}
_INTERFACE_LIST did not get the macro substitution that it needed in order to be able to work. However, by using "notes", it performed the substitution and $SERVICENOTES$ is automatically available on the command line.

Code: Select all

define service {
     service_description     check_interface
     host_name     aaadist1
     notes     $SERVICEGROUPMEMBERS:aaadist1_interfaces$
}
The (potentially beneficial???) side effect now is, if I look at the interface_check service for a device, it lists as the notes all of the interfaces that it knows about.

I do appreciate your reading my question and pondering it. As a side note, the only interfaces we're monitoring with Nagios are uplink interfaces throughout the network -- no UPS links, transfer switch links, etc. So my personal feeling is these are all "can't-go-down-ever" links.

So with the substitution working and with a way to reduce the element of human error, I think this will work for me. Thank you!

Re: Extract a Portion of a Service Group Membership

Posted: Sun Jul 15, 2012 7:00 pm
by jsmurphy
I usually avoid the "build your own" response as it can be rather off-putting to a lot of people but I'm glad you found a solution that works :D. I was actually pondering this problem a little more later in the day when I remembered an awesome little plugin project that might be worth checking out: http://exchange.nagios.org/directory/Pl ... 3t/details

It was a pretty neat little plugin that we considered using but it didn't support the majority of our devices at the time I investigated it.