Questions on setting up notifications in distributed system

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
mroter
Posts: 80
Joined: Sun Apr 29, 2012 12:43 pm

Questions on setting up notifications in distributed system

Post by mroter »

We have a big Nagios implementation (~500 hosts) with multiple Nagios XI (2012R1.0 32bit) servers in multiple sites.
Each site has a local Nagios XI server performing the tests (mostly active) and is reporting to (2) central Nagios XI (Admin) servers. The Admin Servers (primary & DR) has only passive checks.
Notifications is disabled on the site servers. The Admin servers is where the contact groups are defined and notifications are being sent from.

1. Since any service definition at the site servers is multiplied by all the hosts it is running on, we get many instances of the same service on the Admin machine. This is making an "apply" operation very slow/heavy. Any suggestion?
2. On the Admin servers the check interval of the services is set by default to 1m (regardless of the original definition) and the notification interval is also set to 1m (instead of 60m) -sending too many notifications... Any suggestion?
3. To overcome and simplify maintenance we left the contact/contact-groups empty (when defining the objects on the admin servers) and defined host/service escalations. Is this the right way?
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Questions on setting up notifications in distributed sys

Post by mguthrie »

Notifications is disabled on the site servers. The Admin servers is where the contact groups are defined and notifications are being sent from.
With a model like this, I would consider putting all of the notification handling on the site servers themselves. It eliminates a point of failure for getting timely notifications, but it also frees up the central server to continue to scale upward. Obviously do what works best for your scenario, but notifications and event handling add a performance overhead to whatever machine they're on.
1. Since any service definition at the site servers is multiplied by all the hosts it is running on, we get many instances of the same service on the Admin machine. This is making an "apply" operation very slow/heavy. Any suggestion?
The Apply Configuration process pushes fresh configs from the Core Config Manager backend and writes them to physical files for the Core Engine to use. The most hosts you have, the more files will have to be created. However, it might be worth running the following code to make sure that there aren't other delays in the Apply Configuration process. The command below runs Apply Configuration from the command-line and shows logging output.

Code: Select all

cd /usr/local/nagiosxi/scripts
./reconfigure_nagios.sh
2. On the Admin servers the check interval of the services is set by default to 1m (regardless of the original definition) and the notification interval is also set to 1m (instead of 60m) -sending too many notifications... Any suggestion?
Our 2012 enterprise edition has a Bulk Modification tool for the Core Config Manager that can be used to change config directives for a large list of configs at once. Could be handy for this scenario. The defaults are set like that to ensure that aren't config errors from running the Unconfigured Objects wizard.
3. To overcome and simplify maintenance we left the contact/contact-groups empty (when defining the objects on the admin servers) and defined host/service escalations. Is this the right way?
That depends. Potentially that will just add another layer of complication to the configuration. Escalations are necessary when you need the notification list to change or escalate after a certain number of notifications. If you're just using them to associate your contacts, you'd actually be better off assigning contacts through templates or the host/service definitions themselves. If you've got a LOT of escalation definitions that will also add to your Apply Configuration time.
mroter
Posts: 80
Joined: Sun Apr 29, 2012 12:43 pm

Re: Questions on setting up notifications in distributed sys

Post by mroter »

Sending notifications from the site servers means we'll have to define & maintain the contacts, contacts groups, notifications per host/service multiple times, once for each site. We were trying to avoid this (3-4 sites & over 10 optional contacts/groups).

Regarding the service explosion on the Admin server(s), there is no workaround? - we can't collapse the same services to a single instance (e.g. CPU load) like in the site servers, right?
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Questions on setting up notifications in distributed sys

Post by mguthrie »

Regarding the service explosion on the Admin server(s), there is no workaround? - we can't collapse the same services to a single instance (e.g. CPU load) like in the site servers, right?
When you mean "same instance," do you mean collapsing the service definition into one definition for all of the CPU load services? If so, yes, you can most definitely do this on the admin server. You just need to apply the single definition either to the entire host list, or a common hostgroup.

Tell me a little bit more about the notification flood that you're getting from the admin server, are you getting too many notifications for the same host:service?
mroter
Posts: 80
Joined: Sun Apr 29, 2012 12:43 pm

Re: Questions on setting up notifications in distributed sys

Post by mroter »

OK so to confirm, even though the "CPU load" service is a passive check reported from many hosts in different sites I can collapse it to a single service definition applied to a host-group containing all the relevant hosts. This will not cause an identification problem when results for the service will arrive from the remote hosts.

In the Admin we get a notification for each service every 1m rather than 60m. Do we need to use the bulk tool now to set all the properties correctly?
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Questions on setting up notifications in distributed sys

Post by mguthrie »

OK so to confirm, even though the "CPU load" service is a passive check reported from many hosts in different sites I can collapse it to a single service definition applied to a host-group containing all the relevant hosts. This will not cause an identification problem when results for the service will arrive from the remote hosts.
That's correct.
In the Admin we get a notification for each service every 1m rather than 60m. Do we need to use the bulk tool now to set all the properties correctly?
Yes. OR, when you move everything into a single service definition, just update it once in that particular definition.

How often is this passive check coming in? Do you have the is_volatile setting enabled? If so, you'll get a notification with every single non-OK check that comes in instead of on state changes. That might also be the problem there.
mroter
Posts: 80
Joined: Sun Apr 29, 2012 12:43 pm

Re: Questions on setting up notifications in distributed sys

Post by mroter »

Passive checks comes in as soon as they processed by the remote site server.
If 1/5 attempts failed on the remote site will it generate a notification on the Admin imminently or wait for 5/5?
Since the number of attempts (e.g. 5) is NOT passed over we'll have to manually redefine it on the Admin side, right?
Where do I find the "is_ volatile" setting?
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Questions on setting up notifications in distributed sys

Post by mguthrie »

If might be easier to send a config snapshot so I can see how you've got it set up. If Nagios is doing passive checks, there won't be any retries. If you access the Admin->Config Snapshots page, you can either email or PM a configuration snapshot tarball to us, and we can view the full config and have a better idea as to how things are set up.
mroter
Posts: 80
Joined: Sun Apr 29, 2012 12:43 pm

Re: Questions on setting up notifications in distributed sys

Post by mroter »

I'm uploading the Admin and one site server config snapshots.
Last edited by mguthrie on Wed Oct 24, 2012 3:52 pm, edited 1 time in total.
Reason: Removed files after reviewing for privacy reasons
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Questions on setting up notifications in distributed sys

Post by mguthrie »

Ok, so from reviewing your configs, it doesn't look like you have anything out of place other than the notification_interval being set to 1 for the bulk of your services. Once that's updated to 60 you should be a in good shape.
Locked