[Nagios-devel] Re: [Nagios-users] Large scale network monitoring limits with nagios

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

[Nagios-devel] Re: [Nagios-users] Large scale network monitoring limits with nagios

Post by Guest »

Noah Leaman wrote:

> Using the concept of one service per up/down trap for each network
> interface, I tested a little by creating a very simple set of nagios
> configs, but with about 8000 PASSIVE service checks and no active
> service checks. of course there was no problem in terms of scheduling
> issues, but the CGIs all crawled to a snails pace. In my setup (nagios
> 1.2, Dual G4 first-gen xServe) it takes about 30 secs to display the
> Status Summary page.
>
> ... So 9236 services all together but this is really just a small
> subset of what I would like to be able to do. The plan is to through
> hardware at it to spread out the real work being done (i.e. the active
> checks).
>
> But with just this setup, a single CGI take up an entire CPU to run
> and for a few minutes a lot of the time... and the plan was to have a
> good handful of GUI users (5 ish at a time)... it's just about
> unusable with one GUI user.

I'm using a distributed environment of 4 servers to monitor 6200
services so I'm not displaying quite as much as you but I am close. My
designated central server that runs the cgi's is a dual AMD 2200 with
3gb of ram. I am not using 1.2, I am using 1.1 with a cgi patch
submitted to the devel list by David Parrish. Viewing cgi's as an admin
user who has access to all services/hosts causes no problems for me. I
have not tested 1.2 because 1.1 works quite well for me and I have not
wanted any headaches.

The only complaint I have about the cgi's after the patch is that they
take up between 20-50% of a cpu every time someone loads them up. If too
many people in the company are browsing around things can get really
slow. I used to cache some of the pages every few minutes but I just
didn't like the idea of caching the data.

> How to monitor traps for hundreds of network hosts and tens of
> thousands different interfaces each of which could generate up/down
> traps along with other traps. I tried setting up a single "catch-all"
> trap service per host, but notification would need to occur when going
> from and OK to another OK (with a different output). Shouldn't this
> work with is_volatile on and stalking_options set to o,w,u,c (every
> test I've done to get this working from OK to OK doesn't work... but
> maybe I missed something).

Mmmm, this is def a users question. Personally, I do not use the
volatile option because we rely entirely on web interfaces (no email
notifications) to let us know what is going on. I have a "trap server"
running a "snmptrapd log watcher" program which watches the snmptrapd
log for events. If a failure on a device triggers a trap with a oid that
is recognized it flags the service as critical until someone
acknowledges it in the web interface.

Lots of people have other ways of accomplishing this.

> So the higher level question here is am I over my head in what or how
> I can do this with nagios? After tackling the network monitoring
> needs, the plan was to then start the server monitoring (around 1000
> servers of many platforms).

If I ever migrate to 1.2, I'll be sure to let the list know if I have
cgi slowness.

Jason





This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked