Re: [Nagios-users] Performance issue

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-users] Performance issue

Post by Guest »

Sorry for making the leap from -users to -devel like this, but it was
suggested that I might get a better/quicker response from here...

If anyone has any idea what I've missed or fscked up, I'm all ears!!

PS. I've upgraded to the new beta, and it's still slow as heck at passive
stuff. Never skips a beat doing active checks, but I've got way too many
to do them all that way...

---
Jason Marshall, Unix Geek, Kelman Technologies, Inc., Calgary, AB, Canada.

From a Sun Microsystems bug report (#4102680):
"Workaround: don't pound on the mouse like a wild monkey."

"I have great faith in fools:
Self confidence my friends call it." -Edgar Allan Poe


---------- Forwarded message ----------
Date: Wed, 15 May 2002 14:55:26 -0600 (MDT)
From: Jason Marshall
To: Josh Larsen
Cc: nagios-users@lists.sourceforge.net
Subject: Re: [Nagios-users] Performance issue

> I'm experiencing an issue with passive check performance with Nagios
> 1.0b1. I've just finished building a distributed architecture to monitor
> about 900 services on about 600 hosts. I originally decided to move to a
> distributed setup a while back when my single Netsaint box couldn't keep
> up with the high number of service checks.

I have similar problems that I've been fighting with for the past three
weeks or so, since I started to run Netsaint, and then upgraded to Nagios.

I'm running the alpha of Nagios, but I don't think that's the problem.

I run 130 or so distributed "servers" -- each one checks various aspects
of its own operation, and reports back to the central nsca server once
every five minutes, via NSCA.

In my case, though, I've added a bit of randomness to the checking-script
that runs on each distributed "server" so that the 1400 or so passive
services are checked and reported over a 4-minute period. This prevents
the nagios.cmd file from getting swamped, except in pathological cases,
and therefore prevents the nsca processes from blocking on writes to the
named pipe. This, along with bumping up the max connections/min in inetd,
allows all the services to be reported to Nagios on the server, in a
timely fashion.

What happens after that is a mystery. Nagios does NOT seem capable of
keeping up -- not even close, actually.

I'm using freshness checking, and after a service is 900 seconds stale, it
just runs the check_dummy plugin with a value of 5, which reports that
service as "unknown". Out of 1408 services that I monitor passively in
this manner, 1053 of them (right now) are stale.

That's not very good! That means that my 333 MHz Ultra 5 is only able to
handle 350 passive services in five minutes, fairly evenly distributed
throughout the first four of every five minute interval? I think not,
especially when the machine is visably idle -- the CPU't not busy, the
network's handling the load fine, very little disk I/O is happening, no
paging is happening, and the named pipe (nagios.cmd) is almost never over
1/2 full.

I'm starting to get frustrated here. I've tried everything I can think
of, and still the bottleneck seems to be out of my grasp.

(I turned off the command_check_interval -- it's set to -1, which seems to
keep the named pipe nice and empty.

Any ideas? Any luck?

---
Jason Marshall, Unix Geek, Kelman Technologies, Inc., Calgary, AB, Canada.

From a Sun Microsystems bug report (#4102680):
"Workaround: don't pound on the mouse like a wild monkey."

"I have great faith in fools:
Self confidence my friends call it." -Edgar Allan Poe








This post was automatically imported from historical nagios-devel mailing list archives
Original poster: jasonm@kelman.com
Locked