Re: [Nagios-devel] Instrumenting Nagios

Guest · Post by **Guest** » Thu May 21, 2009 2:12 am

Steve,

I would like to offer a possible clue from my experience.

In a Nagios 2.10 system with 4G RAM and 2 CPUs, I have been able to =20
attain less than 10 second latency period with 50,000 *PASSIVE* =20
service checks executed on a 5 minute cycle using my event broker =20
which directly inserts the check results into the service check result =20=

queue and bypasses the command pipe.

This, to me, implies that the issue that you are seeing occurs BEFORE =20=

the check gets into the service check result queue, not after, =20
therefore, is not within the event processing logic proper, but rather =20=

the active check execution logic in your case. This said, I would =20
begin by confining my investigations to that realm.

And now I will say something incredibly obvious: Have you attempted a =20=

shift to 3.0 to see what kind of improvements you may get?

Daniel.

On May 20, 2009, at 9:50 PM, eponymous alias wrote:

>
> To the extent that such delays may be partly
> due to general cost of computing, profiling the
> entire nagios binary would not be a bad idea.
> gprof is your friend.
>
> --- On Tue, 5/19/09, Steven D. Morrey wrote:
>
>> From: Steven D. Morrey
>> Subject: [Nagios-devel] Instrumenting Nagios
>> To: "[email protected]" =
> >
>> Date: Tuesday, May 19, 2009, 11:11 AM
>> Hi Everyone,
>>
>> We're trying to track down a high latency issue we're
>> having with our Nagios system and I'm hoping to get some
>> advice from folks.
>> Here's what=92s going on.
>>
>> We have a system running Nagios 2.12 and DNX 0.19 (latest)
>> This setup is comprised of 1 main nagios server and 3 DNX
>> "worker nodes".
>>
>> We have 29000+ service checks across about 2500 hosts. Over
>> the last year we average about 250 or more services alarming
>> at any given time. We also have on average about 10 hosts
>> down at any given time.
>>
>> My original thought was that perhaps DNX was slowing down,
>> maybe a leak or something so I instrumented DNX, by timing
>> from the moment it's handed a job until it posts the results
>> into the circular results buffer.
>> This figure holds steady at 3.5s.
>>
>> I am pretty sure all checks are getting executed (at least,
>> all the ones that are enabled) eventually. Just more and
>> more slowly over time.
>> Clearly, some checks are being delayed by something or even
>> many things. What I'd like to do is to perhaps extend
>> nagiostats to gather information about why latency is at the
>> level it is, to see if we can't determine why Nagios is
>> waiting to run these checks.
>>
>> What should we be looking at, either in the event loop or
>> outside of it, to get a good overview of how what and why
>> nagios is doing what it's doing?
>>
>> We are thinking of adding counters to the different events
>> (both high and low) in an attempt to determine the source of
>> the latency in detail. For example, if the average check
>> latency is 100 seconds, being able to show that 30 of that
>> was spent doing notifications, and 20 seconds spent doing
>> service reaping, etc. That way we can know where we need to
>> make optimizations.
>>
>> I'm thinking that if we can instrument the following events
>> we should have most of our bases covered (note some of these
>> may already be instrumented)...
>>
>> log file rotations,
>> external command checks,
>> service reaper events,
>> program shutdown,
>> program restart,
>> orphan check,
>> retention save,
>> status save,
>> service result freshness,
>> host result freshness,
>> expired downtime check,
>> check rescheduling,
>> expired comment check
>> host check
>> service check
>>
>> Is there anything else that could or should be instrumented
>> that could give us a good view in what nagios is doing thats
>> causing service checks to be executed further and further
>> away from when they were scheduled?
>>
>> Are these complete? Do these make sense to instrument and
>> would they be useful in determining what is contributing to
>> check latency?
>>
>>
>> Tha

...[email truncated]...

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]