Re: [Nagios-devel] Instrumenting Nagios

Guest · Post by **Guest** » Thu May 21, 2009 2:00 pm

gprof doesn't like Nagios.
It generates a new profile data for each fork.
I have 30,000 service checks on 3,000 hosts that run each hour.
Even then it's ok for 30 minutes or an hour, but when you are trying to deb=
ug something that takes 2 or 3 days to show, it becomes nearly impossible t=
o manage.
oprofile buggered the entire system on my development boxes (SLES 9 on VMWa=
re).
Hence the need to instrument just the important parts.
Unless you folks know of some switch or another I can pass in at compile ti=
me to get the profile data to be manageable.

Thanks!

Sincerely,
Steve

________________________________________
From: eponymous alias [[email protected]]
Sent: Wednesday, May 20, 2009 7:50 PM
To: Nagios Developers List
Subject: Re: [Nagios-devel] Instrumenting Nagios

To the extent that such delays may be partly
due to general cost of computing, profiling the
entire nagios binary would not be a bad idea.
gprof is your friend.

--- On Tue, 5/19/09, Steven D. Morrey wrote:

> From: Steven D. Morrey
> Subject: [Nagios-devel] Instrumenting Nagios
> To: "[email protected]"
> Date: Tuesday, May 19, 2009, 11:11 AM
> Hi Everyone,
>
> We're trying to track down a high latency issue we're
> having with our Nagios system and I'm hoping to get some
> advice from folks.
> Here's what=92s going on.
>
> We have a system running Nagios 2.12 and DNX 0.19 (latest)
> This setup is comprised of 1 main nagios server and 3 DNX
> "worker nodes".
>
> We have 29000+ service checks across about 2500 hosts. Over
> the last year we average about 250 or more services alarming
> at any given time. We also have on average about 10 hosts
> down at any given time.
>
> My original thought was that perhaps DNX was slowing down,
> maybe a leak or something so I instrumented DNX, by timing
> from the moment it's handed a job until it posts the results
> into the circular results buffer.
> This figure holds steady at 3.5s.
>
> I am pretty sure all checks are getting executed (at least,
> all the ones that are enabled) eventually. Just more and
> more slowly over time.
> Clearly, some checks are being delayed by something or even
> many things. What I'd like to do is to perhaps extend
> nagiostats to gather information about why latency is at the
> level it is, to see if we can't determine why Nagios is
> waiting to run these checks.
>
> What should we be looking at, either in the event loop or
> outside of it, to get a good overview of how what and why
> nagios is doing what it's doing?
>
> We are thinking of adding counters to the different events
> (both high and low) in an attempt to determine the source of
> the latency in detail. For example, if the average check
> latency is 100 seconds, being able to show that 30 of that
> was spent doing notifications, and 20 seconds spent doing
> service reaping, etc. That way we can know where we need to
> make optimizations.
>
> I'm thinking that if we can instrument the following events
> we should have most of our bases covered (note some of these
> may already be instrumented)...
>
> log file rotations,
> external command checks,
> service reaper events,
> program shutdown,
> program restart,
> orphan check,
> retention save,
> status save,
> service result freshness,
> host result freshness,
> expired downtime check,
> check rescheduling,
> expired comment check
> host check
> service check
>
> Is there anything else that could or should be instrumented
> that could give us a good view in what nagios is doing thats
> causing service checks to be executed further and further
> away from when they were scheduled?
>
> Are these complete? Do these make sense to instrument and
> would they be useful in determining what is contributing to
> check latency?
>
>
> Thanks in advance!
>
> Sincerely,
> Steve
>
>
> NOTICE: This email message is for the sole use of the
> intended recipient(s) and may contain confidential and
> privileged information. Any unauthorized review, use,
> disclosure or distribut

...[email truncated]...

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: ponymous alias [[email protected]