Re: [Nagios-devel] Instrumenting Nagios
-
Guest
Re: [Nagios-devel] Instrumenting Nagios
To the extent that such delays may be partly=0Adue to general cost of compu=
ting, profiling the=0Aentire nagios binary would not be a bad idea.=0Agprof=
is your friend.=0A=0A--- On Tue, 5/19/09, Steven D. Morrey wrote:=0A=0A> From: Steven D. Morrey =0A> S=
ubject: [Nagios-devel] Instrumenting Nagios=0A> To: "[email protected]=
rceforge.net" =0A> Date: Tuesday, May 1=
9, 2009, 11:11 AM=0A> Hi Everyone,=0A> =0A> We're trying to track down a hi=
gh latency issue we're=0A> having with our Nagios system and I'm hoping to =
get some=0A> advice from folks.=0A> Here's what=E2=80=99s going on.=0A> =0A=
> We have a system running Nagios 2.12 and DNX 0.19 (latest)=0A> This setup=
is comprised of 1 main nagios server and 3 DNX=0A> "worker nodes".=0A> =0A=
> We have 29000+ service checks across about 2500 hosts. Over=0A> the last =
year we average about 250 or more services alarming=0A> at any given time. =
We also have on average about 10 hosts=0A> down at any given time.=0A> =0A>=
My original thought was that perhaps DNX was slowing down,=0A> maybe a lea=
k or something so I instrumented DNX, by timing=0A> from the moment it's ha=
nded a job until it posts the results=0A> into the circular results buffer.=
=0A> This figure holds steady at 3.5s.=0A> =0A> I am pretty sure all checks=
are getting executed (at least,=0A> all the ones that are enabled) eventua=
lly. Just more and=0A> more slowly over time.=0A> Clearly, some checks are =
being delayed by something or even=0A> many things. What I'd like to do is=
to perhaps extend=0A> nagiostats to gather information about why latency i=
s at the=0A> level it is, to see if we can't determine why Nagios is=0A> wa=
iting to run these checks.=0A> =0A> What should we be looking at, either in=
the event loop or=0A> outside of it, to get a good overview of how what an=
d why=0A> nagios is doing what it's doing?=0A> =0A> We are thinking of addi=
ng counters to the different events=0A> (both high and low) in an attempt t=
o determine the source of=0A> the latency in detail. For example, if the av=
erage check=0A> latency is 100 seconds, being able to show that 30 of that=
=0A> was spent doing notifications, and 20 seconds spent doing=0A> service =
reaping, etc. That way we can know where we need to=0A> make optimizations.=
=0A> =0A> I'm thinking that if we can instrument the following events=0A> w=
e should have most of our bases covered (note some of these=0A> may already=
be instrumented)...=0A> =0A> log file rotations,=0A> external command chec=
ks,=0A> service reaper events,=0A> program shutdown,=0A> program restart,=
=0A> orphan check,=0A> retention save,=0A> status save,=0A> service result =
freshness,=0A> host result freshness,=0A> expired downtime check,=0A> check=
rescheduling,=0A> expired comment check=0A> host check=0A> service check=
=0A> =0A> Is there anything else that could or should be instrumented=0A> t=
hat could give us a good view in what nagios is doing thats=0A> causing ser=
vice checks to be executed further and further=0A> away from when they were=
scheduled?=0A> =0A> Are these complete? Do these make sense to instrument =
and=0A> would they be useful in determining what is contributing to=0A> che=
ck latency?=0A> =0A> =0A> Thanks in advance!=0A> =0A> Sincerely,=0A> Steve=
=0A> =0A> =0A> NOTICE: This email message is for the sole use of the=0A> i=
ntended recipient(s) and may contain confidential and=0A> privileged inform=
ation. Any unauthorized review, use,=0A> disclosure or distribution is proh=
ibited. If you are not the=0A> intended recipient, please contact the sende=
r by reply email=0A> and destroy all copies of the original message.=0A> =
=0A> =0A> =0A> ------------------------------------------------------------=
------------------=0A> Crystal Reports - New Free Runtime and 30 Day Trial=
=0A> Check out the new simplified licensing option that enables=0A> =0A> un=
limited royalty-free distribution of the report engine =0A> for externally =
facing server and web deployment
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]