Re: [Nagios-devel] Instrumenting Nagios

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Instrumenting Nagios

Post by Guest »


To the extent that such delays may be partly=0Adue to general cost of compu=
ting, profiling the=0Aentire nagios binary would not be a bad idea.=0Agprof=
is your friend.=0A=0A--- On Tue, 5/19/09, Steven D. Morrey wrote:=0A=0A> From: Steven D. Morrey =0A> S=
ubject: [Nagios-devel] Instrumenting Nagios=0A> To: "[email protected]=
rceforge.net" =0A> Date: Tuesday, May 1=
9, 2009, 11:11 AM=0A> Hi Everyone,=0A> =0A> We're trying to track down a hi=
gh latency issue we're=0A> having with our Nagios system and I'm hoping to =
get some=0A> advice from folks.=0A> Here's what=E2=80=99s going on.=0A> =0A=
> We have a system running Nagios 2.12 and DNX 0.19 (latest)=0A> This setup=
is comprised of 1 main nagios server and 3 DNX=0A> "worker nodes".=0A> =0A=
> We have 29000+ service checks across about 2500 hosts. Over=0A> the last =
year we average about 250 or more services alarming=0A> at any given time. =
We also have on average about 10 hosts=0A> down at any given time.=0A> =0A>=
My original thought was that perhaps DNX was slowing down,=0A> maybe a lea=
k or something so I instrumented DNX, by timing=0A> from the moment it's ha=
nded a job until it posts the results=0A> into the circular results buffer.=
=0A> This figure holds steady at 3.5s.=0A> =0A> I am pretty sure all checks=
are getting executed (at least,=0A> all the ones that are enabled) eventua=
lly. Just more and=0A> more slowly over time.=0A> Clearly, some checks are =
being delayed by something or even=0A> many things. What I'd like to do is=
to perhaps extend=0A> nagiostats to gather information about why latency i=
s at the=0A> level it is, to see if we can't determine why Nagios is=0A> wa=
iting to run these checks.=0A> =0A> What should we be looking at, either in=
the event loop or=0A> outside of it, to get a good overview of how what an=
d why=0A> nagios is doing what it's doing?=0A> =0A> We are thinking of addi=
ng counters to the different events=0A> (both high and low) in an attempt t=
o determine the source of=0A> the latency in detail. For example, if the av=
erage check=0A> latency is 100 seconds, being able to show that 30 of that=
=0A> was spent doing notifications, and 20 seconds spent doing=0A> service =
reaping, etc. That way we can know where we need to=0A> make optimizations.=
=0A> =0A> I'm thinking that if we can instrument the following events=0A> w=
e should have most of our bases covered (note some of these=0A> may already=
be instrumented)...=0A> =0A> log file rotations,=0A> external command chec=
ks,=0A> service reaper events,=0A> program shutdown,=0A> program restart,=
=0A> orphan check,=0A> retention save,=0A> status save,=0A> service result =
freshness,=0A> host result freshness,=0A> expired downtime check,=0A> check=
rescheduling,=0A> expired comment check=0A> host check=0A> service check=
=0A> =0A> Is there anything else that could or should be instrumented=0A> t=
hat could give us a good view in what nagios is doing thats=0A> causing ser=
vice checks to be executed further and further=0A> away from when they were=
scheduled?=0A> =0A> Are these complete? Do these make sense to instrument =
and=0A> would they be useful in determining what is contributing to=0A> che=
ck latency?=0A> =0A> =0A> Thanks in advance!=0A> =0A> Sincerely,=0A> Steve=
=0A> =0A> =0A> NOTICE: This email message is for the sole use of the=0A> i=
ntended recipient(s) and may contain confidential and=0A> privileged inform=
ation. Any unauthorized review, use,=0A> disclosure or distribution is proh=
ibited. If you are not the=0A> intended recipient, please contact the sende=
r by reply email=0A> and destroy all copies of the original message.=0A> =
=0A> =0A> =0A> ------------------------------------------------------------=
------------------=0A> Crystal Reports - New Free Runtime and 30 Day Trial=
=0A> Check out the new simplified licensing option that enables=0A> =0A> un=
limited royalty-free distribution of the report engine =0A> for externally =
facing server and web deployment

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked