[Nagios-devel] Instrumenting Nagios

Guest · Post by **Guest** » Tue May 19, 2009 7:38 pm

Hi Everyone,

We're trying to track down a high latency issue we're having with our Nagio=
s system and I'm hoping to get some advice from folks.
Here's what=92s going on.

We have a system running Nagios 2.12 and DNX 0.19 (latest)
This setup is comprised of 1 main nagios server and 3 DNX "worker nodes".

We have 29000+ service checks across about 2500 hosts. Over the last year w=
e average about 250 or more services alarming at any given time. We also ha=
ve on average about 10 hosts down at any given time.

My original thought was that perhaps DNX was slowing down, maybe a leak or =
something so I instrumented DNX, by timing from the moment it's handed a jo=
b until it posts the results into the circular results buffer.
This figure holds steady at 3.5s.

I am pretty sure all checks are getting executed (at least, all the ones th=
at are enabled) eventually. Just more and more slowly over time.
Clearly, some checks are being delayed by something or even many things. W=
hat I'd like to do is to perhaps extend nagiostats to gather information ab=
out why latency is at the level it is, to see if we can't determine why Nag=
ios is waiting to run these checks.

What should we be looking at, either in the event loop or outside of it, to=
get a good overview of how what and why nagios is doing what it's doing?

We are thinking of adding counters to the different events (both high and l=
ow) in an attempt to determine the source of the latency in detail. For exa=
mple, if the average check latency is 100 seconds, being able to show that =
30 of that was spent doing notifications, and 20 seconds spent doing servic=
e reaping, etc. That way we can know where we need to make optimizations.

I'm thinking that if we can instrument the following events we should have =
most of our bases covered (note some of these may already be instrumented).=
..

log file rotations,
external command checks,
service reaper events,
program shutdown,
program restart,
orphan check,
retention save,
status save,
service result freshness,
host result freshness,
expired downtime check,
check rescheduling,
expired comment check
host check
service check

Is there anything else that could or should be instrumented that could give=
us a good view in what nagios is doing thats causing service checks to be =
executed further and further away from when they were scheduled?

Are these complete? Do these make sense to instrument and would they be use=
ful in determining what is contributing to check latency?

Thanks in advance!

Sincerely,
Steve

NOTICE: This email message is for the sole use of the intended recipient(s=
) and may contain confidential and privileged information. Any unauthorized=
review, use, disclosure or distribution is prohibited. If you are not the =
intended recipient, please contact the sender by reply email and destroy al=
l copies of the original message.

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]