[Nagios-devel] Request for comment: Overhaul of Performance Info

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

[Nagios-devel] Request for comment: Overhaul of Performance Info

Post by Guest »

Dies ist eine mehrteilige Nachricht im MIME-Format.
--=_alternative 0045F0F0C125741F_=
Content-Type: text/plain; charset="utf-8"
content-transfer-encoding: quoted-printable

Hi all,

I'd like to propose an overhaul of the Performance Info=20
(extinfo.cgi?&type=3D4).

In the last weeks I prepared a migration and update from our old 2.9=20
install to
a new physical machine and nagios 3.0. During that time I've been watching
the Performance Info a lot, since performance was an issue for us as the
"migration machine" was running inside a VM on an ESX. Sadly I came to the
conclusion, that the way the info is presented seems to be useless.

The reason is simple:

For example I get the number and percent of the actively checked services
in the last 1/5/15/60 minutes. So far so good. But what exactly tells us=20
this info?
Right - nothing. I have no means to interpret this information, as I=20
cannot determine
if the number of actively checked services in the last minute (for=20
example) is good
or bad. What's missing is numbers to compare the actively checked services
to those that _should_ have been actively checked in the last minute. In=20
our
scenario, I have loads of services scheduled each minute (pings, disk,=20
memory, etc.pp),
but then I do have a lot services that are only checked once per hour or=20
once per
day.
So when nagios presents me with 68% of my servicechecks were performed
in the last minute - I have no clue if that means everything is alright or=
=20
what.

What I would like to see is a comparable performance info, telling me:

x% of your active service checks in the last minute, that should have been=
=20
checked, have been checked.
x% of your acrive service checks scheduled in the last 15 minutes,that=20
should have been checked, have been checked.
etc.pp.

So I can decide if I am putting too much stress on the nagios server or=20
not. And if,
if it's the fault of too many concurrent servicechecks for example, that=20
are lagging behind.

I do know that latency and execution time is displayed too, but those=20
informations are not
really useful to me either. Which brings me to the next point:

Check Execution Time needs some means to distinguish between checks that=20
timed
out and those that just took long. For as long as I can think, the=20
displayed values there
look like:

Check Execution Time: 0.01 sec 10.01 sec 0.494 sec=20

0.01 is checks on localhost - they are the minumum
10.01 is checks that timed out, mainly remote sites where the vpn is=20
currently down for example - they are the maximum
0.5 is roughly the average at all times.

I think people wouldn't even notice, if you would hardcode those numbers=20
in the cgi ;)
Infos that are more or less static are not useful as performance counters.=
=20
To reflect the real circumstances,
timed out checks need to be filtered out, so I have means to see if some=20
checks take longer then
expected.

/discuss

S

--=20
Sascha Runschke
Netzwerk- und Systemmanagement
Telefon : +49 (201) 102-1879 Mobil : +49 (173) 5419665 Fax : +49 (201)=20
102-1102105



GFKL Financial Services AG
Vorstand: Dr. Peter J=C3=A4nsch (Vors.), J=C3=BCrgen Baltes, Dr. Till Ergen=
zinger, Dr. Tom Haverkamp
Vorsitzender des Aufsichtsrats: Dr. Georg F. Thoma
Sitz: Limbecker Platz 1, 45127 Essen, Amtsgericht Essen, HRB 13522
--=_alternative 0045F0F0C125741F_=
Content-Type: text/html; charset="utf-8"
content-transfer-encoding: quoted-printable


Hi all,

I'd like to propose an overhaul of t=
he
Performance Info (extinfo.cgi?&type=3D4).

In the last weeks I prepared a migra=
tion
and update from our old 2.9 install to
a new physical machine and nagios 3.=
0.
During that time I've been watching
the Performance Info a lot, since pe=
rformance
was an issue for us as the
"migration machine" was ru=
n

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked