Page 1 of 1

[Nagios-devel] Request for comment: Overhaul of Performance Info

Posted: Wed Apr 02, 2008 4:44 am
by Guest
Dies ist eine mehrteilige Nachricht im MIME-Format.
--=_alternative 0045F0F0C125741F_=
Content-Type: text/plain; charset="utf-8"
content-transfer-encoding: quoted-printable

Hi all,

I'd like to propose an overhaul of the Performance Info=20
(extinfo.cgi?&type=3D4).

In the last weeks I prepared a migration and update from our old 2.9=20
install to
a new physical machine and nagios 3.0. During that time I've been watching
the Performance Info a lot, since performance was an issue for us as the
"migration machine" was running inside a VM on an ESX. Sadly I came to the
conclusion, that the way the info is presented seems to be useless.

The reason is simple:

For example I get the number and percent of the actively checked services
in the last 1/5/15/60 minutes. So far so good. But what exactly tells us=20
this info?
Right - nothing. I have no means to interpret this information, as I=20
cannot determine
if the number of actively checked services in the last minute (for=20
example) is good
or bad. What's missing is numbers to compare the actively checked services
to those that _should_ have been actively checked in the last minute. In=20
our
scenario, I have loads of services scheduled each minute (pings, disk,=20
memory, etc.pp),
but then I do have a lot services that are only checked once per hour or=20
once per
day.
So when nagios presents me with 68% of my servicechecks were performed
in the last minute - I have no clue if that means everything is alright or=
=20
what.

What I would like to see is a comparable performance info, telling me:

x% of your active service checks in the last minute, that should have been=
=20
checked, have been checked.
x% of your acrive service checks scheduled in the last 15 minutes,that=20
should have been checked, have been checked.
etc.pp.

So I can decide if I am putting too much stress on the nagios server or=20
not. And if,
if it's the fault of too many concurrent servicechecks for example, that=20
are lagging behind.

I do know that latency and execution time is displayed too, but those=20
informations are not
really useful to me either. Which brings me to the next point:

Check Execution Time needs some means to distinguish between checks that=20
timed
out and those that just took long. For as long as I can think, the=20
displayed values there
look like:

Check Execution Time: 0.01 sec 10.01 sec 0.494 sec=20

0.01 is checks on localhost - they are the minumum
10.01 is checks that timed out, mainly remote sites where the vpn is=20
currently down for example - they are the maximum
0.5 is roughly the average at all times.

I think people wouldn't even notice, if you would hardcode those numbers=20
in the cgi ;)
Infos that are more or less static are not useful as performance counters.=
=20
To reflect the real circumstances,
timed out checks need to be filtered out, so I have means to see if some=20
checks take longer then
expected.

/discuss

S

--=20
Sascha Runschke
Netzwerk- und Systemmanagement
Telefon : +49 (201) 102-1879 Mobil : +49 (173) 5419665 Fax : +49 (201)=20
102-1102105



GFKL Financial Services AG
Vorstand: Dr. Peter J=C3=A4nsch (Vors.), J=C3=BCrgen Baltes, Dr. Till Ergen=
zinger, Dr. Tom Haverkamp
Vorsitzender des Aufsichtsrats: Dr. Georg F. Thoma
Sitz: Limbecker Platz 1, 45127 Essen, Amtsgericht Essen, HRB 13522
--=_alternative 0045F0F0C125741F_=
Content-Type: text/html; charset="utf-8"
content-transfer-encoding: quoted-printable


Hi all,

I'd like to propose an overhaul of t=
he
Performance Info (extinfo.cgi?&type=3D4).

In the last weeks I prepared a migra=
tion
and update from our old 2.9 install to
a new physical machine and nagios 3.=
0.
During that time I've been watching
the Performance Info a lot, since pe=
rformance
was an issue for us as the
"migration machine" was ru=
n

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]