Re: [Nagios-devel] Nagios and Gearman - huge environment performan=

Guest · Post by **Guest** » Wed Aug 24, 2011 1:29 pm

--_000_4288A518A157EC4C8873FEE74F778BF0024D13WPSDGQHHOPRSTATEF_
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

I don't see anything obvious here, but that's not always the case. Doing t=
his kind of debugging isn't always cut and dry, each environment is differe=
nt, and while the scheduling is complicated, that's what gives the power. =
I don't see a lot of MACRO processing which I have noticed can hurt a lot i=
n big environments (I stripped out all but absolutely necessary ones). Ano=
ther thing I saw before is if you have a large number of service checks tha=
t have long timeouts and they are timing out, that will throw off the sched=
uler because it has to deal with those long delays. Maybe you could post t=
he output of nagiostats and see if that lends any info? It sounds like the=
core daemon is busy doing something and schedules are getting pushed out, =
so it's a matter of finding what it's busy doing. I've also used strace in=
those cases too to watch/debug what it's doing, but that can be a lot of d=
ata very fast.

Dan

From: Rodney Ramos [mailto:rodneyra@gmail.com]
Sent: Tuesday, August 23, 2011 3:22 PM
To: Nagios Developers List
Subject: Re: [Nagios-devel] Nagios and Gearman - huge environment performan=
ce problem

Hi, everybody. Sorry for taking so long to reply, but I was testing what wa=
s suggested.

Well, I put all files (status.dat, checkresults, nagios.tmp, nagios.log etc=
) on a ram disk (/dev/shm). I also disabled all brokers module, leaving onl=
y the mod_gearman broker, of course. I disabled flapping detection, perform=
ance processing, everything.

The result: absolutely nothing. No improvement. Nagios still stays with 100=
% of CPU. Latency is still big, beteween 250 to 500 sec.

I=B4ve also tested the parameters "max_concurrent_checks", "check_result_re=
aper_frequency" and "max_check_result_reaper_time".

When I=B4ve changed the max_concurrent_checks from "0" to "200", nagios pro=
cess fell down to 30/50%. However, the latency increased a lot, going to mo=
re then 1000 sec!!

I=B4ve changed the "check_result_reaper_frequency" and "max_check_result_re=
aper_time". The first from 10 to 5 s. The second from 30 to 15 sec. No big =
difference.

I=B4ve enabled the nagios debug too. I had to increase the debug file size =
as it get full very very fast. You can see some lines below.

The conclusion: I think that Nagios is not able to make active checks to so=
much hosts and services. It is a limitation of the tool. It has to make so=
much processing like scheduling and rescheduling that all the active check=
s get delayed. And it is not gearman fault. On the contrary, gearman and mo=
d_gearman make their jobs very well.

But, as Daniel said, there is one thing that I can=B4t understand. Why my i=
dle CPU is with 87%? It=B4s very weird. Is there something that makes the p=
erformance better? A Nagios or Operation System parameter?

Thank you very much.

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Debug output:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
[1314129294.322456] [032.0] [pid=3D31793] ** Service Notification Attempt *=
* Host: '139874', Service: 'Memoria', Type: 0, Options: 0, Current State: 2=
, Last Notification: Wed Dec 31 21:00:00 1969
[1314129294.322461] [001.0] [pid=3D31793] check_service_notification_viabil=
ity()
[1314129294.322464] [001.0] [pid=3D31793] check_time_against_period()
[1314129294.322469] [032.1] [pid=3D31793] Notifications are temporarily dis=
abled for this service, so we won't send one out.
[1314129294.322473] [032.0] [pid=3D31793] Notification viability test faile=
d. No notification will be sent out.
[1314129294.322477] [016.1] [pid=3D31793] Rescheduling next check of servic=
e at Tue Aug 23 17:07:56 2011
[1314129294.322481] [001.0] [pid=3D31793] get_next_valid_time()
[1314129294.322484] [001.0] [pid=3D31793] check_time_against_period()
[1314129294.322493] [001.0] [pid=3D31793] schedule_service_check()
[1314129294.322498] [016.0] [pid=3D31793] Scheduling a non-forced, active c=
heck of

...[email truncated]...

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: mailto:rodneyra@gmail.com>