Page 1 of 1

Re: [Nagios-devel] Nagios and Gearman - huge environment

Posted: Tue Aug 23, 2011 7:21 pm
by Guest
--0016364d2065a8b34504ab31f196
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: quoted-printable

Hi, everybody. Sorry for taking so long to reply, but I was testing what wa=
s
suggested.

Well, I put all files (status.dat, checkresults, nagios.tmp, nagios.log etc=
)
on a ram disk (/dev/shm). I also disabled all brokers module, leaving only
the mod_gearman broker, of course. I disabled flapping detection,
performance processing, everything.

The result: absolutely nothing. No improvement. Nagios still stays with 100=
%
of CPU. Latency is still big, beteween 250 to 500 sec.

I=B4ve also tested the parameters "max_concurrent_checks",
"check_result_reaper_frequency" and "max_check_result_reaper_time".

When I=B4ve changed the max_concurrent_checks from "0" to "200", nagios
process fell down to 30/50%. However, the latency increased a lot, going to
more then 1000 sec!!

I=B4ve changed the "check_result_reaper_frequency" and
"max_check_result_reaper_time". The first from 10 to 5 s. The second from 3=
0
to 15 sec. No big difference.

I=B4ve enabled the nagios debug too. I had to increase the debug file size =
as
it get full very very fast. You can see some lines below.

The conclusion: I think that Nagios is not able to make active checks to so
much hosts and services. It is a limitation of the tool. It has to make so
much processing like scheduling and rescheduling that all the active checks
get delayed. And it is not gearman fault. On the contrary, gearman and
mod_gearman make their jobs very well.

But, as Daniel said, there is one thing that I can=B4t understand. Why my i=
dle
CPU is with 87%? It=B4s very weird. Is there something that makes the
performance better? A Nagios or Operation System parameter?

Thank you very much.

=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Debug output:
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
[1314129294.322456] [032.0] [pid=3D31793] ** Service Notification Attempt *=
*
Host: '139874', Service: 'Memoria', Type: 0, Options: 0, Current State: 2,
Last Notification: Wed Dec 31 21:00:00 1969
[1314129294.322461] [001.0] [pid=3D31793]
check_service_notification_viability()
[1314129294.322464] [001.0] [pid=3D31793] check_time_against_period()
[1314129294.322469] [032.1] [pid=3D31793] Notifications are temporarily
disabled for this service, so we won't send one out.
[1314129294.322473] [032.0] [pid=3D31793] Notification viability test faile=
d.
No notification will be sent out.
[1314129294.322477] [016.1] [pid=3D31793] Rescheduling next check of servic=
e
at Tue Aug 23 17:07:56 2011
[1314129294.322481] [001.0] [pid=3D31793] get_next_valid_time()
[1314129294.322484] [001.0] [pid=3D31793] check_time_against_period()
[1314129294.322493] [001.0] [pid=3D31793] schedule_service_check()
[1314129294.322498] [016.0] [pid=3D31793] Scheduling a non-forced, active
check of service 'Memoria' on host 'mi139874' @ Tue Aug 23 17:07:56 2011
[1314129294.337171] [001.0] [pid=3D31793] reschedule_event()
[1314129294.337193] [001.0] [pid=3D31793] add_event()
[1314129294.337590] [064.1] [pid=3D31793] Making callbacks (type 8)...
[1314129294.337598] [064.1] [pid=3D31793] Making callbacks (type 20)...
[1314129294.337605] [064.1] [pid=3D31793] Making callbacks (type 13)...
[1314129294.337610] [064.1] [pid=3D31793] Making callbacks (type 20)...
[1314129294.337630] [016.1] [pid=3D31793] Deleted check result file '(null)=
'
[1314129294.337652] [016.1] [pid=3D31793] Handling check result for service
'Memoria' on host '167077'...
[1314129294.337656] [001.0] [pid=3D31793] handle_async_service_check_result=
()
[1314129294.337659] [016.0] [pid=3D31793] ** Handling check result for serv=
ice
'Memoria' on host 'mi167077'...
[1314129294.337662] [016.1] [pid=3D31793] HOST: mi167077, SERVICE: Memoria,
CHECK TYPE: Active, OPTIONS: 0, SCHEDULED: Yes, RESCHEDULE: Yes, EXITED OK:
Yes, RETURN CODE: 0, OUTPUT: OK: physical memory: Total: 3.49G - Used: 914M
(25%) - Free: 2.6G (75%)|'physical memory'=3D25%;90;95; \n
[1314129294.337693] [016.1] [pid=3D31793] Service is OK.
[1314129294.337697]

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]