Re: [Nagios-devel] Possible bug in Nagios 2.12?

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

Re: [Nagios-devel] Possible bug in Nagios 2.12?

Post by Guest »

--_002_3679AE44D8C04547A4F3EB83E77905626FCBB0DC40MBX01ldschurc_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

>>Steven D. Morrey wrote:
>>
>>> How? Using no delay at all between attempts would be
>>> rather devastating, since spinlocks eat CPU like mad.
> >
>>
>>
>> Stop thinking small. When you have many thousands of checks
> >to run, tiny delays persist and add up. A second here, a
> >second there, and pretty soon you're talking real time.
> >
>>
>
>Well, stop thinking small yourself and use a distributed solution ;-)

Why distribute when a single box is more than capable of handling the load?=
Which leads me back to the reason I came here, seeking your wisdom ;-)
Besides, we are already distributed using DNX.

>>
>>> There's no real design issue.
>>
>>
>>
>> The design issue is that delays build up and become very
>> observable.
>>
>>
>
>That's not a design issue, it's just a fact of life.

It's both.
Under the existing design, which IS on the whole a good one, delays can bui=
ld up.
The best solution at the moment is to reduce the amount of time spent in sl=
eep, just like you said earlier, sched_yield does appear to be the best sol=
ution under the current design.

> >I've removed the sleep in my version of nagios and throughput difference=
is DRAMATIC.

>Do you have a lot of unparallelizable checks?

No, it turns out we don't have any. But we do have 28,000 checks, a check l=
atency around 130 seconds on average, and a very low CPU usage. We saw the =
sleep and thought it might explain the high latency. It turns out that dram=
atic throughput increase we saw when we removed the sleep was very short li=
ved, after about an hour the latency began to increase again.

>> That said other things are having a hard time running on the same machin=
e.

>Including plugins and the reaper threads of Nagios ;-)

Plugins seem to be running just fine when they are actually run. The proble=
m is they aren't being run often enough. As far as we know we aren't over l=
oading the reaper threads, at least we aren't getting any "Warning: Overflo=
w detected in service check result buffer - %ul message(s) lost." Messages.

>> I'm going to sprinkle some yields where the sleeps are at and see if tha=
t helps, I'll keep you apprised.
>>

>That's a very good idea (replacing sleep(1) with sched_yield()). Just
>make sure it keeps on working on AIX and Solaris and stuff like that,
>where Nagios compiles and runs just fine today.

>I'd prefer if you did it with a helper function to make it easier to
>support various operating systems that need it without duplicating a
>lot of code.

I've attached a patch and am seeking comments.
It won't cure cancer but if you do have non-parallel checks it may reduce y=
our overall latency :)

Sincerely,
Steve


NOTICE: This email message is for the sole use of the intended recipient(s=
) and may contain confidential and privileged information. Any unauthorized=
review, use, disclosure or distribution is prohibited. If you are not the =
intended recipient, please contact the sender by reply email and destroy al=
l copies of the original message.



--_002_3679AE44D8C04547A4F3EB83E77905626FCBB0DC40MBX01ldschurc_
Content-Type: text/x-patch; name="thread_yield.patch"
Content-Description: thread_yield.patch
Content-Disposition: attachment; filename="thread_yield.patch"; size=5099;
creation-date="Fri, 17 Apr 2009 14:32:57 GMT";
modification-date="Fri, 17 Apr 2009 14:32:57 GMT"
Content-Transfer-Encoding: base64

ZGlmZiAtd3VycE4gbmFnaW9zLTIuMTIvYmFzZS9ldmVudHMuYyBuYWdpb3MtMi4xMi1tb2RpZmll
ZC9iYXNlL2V2ZW50cy5jCi0tLSBuYWdpb3MtMi4xMi9iYXNlL2V2ZW50cy5jCTIwMDctMDMtMDUg
MTI6NTU6MzQuMDAwMDAwMDAwIC0wNzAwCisrKyBuYWdpb3MtMi4xMi1tb2RpZmllZC9iYXNlL2V2
ZW50cy5jCTIwMDktMDQtMTcgMTA6NTE6NTEuMDAwMDAwMDAwIC0wNjAwCkBAIC0zMCw3ICszMCw3
IEBACiAjaW5jbHVkZSAiLi4vaW5jbHVkZS9uYWdpb3MuaCIKICNpbmNsdWRlICIuLi9pbmNsdWRl
L2Jyb2tlci5oIgogI2luY2x1ZGUgIi4uL2luY2x1ZGUvc3JldGVudGlvbi5oIgotCisjaW5jbHVk
ZSAiLi4vaW5jbHVkZS90aHJlYWRzLmgiCiAKIGV4dGVybiBja

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
Locked