Re: [Nagios-devel] Possible bug in Nagios 2.12?

Guest · Post by **Guest** » Fri Apr 17, 2009 7:35 pm

--_002_3679AE44D8C04547A4F3EB83E77905626FCBB0DC40MBX01ldschurc_
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable

>>Steven D. Morrey wrote:
>>
>>> How? Using no delay at all between attempts would be
>>> rather devastating, since spinlocks eat CPU like mad.
> >
>>
>>
>> Stop thinking small. When you have many thousands of checks
> >to run, tiny delays persist and add up. A second here, a
> >second there, and pretty soon you're talking real time.
> >
>>
>
>Well, stop thinking small yourself and use a distributed solution

Why distribute when a single box is more than capable of handling the load?=
Which leads me back to the reason I came here, seeking your wisdom

Besides, we are already distributed using DNX.

>>
>>> There's no real design issue.
>>
>>
>>
>> The design issue is that delays build up and become very
>> observable.
>>
>>
>
>That's not a design issue, it's just a fact of life.

It's both.
Under the existing design, which IS on the whole a good one, delays can bui=
ld up.
The best solution at the moment is to reduce the amount of time spent in sl=
eep, just like you said earlier, sched_yield does appear to be the best sol=
ution under the current design.

> >I've removed the sleep in my version of nagios and throughput difference=
is DRAMATIC.

>Do you have a lot of unparallelizable checks?

No, it turns out we don't have any. But we do have 28,000 checks, a check l=
atency around 130 seconds on average, and a very low CPU usage. We saw the =
sleep and thought it might explain the high latency. It turns out that dram=
atic throughput increase we saw when we removed the sleep was very short li=
ved, after about an hour the latency began to increase again.

>> That said other things are having a hard time running on the same machin=
e.

>Including plugins and the reaper threads of Nagios

Plugins seem to be running just fine when they are actually run. The proble=
m is they aren't being run often enough. As far as we know we aren't over l=
oading the reaper threads, at least we aren't getting any "Warning: Overflo=
w detected in service check result buffer - %ul message(s) lost." Messages.

>> I'm going to sprinkle some yields where the sleeps are at and see if tha=
t helps, I'll keep you apprised.
>>

>That's a very good idea (replacing sleep(1) with sched_yield()). Just
>make sure it keeps on working on AIX and Solaris and stuff like that,
>where Nagios compiles and runs just fine today.

>I'd prefer if you did it with a helper function to make it easier to
>support various operating systems that need it without duplicating a
>lot of code.

I've attached a patch and am seeking comments.
It won't cure cancer but if you do have non-parallel checks it may reduce y=
our overall latency

Sincerely,
Steve

NOTICE: This email message is for the sole use of the intended recipient(s=
) and may contain confidential and privileged information. Any unauthorized=
review, use, disclosure or distribution is prohibited. If you are not the =
intended recipient, please contact the sender by reply email and destroy al=
l copies of the original message.

--_002_3679AE44D8C04547A4F3EB83E77905626FCBB0DC40MBX01ldschurc_
Content-Type: text/x-patch; name="thread_yield.patch"
Content-Description: thread_yield.patch
Content-Disposition: attachment; filename="thread_yield.patch"; size=5099;
creation-date="Fri, 17 Apr 2009 14:32:57 GMT";
modification-date="Fri, 17 Apr 2009 14:32:57 GMT"
Content-Transfer-Encoding: base64

ZGlmZiAtd3VycE4gbmFnaW9zLTIuMTIvYmFzZS9ldmVudHMuYyBuYWdpb3MtMi4xMi1tb2RpZmll
ZC9iYXNlL2V2ZW50cy5jCi0tLSBuYWdpb3MtMi4xMi9iYXNlL2V2ZW50cy5jCTIwMDctMDMtMDUg
MTI6NTU6MzQuMDAwMDAwMDAwIC0wNzAwCisrKyBuYWdpb3MtMi4xMi1tb2RpZmllZC9iYXNlL2V2
ZW50cy5jCTIwMDktMDQtMTcgMTA6NTE6NTEuMDAwMDAwMDAwIC0wNjAwCkBAIC0zMCw3ICszMCw3
IEBACiAjaW5jbHVkZSAiLi4vaW5jbHVkZS9uYWdpb3MuaCIKICNpbmNsdWRlICIuLi9pbmNsdWRl
L2Jyb2tlci5oIgogI2luY2x1ZGUgIi4uL2luY2x1ZGUvc3JldGVudGlvbi5oIgotCisjaW5jbHVk
ZSAiLi4vaW5jbHVkZS90aHJlYWRzLmgiCiAKIGV4dGVybiBja

...[email truncated]...

This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]