Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
>>Steven D. Morrey wrote:
>>
>>> How? Using no delay at all between attempts would be
>>> rather devastating, since spinlocks eat CPU like mad.
> >
>>
>>
>> Stop thinking small. When you have many thousands of checks
> >to run, tiny delays persist and add up. A second here, a
> >second there, and pretty soon you're talking real time.
> >
>>
>
>Well, stop thinking small yourself and use a distributed solution
Why distribute when a single box is more than capable of handling the load?=
Which leads me back to the reason I came here, seeking your wisdom
Besides, we are already distributed using DNX.
>>
>>> There's no real design issue.
>>
>>
>>
>> The design issue is that delays build up and become very
>> observable.
>>
>>
>
>That's not a design issue, it's just a fact of life.
It's both.
Under the existing design, which IS on the whole a good one, delays can bui=
ld up.
The best solution at the moment is to reduce the amount of time spent in sl=
eep, just like you said earlier, sched_yield does appear to be the best sol=
ution under the current design.
> >I've removed the sleep in my version of nagios and throughput difference=
is DRAMATIC.
>Do you have a lot of unparallelizable checks?
No, it turns out we don't have any. But we do have 28,000 checks, a check l=
atency around 130 seconds on average, and a very low CPU usage. We saw the =
sleep and thought it might explain the high latency. It turns out that dram=
atic throughput increase we saw when we removed the sleep was very short li=
ved, after about an hour the latency began to increase again.
>> That said other things are having a hard time running on the same machin=
e.
>Including plugins and the reaper threads of Nagios
Plugins seem to be running just fine when they are actually run. The proble=
m is they aren't being run often enough. As far as we know we aren't over l=
oading the reaper threads, at least we aren't getting any "Warning: Overflo=
w detected in service check result buffer - %ul message(s) lost." Messages.
>> I'm going to sprinkle some yields where the sleeps are at and see if tha=
t helps, I'll keep you apprised.
>>
>That's a very good idea (replacing sleep(1) with sched_yield()). Just
>make sure it keeps on working on AIX and Solaris and stuff like that,
>where Nagios compiles and runs just fine today.
>I'd prefer if you did it with a helper function to make it easier to
>support various operating systems that need it without duplicating a
>lot of code.
I've attached a patch and am seeking comments.
It won't cure cancer but if you do have non-parallel checks it may reduce y=
our overall latency
Sincerely,
Steve
NOTICE: This email message is for the sole use of the intended recipient(s=
) and may contain confidential and privileged information. Any unauthorized=
review, use, disclosure or distribution is prohibited. If you are not the =
intended recipient, please contact the sender by reply email and destroy al=
l copies of the original message.
--_002_3679AE44D8C04547A4F3EB83E77905626FCBB0DC40MBX01ldschurc_
Content-Type: text/x-patch; name="thread_yield.patch"
Content-Description: thread_yield.patch
Content-Disposition: attachment; filename="thread_yield.patch"; size=5099;
creation-date="Fri, 17 Apr 2009 14:32:57 GMT";
modification-date="Fri, 17 Apr 2009 14:32:57 GMT"
Content-Transfer-Encoding: base64
ZGlmZiAtd3VycE4gbmFnaW9zLTIuMTIvYmFzZS9ldmVudHMuYyBuYWdpb3MtMi4xMi1tb2RpZmll
ZC9iYXNlL2V2ZW50cy5jCi0tLSBuYWdpb3MtMi4xMi9iYXNlL2V2ZW50cy5jCTIwMDctMDMtMDUg
MTI6NTU6MzQuMDAwMDAwMDAwIC0wNzAwCisrKyBuYWdpb3MtMi4xMi1tb2RpZmllZC9iYXNlL2V2
ZW50cy5jCTIwMDktMDQtMTcgMTA6NTE6NTEuMDAwMDAwMDAwIC0wNjAwCkBAIC0zMCw3ICszMCw3
IEBACiAjaW5jbHVkZSAiLi4vaW5jbHVkZS9uYWdpb3MuaCIKICNpbmNsdWRlICIuLi9pbmNsdWRl
L2Jyb2tlci5oIgogI2luY2x1ZGUgIi4uL2luY2x1ZGUvc3JldGVudGlvbi5oIgotCisjaW5jbHVk
ZSAiLi4vaW5jbHVkZS90aHJlYWRzLmgiCiAKIGV4dGVybiBja
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]