Thomas Guyot-Sionnest wrote:
> Hi list,
>
> I'm running a big Nagios monitoring system which has about a hundred of
> remote passive checks reporting trough NSCA. Lately when I added more
> passive checks I noticed that the number of "Failed" checks (No results
> received) increased (For most of the checks it's impossible to say if it did
> run or not).
>
> I'm currently running NSCA in inetd mode using D. J. Bernstein's tcpserver
> program. Since most checks are run by Vixie Cron, and therefore will run at
> the exact same time, my two guess were that either:
>
> 1. I'm jamming up the monitoring server for more that 10 seconds will all
> the checks.
>
> Or
>
> 2. All NSCA processes writing at the same command file trigger some obscure
> OS or Nagios bug.
>
> I have reasons to think it's not #1, so to test #2 I wanted to run NSCA in
> single-process daemon mode. When I do this it get the first passive check
> correctly and send_nsca fail on all other checks. Running strace I see that
> it block on the poll syscall after processing the first check, and send_nsca
> timeouts after 10 seconds.
>
> I'm running Nagios 2.0b3 on Slackware 10.1.0, Dual Athlon MP with 4G of ram,
> NSCA Version 2.6, Official & unpatched.
>
> Compiled with Gcc:
> Configured with: ../gcc-3.3.4/configure --prefix=/usr --enable-shared
> --enable-threads=posix --enable-__cxa_atexit --disable-checking
> --with-gnu-ld --verbose --target=i486-slackware-linux
> --host=i486-slackware-linux
> Thread model: posix
> gcc version 3.3.4
>
> Any thoutht on what's going wrong here?
>
Nagios' command-file is being filled up. It can only hold 4096 bytes
(hard OS limit on most unix-like systems) so with 100+ checks going off
at the same time you're lucky to get half of them written to the pipe
before it times out.
--
Andreas Ericsson [email protected]
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]