Ben Miller wrote:
> Greetings,
> I am seeing a strange behavior with Nagios that appears to be a
> threading issue. I have trouble shot this enough to determine that it
> may be over my head and have to do with how threads are handled in
> Nagios or the libraries it uses. I believe this to be a code level
> issue so I am posting to the devel list vs the user list. Please
> forgive if this is the wrong place.
>
This is most certainly the right place, unless we find it to be a bug in
a library you use, but in that case it's sort of the right place anyway
since the program isn't behaving as per documentation.
> -Symptom
> When I run Nagios it takes about 30 - 60 seconds to load saved state
> information such as scheduled down times, etc. and it takes upwards of
> 60-120 seconds to process external commands. In addition, the check
> queue stacks up because it is only processing one check at a time. A ps
> shows ONLY the main Nagios process, a single child, and that child
> spawning the check command. It appears as if nothing else (external
> commands, notifications, etc) is being processed while the one child
> task is working.
>
I'm not sure, but it's most likely due to one of two reasons;
* A plugin that's being run is stuck in uninterruptable IO. This can
happen when you're trying to check a partition residing on a network
mounted media where the network connection for some reason is down. It
can also happen under spurious circumstances where a process with higher
priority is holding a lock on some resource that the plugin is trying to
use.
* There's a bug in Nagios causing it to hold a mutex in one of the
parents' threads that isn't released before the child is spawned, so the
child inherits the mutex but has no way of releasing it. I know for a
fact that Nagios does things considered illegal for multithreaded
programs after fork()'ing, so this might be it. It should work well
under Linux with reasonably up-to-date libraries and kernel though, but...
> During troubleshooting, I ran Nagios in an strace to determine what it
> was blocking on and I can clearly see that it is stopping during a
> "wait4(" on the pid of the checking or alerting child.
>
What version of plugins are you running? Which check is running when it
hangs?
> I ran an strace -f on nagios to see the full thread flow of what was
> happening and Nagios performed perfectly. The problem went away and
> external checks were processed in a few seconds and ps shows a list of
> half a dozen or so check or alert child processes.
>
> In addition, when I compile with all debugging turned on and ran Nagios
> by itself, the bad behavior was back. However when I run the debug
> executable through strace (with NO -f) the process starts up
> excruciatingly slowly, but then runs properly with multiple child
> processes and handling external commands properly.
>
So in essence it always happens when you run Nagios, no matter how you
compiled it, but never when you're running it from strace?
> The problem occurs consistently and is easy to replicate. It occurs
> with versions 2.0b3 or rc2. I have tested both.
>
Have you tried this with 2.0rc1 or 2.0rc2 ?
Do you get any messages in the nagios.log saying something like:
service_result_worker_thread: poll(): (text-rep of errno) ?
> -Background
> I have been running Nagios with the same version on a different box with
> the exact same compile options and config files for months and
> everything is working fine. I am upgrading from a AMD 32 bit system
> (RedHat Enterprise v4) to a new box with Dual 64 bit Opterons running
> (RedHat Enterprise v4 64bit).
>
Are you going to do this upgrade or have you already done it? Was the
kernel compiled with a 64-bit compiler? Was glibc and the thread-library
compiled with a 64-bit compiler? What versions of kernel, glibc and
thread-library are you using? What flavour of thread-library are you
using (linux-threads or nptl)?
> I compile with: ./configure --prefix=/home/nagios/nagios
> --with-cgiurl=/nagios/cgi-bin --with-nagios-user=nagios
> --with-nagios-group=n
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]