RE: [Nagios-devel] Nagios blocking/stalling: Thread issue? v
Posted: Wed Jan 11, 2006 8:25 pm
My latest tests and findings:
a) disable embedded perl and perl-cache
I did this and the results were exactly the same as before
b) move /home to local volume
Again, the results are the same as before, no improvement.
c) get cvs version of nagios and try it (with the above two changes in
place) If it works, I will reverse b then a and see where/if it breaks.
I downloaded the snapshot and still the behavior is the same as
originally described.
During these tests I observed the following behavior.
The threading seems to startup ok and I see the proper number of checks
occurring. I have a lot that are snmp checks. When the first
check_ping process starts I see the following process tree and slowly
the other checking threads die off until only one thread remains. The
remaining thread is the check_ping thread. When it finally completes,
only one check at a time is performed from then on. This seems to
support you thought that a child process blocking the parent somehow.
29637 pts/1 Sl+ 0:00 \_ ../bin/nagios nagios.cfg
30056 pts/1 S 0:00 \_ ../bin/nagios nagios.cfg
30057 pts/1 S 0:00 \_ /home/nagios/nagios/libexec/check_ping
-p 10 -H 192.168.10.10 -w 100:60% -c 600:100%
30058 pts/1 S 0:00 \_ /bin/ping -n -U -w 16 -c 10
192.168.10.10
I upgraded to the latest plugins and this behavior remains. Somehow
strace -f seems to handle the check_ping blockage and let the app behave
properly
I am out of ideas of what to test next. Does this evidence help? What
is the next step?
Thanks again,
Ben
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Ben
Miller
Sent: Wednesday, January 11, 2006 9:35 PM
To: Andreas Ericsson
Cc: [email protected]
Subject: RE: [Nagios-devel] Nagios blocking/stalling: Thread issue? v
2.0b3 or 2.0rc2
Andreas,
Thank you for your insight!
> I'm not sure, but it's most likely due to one of two reasons;
>
> * A plugin that's being run is stuck in uninterruptable IO. This can=20
> happen when you're trying to check a partition residing on a network=20
> mounted media where the network connection for some reason is down. It
> can also happen under spurious circumstances where a process with
higher=20
> priority is holding a lock on some resource that the plugin is trying
to=20
> use.
>
> * There's a bug in Nagios causing it to hold a mutex in one of the=20
> parents' threads that isn't released before the child is spawned, so
the=20
> child inherits the mutex but has no way of releasing it. I know for a=20
> fact that Nagios does things considered illegal for multithreaded=20
> programs after fork()'ing, so this might be it. It should work well=20
> under Linux with reasonably up-to-date libraries and kernel though,
but...
>
I did leave out a valuable bit of information. The /home directory
itself is nfs mounted on the box running nagios. The nagios binaries
reside on the mount itself. In light of your suggestion, my very next
test will be to copy /home locally and eliminate this variable.
However, I do no see nay processes in the ps list that show as
uninterruptible or disk-wait.
> What version of plugins are you running? Which check is running when
it=20
> hangs?
Running plugins of: nagios-plugins-1.4
Typically the plugin that I see running is a check_ping. However due to
the high number of retries and packets I have check_ping set to make, it
takes a good 30 seconds or more of pinging before it returns failure.
The hosts I am trying to hit are behind a firewall that drops my pings
so the host is seen as down. I have done the same tests from a system
that does have permission to ping the hosts, but the problem still
exists, it is just not as obvious. I wanted to work on a system that
showed the problem as obviously as possible when it was broken.
> So in essence it always happens when you run Nagios, no matter how you
> compiled it, but never when you're running it from strace?
The problem occurs no matter how I compile nagios, when running nagios
by itse
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
a) disable embedded perl and perl-cache
I did this and the results were exactly the same as before
b) move /home to local volume
Again, the results are the same as before, no improvement.
c) get cvs version of nagios and try it (with the above two changes in
place) If it works, I will reverse b then a and see where/if it breaks.
I downloaded the snapshot and still the behavior is the same as
originally described.
During these tests I observed the following behavior.
The threading seems to startup ok and I see the proper number of checks
occurring. I have a lot that are snmp checks. When the first
check_ping process starts I see the following process tree and slowly
the other checking threads die off until only one thread remains. The
remaining thread is the check_ping thread. When it finally completes,
only one check at a time is performed from then on. This seems to
support you thought that a child process blocking the parent somehow.
29637 pts/1 Sl+ 0:00 \_ ../bin/nagios nagios.cfg
30056 pts/1 S 0:00 \_ ../bin/nagios nagios.cfg
30057 pts/1 S 0:00 \_ /home/nagios/nagios/libexec/check_ping
-p 10 -H 192.168.10.10 -w 100:60% -c 600:100%
30058 pts/1 S 0:00 \_ /bin/ping -n -U -w 16 -c 10
192.168.10.10
I upgraded to the latest plugins and this behavior remains. Somehow
strace -f seems to handle the check_ping blockage and let the app behave
properly
I am out of ideas of what to test next. Does this evidence help? What
is the next step?
Thanks again,
Ben
-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Ben
Miller
Sent: Wednesday, January 11, 2006 9:35 PM
To: Andreas Ericsson
Cc: [email protected]
Subject: RE: [Nagios-devel] Nagios blocking/stalling: Thread issue? v
2.0b3 or 2.0rc2
Andreas,
Thank you for your insight!
> I'm not sure, but it's most likely due to one of two reasons;
>
> * A plugin that's being run is stuck in uninterruptable IO. This can=20
> happen when you're trying to check a partition residing on a network=20
> mounted media where the network connection for some reason is down. It
> can also happen under spurious circumstances where a process with
higher=20
> priority is holding a lock on some resource that the plugin is trying
to=20
> use.
>
> * There's a bug in Nagios causing it to hold a mutex in one of the=20
> parents' threads that isn't released before the child is spawned, so
the=20
> child inherits the mutex but has no way of releasing it. I know for a=20
> fact that Nagios does things considered illegal for multithreaded=20
> programs after fork()'ing, so this might be it. It should work well=20
> under Linux with reasonably up-to-date libraries and kernel though,
but...
>
I did leave out a valuable bit of information. The /home directory
itself is nfs mounted on the box running nagios. The nagios binaries
reside on the mount itself. In light of your suggestion, my very next
test will be to copy /home locally and eliminate this variable.
However, I do no see nay processes in the ps list that show as
uninterruptible or disk-wait.
> What version of plugins are you running? Which check is running when
it=20
> hangs?
Running plugins of: nagios-plugins-1.4
Typically the plugin that I see running is a check_ping. However due to
the high number of retries and packets I have check_ping set to make, it
takes a good 30 seconds or more of pinging before it returns failure.
The hosts I am trying to hit are behind a firewall that drops my pings
so the host is seen as down. I have done the same tests from a system
that does have permission to ping the hosts, but the problem still
exists, it is just not as obvious. I wanted to work on a system that
showed the problem as obviously as possible when it was broken.
> So in essence it always happens when you run Nagios, no matter how you
> compiled it, but never when you're running it from strace?
The problem occurs no matter how I compile nagios, when running nagios
by itse
...[email truncated]...
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]