Re: [Nagios-devel] Nagios blocking/stalling: Thread issue? v 2.0b3
Posted: Thu Jan 12, 2006 4:47 am
Ben Miller wrote:
> My latest tests and findings:
>
> a) disable embedded perl and perl-cache
>
> I did this and the results were exactly the same as before
>
> b) move /home to local volume
>
> Again, the results are the same as before, no improvement.
>
> c) get cvs version of nagios and try it (with the above two changes in
> place) If it works, I will reverse b then a and see where/if it breaks.
>
> I downloaded the snapshot and still the behavior is the same as
> originally described.
>
> During these tests I observed the following behavior.
> The threading seems to startup ok and I see the proper number of checks
> occurring. I have a lot that are snmp checks. When the first
> check_ping process starts I see the following process tree and slowly
> the other checking threads die off until only one thread remains. The
> remaining thread is the check_ping thread. When it finally completes,
> only one check at a time is performed from then on. This seems to
> support you thought that a child process blocking the parent somehow.
>
> 29637 pts/1 Sl+ 0:00 \_ ../bin/nagios nagios.cfg
> 30056 pts/1 S 0:00 \_ ../bin/nagios nagios.cfg
> 30057 pts/1 S 0:00 \_ /home/nagios/nagios/libexec/check_ping
> -p 10 -H 192.168.10.10 -w 100:60% -c 600:100%
> 30058 pts/1 S 0:00 \_ /bin/ping -n -U -w 16 -c 10
> 192.168.10.10
>
Perhaps you can try using check_icmp instead? That way you would get rid
of one set of file-descriptors, and the Nagios process in charge of the
plugin will be the process' parent rather than its grandparent.
Unfortunately I accidentally firewalled myself out of oss.op5.se, or I
would have put up a package there for you to download with latest
check_icmp inside. I can upload it tonight (+0100) when I get home from
work.
> I upgraded to the latest plugins and this behavior remains. Somehow
> strace -f seems to handle the check_ping blockage and let the app behave
> properly
This is to be expected since strace makes the program behave differently.
> I am out of ideas of what to test next. Does this evidence help? What
> is the next step?
>
Try replacing check_ping with check_icmp. If that doesn't work this bug
needs to be found and fixed in Nagios, which is non-trivial to say the
least and near impossible without knowing what it is that breaks.
--
Andreas Ericsson [email protected]
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]
> My latest tests and findings:
>
> a) disable embedded perl and perl-cache
>
> I did this and the results were exactly the same as before
>
> b) move /home to local volume
>
> Again, the results are the same as before, no improvement.
>
> c) get cvs version of nagios and try it (with the above two changes in
> place) If it works, I will reverse b then a and see where/if it breaks.
>
> I downloaded the snapshot and still the behavior is the same as
> originally described.
>
> During these tests I observed the following behavior.
> The threading seems to startup ok and I see the proper number of checks
> occurring. I have a lot that are snmp checks. When the first
> check_ping process starts I see the following process tree and slowly
> the other checking threads die off until only one thread remains. The
> remaining thread is the check_ping thread. When it finally completes,
> only one check at a time is performed from then on. This seems to
> support you thought that a child process blocking the parent somehow.
>
> 29637 pts/1 Sl+ 0:00 \_ ../bin/nagios nagios.cfg
> 30056 pts/1 S 0:00 \_ ../bin/nagios nagios.cfg
> 30057 pts/1 S 0:00 \_ /home/nagios/nagios/libexec/check_ping
> -p 10 -H 192.168.10.10 -w 100:60% -c 600:100%
> 30058 pts/1 S 0:00 \_ /bin/ping -n -U -w 16 -c 10
> 192.168.10.10
>
Perhaps you can try using check_icmp instead? That way you would get rid
of one set of file-descriptors, and the Nagios process in charge of the
plugin will be the process' parent rather than its grandparent.
Unfortunately I accidentally firewalled myself out of oss.op5.se, or I
would have put up a package there for you to download with latest
check_icmp inside. I can upload it tonight (+0100) when I get home from
work.
> I upgraded to the latest plugins and this behavior remains. Somehow
> strace -f seems to handle the check_ping blockage and let the app behave
> properly
This is to be expected since strace makes the program behave differently.
> I am out of ideas of what to test next. Does this evidence help? What
> is the next step?
>
Try replacing check_ping with check_icmp. If that doesn't work this bug
needs to be found and fixed in Nagios, which is non-trivial to say the
least and near impossible without knowing what it is that breaks.
--
Andreas Ericsson [email protected]
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
This post was automatically imported from historical nagios-devel mailing list archives
Original poster: [email protected]