[Nagios-devel] check_ping hanging, causing Nagios itself to go to sleep.

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Guest

[Nagios-devel] check_ping hanging, causing Nagios itself to go to sleep.

Post by Guest »

Hi, :-)


I've recently set up Nagios to monitor some of the machines where I work
- the setup consists of 41 hosts, with a total of 152 active and 6
passive checks.

I have, on several occasions now, run into a problem for which I can not
find any explanation (and therefore neither a solution). When a
service-check fails, and the host-check is tried, the ping process-tree
(forked Nagios->check_ping->/bin/ping) sometimes (not always) hangs.
This seems a lot like the problem reported by Dan Rich some days ago.

I'm using Nagios 1.0b5 (had the same problem with 1.0b4) and check_ping
from the Netsaint-plugins package, on a Linux 2.2.20 box (Debian
stable).

Since I could not find any reason for this in the source, I set up a
strace on Nagios, and set off to try and make it hang. I succeeded, but
my success was limited.

A typical ps-listing when Nagios hangs:

root 3007 1.1 0.9 5928 5100 pts/66 S 13:41 0:20 /usr/bin/strace -f -f -f -f -f -f -f -f -o nagios.out /bin/nagios -d /etc/nagios.cfg
nagios 3010 0.2 0.4 3668 2416 ? T 13:41 0:04 \_ /bin/nagios -d /etc/nagios.cfg
nagios 3041 0.0 0.4 3672 2412 ? S 13:41 0:00 \_ /bin/nagios -d /etc/nagios.cfg
nagios 3042 0.0 0.0 1276 488 ? S 13:41 0:00 | \_ /usr/lib/netsaint-plugins/libexec/check_ping -H 555.555.555.555 -w 3000.0,80% -c 5000.0,100% -p 5
nagios 3043 0.0 0.0 1352 492 ? S 13:41 0:00 | \_ /bin/ping -n -c 5 555.555.555.555
nagios 3237 0.0 0.4 3672 2412 ? S 13:42 0:00 \_ /bin/nagios -d /etc/nagios.cfg
nagios 3238 0.0 0.0 1276 488 ? S 13:42 0:00 | \_ /usr/lib/netsaint-plugins/libexec/check_ping -H 555.555.555.556 -w 3000.0,80% -c 5000.0,100% -p 5
nagios 3239 0.0 0.0 1352 492 ? S 13:42 0:00 | \_ /bin/ping -n -c 5 555.555.555.556
nagios 8855 0.0 0.4 3672 2416 ? S 14:01 0:00 \_ /bin/nagios -d /etc/nagios.cfg
nagios 8856 0.0 0.0 1276 488 ? S 14:01 0:00 \_ /usr/lib/netsaint-plugins/libexec/check_ping -H 555.555.555.557 -w 3000.0,80% -c 5000.0,100% -t 20
nagios 8858 0.0 0.0 1352 492 ? S 14:01 0:00 \_ /bin/ping -n -c 5 555.555.555.557
[ cut 10 more similar (but for other hosts) lines]

(The paths of the programs have been modified to make the list more
readable.) The first checks have (at the time of the `ps`-command),
been running for 20 minutes. Nagios was started at 13:41. (Note, there
are both service check_ping and host-check-alive there.)

The setup of the checks:

define command{
command_name check-host-alive
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w 3000.0,80% -c 5000.0,100% -t 20
}
define command{
command_name check_ping
command_line $USER1$/check_ping -H $HOSTADDRESS$ -w $ARG1$ -c $ARG2$ -p 5
}

A sample host:

define host{
use generic-host
host_name bogus.linpro.no
alias bogus.linpro.no
address 555.555.555.555
}
define host{
name generic-host
notifications_enabled 1
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_interval 0
check_command check-host-alive
notification_period 24x7
notification_options d,u,r
max_check_attempts 5
register 0
}

As I said, I've managed to strace these procs, but my strace stops at a
vfork() (which it should - according to the vfork() man-page, I have to
patch my kernel to get strace to follow it). The strace is rather large,
some selections are pasted below (ask for more if you think it'll help).

I'll follow a sub-section of the straces downward. Here is the strace
of process 3041, as it was before I restarted Nagios:

---------------
rt_sigaction(SIGQUIT, {SIG_DFL}, {0x80589bc, [QUIT], SA_RESTART|0x4000000}, 8) = 0
rt_sigaction(SIGTERM, {SIG_DFL}, {0x80589bc, [TERM],

...[email truncated]...


This post was automatically imported from historical nagios-devel mailing list archives
Original poster: jo@linpro.no
Locked