Followup:
OK, we have established that we had a number of VPN failures last night (technically, the VPN didn't fail completely.... but the test for a failure reported that it had (the timeout was too short)) and the system is set to, upon vpn failure, to stop and re-start the vpn link.
Unfortunately (or maybe fortunately as we've collected some data!!), this resulted in a number of "busy loop" worker processes.
As previously advised, I coded a script to run the requested gdb -p commands, a few ps's out to a file.... and then kill -1 the "busy" process so that monitoring could continue.
In every case, the gdb output is identical (apart from addresses)
Code: Select all
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 1737
Reading symbols from /usr/local/nagios/bin/nagios...(no debugging symbols found)...done.
Reading symbols from /lib/libm.so.6...Reading symbols from /usr/lib/debug/lib/libm-2.12.so.debug...done.
done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libdl.so.2...Reading symbols from /usr/lib/debug/lib/libdl-2.12.so.debug...done.
done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libc.so.6...Reading symbols from /usr/lib/debug/lib/libc-2.12.so.debug...done.
done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...Reading symbols from /usr/lib/debug/lib/ld-2.12.so.debug...done.
done.
Loaded symbols for /lib/ld-linux.so.2
0x0016d424 in __kernel_vsyscall ()
(gdb) #0 0x0016d424 in __kernel_vsyscall ()
#1 0x0041a8d3 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:82
#2 0x080bd522 in ?? ()
#3 0x080bd680 in finish_job ()
#4 0x080bd4b5 in ?? ()
#5 0x080bd5c2 in ?? ()
#6 0x080bdcb3 in ?? ()
#7 0x080bad43 in iobroker_poll ()
#8 0x080be131 in enter_worker ()
#9 0x08058ec8 in main ()
(gdb) Detaching from program: /usr/local/nagios/bin/nagios, process 1737
The ps axlf varies.... for example, 4 ps's separated by sleep 60;
Code: Select all
0 500 1737 1726 20 0 3452 872 - R ? 3:14 \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
0 500 1737 1726 20 0 3452 872 - R ? 4:14 \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
0 500 1737 1726 20 0 3452 872 - R ? 5:14 \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
0 500 1737 1726 20 0 3452 872 - R ? 6:14 \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
In this example above.... there are no children.... but
Code: Select all
0 500 2423 2421 20 0 4044 1640 - R ? 46:03 \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
0 500 6513 2423 20 0 0 0 - Z ? 0:00 | \_ [check_by_ssh] <defunct>
0 500 10537 2423 20 0 0 0 - Z ? 0:00 | \_ [check_by_ssh] <defunct>
0 500 2423 2421 20 0 4044 1640 - R ? 47:03 \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
0 500 6513 2423 20 0 0 0 - Z ? 0:00 | \_ [check_by_ssh] <defunct>
0 500 10537 2423 20 0 0 0 - Z ? 0:00 | \_ [check_by_ssh] <defunct>
Apart from the fact that the CPU usages increases by 1 minute after a sleep 60, there appears to be no pattern.
Unfortunately, I didn't get the chance to include the
in the output... will get that added now....