Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

jfrickson · Post by **jfrickson** » Wed Feb 03, 2016 11:18 am

MalcolmPreen wrote:I'm a little confused, as send_nsca has an inbuilt timeout.... (which I have upped to 120s due to historic issues.... this can be reduced if needed).

Does it make any sense that send_nsca should go into a loop if the network it is using should fail as it is using it ? I would have thought the message would be lost, and the script would just continue after "timeout" seconds ??

Is there anything I can do to debug send_nsca ?
How should I best "kill" it, if the problem recurs in the same way ?

Hmm, very odd. Yes, it should time out.

I did some research, and so far I found two instances of alarm() being blocked. One is (or was) a bug in the NVidia proprietary driver, and one is (or was) a bug in tcp wrappers. So it is possible for another program to block signals in send_nsca.

Get the PID of send_nsca then do the following (replacing 9999 with the actual PID):
# cat /proc/9999/status | grep "^SigBlk"
And post the results.

MalcolmPreen · Post by **MalcolmPreen** » Wed Feb 03, 2016 11:40 am

Thanks... by way of interest... I've "caught" a live one (not hung.... but proof of concept), and see the following;

Code: Select all

PID=7761
  PID TTY          TIME CMD
 7761 ?        00:00:00 send_nsca
Name:   send_nsca
State:  S (sleeping)
Tgid:   7761
Pid:    7761
PPid:   7760
TracerPid:      0
Uid:    500     500     500     500
Gid:    500     500     500     500
Utrace: 0
FDSize: 32
Groups: 500
VmPeak:     2420 kB
VmSize:     2388 kB
VmLck:         0 kB
VmHWM:       660 kB
VmRSS:       660 kB
VmData:      184 kB
VmStk:        92 kB
VmExe:        16 kB
VmLib:      2032 kB
VmPTE:        32 kB
VmSwap:        0 kB
Threads:        1
SigQ:   2/29661
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000000002000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed:   ff
Cpus_allowed_list:      0-7
Mems_allowed:   1
Mems_allowed_list:      0
voluntary_ctxt_switches:        2
nonvoluntary_ctxt_switches:     1

Obviously, that's not a hung one.... and I assume the above is as expected...

Will re-work the "kill" script, to capture this output should it recur...

Malcolm

hsmith · Post by **hsmith** » Wed Feb 03, 2016 3:45 pm

Let us know how it goes with reworking the script.

MalcolmPreen · Post by **MalcolmPreen** » Thu Feb 04, 2016 5:21 am

Quick update:

Overnight we had an issue with the VPN set-up, which resulted in a number of the tablets losing their VPN connect and re-starting it....

This does appear to provoke the problem.... so it may be reproducible in a relatively controlled environment...

One interesting thing to note however, is that two of the tablets which are still running nagios 4.0.8 appeared to experience the same problem (one or more worker process goes busy... and nagios services at the "centre" start to go stale).

These are less available for "testing" but it is interesting (to me at least

) that the problem does not appear restricted to 4.1.2-Pre1

Unfortunately, no more details are available at present... but I figured the above info might help (a bit!).

Malcolm

PS Update: another thing to note... whilst on 4.0.8 we see a similar pattern... (stale statuses etc), the "busy loop" is not "as busy".

To try to explain; under 4.1.2-Pre1 - the loop is hard... so much so that every service goes stale (suggesting that no nsca reports are getting through)
whilst with 4.0.8 - one of the worker processes loops as per 4.1.2-Pre1 - but nagios itself seems to continue.... which means that the problem is less serious (only a few reports don't run....) but also that it is harder to detect. (all services going stale is a helpful flag)

I don't know which scenario is going to be better to investigate.... but I can "almost" reproduce the problem in an "almost" controlled environment.

MalcolmPreen · Post by **MalcolmPreen** » Thu Feb 04, 2016 11:20 am

Followup:

OK, we have established that we had a number of VPN failures last night (technically, the VPN didn't fail completely.... but the test for a failure reported that it had (the timeout was too short)) and the system is set to, upon vpn failure, to stop and re-start the vpn link.

Unfortunately (or maybe fortunately as we've collected some data!!), this resulted in a number of "busy loop" worker processes.

As previously advised, I coded a script to run the requested gdb -p commands, a few ps's out to a file.... and then kill -1 the "busy" process so that monitoring could continue.

In every case, the gdb output is identical (apart from addresses)

Code: Select all

GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 1737
Reading symbols from /usr/local/nagios/bin/nagios...(no debugging symbols found)...done.
Reading symbols from /lib/libm.so.6...Reading symbols from /usr/lib/debug/lib/libm-2.12.so.debug...done.
done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libdl.so.2...Reading symbols from /usr/lib/debug/lib/libdl-2.12.so.debug...done.
done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libc.so.6...Reading symbols from /usr/lib/debug/lib/libc-2.12.so.debug...done.
done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...Reading symbols from /usr/lib/debug/lib/ld-2.12.so.debug...done.
done.
Loaded symbols for /lib/ld-linux.so.2
0x0016d424 in __kernel_vsyscall ()
(gdb) #0  0x0016d424 in __kernel_vsyscall ()
#1  0x0041a8d3 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:82
#2  0x080bd522 in ?? ()
#3  0x080bd680 in finish_job ()
#4  0x080bd4b5 in ?? ()
#5  0x080bd5c2 in ?? ()
#6  0x080bdcb3 in ?? ()
#7  0x080bad43 in iobroker_poll ()
#8  0x080be131 in enter_worker ()
#9  0x08058ec8 in main ()
(gdb) Detaching from program: /usr/local/nagios/bin/nagios, process 1737

The ps axlf varies.... for example, 4 ps's separated by sleep 60;

Code: Select all

0   500  1737  1726  20   0   3452   872 -      R    ?          3:14  \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
0   500  1737  1726  20   0   3452   872 -      R    ?          4:14  \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
0   500  1737  1726  20   0   3452   872 -      R    ?          5:14  \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
0   500  1737  1726  20   0   3452   872 -      R    ?          6:14  \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh

In this example above.... there are no children.... but

Code: Select all

0   500  2423  2421  20   0   4044  1640 -      R    ?         46:03  \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
0   500  6513  2423  20   0      0     0 -      Z    ?          0:00  |   \_ [check_by_ssh] <defunct>
0   500 10537  2423  20   0      0     0 -      Z    ?          0:00  |   \_ [check_by_ssh] <defunct>

0   500  2423  2421  20   0   4044  1640 -      R    ?         47:03  \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
0   500  6513  2423  20   0      0     0 -      Z    ?          0:00  |   \_ [check_by_ssh] <defunct>
0   500 10537  2423  20   0      0     0 -      Z    ?          0:00  |   \_ [check_by_ssh] <defunct>

Apart from the fact that the CPU usages increases by 1 minute after a sleep 60, there appears to be no pattern.

Unfortunately, I didn't get the chance to include the

Code: Select all

cat /proc/XXXX/status

in the output... will get that added now....

MalcolmPreen · Post by **MalcolmPreen** » Thu Feb 04, 2016 11:31 am

Correction:

I did include the cat of the /proc/XXXX/status - but only if it found an active send_nsca process.... which only happened once.... and by the time the cat command ran, the process had terminated.

As it appears that send_nsca is no longer the cause of the problem - is there any mileage in collecting this output.... but for the "busy loop" process ?

Any advice, greatly appreciated, Malcolm

jfrickson · Post by **jfrickson** » Thu Feb 04, 2016 12:07 pm

MalcolmPreen wrote:Correction:

I did include the cat of the /proc/XXXX/status - but only if it found an active send_nsca process.... which only happened once.... and by the time the cat command ran, the process had terminated.

As it appears that send_nsca is no longer the cause of the problem - is there any mileage in collecting this output.... but for the "busy loop" process ?

A cat of /proc/XXXX/status would be interesting to see on the worker that's stuck in the loop. It sounds like any further checks on send_nsca will not help.

You have a great mystery going here. I'm going to lose sleep over it if we can't find out what's going on!

MalcolmPreen · Post by **MalcolmPreen** » Thu Feb 04, 2016 12:23 pm

the capture of /proc/$PID/status is now included...

If it survives overnight, I'll work on a "reproduction" in a controlled environment tomorrow assuming I get some time....

Thanks, Malcolm

hsmith · Post by **hsmith** » Thu Feb 04, 2016 5:16 pm

Let us know when you have a chance.

Thanks!

MalcolmPreen · Post by **MalcolmPreen** » Fri Feb 05, 2016 11:16 am

OK... controlled re-production in place.

In summary... if the VPN (openconnect) terminates... (or is deliberately terminated by me), there is code in place to detect this failure.... and re-start the link

After this (not sure at present if it is when the vpn fails... or when it recovers... or somewhere in between), I can see that one worker process and also the nagios core (nagios -d) are in a busy loop. Initially, the fourth column of ps -ef remains at 0.... until about 10-15 minutes have elapsed...

My detection script is looking for non-zero in the fourth column, so only kicks in after it goes non-zero (I'll work on improving this.... so it can detect the problem quicker... and thus have less affect on the customer).

The /proc/PID/status from the busy worker process is;

Code: Select all

Name:   nagios
State:  R (running)
Tgid:   1732
Pid:    1732
PPid:   1726
TracerPid:      0
Uid:    500     500     500     500
Gid:    500     500     500     500
Utrace: 0
FDSize: 256
Groups: 500
VmPeak:     4488 kB
VmSize:     3440 kB
VmLck:         0 kB
VmHWM:      1960 kB
VmRSS:      1040 kB
VmData:      812 kB
VmStk:        88 kB
VmExe:       596 kB
VmLib:      1896 kB
VmPTE:        24 kB
VmSwap:        0 kB
Threads:        1
SigQ:   2/29661
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000000010000
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: ffffffffffffffff
Cpus_allowed:   ff
Cpus_allowed_list:      0-7
Mems_allowed:   1
Mems_allowed_list:      0
voluntary_ctxt_switches:        73974
nonvoluntary_ctxt_switches:     121764

And the ps axlf every minute for 4 minutes shows;

Code: Select all

$ grep "nagios -d" /tmp/*1732
1   500  1726     1  20   0  21364  3492 -      Rs   ?         26:26 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
1   500  1726     1  20   0  21364  3492 -      Rs   ?         27:26 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
1   500  1726     1  20   0  21364  3492 -      Rs   ?         28:26 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
1   500  1726     1  20   0  21364  3492 -      Rs   ?         29:26 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
$ grep 1732 /tmp/*1732
Attaching to process 1732
(gdb) Detaching from program: /usr/local/nagios/bin/nagios, process 1732
0   500  1732  1726  20   0   3440  1040 -      R    ?         19:19  \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
0   500  1732  1726  20   0   3440  1040 -      R    ?         20:19  \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
0   500  1732  1726  20   0   3440  1040 -      R    ?         21:19  \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
0   500  1732  1726  20   0   3440  1040 -      R    ?         22:19  \_ /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh

By my interpretation 1 minute of CPU time per minute = busy loop !!

My next plan... is to check the status file BEFORE reproducing the error... and see if it changes... and also the same before/after for the nagios -d controller process.

The "easy" solution is to stop the VPN from failing !!

Any further suggestions ?

Malcolm

PS quick advanced notice - I will be on vacation from 12-22 February inclusive... so I won't be providing updates... during those dates.

Nagios Support Forum

Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")