Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

MalcolmPreen · Post by **MalcolmPreen** » Wed Jan 27, 2016 9:58 am

Quick(ish) background info:

Our set-up has a number of issues (the main one being "worker processes being choked"), which are fixed by code which will be included in Nagios 4.1.2 - which is not available for a couple of months.

The remedy to these issues is either to launch MANY (about 256 !!) worker processes... which is putting severe load onto the netbooks on which nagios is installed on the customer sites.... and possibly causing other issues.

OR

Upgrade to 4.1.2-Pre1

Due to the load issue, we opted to try 4.1.2-Pre1 on a couple of live, but "not quite as busy" netbooks.

Each of the netbooks on the customer sites, reports host/service data back to "the centre" using nsca messages.

Now to the problem

Since we upgraded the netbooks to 4.1.2-Pre1, we noticed that "sometimes" (normally at least 12 hours after nagios start), the services and hosts at the centre would start to be marked as STALE (basically, no nsca message received from the netbook).

The first occurrence of this was out of hours, so the decision was made to restart nagios on the netbook.... which restored the monitoring service... but doesn't help to diagnose the problem....

Last night, the situation arose again.... and what I saw was as follows;

ps -ef | grep "nagios -" (to show the -d process and all of the --workers)

This showed all of the expected process... but the "C" column, whilst normally almost 0, was about 40 for one of the worker processes.... and also for the "core" (nagios -d) process. This raised figure remained in place consistently... until I took remedial action (see below).

The other 15 worker processes didn't get involved - so I'm assuming that the core process was no longer dispatching - in effect the two processes seemed to be in a loop.

I didn't want to kill the core (nagios -d), in case the problem was recoverable... so I attempted to kill the worker process which was "looping".

As root, from the command line, I issued;

kill -1 PID
kill PID
kill -9 PID

Nothing happened until the kill -9, at which time the process changed from "normal" to;

nagios 19810 19803 2 Jan26 ? 00:49:18 [nagios] <defunct>

Shortly after this, the C column on this process... and the nagios -d process reduced.... not to 0.... but "close".... and the STALE statuses started to reduce (I assume as the core process was now freed to be able to dispatch the other worker processes).

This was about 21 hours ago now.... and apart from being 1 worker process down... the system appears to be running normally.

I suspect I can re-start nagios to recover "complete normality" if required.... but given that things are running normally, I have left the scenario in place, in case any diagnostic information can be gleaned from the current situation.

So... to the questions;
- should the problem recur, what SHOULD I do to collect information to assist debugging ? [whilst it is a production system.... we fully accept that we are running a pre-release version, so are happy to collect any required information to assist the resolution of this problem... - just let me know]

- is it possible to determine which service the nagios worker process is running at any time? (specifically when it loops!! - but anytime would be interesting)

Any advice greatly appreciated...

Thanks, Malcolm

jfrickson · Post by **jfrickson** » Wed Jan 27, 2016 12:25 pm

There are a couple of things you could try:

1) Turn on debugging in nagios.cfg. Set debug_level to -1 and debug_verbosity to 2. Note: this is guaranteed to generate huge amounts of logging.

2) When a process gets in a loop, check it using gdb as follows:

Code: Select all

# ps uw -C nagios
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
nagios    8828  0.0  0.0  18636  2860 ?        Ss   10:55   0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios    8829  0.0  0.0  12932  2624 ?        S    10:55   0:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagi
nagios    8831 25.3  0.0  12932  2624 ?        S    10:55   0:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagi
nagios    8835  0.0  0.0  18120  1624 ?        S    10:55   0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
# gdb -p 8831
     . . . a couple of pages of gdb stuff . . .
(gdb) bt
#0  0x00007f2f5ae63643 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1  0x0000000000472b80 in iobroker_poll (iobs=0x108c040, timeout=timeout@entry=30004)
    at iobroker.c:337
#2  0x000000000047595e in enter_worker (sd=sd@entry=3, cb=<optimized out>)
    at worker.c:830
#3  0x0000000000416166 in nagios_core_worker (
    path=0x7ffc391d2ed0 "/usr/local/nagios/var/rw/nagios.qh") at nagios.c:182
#4  main (argc=<optimized out>, argv=0x7ffc391d2c48) at nagios.c:318
(gdb) detach
Detaching from program: /usr/local/nagios/bin/nagios, process 8829
(gdb) q
#

Then post the results of the backtrace (bt command) here. That should give me a real good idea where to look for the problem.

If you're familiar with gdb, you could also do a little poking around to see what values variables have, what command is being processed, etc.

MalcolmPreen · Post by **MalcolmPreen** » Thu Jan 28, 2016 9:23 am

Thanks for the speedy response.... we've actually had two repeats this morning...

First time.... gdb wasn't yet installed

I issued a kill -1 (only), and the worker process went defunct.... monitoring resumed...

Second time... I have gdb output

Code: Select all

Script started on Thu 28 Jan 2016 14:05:13 GMT
root@tablet# gdb -p 19807
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 19807
Reading symbols from /usr/local/nagios/bin/nagios...(no debugging symbols found)...done.
Reading symbols from /lib/libm.so.6...Reading symbols from /usr/lib/debug/lib/libm-2.12.so.debug...done.
done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libdl.so.2...Reading symbols from /usr/lib/debug/lib/libdl-2.12.so.debug...done.
done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libc.so.6...Reading symbols from /usr/lib/debug/lib/libc-2.12.so.debug...done.
done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...Reading symbols from /usr/lib/debug/lib/ld-2.12.so.debug...done.
done.
Loaded symbols for /lib/ld-linux.so.2
0x0034f424 in __kernel_vsyscall ()
(gdb) bt
#0  0x0034f424 in __kernel_vsyscall ()
#1  0x002288d3 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:82
#2  0x080bd522 in ?? ()
#3  0x080bd680 in finish_job ()
#4  0x080bd4b5 in ?? ()
#5  0x080bd5c2 in ?? ()
#6  0x080bdcb3 in ?? ()
#7  0x080bad43 in iobroker_poll ()
#8  0x080be131 in enter_worker ()
#9  0x08058ec8 in main ()
(gdb) detach
Detaching from program: /usr/local/nagios/bin/nagios, process 19807
(gdb) quit
root@tablet:# kill -1 19807
root@tablet:# exit

Script done on Thu 28 Jan 2016 14:09:19 GMT

Hope that helps... if there are any other gdb commands needed.... just let me know...

I don't have the debug turned on.... its in the nagios.cfg... but I've not restarted yet....

I'm not yet familiar with gdb.... so will have an experiment when I get a chance....

In the meantime.... my plan... (in order to keep the live system as live as possible).... is to;

write a script...
to detect a problem process.... run gdb to collect the requested info out to a file.... kill -1 the process

Then I need to make sure I don't run out of workers...

Thanks for your help

Malcolm

jfrickson · Post by **jfrickson** » Thu Jan 28, 2016 10:29 am

Ok, it looks like you got bit by a previously reported bug. You have two options:

1) You can patch the program. Go into lib/worker.c and find the function gather_output. A few lines from the top of the function you will see this:

Code: Select all

		rd = read(io->fd, buf, sizeof(buf));
		if (rd < 0) {
			if (errno == EINTR || errno == EAGAIN)
				continue;
			if (!final && errno != EAGAIN)
				wlog("job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
		}

Add one additional check so it looks like this:

Code: Select all

		rd = read(io->fd, buf, sizeof(buf));
		if (rd < 0) {
			if (errno == EAGAIN && !final)
				break;
			if (errno == EINTR || errno == EAGAIN)
				continue;
			if (!final && errno != EAGAIN)
				wlog("job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
		}

Then recompile and install.

2) The other option is to download and install version 4.1.2Pre1. The repository is at https://github.com/NagiosEnterprises/na ... 4.1.2-Pre1 or you can download a zip file at https://github.com/NagiosEnterprises/na ... 2-Pre1.zip

Either option should fix your problem. Let us know if you need any help, and if you get it working properly.

MalcolmPreen · Post by **MalcolmPreen** » Thu Jan 28, 2016 12:18 pm

Ah.... but... I'm already using Nagios 4.1.2-Pre1 (ref Subject).

Just to be sure, I just checked lib/worker/c and I have;

Code: Select all

                rd = read(io->fd, buf, sizeof(buf));
                if (rd < 0) {
                        if (errno == EAGAIN && !final)
                                break;
                        if (errno == EINTR || errno == EAGAIN)
                                continue;
                        if (!final && errno != EAGAIN)
                                wlog("job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
                }

So... if it is the same problem... it isn't fixed by this change...

Sorry, Malcolm

jfrickson · Post by **jfrickson** » Thu Jan 28, 2016 12:33 pm

MalcolmPreen wrote:Ah.... but... I'm already using Nagios 4.1.2-Pre1 (ref Subject).

Sorry, Malcolm

Oops!

I got so excited when I thought I found it, I forgot to double-check what you're running. Silly me!

Ok, when it happens again, do a ps -axlf to see what's running as a child of the busy worker process. Note the PID's. Wait a while and run the ps again. Eventually, you should end up with just a single child that doesn't go away. Let me know what you end up with.

MalcolmPreen · Post by **MalcolmPreen** » Fri Jan 29, 2016 10:48 am

No problem... thanks for your help so far.

Just as an update.... in preparation for the weekend I have set up a script which has the following logic;

is any worker process busy (ps -ef column 4 non-zero) ?
if not, do nothing...
if so, wait 30s and check again... has elapsed time increased (column 7)... if so this COULD be a looper (this may need tuning)

collect gdb -p PID... and ps -axlf... wait 60s... ps -axlf... wait 60s... ps -axlf.... wait 60s... ps -axlf.... kill -1 PID

all this output goes to /tmp/gdb.PID.txt

The script runs, as root, every 5 minutes from cron....

Hopefully, this will collect the required data.... whilst maintaining full monitoring performance...

Of course... since I implemented this.... (and turned debug on as specified previously) there have been no repeats (maybe this post will cause one!!)

If it survives over the weekend.... I plan to remove debug.... in case its a timing issue....

Thanks for your help so far, and hope the above makes sense

Malcolm

jfrickson · Post by **jfrickson** » Fri Jan 29, 2016 10:52 am

That sounds like it should work. /me crossing fingers and hoping for good output

MalcolmPreen · Post by **MalcolmPreen** » Tue Feb 02, 2016 12:06 pm

OK... a minor update...

It survived over the weekend.... so I've removed debug....

We had one recurrence yesterday.... from what I can see it was around the time of a VPN failure (the netbook reports the status of its checks to two central servers.... one is in the DMZ... and the other is internal... so requires a VPN connection), and it appeared to be running the "submit nsca report" script.... basically a wrapper around a number of send_nsca commands.

The script to "detect" and "tidy" the error worked, ish, in that I'd left it in "test mode" - so it didn't actually do the kill -1 !!!

I amended the script, and it kicked it....

I'm a little confused, as send_nsca has an inbuilt timeout.... (which I have upped to 120s due to historic issues.... this can be reduced if needed).

Does it make any sense that send_nsca should go into a loop if the network it is using should fail as it is using it ? I would have thought the message would be lost, and the script would just continue after "timeout" seconds ??

Is there anything I can do to debug send_nsca ?
How should I best "kill" it, if the problem recurs in the same way ?

Thanks, Malcolm

hsmith · Post by **hsmith** » Tue Feb 02, 2016 3:56 pm

I've reached out to jfrickson, we're currently having some inclement weather conditions in the Twin Cities, so I apologize for any potential delay in answer. I'll check back tomorrow afternoon.

Nagios Support Forum

Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")