Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
jfrickson

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Post by jfrickson »

Well, the output from cat /proc/XXX/status looks fine. I was kind of hoping it would show a problem.

Try this. Edit the file lib/worker.c. Add #include <syslog.h> at the top. Go to the function gather_output.

Change the code from this:

Code: Select all

static void gather_output(child_process *cp, iobuf *io, int final)
{
	for (;;) {
		char buf[4096];
		int rd;

		rd = read(io->fd, buf, sizeof(buf));
		if (rd < 0) {
			if (errno == EAGAIN && !final)
				break;
			if (errno == EINTR || errno == EAGAIN)
				continue;
			if (!final && errno != EAGAIN)
				wlog("job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
		}
to this:

Code: Select all

static void gather_output(child_process *cp, iobuf *io, int final)
{
	int retry = 5;

	for (;;) {
		char buf[4096];
		int rd;

		rd = read(io->fd, buf, sizeof(buf));
		if (rd < 0) {
			if (errno == EAGAIN && !final)
				break;
			if (errno == EINTR || errno == EAGAIN) {
				char	buf[1024];
				if (--retry == 0) {
					sprintf(buf, "job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
					syslog(LOG_ERR, buf);
					break;
				}
				sprintf(buf, "job %d (pid=%ld): read() returned error %d", cp->id, (long)cp->ei->pid, errno);
				syslog(LOG_ERR, buf);
				sleep(1);
				continue;
			}
			if (!final && errno != EAGAIN)
				wlog("job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
		}
Then send the entries from syslog.

Only do this on your test system. It could introduce unacceptable delays in a production system. If it works, I'll do something more production-ready.
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Post by MalcolmPreen »

OK, will try to schedule this in....

A couple of questions.... I can't make (as yet) the problem recur on my "live" test system.... I might be able to get it to go on my non-live system.... but I'll need to do a large amount of set-up first.... so it certainly won't be today...

After the edit of worker.c - can you confirm that I need to rebuild and install (make all / make install) as normal ?

When you say "send the entries from syslog" - is there some way I can filter the output ? (I assume you only want nagios syslog entries... ?)
I can guess... but I don't want to "over filter".

I'm also guessing that the output might be quite large... so that the best way will likely be "Upload attachment".

As I said... this will likely take some time to set up... so I'll try and do it in the next day or two...
jfrickson

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Post by jfrickson »

MalcolmPreen wrote:OK, will try to schedule this in....

A couple of questions.... I can't make (as yet) the problem recur on my "live" test system.... I might be able to get it to go on my non-live system.... but I'll need to do a large amount of set-up first.... so it certainly won't be today...
No problem. Whenever you can get to it.
MalcolmPreen wrote:After the edit of worker.c - can you confirm that I need to rebuild and install (make all / make install) as normal ?
Yes. make all && make install
MalcolmPreen wrote:When you say "send the entries from syslog" - is there some way I can filter the output ? (I assume you only want nagios syslog entries... ?)
I can guess... but I don't want to "over filter".

I'm also guessing that the output might be quite large... so that the best way will likely be "Upload attachment".
The ident will benagios and the priority will be LOG_ERR and it will include the text read(). It should only be about 6 lines, so you should be able to just paste in into a message as a quote or code.
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Post by MalcolmPreen »

OK, back from vacation...

Needed to add the line;

#include <syslog.h>

to allow it to compile...

I added it after

#include <time.h>

Any problems with this ?

Hopefully will have some more news in the next couple of days.

Malcolm
jfrickson

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Post by jfrickson »

MalcolmPreen wrote:OK, back from vacation...

Needed to add the line;

#include <syslog.h>

to allow it to compile...

I added it after

#include <time.h>

Any problems with this ?
That should work just fine :)
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Post by MalcolmPreen »

OK, first attempt done... no major analysis as its home time...

As per normal... I detached the VPN connection... this time.. I got these errors... but no repeat of the hard loop...

Could the "retry" code be "fixing" the problem?

Code: Select all

[root@tablet log]# tail -f messages|egrep -i "nagios.*job.*read"
Feb 24 17:16:48 tablet nagios: job 43 (pid=9338): read() returned error 11
Feb 24 17:16:52 tablet nagios: job 44 (pid=9567): read() returned error 11
Feb 24 17:17:08 tablet nagios: job 45 (pid=10442): read() returned error 11
Feb 24 17:22:51 tablet nagios: job 80 (pid=25519): read() returned error 11
Feb 24 17:22:53 tablet nagios: job 80 (pid=25672): read() returned error 11
Feb 24 17:22:53 tablet nagios: job 80 (pid=25710): read() returned error 11
Feb 24 17:22:56 tablet nagios: job 81 (pid=25946): read() returned error 11
Feb 24 17:22:57 tablet nagios: job 81 (pid=26095): read() returned error 11
Feb 24 17:22:58 tablet nagios: job 81 (pid=26095): read() returned error 11
Feb 24 17:22:59 tablet nagios: job 82 (pid=26207): read() returned error 11
Feb 24 17:22:59 tablet nagios: job 82 (pid=26238): read() returned error 11
Feb 24 17:23:00 tablet nagios: job 82 (pid=26267): read() returned error 11
Feb 24 17:23:02 tablet nagios: job 82 (pid=26384): read() returned error 11
Feb 24 17:23:03 tablet nagios: job 82 (pid=26480): read() returned error 11
Feb 24 17:24:00 tablet nagios: job 77 (pid=24703): read() returned error 11
Feb 24 17:24:01 tablet nagios: job 77 (pid=24703): read() returned error 11
Feb 24 17:24:02 tablet nagios: job 77 (pid=24703): read() returned error 11
Feb 24 17:24:03 tablet nagios: job 77 (pid=24703): read() returned error 11
Feb 24 17:24:04 tablet nagios: job 77 (pid=24703): Failed to read(): Resource temporarily unavailable
Feb 24 17:24:07 tablet nagios: job 89 (pid=2848): read() returned error 11
Feb 24 17:24:10 tablet nagios: job 89 (pid=2918): read() returned error 11
Feb 24 17:24:11 tablet nagios: job 90 (pid=3043): read() returned error 11
Feb 24 17:26:31 tablet nagios: job 98 (pid=9704): read() returned error 11
Feb 24 17:27:24 tablet nagios: job 102 (pid=12242): read() returned error 11
Feb 24 17:29:13 tablet nagios: job 110 (pid=18043): read() returned error 11
Feb 24 17:29:19 tablet nagios: job 110 (pid=18227): read() returned error 11
Feb 24 17:30:36 tablet nagios: job 114 (pid=21079): read() returned error 11
Since I cut and pasted the above, there have been a few more errors... but no sign as yet of a loop.

Does that help?

I'm not sure how to match up the "job" or "pid" in the output...

Do you need any other logfiles with the above?

Malcolm
User avatar
hsmith
Agent Smith
Posts: 3539
Joined: Thu Jul 30, 2015 11:09 am
Location: 127.0.0.1
Contact:

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Post by hsmith »

I believe John is already gone for the day, but I'll reach out to him tomorrow and see if he has any input.
Former Nagios Employee.
me.
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Post by MalcolmPreen »

Thanks.... I've done a few tests this morning;

(a) following the forced VPN failure last night.... and the one " Resource temporarily unavailable " error appears to have prevented the hard loop. The monitoring continued overnight without any side effects.

(b) I "tweaked" the retry variable.... "just in case" it worked after a longer delay.... no change... except it takes a little longer to "fail"

(c) Without the forced failure, we get "some" EAGAIN (errno=11) errors.... but normally no more than 2 before the task succeeds.

When the worker task "times out" and executes the

Code: Select all

            if (--retry == 0) {
               sprintf(buf, "job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
               syslog(LOG_ERR, buf);
               break;
            }
Does that "re-queue" the job ? Or is that task lost ?

On a positive note, it doesn't appear to affect the running of the tablet....

As yesterday, if there is anything more you need, I'll do what I can to collect it.

Malcolm
jfrickson

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Post by jfrickson »

MalcolmPreen wrote:(a) following the forced VPN failure last night.... and the one " Resource temporarily unavailable " error appears to have prevented the hard loop. The monitoring continued overnight without any side effects.

(b) I "tweaked" the retry variable.... "just in case" it worked after a longer delay.... no change... except it takes a little longer to "fail"

(c) Without the forced failure, we get "some" EAGAIN (errno=11) errors.... but normally no more than 2 before the task succeeds.
That's good news! It looks like we fixed the problem.
MalcolmPreen wrote:When the worker task "times out" and executes the

Code: Select all

            if (--retry == 0) {
               sprintf(buf, "job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
               syslog(LOG_ERR, buf);
               break;
            }
Does that "re-queue" the job ? Or is that task lost ?
If retrying five times doesn't work, then it's gone. You can bump up the number of retries if you start getting failures, but I would be real surprised if five seconds doesn't work but a higher number does.
MalcolmPreen wrote:On a positive note, it doesn't appear to affect the running of the tablet....

As yesterday, if there is anything more you need, I'll do what I can to collect it.

Malcolm
Thanks! I'll be going over this in more detail, and probably putting in a fix that is very close to what I had you do.

If you want, you can comment out the calls to syslog (or not) and deploy the program to your tablets.

Thanks for your help!
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Post by MalcolmPreen »

OK, that's good news.

Will there be another "release" (4.1.2-Pre2?) before the official release ? Or is my only option to implement 4.1.2-Pre1 plus this fix.... (or wait patiently :-) )

If you have an "official fix" you want me to try, then please let me know.

Malcolm
Locked