Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Engage with the community of users including those using the open source solutions.
Includes Nagios Core, Plugins, and NCPA

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Postby jfrickson » Fri Feb 05, 2016 12:52 pm

Well, the output from cat /proc/XXX/status looks fine. I was kind of hoping it would show a problem.

Try this. Edit the file lib/worker.c. Add #include <syslog.h> at the top. Go to the function gather_output.

Change the code from this:
Code: Select all
static void gather_output(child_process *cp, iobuf *io, int final)
{
   for (;;) {
      char buf[4096];
      int rd;

      rd = read(io->fd, buf, sizeof(buf));
      if (rd < 0) {
         if (errno == EAGAIN && !final)
            break;
         if (errno == EINTR || errno == EAGAIN)
            continue;
         if (!final && errno != EAGAIN)
            wlog("job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
      }


to this:
Code: Select all
static void gather_output(child_process *cp, iobuf *io, int final)
{
   int retry = 5;

   for (;;) {
      char buf[4096];
      int rd;

      rd = read(io->fd, buf, sizeof(buf));
      if (rd < 0) {
         if (errno == EAGAIN && !final)
            break;
         if (errno == EINTR || errno == EAGAIN) {
            char   buf[1024];
            if (--retry == 0) {
               sprintf(buf, "job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
               syslog(LOG_ERR, buf);
               break;
            }
            sprintf(buf, "job %d (pid=%ld): read() returned error %d", cp->id, (long)cp->ei->pid, errno);
            syslog(LOG_ERR, buf);
            sleep(1);
            continue;
         }
         if (!final && errno != EAGAIN)
            wlog("job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
      }


Then send the entries from syslog.

Only do this on your test system. It could introduce unacceptable delays in a production system. If it works, I'll do something more production-ready.
Former Nagios Employee
jfrickson
Developer
 
Posts: 188
Joined: Wed Jul 22, 2015 9:58 am

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Postby MalcolmPreen » Mon Feb 08, 2016 3:48 am

OK, will try to schedule this in....

A couple of questions.... I can't make (as yet) the problem recur on my "live" test system.... I might be able to get it to go on my non-live system.... but I'll need to do a large amount of set-up first.... so it certainly won't be today...

After the edit of worker.c - can you confirm that I need to rebuild and install (make all / make install) as normal ?

When you say "send the entries from syslog" - is there some way I can filter the output ? (I assume you only want nagios syslog entries... ?)
I can guess... but I don't want to "over filter".

I'm also guessing that the output might be quite large... so that the best way will likely be "Upload attachment".

As I said... this will likely take some time to set up... so I'll try and do it in the next day or two...
MalcolmPreen
 
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Postby jfrickson » Mon Feb 08, 2016 10:08 am

MalcolmPreen wrote:OK, will try to schedule this in....

A couple of questions.... I can't make (as yet) the problem recur on my "live" test system.... I might be able to get it to go on my non-live system.... but I'll need to do a large amount of set-up first.... so it certainly won't be today...

No problem. Whenever you can get to it.

MalcolmPreen wrote:After the edit of worker.c - can you confirm that I need to rebuild and install (make all / make install) as normal ?

Yes. make all && make install

MalcolmPreen wrote:When you say "send the entries from syslog" - is there some way I can filter the output ? (I assume you only want nagios syslog entries... ?)
I can guess... but I don't want to "over filter".

I'm also guessing that the output might be quite large... so that the best way will likely be "Upload attachment".

The ident will benagios and the priority will be LOG_ERR and it will include the text read(). It should only be about 6 lines, so you should be able to just paste in into a message as a quote or code.
Former Nagios Employee
jfrickson
Developer
 
Posts: 188
Joined: Wed Jul 22, 2015 9:58 am

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Postby MalcolmPreen » Wed Feb 24, 2016 11:35 am

OK, back from vacation...

Needed to add the line;

#include <syslog.h>

to allow it to compile...

I added it after

#include <time.h>

Any problems with this ?

Hopefully will have some more news in the next couple of days.

Malcolm
MalcolmPreen
 
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Postby jfrickson » Wed Feb 24, 2016 11:58 am

MalcolmPreen wrote:OK, back from vacation...

Needed to add the line;

#include <syslog.h>

to allow it to compile...

I added it after

#include <time.h>

Any problems with this ?


That should work just fine :)
Former Nagios Employee
jfrickson
Developer
 
Posts: 188
Joined: Wed Jul 22, 2015 9:58 am

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Postby MalcolmPreen » Wed Feb 24, 2016 12:40 pm

OK, first attempt done... no major analysis as its home time...

As per normal... I detached the VPN connection... this time.. I got these errors... but no repeat of the hard loop...

Could the "retry" code be "fixing" the problem?

Code: Select all
[root@tablet log]# tail -f messages|egrep -i "nagios.*job.*read"
Feb 24 17:16:48 tablet nagios: job 43 (pid=9338): read() returned error 11
Feb 24 17:16:52 tablet nagios: job 44 (pid=9567): read() returned error 11
Feb 24 17:17:08 tablet nagios: job 45 (pid=10442): read() returned error 11
Feb 24 17:22:51 tablet nagios: job 80 (pid=25519): read() returned error 11
Feb 24 17:22:53 tablet nagios: job 80 (pid=25672): read() returned error 11
Feb 24 17:22:53 tablet nagios: job 80 (pid=25710): read() returned error 11
Feb 24 17:22:56 tablet nagios: job 81 (pid=25946): read() returned error 11
Feb 24 17:22:57 tablet nagios: job 81 (pid=26095): read() returned error 11
Feb 24 17:22:58 tablet nagios: job 81 (pid=26095): read() returned error 11
Feb 24 17:22:59 tablet nagios: job 82 (pid=26207): read() returned error 11
Feb 24 17:22:59 tablet nagios: job 82 (pid=26238): read() returned error 11
Feb 24 17:23:00 tablet nagios: job 82 (pid=26267): read() returned error 11
Feb 24 17:23:02 tablet nagios: job 82 (pid=26384): read() returned error 11
Feb 24 17:23:03 tablet nagios: job 82 (pid=26480): read() returned error 11
Feb 24 17:24:00 tablet nagios: job 77 (pid=24703): read() returned error 11
Feb 24 17:24:01 tablet nagios: job 77 (pid=24703): read() returned error 11
Feb 24 17:24:02 tablet nagios: job 77 (pid=24703): read() returned error 11
Feb 24 17:24:03 tablet nagios: job 77 (pid=24703): read() returned error 11
Feb 24 17:24:04 tablet nagios: job 77 (pid=24703): Failed to read(): Resource temporarily unavailable
Feb 24 17:24:07 tablet nagios: job 89 (pid=2848): read() returned error 11
Feb 24 17:24:10 tablet nagios: job 89 (pid=2918): read() returned error 11
Feb 24 17:24:11 tablet nagios: job 90 (pid=3043): read() returned error 11
Feb 24 17:26:31 tablet nagios: job 98 (pid=9704): read() returned error 11
Feb 24 17:27:24 tablet nagios: job 102 (pid=12242): read() returned error 11
Feb 24 17:29:13 tablet nagios: job 110 (pid=18043): read() returned error 11
Feb 24 17:29:19 tablet nagios: job 110 (pid=18227): read() returned error 11
Feb 24 17:30:36 tablet nagios: job 114 (pid=21079): read() returned error 11


Since I cut and pasted the above, there have been a few more errors... but no sign as yet of a loop.

Does that help?

I'm not sure how to match up the "job" or "pid" in the output...

Do you need any other logfiles with the above?

Malcolm
MalcolmPreen
 
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Postby hsmith » Wed Feb 24, 2016 6:10 pm

I believe John is already gone for the day, but I'll reach out to him tomorrow and see if he has any input.
Former Nagios Employee.
me.
User avatar
hsmith
Agent Smith
 
Posts: 3539
Joined: Thu Jul 30, 2015 11:09 am
Location: 127.0.0.1

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Postby MalcolmPreen » Thu Feb 25, 2016 6:16 am

Thanks.... I've done a few tests this morning;

(a) following the forced VPN failure last night.... and the one " Resource temporarily unavailable " error appears to have prevented the hard loop. The monitoring continued overnight without any side effects.

(b) I "tweaked" the retry variable.... "just in case" it worked after a longer delay.... no change... except it takes a little longer to "fail"

(c) Without the forced failure, we get "some" EAGAIN (errno=11) errors.... but normally no more than 2 before the task succeeds.

When the worker task "times out" and executes the
Code: Select all
            if (--retry == 0) {
               sprintf(buf, "job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
               syslog(LOG_ERR, buf);
               break;
            }


Does that "re-queue" the job ? Or is that task lost ?

On a positive note, it doesn't appear to affect the running of the tablet....

As yesterday, if there is anything more you need, I'll do what I can to collect it.

Malcolm
MalcolmPreen
 
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Postby jfrickson » Thu Feb 25, 2016 5:02 pm

MalcolmPreen wrote:(a) following the forced VPN failure last night.... and the one " Resource temporarily unavailable " error appears to have prevented the hard loop. The monitoring continued overnight without any side effects.

(b) I "tweaked" the retry variable.... "just in case" it worked after a longer delay.... no change... except it takes a little longer to "fail"

(c) Without the forced failure, we get "some" EAGAIN (errno=11) errors.... but normally no more than 2 before the task succeeds.


That's good news! It looks like we fixed the problem.

MalcolmPreen wrote:When the worker task "times out" and executes the
Code: Select all
            if (--retry == 0) {
               sprintf(buf, "job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
               syslog(LOG_ERR, buf);
               break;
            }


Does that "re-queue" the job ? Or is that task lost ?


If retrying five times doesn't work, then it's gone. You can bump up the number of retries if you start getting failures, but I would be real surprised if five seconds doesn't work but a higher number does.

MalcolmPreen wrote:On a positive note, it doesn't appear to affect the running of the tablet....

As yesterday, if there is anything more you need, I'll do what I can to collect it.

Malcolm


Thanks! I'll be going over this in more detail, and probably putting in a fix that is very close to what I had you do.

If you want, you can comment out the calls to syslog (or not) and deploy the program to your tablets.

Thanks for your help!
Former Nagios Employee
jfrickson
Developer
 
Posts: 188
Joined: Wed Jul 22, 2015 9:58 am

Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")

Postby MalcolmPreen » Fri Feb 26, 2016 4:04 am

OK, that's good news.

Will there be another "release" (4.1.2-Pre2?) before the official release ? Or is my only option to implement 4.1.2-Pre1 plus this fix.... (or wait patiently :-) )

If you have an "official fix" you want me to try, then please let me know.

Malcolm
MalcolmPreen
 
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

PreviousNext

Return to Community Support

Who is online

Users browsing this forum: Google [Bot] and 15 guests