Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")
-
- Posts: 63
- Joined: Wed Jan 25, 2012 9:21 am
Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")
Quick(ish) background info:
Our set-up has a number of issues (the main one being "worker processes being choked"), which are fixed by code which will be included in Nagios 4.1.2 - which is not available for a couple of months.
The remedy to these issues is either to launch MANY (about 256 !!) worker processes... which is putting severe load onto the netbooks on which nagios is installed on the customer sites.... and possibly causing other issues.
OR
Upgrade to 4.1.2-Pre1
Due to the load issue, we opted to try 4.1.2-Pre1 on a couple of live, but "not quite as busy" netbooks.
Each of the netbooks on the customer sites, reports host/service data back to "the centre" using nsca messages.
Now to the problem
Since we upgraded the netbooks to 4.1.2-Pre1, we noticed that "sometimes" (normally at least 12 hours after nagios start), the services and hosts at the centre would start to be marked as STALE (basically, no nsca message received from the netbook).
The first occurrence of this was out of hours, so the decision was made to restart nagios on the netbook.... which restored the monitoring service... but doesn't help to diagnose the problem....
Last night, the situation arose again.... and what I saw was as follows;
ps -ef | grep "nagios -" (to show the -d process and all of the --workers)
This showed all of the expected process... but the "C" column, whilst normally almost 0, was about 40 for one of the worker processes.... and also for the "core" (nagios -d) process. This raised figure remained in place consistently... until I took remedial action (see below).
The other 15 worker processes didn't get involved - so I'm assuming that the core process was no longer dispatching - in effect the two processes seemed to be in a loop.
I didn't want to kill the core (nagios -d), in case the problem was recoverable... so I attempted to kill the worker process which was "looping".
As root, from the command line, I issued;
kill -1 PID
kill PID
kill -9 PID
Nothing happened until the kill -9, at which time the process changed from "normal" to;
nagios 19810 19803 2 Jan26 ? 00:49:18 [nagios] <defunct>
Shortly after this, the C column on this process... and the nagios -d process reduced.... not to 0.... but "close".... and the STALE statuses started to reduce (I assume as the core process was now freed to be able to dispatch the other worker processes).
This was about 21 hours ago now.... and apart from being 1 worker process down... the system appears to be running normally.
I suspect I can re-start nagios to recover "complete normality" if required.... but given that things are running normally, I have left the scenario in place, in case any diagnostic information can be gleaned from the current situation.
So... to the questions;
- should the problem recur, what SHOULD I do to collect information to assist debugging ? [whilst it is a production system.... we fully accept that we are running a pre-release version, so are happy to collect any required information to assist the resolution of this problem... - just let me know]
- is it possible to determine which service the nagios worker process is running at any time? (specifically when it loops!! - but anytime would be interesting)
Any advice greatly appreciated...
Thanks, Malcolm
Our set-up has a number of issues (the main one being "worker processes being choked"), which are fixed by code which will be included in Nagios 4.1.2 - which is not available for a couple of months.
The remedy to these issues is either to launch MANY (about 256 !!) worker processes... which is putting severe load onto the netbooks on which nagios is installed on the customer sites.... and possibly causing other issues.
OR
Upgrade to 4.1.2-Pre1
Due to the load issue, we opted to try 4.1.2-Pre1 on a couple of live, but "not quite as busy" netbooks.
Each of the netbooks on the customer sites, reports host/service data back to "the centre" using nsca messages.
Now to the problem
Since we upgraded the netbooks to 4.1.2-Pre1, we noticed that "sometimes" (normally at least 12 hours after nagios start), the services and hosts at the centre would start to be marked as STALE (basically, no nsca message received from the netbook).
The first occurrence of this was out of hours, so the decision was made to restart nagios on the netbook.... which restored the monitoring service... but doesn't help to diagnose the problem....
Last night, the situation arose again.... and what I saw was as follows;
ps -ef | grep "nagios -" (to show the -d process and all of the --workers)
This showed all of the expected process... but the "C" column, whilst normally almost 0, was about 40 for one of the worker processes.... and also for the "core" (nagios -d) process. This raised figure remained in place consistently... until I took remedial action (see below).
The other 15 worker processes didn't get involved - so I'm assuming that the core process was no longer dispatching - in effect the two processes seemed to be in a loop.
I didn't want to kill the core (nagios -d), in case the problem was recoverable... so I attempted to kill the worker process which was "looping".
As root, from the command line, I issued;
kill -1 PID
kill PID
kill -9 PID
Nothing happened until the kill -9, at which time the process changed from "normal" to;
nagios 19810 19803 2 Jan26 ? 00:49:18 [nagios] <defunct>
Shortly after this, the C column on this process... and the nagios -d process reduced.... not to 0.... but "close".... and the STALE statuses started to reduce (I assume as the core process was now freed to be able to dispatch the other worker processes).
This was about 21 hours ago now.... and apart from being 1 worker process down... the system appears to be running normally.
I suspect I can re-start nagios to recover "complete normality" if required.... but given that things are running normally, I have left the scenario in place, in case any diagnostic information can be gleaned from the current situation.
So... to the questions;
- should the problem recur, what SHOULD I do to collect information to assist debugging ? [whilst it is a production system.... we fully accept that we are running a pre-release version, so are happy to collect any required information to assist the resolution of this problem... - just let me know]
- is it possible to determine which service the nagios worker process is running at any time? (specifically when it loops!! - but anytime would be interesting)
Any advice greatly appreciated...
Thanks, Malcolm
Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")
There are a couple of things you could try:
1) Turn on debugging in nagios.cfg. Set debug_level to -1 and debug_verbosity to 2. Note: this is guaranteed to generate huge amounts of logging.
2) When a process gets in a loop, check it using gdb as follows:
Then post the results of the backtrace (bt command) here. That should give me a real good idea where to look for the problem.
If you're familiar with gdb, you could also do a little poking around to see what values variables have, what command is being processed, etc.
1) Turn on debugging in nagios.cfg. Set debug_level to -1 and debug_verbosity to 2. Note: this is guaranteed to generate huge amounts of logging.
2) When a process gets in a loop, check it using gdb as follows:
Code: Select all
# ps uw -C nagios
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
nagios 8828 0.0 0.0 18636 2860 ? Ss 10:55 0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
nagios 8829 0.0 0.0 12932 2624 ? S 10:55 0:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagi
nagios 8831 25.3 0.0 12932 2624 ? S 10:55 0:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagi
nagios 8835 0.0 0.0 18120 1624 ? S 10:55 0:00 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
# gdb -p 8831
. . . a couple of pages of gdb stuff . . .
(gdb) bt
#0 0x00007f2f5ae63643 in __epoll_wait_nocancel () from /lib64/libc.so.6
#1 0x0000000000472b80 in iobroker_poll (iobs=0x108c040, timeout=timeout@entry=30004)
at iobroker.c:337
#2 0x000000000047595e in enter_worker (sd=sd@entry=3, cb=<optimized out>)
at worker.c:830
#3 0x0000000000416166 in nagios_core_worker (
path=0x7ffc391d2ed0 "/usr/local/nagios/var/rw/nagios.qh") at nagios.c:182
#4 main (argc=<optimized out>, argv=0x7ffc391d2c48) at nagios.c:318
(gdb) detach
Detaching from program: /usr/local/nagios/bin/nagios, process 8829
(gdb) q
#
If you're familiar with gdb, you could also do a little poking around to see what values variables have, what command is being processed, etc.
-
- Posts: 63
- Joined: Wed Jan 25, 2012 9:21 am
Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")
Thanks for the speedy response.... we've actually had two repeats this morning...
First time.... gdb wasn't yet installed
I issued a kill -1 (only), and the worker process went defunct.... monitoring resumed...
Second time... I have gdb output
Hope that helps... if there are any other gdb commands needed.... just let me know...
I don't have the debug turned on.... its in the nagios.cfg... but I've not restarted yet....
I'm not yet familiar with gdb.... so will have an experiment when I get a chance....
In the meantime.... my plan... (in order to keep the live system as live as possible).... is to;
write a script...
to detect a problem process.... run gdb to collect the requested info out to a file.... kill -1 the process
Then I need to make sure I don't run out of workers...
Thanks for your help
Malcolm
First time.... gdb wasn't yet installed
I issued a kill -1 (only), and the worker process went defunct.... monitoring resumed...
Second time... I have gdb output
Code: Select all
Script started on Thu 28 Jan 2016 14:05:13 GMT
root@tablet# gdb -p 19807
GNU gdb (GDB) Red Hat Enterprise Linux (7.2-83.el6)
Copyright (C) 2010 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Attaching to process 19807
Reading symbols from /usr/local/nagios/bin/nagios...(no debugging symbols found)...done.
Reading symbols from /lib/libm.so.6...Reading symbols from /usr/lib/debug/lib/libm-2.12.so.debug...done.
done.
Loaded symbols for /lib/libm.so.6
Reading symbols from /lib/libdl.so.2...Reading symbols from /usr/lib/debug/lib/libdl-2.12.so.debug...done.
done.
Loaded symbols for /lib/libdl.so.2
Reading symbols from /lib/libc.so.6...Reading symbols from /usr/lib/debug/lib/libc-2.12.so.debug...done.
done.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...Reading symbols from /usr/lib/debug/lib/ld-2.12.so.debug...done.
done.
Loaded symbols for /lib/ld-linux.so.2
0x0034f424 in __kernel_vsyscall ()
(gdb) bt
#0 0x0034f424 in __kernel_vsyscall ()
#1 0x002288d3 in __read_nocancel () at ../sysdeps/unix/syscall-template.S:82
#2 0x080bd522 in ?? ()
#3 0x080bd680 in finish_job ()
#4 0x080bd4b5 in ?? ()
#5 0x080bd5c2 in ?? ()
#6 0x080bdcb3 in ?? ()
#7 0x080bad43 in iobroker_poll ()
#8 0x080be131 in enter_worker ()
#9 0x08058ec8 in main ()
(gdb) detach
Detaching from program: /usr/local/nagios/bin/nagios, process 19807
(gdb) quit
root@tablet:# kill -1 19807
root@tablet:# exit
Script done on Thu 28 Jan 2016 14:09:19 GMT
I don't have the debug turned on.... its in the nagios.cfg... but I've not restarted yet....
I'm not yet familiar with gdb.... so will have an experiment when I get a chance....
In the meantime.... my plan... (in order to keep the live system as live as possible).... is to;
write a script...
to detect a problem process.... run gdb to collect the requested info out to a file.... kill -1 the process
Then I need to make sure I don't run out of workers...
Thanks for your help
Malcolm
Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")
Ok, it looks like you got bit by a previously reported bug. You have two options:
1) You can patch the program. Go into lib/worker.c and find the function gather_output. A few lines from the top of the function you will see this:
Add one additional check so it looks like this:
Then recompile and install.
2) The other option is to download and install version 4.1.2Pre1. The repository is at https://github.com/NagiosEnterprises/na ... 4.1.2-Pre1 or you can download a zip file at https://github.com/NagiosEnterprises/na ... 2-Pre1.zip
Either option should fix your problem. Let us know if you need any help, and if you get it working properly.
1) You can patch the program. Go into lib/worker.c and find the function gather_output. A few lines from the top of the function you will see this:
Code: Select all
rd = read(io->fd, buf, sizeof(buf));
if (rd < 0) {
if (errno == EINTR || errno == EAGAIN)
continue;
if (!final && errno != EAGAIN)
wlog("job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
}
Code: Select all
rd = read(io->fd, buf, sizeof(buf));
if (rd < 0) {
if (errno == EAGAIN && !final)
break;
if (errno == EINTR || errno == EAGAIN)
continue;
if (!final && errno != EAGAIN)
wlog("job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
}
2) The other option is to download and install version 4.1.2Pre1. The repository is at https://github.com/NagiosEnterprises/na ... 4.1.2-Pre1 or you can download a zip file at https://github.com/NagiosEnterprises/na ... 2-Pre1.zip
Either option should fix your problem. Let us know if you need any help, and if you get it working properly.
-
- Posts: 63
- Joined: Wed Jan 25, 2012 9:21 am
Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")
Ah.... but... I'm already using Nagios 4.1.2-Pre1 (ref Subject).
Just to be sure, I just checked lib/worker/c and I have;
So... if it is the same problem... it isn't fixed by this change...
Sorry, Malcolm
Just to be sure, I just checked lib/worker/c and I have;
Code: Select all
rd = read(io->fd, buf, sizeof(buf));
if (rd < 0) {
if (errno == EAGAIN && !final)
break;
if (errno == EINTR || errno == EAGAIN)
continue;
if (!final && errno != EAGAIN)
wlog("job %d (pid=%ld): Failed to read(): %s", cp->id, (long)cp->ei->pid, strerror(errno));
}
Sorry, Malcolm
Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")
Oops! I got so excited when I thought I found it, I forgot to double-check what you're running. Silly me!MalcolmPreen wrote:Ah.... but... I'm already using Nagios 4.1.2-Pre1 (ref Subject).
Sorry, Malcolm
Ok, when it happens again, do a ps -axlf to see what's running as a child of the busy worker process. Note the PID's. Wait a while and run the ps again. Eventually, you should end up with just a single child that doesn't go away. Let me know what you end up with.
-
- Posts: 63
- Joined: Wed Jan 25, 2012 9:21 am
Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")
No problem... thanks for your help so far.
Just as an update.... in preparation for the weekend I have set up a script which has the following logic;
is any worker process busy (ps -ef column 4 non-zero) ?
if not, do nothing...
if so, wait 30s and check again... has elapsed time increased (column 7)... if so this COULD be a looper (this may need tuning)
collect gdb -p PID... and ps -axlf... wait 60s... ps -axlf... wait 60s... ps -axlf.... wait 60s... ps -axlf.... kill -1 PID
all this output goes to /tmp/gdb.PID.txt
The script runs, as root, every 5 minutes from cron....
Hopefully, this will collect the required data.... whilst maintaining full monitoring performance...
Of course... since I implemented this.... (and turned debug on as specified previously) there have been no repeats (maybe this post will cause one!!)
If it survives over the weekend.... I plan to remove debug.... in case its a timing issue....
Thanks for your help so far, and hope the above makes sense
Malcolm
Just as an update.... in preparation for the weekend I have set up a script which has the following logic;
is any worker process busy (ps -ef column 4 non-zero) ?
if not, do nothing...
if so, wait 30s and check again... has elapsed time increased (column 7)... if so this COULD be a looper (this may need tuning)
collect gdb -p PID... and ps -axlf... wait 60s... ps -axlf... wait 60s... ps -axlf.... wait 60s... ps -axlf.... kill -1 PID
all this output goes to /tmp/gdb.PID.txt
The script runs, as root, every 5 minutes from cron....
Hopefully, this will collect the required data.... whilst maintaining full monitoring performance...
Of course... since I implemented this.... (and turned debug on as specified previously) there have been no repeats (maybe this post will cause one!!)
If it survives over the weekend.... I plan to remove debug.... in case its a timing issue....
Thanks for your help so far, and hope the above makes sense
Malcolm
Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")
That sounds like it should work. /me crossing fingers and hoping for good output
-
- Posts: 63
- Joined: Wed Jan 25, 2012 9:21 am
Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")
OK... a minor update...
It survived over the weekend.... so I've removed debug....
We had one recurrence yesterday.... from what I can see it was around the time of a VPN failure (the netbook reports the status of its checks to two central servers.... one is in the DMZ... and the other is internal... so requires a VPN connection), and it appeared to be running the "submit nsca report" script.... basically a wrapper around a number of send_nsca commands.
The script to "detect" and "tidy" the error worked, ish, in that I'd left it in "test mode" - so it didn't actually do the kill -1 !!!
I amended the script, and it kicked it....
I'm a little confused, as send_nsca has an inbuilt timeout.... (which I have upped to 120s due to historic issues.... this can be reduced if needed).
Does it make any sense that send_nsca should go into a loop if the network it is using should fail as it is using it ? I would have thought the message would be lost, and the script would just continue after "timeout" seconds ??
Is there anything I can do to debug send_nsca ?
How should I best "kill" it, if the problem recurs in the same way ?
Thanks, Malcolm
It survived over the weekend.... so I've removed debug....
We had one recurrence yesterday.... from what I can see it was around the time of a VPN failure (the netbook reports the status of its checks to two central servers.... one is in the DMZ... and the other is internal... so requires a VPN connection), and it appeared to be running the "submit nsca report" script.... basically a wrapper around a number of send_nsca commands.
The script to "detect" and "tidy" the error worked, ish, in that I'd left it in "test mode" - so it didn't actually do the kill -1 !!!
I amended the script, and it kicked it....
I'm a little confused, as send_nsca has an inbuilt timeout.... (which I have upped to 120s due to historic issues.... this can be reduced if needed).
Does it make any sense that send_nsca should go into a loop if the network it is using should fail as it is using it ? I would have thought the message would be lost, and the script would just continue after "timeout" seconds ??
Is there anything I can do to debug send_nsca ?
How should I best "kill" it, if the problem recurs in the same way ?
Thanks, Malcolm
Re: Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")
I've reached out to jfrickson, we're currently having some inclement weather conditions in the Twin Cities, so I apologize for any potential delay in answer. I'll check back tomorrow afternoon.
Former Nagios Employee.
me.
me.