Problem with Nagios 4.1.2-Pre1 (nagios "busy loop")
Posted: Wed Jan 27, 2016 9:58 am
Quick(ish) background info:
Our set-up has a number of issues (the main one being "worker processes being choked"), which are fixed by code which will be included in Nagios 4.1.2 - which is not available for a couple of months.
The remedy to these issues is either to launch MANY (about 256 !!) worker processes... which is putting severe load onto the netbooks on which nagios is installed on the customer sites.... and possibly causing other issues.
OR
Upgrade to 4.1.2-Pre1
Due to the load issue, we opted to try 4.1.2-Pre1 on a couple of live, but "not quite as busy" netbooks.
Each of the netbooks on the customer sites, reports host/service data back to "the centre" using nsca messages.
Now to the problem
Since we upgraded the netbooks to 4.1.2-Pre1, we noticed that "sometimes" (normally at least 12 hours after nagios start), the services and hosts at the centre would start to be marked as STALE (basically, no nsca message received from the netbook).
The first occurrence of this was out of hours, so the decision was made to restart nagios on the netbook.... which restored the monitoring service... but doesn't help to diagnose the problem....
Last night, the situation arose again.... and what I saw was as follows;
ps -ef | grep "nagios -" (to show the -d process and all of the --workers)
This showed all of the expected process... but the "C" column, whilst normally almost 0, was about 40 for one of the worker processes.... and also for the "core" (nagios -d) process. This raised figure remained in place consistently... until I took remedial action (see below).
The other 15 worker processes didn't get involved - so I'm assuming that the core process was no longer dispatching - in effect the two processes seemed to be in a loop.
I didn't want to kill the core (nagios -d), in case the problem was recoverable... so I attempted to kill the worker process which was "looping".
As root, from the command line, I issued;
kill -1 PID
kill PID
kill -9 PID
Nothing happened until the kill -9, at which time the process changed from "normal" to;
nagios 19810 19803 2 Jan26 ? 00:49:18 [nagios] <defunct>
Shortly after this, the C column on this process... and the nagios -d process reduced.... not to 0.... but "close".... and the STALE statuses started to reduce (I assume as the core process was now freed to be able to dispatch the other worker processes).
This was about 21 hours ago now.... and apart from being 1 worker process down... the system appears to be running normally.
I suspect I can re-start nagios to recover "complete normality" if required.... but given that things are running normally, I have left the scenario in place, in case any diagnostic information can be gleaned from the current situation.
So... to the questions;
- should the problem recur, what SHOULD I do to collect information to assist debugging ? [whilst it is a production system.... we fully accept that we are running a pre-release version, so are happy to collect any required information to assist the resolution of this problem... - just let me know]
- is it possible to determine which service the nagios worker process is running at any time? (specifically when it loops!! - but anytime would be interesting)
Any advice greatly appreciated...
Thanks, Malcolm
Our set-up has a number of issues (the main one being "worker processes being choked"), which are fixed by code which will be included in Nagios 4.1.2 - which is not available for a couple of months.
The remedy to these issues is either to launch MANY (about 256 !!) worker processes... which is putting severe load onto the netbooks on which nagios is installed on the customer sites.... and possibly causing other issues.
OR
Upgrade to 4.1.2-Pre1
Due to the load issue, we opted to try 4.1.2-Pre1 on a couple of live, but "not quite as busy" netbooks.
Each of the netbooks on the customer sites, reports host/service data back to "the centre" using nsca messages.
Now to the problem
Since we upgraded the netbooks to 4.1.2-Pre1, we noticed that "sometimes" (normally at least 12 hours after nagios start), the services and hosts at the centre would start to be marked as STALE (basically, no nsca message received from the netbook).
The first occurrence of this was out of hours, so the decision was made to restart nagios on the netbook.... which restored the monitoring service... but doesn't help to diagnose the problem....
Last night, the situation arose again.... and what I saw was as follows;
ps -ef | grep "nagios -" (to show the -d process and all of the --workers)
This showed all of the expected process... but the "C" column, whilst normally almost 0, was about 40 for one of the worker processes.... and also for the "core" (nagios -d) process. This raised figure remained in place consistently... until I took remedial action (see below).
The other 15 worker processes didn't get involved - so I'm assuming that the core process was no longer dispatching - in effect the two processes seemed to be in a loop.
I didn't want to kill the core (nagios -d), in case the problem was recoverable... so I attempted to kill the worker process which was "looping".
As root, from the command line, I issued;
kill -1 PID
kill PID
kill -9 PID
Nothing happened until the kill -9, at which time the process changed from "normal" to;
nagios 19810 19803 2 Jan26 ? 00:49:18 [nagios] <defunct>
Shortly after this, the C column on this process... and the nagios -d process reduced.... not to 0.... but "close".... and the STALE statuses started to reduce (I assume as the core process was now freed to be able to dispatch the other worker processes).
This was about 21 hours ago now.... and apart from being 1 worker process down... the system appears to be running normally.
I suspect I can re-start nagios to recover "complete normality" if required.... but given that things are running normally, I have left the scenario in place, in case any diagnostic information can be gleaned from the current situation.
So... to the questions;
- should the problem recur, what SHOULD I do to collect information to assist debugging ? [whilst it is a production system.... we fully accept that we are running a pre-release version, so are happy to collect any required information to assist the resolution of this problem... - just let me know]
- is it possible to determine which service the nagios worker process is running at any time? (specifically when it loops!! - but anytime would be interesting)
Any advice greatly appreciated...
Thanks, Malcolm