Problem with Nagios 4.1.2-Pre1 (worker processes dying)
Posted: Wed Jan 27, 2016 10:21 am
This problem was happening previously on Nagios 4.0.8, 4.1.1 as well as my current set-up on 4.1.2-Pre1
Originally I was running Centos 6.4 (Final) - but upgrading to Centos 6.7 (Final) made no difference.
One of our customer sites is monitored by a netbook... which reports statuses back to "the centre" using nsca.
This particular customer has moved sites / locations a number of times, as well as changing network details... the most recent change resulted in a number of the hosts / services which were originally being monitored no longer existing.
As this customer was not my primary focus, I changed the appropriate IP addresses, where this had been changed, and left the others "down".... "to deal with at a later date".
Shortly after this, I noticed that every so often, ALL of the services from this customer would be stale.... and when logging in, I discovered that nagios was not running any more on this netbook.
More detailed analysis showed that, one by one, the worker processes appeared to be dying - I don't have the EXACT text, but the nagios.log file contained something along the lines of;
wproc: Socket to worker Core Worker XXXXX broken, removing
And sure enough PID XXXXX was gone following this message.
Given that no other netbooks were doing this.... and I'd read something (on here I think) about memory buffers causing workers to die.... I put aside a day to "tidy up" the netbook.
Having commented out a significant portion of the "removed" systems / services (not all.... but more than 50%), I noticed that overnight we had lost no more workers....
I don't know what service it was which killed the worker processes (or if it was just an overall load thing).
So, in summary.... there was a problem.... which is no longer affecting me.... but I figured that my findings above MIGHT help someone in the future.
I can probably resurrect the problem if some specific debugging can be done.... but this will have to be scheduled... so I can't guarantee time scales.
Let me know
Malcolm
Originally I was running Centos 6.4 (Final) - but upgrading to Centos 6.7 (Final) made no difference.
One of our customer sites is monitored by a netbook... which reports statuses back to "the centre" using nsca.
This particular customer has moved sites / locations a number of times, as well as changing network details... the most recent change resulted in a number of the hosts / services which were originally being monitored no longer existing.
As this customer was not my primary focus, I changed the appropriate IP addresses, where this had been changed, and left the others "down".... "to deal with at a later date".
Shortly after this, I noticed that every so often, ALL of the services from this customer would be stale.... and when logging in, I discovered that nagios was not running any more on this netbook.
More detailed analysis showed that, one by one, the worker processes appeared to be dying - I don't have the EXACT text, but the nagios.log file contained something along the lines of;
wproc: Socket to worker Core Worker XXXXX broken, removing
And sure enough PID XXXXX was gone following this message.
Given that no other netbooks were doing this.... and I'd read something (on here I think) about memory buffers causing workers to die.... I put aside a day to "tidy up" the netbook.
Having commented out a significant portion of the "removed" systems / services (not all.... but more than 50%), I noticed that overnight we had lost no more workers....
I don't know what service it was which killed the worker processes (or if it was just an overall load thing).
So, in summary.... there was a problem.... which is no longer affecting me.... but I figured that my findings above MIGHT help someone in the future.
I can probably resurrect the problem if some specific debugging can be done.... but this will have to be scheduled... so I can't guarantee time scales.
Let me know
Malcolm