Nagios Support Forum

Posted: **Wed Jan 27, 2016 10:21 am**

This problem was happening previously on Nagios 4.0.8, 4.1.1 as well as my current set-up on 4.1.2-Pre1
Originally I was running Centos 6.4 (Final) - but upgrading to Centos 6.7 (Final) made no difference.

One of our customer sites is monitored by a netbook... which reports statuses back to "the centre" using nsca.

This particular customer has moved sites / locations a number of times, as well as changing network details... the most recent change resulted in a number of the hosts / services which were originally being monitored no longer existing.

As this customer was not my primary focus, I changed the appropriate IP addresses, where this had been changed, and left the others "down".... "to deal with at a later date".

Shortly after this, I noticed that every so often, ALL of the services from this customer would be stale.... and when logging in, I discovered that nagios was not running any more on this netbook.

More detailed analysis showed that, one by one, the worker processes appeared to be dying - I don't have the EXACT text, but the nagios.log file contained something along the lines of;

wproc: Socket to worker Core Worker XXXXX broken, removing

And sure enough PID XXXXX was gone following this message.

Given that no other netbooks were doing this.... and I'd read something (on here I think) about memory buffers causing workers to die.... I put aside a day to "tidy up" the netbook.

Having commented out a significant portion of the "removed" systems / services (not all.... but more than 50%), I noticed that overnight we had lost no more workers....

I don't know what service it was which killed the worker processes (or if it was just an overall load thing).

So, in summary.... there was a problem.... which is no longer affecting me.... but I figured that my findings above MIGHT help someone in the future.

I can probably resurrect the problem if some specific debugging can be done.... but this will have to be scheduled... so I can't guarantee time scales.

Let me know

Malcolm

Posted: **Wed Jan 27, 2016 4:37 pm**

Hi Malcom,

John left some good debugging instructions in your other post. Check them out!

https://support.nagios.com/forum/viewto ... =7&t=36767

Posted: **Thu Jan 28, 2016 7:06 am**

Thanks... working on that now... gdb is now installed... all I have to do is get proficient in it !!

Malcolm

Posted: **Thu Jan 28, 2016 11:18 am**

Let us know if you run into anything we can potentially help you with. Keep in mind 4.1.2-Pre1 hasn't really been touched by the support team yet, so you may run into issues we have not come across yet.

Posted: **Fri Feb 05, 2016 6:22 am**

Just to confirm, this particular system has been stable for over a week now.

It has only 4 worker processes, and they have racked up quite a bit of CPU time.... (but nothing excessive as per my other active post).

All four processes are still active... and the monitoring is working as expected.

So... the multitude of errors (or perhaps one specific error?) appears to have been causing "worker death".

As per the initial post.... I'll try and find time to investigate... but I can't promise when... so don't hold your breath...

Malcolm

Posted: **Fri Feb 05, 2016 1:17 pm**

MalcolmPreen wrote:so don't hold your breath...

I'll do my best.

Nagios Support Forum

Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)