Page 1 of 1

Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Posted: Wed Jan 27, 2016 10:21 am
by MalcolmPreen
This problem was happening previously on Nagios 4.0.8, 4.1.1 as well as my current set-up on 4.1.2-Pre1
Originally I was running Centos 6.4 (Final) - but upgrading to Centos 6.7 (Final) made no difference.

One of our customer sites is monitored by a netbook... which reports statuses back to "the centre" using nsca.

This particular customer has moved sites / locations a number of times, as well as changing network details... the most recent change resulted in a number of the hosts / services which were originally being monitored no longer existing.

As this customer was not my primary focus, I changed the appropriate IP addresses, where this had been changed, and left the others "down".... "to deal with at a later date".

Shortly after this, I noticed that every so often, ALL of the services from this customer would be stale.... and when logging in, I discovered that nagios was not running any more on this netbook.

More detailed analysis showed that, one by one, the worker processes appeared to be dying - I don't have the EXACT text, but the nagios.log file contained something along the lines of;

wproc: Socket to worker Core Worker XXXXX broken, removing

And sure enough PID XXXXX was gone following this message.

Given that no other netbooks were doing this.... and I'd read something (on here I think) about memory buffers causing workers to die.... I put aside a day to "tidy up" the netbook.

Having commented out a significant portion of the "removed" systems / services (not all.... but more than 50%), I noticed that overnight we had lost no more workers....

I don't know what service it was which killed the worker processes (or if it was just an overall load thing).

So, in summary.... there was a problem.... which is no longer affecting me.... but I figured that my findings above MIGHT help someone in the future.

I can probably resurrect the problem if some specific debugging can be done.... but this will have to be scheduled... so I can't guarantee time scales.

Let me know

Malcolm

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Posted: Wed Jan 27, 2016 4:37 pm
by hsmith
Hi Malcom,

John left some good debugging instructions in your other post. Check them out!

https://support.nagios.com/forum/viewto ... =7&t=36767

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Posted: Thu Jan 28, 2016 7:06 am
by MalcolmPreen
Thanks... working on that now... gdb is now installed... all I have to do is get proficient in it !!

Malcolm

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Posted: Thu Jan 28, 2016 11:18 am
by hsmith
Let us know if you run into anything we can potentially help you with. Keep in mind 4.1.2-Pre1 hasn't really been touched by the support team yet, so you may run into issues we have not come across yet.

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Posted: Fri Feb 05, 2016 6:22 am
by MalcolmPreen
Just to confirm, this particular system has been stable for over a week now.

It has only 4 worker processes, and they have racked up quite a bit of CPU time.... (but nothing excessive as per my other active post).

All four processes are still active... and the monitoring is working as expected.

So... the multitude of errors (or perhaps one specific error?) appears to have been causing "worker death".

As per the initial post.... I'll try and find time to investigate... but I can't promise when... so don't hold your breath...

Malcolm

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Posted: Fri Feb 05, 2016 1:17 pm
by hsmith
MalcolmPreen wrote:so don't hold your breath...
I'll do my best.