Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Post by MalcolmPreen »

This problem was happening previously on Nagios 4.0.8, 4.1.1 as well as my current set-up on 4.1.2-Pre1
Originally I was running Centos 6.4 (Final) - but upgrading to Centos 6.7 (Final) made no difference.

One of our customer sites is monitored by a netbook... which reports statuses back to "the centre" using nsca.

This particular customer has moved sites / locations a number of times, as well as changing network details... the most recent change resulted in a number of the hosts / services which were originally being monitored no longer existing.

As this customer was not my primary focus, I changed the appropriate IP addresses, where this had been changed, and left the others "down".... "to deal with at a later date".

Shortly after this, I noticed that every so often, ALL of the services from this customer would be stale.... and when logging in, I discovered that nagios was not running any more on this netbook.

More detailed analysis showed that, one by one, the worker processes appeared to be dying - I don't have the EXACT text, but the nagios.log file contained something along the lines of;

wproc: Socket to worker Core Worker XXXXX broken, removing

And sure enough PID XXXXX was gone following this message.

Given that no other netbooks were doing this.... and I'd read something (on here I think) about memory buffers causing workers to die.... I put aside a day to "tidy up" the netbook.

Having commented out a significant portion of the "removed" systems / services (not all.... but more than 50%), I noticed that overnight we had lost no more workers....

I don't know what service it was which killed the worker processes (or if it was just an overall load thing).

So, in summary.... there was a problem.... which is no longer affecting me.... but I figured that my findings above MIGHT help someone in the future.

I can probably resurrect the problem if some specific debugging can be done.... but this will have to be scheduled... so I can't guarantee time scales.

Let me know

Malcolm
User avatar
hsmith
Agent Smith
Posts: 3539
Joined: Thu Jul 30, 2015 11:09 am
Location: 127.0.0.1
Contact:

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Post by hsmith »

Hi Malcom,

John left some good debugging instructions in your other post. Check them out!

https://support.nagios.com/forum/viewto ... =7&t=36767
Former Nagios Employee.
me.
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Post by MalcolmPreen »

Thanks... working on that now... gdb is now installed... all I have to do is get proficient in it !!

Malcolm
User avatar
hsmith
Agent Smith
Posts: 3539
Joined: Thu Jul 30, 2015 11:09 am
Location: 127.0.0.1
Contact:

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Post by hsmith »

Let us know if you run into anything we can potentially help you with. Keep in mind 4.1.2-Pre1 hasn't really been touched by the support team yet, so you may run into issues we have not come across yet.
Former Nagios Employee.
me.
MalcolmPreen
Posts: 63
Joined: Wed Jan 25, 2012 9:21 am

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Post by MalcolmPreen »

Just to confirm, this particular system has been stable for over a week now.

It has only 4 worker processes, and they have racked up quite a bit of CPU time.... (but nothing excessive as per my other active post).

All four processes are still active... and the monitoring is working as expected.

So... the multitude of errors (or perhaps one specific error?) appears to have been causing "worker death".

As per the initial post.... I'll try and find time to investigate... but I can't promise when... so don't hold your breath...

Malcolm
User avatar
hsmith
Agent Smith
Posts: 3539
Joined: Thu Jul 30, 2015 11:09 am
Location: 127.0.0.1
Contact:

Re: Problem with Nagios 4.1.2-Pre1 (worker processes dying)

Post by hsmith »

MalcolmPreen wrote:so don't hold your breath...
I'll do my best.
Former Nagios Employee.
me.
Locked