Core Worker failed to reap child in 4.0.1

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
Rob
Posts: 5
Joined: Thu Oct 31, 2013 7:09 pm

Core Worker failed to reap child in 4.0.1

Post by Rob »

Hi All,

I upgraded from something ancient and venerable yesterday to the latest stable to date - 4.0.1. This is on a redhat 5 system. Build went fine. Config files needed a bit of minor cleaning up of obsoleted options, but otherwise transferred well. Web interface is up and responsive. Checks are being made and generating emails.

Only one problem. /var/log/messages is generating a lot of messages like:
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386616
And by a lot of these, I mean approximately one avery 10 microseconds:
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386521
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386532
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386542
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386552
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386563
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386573
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386583
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386595
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386606
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386616
I haven't yet managed to catch it in the act fast enough to see what the child is that its trying to reap, but as you can guess, the disk partition containing the log is filling up rather rapidly...

Any suggestions on how to get this to stop?
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Core Worker failed to reap child in 4.0.1

Post by sreinhardt »

There doesn't appear to be a current tracker issue for this. Are you getting checks results returned for any workers? Have you also verified that this process is actually running?

Code: Select all

ps -ef | grep nag
ps -ef | grep 27893
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
Rob
Posts: 5
Joined: Thu Oct 31, 2013 7:09 pm

Re: Core Worker failed to reap child in 4.0.1

Post by Rob »

My problem is that the child pid that its trying to reap changes every couple of seconds, and generally by the time I can read it and run a search for it, there's no longer anything there to see. I do seem to be getting checks back for some workers - or at least, I seem to be getting statuses for some of my services, and notifications where appropriate. There is a block of messages that seems to precede the lines I mentioned before (apologies for leaving it out - it got buried in the other messages):

Code: Select all

Nov  1 15:17:42 acadmonitor nagios: wproc: Core Worker 2804: job 422 (pid=6828) timed out. Killing it
Nov  1 15:17:42 acadmonitor nagios: wproc: CHECK job 422 from worker Core Worker 2804 timed out after 30.01s
Nov  1 15:17:42 acadmonitor nagios: wproc:   command: /usr/local/nagios/libexec/check_ping -H 134.114.248.235 -w 3000.0,80% -c 5000.0,100% -
p 5
Nov  1 15:17:42 acadmonitor nagios: wproc:   host=anthroprt5; service=(null);
Nov  1 15:17:42 acadmonitor nagios: wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Nov  1 15:17:42 acadmonitor nagios: Warning: Check of host 'anthroprt5' timed out after 30.01 seconds
I've had to rather agressively throttle any logging from nagios at all in order to keep from getting paged (by nagios) several times a night about disk usage, but if you have suggestions for other things to try or look for, I can turn it back on to gather data
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Core Worker failed to reap child in 4.0.1

Post by slansing »

And just as a check, executing:

Code: Select all

/usr/local/nagios/libexec/check_ping -H 134.114.248.235 -w 3000.0,80% -c 5000.0,100%
Does not take an absurdly long time to run correct?
Rob
Posts: 5
Joined: Thu Oct 31, 2013 7:09 pm

Re: Core Worker failed to reap child in 4.0.1

Post by Rob »

It takes about 30 seconds to time out - the machine genuinely is unreachable. The issue is not the timeout or lack of response from the machine being checked, its the thousands of log messages generated by the lack of response.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Core Worker failed to reap child in 4.0.1

Post by slansing »

I'm going to post a bug report on our internal tracking system, would you be willing to do the same at:

tracker.nagios.org

Thank you!
Rob
Posts: 5
Joined: Thu Oct 31, 2013 7:09 pm

Re: Core Worker failed to reap child in 4.0.1

Post by Rob »

Will do; thanks.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Core Worker failed to reap child in 4.0.1

Post by abrist »

Great. Once the bug report is filed on tracker, post a link in this thread for future forum goers and searchers.
Thanks.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
Rob
Posts: 5
Joined: Thu Oct 31, 2013 7:09 pm

Re: Core Worker failed to reap child in 4.0.1

Post by Rob »

Locked