Core Worker failed to reap child in 4.0.1

Rob · Post by **Rob** » Fri Nov 01, 2013 4:40 pm

Hi All,

I upgraded from something ancient and venerable yesterday to the latest stable to date - 4.0.1. This is on a redhat 5 system. Build went fine. Config files needed a bit of minor cleaning up of obsoleted options, but otherwise transferred well. Web interface is up and responsive. Checks are being made and generating emails.

Only one problem. /var/log/messages is generating a lot of messages like:

Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386616

And by a lot of these, I mean approximately one avery 10 microseconds:

Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386521
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386532
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386542
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386552
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386563
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386573
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386583
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386595
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386606
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386616

I haven't yet managed to catch it in the act fast enough to see what the child is that its trying to reap, but as you can guess, the disk partition containing the log is filling up rather rapidly...

Any suggestions on how to get this to stop?

sreinhardt · Post by **sreinhardt** » Mon Nov 04, 2013 11:27 am

There doesn't appear to be a current tracker issue for this. Are you getting checks results returned for any workers? Have you also verified that this process is actually running?

Code: Select all

ps -ef | grep nag
ps -ef | grep 27893

Rob · Post by **Rob** » Mon Nov 04, 2013 5:54 pm

My problem is that the child pid that its trying to reap changes every couple of seconds, and generally by the time I can read it and run a search for it, there's no longer anything there to see. I do seem to be getting checks back for some workers - or at least, I seem to be getting statuses for some of my services, and notifications where appropriate. There is a block of messages that seems to precede the lines I mentioned before (apologies for leaving it out - it got buried in the other messages):

Code: Select all

Nov  1 15:17:42 acadmonitor nagios: wproc: Core Worker 2804: job 422 (pid=6828) timed out. Killing it
Nov  1 15:17:42 acadmonitor nagios: wproc: CHECK job 422 from worker Core Worker 2804 timed out after 30.01s
Nov  1 15:17:42 acadmonitor nagios: wproc:   command: /usr/local/nagios/libexec/check_ping -H 134.114.248.235 -w 3000.0,80% -c 5000.0,100% -
p 5
Nov  1 15:17:42 acadmonitor nagios: wproc:   host=anthroprt5; service=(null);
Nov  1 15:17:42 acadmonitor nagios: wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Nov  1 15:17:42 acadmonitor nagios: Warning: Check of host 'anthroprt5' timed out after 30.01 seconds

I've had to rather agressively throttle any logging from nagios at all in order to keep from getting paged (by nagios) several times a night about disk usage, but if you have suggestions for other things to try or look for, I can turn it back on to gather data

slansing · Post by **slansing** » Tue Nov 05, 2013 2:17 pm

And just as a check, executing:

Code: Select all

/usr/local/nagios/libexec/check_ping -H 134.114.248.235 -w 3000.0,80% -c 5000.0,100%

Does not take an absurdly long time to run correct?

Rob · Post by **Rob** » Tue Nov 05, 2013 3:27 pm

It takes about 30 seconds to time out - the machine genuinely is unreachable. The issue is not the timeout or lack of response from the machine being checked, its the thousands of log messages generated by the lack of response.

slansing · Post by **slansing** » Wed Nov 06, 2013 12:00 pm

I'm going to post a bug report on our internal tracking system, would you be willing to do the same at:

tracker.nagios.org

Thank you!

Rob · Post by **Rob** » Wed Nov 06, 2013 3:04 pm

Will do; thanks.

abrist · Post by **abrist** » Wed Nov 06, 2013 3:29 pm

Great. Once the bug report is filed on tracker, post a link in this thread for future forum goers and searchers.
Thanks.

Rob · Post by **Rob** » Wed Nov 06, 2013 6:16 pm

Good point:

http://tracker.nagios.org/view.php?id=529

Nagios Support Forum

Core Worker failed to reap child in 4.0.1

Core Worker failed to reap child in 4.0.1

Re: Core Worker failed to reap child in 4.0.1

Re: Core Worker failed to reap child in 4.0.1

Re: Core Worker failed to reap child in 4.0.1

Re: Core Worker failed to reap child in 4.0.1

Re: Core Worker failed to reap child in 4.0.1

Re: Core Worker failed to reap child in 4.0.1

Re: Core Worker failed to reap child in 4.0.1

Re: Core Worker failed to reap child in 4.0.1