Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
I upgraded from something ancient and venerable yesterday to the latest stable to date - 4.0.1. This is on a redhat 5 system. Build went fine. Config files needed a bit of minor cleaning up of obsoleted options, but otherwise transferred well. Web interface is up and responsive. Checks are being made and generating emails.
Only one problem. /var/log/messages is generating a lot of messages like:
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386616
And by a lot of these, I mean approximately one avery 10 microseconds:
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386521
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386532
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386542
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386552
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386563
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386573
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386583
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386595
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386606
Nov 1 08:01:58 acadmonitor nagios: wproc: Core Worker 23413: Failed to reap child with pid 27893. Next attempt @ 1383318118.386616
I haven't yet managed to catch it in the act fast enough to see what the child is that its trying to reap, but as you can guess, the disk partition containing the log is filling up rather rapidly...
There doesn't appear to be a current tracker issue for this. Are you getting checks results returned for any workers? Have you also verified that this process is actually running?
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
My problem is that the child pid that its trying to reap changes every couple of seconds, and generally by the time I can read it and run a search for it, there's no longer anything there to see. I do seem to be getting checks back for some workers - or at least, I seem to be getting statuses for some of my services, and notifications where appropriate. There is a block of messages that seems to precede the lines I mentioned before (apologies for leaving it out - it got buried in the other messages):
Nov 1 15:17:42 acadmonitor nagios: wproc: Core Worker 2804: job 422 (pid=6828) timed out. Killing it
Nov 1 15:17:42 acadmonitor nagios: wproc: CHECK job 422 from worker Core Worker 2804 timed out after 30.01s
Nov 1 15:17:42 acadmonitor nagios: wproc: command: /usr/local/nagios/libexec/check_ping -H 134.114.248.235 -w 3000.0,80% -c 5000.0,100% -
p 5
Nov 1 15:17:42 acadmonitor nagios: wproc: host=anthroprt5; service=(null);
Nov 1 15:17:42 acadmonitor nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Nov 1 15:17:42 acadmonitor nagios: Warning: Check of host 'anthroprt5' timed out after 30.01 seconds
I've had to rather agressively throttle any logging from nagios at all in order to keep from getting paged (by nagios) several times a night about disk usage, but if you have suggestions for other things to try or look for, I can turn it back on to gather data
It takes about 30 seconds to time out - the machine genuinely is unreachable. The issue is not the timeout or lack of response from the machine being checked, its the thousands of log messages generated by the lack of response.
Great. Once the bug report is filed on tracker, post a link in this thread for future forum goers and searchers.
Thanks.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.