Possible Bug in 1.3 / Strange wproc timeouts
Posted: Mon Jul 28, 2014 12:12 pm
Hi Guys / Gals
I've been noticing some strange behavior and a lot of false positives with my installation since upgrading to the latest 2014R1.3 patch. Mainly, random service checks are timing out and causing false alarms when the application is not experiencing any outages. Looking back thru my syslog server, before the 2014R1.3 upgrade, I wasn't seeing any noise from wproc in the logs and we weren't experiencing a flood of false alarms. Since the upgrade.. These gems have popped up in logs since upgrading...
All of the false alarms are related to socket timeouts.
Some data from my system profile...
nagios (pid 29145) is running...
NPCD running (pid 20190).
ndo2db (pid 22159) is running...
CPU Load 15: 0.42
Total Hosts: 422
Total Services: 3134
Any thoughts on the cause of this?
Thanks in advance.
-nick
I've been noticing some strange behavior and a lot of false positives with my installation since upgrading to the latest 2014R1.3 patch. Mainly, random service checks are timing out and causing false alarms when the application is not experiencing any outages. Looking back thru my syslog server, before the 2014R1.3 upgrade, I wasn't seeing any noise from wproc in the logs and we weren't experiencing a flood of false alarms. Since the upgrade.. These gems have popped up in logs since upgrading...
Code: Select all
Jul 26 00:13:55 nagiosxi nagios: wproc: Core Worker 4429: job 17959 (pid=11784) timed out. Killing it
Jul 26 00:13:55 nagiosxi nagios: wproc: CHECK job 17959 from worker Core Worker 4429 timed out after 60.01s
Jul 26 00:13:55 nagiosxi nagios: wproc: host=middleearth; service=app - dp system logs;
Jul 26 00:13:55 nagiosxi nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:13:55 nagiosxi nagios: wproc: Core Worker 4429: tv.tv_sec is currently 1406355235
Jul 26 00:13:55 nagiosxi nagios: wproc: Core Worker 4429: Failed to reap child with pid 11784. Next attempt @ 1406355240.501255
Jul 26 00:13:55 nagiosxi nagios: wproc: Core Worker 4426: job 17961 (pid=11808) timed out. Killing it
Jul 26 00:13:55 nagiosxi nagios: wproc: CHECK job 17961 from worker Core Worker 4426 timed out after 60.00s
Jul 26 00:13:55 nagiosxi nagios: wproc: host=prism01; service=sys - syslog;
Jul 26 00:13:55 nagiosxi nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:13:55 nagiosxi nagios: wproc: stdout line 01: CHECK_NRPE: Socket timeout after 60 seconds.
Jul 26 00:13:55 nagiosxi nagios: wproc: Core Worker 4426: job 17961 (pid=11808): Dormant child reaped
Jul 26 00:13:59 nagiosxi nagios_log [1406355231] wproc: Core Worker 4431: job 17952 (pid=11672) timed out. Killing it
Jul 26 00:13:59 nagiosxi nagios_log [1406355231] wproc: Core Worker 4431: job 28545328 with pid 11672 reaped at timeout. timeouts=1; started=18019
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: Core Worker 4429: job 17959 (pid=11784) timed out. Killing it
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: CHECK job 17959 from worker Core Worker 4429 timed out after 60.01s
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: host=middleearth; service=app - dp system logs;
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: Core Worker 4429: tv.tv_sec is currently 1406355235
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: Core Worker 4429: Failed to reap child with pid 11784. Next attempt @ 1406355240.501255
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: Core Worker 4426: job 17961 (pid=11808) timed out. Killing it
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: CHECK job 17961 from worker Core Worker 4426 timed out after 60.00s
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: host=prism01; service=sys - syslog;
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: stdout line 01: CHECK_NRPE: Socket timeout after 60 seconds.
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: Core Worker 4426: job 17961 (pid=11808): Dormant child reaped
Jul 26 00:14:00 nagiosxi nagios: wproc: Core Worker 4429: job 17959 (pid=11784): Dormant child reaped
Jul 26 00:14:00 nagiosxi nagios: wproc: Core Worker 4424: job 17965 (pid=11860) timed out. Killing it
Jul 26 00:14:00 nagiosxi nagios: wproc: Core Worker 4424: job 32306000 with pid 11860 reaped at timeout. timeouts=1; started=18029
Jul 26 00:14:05 nagiosxi nagios: wproc: Core Worker 4434: job 17971 (pid=12072) timed out. Killing it
Jul 26 00:14:05 nagiosxi nagios: wproc: Core Worker 4434: job 14199232 with pid 12072 reaped at timeout. timeouts=2; started=18031
Jul 26 00:14:07 nagiosxi nagios: wproc: Core Worker 4430: job 17973 (pid=12094) timed out. Killing it
Jul 26 00:14:07 nagiosxi nagios: wproc: CHECK job 17973 from worker Core Worker 4430 timed out after 60.00s
Jul 26 00:14:07 nagiosxi nagios: wproc: host=nachobase01; service=sys - syslog;
Jul 26 00:14:07 nagiosxi nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:14:07 nagiosxi nagios: wproc: stdout line 01: CHECK_NRPE: Socket timeout after 60 seconds.
Jul 26 00:14:07 nagiosxi nagios: wproc: Core Worker 4430: job 17973 (pid=12094): Dormant child reaped
Jul 26 00:14:09 nagiosxi nagios_log [1406355240] wproc: Core Worker 4429: job 17959 (pid=11784): Dormant child reaped
Jul 26 00:14:09 nagiosxi nagios_log [1406355240] wproc: Core Worker 4424: job 17965 (pid=11860) timed out. Killing it
Jul 26 00:14:09 nagiosxi nagios_log [1406355240] wproc: Core Worker 4424: job 32306000 with pid 11860 reaped at timeout. timeouts=1; started=18029
Jul 26 00:14:09 nagiosxi nagios_log [1406355245] wproc: Core Worker 4434: job 17971 (pid=12072) timed out. Killing it
Jul 26 00:14:09 nagiosxi nagios_log [1406355245] wproc: Core Worker 4434: job 14199232 with pid 12072 reaped at timeout. timeouts=2; started=18031
Jul 26 00:14:09 nagiosxi nagios_log [1406355247] wproc: Core Worker 4430: job 17973 (pid=12094) timed out. Killing it
Jul 26 00:14:09 nagiosxi nagios_log [1406355247] wproc: CHECK job 17973 from worker Core Worker 4430 timed out after 60.00s
Jul 26 00:14:09 nagiosxi nagios_log [1406355247] wproc: host=nachobase01; service=sys - syslog;
Jul 26 00:14:09 nagiosxi nagios_log [1406355247] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:14:09 nagiosxi nagios_log [1406355247] wproc: stdout line 01: CHECK_NRPE: Socket timeout after 60 seconds.
Jul 26 00:14:09 nagiosxi nagios_log [1406355247] wproc: Core Worker 4430: job 17973 (pid=12094): Dormant child reaped
Jul 26 00:14:15 nagiosxi nagios: wproc: Core Worker 4424: job 17985 (pid=12257) timed out. Killing it
Jul 26 00:14:15 nagiosxi nagios: wproc: CHECK job 17985 from worker Core Worker 4424 timed out after 60.00s
Jul 26 00:14:15 nagiosxi nagios: wproc: host=prism02; service=app - hbase loader logs;
Jul 26 00:14:15 nagiosxi nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:14:15 nagiosxi nagios: wproc: Core Worker 4424: tv.tv_sec is currently 1406355255
Jul 26 00:14:15 nagiosxi nagios: wproc: Core Worker 4424: Failed to reap child with pid 12257. Next attempt @ 1406355260.286074
Jul 26 00:14:19 nagiosxi nagios_log [1406355255] wproc: Core Worker 4424: job 17985 (pid=12257) timed out. Killing it
Jul 26 00:14:19 nagiosxi nagios_log [1406355255] wproc: CHECK job 17985 from worker Core Worker 4424 timed out after 60.00sSome data from my system profile...
nagios (pid 29145) is running...
NPCD running (pid 20190).
ndo2db (pid 22159) is running...
CPU Load 15: 0.42
Total Hosts: 422
Total Services: 3134
Any thoughts on the cause of this?
Thanks in advance.
-nick