Possible Bug in 1.3 / Strange wproc timeouts

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
technick
Posts: 49
Joined: Tue Feb 04, 2014 10:30 am
Location: Denver, CO

Possible Bug in 1.3 / Strange wproc timeouts

Post by technick »

Hi Guys / Gals

I've been noticing some strange behavior and a lot of false positives with my installation since upgrading to the latest 2014R1.3 patch. Mainly, random service checks are timing out and causing false alarms when the application is not experiencing any outages. Looking back thru my syslog server, before the 2014R1.3 upgrade, I wasn't seeing any noise from wproc in the logs and we weren't experiencing a flood of false alarms. Since the upgrade.. These gems have popped up in logs since upgrading...

Code: Select all

Jul 26 00:13:55 nagiosxi nagios: wproc: Core Worker 4429: job 17959 (pid=11784) timed out. Killing it
Jul 26 00:13:55 nagiosxi nagios: wproc: CHECK job 17959 from worker Core Worker 4429 timed out after 60.01s
Jul 26 00:13:55 nagiosxi nagios: wproc:   host=middleearth; service=app - dp system logs;
Jul 26 00:13:55 nagiosxi nagios: wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:13:55 nagiosxi nagios: wproc: Core Worker 4429: tv.tv_sec is currently 1406355235
Jul 26 00:13:55 nagiosxi nagios: wproc: Core Worker 4429: Failed to reap child with pid 11784. Next attempt @ 1406355240.501255
Jul 26 00:13:55 nagiosxi nagios: wproc: Core Worker 4426: job 17961 (pid=11808) timed out. Killing it
Jul 26 00:13:55 nagiosxi nagios: wproc: CHECK job 17961 from worker Core Worker 4426 timed out after 60.00s
Jul 26 00:13:55 nagiosxi nagios: wproc:   host=prism01; service=sys - syslog;
Jul 26 00:13:55 nagiosxi nagios: wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:13:55 nagiosxi nagios: wproc:   stdout line 01: CHECK_NRPE: Socket timeout after 60 seconds.
Jul 26 00:13:55 nagiosxi nagios: wproc: Core Worker 4426: job 17961 (pid=11808): Dormant child reaped
Jul 26 00:13:59 nagiosxi nagios_log [1406355231] wproc: Core Worker 4431: job 17952 (pid=11672) timed out. Killing it
Jul 26 00:13:59 nagiosxi nagios_log [1406355231] wproc: Core Worker 4431: job 28545328 with pid 11672 reaped at timeout. timeouts=1; started=18019
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: Core Worker 4429: job 17959 (pid=11784) timed out. Killing it
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: CHECK job 17959 from worker Core Worker 4429 timed out after 60.01s
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc:   host=middleearth; service=app - dp system logs;
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: Core Worker 4429: tv.tv_sec is currently 1406355235
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: Core Worker 4429: Failed to reap child with pid 11784. Next attempt @ 1406355240.501255
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: Core Worker 4426: job 17961 (pid=11808) timed out. Killing it
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: CHECK job 17961 from worker Core Worker 4426 timed out after 60.00s
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc:   host=prism01; service=sys - syslog;
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc:   stdout line 01: CHECK_NRPE: Socket timeout after 60 seconds.
Jul 26 00:13:59 nagiosxi nagios_log [1406355235] wproc: Core Worker 4426: job 17961 (pid=11808): Dormant child reaped
Jul 26 00:14:00 nagiosxi nagios: wproc: Core Worker 4429: job 17959 (pid=11784): Dormant child reaped
Jul 26 00:14:00 nagiosxi nagios: wproc: Core Worker 4424: job 17965 (pid=11860) timed out. Killing it
Jul 26 00:14:00 nagiosxi nagios: wproc: Core Worker 4424: job 32306000 with pid 11860 reaped at timeout. timeouts=1; started=18029
Jul 26 00:14:05 nagiosxi nagios: wproc: Core Worker 4434: job 17971 (pid=12072) timed out. Killing it
Jul 26 00:14:05 nagiosxi nagios: wproc: Core Worker 4434: job 14199232 with pid 12072 reaped at timeout. timeouts=2; started=18031
Jul 26 00:14:07 nagiosxi nagios: wproc: Core Worker 4430: job 17973 (pid=12094) timed out. Killing it
Jul 26 00:14:07 nagiosxi nagios: wproc: CHECK job 17973 from worker Core Worker 4430 timed out after 60.00s
Jul 26 00:14:07 nagiosxi nagios: wproc:   host=nachobase01; service=sys - syslog;
Jul 26 00:14:07 nagiosxi nagios: wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:14:07 nagiosxi nagios: wproc:   stdout line 01: CHECK_NRPE: Socket timeout after 60 seconds.
Jul 26 00:14:07 nagiosxi nagios: wproc: Core Worker 4430: job 17973 (pid=12094): Dormant child reaped
Jul 26 00:14:09 nagiosxi nagios_log [1406355240] wproc: Core Worker 4429: job 17959 (pid=11784): Dormant child reaped
Jul 26 00:14:09 nagiosxi nagios_log [1406355240] wproc: Core Worker 4424: job 17965 (pid=11860) timed out. Killing it
Jul 26 00:14:09 nagiosxi nagios_log [1406355240] wproc: Core Worker 4424: job 32306000 with pid 11860 reaped at timeout. timeouts=1; started=18029
Jul 26 00:14:09 nagiosxi nagios_log [1406355245] wproc: Core Worker 4434: job 17971 (pid=12072) timed out. Killing it
Jul 26 00:14:09 nagiosxi nagios_log [1406355245] wproc: Core Worker 4434: job 14199232 with pid 12072 reaped at timeout. timeouts=2; started=18031
Jul 26 00:14:09 nagiosxi nagios_log [1406355247] wproc: Core Worker 4430: job 17973 (pid=12094) timed out. Killing it
Jul 26 00:14:09 nagiosxi nagios_log [1406355247] wproc: CHECK job 17973 from worker Core Worker 4430 timed out after 60.00s
Jul 26 00:14:09 nagiosxi nagios_log [1406355247] wproc:   host=nachobase01; service=sys - syslog;
Jul 26 00:14:09 nagiosxi nagios_log [1406355247] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:14:09 nagiosxi nagios_log [1406355247] wproc:   stdout line 01: CHECK_NRPE: Socket timeout after 60 seconds.
Jul 26 00:14:09 nagiosxi nagios_log [1406355247] wproc: Core Worker 4430: job 17973 (pid=12094): Dormant child reaped
Jul 26 00:14:15 nagiosxi nagios: wproc: Core Worker 4424: job 17985 (pid=12257) timed out. Killing it
Jul 26 00:14:15 nagiosxi nagios: wproc: CHECK job 17985 from worker Core Worker 4424 timed out after 60.00s
Jul 26 00:14:15 nagiosxi nagios: wproc:   host=prism02; service=app - hbase loader logs;
Jul 26 00:14:15 nagiosxi nagios: wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Jul 26 00:14:15 nagiosxi nagios: wproc: Core Worker 4424: tv.tv_sec is currently 1406355255
Jul 26 00:14:15 nagiosxi nagios: wproc: Core Worker 4424: Failed to reap child with pid 12257. Next attempt @ 1406355260.286074
Jul 26 00:14:19 nagiosxi nagios_log [1406355255] wproc: Core Worker 4424: job 17985 (pid=12257) timed out. Killing it
Jul 26 00:14:19 nagiosxi nagios_log [1406355255] wproc: CHECK job 17985 from worker Core Worker 4424 timed out after 60.00s
All of the false alarms are related to socket timeouts.

Some data from my system profile...

nagios (pid 29145) is running...
NPCD running (pid 20190).
ndo2db (pid 22159) is running...
CPU Load 15: 0.42
Total Hosts: 422
Total Services: 3134

Any thoughts on the cause of this?

Thanks in advance.

-nick
----------------------
Nagios Jedi in training.
tmcdonald
Posts: 9117
Joined: Mon Sep 23, 2013 8:40 am

Re: Possible Bug in 1.3 / Strange wproc timeouts

Post by tmcdonald »

Can I get a copy of your profile? In the XI web interface, go to Admin -> System Profile and click the blue "Download Profile" button. Then send me that profile.zip in a PM.
Former Nagios employee
Locked