Core Worker timed out and failed to reap child
Posted: Wed Sep 16, 2020 7:02 am
We experienced an issue this morning with one of our Nagios XI instances (5.6.10 running RHEL 7 VM's. 8 CPU with 16GB of RAM). We have the DB offloaded, and are running with a RAMDisk.
What we saw is a bunch of services start timing out, get reaped, and then suddenly, the Nagios install could no longer reap the core worker.
In each case, it seems to be an agentless check (LDAP Monitor, Website Check, etc, SNMP Polling, Host ICMP checks etc).
The timeouts seems to have happened at 6:40:40AM, and about 11 seconds later, we got a notice of a stuck CPU from VMWare. The system then started being unable to reap child workers at 6:47:11AM:
Any help or suggestions would be appreciated.
What we saw is a bunch of services start timing out, get reaped, and then suddenly, the Nagios install could no longer reap the core worker.
Code: Select all
[1600238831] Warning: Service performance data file processing command '/bin/mv /var/nagiosramdisk/service-perfdata /var/nagiosramdisk/spool/xidpe/1600238824.perfdata.service' timed out after 5 seconds
[1600238831] wproc: Core Worker 24332: job 108547 (pid=20424) timed out. Killing it
[1600238831] wproc: GLOBAL HOST EVENTHANDLER job 108547 from worker Core Worker 24332 timed out after 30.96s
[1600238831] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1600238831] Warning: Global host event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=host --host="cipolpre504" --hostaddress="XXXXXXXXXXX" --hoststate=UP --hoststateid=0 --lasthoststate=DOWN --lasthoststateid=1 --hoststatetype=SOFT --currentattempt=1 --maxattempts=4 --hosteventid=864175 --hostproblemid=0 --hostoutput="OK - XXXXXXXXXXX: rta 9.617ms, lost 0%" --longhostoutput="" --hostdowntime=0' timed out after 0.00 seconds
[1600238831] wproc: Core Worker 24332: tv.tv_sec is currently 1600238828
[1600238831] wproc: Core Worker 24332: Failed to reap child with pid 20424. Next attempt @ 1600238833.577880
[1600238831] wproc: Core Worker 24334: job 108547 (pid=20423) timed out. Killing it
[1600238831] wproc: GLOBAL HOST EVENTHANDLER job 108547 from worker Core Worker 24334 timed out after 32.55s
[1600238831] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1600238831] Warning: Global host event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=host --host="XXXXXXXXXXXXXXX" --hostaddress="XXXXXXXXX" --hoststate=UP --hoststateid=0 --lasthoststate=DOWN --lasthoststateid=1 --hoststatetype=SOFT --currentattempt=1 --maxattempts=4 --hosteventid=864174 --hostproblemid=0 --hostoutput="OK - XXXXXXXXXX: rta 15.214ms, lost 0%" --longhostoutput="" --hostdowntime=0' timed out after 0.00 seconds
[1600238831] wproc: Core Worker 24334: job 108547 (pid=20423): Dormant child reaped
[1600238831] wproc: Core Worker 24333: job 108528 (pid=20107) timed out. Killing it
[1600238831] wproc: GLOBAL SERVICE EVENTHANDLER job 108528 from worker Core Worker 24333 timed out after 35.93s
[1600238831] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1600238831] Warning: Global service event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=service --host="XXXXXXXXXXXXX" --service="XXXXXXXXXXXX" --hostaddress="XXXXXXXXXXX" --hoststate=UP --hoststateid=0 --hosteventid=832188 --hostproblemid=0 --servicestate=CRITICAL --servicestateid=2 --lastservicestate=OK --lastservicestateid=0 --servicestatetype=SOFT --currentattempt=1 --maxattempts=3 --serviceeventid=864167 --serviceproblemid=427842 --serviceoutput="CRITICAL - Socket timeout" --longserviceoutput="" --servicedowntime=0' timed out after 0.00 seconds
[1600238831] wproc: Core Worker 24333: tv.tv_sec is currently 1600238828
[1600238831] wproc: Core Worker 24333: Failed to reap child with pid 20107. Next attempt @ 1600238833.951373The timeouts seems to have happened at 6:40:40AM, and about 11 seconds later, we got a notice of a stuck CPU from VMWare. The system then started being unable to reap child workers at 6:47:11AM:
Code: Select all
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.658692] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [nagios:14492]
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.694377] Modules linked in: vsock_diag tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc nfsv3 nfs_acl rpcsec_gss_krb5 n
fsv4 dns_resolver nfs lockd grace fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 n
f_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack li
bcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vmw_vsock_vmci_transport vsock sb_edac iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf
128mul vmw_balloon glue_helper ablk_helper cryptd joydev pcspkr sg parport_pc parport vmw_vmci i2c_piix4 b9k_744109(POE) cbproxy_cbp_744_20200413(POE) auth_rpcgss sunrpc ip_tables ext4 mbcache jbd
2 sr_mod cdrom ata_generic pata_acpi vmwgfx sd_mod crc_t10dif crct10dif_generic drm_kms_helper syscopyarea ahci sysfillrect sysimgblt fb_sys_fops ttm libahci ata_piix drm nfit crct10dif_pclmul crc
t10dif_common crc32c_intel libata serio_raw vmxnet3 libnvdimm vmw_pvscsi drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741148] CPU: 0 PID: 14492 Comm: nagios Kdump: loaded Tainted: P OE ------------ 3.10.0-1127.13.1.el7.x86_64 #1
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741218] task: ffff9234066a8000 ti: ffff9232770bc000 task.ti: ffff9232770bc000
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741224] RIP: 0010:[<ffffffff94d966b9>] [<ffffffff94d966b9>] __write_lock_failed+0x9/0x20
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741355] RSP: 0000:ffff9232770bf998 EFLAGS: 00000297
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741358] RAX: ffff9232770bffd8 RBX: 0000000022a5ad9c RCX: 000000001367ca6e
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741360] RDX: ffffffffc0613860 RSI: 0000000050785f50 RDI: ffff9235eb9712f4
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741361] RBP: ffff9232770bf998 R08: 00000000001499b2 R09: 0000000000000000
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741362] R10: 00000000006aaaab R11: ffff9232770bf8ce R12: ffffffffc06101b8
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741374] R13: 0000000000000001 R14: ffffffffc06101b8 R15: 0000000000000800
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741376] FS: 00007f0a7b09d740(0000) GS:ffff9235ffc00000(0000) knlGS:0000000000000000
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741378] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741379] CR2: 00007fd6daffd410 CR3: 000000003870c000 CR4: 00000000001607f0
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.742085] Call Trace:
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.742303] [<ffffffff9518854a>] _raw_write_lock_bh+0x2a/0x30
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.816896] [<ffffffffc04fb953>] bit9_rwlock+0x72/0x88 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.818024] [<ffffffffc0478710>] _ZN12NativeRWLock4LockEv+0x1c/0x26 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.818234] [<ffffffffc04a433b>] _ZN8HoldLock4LockEv+0xb3/0x118 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.818366] [<ffffffffc04fbaeb>] ? bit9_readunlock+0x72/0x88 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.838025] [<ffffffffc05014a5>] _ZN8HoldLockC2ER10LockObjectNS0_13DesiredAccessEb+0x53/0x56 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.839252] [<ffffffffc0487fd8>] _ZN15ProcessTracking5CloneEP11ProcessInfo+0x44/0x7e8 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.839352] [<ffffffffc047b44f>] ? _ZN10RuleEngine20SearchForTrustedUserEPK13UnicodeString+0x6b/0x74 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.839561] [<ffffffffc04f32d6>] ? _ZN11ProcessInfo21PlatformSpecificSetupE21ProcessCreationSourceP14bit9_process_sPK15_UNICODE_STRINGPKcS5_PS_+0x2a4/0x364 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.841312] [<ffffffffc04f15fc>] _ZN15ProcessTracking33CreateProcessNotificationPostExecEP14bit9_process_sP17bit9_filestream_sPKcP11ProcessInfo+0x4aa/0x7a2 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.841448] [<ffffffffc04f3c77>] cbp_common_process_execute+0x235/0x25c [b9k_744109]
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.841626] [<ffffffffc04f5fe1>] ? LSMSecurityHooksIsEnabled+0x7c/0xde [b9k_744109]
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.841668] [<ffffffffc04f78e4>] init_def_mntns+0x18a1/0x1e05 [b9k_744109]
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.841824] [<ffffffffc046ac92>] cbstub_lsm_bprm_set_creds+0x85/0xb9 [cbproxy_cbp_744_20200413]
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.841943] [<ffffffff94d07419>] security_bprm_set_creds+0x19/0x20
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842010] [<ffffffff94c5428f>] prepare_binprm+0x9f/0x270
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842110] [<ffffffff94cb1ae5>] load_script+0x255/0x2a0
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842168] [<ffffffff94d2c709>] ? ima_bprm_check+0x49/0x50
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842171] [<ffffffff94c544fa>] search_binary_handler+0x9a/0x1c0
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842174] [<ffffffff94c55c06>] do_execve_common.isra.24+0x616/0x880
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842183] [<ffffffff94c56109>] SyS_execve+0x29/0x30
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842345] [<ffffffff95193478>] stub_execve+0x48/0x80
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842346] Code: 00 00 e9 03 00 00 00 41 ff e7 e8 07 00 00 00 f3 90 0f ae e8 eb f9 4c 89 3c 24 c3 90 90 90 90 90 90 90 55 48 89 e5 f0 ff 07 f3 90 <83> 3f 01 75 f9 f0 ff 0f 75 f1 5d c3 90 66 2e 0f 1f 84 00 00 00
Any help or suggestions would be appreciated.