What we saw is a bunch of services start timing out, get reaped, and then suddenly, the Nagios install could no longer reap the core worker.
Code: Select all
[1600238831] Warning: Service performance data file processing command '/bin/mv /var/nagiosramdisk/service-perfdata /var/nagiosramdisk/spool/xidpe/1600238824.perfdata.service' timed out after 5 seconds
[1600238831] wproc: Core Worker 24332: job 108547 (pid=20424) timed out. Killing it
[1600238831] wproc: GLOBAL HOST EVENTHANDLER job 108547 from worker Core Worker 24332 timed out after 30.96s
[1600238831] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1600238831] Warning: Global host event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=host --host="cipolpre504" --hostaddress="XXXXXXXXXXX" --hoststate=UP --hoststateid=0 --lasthoststate=DOWN --lasthoststateid=1 --hoststatetype=SOFT --currentattempt=1 --maxattempts=4 --hosteventid=864175 --hostproblemid=0 --hostoutput="OK - XXXXXXXXXXX: rta 9.617ms, lost 0%" --longhostoutput="" --hostdowntime=0' timed out after 0.00 seconds
[1600238831] wproc: Core Worker 24332: tv.tv_sec is currently 1600238828
[1600238831] wproc: Core Worker 24332: Failed to reap child with pid 20424. Next attempt @ 1600238833.577880
[1600238831] wproc: Core Worker 24334: job 108547 (pid=20423) timed out. Killing it
[1600238831] wproc: GLOBAL HOST EVENTHANDLER job 108547 from worker Core Worker 24334 timed out after 32.55s
[1600238831] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1600238831] Warning: Global host event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=host --host="XXXXXXXXXXXXXXX" --hostaddress="XXXXXXXXX" --hoststate=UP --hoststateid=0 --lasthoststate=DOWN --lasthoststateid=1 --hoststatetype=SOFT --currentattempt=1 --maxattempts=4 --hosteventid=864174 --hostproblemid=0 --hostoutput="OK - XXXXXXXXXX: rta 15.214ms, lost 0%" --longhostoutput="" --hostdowntime=0' timed out after 0.00 seconds
[1600238831] wproc: Core Worker 24334: job 108547 (pid=20423): Dormant child reaped
[1600238831] wproc: Core Worker 24333: job 108528 (pid=20107) timed out. Killing it
[1600238831] wproc: GLOBAL SERVICE EVENTHANDLER job 108528 from worker Core Worker 24333 timed out after 35.93s
[1600238831] wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1600238831] Warning: Global service event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=service --host="XXXXXXXXXXXXX" --service="XXXXXXXXXXXX" --hostaddress="XXXXXXXXXXX" --hoststate=UP --hoststateid=0 --hosteventid=832188 --hostproblemid=0 --servicestate=CRITICAL --servicestateid=2 --lastservicestate=OK --lastservicestateid=0 --servicestatetype=SOFT --currentattempt=1 --maxattempts=3 --serviceeventid=864167 --serviceproblemid=427842 --serviceoutput="CRITICAL - Socket timeout" --longserviceoutput="" --servicedowntime=0' timed out after 0.00 seconds
[1600238831] wproc: Core Worker 24333: tv.tv_sec is currently 1600238828
[1600238831] wproc: Core Worker 24333: Failed to reap child with pid 20107. Next attempt @ 1600238833.951373The timeouts seems to have happened at 6:40:40AM, and about 11 seconds later, we got a notice of a stuck CPU from VMWare. The system then started being unable to reap child workers at 6:47:11AM:
Code: Select all
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.658692] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [nagios:14492]
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.694377] Modules linked in: vsock_diag tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc nfsv3 nfs_acl rpcsec_gss_krb5 n
fsv4 dns_resolver nfs lockd grace fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 n
f_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack li
bcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vmw_vsock_vmci_transport vsock sb_edac iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf
128mul vmw_balloon glue_helper ablk_helper cryptd joydev pcspkr sg parport_pc parport vmw_vmci i2c_piix4 b9k_744109(POE) cbproxy_cbp_744_20200413(POE) auth_rpcgss sunrpc ip_tables ext4 mbcache jbd
2 sr_mod cdrom ata_generic pata_acpi vmwgfx sd_mod crc_t10dif crct10dif_generic drm_kms_helper syscopyarea ahci sysfillrect sysimgblt fb_sys_fops ttm libahci ata_piix drm nfit crct10dif_pclmul crc
t10dif_common crc32c_intel libata serio_raw vmxnet3 libnvdimm vmw_pvscsi drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741148] CPU: 0 PID: 14492 Comm: nagios Kdump: loaded Tainted: P OE ------------ 3.10.0-1127.13.1.el7.x86_64 #1
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741218] task: ffff9234066a8000 ti: ffff9232770bc000 task.ti: ffff9232770bc000
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741224] RIP: 0010:[<ffffffff94d966b9>] [<ffffffff94d966b9>] __write_lock_failed+0x9/0x20
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741355] RSP: 0000:ffff9232770bf998 EFLAGS: 00000297
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741358] RAX: ffff9232770bffd8 RBX: 0000000022a5ad9c RCX: 000000001367ca6e
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741360] RDX: ffffffffc0613860 RSI: 0000000050785f50 RDI: ffff9235eb9712f4
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741361] RBP: ffff9232770bf998 R08: 00000000001499b2 R09: 0000000000000000
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741362] R10: 00000000006aaaab R11: ffff9232770bf8ce R12: ffffffffc06101b8
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741374] R13: 0000000000000001 R14: ffffffffc06101b8 R15: 0000000000000800
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741376] FS: 00007f0a7b09d740(0000) GS:ffff9235ffc00000(0000) knlGS:0000000000000000
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741378] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741379] CR2: 00007fd6daffd410 CR3: 000000003870c000 CR4: 00000000001607f0
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.742085] Call Trace:
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.742303] [<ffffffff9518854a>] _raw_write_lock_bh+0x2a/0x30
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.816896] [<ffffffffc04fb953>] bit9_rwlock+0x72/0x88 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.818024] [<ffffffffc0478710>] _ZN12NativeRWLock4LockEv+0x1c/0x26 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.818234] [<ffffffffc04a433b>] _ZN8HoldLock4LockEv+0xb3/0x118 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.818366] [<ffffffffc04fbaeb>] ? bit9_readunlock+0x72/0x88 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.838025] [<ffffffffc05014a5>] _ZN8HoldLockC2ER10LockObjectNS0_13DesiredAccessEb+0x53/0x56 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.839252] [<ffffffffc0487fd8>] _ZN15ProcessTracking5CloneEP11ProcessInfo+0x44/0x7e8 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.839352] [<ffffffffc047b44f>] ? _ZN10RuleEngine20SearchForTrustedUserEPK13UnicodeString+0x6b/0x74 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.839561] [<ffffffffc04f32d6>] ? _ZN11ProcessInfo21PlatformSpecificSetupE21ProcessCreationSourceP14bit9_process_sPK15_UNICODE_STRINGPKcS5_PS_+0x2a4/0x364 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.841312] [<ffffffffc04f15fc>] _ZN15ProcessTracking33CreateProcessNotificationPostExecEP14bit9_process_sP17bit9_filestream_sPKcP11ProcessInfo+0x4aa/0x7a2 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.841448] [<ffffffffc04f3c77>] cbp_common_process_execute+0x235/0x25c [b9k_744109]
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.841626] [<ffffffffc04f5fe1>] ? LSMSecurityHooksIsEnabled+0x7c/0xde [b9k_744109]
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.841668] [<ffffffffc04f78e4>] init_def_mntns+0x18a1/0x1e05 [b9k_744109]
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.841824] [<ffffffffc046ac92>] cbstub_lsm_bprm_set_creds+0x85/0xb9 [cbproxy_cbp_744_20200413]
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.841943] [<ffffffff94d07419>] security_bprm_set_creds+0x19/0x20
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842010] [<ffffffff94c5428f>] prepare_binprm+0x9f/0x270
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842110] [<ffffffff94cb1ae5>] load_script+0x255/0x2a0
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842168] [<ffffffff94d2c709>] ? ima_bprm_check+0x49/0x50
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842171] [<ffffffff94c544fa>] search_binary_handler+0x9a/0x1c0
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842174] [<ffffffff94c55c06>] do_execve_common.isra.24+0x616/0x880
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842183] [<ffffffff94c56109>] SyS_execve+0x29/0x30
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842345] [<ffffffff95193478>] stub_execve+0x48/0x80
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842346] Code: 00 00 e9 03 00 00 00 41 ff e7 e8 07 00 00 00 f3 90 0f ae e8 eb f9 4c 89 3c 24 c3 90 90 90 90 90 90 90 55 48 89 e5 f0 ff 07 f3 90 <83> 3f 01 75 f9 f0 ff 0f 75 f1 5d c3 90 66 2e 0f 1f 84 00 00 00
Any help or suggestions would be appreciated.