Core Worker timed out and failed to reap child

hbouma · Post by **hbouma** » Wed Sep 16, 2020 7:02 am

We experienced an issue this morning with one of our Nagios XI instances (5.6.10 running RHEL 7 VM's. 8 CPU with 16GB of RAM). We have the DB offloaded, and are running with a RAMDisk.

What we saw is a bunch of services start timing out, get reaped, and then suddenly, the Nagios install could no longer reap the core worker.

Code: Select all

[1600238831] Warning: Service performance data file processing command '/bin/mv /var/nagiosramdisk/service-perfdata /var/nagiosramdisk/spool/xidpe/1600238824.perfdata.service' timed out after 5 seconds
[1600238831] wproc: Core Worker 24332: job 108547 (pid=20424) timed out. Killing it
[1600238831] wproc: GLOBAL HOST EVENTHANDLER job 108547 from worker Core Worker 24332 timed out after 30.96s
[1600238831] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1600238831] Warning: Global host event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=host --host="cipolpre504" --hostaddress="XXXXXXXXXXX" --hoststate=UP --hoststateid=0 --lasthoststate=DOWN --lasthoststateid=1 --hoststatetype=SOFT --currentattempt=1 --maxattempts=4 --hosteventid=864175 --hostproblemid=0 --hostoutput="OK - XXXXXXXXXXX: rta 9.617ms, lost 0%" --longhostoutput="" --hostdowntime=0' timed out after 0.00 seconds
[1600238831] wproc: Core Worker 24332: tv.tv_sec is currently 1600238828
[1600238831] wproc: Core Worker 24332: Failed to reap child with pid 20424. Next attempt @ 1600238833.577880
[1600238831] wproc: Core Worker 24334: job 108547 (pid=20423) timed out. Killing it
[1600238831] wproc: GLOBAL HOST EVENTHANDLER job 108547 from worker Core Worker 24334 timed out after 32.55s
[1600238831] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1600238831] Warning: Global host event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=host --host="XXXXXXXXXXXXXXX" --hostaddress="XXXXXXXXX" --hoststate=UP --hoststateid=0 --lasthoststate=DOWN --lasthoststateid=1 --hoststatetype=SOFT --currentattempt=1 --maxattempts=4 --hosteventid=864174 --hostproblemid=0 --hostoutput="OK - XXXXXXXXXX: rta 15.214ms, lost 0%" --longhostoutput="" --hostdowntime=0' timed out after 0.00 seconds
[1600238831] wproc: Core Worker 24334: job 108547 (pid=20423): Dormant child reaped
[1600238831] wproc: Core Worker 24333: job 108528 (pid=20107) timed out. Killing it
[1600238831] wproc: GLOBAL SERVICE EVENTHANDLER job 108528 from worker Core Worker 24333 timed out after 35.93s
[1600238831] wproc:   early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
[1600238831] Warning: Global service event handler command '/usr/bin/php /usr/local/nagiosxi/scripts/handle_nagioscore_event.php --handler-type=service --host="XXXXXXXXXXXXX" --service="XXXXXXXXXXXX" --hostaddress="XXXXXXXXXXX" --hoststate=UP --hoststateid=0 --hosteventid=832188 --hostproblemid=0 --servicestate=CRITICAL --servicestateid=2 --lastservicestate=OK --lastservicestateid=0 --servicestatetype=SOFT --currentattempt=1 --maxattempts=3 --serviceeventid=864167 --serviceproblemid=427842 --serviceoutput="CRITICAL - Socket timeout" --longserviceoutput="" --servicedowntime=0' timed out after 0.00 seconds
[1600238831] wproc: Core Worker 24333: tv.tv_sec is currently 1600238828
[1600238831] wproc: Core Worker 24333: Failed to reap child with pid 20107. Next attempt @ 1600238833.951373

In each case, it seems to be an agentless check (LDAP Monitor, Website Check, etc, SNMP Polling, Host ICMP checks etc).

The timeouts seems to have happened at 6:40:40AM, and about 11 seconds later, we got a notice of a stuck CPU from VMWare. The system then started being unable to reap child workers at 6:47:11AM:

Code: Select all

Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.658692] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [nagios:14492]
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.694377] Modules linked in: vsock_diag tcp_diag udp_diag inet_diag unix_diag af_packet_diag netlink_diag binfmt_misc nfsv3 nfs_acl rpcsec_gss_krb5 n
fsv4 dns_resolver nfs lockd grace fscache ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 n
f_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack li
bcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vmw_vsock_vmci_transport vsock sb_edac iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf
128mul vmw_balloon glue_helper ablk_helper cryptd joydev pcspkr sg parport_pc parport vmw_vmci i2c_piix4 b9k_744109(POE) cbproxy_cbp_744_20200413(POE) auth_rpcgss sunrpc ip_tables ext4 mbcache jbd
2 sr_mod cdrom ata_generic pata_acpi vmwgfx sd_mod crc_t10dif crct10dif_generic drm_kms_helper syscopyarea ahci sysfillrect sysimgblt fb_sys_fops ttm libahci ata_piix drm nfit crct10dif_pclmul crc
t10dif_common crc32c_intel libata serio_raw vmxnet3 libnvdimm vmw_pvscsi drm_panel_orientation_quirks dm_mirror dm_region_hash dm_log dm_mod
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741148] CPU: 0 PID: 14492 Comm: nagios Kdump: loaded Tainted: P           OE  ------------   3.10.0-1127.13.1.el7.x86_64 #1
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 12/12/2018
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741218] task: ffff9234066a8000 ti: ffff9232770bc000 task.ti: ffff9232770bc000
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741224] RIP: 0010:[<ffffffff94d966b9>]  [<ffffffff94d966b9>] __write_lock_failed+0x9/0x20
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741355] RSP: 0000:ffff9232770bf998  EFLAGS: 00000297
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741358] RAX: ffff9232770bffd8 RBX: 0000000022a5ad9c RCX: 000000001367ca6e
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741360] RDX: ffffffffc0613860 RSI: 0000000050785f50 RDI: ffff9235eb9712f4
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741361] RBP: ffff9232770bf998 R08: 00000000001499b2 R09: 0000000000000000
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741362] R10: 00000000006aaaab R11: ffff9232770bf8ce R12: ffffffffc06101b8
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741374] R13: 0000000000000001 R14: ffffffffc06101b8 R15: 0000000000000800
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741376] FS:  00007f0a7b09d740(0000) GS:ffff9235ffc00000(0000) knlGS:0000000000000000
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX  kernel: [1350084.741378] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.741379] CR2: 00007fd6daffd410 CR3: 000000003870c000 CR4: 00000000001607f0
Sep 16 06:44:51 XXXXXXXXXXXXXXXXXX kernel: [1350084.742085] Call Trace:
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.742303]  [<ffffffff9518854a>] _raw_write_lock_bh+0x2a/0x30
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.816896]  [<ffffffffc04fb953>] bit9_rwlock+0x72/0x88 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.818024]  [<ffffffffc0478710>] _ZN12NativeRWLock4LockEv+0x1c/0x26 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX  kernel: [1350084.818234]  [<ffffffffc04a433b>] _ZN8HoldLock4LockEv+0xb3/0x118 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.818366]  [<ffffffffc04fbaeb>] ? bit9_readunlock+0x72/0x88 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX  kernel: [1350084.838025]  [<ffffffffc05014a5>] _ZN8HoldLockC2ER10LockObjectNS0_13DesiredAccessEb+0x53/0x56 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.839252]  [<ffffffffc0487fd8>] _ZN15ProcessTracking5CloneEP11ProcessInfo+0x44/0x7e8 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.839352]  [<ffffffffc047b44f>] ? _ZN10RuleEngine20SearchForTrustedUserEPK13UnicodeString+0x6b/0x74 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.839561]  [<ffffffffc04f32d6>] ? _ZN11ProcessInfo21PlatformSpecificSetupE21ProcessCreationSourceP14bit9_process_sPK15_UNICODE_STRINGPKcS5_PS_+0x2a4/0x364 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.841312]  [<ffffffffc04f15fc>] _ZN15ProcessTracking33CreateProcessNotificationPostExecEP14bit9_process_sP17bit9_filestream_sPKcP11ProcessInfo+0x4aa/0x7a2 [b9k_744109]
Sep 16 06:44:53 XXXXXXXXXXXXXXXXXX kernel: [1350084.841448]  [<ffffffffc04f3c77>] cbp_common_process_execute+0x235/0x25c [b9k_744109]
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.841626]  [<ffffffffc04f5fe1>] ? LSMSecurityHooksIsEnabled+0x7c/0xde [b9k_744109]
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.841668]  [<ffffffffc04f78e4>] init_def_mntns+0x18a1/0x1e05 [b9k_744109]
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.841824]  [<ffffffffc046ac92>] cbstub_lsm_bprm_set_creds+0x85/0xb9 [cbproxy_cbp_744_20200413]
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.841943]  [<ffffffff94d07419>] security_bprm_set_creds+0x19/0x20
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842010]  [<ffffffff94c5428f>] prepare_binprm+0x9f/0x270
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842110]  [<ffffffff94cb1ae5>] load_script+0x255/0x2a0
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842168]  [<ffffffff94d2c709>] ? ima_bprm_check+0x49/0x50
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842171]  [<ffffffff94c544fa>] search_binary_handler+0x9a/0x1c0
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX  kernel: [1350084.842174]  [<ffffffff94c55c06>] do_execve_common.isra.24+0x616/0x880
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842183]  [<ffffffff94c56109>] SyS_execve+0x29/0x30
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842345]  [<ffffffff95193478>] stub_execve+0x48/0x80
Sep 16 06:44:54 XXXXXXXXXXXXXXXXXX kernel: [1350084.842346] Code: 00 00 e9 03 00 00 00 41 ff e7 e8 07 00 00 00 f3 90 0f ae e8 eb f9 4c 89 3c 24 c3 90 90 90 90 90 90 90 55 48 89 e5 f0 ff 07 f3 90 <83> 3f 01 75 f9 f0 ff 0f 75 f1 5d c3 90 66 2e 0f 1f 84 00 00 00

Any help or suggestions would be appreciated.

benjaminsmith · Post by **benjaminsmith** » Thu Sep 17, 2020 10:50 am

Hi Henry,

What type of check volume (hosts + services) are you running on this system? For the most part, agentless checks are going to require more system resources checks executed using an agent like NCPA.

I would take a look at the settings in the nagios.cfg and review any modifications. You can also try to tweak the reaper settings to improve performance, for example:

Code: Select all

check_result_reaper_frequency=3
max_check_result_reaper_time=10

Reference:
Maximizing Performance In Nagios XI

hbouma · Post by **hbouma** » Thu Sep 17, 2020 11:00 am

Thank you for the response.

Our systems are already setup with those values. We have also gone through the document you provided when we first configured our Nagios XI instances. All changes have been applied with the exception of the Backend Cache settings as realtime data is extremely important to us. RAMDisk, Offloaded DB and rrdcached are all implemented.

We have 721 hosts, 6369 services. About 98% of our checks are active checks, with around 600 being agentless checks. The remaining checks are all passive checks.

nagios.cfg:
max_check_result_reaper_time=10
check_result_reaper_frequency=3

Load looks like this:

2020-09-17 11_59_32-Nagios XI.png

hbouma · Post by **hbouma** » Thu Sep 17, 2020 11:14 am

Monitoring and Server stats

2020-09-17 12_13_33-Nagios XI.png

hbouma · Post by **hbouma** » Thu Sep 17, 2020 11:23 am

CPU usage on our box has always been tricky. The most recent upgrade of Nagios Fusion has helped by only polling when users are logged in (previously, it would take between 40 and 60% of the CPU when Fusion polled for our 300+ users).

2020-09-17 12_21_23-Nagios XI.png

hbouma · Post by **hbouma** » Thu Sep 17, 2020 11:26 am

Results of a TOP command. The b9daemon is the carbon black agent that our IT Security department has insisted we have installed.

2020-09-17 12_24_25-mRemoteNG - confCons.xml - Nagios UAT OK 0101.png

benjaminsmith · Post by **benjaminsmith** » Fri Sep 18, 2020 2:23 pm

Hi Henry,

We'd have to go through the Nagios logs to see if there's some kind of pattern here, but the error may have just been related to a specific network outage that happened at 6:40:40AM.

Alternatively, there may be too many checks spaced together and loading up the server. I would review your check intervals to spread them out as much as possible to decrease the load.

Regards,
Benjamin

hbouma · Post by **hbouma** » Fri Sep 18, 2020 2:26 pm

Do you want a profile sent to you to let you review the logs? I saw nothing.

Checks cannot be spaced out anymore than they currently are. The timeframe for reporting on issues does not allow for that.

We had no network outage reported and nothing in our logs to show network issues.

benjaminsmith · Post by **benjaminsmith** » Mon Sep 21, 2020 1:52 pm

Hi Henry,

I could certainly look at the Nagios logs from that day. The past nagios logs are located in the following folder.

Code: Select all

/usr/local/nagios/var/archives

These logs are rotated every 24 hours around midnight, so if the date on the log 09-09-2020, this would contain the entries for 09-08-2020.

Additionally, if you would like to download a fresh system profile and I can review it for any current errors.

Regards,
Benjamin

To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and share in a private message or upload it to the post/ticket, and then reply to this post to bring it up in the queue.

hbouma · Post by **hbouma** » Tue Sep 22, 2020 8:26 am

PM sent

Nagios Support Forum

Core Worker timed out and failed to reap child

Core Worker timed out and failed to reap child

Re: Core Worker timed out and failed to reap child

Re: Core Worker timed out and failed to reap child

Re: Core Worker timed out and failed to reap child

Re: Core Worker timed out and failed to reap child

Re: Core Worker timed out and failed to reap child

Re: Core Worker timed out and failed to reap child

Re: Core Worker timed out and failed to reap child

Re: Core Worker timed out and failed to reap child

Re: Core Worker timed out and failed to reap child