Page 2 of 2

Re: Core Worker timed out and failed to reap child

Posted: Tue Sep 22, 2020 9:06 am
by hbouma
If we do have a situation where the server is just overloaded, what types of symptoms would we be seeing?

Re: Core Worker timed out and failed to reap child

Posted: Wed Sep 23, 2020 10:49 am
by benjaminsmith
Hi Henry,

You would see a high CPU load and high check latency since Nagios is unable to schedule the checks on time and this is what I'm seeing in the system profile.

Top Command Shows High CPU Load

Code: Select all

top - 07:54:43 up 5 days, 23:21,  1 user,  load average: 320.75, 313.22, 406.45
Tasks: 592 total, 177 running, 411 sleeping,   0 stopped,   4 zombie
Kernal Message Queues ( slow to write results to the databsae)

Code: Select all

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages    
0xd7000002 17         nagios     600        12948480     12645
I noticed that you have quite a few NCPA checks setup for an interval of every 3 minutes, if you can increase it 5 minutes that would help substantially.

The system has 8 Single Core CPU's, if you're able to add more that would help since scheduling active host and service check is CPU intensive.

Other options to reduce load would be to set up passive checks or integrating Mod Gearman.

Using NCPA For Passive Check
Integrating Mod-Gearman With Nagios XI

Re: Core Worker timed out and failed to reap child

Posted: Thu Sep 24, 2020 6:42 am
by hbouma
I have reached out to our server team. They will be adding CPU's next week Friday and the following Friday.

Re: Core Worker timed out and failed to reap child

Posted: Thu Sep 24, 2020 3:27 pm
by benjaminsmith
Hi Henry,
I have reached out to our server team. They will be adding CPU's next week Friday and the following Friday.
Great. Let us know the results.