Page 1 of 1

Core Worker XXX seems to be choked

Posted: Wed Jan 11, 2017 10:59 pm
by Fred Kroeger
I'm regularly seing these Core worker choked entries in the messages log.

Code: Select all

Jan 12 11:26:21 nagios-wp2 nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 131: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:21 nagios-wp2 nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 131: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:21 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 140: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:21 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 120: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:22 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 145: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:22 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 129: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:22 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 132: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:23 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 131: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:23 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 145: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:23 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 831: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:23 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 621: errno = 11 (Resource temporarily unavailable)
3 hrs previous to these entries I had about 100 entries

This is the process

Code: Select all

# ps -ef | grep 43306
nagios   43306 43300  0 Jan11 ?        00:05:37 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
and this is the amount of worker processes running

Code: Select all

# ps -ef | grep worker
nagios   27143 43307  0 11:54 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   43302 43300  0 Jan11 ?        00:04:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   43303 43300  1 Jan11 ?        00:12:08 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   43304 43300  0 Jan11 ?        00:09:06 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   43305 43300  2 Jan11 ?        00:20:48 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   43306 43300  0 Jan11 ?        00:05:39 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   43307 43300  0 Jan11 ?        00:08:54 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
This is on the server that I am having performance issues with. It's NagiosXI 5.3.3 VM and I've got 3x CPUs & 24GB RAM provisioned.
Most of my checks have been offloaded to a ModGearman worker.

regards... Fred

Re: Core Worker XXX seems to be choked

Posted: Thu Jan 12, 2017 10:30 am
by rkennedy
Could you PM over a profile for us to review with this? Generally related to resources running low.

What sort of disks is the machine running on?

EDIT: profile received.

Re: Core Worker XXX seems to be choked

Posted: Fri Jan 13, 2017 10:19 am
by rkennedy
Seeing quite the flood of messages like this in your profile -

Code: Select all

Jan 13 14:51:51 nagios-wp2 nagios: Error: Got check result for service 'NTP Time' on host 'log02.dms.ops'. Unable to find service

Code: Select all

[01-13-2017 14:23:10] NPCD: WARN: MAX load reached: load 34.410000/25.000000 at i=0
[01-13-2017 14:23:25] NPCD: WARN: MAX load reached: load 34.310000/25.000000 at i=1
[01-13-2017 14:23:40] NPCD: WARN: MAX load reached: load 28.140000/25.000000 at i=1
Your ram appears to be fine, but the CPU is what concerns me. How many CPU's do you have allocated to the machine? top - 14:52:04 up 17 days, 4:31, 1 user, load average: 6.07, 7.57, 9.86

Re: Core Worker XXX seems to be choked

Posted: Mon Jan 23, 2017 9:24 pm
by Fred Kroeger
Running with 4 CPUs so a Load Avg of 6 shouldn't really be an issue.
Maybe I should up the npcd threshold a bit higher so that it doesn't keep having to try & catch up when it gets suspended?

Re: Core Worker XXX seems to be choked

Posted: Tue Jan 24, 2017 10:40 am
by rkennedy
I think the main issue here is figuring out why the load is spiking so high. Are you able to setup an event handler monitoring the Nagios machines CPU? Something like this will help to see what's spiking so drastically.

Code: Select all

#!/bin/bash
date=$(date)
echo -e "$date" >> checktopcpu.txt
ps -eo pcpu,args --sort=-%cpu|head >> checktopcpu.txt
echo -e "\n" >> checktopcpu.txt
This will create a log file for you to look at, and help to get an insight of what's going on.