Core Worker XXX seems to be choked

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Core Worker XXX seems to be choked

Post by Fred Kroeger »

I'm regularly seing these Core worker choked entries in the messages log.

Code: Select all

Jan 12 11:26:21 nagios-wp2 nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 131: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:21 nagios-wp2 nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 131: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:21 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 140: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:21 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 120: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:22 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 145: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:22 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 129: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:22 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 132: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:23 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 131: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:23 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 145: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:23 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 831: errno = 11 (Resource temporarily unavailable)
Jan 12 11:26:23 -nagios: wproc: 'Core Worker 43306' seems to be choked. ret = -1; bufsize = 621: errno = 11 (Resource temporarily unavailable)
3 hrs previous to these entries I had about 100 entries

This is the process

Code: Select all

# ps -ef | grep 43306
nagios   43306 43300  0 Jan11 ?        00:05:37 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
and this is the amount of worker processes running

Code: Select all

# ps -ef | grep worker
nagios   27143 43307  0 11:54 ?        00:00:00 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   43302 43300  0 Jan11 ?        00:04:04 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   43303 43300  1 Jan11 ?        00:12:08 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   43304 43300  0 Jan11 ?        00:09:06 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   43305 43300  2 Jan11 ?        00:20:48 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   43306 43300  0 Jan11 ?        00:05:39 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
nagios   43307 43300  0 Jan11 ?        00:08:54 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
This is on the server that I am having performance issues with. It's NagiosXI 5.3.3 VM and I've got 3x CPUs & 24GB RAM provisioned.
Most of my checks have been offloaded to a ModGearman worker.

regards... Fred
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Core Worker XXX seems to be choked

Post by rkennedy »

Could you PM over a profile for us to review with this? Generally related to resources running low.

What sort of disks is the machine running on?

EDIT: profile received.
Former Nagios Employee
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Core Worker XXX seems to be choked

Post by rkennedy »

Seeing quite the flood of messages like this in your profile -

Code: Select all

Jan 13 14:51:51 nagios-wp2 nagios: Error: Got check result for service 'NTP Time' on host 'log02.dms.ops'. Unable to find service

Code: Select all

[01-13-2017 14:23:10] NPCD: WARN: MAX load reached: load 34.410000/25.000000 at i=0
[01-13-2017 14:23:25] NPCD: WARN: MAX load reached: load 34.310000/25.000000 at i=1
[01-13-2017 14:23:40] NPCD: WARN: MAX load reached: load 28.140000/25.000000 at i=1
Your ram appears to be fine, but the CPU is what concerns me. How many CPU's do you have allocated to the machine? top - 14:52:04 up 17 days, 4:31, 1 user, load average: 6.07, 7.57, 9.86
Former Nagios Employee
Fred Kroeger
Posts: 588
Joined: Wed Oct 19, 2011 11:36 pm
Location: Perth, Western Australia
Contact:

Re: Core Worker XXX seems to be choked

Post by Fred Kroeger »

Running with 4 CPUs so a Load Avg of 6 shouldn't really be an issue.
Maybe I should up the npcd threshold a bit higher so that it doesn't keep having to try & catch up when it gets suspended?
rkennedy
Posts: 6579
Joined: Mon Oct 05, 2015 11:45 am

Re: Core Worker XXX seems to be choked

Post by rkennedy »

I think the main issue here is figuring out why the load is spiking so high. Are you able to setup an event handler monitoring the Nagios machines CPU? Something like this will help to see what's spiking so drastically.

Code: Select all

#!/bin/bash
date=$(date)
echo -e "$date" >> checktopcpu.txt
ps -eo pcpu,args --sort=-%cpu|head >> checktopcpu.txt
echo -e "\n" >> checktopcpu.txt
This will create a log file for you to look at, and help to get an insight of what's going on.
Former Nagios Employee
Locked