Continouus nagios: wproc: 'Core Work XXXX' seems to be choke

tonyleatwork · Post by **tonyleatwork** » Wed Jan 21, 2015 3:46 pm

Hi -

After a reboot I started getting a lot of:

Jan 21 15:27:19 nwd2ng01 nagios: wproc: 'Core Worker 1835' seems to be choked. ret = -1; bufsize = 180: errno = 11 (Resource temporarily unavailable)
Jan 21 15:27:19 nwd2ng01 nagios: Unable to run check for service 'Page File Usage' on host 'hostname.corp.com'

The issue goes for a while and then goes away and comes back. Not sure what the trigger point is. The /var/log/messages grew to 32MB just today.

Profile is below:

Close
Nagios XI Installation Profile
Download Profile
System:
Nagios XI Version : 2014R1.4
nwd2ng01.corp.analog.com 2.6.32-358.2.1.el6.x86_64 x86_64
CentOS release 6.5 (Final)
Gnome is not installed
Apache Information
PHP Version: 5.3.3
Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0
Server Name: nwd2ng01.corp.analog.com
Server Address: 10.64.52.120
Server Port: 80
Date/Time
PHP Timezone: America/New_York
PHP Time: Wed, 21 Jan 2015 15:40:32 -0500
System Time: Wed, 21 Jan 2015 15:40:32 -0500
Nagios XI Data
License ends in: MSTNQS

nagios (pid 1827) is running...
NPCD running (pid 1776).
ndo2db (pid 1851) is running...
CPU Load 15: 17.55
Total Hosts: 444
Total Services: 126
Function 'get_base_uri' returns: http://nwd2ng01.corp.analog.com/nagiosxi/
Function 'get_base_url' returns: http://nwd2ng01.corp.analog.com/nagiosxi/
Function 'get_backend_url(internal_call=false)' returns: http://nwd2ng01.corp.analog.com/nagiosx ... rofile.php
Function 'get_backend_url(internal_call=true)' returns: http://localhost/nagiosxi/backend/
Ping Test localhost
Running:

/bin/ping -c 3 localhost 2>&1

PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.015 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.014 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.017 ms

--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.014/0.015/0.017/0.003 ms
Test wget To localhost
WGET From URL: http://localhost/nagiosxi/includes/components/ccm/
Running:

/usr/bin/wget http://localhost/nagiosxi/includes/components/ccm/

--2015-01-21 15:40:34-- http://localhost/nagiosxi/includes/components/ccm/
Resolving localhost... ::1, 127.0.0.1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "/usr/local/nagiosxi/tmp/ccm_index.tmp"

0K ......... 16.7M=0.001s

2015-01-21 15:40:34 (16.7 MB/s) - "/usr/local/nagiosxi/tmp/ccm_index.tmp" saved [9666]

tonyleatwork · Post by **tonyleatwork** » Wed Jan 21, 2015 4:17 pm

I realized that the messages relate to actual processes. I didn't check for zombie processes until now, but there are 17 right after boot.

The zombie processes are mostly check_wmi_plus, probably waiting for a response?

Questions: will this impact my monitoring? Will Nagios try again? Should this be a concern and how do we address it? Thanks in advance.

abrist · Post by **abrist** » Wed Jan 21, 2015 4:32 pm

tonyleatwork wrote:Will Nagios try again?

It indeed should be rescheduled. How often do you see these errors/warnings?

tonyleatwork · Post by **tonyleatwork** » Wed Jan 21, 2015 8:45 pm

It's happening every 10 minutes or so, or about every 2 intervals (most of my checks are at 5 min intervals). There are about 10-15 messages referencing about 4-5 unique PIDs.

My hunch was that it is performance related. Fortunately this is on a VM and I was able to just add more processors to it. That fixed the problem (or just further masked it?).

The concern is that we "only" have 2700 checks, should the system be this pegged already? We gave it 4 processors @ 3+ghz each, 8gb ram.

How can I test for my sizing requirements and see if this isn't just a 'gremlin' in the system?

cmerchant · Post by **cmerchant** » Thu Jan 22, 2015 12:31 pm

I would suggest you look at the Monitoring Engine Status page: Admin --> System Information --> Monitoring Engine Status

Look at the event queue (how many concurrent checks),
Check Statistics (quantity and rate of checks), and
Performance (avg time to complete checks).

tonyleatwork · Post by **tonyleatwork** » Wed Feb 04, 2015 4:33 pm

Please close this. A workaround was to increase the processing capacity of the system. I have a separate request in regards to system performance.

Nagios Support Forum

Continouus nagios: wproc: 'Core Work XXXX' seems to be choke

Continouus nagios: wproc: 'Core Work XXXX' seems to be choke

Re: Continouus nagios: wproc: 'Core Work XXXX' seems to be c

Re: Continouus nagios: wproc: 'Core Work XXXX' seems to be c

Re: Continouus nagios: wproc: 'Core Work XXXX' seems to be c

Re: Continouus nagios: wproc: 'Core Work XXXX' seems to be c

Re: Continouus nagios: wproc: 'Core Work XXXX' seems to be c