Continouus nagios: wproc: 'Core Work XXXX' seems to be choke

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
tonyleatwork
Posts: 91
Joined: Mon Jul 07, 2014 8:55 am

Continouus nagios: wproc: 'Core Work XXXX' seems to be choke

Post by tonyleatwork »

Hi -

After a reboot I started getting a lot of:

Jan 21 15:27:19 nwd2ng01 nagios: wproc: 'Core Worker 1835' seems to be choked. ret = -1; bufsize = 180: errno = 11 (Resource temporarily unavailable)
Jan 21 15:27:19 nwd2ng01 nagios: Unable to run check for service 'Page File Usage' on host 'hostname.corp.com'

The issue goes for a while and then goes away and comes back. Not sure what the trigger point is. The /var/log/messages grew to 32MB just today.

Profile is below:

Close
Nagios XI Installation Profile
Download Profile
System:
Nagios XI Version : 2014R1.4
nwd2ng01.corp.analog.com 2.6.32-358.2.1.el6.x86_64 x86_64
CentOS release 6.5 (Final)
Gnome is not installed
Apache Information
PHP Version: 5.3.3
Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:35.0) Gecko/20100101 Firefox/35.0
Server Name: nwd2ng01.corp.analog.com
Server Address: 10.64.52.120
Server Port: 80
Date/Time
PHP Timezone: America/New_York
PHP Time: Wed, 21 Jan 2015 15:40:32 -0500
System Time: Wed, 21 Jan 2015 15:40:32 -0500
Nagios XI Data
License ends in: MSTNQS

nagios (pid 1827) is running...
NPCD running (pid 1776).
ndo2db (pid 1851) is running...
CPU Load 15: 17.55
Total Hosts: 444
Total Services: 126
Function 'get_base_uri' returns: http://nwd2ng01.corp.analog.com/nagiosxi/
Function 'get_base_url' returns: http://nwd2ng01.corp.analog.com/nagiosxi/
Function 'get_backend_url(internal_call=false)' returns: http://nwd2ng01.corp.analog.com/nagiosx ... rofile.php
Function 'get_backend_url(internal_call=true)' returns: http://localhost/nagiosxi/backend/
Ping Test localhost
Running:

/bin/ping -c 3 localhost 2>&1

PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.015 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.014 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.017 ms

--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.014/0.015/0.017/0.003 ms
Test wget To localhost
WGET From URL: http://localhost/nagiosxi/includes/components/ccm/
Running:

/usr/bin/wget http://localhost/nagiosxi/includes/components/ccm/

--2015-01-21 15:40:34-- http://localhost/nagiosxi/includes/components/ccm/
Resolving localhost... ::1, 127.0.0.1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "/usr/local/nagiosxi/tmp/ccm_index.tmp"

0K ......... 16.7M=0.001s

2015-01-21 15:40:34 (16.7 MB/s) - "/usr/local/nagiosxi/tmp/ccm_index.tmp" saved [9666]
tonyleatwork
Posts: 91
Joined: Mon Jul 07, 2014 8:55 am

Re: Continouus nagios: wproc: 'Core Work XXXX' seems to be c

Post by tonyleatwork »

I realized that the messages relate to actual processes. I didn't check for zombie processes until now, but there are 17 right after boot.

The zombie processes are mostly check_wmi_plus, probably waiting for a response?

Questions: will this impact my monitoring? Will Nagios try again? Should this be a concern and how do we address it? Thanks in advance.
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Continouus nagios: wproc: 'Core Work XXXX' seems to be c

Post by abrist »

tonyleatwork wrote:Will Nagios try again?
It indeed should be rescheduled. How often do you see these errors/warnings?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
tonyleatwork
Posts: 91
Joined: Mon Jul 07, 2014 8:55 am

Re: Continouus nagios: wproc: 'Core Work XXXX' seems to be c

Post by tonyleatwork »

It's happening every 10 minutes or so, or about every 2 intervals (most of my checks are at 5 min intervals). There are about 10-15 messages referencing about 4-5 unique PIDs.

My hunch was that it is performance related. Fortunately this is on a VM and I was able to just add more processors to it. That fixed the problem (or just further masked it?).

The concern is that we "only" have 2700 checks, should the system be this pegged already? We gave it 4 processors @ 3+ghz each, 8gb ram.

How can I test for my sizing requirements and see if this isn't just a 'gremlin' in the system?
cmerchant
Posts: 546
Joined: Wed Sep 24, 2014 11:19 am

Re: Continouus nagios: wproc: 'Core Work XXXX' seems to be c

Post by cmerchant »

I would suggest you look at the Monitoring Engine Status page: Admin --> System Information --> Monitoring Engine Status

Look at the event queue (how many concurrent checks),
Check Statistics (quantity and rate of checks), and
Performance (avg time to complete checks).
tonyleatwork
Posts: 91
Joined: Mon Jul 07, 2014 8:55 am

Re: Continouus nagios: wproc: 'Core Work XXXX' seems to be c

Post by tonyleatwork »

Please close this. A workaround was to increase the processing capacity of the system. I have a separate request in regards to system performance.
Locked