Page 1 of 1

nagios: wproc: Core Worker 35151: job 15496 (pid=35200) tim

Posted: Fri Aug 11, 2017 7:51 am
by bran3b
Nagios version 5.3.2
Redhat 7.0

My Nagios instance hung and drove the host load to over 2000. Rebooting the host resolved the issue but what caused the problem in the first place? The last service checks seem to have happened at 22:36, at 22:33 the message was written to the OS logs:

Aug 10 22:33:00 nagios: wproc: Core Worker 35151: job 15496 (pid=35200) timed out. Killing it
Aug 10 22:33:00 nagios: wproc: CHECK job 15496 from worker Core Worker 35151 timed out after 30.02s
Aug 10 22:33:00 nagios: wproc: host=<<hostname>>; service=(null);
Aug 10 22:33:00 nagios: wproc: early_timeout=1; exited_ok=0; wait_status=0; error_code=62;
Aug 10 22:33:00 nagios: Warning: Check of host 'is-backup700' timed out after 30.02 seconds
Aug 10 22:33:00 nagios: wproc: Core Worker 35151: job 15496 (pid=35200): Dormant child reaped

immediately after those messages I get a steady flow (16 or more per minute, all starting the first second of that minute) of these:

Aug 10 22:33:01 systemd: Starting Session 1203871 of user nagios.

Re: nagios: wproc: Core Worker 35151: job 15496 (pid=35200)

Posted: Fri Aug 11, 2017 2:05 pm
by dwhitfield
We are going to need a bit more information.

Can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.

After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.

If you can't send the profile, how many hosts do you have and how many services? What's the output of df -h and df -i?

Re: nagios: wproc: Core Worker 35151: job 15496 (pid=35200)

Posted: Fri Aug 11, 2017 3:55 pm
by bran3b
System Profile
A system profile makes it easier for our support techs to understand the system that you are running on. Including a downloaded system profile with your support ticket is always a good idea.
Show Profile Download Profile
Nagios XI Installation Profile

System:

Nagios XI Version : 5.2.7
is-nagios.gwl.com 3.10.0-123.el7.x86_64 x86_64
Red Hat Enterprise Linux Server release 7.0 (Maipo)
Gnome is not installed
Apache Information

PHP Version: 5.4.16
Agent: Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36
Server Name: is-nagios
Server Address: xxx.xxx.xxx.xxx
Server Port: 443
Date/Time

PHP Timezone: US/Mountain
PHP Time: Fri, 11 Aug 2017 14:35:35 -0600
System Time: Fri, 11 Aug 2017 14:35:35 -0600
Nagios XI Data

License ends in: QUOTRQ

nagios (pid 30131) is running...
NPCD running (pid 4847).
ndo2db (pid 4868) is running...
CPU Load 15: 0.78
Total Hosts: 1027
Total Services: 6828
Function 'get_base_uri' returns: https://is-nagios/nagiosxi/
Function 'get_base_url' returns: https://is-nagios/nagiosxi/
Function 'get_backend_url(internal_call=false)' returns: https://is-nagios/nagiosxi/includes/com ... rofile.php
Function 'get_backend_url(internal_call=true)' returns: http://localhost/nagiosxi/backend/
Ping Test localhost

Running:
/bin/ping -c 3 localhost 2>&1
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.052 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.041 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.036 ms

--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.036/0.043/0.052/0.006 ms
Test wget To localhost

WGET From URL: http://localhost/nagiosxi/includes/components/ccm/
Running:
/usr/bin/wget http://localhost/nagiosxi/includes/components/ccm/
--2017-08-11 14:35:37-- http://localhost/nagiosxi/includes/components/ccm/
Resolving localhost (localhost)... 127.0.0.1, ::1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: '/usr/local/nagiosxi/tmp/ccm_index.tmp'

0K ......... 754K=0.01s

2017-08-11 14:35:37 (754 KB/s) - '/usr/local/nagiosxi/tmp/ccm_index.tmp' saved [9836]

Network Settings

1: lo: mtu 65536 qdisc noqueue state UNKNOWN

link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

inet 127.0.0.1/8 scope host lo

valid_lft forever preferred_lft forever

2: em1: mtu 1500 qdisc noop state DOWN qlen 1000

link/ether b8:2a:72:df:66:83 brd ff:ff:ff:ff:ff:ff

3: em2: mtu 1500 qdisc noop state DOWN qlen 1000

link/ether b8:2a:72:df:66:85 brd ff:ff:ff:ff:ff:ff

4: em3: mtu 1500 qdisc mq master bond0 state UP qlen 1000

link/ether b8:2a:72:df:66:87 brd ff:ff:ff:ff:ff:ff

5: em4: mtu 1500 qdisc mq state UP qlen 1000

link/ether b8:2a:72:df:66:89 brd ff:ff:ff:ff:ff:ff

6: p3p1: mtu 1500 qdisc noop state DOWN qlen 1000

link/ether 00:0a:f7:76:4c:f0 brd ff:ff:ff:ff:ff:ff

7: p3p2: mtu 1500 qdisc noop state DOWN qlen 1000

link/ether 00:0a:f7:76:4c:f2 brd ff:ff:ff:ff:ff:ff

8: bond0: mtu 1500 qdisc noqueue state UP

link/ether b8:2a:72:df:66:87 brd ff:ff:ff:ff:ff:ff

inet xxx.xxx.xxx.xxx/24 brd xxx.xxx.xxx.255 scope global bond0

valid_lft forever preferred_lft forever


default via 143.199.99.1 dev bond0

xxx.xxx.xxx.0/24 dev bond0 proto kernel scope link src xxx.xxx.xxx.xxx

169.254.0.0/16 dev em4 scope link metric 1005

169.254.0.0/16 dev bond0 scope link metric 1008

Re: nagios: wproc: Core Worker 35151: job 15496 (pid=35200)

Posted: Fri Aug 11, 2017 4:45 pm
by dwhitfield
It looks like you missed the last question: What's the output of df -h and df -i?

Also, are you able to send the zip file rather than just the text file? That should be much more helpful.