Page 2 of 5
Re: Nagios 4.1.1 too many zombie process and 100% cpu usage
Posted: Fri Sep 23, 2016 1:50 am
by Box293
Can you please post your nagios.cfg.
I've been testing Core 4.2.1 on a fresh build of Solaris 11.2 and have seen some of this high CPU utilization, it seems to co-incide with these errors in the log file:
Code: Select all
[1474613162] wproc: Core Worker 2028: job 1 (pid=2057) timed out. Killing it
[1474613162] wproc: Core Worker 2028: job 1 with pid 2057 reaped at timeout. timeouts=1; started=2
[1474613205] wproc: Core Worker 2029: job 1 (pid=2058) timed out. Killing it
[1474613205] wproc: Core Worker 2029: job 1 with pid 2058 reaped at timeout. timeouts=1; started=2
[1474613292] wproc: Core Worker 2030: job 1 (pid=2064) timed out. Killing it
[1474613292] wproc: Core Worker 2030: job 1 with pid 2064 reaped at timeout. timeouts=1; started=2
[1474613309] wproc: Core Worker 2028: job 2 (pid=2069) timed out. Killing it
[1474613309] wproc: Core Worker 2028: job 2 with pid 2069 reaped at timeout. timeouts=2; started=3
This is just a standard install of Nagios with the sample configs, so it's basically 1 host and 7 services.
Re: Nagios 4.1.1 too many zombie process and 100% cpu usage
Posted: Mon Sep 26, 2016 7:40 pm
by sainudani
Hi,
I could not find exactly what you required.
But assuming you need to find out which Nagios process is consuming more CPUs.
Please find the below Solaris way of the identifying.(Hope 'truss' in Solaris is equal to 'strace' in linux environment).
--------------------------Top--------------------------
bash-3.2# top
load averages: 9.21, 9.42, 9.43; up 24+18:05:30 10:14:03
368 processes: 48 sleeping, 2 running, 310 zombie, 8 on cpu
CPU states: 0.0% idle, 79.8% user, 20.2% kernel, 0.0% iowait, 0.0% swap
Memory: 32G phys mem, 3280M free mem, 20G total swap, 20G free swap
-------------------------------------------------------------------------
==>~80% CPU is consumed by Nagios & ~20% by Kernel
-------------------------------------------------------------------------
----------------------Prstat-------------------------------
Cbash-3.2# prstat -u nagios
PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP
17845 nagios 7816K 5752K run 0 0 79:34:19 12% nagios/1
17847 nagios 7840K 5840K cpu3 0 0 79:31:10 12% nagios/1
17846 nagios 7776K 5800K cpu5 0 0 79:31:35 12% nagios/1
17841 nagios 7792K 5792K cpu0 0 0 79:34:44 12% nagios/1
17839 nagios 7856K 5816K cpu7 0 0 79:34:06 11% nagios/1
17840 nagios 7864K 5824K cpu2 0 0 79:32:32 11% nagios/1
17842 nagios 7792K 5680K cpu4 0 0 79:33:23 11% nagios/1
17850 nagios 7880K 5824K cpu1 0 0 79:32:35 11% nagios/1
17837 nagios 15M 12M sleep 59 0 1:10:03 0.1% nagios/1
19146 nagios 6824K 4352K sleep 59 0 0:00:00 0.0% ssh/1
19787 nagios 190M 49M sleep 59 0 1:04:00 0.0% java/27
19116 nagios 4504K 2688K sleep 59 0 0:00:00 0.0% psu_check/1
19126 nagios 4680K 3096K sleep 59 0 0:00:00 0.0% get_alom_data.e/1
17526 nagios 0K 0K zombie 0 - 0:00:00 0.0% /0
17630 nagios 0K 0K zombie 0 - 0:00:00 0.0% /0
17843 nagios 0K 0K zombie 0 - 0:00:00 0.0% /0
17624 nagios 0K 0K zombie 0 - 0:00:00 0.0% /0
18682 nagios 0K 0K zombie 0 - 0:00:00 0.0% /0
17594 nagios 0K 0K zombie 0 - 0:00:00 0.0% /0
----------------------------------------------------------------------------------
==>From above, we can see PIDs 17845,17847,17846,17841,17839,17840,17842,17850 consumes ~90% CPU.
----------------------------------------------------------------------------------
--------------------------------Ptree--------------------------------------------
bash-3.2# ptree 17845
19186 zsched
17837 /usr/local/nagios/bin/nagios -d /usr/local/nagios/etc/nagios.cfg
17845 /usr/local/nagios/bin/nagios --worker /usr/local/nagios/var/rw/nagios.qh
24091 <defunct>
24174 <defunct>
24869 <defunct>
25504 <defunct>
25313 <defunct>
24296 <defunct>
25519 <defunct>
25399 <defunct>
24916 <defunct>
24959 <defunct>
24072 <defunct>
25479 <defunct>
25285 <defunct>
24691 <defunct>
25166 <defunct>
24180 <defunct>
24861 <defunct>
24490 <defunct>
24301 <defunct>
25293 <defunct>
24426 <defunct>
24698 <defunct>
24627 <defunct>
25188 <defunct>
25204 <defunct>
24311 <defunct>
24714 <defunct>
25172 <defunct>
24150 <defunct>
25536 <defunct>
24135 <defunct>
24894 <defunct>
25157 <defunct>
25552 <defunct>
25496 <defunct>
25377 <defunct>
25212 <defunct>
24931 <defunct>
24989 <defunct>
24505 <defunct>
24664 <defunct>
24671 <defunct>
24967 <defunct>
25323 <defunct>
25180 <defunct>
25512 <defunct>
24124 <defunct>
25300 <defunct>
25196 <defunct>
24115 <defunct>
24326 <defunct>
24530 <defunct>
bash-3.2#
-------------------------------------------------------------------------------
===>From above we can see there are more than 60 defunct/zombie process associated with each PIDs.
-------------------------------------------------------------------------------
------------------------truss -p 17845--------------------------------------
===>Output file attached for your reference.
-------------------------------------------------------------------------------
Re: Nagios 4.1.1 too many zombie process and 100% cpu usage
Posted: Mon Sep 26, 2016 7:42 pm
by sainudani
Also wondering should I upgrade client nrpe package as part of Nagios core update?
I am not sure what version of nrpe is installed on nagios clients.How to check that?
Re: Nagios 4.1.1 too many zombie process and 100% cpu usage
Posted: Mon Sep 26, 2016 7:47 pm
by sainudani
Also I could see similar messages in /var/adm/messages
ep 27 10:43:56 nagios[17837]: [ID 702911 user.info] wproc: early_timeout=0; exited_ok=1; wait_status=512; error_code=0;
Sep 27 10:43:56 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17847: job 301011 with pid 13069 reaped at timeout. timeouts=300819; started=300865
Sep 27 10:44:05 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17845: job 300993 (pid=12866) timed out. Killing it
Sep 27 10:44:05 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17845: job 300993 with pid 12866 reaped at timeout. timeouts=300821; started=300868
Sep 27 10:44:05 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17841: job 300994 (pid=12873) timed out. Killing it
Sep 27 10:44:05 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17841: job 300994 with pid 12873 reaped at timeout. timeouts=300827; started=300868
Sep 27 10:44:07 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17846: job 300901 (pid=11732) timed out. Killing it
Sep 27 10:44:07 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17846: job 300901 with pid 11732 reaped at timeout. timeouts=300836; started=300879
Sep 27 10:44:08 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17842: job 300902 (pid=11737) timed out. Killing it
Sep 27 10:44:08 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17845: job 300902 (pid=11738) timed out. Killing it
Sep 27 10:44:08 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17842: job 300902 with pid 11737 reaped at timeout. timeouts=300841; started=300877
Sep 27 10:44:08 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17845: job 300902 with pid 11738 reaped at timeout. timeouts=300822; started=300869
Sep 27 10:44:08 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17846: job 300902 (pid=11740) timed out. Killing it
Sep 27 10:44:08 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17839: job 300902 (pid=11733) timed out. Killing it
Sep 27 10:44:08 nagios[17837]: [ID 702911 user.info] wproc: Core Worker 17847: job 300901 (pid=11739) timed out. Killing it
Re: Nagios 4.1.1 too many zombie process and 100% cpu usage
Posted: Mon Sep 26, 2016 7:54 pm
by sainudani
Please find the nagios.cfg file as attached.
Re: Nagios 4.1.1 too many zombie process and 100% cpu usage
Posted: Tue Sep 27, 2016 3:59 pm
by lmiltchev
In our first post you say:
I've been testing Core 4.2.1 on a fresh build of Solaris 11.2...
The title says:
Nagios 4.1.1 too many zombie process and 100% cpu usage
...and in your nagios.cfg I see this:
NAGIOS.CFG - Sample Main Config File for Nagios 3.3.1
Can you clarify which version of Nagios Core are you using, and why you have Nagios 3.3.1 listed in the main config? Can you run the following command and show the output?
Code: Select all
/usr/local/nagios/bin/nagios | head -2
Re: Nagios 4.1.1 too many zombie process and 100% cpu usage
Posted: Tue Sep 27, 2016 6:23 pm
by sainudani
Hi,
Ok. Apologies if it caused any confusion.
Let me explain the scenario first and requested out put will be given at the end.
OS: Solaris 10 SPARC 64bit
Until last month it was Nagios core 3.3.1 and I upgraded to Nagios Core 4.1.1
(Procedures followed, ./configure , make all & make install)
From there, CPU usage started shooting 100% and all other issues began.
As per the suggestions from this forum, upgraded to 4.2 & 4.2.1 and it didn't help.
So, again restored 4.1.1 from tar backup taken prior to upgrade to 4.2
Now, it is 4.1.1 having 100% cpu usage issue and too many zombie process.
I could see /usr/local/nagios/etc/nagios.cfg is still old 3.3.1 (I haven't done anything on this as there was no steps mentioned to change this manually in the upgrade process).
Hope it is clear now.
Please find the required output.
bash-3.2# cat /usr/local/nagios/etc/nagios.cfg |head -5
##############################################################################
#
# NAGIOS.CFG - Sample Main Config File for Nagios 3.3.1
#
# Read the documentation for more information on this configuration
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bash-3.2# /usr/local/nagios/bin/nagios | head -2
Nagios Core 4.1.1
bash-3.2#
Thanks,
Re: Nagios 4.1.1 too many zombie process and 100% cpu usage
Posted: Wed Sep 28, 2016 9:39 am
by jfrickson
Run your
./configure with whatever options you usually use. At the very end will be a line that says something like:
The 'epoll' could also be 'poll' or 'select'. If it says 'epoll', run
Code: Select all
./configure --with-iobroker=poll [your-other-options here]
If it says 'poll', run
Code: Select all
./configure --with-iobroker=select [your-other-options here]
Let us know if that helps at all.
Re: Nagios 4.1.1 too many zombie process and 100% cpu usage
Posted: Wed Sep 28, 2016 7:17 pm
by sainudani
Hi,
Below was the ./configure summary.
*** Configuration summary for nagios 4.2.1 09-06-2016 ***:
General Options:
-------------------------
Nagios executable: nagios
Nagios user/group: nagios,nagios
Command user/group: nagios,nagios
Event Broker: yes
Install ${prefix}: /usr/local/nagios
Install ${includedir}: /usr/local/nagios/include/nagios
Lock file: ${prefix}/var/nagios.lock
Check result directory: ${prefix}/var/spool/checkresults
Init directory: /etc/init.d
Apache conf.d directory: /etc/httpd/conf.d
Mail program: /bin/mail
Host OS: solaris2.10
IOBroker Method: poll
As suggested , I ran ./configure --with-iobroker=select
eneral Options:
-------------------------
Nagios executable: nagios
Nagios user/group: nagios,nagios
Command user/group: nagios,nagios
Event Broker: yes
Install ${prefix}: /usr/local/nagios
Install ${includedir}: /usr/local/nagios/include/nagios
Lock file: ${prefix}/var/nagios.lock
Check result directory: ${prefix}/var/spool/checkresults
Init directory: /etc/init.d
Apache conf.d directory: /etc/httpd/conf.d
Mail program: /bin/mail
Host OS: solaris2.10
IOBroker Method: select
Look like it could manage to stop the zombie and cpu utilization.
load averages: 10.1, 10.2, 15.8;
103 processes: 54 sleeping, 38 zombie, 11 on cpu
CPU states: 84.3% idle, 15.6% user, 0.1% kernel, 0.0% iowait, 0.0% swap
Memory: 64G phys mem, 8809M free mem, 20G total swap, 20G free swap
But after this, polling has been stopped (Service status not getting updated for past 1hr) and all client services shows previous time stamp.
New error started to appear in /var/adm/messages,
[ID 702911 user.info] wproc: 'Core Worker 13757' seems to be choked. ret = -1; bufsize = 5658: errno = 11 (Resource temporarily unavailable)
[ID 702911 user.info] wproc: 'Core Worker 13759' seems to be choked. ret = -1; bufsize = 5464: errno = 11 (Resource temporarily unavailable)
[ID 702911 user.info] wproc: 'Core Worker 13762' seems to be choked. ret = -1; bufsize = 5746: errno = 11 (Resource temporarily unavailable)
Re: Nagios 4.1.1 too many zombie process and 100% cpu usage
Posted: Wed Sep 28, 2016 7:18 pm
by sainudani
bash-3.2# tail -40 /usr/local/nagios/var/nagios.log
[1475108262] wproc: 'Core Worker 13800' seems to be choked. ret = -1; bufsize = 5334: errno = 11 (Resource temporarily unavailable)
[1475108262] Unable to send check for host 'hostnsit1ctm02' to worker (ret=-2)
[1475108262] wproc: 'Core Worker 13757' seems to be choked. ret = -1; bufsize = 5600: errno = 11 (Resource temporarily unavailable)
[1475108262] Unable to run check for service 'Var Partition' on host 'hostnsit1ctm01'
[1475108262] wproc: 'Core Worker 13758' seems to be choked. ret = -1; bufsize = 5674: errno = 11 (Resource temporarily unavailable)
[1475108262] Unable to run check for service 'u01_oraredo_sit1ctsv' on host 'hostnsit1ctm01'
[1475108263] wproc: 'Core Worker 13759' seems to be choked. ret = -1; bufsize = 5212: errno = 11 (Resource temporarily unavailable)
[1475108263] Unable to send check for host 'hostnsit1dm01' to worker (ret=-2)
[1475108263] wproc: 'Core Worker 13760' seems to be choked. ret = -1; bufsize = 5519: errno = 11 (Resource temporarily unavailable)
[1475108263] Unable to run check for service 'NIC' on host 'hostnsit1ctm02'
[1475108263] wproc: 'Core Worker 13762' seems to be choked. ret = -1; bufsize = 5457: errno = 11 (Resource temporarily unavailable)
[1475108263] Unable to run check for service 'LUN_Connectivity_3' on host 'hostnsit1dm01'
[1475108263] wproc: 'Core Worker 13798' seems to be choked. ret = -1; bufsize = 5503: errno = 11 (Resource temporarily unavailable)
[1475108263] Unable to run check for service 'Fibre Connectivity' on host 'hostnsit1dm02'
[1475108264] wproc: 'Core Worker 13799' seems to be choked. ret = -1; bufsize = 5476: errno = 11 (Resource temporarily unavailable)
[1475108264] Unable to run check for service 'Var Partition' on host 'hostnsit1dm02'
[1475108264] wproc: 'Core Worker 13802' seems to be choked. ret = -1; bufsize = 5666: errno = 11 (Resource temporarily unavailable)
[1475108264] Unable to run check for service 'Swap Usage' on host 'hostnsit1inapp01'
[1475108265] wproc: 'Core Worker 13801' seems to be choked. ret = -1; bufsize = 5212: errno = 11 (Resource temporarily unavailable)
[1475108265] Unable to send check for host 'hostnsit1dm02' to worker (ret=-2)
[1475108265] wproc: 'Core Worker 13800' seems to be choked. ret = -1; bufsize = 5745: errno = 11 (Resource temporarily unavailable)