Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
I've been testing Core 4.2.1 on a fresh build of Solaris 11.2 and have seen some of this high CPU utilization, it seems to co-incide with these errors in the log file:
Hi,
I could not find exactly what you required.
But assuming you need to find out which Nagios process is consuming more CPUs.
Please find the below Solaris way of the identifying.(Hope 'truss' in Solaris is equal to 'strace' in linux environment).
--------------------------Top--------------------------
bash-3.2# top
load averages: 9.21, 9.42, 9.43; up 24+18:05:30 10:14:03
368 processes: 48 sleeping, 2 running, 310 zombie, 8 on cpu
CPU states: 0.0% idle, 79.8% user, 20.2% kernel, 0.0% iowait, 0.0% swap
Memory: 32G phys mem, 3280M free mem, 20G total swap, 20G free swap
-------------------------------------------------------------------------
==>~80% CPU is consumed by Nagios & ~20% by Kernel
-------------------------------------------------------------------------
----------------------------------------------------------------------------------
==>From above, we can see PIDs 17845,17847,17846,17841,17839,17840,17842,17850 consumes ~90% CPU.
----------------------------------------------------------------------------------
------------------------truss -p 17845--------------------------------------
===>Output file attached for your reference.
-------------------------------------------------------------------------------
Also wondering should I upgrade client nrpe package as part of Nagios core update?
I am not sure what version of nrpe is installed on nagios clients.How to check that?
I've been testing Core 4.2.1 on a fresh build of Solaris 11.2...
The title says:
Nagios 4.1.1 too many zombie process and 100% cpu usage
...and in your nagios.cfg I see this:
NAGIOS.CFG - Sample Main Config File for Nagios 3.3.1
Can you clarify which version of Nagios Core are you using, and why you have Nagios 3.3.1 listed in the main config? Can you run the following command and show the output?
Hi,
Ok. Apologies if it caused any confusion.
Let me explain the scenario first and requested out put will be given at the end.
OS: Solaris 10 SPARC 64bit
Until last month it was Nagios core 3.3.1 and I upgraded to Nagios Core 4.1.1
(Procedures followed, ./configure , make all & make install)
From there, CPU usage started shooting 100% and all other issues began.
As per the suggestions from this forum, upgraded to 4.2 & 4.2.1 and it didn't help.
So, again restored 4.1.1 from tar backup taken prior to upgrade to 4.2
Now, it is 4.1.1 having 100% cpu usage issue and too many zombie process.
I could see /usr/local/nagios/etc/nagios.cfg is still old 3.3.1 (I haven't done anything on this as there was no steps mentioned to change this manually in the upgrade process).
Hope it is clear now.
Please find the required output.
bash-3.2# cat /usr/local/nagios/etc/nagios.cfg |head -5
##############################################################################
#
# NAGIOS.CFG - Sample Main Config File for Nagios 3.3.1
#
# Read the documentation for more information on this configuration
-------------------------------------------------------------------------------------------------------------------------------------------------------------
bash-3.2# /usr/local/nagios/bin/nagios | head -2
bash-3.2# tail -40 /usr/local/nagios/var/nagios.log
[1475108262] wproc: 'Core Worker 13800' seems to be choked. ret = -1; bufsize = 5334: errno = 11 (Resource temporarily unavailable)
[1475108262] Unable to send check for host 'hostnsit1ctm02' to worker (ret=-2)
[1475108262] wproc: 'Core Worker 13757' seems to be choked. ret = -1; bufsize = 5600: errno = 11 (Resource temporarily unavailable)
[1475108262] Unable to run check for service 'Var Partition' on host 'hostnsit1ctm01'
[1475108262] wproc: 'Core Worker 13758' seems to be choked. ret = -1; bufsize = 5674: errno = 11 (Resource temporarily unavailable)
[1475108262] Unable to run check for service 'u01_oraredo_sit1ctsv' on host 'hostnsit1ctm01'
[1475108263] wproc: 'Core Worker 13759' seems to be choked. ret = -1; bufsize = 5212: errno = 11 (Resource temporarily unavailable)
[1475108263] Unable to send check for host 'hostnsit1dm01' to worker (ret=-2)
[1475108263] wproc: 'Core Worker 13760' seems to be choked. ret = -1; bufsize = 5519: errno = 11 (Resource temporarily unavailable)
[1475108263] Unable to run check for service 'NIC' on host 'hostnsit1ctm02'
[1475108263] wproc: 'Core Worker 13762' seems to be choked. ret = -1; bufsize = 5457: errno = 11 (Resource temporarily unavailable)
[1475108263] Unable to run check for service 'LUN_Connectivity_3' on host 'hostnsit1dm01'
[1475108263] wproc: 'Core Worker 13798' seems to be choked. ret = -1; bufsize = 5503: errno = 11 (Resource temporarily unavailable)
[1475108263] Unable to run check for service 'Fibre Connectivity' on host 'hostnsit1dm02'
[1475108264] wproc: 'Core Worker 13799' seems to be choked. ret = -1; bufsize = 5476: errno = 11 (Resource temporarily unavailable)
[1475108264] Unable to run check for service 'Var Partition' on host 'hostnsit1dm02'
[1475108264] wproc: 'Core Worker 13802' seems to be choked. ret = -1; bufsize = 5666: errno = 11 (Resource temporarily unavailable)
[1475108264] Unable to run check for service 'Swap Usage' on host 'hostnsit1inapp01'
[1475108265] wproc: 'Core Worker 13801' seems to be choked. ret = -1; bufsize = 5212: errno = 11 (Resource temporarily unavailable)
[1475108265] Unable to send check for host 'hostnsit1dm02' to worker (ret=-2)
[1475108265] wproc: 'Core Worker 13800' seems to be choked. ret = -1; bufsize = 5745: errno = 11 (Resource temporarily unavailable)