Page 1 of 2
NagiosXI consuming a large amount of CPU
Posted: Thu Jul 16, 2015 8:29 am
by Frédéric GRANAT
Hi,
We allocated 4 vCPU to NagiosXI (Nagios XI 2014R2.0)
The server is consuming a large amount of these ressources (please see the attached file).
Please look at joined files :
top.txt result of command top
cpu.txt result of command ps aux --sort -%cpu
Is there a way to tune Nagiosxi ?
rgds,
Frederic
Re: NagiosXI consuming a large amount of CPU
Posted: Thu Jul 16, 2015 9:13 am
by jdalrymple
How large is your environment?
What sort of checks are you running?
Here are some documents that may be helpful to you:
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
https://assets.nagios.com/downloads/nag ... zation.pdf
Your top output does make it appear that something is haywire. The core Nagios process should not be that busy unless your environment is monstrous.
Re: NagiosXI consuming a large amount of CPU
Posted: Thu Jul 16, 2015 10:32 am
by Frédéric GRANAT
We have 291 hosts and 725 services
We monitor availability of hosts (windows server, vmware esx servers, routers, switches) and for Windows servers we monitor disk, cpu, ram, windows services in automatic start mode.
Re: NagiosXI consuming a large amount of CPU
Posted: Thu Jul 16, 2015 11:10 am
by jdalrymple
Not a particularly large environment. From where you are I'd recommend restarting the Nagios monitoring engine process. I would stop the process (either in the GUI or on the command line with `/etc/init.d/nagios stop`) then `ps -ef | grep nagios.cfg` to make sure you don't have uncontrolled zombies. If so kill them with kill, -9 if you have to. Then restart the Nagios process and see if it calms down.
If you look at the historical load (should be a default in the NagiosXI interface for localhost) is it always high or is this a recent or recurring problem?
Re: NagiosXI consuming a large amount of CPU
Posted: Fri Jul 17, 2015 4:06 am
by Frédéric GRANAT
Hi,
I restarted the nagios monitoring engine and I still have the problem.
Yes the problem occured in the past and It was due to the monitoring of VMware ESX server.
More precisely the monitoring of services using check_esx3.pl :
Please see the post :
"NagiosXI consuming a large amount of CPU
by Frédéric GRANAT » Mon Dec 10, 2012 9:11 am in Nagios XI "
It seems it's also the case that time, because I add the monitoring of 5 ESX servers and since, the CPU consumption increased strongly.
The difference is that now I only check the hosts not the services so I don't use check_esx3.pl.
Do you have any idea ?
Re: NagiosXI consuming a large amount of CPU
Posted: Fri Jul 17, 2015 9:25 am
by jdalrymple
I'm going to offer my solution, and also point you to our coworkers's site which has volumes of information on the topic.
I personally have not seen check_esx3.pl consume such a ridiculous amount of CPU, but I've also never monitored too huge of a VMware environment. At most 30 hosts and about 1000 VMs. That said your environment is likely much larger so therein lies the additional load. My suggestion would be to offload the ESX checks to a gearman worker. Let them handle the brunt of that perl script and then your XI box can focus on it's normal day to day routine. Here is the documentation:
https://assets.nagios.com/downloads/nag ... ios_XI.pdf
It's foolproof to set up and generally just works.
Now - onto what my coworker Box293 would recommend - switch to his check.
https://exchange.nagios.org/directory/P ... re/details
He works on it a lot and it is configured in such a fashion that the checks are performed on a vMA (VMware Management Assistant) instead of directly on your XI box.
both options are great options, and either one is guarnateed to reduce your load on the XI box - although it just displaces it to another host. The only other suggestion I can offer is to tidy up your VMware checks. Make sure you're not needlessly monitoring the same NFS datastores on each and every host, make sure you're not needlessly monitoring your vMotion or svMotion dedicated networks (assuming you don't care about saturation there).
Re: NagiosXI consuming a large amount of CPU
Posted: Fri Jul 17, 2015 10:05 am
by Frédéric GRANAT
You said :
At most 30 hosts and about 1000 VMs. That said your environment is likely much larger so therein lies the additional load
My answer : No I said we have 291 hosts (16 ESX servers) and 725 services.
Before I add the 5 new ESX servers, the CPU consumption was fine.
Re: NagiosXI consuming a large amount of CPU
Posted: Fri Jul 17, 2015 10:31 am
by jdalrymple
Frédéric GRANAT wrote:Before I add the 5 new ESX servers, the CPU consumption was fine.
Are the checks succeeding on these 5 new hosts, or are they timing out? Are you monitoring a WHOLE LOT MORE on the new 5 ESX boxes than the ones that already existed?
Re: NagiosXI consuming a large amount of CPU
Posted: Wed Jul 22, 2015 2:07 am
by Frédéric GRANAT
The checks are succeeding and I do nothing more than for the other ESX (only availability of the host)
Re: NagiosXI consuming a large amount of CPU
Posted: Wed Jul 22, 2015 4:32 pm
by jdalrymple
Run this script and post the results:
Code: Select all
#!/usr/bin/perl
#
# ============================== SUMMARY =====================================
#
# Program : profile_nagios_executiontime.pl
# Version : 0.21
# Date : Jan 15, 2012
# Author : William Leibzon - [email protected]
# Summary : This is a nagios profiler to find which checks take longer
# time to execute. Run it directly from unix shell, not as
# a plugin. There are no parameters, but you may want to
# change the file with path to your nagios status file
# if its different than /var/log/nagios/status.dat
# Licence : GPL - summary below, text at http://www.fsf.org/licenses/gpl.txt
# Version History: 0.1 - November 2008 : original release for nagios 2.x
# 0.2 - Dec 15, 2010 : support for nagios 3.0, simple summary header added
# 0.21 - Jan 15, 2012 : if nagios is not running, don't give an exception
# =========================== PROGRAM LICENSE ================================
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GnU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#
# ========================== START OF PROGRAM CODE ===========================
use strict;
my %service_data = ();
my %host_data = ();
my $file="/usr/local/nagios/var/status.dat";
if (!open (FL, $file)) {
print "Could not open file $file - $!";
print "\nPerhaps Nagios is not running?\n";
exit(1);
}
my $block="";
my $bdata;
while (<FL>) {
if ( !$block && /\s*(\w+)\s+{/ ) {
$block=$1;
$bdata={};
}
elsif ( $block && /\s*}/) {
if (($block eq "host" || $block eq "hoststatus") && defined($bdata->{'host_name'})) {
$host_data{$bdata->{'host_name'}}=$bdata;
}
if (($block eq "service" || $block eq "servicestatus") && defined($bdata->{'host_name'}) && defined($bdata->{'service_description'})) {
$service_data{$bdata->{'host_name'}.'_____'.$bdata->{'service_description'}}=$bdata;
}
$block="";
}
elsif ( $block && /\s*(\w+)=(.*)/ ) {
$bdata->{$1}=$2;
}
}
close(FL);
my %stats=('_all_'=>{tnum=>0,texec=>0});
my $host;
my $service;
foreach (sort { $service_data{$b}{check_execution_time} <=> $service_data{$a}{check_execution_time} } keys %service_data) {
if ($service_data{$_}{active_checks_enabled}==1) {
$host=$service_data{$_}{host_name};
$service=$service_data{$_}{service_description};
print "Host: $host Service: $service Check Time: ".$service_data{$_}{check_execution_time}."\n";
$stats{_all_}{texec}+=$service_data{$_}{check_execution_time};
$stats{_all_}{tnum}++;
$stats{$service}={texec=>0,tnum=>0} if !defined($stats{$service});
$stats{$service}{texec}+=$service_data{$_}{check_execution_time};
$stats{$service}{tnum}++;
}
}
print "\n";
if ($stats{'_all_'}{'tnum'}>0) {
printf "Service: $_ Average Execution Time: %.3f (sec) NumChecks: %d\n",($stats{$_}{texec}/$stats{$_}{tnum}),$stats{$_}{tnum} foreach (sort { $stats{$a}{texec}/$stats{$a}{tnum} <=> $stats{$b}{texec}/$stats{$b}{tnum} } keys %stats);
printf "\nTotal Execution Time: %d (sec) NumChecks: %d Average Time: %.3f (sec)\n",$stats{'_all_'}{texec},$stats{'_all_'}{tnum},($stats{'_all_'}{texec}/$stats{'_all_'}{tnum});
}
else {
print "\nCould find data on actively executed checks. Is your nagios configured and running?\n";
}
It's a handy dandy script from here:
https://exchange.nagios.org/directory/P ... me/details
modified for XI though
** EDIT **
FYI - I don't expect this script to have the answers, I expect clues...