NagiosXI consuming a large amount of CPU
-
Frédéric GRANAT
- Posts: 445
- Joined: Mon Nov 19, 2012 11:36 am
NagiosXI consuming a large amount of CPU
Hi,
We allocated 4 vCPU to NagiosXI (Nagios XI 2014R2.0)
The server is consuming a large amount of these ressources (please see the attached file).
Please look at joined files :
top.txt result of command top
cpu.txt result of command ps aux --sort -%cpu
Is there a way to tune Nagiosxi ?
rgds,
Frederic
We allocated 4 vCPU to NagiosXI (Nagios XI 2014R2.0)
The server is consuming a large amount of these ressources (please see the attached file).
Please look at joined files :
top.txt result of command top
cpu.txt result of command ps aux --sort -%cpu
Is there a way to tune Nagiosxi ?
rgds,
Frederic
You do not have the required permissions to view the files attached to this post.
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: NagiosXI consuming a large amount of CPU
How large is your environment?
What sort of checks are you running?
Here are some documents that may be helpful to you:
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
https://assets.nagios.com/downloads/nag ... zation.pdf
Your top output does make it appear that something is haywire. The core Nagios process should not be that busy unless your environment is monstrous.
What sort of checks are you running?
Here are some documents that may be helpful to you:
https://assets.nagios.com/downloads/nag ... ios-XI.pdf
https://assets.nagios.com/downloads/nag ... zation.pdf
Your top output does make it appear that something is haywire. The core Nagios process should not be that busy unless your environment is monstrous.
-
Frédéric GRANAT
- Posts: 445
- Joined: Mon Nov 19, 2012 11:36 am
Re: NagiosXI consuming a large amount of CPU
We have 291 hosts and 725 services
We monitor availability of hosts (windows server, vmware esx servers, routers, switches) and for Windows servers we monitor disk, cpu, ram, windows services in automatic start mode.
We monitor availability of hosts (windows server, vmware esx servers, routers, switches) and for Windows servers we monitor disk, cpu, ram, windows services in automatic start mode.
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: NagiosXI consuming a large amount of CPU
Not a particularly large environment. From where you are I'd recommend restarting the Nagios monitoring engine process. I would stop the process (either in the GUI or on the command line with `/etc/init.d/nagios stop`) then `ps -ef | grep nagios.cfg` to make sure you don't have uncontrolled zombies. If so kill them with kill, -9 if you have to. Then restart the Nagios process and see if it calms down.
If you look at the historical load (should be a default in the NagiosXI interface for localhost) is it always high or is this a recent or recurring problem?
If you look at the historical load (should be a default in the NagiosXI interface for localhost) is it always high or is this a recent or recurring problem?
-
Frédéric GRANAT
- Posts: 445
- Joined: Mon Nov 19, 2012 11:36 am
Re: NagiosXI consuming a large amount of CPU
Hi,
I restarted the nagios monitoring engine and I still have the problem.
Yes the problem occured in the past and It was due to the monitoring of VMware ESX server.
More precisely the monitoring of services using check_esx3.pl :
Please see the post :
"NagiosXI consuming a large amount of CPU
by Frédéric GRANAT » Mon Dec 10, 2012 9:11 am in Nagios XI "
It seems it's also the case that time, because I add the monitoring of 5 ESX servers and since, the CPU consumption increased strongly.
The difference is that now I only check the hosts not the services so I don't use check_esx3.pl.
Do you have any idea ?
I restarted the nagios monitoring engine and I still have the problem.
Yes the problem occured in the past and It was due to the monitoring of VMware ESX server.
More precisely the monitoring of services using check_esx3.pl :
Please see the post :
"NagiosXI consuming a large amount of CPU
by Frédéric GRANAT » Mon Dec 10, 2012 9:11 am in Nagios XI "
It seems it's also the case that time, because I add the monitoring of 5 ESX servers and since, the CPU consumption increased strongly.
The difference is that now I only check the hosts not the services so I don't use check_esx3.pl.
Do you have any idea ?
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: NagiosXI consuming a large amount of CPU
I'm going to offer my solution, and also point you to our coworkers's site which has volumes of information on the topic.
I personally have not seen check_esx3.pl consume such a ridiculous amount of CPU, but I've also never monitored too huge of a VMware environment. At most 30 hosts and about 1000 VMs. That said your environment is likely much larger so therein lies the additional load. My suggestion would be to offload the ESX checks to a gearman worker. Let them handle the brunt of that perl script and then your XI box can focus on it's normal day to day routine. Here is the documentation:
https://assets.nagios.com/downloads/nag ... ios_XI.pdf
It's foolproof to set up and generally just works.
Now - onto what my coworker Box293 would recommend - switch to his check.
https://exchange.nagios.org/directory/P ... re/details
He works on it a lot and it is configured in such a fashion that the checks are performed on a vMA (VMware Management Assistant) instead of directly on your XI box.
both options are great options, and either one is guarnateed to reduce your load on the XI box - although it just displaces it to another host. The only other suggestion I can offer is to tidy up your VMware checks. Make sure you're not needlessly monitoring the same NFS datastores on each and every host, make sure you're not needlessly monitoring your vMotion or svMotion dedicated networks (assuming you don't care about saturation there).
I personally have not seen check_esx3.pl consume such a ridiculous amount of CPU, but I've also never monitored too huge of a VMware environment. At most 30 hosts and about 1000 VMs. That said your environment is likely much larger so therein lies the additional load. My suggestion would be to offload the ESX checks to a gearman worker. Let them handle the brunt of that perl script and then your XI box can focus on it's normal day to day routine. Here is the documentation:
https://assets.nagios.com/downloads/nag ... ios_XI.pdf
It's foolproof to set up and generally just works.
Now - onto what my coworker Box293 would recommend - switch to his check.
https://exchange.nagios.org/directory/P ... re/details
He works on it a lot and it is configured in such a fashion that the checks are performed on a vMA (VMware Management Assistant) instead of directly on your XI box.
both options are great options, and either one is guarnateed to reduce your load on the XI box - although it just displaces it to another host. The only other suggestion I can offer is to tidy up your VMware checks. Make sure you're not needlessly monitoring the same NFS datastores on each and every host, make sure you're not needlessly monitoring your vMotion or svMotion dedicated networks (assuming you don't care about saturation there).
-
Frédéric GRANAT
- Posts: 445
- Joined: Mon Nov 19, 2012 11:36 am
Re: NagiosXI consuming a large amount of CPU
You said :
At most 30 hosts and about 1000 VMs. That said your environment is likely much larger so therein lies the additional load
My answer : No I said we have 291 hosts (16 ESX servers) and 725 services.
Before I add the 5 new ESX servers, the CPU consumption was fine.
At most 30 hosts and about 1000 VMs. That said your environment is likely much larger so therein lies the additional load
My answer : No I said we have 291 hosts (16 ESX servers) and 725 services.
Before I add the 5 new ESX servers, the CPU consumption was fine.
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: NagiosXI consuming a large amount of CPU
Are the checks succeeding on these 5 new hosts, or are they timing out? Are you monitoring a WHOLE LOT MORE on the new 5 ESX boxes than the ones that already existed?Frédéric GRANAT wrote:Before I add the 5 new ESX servers, the CPU consumption was fine.
-
Frédéric GRANAT
- Posts: 445
- Joined: Mon Nov 19, 2012 11:36 am
Re: NagiosXI consuming a large amount of CPU
The checks are succeeding and I do nothing more than for the other ESX (only availability of the host)
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: NagiosXI consuming a large amount of CPU
Run this script and post the results:
It's a handy dandy script from here:
https://exchange.nagios.org/directory/P ... me/details
modified for XI though
** EDIT **
FYI - I don't expect this script to have the answers, I expect clues...
Code: Select all
#!/usr/bin/perl
#
# ============================== SUMMARY =====================================
#
# Program : profile_nagios_executiontime.pl
# Version : 0.21
# Date : Jan 15, 2012
# Author : William Leibzon - [email protected]
# Summary : This is a nagios profiler to find which checks take longer
# time to execute. Run it directly from unix shell, not as
# a plugin. There are no parameters, but you may want to
# change the file with path to your nagios status file
# if its different than /var/log/nagios/status.dat
# Licence : GPL - summary below, text at http://www.fsf.org/licenses/gpl.txt
# Version History: 0.1 - November 2008 : original release for nagios 2.x
# 0.2 - Dec 15, 2010 : support for nagios 3.0, simple summary header added
# 0.21 - Jan 15, 2012 : if nagios is not running, don't give an exception
# =========================== PROGRAM LICENSE ================================
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GnU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#
# ========================== START OF PROGRAM CODE ===========================
use strict;
my %service_data = ();
my %host_data = ();
my $file="/usr/local/nagios/var/status.dat";
if (!open (FL, $file)) {
print "Could not open file $file - $!";
print "\nPerhaps Nagios is not running?\n";
exit(1);
}
my $block="";
my $bdata;
while (<FL>) {
if ( !$block && /\s*(\w+)\s+{/ ) {
$block=$1;
$bdata={};
}
elsif ( $block && /\s*}/) {
if (($block eq "host" || $block eq "hoststatus") && defined($bdata->{'host_name'})) {
$host_data{$bdata->{'host_name'}}=$bdata;
}
if (($block eq "service" || $block eq "servicestatus") && defined($bdata->{'host_name'}) && defined($bdata->{'service_description'})) {
$service_data{$bdata->{'host_name'}.'_____'.$bdata->{'service_description'}}=$bdata;
}
$block="";
}
elsif ( $block && /\s*(\w+)=(.*)/ ) {
$bdata->{$1}=$2;
}
}
close(FL);
my %stats=('_all_'=>{tnum=>0,texec=>0});
my $host;
my $service;
foreach (sort { $service_data{$b}{check_execution_time} <=> $service_data{$a}{check_execution_time} } keys %service_data) {
if ($service_data{$_}{active_checks_enabled}==1) {
$host=$service_data{$_}{host_name};
$service=$service_data{$_}{service_description};
print "Host: $host Service: $service Check Time: ".$service_data{$_}{check_execution_time}."\n";
$stats{_all_}{texec}+=$service_data{$_}{check_execution_time};
$stats{_all_}{tnum}++;
$stats{$service}={texec=>0,tnum=>0} if !defined($stats{$service});
$stats{$service}{texec}+=$service_data{$_}{check_execution_time};
$stats{$service}{tnum}++;
}
}
print "\n";
if ($stats{'_all_'}{'tnum'}>0) {
printf "Service: $_ Average Execution Time: %.3f (sec) NumChecks: %d\n",($stats{$_}{texec}/$stats{$_}{tnum}),$stats{$_}{tnum} foreach (sort { $stats{$a}{texec}/$stats{$a}{tnum} <=> $stats{$b}{texec}/$stats{$b}{tnum} } keys %stats);
printf "\nTotal Execution Time: %d (sec) NumChecks: %d Average Time: %.3f (sec)\n",$stats{'_all_'}{texec},$stats{'_all_'}{tnum},($stats{'_all_'}{texec}/$stats{'_all_'}{tnum});
}
else {
print "\nCould find data on actively executed checks. Is your nagios configured and running?\n";
}https://exchange.nagios.org/directory/P ... me/details
modified for XI though
** EDIT **
FYI - I don't expect this script to have the answers, I expect clues...