NagiosXI consuming a large amount of CPU

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Frédéric GRANAT
Posts: 445
Joined: Mon Nov 19, 2012 11:36 am

NagiosXI consuming a large amount of CPU

Post by Frédéric GRANAT »

Hi,
We allocated 4 vCPU to NagiosXI (Nagios XI 2014R2.0)

The server is consuming a large amount of these ressources (please see the attached file).
Please look at joined files :
top.txt result of command top
cpu.txt result of command ps aux --sort -%cpu

Is there a way to tune Nagiosxi ?

rgds,

Frederic
You do not have the required permissions to view the files attached to this post.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: NagiosXI consuming a large amount of CPU

Post by jdalrymple »

How large is your environment?
What sort of checks are you running?

Here are some documents that may be helpful to you:

https://assets.nagios.com/downloads/nag ... ios-XI.pdf
https://assets.nagios.com/downloads/nag ... zation.pdf

Your top output does make it appear that something is haywire. The core Nagios process should not be that busy unless your environment is monstrous.
Frédéric GRANAT
Posts: 445
Joined: Mon Nov 19, 2012 11:36 am

Re: NagiosXI consuming a large amount of CPU

Post by Frédéric GRANAT »

We have 291 hosts and 725 services
We monitor availability of hosts (windows server, vmware esx servers, routers, switches) and for Windows servers we monitor disk, cpu, ram, windows services in automatic start mode.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: NagiosXI consuming a large amount of CPU

Post by jdalrymple »

Not a particularly large environment. From where you are I'd recommend restarting the Nagios monitoring engine process. I would stop the process (either in the GUI or on the command line with `/etc/init.d/nagios stop`) then `ps -ef | grep nagios.cfg` to make sure you don't have uncontrolled zombies. If so kill them with kill, -9 if you have to. Then restart the Nagios process and see if it calms down.

If you look at the historical load (should be a default in the NagiosXI interface for localhost) is it always high or is this a recent or recurring problem?
Frédéric GRANAT
Posts: 445
Joined: Mon Nov 19, 2012 11:36 am

Re: NagiosXI consuming a large amount of CPU

Post by Frédéric GRANAT »

Hi,
I restarted the nagios monitoring engine and I still have the problem.
Yes the problem occured in the past and It was due to the monitoring of VMware ESX server.
More precisely the monitoring of services using check_esx3.pl :

Please see the post :
"NagiosXI consuming a large amount of CPU
by Frédéric GRANAT » Mon Dec 10, 2012 9:11 am in Nagios XI "

It seems it's also the case that time, because I add the monitoring of 5 ESX servers and since, the CPU consumption increased strongly.
The difference is that now I only check the hosts not the services so I don't use check_esx3.pl.
Do you have any idea ?
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: NagiosXI consuming a large amount of CPU

Post by jdalrymple »

I'm going to offer my solution, and also point you to our coworkers's site which has volumes of information on the topic.

I personally have not seen check_esx3.pl consume such a ridiculous amount of CPU, but I've also never monitored too huge of a VMware environment. At most 30 hosts and about 1000 VMs. That said your environment is likely much larger so therein lies the additional load. My suggestion would be to offload the ESX checks to a gearman worker. Let them handle the brunt of that perl script and then your XI box can focus on it's normal day to day routine. Here is the documentation:

https://assets.nagios.com/downloads/nag ... ios_XI.pdf

It's foolproof to set up and generally just works.

Now - onto what my coworker Box293 would recommend - switch to his check.

https://exchange.nagios.org/directory/P ... re/details

He works on it a lot and it is configured in such a fashion that the checks are performed on a vMA (VMware Management Assistant) instead of directly on your XI box.

both options are great options, and either one is guarnateed to reduce your load on the XI box - although it just displaces it to another host. The only other suggestion I can offer is to tidy up your VMware checks. Make sure you're not needlessly monitoring the same NFS datastores on each and every host, make sure you're not needlessly monitoring your vMotion or svMotion dedicated networks (assuming you don't care about saturation there).
Frédéric GRANAT
Posts: 445
Joined: Mon Nov 19, 2012 11:36 am

Re: NagiosXI consuming a large amount of CPU

Post by Frédéric GRANAT »

You said :
At most 30 hosts and about 1000 VMs. That said your environment is likely much larger so therein lies the additional load
My answer : No I said we have 291 hosts (16 ESX servers) and 725 services.
Before I add the 5 new ESX servers, the CPU consumption was fine.
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: NagiosXI consuming a large amount of CPU

Post by jdalrymple »

Frédéric GRANAT wrote:Before I add the 5 new ESX servers, the CPU consumption was fine.
Are the checks succeeding on these 5 new hosts, or are they timing out? Are you monitoring a WHOLE LOT MORE on the new 5 ESX boxes than the ones that already existed?
Frédéric GRANAT
Posts: 445
Joined: Mon Nov 19, 2012 11:36 am

Re: NagiosXI consuming a large amount of CPU

Post by Frédéric GRANAT »

The checks are succeeding and I do nothing more than for the other ESX (only availability of the host)
jdalrymple
Skynet Drone
Posts: 2620
Joined: Wed Feb 11, 2015 1:56 pm

Re: NagiosXI consuming a large amount of CPU

Post by jdalrymple »

Run this script and post the results:

Code: Select all

#!/usr/bin/perl
#
# ============================== SUMMARY =====================================
#
# Program : profile_nagios_executiontime.pl
# Version : 0.21
# Date    : Jan 15, 2012
# Author  : William Leibzon - [email protected]
# Summary : This is a nagios profiler to find which checks take longer
#           time to execute. Run it directly from unix shell, not as
#           a plugin. There are no parameters, but you may want to
#           change the file with path to your nagios status file
#           if its different than /var/log/nagios/status.dat
# Licence : GPL - summary below, text at http://www.fsf.org/licenses/gpl.txt
# Version History: 0.1 - November 2008 : original release for nagios 2.x
#                  0.2  - Dec 15, 2010 : support for nagios 3.0, simple summary header added
#                  0.21 - Jan 15, 2012 : if nagios is not running, don't give an exception
# =========================== PROGRAM LICENSE ================================
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GnU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
#
# ========================== START OF PROGRAM CODE ===========================

use strict;

my %service_data = ();
my %host_data = ();
my $file="/usr/local/nagios/var/status.dat";

if (!open (FL, $file)) {
        print "Could not open file $file - $!";
        print "\nPerhaps Nagios is not running?\n";
        exit(1);
}
my $block="";
my $bdata;
while (<FL>) {
        if ( !$block && /\s*(\w+)\s+{/ ) {
                $block=$1;
                $bdata={};
        }
        elsif ( $block && /\s*}/) {
                if (($block eq "host" || $block eq "hoststatus") && defined($bdata->{'host_name'})) {
                        $host_data{$bdata->{'host_name'}}=$bdata;
                }
                if (($block eq "service" || $block eq "servicestatus") && defined($bdata->{'host_name'}) && defined($bdata->{'service_description'})) {
                        $service_data{$bdata->{'host_name'}.'_____'.$bdata->{'service_description'}}=$bdata;
                }
                $block="";
        }
        elsif ( $block && /\s*(\w+)=(.*)/ ) {
                $bdata->{$1}=$2;
        }
}
close(FL);

my %stats=('_all_'=>{tnum=>0,texec=>0});
my $host;
my $service;
foreach (sort { $service_data{$b}{check_execution_time} <=> $service_data{$a}{check_execution_time} } keys %service_data) {
        if ($service_data{$_}{active_checks_enabled}==1) {
                $host=$service_data{$_}{host_name};
                $service=$service_data{$_}{service_description};
                print "Host: $host Service: $service Check Time: ".$service_data{$_}{check_execution_time}."\n";
                $stats{_all_}{texec}+=$service_data{$_}{check_execution_time};
                $stats{_all_}{tnum}++;
                $stats{$service}={texec=>0,tnum=>0} if !defined($stats{$service});
                $stats{$service}{texec}+=$service_data{$_}{check_execution_time};
                $stats{$service}{tnum}++;
        }
}
print "\n";
if ($stats{'_all_'}{'tnum'}>0) {
  printf "Service: $_   Average Execution Time: %.3f (sec)  NumChecks: %d\n",($stats{$_}{texec}/$stats{$_}{tnum}),$stats{$_}{tnum} foreach (sort { $stats{$a}{texec}/$stats{$a}{tnum} <=> $stats{$b}{texec}/$stats{$b}{tnum} } keys %stats);
  printf "\nTotal Execution Time: %d (sec)   NumChecks: %d   Average Time: %.3f (sec)\n",$stats{'_all_'}{texec},$stats{'_all_'}{tnum},($stats{'_all_'}{texec}/$stats{'_all_'}{tnum});
}
else {
  print "\nCould find data on actively executed checks. Is your nagios configured and running?\n";
}
It's a handy dandy script from here:
https://exchange.nagios.org/directory/P ... me/details
modified for XI though

** EDIT **

FYI - I don't expect this script to have the answers, I expect clues...
Locked