Page 2 of 2

Re: Slow NagiosXI VM

Posted: Thu Dec 22, 2011 8:21 pm
by Fred Kroeger
Thanks Mike - I've done/checked just about everything - but keep asking the questions - we may just crack it!
No - nothing in the nagios.log file related to errors, no multiple nagios processes running, no extra notifications going out.
Load average is higher while npcd is running (~9-10 npcd started down to ~6-7 npcd stopped), but this shouldn't be a problem for a 4xCPU VM .
Also tried running the VM with 2xCPU & 3xCPU - again no real difference - Load average was higher ~12.

I did have a look at the link you sent from Daniels presentation, but looked past what you were referring to.
He had a one liner there to renice the perfdata run command in npcd.cfg

Code: Select all

perfdata_file_run_cmd = /bin/nice -n 19 /usr/local/nagios/libexec/process_perfdata.pl
This has brought my latency down to ~45secs

I set the reaper values back to default as you suggested and latency improved a little bit ~30-35secs

The office is closing over the Christmas/New Year period - so if you respond to this I will probably only see it when I get back.
However by then, I'm sure you will have had a breakthrough and I will be able to implement the solution to my performance problems ;-)

Thanks for your help! regards... Fred

(My latest stats)

Code: Select all

Nagios Stats 3.2.3
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 10-03-2010
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /var/nagiosramdisk/status.dat
Status File Age:                        0d 0h 0m 14s
Status File Version:                    3.2.3

Program Running Time:                   0d 0h 47m 6s
Nagios PID:                             7774
Used/High/Total Command Buffers:        0 / 0 / 4096

Total Services:                         5507
Services Checked:                       5507
Services Scheduled:                     5507
Services Actively Checked:              5507
Services Passively Checked:             0
Total Service State Change:             0.000 / 37.170 / 0.051 %
Active Service Latency:                 10.221 / 63.098 / 32.035 sec
Active Service Execution Time:          0.018 / 46.974 / 1.236 sec
Active Service State Change:            0.000 / 37.170 / 0.051 %
Active Services Last 1/5/15/60 min:     666 / 4282 / 5471 / 5507
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              5287 / 55 / 66 / 99
Services Flapping:                      8
Services In Downtime:                   0

Total Hosts:                            679
Hosts Checked:                          679
Hosts Scheduled:                        679
Hosts Actively Checked:                 679
Host Passively Checked:                 0
Total Host State Change:                0.000 / 16.840 / 0.081 %
Active Host Latency:                    0.000 / 59.670 / 30.981 sec
Active Host Execution Time:             0.038 / 10.129 / 0.182 sec
Active Host State Change:               0.000 / 16.840 / 0.081 %
Active Hosts Last 1/5/15/60 min:        41 / 492 / 657 / 679
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  678 / 1 / 0
Hosts Flapping:                         1
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     102 / 707 / 1900
   Scheduled:                           61 / 513 / 1351
   On-demand:                           41 / 194 / 549
   Parallel:                            61 / 517 / 1361
   Serial:                              0 / 0 / 0
   Cached:                              41 / 190 / 539
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  1033 / 4605 / 13387
   Scheduled:                           1033 / 4605 / 13387
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0


Re: Slow NagiosXI VM

Posted: Tue Dec 27, 2011 10:48 am
by mguthrie
My guess is that we need to give the CPU a little more breathing room still. Disk I\O could be the issue as well, that could be delaying the checks. You could look at trying out rrdcached to reduce the disk writes to the rrd files:
http://assets.nagios.com/downloads/nagi ... ios_XI.pdf

If you haven't already, you could offload mysql to a 2nd server, and that tends to cut CPU usage almost in half, as well as disk writes.

Another thing to look at would be looking at some of the following checks, and either having some of these on slaves servers or as passive checks, since they take up a lot of CPU. Or if you want to keep them all centralized, you could look at spacing them a little bit farther apart if your response time will allow it.

ESX checks -> check_esx3.pl (this can steal up to 40% CPU while it runs)
SNMP checks
WMI checks

Re: Slow NagiosXI VM

Posted: Tue Dec 27, 2011 12:51 pm
by jtata
I experienced similar problems with my VM when my number of checks started getting into the hundreds. The following seemed to help:

-Resource allocation in ESX. My VM has 4 virtual procs and 2GB RAM. WIth the default VM image I was maxing out around 100 checks, with multiple CPUs I am running over 500 checks on a single VM (which also houses mysql).
-Check host assignment for the VM, if possible keep Nagios on a different ESX host than other high CPU servers. Once I set a DRS exclusion to keep Nagios separate from my EPO server I got a lot less CPU alarms in ESX.
-Extend interval between checks wherever possible. Not everything needs check =5 and retry_interval=1. I find retry interval to be very important as you can really tax the system if a lot of hosts are down, consequently I reserve 5/1 for services I need to report for SLA purposes.
-Minimize notifications where possible. Switching most of my notifications from 5 separate contacts to a single distribution list means 1 email per alert instead of 5. Big difference when you have 40 checks alarm at once.

Re: Slow NagiosXI VM

Posted: Wed Dec 28, 2011 10:12 am
by mguthrie
Thanks for the tips jtata, good stuff!