Performance issues

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
darkweaver87
Posts: 1
Joined: Fri Nov 05, 2010 3:25 am

Performance issues

Post by darkweaver87 »

Hi all,

I have performance issues on my distributed monitoring architecture.

Here is my architecture (some kind of classical one but it's more clear with it):
Nagios poller --> Nod mod <---> SSH tunnel <---> Ndo2DB --> Centreon with MySQL

On all my Nagios servers, the setup is:
- OS: Debian lenny 64 bits kernel 2.6.26
- Nagios version: 3.2.1
- NDO: 1.4b7

My hardware is a Dell PowerEdge R610 with:
- 12 GB (800 Mhz) of memory
- RAID 10 with 4 SAS disks
- processor: 2 x Intel(R) Xeon(R) CPU X5560 @ 2.80GHz (4 cores, multi-threaded)

I have performance issues on the biggest Nagios poller.
There are 172 hosts with 1625 services.
The machine is doing nothing because it seems Nagios too:

Code: Select all

top - 10:07:50 up 10 days, 22:17,  2 users,  load average: 0.06, 0.11, 0.09
Tasks: 206 total,   1 running, 186 sleeping,   0 stopped,  19 zombie
Cpu(s):  0.1%us,  0.2%sy,  0.7%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  12326768k total,  2033152k used, 10293616k free,   119816k buffers
Swap:  3997688k total,        0k used,  3997688k free,   416784k cached
I googled for days and I followed Nagios performance tuning web page recommendations (except removing perl plugins because I need some), but nothing to do.
My latency decreased from nearly 2000 to 300 but nothing more.

I set up max_concurrent_checks to 0 but I have spikes to only 80 processes.
It seems I have scheduling issues.

Here my nagios config:

Code: Select all

cfg_file=/etc/nagios3/hostTemplates.cfg
cfg_file=/etc/nagios3/hosts.cfg
cfg_file=/etc/nagios3/serviceTemplates.cfg
cfg_file=/etc/nagios3/services.cfg
cfg_file=/etc/nagios3/misccommands.cfg
cfg_file=/etc/nagios3/checkcommands.cfg
cfg_file=/etc/nagios3/contactgroups.cfg
cfg_file=/etc/nagios3/contacts.cfg
cfg_file=/etc/nagios3/hostgroups.cfg
cfg_file=/etc/nagios3/servicegroups.cfg
cfg_file=/etc/nagios3/timeperiods.cfg
cfg_file=/etc/nagios3/escalations.cfg
cfg_file=/etc/nagios3/dependencies.cfg
resource_file=/etc/nagios3//resource.cfg
log_file=/var/log/nagios3/nagios.log
object_cache_file=/var/cache/nagios3/objects.cache
temp_file=/var/cache/nagios3/nagios.tmp
status_file=/var/cache/nagios3/status.dat
p1_file=/usr/lib/nagios3/p1.pl
status_update_interval=15
nagios_user=nagios
nagios_group=nagios
enable_notifications=1
execute_service_checks=1
accept_passive_service_checks=1
enable_event_handlers=1
log_rotation_method=d
log_archive_path=/var/log/nagios3/archives/
check_external_commands=1
command_check_interval=1s
command_file=/var/lib/nagios3/rw/nagios.cmd
lock_file=/var/run/nagios3/nagios3.pid
retain_state_information=1
state_retention_file=/var/lib/nagios3/retention.dat
retention_update_interval=60
use_retained_program_state=1
use_retained_scheduling_info=1
use_syslog=0
log_notifications=1
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=1
log_external_commands=1
sleep_time=0.1
service_inter_check_delay_method=s
host_inter_check_delay_method=s
service_interleave_factor=s
max_concurrent_checks=0
max_service_check_spread=1
check_result_reaper_frequency=2
interval_length=60
use_agressive_host_checking=0
enable_flap_detection=0
low_service_flap_threshold=25.0
high_service_flap_threshold=50.0
low_host_flap_threshold=25.0
high_host_flap_threshold=50.0
soft_state_dependencies=0
service_check_timeout=60
host_check_timeout=10
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
ochp_timeout=5
perfdata_timeout=5
obsess_over_services=0
process_performance_data=1
service_perfdata_command=process-service-perfdata
host_perfdata_file_mode=2
service_perfdata_file_mode=2
check_for_orphaned_services=0
check_service_freshness=1
date_format=euro
illegal_object_name_chars=~!$%^&*"|'<>?,()=
illegal_macro_output_chars=`~$^&"|'<>
admin_email=admin
admin_pager=admin@localhost
broker_module=/usr/lib/ndoutils/ndomod-mysql-3x.o config_file=/etc/nagios3/ndomod.cfg
event_broker_options=-1
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
cached_host_check_horizon=120
cached_service_check_horizon=120
use_large_installation_tweaks=1
free_child_process_memory=0
child_processes_fork_twice=0
enable_environment_macros=0
enable_embedded_perl=1
use_embedded_perl_implicitly=1
debug_level=-1
Here is nagios reporting:

Code: Select all

# nagios3stats -c /etc/nagios3/nagios.cfg 

Nagios Stats 3.2.1
Copyright (c) 2003-2008 Ethan Galstad (www.nagios.org)
Last Modified: 03-09-2010
License: GPL

CURRENT STATUS DATA
------------------------------------------------------
Status File:                            /var/cache/nagios3/status.dat
Status File Age:                        0d 0h 0m 17s
Status File Version:                    3.2.1

Program Running Time:                   0d 0h 31m 49s
Nagios PID:                             6583
Used/High/Total Command Buffers:        0 / 0 / 4096

Total Services:                         1625
Services Checked:                       1625
Services Scheduled:                     1625
Services Actively Checked:              1625
Services Passively Checked:             0
Total Service State Change:             0.000 / 27.300 / 0.026 %
Active Service Latency:                 245.374 / 1378.759 / 704.176 sec
Active Service Execution Time:          0.004 / 181.324 / 0.995 sec
Active Service State Change:            0.000 / 27.300 / 0.026 %
Active Services Last 1/5/15/60 min:     0 / 0 / 676 / 1625
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              1493 / 19 / 23 / 90
Services Flapping:                      0
Services In Downtime:                   0

Total Hosts:                            172
Hosts Checked:                          172
Hosts Scheduled:                        172
Hosts Actively Checked:                 172
Host Passively Checked:                 0
Total Host State Change:                0.000 / 12.110 / 0.413 %
Active Host Latency:                    0.000 / 1378.761 / 705.959 sec
Active Host Execution Time:             0.012 / 0.520 / 0.031 sec
Active Host State Change:               0.000 / 12.110 / 0.413 %
Active Hosts Last 1/5/15/60 min:        0 / 0 / 72 / 172
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  170 / 2 / 0
Hosts Flapping:                         0
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     0 / 4 / 132
   Scheduled:                           0 / 0 / 79
   On-demand:                           0 / 4 / 53
   Parallel:                            0 / 0 / 80
   Serial:                              0 / 0 / 0
   Cached:                              0 / 4 / 52
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  0 / 0 / 746
   Scheduled:                           0 / 0 / 746
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0

Code: Select all

# nagios3 -ux -s /etc/nagios3/nagios.cfg 

Nagios Core 3.2.1
Copyright (c) 2009-2010 Nagios Core Development Team and Community Contributors
Copyright (c) 1999-2009 Ethan Galstad
Last Modified: 03-09-2010
License: GPL

Website: http://www.nagios.org
Timing information on object configuration processing is listed
below.  You can use this information to see if precaching your
object configuration would be useful.

Object Config Source: Pre-cached config file

OBJECT CONFIG PROCESSING TIMES      (* = Potential for precache savings with -u option)
----------------------------------
Read:                 0.079060 sec
Resolve:              0.000000 sec  *
Recomb Contactgroups: 0.000000 sec  *
Recomb Hostgroups:    0.000000 sec  *
Dup Services:         0.000000 sec  *
Recomb Servicegroups: 0.000000 sec  *
Duplicate:            0.000000 sec  *
Inherit:              0.000000 sec  *
Recomb Contacts:      0.000000 sec  *
Sort:                 0.000000 sec  *
Register:             0.010980 sec
Free:                 0.000526 sec
                      ============
TOTAL:                0.090566 sec  


RETENTION DATA TIMES
----------------------------------
Read and Process:     0.115329 sec
                      ============
TOTAL:                0.115329 sec


Timing information on configuration verification is listed below.

CONFIG VERIFICATION TIMES          (* = Potential for speedup with -x option)
----------------------------------
Object Relationships: 0.006147 sec
Circular Paths:       0.000000 sec  *
Misc:                 0.000474 sec
                      ============
TOTAL:                0.006621 sec  * = 0.000000 sec (0.0%) estimated savings


EVENT SCHEDULING TIMES
-------------------------------------
Get service info:        0.002322 sec
Get host info info:      0.000250 sec
Get service params:      0.000005 sec
Schedule service times:  0.006411 sec
Schedule service events: 0.003627 sec
Get host params:         0.000001 sec
Schedule host times:     0.000675 sec
Schedule host events:    0.000830 sec
                         ============
TOTAL:                   0.014121 sec


Projected scheduling information for host and service checks
is listed below.  This information assumes that you are going
to start running Nagios with your current config files.

HOST SCHEDULING INFORMATION
---------------------------
Total hosts:                     172
Total scheduled hosts:           172
Host inter-check delay method:   SMART
Average host check interval:     60.00 sec
Host inter-check delay:          0.35 sec
Max host check spread:           30 min
First scheduled check:           Fri Nov  5 09:57:45 2010
Last scheduled check:            Fri Nov  5 09:58:44 2010


SERVICE SCHEDULING INFORMATION
-------------------------------
Total services:                     1625
Total scheduled services:           1625
Service inter-check delay method:   SMART
Average service check interval:     300.00 sec
Inter-check delay:                  0.04 sec
Interleave factor method:           SMART
Average services per host:          9.45
Service interleave factor:          10
Max service check spread:           1 min
First scheduled check:              Fri Nov  5 09:57:51 2010
Last scheduled check:               Fri Nov  5 09:58:51 2010


CHECK PROCESSING INFORMATION
----------------------------
Check result reaper interval:       2 sec
Max concurrent service checks:      Unlimited


PERFORMANCE SUGGESTIONS
-----------------------
I have no suggestions - things look okay.
If someone have any idea I will appreciate.

Thanks.

Rémi
Locked