Suspected Memory Leak
Posted: Tue Nov 27, 2012 5:04 am
Hello,
(Host spec at the bottom)
I'm supporting a production Nagios instance (v3.2.1 which i appreciate is an old version) and i'm starting to see what looks like a memory leak when i increase the maximum concurrent service checks setting in Nagios.cfg. For 18 months our instance has happily supported ~3000 service checks across ~400 hosts while running at 200 max concurrent checks. When i started to see the latency increase to 15 seconds and service checks being nudged in the log, i increased the maximum checks to 400 and since then our instance crashes every few weeks.
Looking at the memory usage for the Nagios host, i can see memory usage incrementing at 3GB every 2 weeks (seems to crash at 5GB usage) which has been occurring since the max checks has been increased. For the moment i have decreased the check limit to 200 and it's now stable.
I've looked on the Nagios Docs site (http://nagios.sourceforge.net/docs/3_0/tuning.html) and i'm unable to see any advice on the correlation between max service checks and memory usage so i'm posting here for helpful advice or suggestions. I suspect that other items in the Nagios configuration may also need amending but i'm reluctant to do so as this is a production instance and i'm not 100% certain of all the configuration options.
I'm planning to have this cloned to a VM so that i can test the upgrade process as we're overdue and it may well be that that could fix it however i thought i'd check here also to cover all grounds. I should also mention that we only get a load of 3 and there's plenty of CPU resource
Here's my current Nagios.cfg config:
Here's my host:
Any advice is welcome
Chris
(Host spec at the bottom)
I'm supporting a production Nagios instance (v3.2.1 which i appreciate is an old version) and i'm starting to see what looks like a memory leak when i increase the maximum concurrent service checks setting in Nagios.cfg. For 18 months our instance has happily supported ~3000 service checks across ~400 hosts while running at 200 max concurrent checks. When i started to see the latency increase to 15 seconds and service checks being nudged in the log, i increased the maximum checks to 400 and since then our instance crashes every few weeks.
Looking at the memory usage for the Nagios host, i can see memory usage incrementing at 3GB every 2 weeks (seems to crash at 5GB usage) which has been occurring since the max checks has been increased. For the moment i have decreased the check limit to 200 and it's now stable.
I've looked on the Nagios Docs site (http://nagios.sourceforge.net/docs/3_0/tuning.html) and i'm unable to see any advice on the correlation between max service checks and memory usage so i'm posting here for helpful advice or suggestions. I suspect that other items in the Nagios configuration may also need amending but i'm reluctant to do so as this is a production instance and i'm not 100% certain of all the configuration options.
I'm planning to have this cloned to a VM so that i can test the upgrade process as we're overdue and it may well be that that could fix it however i thought i'd check here also to cover all grounds. I should also mention that we only get a load of 3 and there's plenty of CPU resource
Here's my current Nagios.cfg config:
Code: Select all
cfg_file=/etc/nagios3/hostTemplates.cfg
cfg_file=/etc/nagios3/hosts.cfg
cfg_file=/etc/nagios3/serviceTemplates.cfg
cfg_file=/etc/nagios3/services.cfg
cfg_file=/etc/nagios3/misccommands.cfg
cfg_file=/etc/nagios3/checkcommands.cfg
cfg_file=/etc/nagios3/contactgroups.cfg
cfg_file=/etc/nagios3/contacts.cfg
cfg_file=/etc/nagios3/hostgroups.cfg
cfg_file=/etc/nagios3/servicegroups.cfg
cfg_file=/etc/nagios3/timeperiods.cfg
cfg_file=/etc/nagios3/escalations.cfg
cfg_file=/etc/nagios3/dependencies.cfg
cfg_file=/etc/nagios3/meta_commands.cfg
cfg_file=/etc/nagios3/meta_contact.cfg
cfg_file=/etc/nagios3/meta_contactgroup.cfg
cfg_file=/etc/nagios3/meta_dependencies.cfg
cfg_file=/etc/nagios3/meta_escalations.cfg
cfg_file=/etc/nagios3/meta_host.cfg
cfg_file=/etc/nagios3/meta_hostgroup.cfg
cfg_file=/etc/nagios3/meta_services.cfg
cfg_file=/etc/nagios3/meta_timeperiod.cfg
resource_file=/etc/nagios3/resource.cfg
log_file=/var/log/nagios3/nagios.log
status_file=/var/cache/nagios3/status.dat
object_cache_file=/var/cache/nagios3/objects.cache
temp_file=/var/cache/nagios3/nagios.tmp
p1_file=/usr/lib/nagios3/p1.pl
nagios_user=nagios
nagios_group=nagios
enable_notifications=1
execute_service_checks=1
accept_passive_service_checks=1
enable_event_handlers=1
log_rotation_method=d
log_archive_path=/var/log/nagios3/archives/
check_external_commands=1
command_check_interval=1s
command_file=/var/lib/nagios3/rw/nagios.cmd
lock_file=/var/run/nagios3/nagios3.pid
retain_state_information=1
state_retention_file=/var/lib/nagios3/retention.dat
retention_update_interval=60
use_retained_program_state=1
use_retained_scheduling_info=1
use_syslog=0
log_notifications=1
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=1
log_external_commands=1
sleep_time=1
service_inter_check_delay_method=s
service_interleave_factor=s
max_concurrent_checks=200
max_service_check_spread=5
check_result_reaper_frequency=5
interval_length=60
enable_flap_detection=1
low_service_flap_threshold=25.0
high_service_flap_threshold=50.0
low_host_flap_threshold=25.0
high_host_flap_threshold=50.0
soft_state_dependencies=0
service_check_timeout=60
host_check_timeout=10
event_handler_timeout=30
notification_timeout=30
ocsp_timeout=5
ochp_timeout=5
perfdata_timeout=5
obsess_over_services=0
process_performance_data=1
service_perfdata_command=process-service-perfdata
service_perfdata_file_mode=2
check_for_orphaned_services=0
check_service_freshness=1
date_format=euro
illegal_object_name_chars=~!$%^"&*|'<>?,()=
illegal_macro_output_chars=`~$^"&|'<>
admin_email=admin
admin_pager=admin@localhost
broker_module=/usr/lib/ndoutils/ndomod-mysql-3x.o config_file=/etc/nagios3/ndomod.cfg
event_broker_options=-1
debug_level=0
debug_verbosity=2
use_aggressive_host_checking=0
Code: Select all
OS: Ubuntu Server 10.10
Dell PowerEdge R610
CPU: 2x INTEL XEON E5620 PROCESSOR 2.40GHZ Quadcore
Memory: 8GB 1066 MHZ
HDD: 2x 146GB, SAS 6GBPS, 15K Mirror RAID
Any advice is welcome
Chris