Latencies increase drastically after 33 hourse of uptime

dnelson · Post by **dnelson** » Mon Dec 23, 2013 3:55 pm

Hi All,

I'm refreshing our Nagios environment (hardware and OS upgrade) and, during my testing, I happened upon an event that occurs approximately after ~33 hours of daemon uptime in which both the host and service latencies begin to increase at a rate of approximately 140 seconds / day (see graphs). Although not on the graphs, the host latencies follow the same trends as the service latencies.

I'm out of ideas and could use some help on this topic if anybody has any pointers they would care to offer. Hardware, OS, nagios.cfg details are below.

Thank you,
David Nelson

Code: Select all

OS: RHEL6 x86_64
Kernels: 2.6.32-279.14.1.el6.x86_64 and 2.6.39-200.24.1.el6uek.x86_64
Hardware: PowerEdge R720 (2 X Six-Core Hyper-Threaded Intel(R) Xeon(R) E5-2640 0 @ 2.50GHz) w/ 16 GB RAM
Filesystems: ext4 and tmpfs
Nagios Cores: 3.3.1, 3.4.4, 3.5.1; all compiled from source
Total Services: 15222 (this will increase to ~18,000 at go-live)
Total Hosts: 908 (this will increase to ~1200 at go-live)

I recently enabled debugging with debug_level=31 (Functions, Configuration, Process information, Scheduled events, and Host/service checks). Let me know if that information would be of interest and I'll post portions of it here.

Code: Select all

Current nagios.cfg is:
accept_passive_host_checks=1
accept_passive_service_checks=1
additional_freshness_latency=15
auto_reschedule_checks=1
auto_rescheduling_interval=30
auto_rescheduling_window=180
bare_update_check=1
cached_host_check_horizon=30
cached_service_check_horizon=30
cfg_dir=/opt/apps/nagios/etc/groups/unix/etc
cfg_dir=/opt/apps/nagios/etc/groups/unix/hosts
cfg_dir=/opt/apps/nagios/etc/groups/web/etc
cfg_dir=/opt/apps/nagios/etc/objects
check_external_commands=1
check_for_orphaned_hosts=1
check_for_orphaned_services=1
check_for_updates=1
check_host_freshness=1
check_result_path=/dev/shm
check_result_reaper_frequency=5
check_service_freshness=1
child_processes_fork_twice=1
command_check_interval=-1
command_file=/opt/apps/nagios-private/var/rw/nagios.cmd
daemon_dumps_core=0
date_format=us
debug_file=/opt/apps/nagios/var/nagios.debug
debug_level=31
debug_verbosity=2
enable_embedded_perl=1
enable_environment_macros=1
enable_event_handlers=0
enable_flap_detection=1
enable_notifications=0
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
event_broker_options=0
event_handler_timeout=30
execute_host_checks=1
execute_service_checks=1
external_command_buffer_slots=4096
high_host_flap_threshold=20.0
high_service_flap_threshold=20.0
host_check_timeout=30
host_freshness_check_interval=60
host_inter_check_delay_method=s
illegal_macro_output_chars=`~$&|'"<>
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
interval_length=60
lock_file=/dev/shm/nagios.lock
log_archive_path=/opt/apps/nagios-private/var/archives
log_event_handlers=1
log_external_commands=1
log_file=/opt/apps/nagios-private/var/nagios.log
log_host_retries=1
log_initial_states=0
log_notifications=1
log_passive_checks=1
log_rotation_method=d
log_service_retries=1
low_host_flap_threshold=5.0
low_service_flap_threshold=15.0
max_check_result_file_age=3600
max_check_result_reaper_time=30
max_concurrent_checks=0
max_debug_file_size=1000000000
max_host_check_spread=30
max_service_check_spread=30
nagios_group=nagios
nagios_user=nagios
notification_timeout=30
object_cache_file=/dev/shm/objects.cache
obsess_over_hosts=0
obsess_over_services=0
ocsp_timeout=5
p1_file=/opt/apps/nagios/bin/p1.pl
passive_host_checks_are_soft=0
perfdata_timeout=5
precached_object_file=/dev/shm/objects.precache
process_performance_data=0
resource_file=/opt/apps/nagios/etc/resource.cfg
retain_state_information=1
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_service_attribute_mask=0
retention_update_interval=5
service_check_timeout=30
service_freshness_check_interval=60
service_inter_check_delay_method=s
service_interleave_factor=s
sleep_time=0.25
soft_state_dependencies=0
state_retention_file=/opt/apps/nagios/var/retention.dat
status_file=/dev/shm/status.dat
status_update_interval=10
temp_file=/dev/shm/nagios.tmp
temp_path=/opt/apps/nagios/tmp
translate_passive_host_checks=0
use_aggressive_host_checking=0
use_embedded_perl_implicitly=1
use_large_installation_tweaks=1
use_regexp_matching=0
use_retained_program_state=1
use_retained_scheduling_info=1
use_syslog=0
use_true_regexp_matching=0

sreinhardt · Post by **sreinhardt** » Wed Dec 25, 2013 1:01 pm

Interesting issue, I know that we have clients with similarly sized installations that are not seeing this same issue, and your system certainly seems to have enough power to handle the load you are looking to run. Couple questions to get started.

Do you have plugins or other handlers that need environment macros? This is an extremely resource hungry option to have on in large installs such as yours. If you do not use it I would highly suggest disabling it.
Could you move or remove the precache file? This is just to be 100% sure that it is populating from your current configs and not holding old settings that may be inaccurate.
Could you post images of the load, memory, and on-demand check statistics from the same machine and time periods, if possible?

dnelson · Post by **dnelson** » Thu Dec 26, 2013 2:08 pm

Hi Spenser,

sreinhardt wrote:Do you have plugins or other handlers that need environment macros? This is an extremely resource hungry option to have on in large installs such as yours. If you do not use it I would highly suggest disabling it.

I am going to go out on a limb and say 'yes' because it looks like the SNMP4Nagios plugins use them and we do use these plugins. Otherwise, I'm not seeing any other usage of environment macros.

sreinhardt wrote:Could you move or remove the precache file? This is just to be 100% sure that it is populating from your current configs and not holding old settings that may be inaccurate.

The precache object file (/dev/shm/objects.precache) didn't exist and the daemon is not started with the '-p' option. I didn't realize this till after I had stopped the daemon.

sreinhardt wrote:Could you post images of the load, memory, and on-demand check statistics from the same machine and time periods, if possible?

Here's the load graph for the past week. The values are normalized by the number of CPU's as reported by /proc/cpustat; in this case, 24. The maximum fifteen minute load avg of 11.47% thus translates to 2.69. You can see that last Friday just after the noon hour, I had restarted the daemon. You can also just see the restart earlier today on the right side. Looking at the graph, this problem kinda smells like a kernel issue because sys time gradually increases and then levels off. I'm going to create/post a couple of graphs to see if there might be a correlation.

Here's the memory graph for the past week. I suspect that the saw-toothed application memory usage beginning on Friday is caused by the enabling of 'debug_level = 31'. The debug file size was set to 1GB which nearly corresponds to the height of the saw teeth.

I don't have records of the on-demand check statistics. Since restarting Nagios ~1.5 hours ago, I do have these stats. Is this good enough?

Code: Select all

Status File:                            /dev/shm/status.dat
Status File Age:                        0d 0h 0m 6s
Status File Version:                    3.3.1

Program Running Time:                   0d 1h 25m 27s
Nagios PID:                             26363
Used/High/Total Command Buffers:        0 / 0 / 4096

Total Services:                         15222
Services Checked:                       14821
Services Scheduled:                     14821
Services Actively Checked:              15222
Services Passively Checked:             0
Total Service State Change:             0.000 / 34.010 / 0.006 %
Active Service Latency:                 0.000 / 455.357 / 0.574 sec
Active Service Execution Time:          0.000 / 11.873 / 0.093 sec
Active Service State Change:            0.000 / 34.010 / 0.006 %
Active Services Last 1/5/15/60 min:     1667 / 10000 / 14046 / 14813
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              15189 / 7 / 0 / 26
Services Flapping:                      1
Services In Downtime:                   0

Total Hosts:                            908
Hosts Checked:                          908
Hosts Scheduled:                        908
Hosts Actively Checked:                 908
Host Passively Checked:                 0
Total Host State Change:                0.000 / 0.000 / 0.000 %
Active Host Latency:                    0.089 / 518.481 / 4.084 sec
Active Host Execution Time:             0.006 / 0.294 / 0.020 sec
Active Host State Change:               0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min:        119 / 778 / 848 / 902
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  908 / 0 / 0
Hosts Flapping:                         0
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     134 / 841 / 2570
   Scheduled:                           133 / 828 / 2514
   On-demand:                           1 / 13 / 56
   Parallel:                            133 / 828 / 2523
   Serial:                              0 / 0 / 0
   Cached:                              1 / 13 / 47
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  1956 / 10389 / 30266
   Scheduled:                           1956 / 10389 / 30266
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0

Regards,
David Nelson

slansing · Post by **slansing** » Thu Dec 26, 2013 3:34 pm

Another thing to check on, do you have a large amount of files in /tmp under the guise of "check*"? Do you have a ballpark estimate of how many service checks run active SNMP get's? SNMP checks are notorious load throttlers, and it would have an impact on latency as well. The fact that this occurs around 33 hours after a restart is a bit sticky... are you able to narrow down any system level tasks "processes, etc.." at this time that are grabbing a lot of resources?

dnelson · Post by **dnelson** » Thu Dec 26, 2013 3:50 pm

Additional notes/thoughts:
- The current production environment runs on a Sun Fire X4150 (2 X Quad-Core Intel(R) Xeon(R) X5460 @ 3.16GHz) w/ 8GB of RAM on RHEL5 X86_64. The daemon has been running since Dec 18 and the latencies look as follows.

- Originally, this project was to design and implement an HA Nagios solution. To achieve that, I had configured Gluster and Linux HA (heartbeat). Since discovering the latency behavior, I've stripped those components out from under Nagios but they still do exist on the system. I am contemplating the rebuilding of the RHEL6 server to remove any trace of the HA components and then dropping in the production configuration as-is (or as close to production as possible) and then observe the system for a couple of days.
- Using a very low pass filter there does seem to exist a subtle correlation between kernel time and service latency. Pay no attention to the y-axis values as the data has been scaled to get the graphs to display nicely. The other CPU times (I/O wait and user) are very level when compared to kernel time. I don't yet know what the kernel is doing that is causing it to gradually rise, though. Restarting the Nagios daemon causes kernel time to reset. It is also interesting that when kernel time knees over (around sample 4950) is when service time begins to climb.

Regards,
David Nelson

slansing · Post by **slansing** » Thu Dec 26, 2013 4:20 pm

Very interesting data sets, I hope your above proposal is not going to be too much work, are you running anything else on the same kernel configuration as this nagios server? You had mentioned HA, is there a backup server you could bring online to see if it has the same quirky latency?

dnelson · Post by **dnelson** » Thu Dec 26, 2013 4:32 pm

Hi slansing,

slansing wrote:Another thing to check on, do you have a large amount of files in /tmp under the guise of "check*"?

I've got one named 'check_ftp' whose time stamp is being updated regularly. There are remnants of temp files left behind by some /bin/sh check scripts but only because they don't properly clean up after themselves. An 'lsof /tmp' doesn't show anything interesting.

slansing wrote:Do you have a ballpark estimate of how many service checks run active SNMP get's? SNMP checks are notorious load throttlers, and it would have an impact on latency as well. The fact that this occurs around 33 hours after a restart is a bit sticky... are you able to narrow down any system level tasks "processes, etc.." at this time that are grabbing a lot of resources?

Kinda hard to give you a value. Instead, over a 5 minute period (our standard check time), there were approx 13 outbound SNMP requests per second.

Keep the suggestions coming - somethings gotta shake out eventually.

Regards,
David Nelson

dnelson · Post by **dnelson** » Thu Dec 26, 2013 4:48 pm

Hi slansing,

slansing wrote:Very interesting data sets, I hope your above proposal is not going to be too much work, are you running anything else on the same kernel configuration as this nagios server? You had mentioned HA, is there a backup server you could bring online to see if it has the same quirky latency?

We've got a fair number of RHEL6 servers in our env for some time now. Although none run Nagios (except for the ones that I'm working one), nothing this bizarre has been reported. I've decided that I'm going to rebuild the passive node of the HA cluster w/o installing any of the Gluster/HA stuff. In the end, I'll have a fairly generic/stock system and the question regarding if 'gluster or HA' is affecting things can be answered. That seems to be a logical direction to pursue. If all goes well, I should have something up and running by the end of the day. I'm not rebuilding the "badly behaving system" because, after all, something that is behaving badly is easier to diagnose than something that used to behave badly.

Regards,
David Nelson

abrist · Post by **abrist** » Fri Dec 27, 2013 10:53 am

dnelson wrote:I'm not rebuilding the "badly behaving system" because, after all, something that is behaving badly is easier to diagnose than something that used to behave badly.

Well stated. Do you know if this kernel was a release candidate, or a full release? Which one are you running - 2.6.18 or 2.6.32? Was the kernel upgraded or any other major changes done before you started experiencing the issue?

Let us know how the newly provisioned server behaves.

dnelson · Post by **dnelson** » Fri Dec 27, 2013 3:27 pm

abrist wrote:Do you know if this kernel was a release candidate, or a full release? Which one are you running - 2.6.18 or 2.6.32? Was the kernel upgraded or any other major changes done before you started experiencing the issue?

Hi abrist,

AFAIK, these kernels are the full release. They are the ones released through the standard update channels either though RedHat or Oracle. Come to think of it, I don't really know how I would tell a "full release" from a "release candidate". The OEL/RHEL6 kernels that I have used have been:

2.6.32-279.14.1.el6.x86_64
2.6.39-200.24.1.el6uek.x86_64

Things didn't go as smoothly as I had hoped on the rebuild. Nevertheless, it is up and running. I'll check in on it on Monday, Dec 30.

Regards,
David Nelson

Nagios Support Forum

Latencies increase drastically after 33 hourse of uptime

Latencies increase drastically after 33 hourse of uptime

Re: Latencies increase drastically after 33 hourse of uptime

Re: Latencies increase drastically after 33 hourse of uptime

Re: Latencies increase drastically after 33 hourse of uptime

Additional data and thoughts.

Re: Latencies increase drastically after 33 hourse of uptime

Re: Latencies increase drastically after 33 hourse of uptime

Re: Latencies increase drastically after 33 hourse of uptime

Re: Latencies increase drastically after 33 hourse of uptime

Re: Latencies increase drastically after 33 hourse of uptime