Latencies increase drastically after 33 hourse of uptime

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
dnelson
Posts: 14
Joined: Mon Dec 23, 2013 11:07 am

Latencies increase drastically after 33 hourse of uptime

Post by dnelson »

Hi All,

I'm refreshing our Nagios environment (hardware and OS upgrade) and, during my testing, I happened upon an event that occurs approximately after ~33 hours of daemon uptime in which both the host and service latencies begin to increase at a rate of approximately 140 seconds / day (see graphs). Although not on the graphs, the host latencies follow the same trends as the service latencies.

Image

I'm out of ideas and could use some help on this topic if anybody has any pointers they would care to offer. Hardware, OS, nagios.cfg details are below.

Thank you,
David Nelson

Code: Select all

OS: RHEL6 x86_64
Kernels: 2.6.32-279.14.1.el6.x86_64 and 2.6.39-200.24.1.el6uek.x86_64
Hardware: PowerEdge R720 (2 X Six-Core Hyper-Threaded Intel(R) Xeon(R) E5-2640 0 @ 2.50GHz) w/ 16 GB RAM
Filesystems: ext4 and tmpfs
Nagios Cores: 3.3.1, 3.4.4, 3.5.1; all compiled from source
Total Services: 15222 (this will increase to ~18,000 at go-live)
Total Hosts: 908 (this will increase to ~1200 at go-live)
I recently enabled debugging with debug_level=31 (Functions, Configuration, Process information, Scheduled events, and Host/service checks). Let me know if that information would be of interest and I'll post portions of it here.

Code: Select all

Current nagios.cfg is:
accept_passive_host_checks=1
accept_passive_service_checks=1
additional_freshness_latency=15
auto_reschedule_checks=1
auto_rescheduling_interval=30
auto_rescheduling_window=180
bare_update_check=1
cached_host_check_horizon=30
cached_service_check_horizon=30
cfg_dir=/opt/apps/nagios/etc/groups/unix/etc
cfg_dir=/opt/apps/nagios/etc/groups/unix/hosts
cfg_dir=/opt/apps/nagios/etc/groups/web/etc
cfg_dir=/opt/apps/nagios/etc/objects
check_external_commands=1
check_for_orphaned_hosts=1
check_for_orphaned_services=1
check_for_updates=1
check_host_freshness=1
check_result_path=/dev/shm
check_result_reaper_frequency=5
check_service_freshness=1
child_processes_fork_twice=1
command_check_interval=-1
command_file=/opt/apps/nagios-private/var/rw/nagios.cmd
daemon_dumps_core=0
date_format=us
debug_file=/opt/apps/nagios/var/nagios.debug
debug_level=31
debug_verbosity=2
enable_embedded_perl=1
enable_environment_macros=1
enable_event_handlers=0
enable_flap_detection=1
enable_notifications=0
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
event_broker_options=0
event_handler_timeout=30
execute_host_checks=1
execute_service_checks=1
external_command_buffer_slots=4096
high_host_flap_threshold=20.0
high_service_flap_threshold=20.0
host_check_timeout=30
host_freshness_check_interval=60
host_inter_check_delay_method=s
illegal_macro_output_chars=`~$&|'"<>
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
interval_length=60
lock_file=/dev/shm/nagios.lock
log_archive_path=/opt/apps/nagios-private/var/archives
log_event_handlers=1
log_external_commands=1
log_file=/opt/apps/nagios-private/var/nagios.log
log_host_retries=1
log_initial_states=0
log_notifications=1
log_passive_checks=1
log_rotation_method=d
log_service_retries=1
low_host_flap_threshold=5.0
low_service_flap_threshold=15.0
max_check_result_file_age=3600
max_check_result_reaper_time=30
max_concurrent_checks=0
max_debug_file_size=1000000000
max_host_check_spread=30
max_service_check_spread=30
nagios_group=nagios
nagios_user=nagios
notification_timeout=30
object_cache_file=/dev/shm/objects.cache
obsess_over_hosts=0
obsess_over_services=0
ocsp_timeout=5
p1_file=/opt/apps/nagios/bin/p1.pl
passive_host_checks_are_soft=0
perfdata_timeout=5
precached_object_file=/dev/shm/objects.precache
process_performance_data=0
resource_file=/opt/apps/nagios/etc/resource.cfg
retain_state_information=1
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_service_attribute_mask=0
retention_update_interval=5
service_check_timeout=30
service_freshness_check_interval=60
service_inter_check_delay_method=s
service_interleave_factor=s
sleep_time=0.25
soft_state_dependencies=0
state_retention_file=/opt/apps/nagios/var/retention.dat
status_file=/dev/shm/status.dat
status_update_interval=10
temp_file=/dev/shm/nagios.tmp
temp_path=/opt/apps/nagios/tmp
translate_passive_host_checks=0
use_aggressive_host_checking=0
use_embedded_perl_implicitly=1
use_large_installation_tweaks=1
use_regexp_matching=0
use_retained_program_state=1
use_retained_scheduling_info=1
use_syslog=0
use_true_regexp_matching=0
Last edited by dnelson on Fri Dec 27, 2013 10:26 am, edited 1 time in total.
sreinhardt
-fno-stack-protector
Posts: 4366
Joined: Mon Nov 19, 2012 12:10 pm

Re: Latencies increase drastically after 33 hourse of uptime

Post by sreinhardt »

Interesting issue, I know that we have clients with similarly sized installations that are not seeing this same issue, and your system certainly seems to have enough power to handle the load you are looking to run. Couple questions to get started.

Do you have plugins or other handlers that need environment macros? This is an extremely resource hungry option to have on in large installs such as yours. If you do not use it I would highly suggest disabling it.
Could you move or remove the precache file? This is just to be 100% sure that it is populating from your current configs and not holding old settings that may be inaccurate.
Could you post images of the load, memory, and on-demand check statistics from the same machine and time periods, if possible?
Nagios-Plugins maintainer exclusively, unless you have other C language bugs with open-source nagios projects, then I am happy to help! Please pm or use other communication to alert me to issues as I no longer track the forum.
dnelson
Posts: 14
Joined: Mon Dec 23, 2013 11:07 am

Re: Latencies increase drastically after 33 hourse of uptime

Post by dnelson »

Hi Spenser,
sreinhardt wrote:Do you have plugins or other handlers that need environment macros? This is an extremely resource hungry option to have on in large installs such as yours. If you do not use it I would highly suggest disabling it.
I am going to go out on a limb and say 'yes' because it looks like the SNMP4Nagios plugins use them and we do use these plugins. Otherwise, I'm not seeing any other usage of environment macros.
sreinhardt wrote:Could you move or remove the precache file? This is just to be 100% sure that it is populating from your current configs and not holding old settings that may be inaccurate.
The precache object file (/dev/shm/objects.precache) didn't exist and the daemon is not started with the '-p' option. I didn't realize this till after I had stopped the daemon.
sreinhardt wrote:Could you post images of the load, memory, and on-demand check statistics from the same machine and time periods, if possible?
Here's the load graph for the past week. The values are normalized by the number of CPU's as reported by /proc/cpustat; in this case, 24. The maximum fifteen minute load avg of 11.47% thus translates to 2.69. You can see that last Friday just after the noon hour, I had restarted the daemon. You can also just see the restart earlier today on the right side. Looking at the graph, this problem kinda smells like a kernel issue because sys time gradually increases and then levels off. I'm going to create/post a couple of graphs to see if there might be a correlation.

Image

Here's the memory graph for the past week. I suspect that the saw-toothed application memory usage beginning on Friday is caused by the enabling of 'debug_level = 31'. The debug file size was set to 1GB which nearly corresponds to the height of the saw teeth.

Image

I don't have records of the on-demand check statistics. Since restarting Nagios ~1.5 hours ago, I do have these stats. Is this good enough?

Code: Select all

Status File:                            /dev/shm/status.dat
Status File Age:                        0d 0h 0m 6s
Status File Version:                    3.3.1

Program Running Time:                   0d 1h 25m 27s
Nagios PID:                             26363
Used/High/Total Command Buffers:        0 / 0 / 4096

Total Services:                         15222
Services Checked:                       14821
Services Scheduled:                     14821
Services Actively Checked:              15222
Services Passively Checked:             0
Total Service State Change:             0.000 / 34.010 / 0.006 %
Active Service Latency:                 0.000 / 455.357 / 0.574 sec
Active Service Execution Time:          0.000 / 11.873 / 0.093 sec
Active Service State Change:            0.000 / 34.010 / 0.006 %
Active Services Last 1/5/15/60 min:     1667 / 10000 / 14046 / 14813
Passive Service Latency:                0.000 / 0.000 / 0.000 sec
Passive Service State Change:           0.000 / 0.000 / 0.000 %
Passive Services Last 1/5/15/60 min:    0 / 0 / 0 / 0
Services Ok/Warn/Unk/Crit:              15189 / 7 / 0 / 26
Services Flapping:                      1
Services In Downtime:                   0

Total Hosts:                            908
Hosts Checked:                          908
Hosts Scheduled:                        908
Hosts Actively Checked:                 908
Host Passively Checked:                 0
Total Host State Change:                0.000 / 0.000 / 0.000 %
Active Host Latency:                    0.089 / 518.481 / 4.084 sec
Active Host Execution Time:             0.006 / 0.294 / 0.020 sec
Active Host State Change:               0.000 / 0.000 / 0.000 %
Active Hosts Last 1/5/15/60 min:        119 / 778 / 848 / 902
Passive Host Latency:                   0.000 / 0.000 / 0.000 sec
Passive Host State Change:              0.000 / 0.000 / 0.000 %
Passive Hosts Last 1/5/15/60 min:       0 / 0 / 0 / 0
Hosts Up/Down/Unreach:                  908 / 0 / 0
Hosts Flapping:                         0
Hosts In Downtime:                      0

Active Host Checks Last 1/5/15 min:     134 / 841 / 2570
   Scheduled:                           133 / 828 / 2514
   On-demand:                           1 / 13 / 56
   Parallel:                            133 / 828 / 2523
   Serial:                              0 / 0 / 0
   Cached:                              1 / 13 / 47
Passive Host Checks Last 1/5/15 min:    0 / 0 / 0
Active Service Checks Last 1/5/15 min:  1956 / 10389 / 30266
   Scheduled:                           1956 / 10389 / 30266
   On-demand:                           0 / 0 / 0
   Cached:                              0 / 0 / 0
Passive Service Checks Last 1/5/15 min: 0 / 0 / 0

External Commands Last 1/5/15 min:      0 / 0 / 0
Regards,
David Nelson
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Latencies increase drastically after 33 hourse of uptime

Post by slansing »

Another thing to check on, do you have a large amount of files in /tmp under the guise of "check*"? Do you have a ballpark estimate of how many service checks run active SNMP get's? SNMP checks are notorious load throttlers, and it would have an impact on latency as well. The fact that this occurs around 33 hours after a restart is a bit sticky... are you able to narrow down any system level tasks "processes, etc.." at this time that are grabbing a lot of resources?
dnelson
Posts: 14
Joined: Mon Dec 23, 2013 11:07 am

Additional data and thoughts.

Post by dnelson »

Additional notes/thoughts:
- The current production environment runs on a Sun Fire X4150 (2 X Quad-Core Intel(R) Xeon(R) X5460 @ 3.16GHz) w/ 8GB of RAM on RHEL5 X86_64. The daemon has been running since Dec 18 and the latencies look as follows.
Image

- Originally, this project was to design and implement an HA Nagios solution. To achieve that, I had configured Gluster and Linux HA (heartbeat). Since discovering the latency behavior, I've stripped those components out from under Nagios but they still do exist on the system. I am contemplating the rebuilding of the RHEL6 server to remove any trace of the HA components and then dropping in the production configuration as-is (or as close to production as possible) and then observe the system for a couple of days.
- Using a very low pass filter there does seem to exist a subtle correlation between kernel time and service latency. Pay no attention to the y-axis values as the data has been scaled to get the graphs to display nicely. The other CPU times (I/O wait and user) are very level when compared to kernel time. I don't yet know what the kernel is doing that is causing it to gradually rise, though. Restarting the Nagios daemon causes kernel time to reset. It is also interesting that when kernel time knees over (around sample 4950) is when service time begins to climb.
Image

Regards,
David Nelson
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: Latencies increase drastically after 33 hourse of uptime

Post by slansing »

Very interesting data sets, I hope your above proposal is not going to be too much work, are you running anything else on the same kernel configuration as this nagios server? You had mentioned HA, is there a backup server you could bring online to see if it has the same quirky latency?
dnelson
Posts: 14
Joined: Mon Dec 23, 2013 11:07 am

Re: Latencies increase drastically after 33 hourse of uptime

Post by dnelson »

Hi slansing,
slansing wrote:Another thing to check on, do you have a large amount of files in /tmp under the guise of "check*"?
I've got one named 'check_ftp' whose time stamp is being updated regularly. There are remnants of temp files left behind by some /bin/sh check scripts but only because they don't properly clean up after themselves. An 'lsof /tmp' doesn't show anything interesting.
slansing wrote:Do you have a ballpark estimate of how many service checks run active SNMP get's? SNMP checks are notorious load throttlers, and it would have an impact on latency as well. The fact that this occurs around 33 hours after a restart is a bit sticky... are you able to narrow down any system level tasks "processes, etc.." at this time that are grabbing a lot of resources?
Kinda hard to give you a value. Instead, over a 5 minute period (our standard check time), there were approx 13 outbound SNMP requests per second.

Keep the suggestions coming - somethings gotta shake out eventually.

Regards,
David Nelson
dnelson
Posts: 14
Joined: Mon Dec 23, 2013 11:07 am

Re: Latencies increase drastically after 33 hourse of uptime

Post by dnelson »

Hi slansing,
slansing wrote:Very interesting data sets, I hope your above proposal is not going to be too much work, are you running anything else on the same kernel configuration as this nagios server? You had mentioned HA, is there a backup server you could bring online to see if it has the same quirky latency?
We've got a fair number of RHEL6 servers in our env for some time now. Although none run Nagios (except for the ones that I'm working one), nothing this bizarre has been reported. I've decided that I'm going to rebuild the passive node of the HA cluster w/o installing any of the Gluster/HA stuff. In the end, I'll have a fairly generic/stock system and the question regarding if 'gluster or HA' is affecting things can be answered. That seems to be a logical direction to pursue. If all goes well, I should have something up and running by the end of the day. I'm not rebuilding the "badly behaving system" because, after all, something that is behaving badly is easier to diagnose than something that used to behave badly.

Regards,
David Nelson
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: Latencies increase drastically after 33 hourse of uptime

Post by abrist »

dnelson wrote:I'm not rebuilding the "badly behaving system" because, after all, something that is behaving badly is easier to diagnose than something that used to behave badly.
Well stated. Do you know if this kernel was a release candidate, or a full release? Which one are you running - 2.6.18 or 2.6.32? Was the kernel upgraded or any other major changes done before you started experiencing the issue?

Let us know how the newly provisioned server behaves.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
dnelson
Posts: 14
Joined: Mon Dec 23, 2013 11:07 am

Re: Latencies increase drastically after 33 hourse of uptime

Post by dnelson »

abrist wrote:Do you know if this kernel was a release candidate, or a full release? Which one are you running - 2.6.18 or 2.6.32? Was the kernel upgraded or any other major changes done before you started experiencing the issue?
Hi abrist,

AFAIK, these kernels are the full release. They are the ones released through the standard update channels either though RedHat or Oracle. Come to think of it, I don't really know how I would tell a "full release" from a "release candidate". The OEL/RHEL6 kernels that I have used have been:

2.6.32-279.14.1.el6.x86_64
2.6.39-200.24.1.el6uek.x86_64

Things didn't go as smoothly as I had hoped on the rebuild. Nevertheless, it is up and running. I'll check in on it on Monday, Dec 30.

Regards,
David Nelson
Locked