Page 2 of 3

Re: Latencies increase drastically after 33 hours of uptime

Posted: Fri Dec 27, 2013 4:45 pm
by dnelson
The shutdown and rebuild of the 2nd node looks to have had an interesting effect on the service latencies of the 1st node. The service latency times due to the daemon restart at 9:42 AM yesterday is not surprising. However, when I shutdown the 2nd node at 13:39 PM yesterday to begin rebuilding the OS, latencies took another dive for the better. Strange stuff. Not sure why they started climbing again. Had I noticed this earlier, I would have left things running. Note that the y-axis is log base 10 and the units are in mS.

Image

This upcoming weekend's data from the rebuild node (w/o any gluster/HA) may be very interesting.

Latencies climb with a generic OS build

Posted: Mon Dec 30, 2013 9:54 am
by dnelson
It turns out that a fresh generic OS build behaves the same as one w/ Gluster and Linux HA (heartbeat) installed. After ~33 hours of daemon uptime, both service and host latencies increase at a rate of approx 160 seconds / day. Also, kernel time behaves similarly in that it gradually rises and when it kneels over, latencies start to climb. Again, the y-scale values are normalized and does not reflect the actual values.

The good news is that Gluster and Linux HA don't contribute/trigger the problem so I don't have to rethink my HA design.

Image

Today I'm going to reconfigure nagios.cfg and place all Nagios related files under /opt/apps/nagios/... This will configure Nagios one step closer to actual production.

Does anybody know of anybody that is running Nagios 3.x w/ RHEL/OEL 6 with a daemon uptime greater than 33 hours that can report on service/host latencies?

Thanks,
David Nelson

Re: Latencies increase drastically after 33 hourse of uptime

Posted: Mon Dec 30, 2013 12:46 pm
by abrist
dnelson wrote: Does anybody know of anybody that is running Nagios 3.x w/ RHEL/OEL 6 with a daemon uptime greater than 33 hours that can report on service/host latencies?
I have a few different core systems running 3.5.1 and I have not noticed this behavior. The checks/5min are low on those servers though (around a 1000). This behavior in the past has been caused by:
1) Latency/lack of resources (ram/disk io/load)
2) System ulimits (open files is usually the culprit here)
3) Improper configuration (checks running at too small of an interval or too large of timeouts)
4) Specific checks that are load/disk intensive (vmware, oracle, sql queries, etc)

Re: Latencies increase drastically after 33 hours of uptime

Posted: Thu Jan 02, 2014 1:38 pm
by dnelson
abrist wrote: I have a few different core systems running 3.5.1 and I have not noticed this behavior. The checks/5min are low on those servers though (around a 1000). This behavior in the past has been caused by:
1) Latency/lack of resources (ram/disk io/load)
2) System ulimits (open files is usually the culprit here)
3) Improper configuration (checks running at too small of an interval or too large of timeouts)
4) Specific checks that are load/disk intensive (vmware, oracle, sql queries, etc)
Hi abrist,

While I don't believe that #1, #3, and #4 are at play here since current production is on weaker hardware and has more monitors, item #2 piqued my interest, however. I did confirm that, while not identical, ulimits appear to be reasonably configured. Of interest is the 'max processes' of 1024 for OEL/RHEL6. I don't think this is affecting the behavior for two reasons: 1) Why after 33 hours? Nagios should have things sorted out within 30 minutes if I understand 'max_service_check_spread' and 'max_host_check_spread'. 2) The results of the checks maxes out around 250 every 5 seconds. Also, 'ps -eLf' in a continuous while loop rarely exceeds 15. I'm just not seeing any clues that would indicate that 'max processes' is being reached.

Current production on RHEL5 - 2.6.18-348.6.1.0.1.el5

Code: Select all

Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            10485760             unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             63844                63844                processes
Max open files            1024                 1024                 files
Max locked memory         32768                32768                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       63844                63844                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
OEL/RHEL6 - 2.6.32-279.14.1.el6.x86_64

Code: Select all

Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            10485760             unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             1024                 127056               processes
Max open files            1024                 4096                 files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       127056               127056               signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us
On a related note, I received some feedback that some other folks are experiencing this same issue and they are just as puzzled as to why. I reconfigured some things and set up some additional data collection last night. I'll report on what I find.

Regards,
David Nelson

Re: Latencies increase drastically after 33 hourse of uptime

Posted: Thu Jan 02, 2014 1:49 pm
by abrist
Great. Thanks for your due diligence. There are a few batch processes that could occasionally spike the open files quantities. You should receive errors in the system logs if such an issue occurs. Have you done any parsing of /var/log/messages to see if that holds any clues?

Small Nagios Config

Posted: Mon Jan 06, 2014 1:54 pm
by dnelson
Over the new year's break, I configured a simple Nagios setup consisting of ~650 host checks each with a simple nrpe service check. Without the need for SNMP checks, I also set 'enable_environment_macros=0' since that was brought up in an earlier post I warranted some investigation. After ~38.5 hours of daemon uptime, kernel time was steadily growing while latencies remained constant.

Image

Normally, latencies climb after 33 hours of daemon uptime. Why, after 38.5 hours of daemon uptime, hadn't latencies started to climb?

Some observations:
- The slope of the "small nagios config" kernel time line is ~0.000279. in comparison, the slope of the original data for kernel time, before svc latencies begin to increase, was ~ 0.002988 (just over an order of magnitude bigger at 10.7).
- Could the slopes of the kernel times be influenced by the # of checks? 650 (# of hosts) * 2 (one host check and one service check) * 10.7 ~ 13924. Compared to the original data of 908 hosts and 15222 services producing a total of 16130 checks, it seems plausible that the number of checks could influence the length of time until service latencies begin to climb. This would be easy to investigate with a bunch of simple service checks.

use_large_installation_tweaks and child_processes_fork_twice

Posted: Mon Jan 06, 2014 3:52 pm
by dnelson
Reading up on what large_installation_tweaks does, (http://nagios.sourceforge.net/docs/3_0/ ... weaks.html), I interpreted that if set, child_processes_fork_twice would be automatically set to 0. This seemed to be reinforced in the docs for "child_process_fork_twice" as it stated ".... However, if the use_large_installation_tweaks option is enabled, it will only fork() once. ..."

As a test, I configured the test server with 1334 hosts and 52026 service checks (way more than what would be seen in production).

With large_installation_tweaks=1 and child_processes_fork_twice=0, the test server was able to process 52,000 service checks in a 5 minute interval fairly well. Latencies had eventually increased to just over 1.1 seconds after nearly 3 days of daemon uptime.

With large_installation_tweaks=1 and child_processes_fork_twice=1, the test server was able to process 19,389 service checks in a 5 minute interval. After 17 minutes of daemon uptime, service latencies had increased to 360 seconds.

What's going on when large_installation_tweaks=1 and child_processes_fork_twice=1?

Re: Latencies increase drastically after 33 hourse of uptime

Posted: Mon Jan 06, 2014 5:18 pm
by abrist
dnelson wrote:What's going on when large_installation_tweaks=1 and child_processes_fork_twice=1?
What is going on indeed. You may have enough information for the core devs to take a look - open a ticket at http://tracker,nagios.org

Use the source, Luke.

Posted: Tue Jan 07, 2014 3:34 pm
by dnelson
Before submitting a case w/ the devs regarding the use_large_installation_tweaks and child_processes_fork_twice tunables, I needed to verify that the problem existed in 3.5.1 (I had been using 3.3.1). Sure enough, I can reproduce the problem with 3.5.1.

Next, I went reading in the source code to see how things worked when defining use_large_installation_tweaks and this is when I saw the light. The code reads:

If use_large_installation_tweaks=1 and if either child_processes_fork_twice or free_child_process_memory are not defined in nagios.cfg, then set child_processes_fork_twice=0 or free_child_process_memory=0. If child_processes_fork_twice or free_child_process_memory are defined, use those values, regardless.

I don't believe any of this answers the '33 hour' question, but it may be a good step in the right direction. Armed with this new understanding, I'm going to review our config and checks and see about unsetting both child_processes_fork_twice and free_child_process_memory and allowing use_large_installation_tweaks=1 to do its thing.

Re: Latencies increase drastically after 33 hourse of uptime

Posted: Tue Jan 07, 2014 4:20 pm
by abrist
Sounds good, keep us informed of your progress. Just FYI, there are some caveats to large installation tweaks (primarily with env vars):
http://nagios.sourceforge.net/docs/nagi ... weaks.html