Nagios 4 Load issues

liquidcool · Post by **liquidcool** » Tue May 20, 2014 9:00 am

From the testing today I am almost certain there is something going on with Nagios version 4. I installed 3.5.1 this morning and have let it run the whole day and no spikes at all like in 4.0.6

I have attached the graph showing this. All those spikes are with version 4.0.6 installed and running. From between Tue 00:00 and Tue 06:00 I compiled and installed 3.5.1 (No embedded perl) and you can clearly see that the load is as it should be, pretty constant with very little fluctuation. The config file is the same for both versions.

I really would like to know if there is anyone from Nagios that can help with this or if they are at least aware of this.

abrist · Post by **abrist** » Tue May 20, 2014 11:03 am

We were not aware of this. I am attempting to reproduce the issue. My curiosity concerning processes with heavy load was due to my attempt to reproduce the environment. My tests currently do not show these spikes. It could be related to one of your checks and core 4, a core 4 config option, or a problem with core 4 itself. The next time load starts spiking, can you get a few top and ps outputs?

liquidcool · Post by **liquidcool** » Wed May 21, 2014 5:49 am

From the start I have checked top and ps outputs and there is nothing untoward. Being a VM I can monitor all the different aspects of it, and CPU utilization, Disk and Network show no change. Everything looks like it is ticking along as normal and nothing changes. All the checks run at their regular intervals, none are in a hung state or zombied. The only metric I can't get from inside the VM is what is happening in memory. Something makes me think there is some sort of flush happening. Something being cleared out of a cache. So unfortunately top and ps don't show anything that points in a particular direction to the cause of the issue.

Just so you are aware. This is a server running about 4500 checks over 48 devices with check intervals between 5 and 10 minutes. This is the only time we start seeing it (well when there are a thousand or more checks). When there are only a few checks it is not visible at all.

tmcdonald · Post by **tmcdonald** » Wed May 21, 2014 4:53 pm

liquidcool wrote:From the start I have checked top and ps outputs and there is nothing untoward.

Can you clarify what you mean by "untoward"? Also, are you using the exact same configs and plugins with 2.5 as you are with 4.0?

The pattern seems to be predictable enough that we can do some targeted testing during a peak. Could you try running the following during a peak and posting the output?

Code: Select all

/usr/local/nagios/libexec/check_load -w 10,8,6 -c 20,16,12
w
free -m
ps -ef > /tmp/ps.out

Post the output of the first three, and PM me the last one.

liquidcool · Post by **liquidcool** » Thu May 22, 2014 6:59 am

untoward meaning nothing out of the ordinary. Same checks processing, nothing stuck, nothing hogging cpu or memory.

The configs are exactly the same. all I have done is a make base-install. so all it overwrites are the binaries.

I will get those to you as soon as I can.

tmcdonald · Post by **tmcdonald** » Thu May 22, 2014 4:44 pm

Thanks. We are seeing one other thread with a similar issue, but the load is constant. It will be interesting to compare the results.

liquidcool · Post by **liquidcool** » Fri May 23, 2014 1:29 am

Here we go:

Code: Select all

top - 02:23:05 up 23 days, 14:38,  3 users,  load average: 18.35, 15.14, 10.
Tasks: 163 total,   1 running, 162 sleeping,   0 stopped,   0 zombie
Cpu(s): 33.0%us,  6.5%sy,  0.0%ni, 60.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%
Mem:   3927988k total,  3217176k used,   710812k free,   162656k buffers
Swap:  4200956k total,      288k used,  4200668k free,  2818844k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 8645 root      20   0  8876 1268  852 R    1  0.0   0:00.16 top
26454 nagios    20   0 29376  12m 1108 S    1  0.3  21:44.84 nagios
 6109 root      20   0  232m 1300  916 S    0  0.0  54:26.11 nscd
26457 nagios    20   0 13388 1156  656 S    0  0.0   1:29.30 nagios
26458 nagios    20   0 13420 1260  656 S    0  0.0   1:30.85 nagios
26459 nagios    20   0 13420 1264  656 S    0  0.0   1:32.23 nagios
    1 root      20   0 10388  776  640 S    0  0.0   0:22.72 init
    2 root      20   0     0    0    0 S    0  0.0   0:00.00 kthreadd
    3 root      RT   0     0    0    0 S    0  0.0   0:51.43 migration/0
    4 root      20   0     0    0    0 S    0  0.0   4:24.32 ksoftirqd/0
    5 root      RT   0     0    0    0 S    0  0.0   0:53.43 migration/1
    6 root      20   0     0    0    0 S    0  0.0   4:33.82 ksoftirqd/1
    7 root      RT   0     0    0    0 S    0  0.0   0:53.87 migration/2
    8 root      20   0     0    0    0 S    0  0.0   4:40.62 ksoftirqd/2
    9 root      RT   0     0    0    0 S    0  0.0   0:54.51 migration/3
   10 root      20   0     0    0    0 S    0  0.0   4:40.98 ksoftirqd/3
   11 root      20   0     0    0    0 S    0  0.0   1:48.58 events/0
   12 root      20   0     0    0    0 S    0  0.0   1:53.65 events/1
   13 root      20   0     0    0    0 S    0  0.0   2:00.41 events/2
   14 root      20   0     0    0    0 S    0  0.0   2:14.94 events/3
   15 root      20   0     0    0    0 S    0  0.0   0:00.00 cpuset
   16 root      20   0     0    0    0 S    0  0.0   0:00.00 khelper
   17 root      20   0     0    0    0 S    0  0.0   0:00.00 netns
   18 root      20   0     0    0    0 S    0  0.0   0:00.00 async/mgr
   19 root      20   0     0    0    0 S    0  0.0   0:00.00 pm
   20 root      20   0     0    0    0 S    0  0.0   0:02.46 sync_supers
   21 root      20   0     0    0    0 S    0  0.0   0:03.14 bdi-default
   22 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/0
   23 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/1
   24 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/2
   25 root      20   0     0    0    0 S    0  0.0   0:00.00 kintegrityd/3
   26 root      20   0     0    0    0 S    0  0.0   0:00.41 kblockd/0
   27 root      20   0     0    0    0 S    0  0.0   0:00.95 kblockd/1
   28 root      20   0     0    0    0 S    0  0.0   0:00.65 kblockd/2
   29 root      20   0     0    0    0 S    0  0.0   0:00.61 kblockd/3
   30 root      20   0     0    0    0 S    0  0.0   0:00.00 kacpid
   31 root      20   0     0    0    0 S    0  0.0   0:00.00 kacpi_notify
   32 root      20   0     0    0    0 S    0  0.0   0:00.00 kacpi_hotplug
   33 root      20   0     0    0    0 S    0  0.0   0:00.00 kseriod
   38 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/0
   39 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/1
   40 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/2
   41 root      20   0     0    0    0 S    0  0.0   0:00.00 kondemand/3
   42 root      20   0     0    0    0 S    0  0.0   0:00.00 khungtaskd

Code: Select all

Every 2.0s: free -m                               Fri May 23 02:23:20 2014

             total       used       free     shared    buffers     cached
Mem:          3835       3149        686          0        158       2752
-/+ buffers/cache:        237       3598
Swap:         4102          0       4102

Code: Select all

 02:22:40 up 23 days, 14:38,  3 users,  load average: 14.95, 14.24, 10.30
USER     TTY        LOGIN@   IDLE   JCPU   PCPU WHAT
jpope    pts/0     02:20    1:40   0.14s  0.02s sshd: jpope [priv]
jpope    pts/1     02:21    1:04   0.17s  0.02s sshd: jpope [priv]
jpope    pts/2     02:22    0.00s  0.08s  0.01s sshd: jpope [priv]

Code: Select all

WARNING - load average: 15.44, 15.22, 11.02|load1=15.440;10.000;20.000;0; load5=15.220;8.000;16.000;0; load15=11.020;6.000;12.000;0;

tmcdonald · Post by **tmcdonald** » Fri May 23, 2014 10:54 am

I guess we should have asked this earlier, but what OS and version are you on? Architecture?

liquidcool · Post by **liquidcool** » Fri May 23, 2014 12:36 pm

Running on VMWare
SuSe Ent 11 SP1
Kernel 2.6.32.29
Quad 3Ghz
4GB RAM
30GB HDD

tmcdonald · Post by **tmcdonald** » Fri May 23, 2014 3:19 pm

We've seen this internally though we can't seem to force it. We're going to do some digging on our end. In the meantime, can you provide some insight to your "top" output? Normally you would see something like "1.0" in the CPU% column and yours is just 1. Similarly it looks like your 15-minute average got cut off, but I want to make sure that's not a weird display error.

Nagios Support Forum

Nagios 4 Load issues

Re: Nagios 4 Load issues

Re: Nagios 4 Load issues

Re: Nagios 4 Load issues

Re: Nagios 4 Load issues

Re: Nagios 4 Load issues

Re: Nagios 4 Load issues

Re: Nagios 4 Load issues

Re: Nagios 4 Load issues

Re: Nagios 4 Load issues

Re: Nagios 4 Load issues