Nagios 4 Load issues
-
- Posts: 59
- Joined: Tue Feb 21, 2012 6:08 am
Re: Nagios 4 Load issues
From the testing today I am almost certain there is something going on with Nagios version 4. I installed 3.5.1 this morning and have let it run the whole day and no spikes at all like in 4.0.6
I have attached the graph showing this. All those spikes are with version 4.0.6 installed and running. From between Tue 00:00 and Tue 06:00 I compiled and installed 3.5.1 (No embedded perl) and you can clearly see that the load is as it should be, pretty constant with very little fluctuation. The config file is the same for both versions.
I really would like to know if there is anyone from Nagios that can help with this or if they are at least aware of this.
I have attached the graph showing this. All those spikes are with version 4.0.6 installed and running. From between Tue 00:00 and Tue 06:00 I compiled and installed 3.5.1 (No embedded perl) and you can clearly see that the load is as it should be, pretty constant with very little fluctuation. The config file is the same for both versions.
I really would like to know if there is anyone from Nagios that can help with this or if they are at least aware of this.
Re: Nagios 4 Load issues
We were not aware of this. I am attempting to reproduce the issue. My curiosity concerning processes with heavy load was due to my attempt to reproduce the environment. My tests currently do not show these spikes. It could be related to one of your checks and core 4, a core 4 config option, or a problem with core 4 itself. The next time load starts spiking, can you get a few top and ps outputs?
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
-
- Posts: 59
- Joined: Tue Feb 21, 2012 6:08 am
Re: Nagios 4 Load issues
From the start I have checked top and ps outputs and there is nothing untoward. Being a VM I can monitor all the different aspects of it, and CPU utilization, Disk and Network show no change. Everything looks like it is ticking along as normal and nothing changes. All the checks run at their regular intervals, none are in a hung state or zombied. The only metric I can't get from inside the VM is what is happening in memory. Something makes me think there is some sort of flush happening. Something being cleared out of a cache. So unfortunately top and ps don't show anything that points in a particular direction to the cause of the issue.
Just so you are aware. This is a server running about 4500 checks over 48 devices with check intervals between 5 and 10 minutes. This is the only time we start seeing it (well when there are a thousand or more checks). When there are only a few checks it is not visible at all.
Just so you are aware. This is a server running about 4500 checks over 48 devices with check intervals between 5 and 10 minutes. This is the only time we start seeing it (well when there are a thousand or more checks). When there are only a few checks it is not visible at all.
Re: Nagios 4 Load issues
Can you clarify what you mean by "untoward"? Also, are you using the exact same configs and plugins with 2.5 as you are with 4.0?liquidcool wrote:From the start I have checked top and ps outputs and there is nothing untoward.
The pattern seems to be predictable enough that we can do some targeted testing during a peak. Could you try running the following during a peak and posting the output?
Code: Select all
/usr/local/nagios/libexec/check_load -w 10,8,6 -c 20,16,12
w
free -m
ps -ef > /tmp/ps.out
Former Nagios employee
-
- Posts: 59
- Joined: Tue Feb 21, 2012 6:08 am
Re: Nagios 4 Load issues
untoward meaning nothing out of the ordinary. Same checks processing, nothing stuck, nothing hogging cpu or memory.
The configs are exactly the same. all I have done is a make base-install. so all it overwrites are the binaries.
I will get those to you as soon as I can.
The configs are exactly the same. all I have done is a make base-install. so all it overwrites are the binaries.
I will get those to you as soon as I can.
Re: Nagios 4 Load issues
Thanks. We are seeing one other thread with a similar issue, but the load is constant. It will be interesting to compare the results.
Former Nagios employee
-
- Posts: 59
- Joined: Tue Feb 21, 2012 6:08 am
Re: Nagios 4 Load issues
Here we go:
Code: Select all
top - 02:23:05 up 23 days, 14:38, 3 users, load average: 18.35, 15.14, 10.
Tasks: 163 total, 1 running, 162 sleeping, 0 stopped, 0 zombie
Cpu(s): 33.0%us, 6.5%sy, 0.0%ni, 60.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%
Mem: 3927988k total, 3217176k used, 710812k free, 162656k buffers
Swap: 4200956k total, 288k used, 4200668k free, 2818844k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
8645 root 20 0 8876 1268 852 R 1 0.0 0:00.16 top
26454 nagios 20 0 29376 12m 1108 S 1 0.3 21:44.84 nagios
6109 root 20 0 232m 1300 916 S 0 0.0 54:26.11 nscd
26457 nagios 20 0 13388 1156 656 S 0 0.0 1:29.30 nagios
26458 nagios 20 0 13420 1260 656 S 0 0.0 1:30.85 nagios
26459 nagios 20 0 13420 1264 656 S 0 0.0 1:32.23 nagios
1 root 20 0 10388 776 640 S 0 0.0 0:22.72 init
2 root 20 0 0 0 0 S 0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0 0.0 0:51.43 migration/0
4 root 20 0 0 0 0 S 0 0.0 4:24.32 ksoftirqd/0
5 root RT 0 0 0 0 S 0 0.0 0:53.43 migration/1
6 root 20 0 0 0 0 S 0 0.0 4:33.82 ksoftirqd/1
7 root RT 0 0 0 0 S 0 0.0 0:53.87 migration/2
8 root 20 0 0 0 0 S 0 0.0 4:40.62 ksoftirqd/2
9 root RT 0 0 0 0 S 0 0.0 0:54.51 migration/3
10 root 20 0 0 0 0 S 0 0.0 4:40.98 ksoftirqd/3
11 root 20 0 0 0 0 S 0 0.0 1:48.58 events/0
12 root 20 0 0 0 0 S 0 0.0 1:53.65 events/1
13 root 20 0 0 0 0 S 0 0.0 2:00.41 events/2
14 root 20 0 0 0 0 S 0 0.0 2:14.94 events/3
15 root 20 0 0 0 0 S 0 0.0 0:00.00 cpuset
16 root 20 0 0 0 0 S 0 0.0 0:00.00 khelper
17 root 20 0 0 0 0 S 0 0.0 0:00.00 netns
18 root 20 0 0 0 0 S 0 0.0 0:00.00 async/mgr
19 root 20 0 0 0 0 S 0 0.0 0:00.00 pm
20 root 20 0 0 0 0 S 0 0.0 0:02.46 sync_supers
21 root 20 0 0 0 0 S 0 0.0 0:03.14 bdi-default
22 root 20 0 0 0 0 S 0 0.0 0:00.00 kintegrityd/0
23 root 20 0 0 0 0 S 0 0.0 0:00.00 kintegrityd/1
24 root 20 0 0 0 0 S 0 0.0 0:00.00 kintegrityd/2
25 root 20 0 0 0 0 S 0 0.0 0:00.00 kintegrityd/3
26 root 20 0 0 0 0 S 0 0.0 0:00.41 kblockd/0
27 root 20 0 0 0 0 S 0 0.0 0:00.95 kblockd/1
28 root 20 0 0 0 0 S 0 0.0 0:00.65 kblockd/2
29 root 20 0 0 0 0 S 0 0.0 0:00.61 kblockd/3
30 root 20 0 0 0 0 S 0 0.0 0:00.00 kacpid
31 root 20 0 0 0 0 S 0 0.0 0:00.00 kacpi_notify
32 root 20 0 0 0 0 S 0 0.0 0:00.00 kacpi_hotplug
33 root 20 0 0 0 0 S 0 0.0 0:00.00 kseriod
38 root 20 0 0 0 0 S 0 0.0 0:00.00 kondemand/0
39 root 20 0 0 0 0 S 0 0.0 0:00.00 kondemand/1
40 root 20 0 0 0 0 S 0 0.0 0:00.00 kondemand/2
41 root 20 0 0 0 0 S 0 0.0 0:00.00 kondemand/3
42 root 20 0 0 0 0 S 0 0.0 0:00.00 khungtaskd
Code: Select all
Every 2.0s: free -m Fri May 23 02:23:20 2014
total used free shared buffers cached
Mem: 3835 3149 686 0 158 2752
-/+ buffers/cache: 237 3598
Swap: 4102 0 4102
Code: Select all
02:22:40 up 23 days, 14:38, 3 users, load average: 14.95, 14.24, 10.30
USER TTY LOGIN@ IDLE JCPU PCPU WHAT
jpope pts/0 02:20 1:40 0.14s 0.02s sshd: jpope [priv]
jpope pts/1 02:21 1:04 0.17s 0.02s sshd: jpope [priv]
jpope pts/2 02:22 0.00s 0.08s 0.01s sshd: jpope [priv]
Code: Select all
WARNING - load average: 15.44, 15.22, 11.02|load1=15.440;10.000;20.000;0; load5=15.220;8.000;16.000;0; load15=11.020;6.000;12.000;0;
Re: Nagios 4 Load issues
I guess we should have asked this earlier, but what OS and version are you on? Architecture?
Former Nagios employee
-
- Posts: 59
- Joined: Tue Feb 21, 2012 6:08 am
Re: Nagios 4 Load issues
Running on VMWare
SuSe Ent 11 SP1
Kernel 2.6.32.29
Quad 3Ghz
4GB RAM
30GB HDD
SuSe Ent 11 SP1
Kernel 2.6.32.29
Quad 3Ghz
4GB RAM
30GB HDD
Re: Nagios 4 Load issues
We've seen this internally though we can't seem to force it. We're going to do some digging on our end. In the meantime, can you provide some insight to your "top" output? Normally you would see something like "1.0" in the CPU% column and yours is just 1. Similarly it looks like your 15-minute average got cut off, but I want to make sure that's not a weird display error.
Former Nagios employee