extremely high load average
Posted: Mon Mar 24, 2014 11:14 am
Well, I've just seen my highest ever load average of 48. I didn't capture top at that point but
I'm concerned because i've got quite a bit of CPU power available to me and i'm not really monitoring that many hosts.
Can somebody please take a look and let me know what i can do to make better use of the hardware? At this rate i'm going to need to somehow persuade the customer site to purchase an earth simulator to run it all on
I do notice a lot of the check's pretty much run at the same time. Maybe there is a way to automatically reschedule the checks to space them out a bit?
Nagios XI 2012R2.8c
141 hosts
799 checks
I'm not 100% sure but i am highly suspicious that i'm getting regular false positives because some of the scripts are not completing due to the extremely high load
Please do not double post, edit your previous post to add more information, double posting will only bump you lower on our reply list.
I'm concerned because i've got quite a bit of CPU power available to me and i'm not really monitoring that many hosts.
Can somebody please take a look and let me know what i can do to make better use of the hardware? At this rate i'm going to need to somehow persuade the customer site to purchase an earth simulator to run it all on
I do notice a lot of the check's pretty much run at the same time. Maybe there is a way to automatically reschedule the checks to space them out a bit?
Nagios XI 2012R2.8c
141 hosts
799 checks
I'm not 100% sure but i am highly suspicious that i'm getting regular false positives because some of the scripts are not completing due to the extremely high load
Code: Select all
[root@localhost ~]# uname -a
Linux localhost.localdomain 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
Code: Select all
top - 16:06:44 up 11 days, 9 min, 1 user, load average: 24.22, 14.27, 10.65
Tasks: 204 total, 1 running, 203 sleeping, 0 stopped, 0 zombie
Cpu(s): 8.6%us, 1.5%sy, 0.0%ni, 89.4%id, 0.4%wa, 0.0%hi, 0.1%si, 0.0%st
Mem: 3923956k total, 1386432k used, 2537524k free, 62580k buffers
Swap: 2064376k total, 21216k used, 2043160k free, 317276k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4711 apache 20 0 439m 31m 4488 S 3.7 0.8 0:13.93 httpd
13750 apache 20 0 432m 23m 4664 S 3.7 0.6 0:35.82 httpd
23625 apache 20 0 432m 24m 4668 S 3.7 0.6 3:30.46 httpd
11138 apache 20 0 439m 31m 4692 S 3.3 0.8 5:15.51 httpd
15506 apache 20 0 439m 31m 4668 S 3.3 0.8 5:14.99 httpd
20040 apache 20 0 439m 30m 4232 S 3.3 0.8 0:00.35 httpd
27970 apache 20 0 431m 23m 4660 S 3.3 0.6 3:52.59 httpd
4727 apache 20 0 440m 31m 4652 S 3.0 0.8 0:16.53 httpd
19704 apache 20 0 439m 31m 4660 S 3.0 0.8 5:10.13 httpd
1576 mysql 20 0 2180m 65m 4104 S 0.7 1.7 81:49.00 mysqld
13676 nagios 20 0 34708 3856 1248 S 0.7 0.1 0:01.68 nagios
20 root 20 0 0 0 0 S 0.3 0.0 5:30.86 events/1
10005 apache 20 0 432m 23m 4672 S 0.3 0.6 5:30.51 httpd
11137 apache 20 0 431m 23m 4680 S 0.3 0.6 5:26.62 httpd
11473 postgres 20 0 210m 6628 4260 S 0.3 0.2 0:08.37 postmaster
14281 postgres 20 0 210m 6604 4224 S 0.3 0.2 0:00.85 postmaster
18548 postgres 20 0 210m 5672 3476 S 0.3 0.1 0:00.01 postmaster
20107 postgres 20 0 210m 6100 3756 S 0.3 0.2 0:08.28 postmaster
20118 postgres 20 0 210m 5736 3464 S 0.3 0.1 0:00.01 postmaster
28161 postgres 20 0 210m 6860 4472 S 0.3 0.2 0:05.84 postmaster
1 root 20 0 19232 1104 892 S 0.0 0.0 0:01.98 init
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root RT 0 0 0 0 S 0.0 0.0 0:22.67 migration/0
4 root 20 0 0 0 0 S 0.0 0.0 0:04.70 ksoftirqd/0
5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
6 root RT 0 0 0 0 S 0.0 0.0 0:03.07 watchdog/0
7 root RT 0 0 0 0 S 0.0 0.0 0:24.54 migration/1
8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
9 root 20 0 0 0 0 S 0.0 0.0 0:04.85 ksoftirqd/1
10 root RT 0 0 0 0 S 0.0 0.0 0:03.63 watchdog/1
11 root RT 0 0 0 0 S 0.0 0.0 0:24.68 migration/2
12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/2
13 root 20 0 0 0 0 S 0.0 0.0 0:04.26 ksoftirqd/2
14 root RT 0 0 0 0 S 0.0 0.0 0:02.89 watchdog/2
15 root RT 0 0 0 0 S 0.0 0.0 0:22.83 migration/3
16 root RT 0 0 0 0 S 0.0 0.0 0:00.00 migration/3
17 root 20 0 0 0 0 S 0.0 0.0 0:03.91 ksoftirqd/3
18 root RT 0 0 0 0 S 0.0 0.0 0:03.58 watchdog/3
19 root 20 0 0 0 0 S 0.0 0.0 0:50.56 events/0
21 root 20 0 0 0 0 S 0.0 0.0 0:54.42 events/2
22 root 20 0 0 0 0 S 0.0 0.0 1:30.12 events/3
23 root 20 0 0 0 0 S 0.0 0.0 0:00.00 cgroup
24 root 20 0 0 0 0 S 0.0 0.0 0:00.00 khelper
25 root 20 0 0 0 0 S 0.0 0.0 0:00.00 netns
26 root 20 0 0 0 0 S 0.0 0.0 0:00.00 async/mgr
27 root 20 0 0 0 0 S 0.0 0.0 0:00.00 pm
28 root 20 0 0 0 0 S 0.0 0.0 0:03.15 sync_supers
29 root 20 0 0 0 0 S 0.0 0.0 0:04.23 bdi-default
30 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kintegrityd/0
31 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kintegrityd/1
32 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kintegrityd/2
33 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kintegrityd/3
34 root 20 0 0 0 0 S 0.0 0.0 1:41.91 kblockd/0
Code: Select all
[root@localhost ~]# cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
stepping : 4
cpu MHz : 2800.099
cache size : 8192 KB
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips : 5600.19
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
stepping : 4
cpu MHz : 2800.099
cache size : 8192 KB
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips : 5600.19
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
stepping : 4
cpu MHz : 2800.099
cache size : 8192 KB
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips : 5600.19
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU X5560 @ 2.80GHz
stepping : 4
cpu MHz : 2800.099
cache size : 8192 KB
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips : 5600.19
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:
Code: Select all
Server Statistics
Metric
Value
Load
1-min 32.05
5-min 14.26
15-min 11.32
CPU Stats
User 11.87%
Nice 0.00%
System 1.65%
I/O Wait 0.40%
Steal 0.00%
Idle 86.08%
Memory
Total 3831 MB
Used 1375 MB
Free 2456 MB
Shared 0 MB
Buffers 61 MB
Cached 315 MB
Swap
Total 2015 MB
Used 20 MB
Free 1995 MB