extremely high load average

nottheadmin · Post by **nottheadmin** » Mon Mar 24, 2014 11:14 am

Well, I've just seen my highest ever load average of 48. I didn't capture top at that point but

I'm concerned because i've got quite a bit of CPU power available to me and i'm not really monitoring that many hosts.

Can somebody please take a look and let me know what i can do to make better use of the hardware? At this rate i'm going to need to somehow persuade the customer site to purchase an earth simulator to run it all on

I do notice a lot of the check's pretty much run at the same time. Maybe there is a way to automatically reschedule the checks to space them out a bit?

Nagios XI 2012R2.8c
141 hosts
799 checks

I'm not 100% sure but i am highly suspicious that i'm getting regular false positives because some of the scripts are not completing due to the extremely high load

Code: Select all

[root@localhost ~]# uname -a
Linux localhost.localdomain 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Code: Select all

top - 16:06:44 up 11 days, 9 min,  1 user,  load average: 24.22, 14.27, 10.65
Tasks: 204 total,   1 running, 203 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.6%us,  1.5%sy,  0.0%ni, 89.4%id,  0.4%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   3923956k total,  1386432k used,  2537524k free,    62580k buffers
Swap:  2064376k total,    21216k used,  2043160k free,   317276k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 4711 apache    20   0  439m  31m 4488 S  3.7  0.8   0:13.93 httpd
13750 apache    20   0  432m  23m 4664 S  3.7  0.6   0:35.82 httpd
23625 apache    20   0  432m  24m 4668 S  3.7  0.6   3:30.46 httpd
11138 apache    20   0  439m  31m 4692 S  3.3  0.8   5:15.51 httpd
15506 apache    20   0  439m  31m 4668 S  3.3  0.8   5:14.99 httpd
20040 apache    20   0  439m  30m 4232 S  3.3  0.8   0:00.35 httpd
27970 apache    20   0  431m  23m 4660 S  3.3  0.6   3:52.59 httpd
 4727 apache    20   0  440m  31m 4652 S  3.0  0.8   0:16.53 httpd
19704 apache    20   0  439m  31m 4660 S  3.0  0.8   5:10.13 httpd
 1576 mysql     20   0 2180m  65m 4104 S  0.7  1.7  81:49.00 mysqld
13676 nagios    20   0 34708 3856 1248 S  0.7  0.1   0:01.68 nagios
   20 root      20   0     0    0    0 S  0.3  0.0   5:30.86 events/1
10005 apache    20   0  432m  23m 4672 S  0.3  0.6   5:30.51 httpd
11137 apache    20   0  431m  23m 4680 S  0.3  0.6   5:26.62 httpd
11473 postgres  20   0  210m 6628 4260 S  0.3  0.2   0:08.37 postmaster
14281 postgres  20   0  210m 6604 4224 S  0.3  0.2   0:00.85 postmaster
18548 postgres  20   0  210m 5672 3476 S  0.3  0.1   0:00.01 postmaster
20107 postgres  20   0  210m 6100 3756 S  0.3  0.2   0:08.28 postmaster
20118 postgres  20   0  210m 5736 3464 S  0.3  0.1   0:00.01 postmaster
28161 postgres  20   0  210m 6860 4472 S  0.3  0.2   0:05.84 postmaster
    1 root      20   0 19232 1104  892 S  0.0  0.0   0:01.98 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
    3 root      RT   0     0    0    0 S  0.0  0.0   0:22.67 migration/0
    4 root      20   0     0    0    0 S  0.0  0.0   0:04.70 ksoftirqd/0
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
    6 root      RT   0     0    0    0 S  0.0  0.0   0:03.07 watchdog/0
    7 root      RT   0     0    0    0 S  0.0  0.0   0:24.54 migration/1
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1
    9 root      20   0     0    0    0 S  0.0  0.0   0:04.85 ksoftirqd/1
   10 root      RT   0     0    0    0 S  0.0  0.0   0:03.63 watchdog/1
   11 root      RT   0     0    0    0 S  0.0  0.0   0:24.68 migration/2
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2
   13 root      20   0     0    0    0 S  0.0  0.0   0:04.26 ksoftirqd/2
   14 root      RT   0     0    0    0 S  0.0  0.0   0:02.89 watchdog/2
   15 root      RT   0     0    0    0 S  0.0  0.0   0:22.83 migration/3
   16 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3
   17 root      20   0     0    0    0 S  0.0  0.0   0:03.91 ksoftirqd/3
   18 root      RT   0     0    0    0 S  0.0  0.0   0:03.58 watchdog/3
   19 root      20   0     0    0    0 S  0.0  0.0   0:50.56 events/0
   21 root      20   0     0    0    0 S  0.0  0.0   0:54.42 events/2
   22 root      20   0     0    0    0 S  0.0  0.0   1:30.12 events/3
   23 root      20   0     0    0    0 S  0.0  0.0   0:00.00 cgroup
   24 root      20   0     0    0    0 S  0.0  0.0   0:00.00 khelper
   25 root      20   0     0    0    0 S  0.0  0.0   0:00.00 netns
   26 root      20   0     0    0    0 S  0.0  0.0   0:00.00 async/mgr
   27 root      20   0     0    0    0 S  0.0  0.0   0:00.00 pm
   28 root      20   0     0    0    0 S  0.0  0.0   0:03.15 sync_supers
   29 root      20   0     0    0    0 S  0.0  0.0   0:04.23 bdi-default
   30 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kintegrityd/0
   31 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kintegrityd/1
   32 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kintegrityd/2
   33 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kintegrityd/3
   34 root      20   0     0    0    0 S  0.0  0.0   1:41.91 kblockd/0

Code: Select all

[root@localhost ~]# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz
stepping        : 4
cpu MHz         : 2800.099
cache size      : 8192 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips        : 5600.19
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz
stepping        : 4
cpu MHz         : 2800.099
cache size      : 8192 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips        : 5600.19
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz
stepping        : 4
cpu MHz         : 2800.099
cache size      : 8192 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips        : 5600.19
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz
stepping        : 4
cpu MHz         : 2800.099
cache size      : 8192 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips        : 5600.19
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Code: Select all

Server Statistics
Metric
Value
Load
1-min	32.05	
 
5-min	14.26	
 
15-min	11.32	
 
CPU Stats
User	11.87%	
 
Nice	0.00%	
 
System	1.65%	
 
I/O Wait	0.40%	
 
Steal	0.00%	
 
Idle	86.08%	
 
Memory
Total	3831 MB
Used	1375 MB	
 
Free	2456 MB	
 
Shared	0 MB	
 
Buffers	61 MB	
 
Cached	315 MB	
 
Swap
Total	2015 MB
Used	20 MB	
 
Free	1995 MB

Please do not double post, edit your previous post to add more information, double posting will only bump you lower on our reply list.

scottwilkerson · Post by **scottwilkerson** » Mon Mar 24, 2014 1:05 pm

something is definately off as you have high load and low CPU usage and relatively low looking iowait

Is this XI server attached to a SAN for disks?

what is the output of these commands

Code: Select all

df -h
df -i
ls -l /usr/local/nagios/var/spool/perfdata|wc -l
ls -l /usr/local/nagios/var/spool/xidpe|wc -l

nottheadmin · Post by **nottheadmin** » Tue Mar 25, 2014 5:30 am

It's VM, appliance i downloaded from you. It's essentially got it's own physical host right now but yes, the storage is an iscsi connected disk pack. The ESX monitoring tools do not paint the same picture of doom as top on the vm does.

We're moving it to less populated hardware now to see if there is any change.
Nope, switching to new hardware did not make any difference.

load average: 23.22, 21.38, 11.94

scottwilkerson wrote:something is definately off as you have high load and low CPU usage and relatively low looking iowait

Is this XI server attached to a SAN for disks?

what is the output of these commands
Code: Select all
df -h
df -i
ls -l /usr/local/nagios/var/spool/perfdata|wc -l
ls -l /usr/local/nagios/var/spool/xidpe|wc -l

Code: Select all

[root@localhost ntp]# df -h
Filesystem                    Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root  7.5G  5.5G  1.6G  78% /
tmpfs                         1.9G     0  1.9G   0% /dev/shm
/dev/sda1                     485M   50M  410M  11% /boot
[root@localhost ntp]# df -i
Filesystem                   Inodes IUsed  IFree IUse% Mounted on
/dev/mapper/VolGroup-lv_root 494832 88607 406225   18% /
tmpfs                        490494     1 490493    1% /dev/shm
/dev/sda1                    128016    44 127972    1% /boot
[root@localhost ntp]# ls -l /usr/local/nagios/var/spool/perfdata|wc -l
34
[root@localhost ntp]# ls -l /usr/local/nagios/var/spool/xidpe|wc -l
1

Please not double post, edit your previous post to add more information, replying to yourself will only bump you lower on our list...

krobertson71 · Post by **krobertson71** » Tue Mar 25, 2014 10:12 am

Just a question...

Are these all SNMP checks by chance?

I am running a Nagios XI demo on a gimpy 1 cpu box with 4 gigs of ram. I am only doing about 100 checks, but this vm is barely noticing any slowness what so ever.

Also, Is the Nagios Xi interface sluggish in any way? Does it take a long time to say, pull up a dashboard or view?

Sorry for the general questions, but it seems like one these things:

1) Assuming you are running a VM solution on dedicated hardware - Are you sure you have assigned enough CPU's to the instance?
2) Try offloading MySql to its own instance and configure NDOUtils to point to it.
3) I do see a lot of HTTPD instances running. Do you have a lot of dedicated users connected?
4) If the interface is running smoothly, then maybe top is having issues reporting accurately due to VMware configuration?

Just some ideas.

slansing · Post by **slansing** » Tue Mar 25, 2014 2:35 pm

Excellent questions krobertson71,

In addition, as a side note while you reply, if there are a number of users in the web interface, are they running reports? Reports can be quite intensive depending on the time frame that a user is pulling data from.

nottheadmin · Post by **nottheadmin** » Wed Mar 26, 2014 4:21 am

Hi, Well, it has 4 x Intel(R) Xeon(R) CPU X5560 @ 2.80GHz assigned to it, top on the server shows that it is using them all and the load average suggests that it is nowhere near enough.
However, the ESX server / Vsphere VMware monitoring tool shows that it is barely using half of the available CPU resources.

I am the only person using the web interface, i know this seems strange, I usually have 3 or 4 tabs open and that is it.

Nobody is running reports.

There are some SNMP checks but not that many, the majority of the checks are check_wmi_plus. They all seem to kick off at the same time and that is when the load shoots through the roof. At this time, the interface does slow down a bit but is not exactly snappy at the best of times but is usable.

I'm thinking that the checks need to be spread out so they do not all run at the same time.

At the same time i'm looking into converting this to a physical machine but would rather not if it can be helped.

Post by **lmiltchev** » Wed Mar 26, 2014 1:23 pm

Do you have any errors in the apache error log?

Code: Select all

tail -50 /var/log/httpd/error_log

Let's take a look at the mysqld log as well, just to rule out crashed database tables:

Code: Select all

tail -50 /var/log/mysqld.log

nottheadmin · Post by **nottheadmin** » Thu Mar 27, 2014 5:07 am

too many url's in those http error log, the forum wont let me post it.

Code: Select all

[root@localhost ~]# tail -50 /var/log/mysqld.log
140218 10:25:12  InnoDB: Completed initialization of buffer pool
140218 10:25:12  InnoDB: Started; log sequence number 0 44233
140218 10:25:12 [Note] Event Scheduler: Loaded 0 events
140218 10:25:12 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.71'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
140219 12:32:16 [Note] /usr/libexec/mysqld: Normal shutdown

140219 12:32:16 [Note] Event Scheduler: Purging the queue. 0 events
140219 12:32:16  InnoDB: Starting shutdown...
140219 12:32:19  InnoDB: Shutdown completed; log sequence number 0 44233
140219 12:32:19 [Note] /usr/libexec/mysqld: Shutdown complete

140219 12:32:19 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140219 12:33:50 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140219 12:33:51  InnoDB: Initializing buffer pool, size = 8.0M
140219 12:33:51  InnoDB: Completed initialization of buffer pool
140219 12:33:52  InnoDB: Started; log sequence number 0 44233
140219 12:33:52 [Note] Event Scheduler: Loaded 0 events
140219 12:33:52 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.71'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
140219 14:21:46 [Note] /usr/libexec/mysqld: Normal shutdown

140219 14:21:46 [Note] Event Scheduler: Purging the queue. 0 events
140219 14:21:48  InnoDB: Starting shutdown...
140219 14:21:48  InnoDB: Shutdown completed; log sequence number 0 44233
140219 14:21:48 [Note] /usr/libexec/mysqld: Shutdown complete

140219 14:21:48 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140219 14:23:11 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140219 14:23:11  InnoDB: Initializing buffer pool, size = 8.0M
140219 14:23:11  InnoDB: Completed initialization of buffer pool
140219 14:23:12  InnoDB: Started; log sequence number 0 44233
140219 14:23:12 [Note] Event Scheduler: Loaded 0 events
140219 14:23:12 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.71'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
140313 15:56:27 [Note] /usr/libexec/mysqld: Normal shutdown

140313 15:56:27 [Note] Event Scheduler: Purging the queue. 0 events
140313 15:56:28  InnoDB: Starting shutdown...
140313 15:56:35  InnoDB: Shutdown completed; log sequence number 0 44233
140313 15:56:35 [Note] /usr/libexec/mysqld: Shutdown complete

140313 15:56:36 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140313 15:58:11 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140313 15:58:12  InnoDB: Initializing buffer pool, size = 8.0M
140313 15:58:12  InnoDB: Completed initialization of buffer pool
140313 15:58:13  InnoDB: Started; log sequence number 0 44233
140313 15:58:13 [Note] Event Scheduler: Loaded 0 events
140313 15:58:13 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.71'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution

abrist · Post by **abrist** » Thu Mar 27, 2014 2:40 pm

nottheadmin wrote:too many url's in those http error log, the forum wont let me post it.

You should be able to attach the httpd error log file to your post instead of pasting it in.
1) Do you have a large number of failing checks? I ask because this can spike load if they are failing with timeouts as the nagios forks will stay open until timeout is reached.
2) How often are your checks run?
3) Also check /var/log/message for fork or orphan errors.

nottheadmin · Post by **nottheadmin** » Fri Apr 04, 2014 6:04 am

Sorry for the delay, I had a few days off work. http error log attached

Well, I do have a fair volume of warnings from check_WMI_plus on account of hard disks filling up, but it's no usually more than 4 or 5 at a time.

One thing that i am about to try is to delete the individual checks per hosts and apply a single check / command to the hostgroup instead. I'm hoping that it will not set off 140 instances of check_wmi at the same time. I did this with one of the smaller host groups last week and the load average is lower.

I'll do it with the largest hostgroup and see what happens.

Checks are run on the default 5 minute cycle.

No mention of fork or orphan in /var/log/messages

Nagios Support Forum

extremely high load average

extremely high load average

Re: extremely high load average

Re: extremely high load average

Re: extremely high load average

Re: extremely high load average

Re: extremely high load average

Re: extremely high load average

Re: extremely high load average

Re: extremely high load average

Re: extremely high load average