extremely high load average

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
nottheadmin
Posts: 53
Joined: Thu Dec 19, 2013 9:51 am
Location: Amsterdam, NL

extremely high load average

Post by nottheadmin »

Well, I've just seen my highest ever load average of 48. I didn't capture top at that point but

I'm concerned because i've got quite a bit of CPU power available to me and i'm not really monitoring that many hosts.

Can somebody please take a look and let me know what i can do to make better use of the hardware? At this rate i'm going to need to somehow persuade the customer site to purchase an earth simulator to run it all on :)

I do notice a lot of the check's pretty much run at the same time. Maybe there is a way to automatically reschedule the checks to space them out a bit?

Nagios XI 2012R2.8c
141 hosts
799 checks

I'm not 100% sure but i am highly suspicious that i'm getting regular false positives because some of the scripts are not completing due to the extremely high load

Code: Select all

[root@localhost ~]# uname -a
Linux localhost.localdomain 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

Code: Select all

top - 16:06:44 up 11 days, 9 min,  1 user,  load average: 24.22, 14.27, 10.65
Tasks: 204 total,   1 running, 203 sleeping,   0 stopped,   0 zombie
Cpu(s):  8.6%us,  1.5%sy,  0.0%ni, 89.4%id,  0.4%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   3923956k total,  1386432k used,  2537524k free,    62580k buffers
Swap:  2064376k total,    21216k used,  2043160k free,   317276k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 4711 apache    20   0  439m  31m 4488 S  3.7  0.8   0:13.93 httpd
13750 apache    20   0  432m  23m 4664 S  3.7  0.6   0:35.82 httpd
23625 apache    20   0  432m  24m 4668 S  3.7  0.6   3:30.46 httpd
11138 apache    20   0  439m  31m 4692 S  3.3  0.8   5:15.51 httpd
15506 apache    20   0  439m  31m 4668 S  3.3  0.8   5:14.99 httpd
20040 apache    20   0  439m  30m 4232 S  3.3  0.8   0:00.35 httpd
27970 apache    20   0  431m  23m 4660 S  3.3  0.6   3:52.59 httpd
 4727 apache    20   0  440m  31m 4652 S  3.0  0.8   0:16.53 httpd
19704 apache    20   0  439m  31m 4660 S  3.0  0.8   5:10.13 httpd
 1576 mysql     20   0 2180m  65m 4104 S  0.7  1.7  81:49.00 mysqld
13676 nagios    20   0 34708 3856 1248 S  0.7  0.1   0:01.68 nagios
   20 root      20   0     0    0    0 S  0.3  0.0   5:30.86 events/1
10005 apache    20   0  432m  23m 4672 S  0.3  0.6   5:30.51 httpd
11137 apache    20   0  431m  23m 4680 S  0.3  0.6   5:26.62 httpd
11473 postgres  20   0  210m 6628 4260 S  0.3  0.2   0:08.37 postmaster
14281 postgres  20   0  210m 6604 4224 S  0.3  0.2   0:00.85 postmaster
18548 postgres  20   0  210m 5672 3476 S  0.3  0.1   0:00.01 postmaster
20107 postgres  20   0  210m 6100 3756 S  0.3  0.2   0:08.28 postmaster
20118 postgres  20   0  210m 5736 3464 S  0.3  0.1   0:00.01 postmaster
28161 postgres  20   0  210m 6860 4472 S  0.3  0.2   0:05.84 postmaster
    1 root      20   0 19232 1104  892 S  0.0  0.0   0:01.98 init
    2 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kthreadd
    3 root      RT   0     0    0    0 S  0.0  0.0   0:22.67 migration/0
    4 root      20   0     0    0    0 S  0.0  0.0   0:04.70 ksoftirqd/0
    5 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/0
    6 root      RT   0     0    0    0 S  0.0  0.0   0:03.07 watchdog/0
    7 root      RT   0     0    0    0 S  0.0  0.0   0:24.54 migration/1
    8 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/1
    9 root      20   0     0    0    0 S  0.0  0.0   0:04.85 ksoftirqd/1
   10 root      RT   0     0    0    0 S  0.0  0.0   0:03.63 watchdog/1
   11 root      RT   0     0    0    0 S  0.0  0.0   0:24.68 migration/2
   12 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/2
   13 root      20   0     0    0    0 S  0.0  0.0   0:04.26 ksoftirqd/2
   14 root      RT   0     0    0    0 S  0.0  0.0   0:02.89 watchdog/2
   15 root      RT   0     0    0    0 S  0.0  0.0   0:22.83 migration/3
   16 root      RT   0     0    0    0 S  0.0  0.0   0:00.00 migration/3
   17 root      20   0     0    0    0 S  0.0  0.0   0:03.91 ksoftirqd/3
   18 root      RT   0     0    0    0 S  0.0  0.0   0:03.58 watchdog/3
   19 root      20   0     0    0    0 S  0.0  0.0   0:50.56 events/0
   21 root      20   0     0    0    0 S  0.0  0.0   0:54.42 events/2
   22 root      20   0     0    0    0 S  0.0  0.0   1:30.12 events/3
   23 root      20   0     0    0    0 S  0.0  0.0   0:00.00 cgroup
   24 root      20   0     0    0    0 S  0.0  0.0   0:00.00 khelper
   25 root      20   0     0    0    0 S  0.0  0.0   0:00.00 netns
   26 root      20   0     0    0    0 S  0.0  0.0   0:00.00 async/mgr
   27 root      20   0     0    0    0 S  0.0  0.0   0:00.00 pm
   28 root      20   0     0    0    0 S  0.0  0.0   0:03.15 sync_supers
   29 root      20   0     0    0    0 S  0.0  0.0   0:04.23 bdi-default
   30 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kintegrityd/0
   31 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kintegrityd/1
   32 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kintegrityd/2
   33 root      20   0     0    0    0 S  0.0  0.0   0:00.00 kintegrityd/3
   34 root      20   0     0    0    0 S  0.0  0.0   1:41.91 kblockd/0

Code: Select all

[root@localhost ~]# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz
stepping        : 4
cpu MHz         : 2800.099
cache size      : 8192 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips        : 5600.19
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz
stepping        : 4
cpu MHz         : 2800.099
cache size      : 8192 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips        : 5600.19
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz
stepping        : 4
cpu MHz         : 2800.099
cache size      : 8192 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips        : 5600.19
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           X5560  @ 2.80GHz
stepping        : 4
cpu MHz         : 2800.099
cache size      : 8192 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts xtopology tsc_reliable nonstop_tsc aperfmperf unfair_spinlock pni ssse3 cx16 sse4_1 sse4_2 popcnt hypervisor lahf_lm ida dts
bogomips        : 5600.19
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Code: Select all

Server Statistics
Metric
Value
Load
1-min	32.05	
 
5-min	14.26	
 
15-min	11.32	
 
CPU Stats
User	11.87%	
 
Nice	0.00%	
 
System	1.65%	
 
I/O Wait	0.40%	
 
Steal	0.00%	
 
Idle	86.08%	
 
Memory
Total	3831 MB
Used	1375 MB	
 
Free	2456 MB	
 
Shared	0 MB	
 
Buffers	61 MB	
 
Cached	315 MB	
 
Swap
Total	2015 MB
Used	20 MB	
 
Free	1995 MB
Please do not double post, edit your previous post to add more information, double posting will only bump you lower on our reply list.
Last edited by slansing on Tue Apr 15, 2014 9:21 am, edited 1 time in total.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: extremely high load average

Post by scottwilkerson »

something is definately off as you have high load and low CPU usage and relatively low looking iowait

Is this XI server attached to a SAN for disks?

what is the output of these commands

Code: Select all

df -h
df -i
ls -l /usr/local/nagios/var/spool/perfdata|wc -l
ls -l /usr/local/nagios/var/spool/xidpe|wc -l
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
nottheadmin
Posts: 53
Joined: Thu Dec 19, 2013 9:51 am
Location: Amsterdam, NL

Re: extremely high load average

Post by nottheadmin »

It's VM, appliance i downloaded from you. It's essentially got it's own physical host right now but yes, the storage is an iscsi connected disk pack. The ESX monitoring tools do not paint the same picture of doom as top on the vm does.

We're moving it to less populated hardware now to see if there is any change.
Nope, switching to new hardware did not make any difference.

load average: 23.22, 21.38, 11.94
scottwilkerson wrote:something is definately off as you have high load and low CPU usage and relatively low looking iowait

Is this XI server attached to a SAN for disks?

what is the output of these commands

Code: Select all

df -h
df -i
ls -l /usr/local/nagios/var/spool/perfdata|wc -l
ls -l /usr/local/nagios/var/spool/xidpe|wc -l

Code: Select all

[root@localhost ntp]# df -h
Filesystem                    Size  Used Avail Use% Mounted on
/dev/mapper/VolGroup-lv_root  7.5G  5.5G  1.6G  78% /
tmpfs                         1.9G     0  1.9G   0% /dev/shm
/dev/sda1                     485M   50M  410M  11% /boot
[root@localhost ntp]# df -i
Filesystem                   Inodes IUsed  IFree IUse% Mounted on
/dev/mapper/VolGroup-lv_root 494832 88607 406225   18% /
tmpfs                        490494     1 490493    1% /dev/shm
/dev/sda1                    128016    44 127972    1% /boot
[root@localhost ntp]# ls -l /usr/local/nagios/var/spool/perfdata|wc -l
34
[root@localhost ntp]# ls -l /usr/local/nagios/var/spool/xidpe|wc -l
1
Please not double post, edit your previous post to add more information, replying to yourself will only bump you lower on our list...
Last edited by slansing on Tue Apr 15, 2014 9:22 am, edited 1 time in total.
krobertson71
Posts: 444
Joined: Tue Feb 11, 2014 10:16 pm

Re: extremely high load average

Post by krobertson71 »

Just a question...

Are these all SNMP checks by chance?

I am running a Nagios XI demo on a gimpy 1 cpu box with 4 gigs of ram. I am only doing about 100 checks, but this vm is barely noticing any slowness what so ever.

Also, Is the Nagios Xi interface sluggish in any way? Does it take a long time to say, pull up a dashboard or view?

Sorry for the general questions, but it seems like one these things:

1) Assuming you are running a VM solution on dedicated hardware - Are you sure you have assigned enough CPU's to the instance?
2) Try offloading MySql to its own instance and configure NDOUtils to point to it.
3) I do see a lot of HTTPD instances running. Do you have a lot of dedicated users connected?
4) If the interface is running smoothly, then maybe top is having issues reporting accurately due to VMware configuration?

Just some ideas.
slansing
Posts: 7698
Joined: Mon Apr 23, 2012 4:28 pm
Location: Travelling through time and space...

Re: extremely high load average

Post by slansing »

Excellent questions krobertson71,

In addition, as a side note while you reply, if there are a number of users in the web interface, are they running reports? Reports can be quite intensive depending on the time frame that a user is pulling data from.
nottheadmin
Posts: 53
Joined: Thu Dec 19, 2013 9:51 am
Location: Amsterdam, NL

Re: extremely high load average

Post by nottheadmin »

Hi, Well, it has 4 x Intel(R) Xeon(R) CPU X5560 @ 2.80GHz assigned to it, top on the server shows that it is using them all and the load average suggests that it is nowhere near enough.
However, the ESX server / Vsphere VMware monitoring tool shows that it is barely using half of the available CPU resources.

I am the only person using the web interface, i know this seems strange, I usually have 3 or 4 tabs open and that is it.

Nobody is running reports.

There are some SNMP checks but not that many, the majority of the checks are check_wmi_plus. They all seem to kick off at the same time and that is when the load shoots through the roof. At this time, the interface does slow down a bit but is not exactly snappy at the best of times but is usable.

I'm thinking that the checks need to be spread out so they do not all run at the same time.

At the same time i'm looking into converting this to a physical machine but would rather not if it can be helped.
User avatar
lmiltchev
Bugs find me
Posts: 13589
Joined: Mon May 23, 2011 12:15 pm

Re: extremely high load average

Post by lmiltchev »

Do you have any errors in the apache error log?

Code: Select all

tail -50 /var/log/httpd/error_log
Let's take a look at the mysqld log as well, just to rule out crashed database tables:

Code: Select all

tail -50 /var/log/mysqld.log
Be sure to check out our Knowledgebase for helpful articles and solutions!
nottheadmin
Posts: 53
Joined: Thu Dec 19, 2013 9:51 am
Location: Amsterdam, NL

Re: extremely high load average

Post by nottheadmin »

too many url's in those http error log, the forum wont let me post it.

Code: Select all

[root@localhost ~]# tail -50 /var/log/mysqld.log
140218 10:25:12  InnoDB: Completed initialization of buffer pool
140218 10:25:12  InnoDB: Started; log sequence number 0 44233
140218 10:25:12 [Note] Event Scheduler: Loaded 0 events
140218 10:25:12 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.71'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
140219 12:32:16 [Note] /usr/libexec/mysqld: Normal shutdown

140219 12:32:16 [Note] Event Scheduler: Purging the queue. 0 events
140219 12:32:16  InnoDB: Starting shutdown...
140219 12:32:19  InnoDB: Shutdown completed; log sequence number 0 44233
140219 12:32:19 [Note] /usr/libexec/mysqld: Shutdown complete

140219 12:32:19 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140219 12:33:50 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140219 12:33:51  InnoDB: Initializing buffer pool, size = 8.0M
140219 12:33:51  InnoDB: Completed initialization of buffer pool
140219 12:33:52  InnoDB: Started; log sequence number 0 44233
140219 12:33:52 [Note] Event Scheduler: Loaded 0 events
140219 12:33:52 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.71'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
140219 14:21:46 [Note] /usr/libexec/mysqld: Normal shutdown

140219 14:21:46 [Note] Event Scheduler: Purging the queue. 0 events
140219 14:21:48  InnoDB: Starting shutdown...
140219 14:21:48  InnoDB: Shutdown completed; log sequence number 0 44233
140219 14:21:48 [Note] /usr/libexec/mysqld: Shutdown complete

140219 14:21:48 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140219 14:23:11 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140219 14:23:11  InnoDB: Initializing buffer pool, size = 8.0M
140219 14:23:11  InnoDB: Completed initialization of buffer pool
140219 14:23:12  InnoDB: Started; log sequence number 0 44233
140219 14:23:12 [Note] Event Scheduler: Loaded 0 events
140219 14:23:12 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.71'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
140313 15:56:27 [Note] /usr/libexec/mysqld: Normal shutdown

140313 15:56:27 [Note] Event Scheduler: Purging the queue. 0 events
140313 15:56:28  InnoDB: Starting shutdown...
140313 15:56:35  InnoDB: Shutdown completed; log sequence number 0 44233
140313 15:56:35 [Note] /usr/libexec/mysqld: Shutdown complete

140313 15:56:36 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
140313 15:58:11 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
140313 15:58:12  InnoDB: Initializing buffer pool, size = 8.0M
140313 15:58:12  InnoDB: Completed initialization of buffer pool
140313 15:58:13  InnoDB: Started; log sequence number 0 44233
140313 15:58:13 [Note] Event Scheduler: Loaded 0 events
140313 15:58:13 [Note] /usr/libexec/mysqld: ready for connections.
Version: '5.1.71'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  Source distribution
abrist
Red Shirt
Posts: 8334
Joined: Thu Nov 15, 2012 1:20 pm

Re: extremely high load average

Post by abrist »

nottheadmin wrote:too many url's in those http error log, the forum wont let me post it.
You should be able to attach the httpd error log file to your post instead of pasting it in.
1) Do you have a large number of failing checks? I ask because this can spike load if they are failing with timeouts as the nagios forks will stay open until timeout is reached.
2) How often are your checks run?
3) Also check /var/log/message for fork or orphan errors.
Former Nagios employee
"It is turtles. All. The. Way. Down. . . .and maybe an elephant or two."
VI VI VI - The editor of the Beast!
Come to the Dark Side.
nottheadmin
Posts: 53
Joined: Thu Dec 19, 2013 9:51 am
Location: Amsterdam, NL

Re: extremely high load average

Post by nottheadmin »

Sorry for the delay, I had a few days off work. http error log attached

Well, I do have a fair volume of warnings from check_WMI_plus on account of hard disks filling up, but it's no usually more than 4 or 5 at a time.

One thing that i am about to try is to delete the individual checks per hosts and apply a single check / command to the hostgroup instead. I'm hoping that it will not set off 140 instances of check_wmi at the same time. I did this with one of the smaller host groups last week and the load average is lower.

I'll do it with the largest hostgroup and see what happens.

Checks are run on the default 5 minute cycle.

No mention of fork or orphan in /var/log/messages
You do not have the required permissions to view the files attached to this post.
Locked