Page 2 of 2
Re: extremely high load average
Posted: Fri Apr 04, 2014 11:10 am
by abrist
Let us know how it works out. You have an error in the log:
Code: Select all
/usr/local/nagios/libexec/check_smb_share: line 8: /usr/bin/basename: No such file or directory
I would presume your samba checks will not work until you resolve the error. First you should check if its path is different:
Re: extremely high load average
Posted: Mon Apr 07, 2014 10:32 am
by nottheadmin
yeah i know about that,
It's right there in /bin/basename, i have modified the script to reflect that but it still errors, the script runs fine though so i'm not bothered about that right now but thanks for pointing it out.
OK so I went ahead a deleted several hundred individual wmi_plus checks and then created about 5 or 6 new checks and applied them to a host group, my largest host group of about 62 hosts with 8 wmi_plus checks each.
unfortunately it has not helped
load average: 33.68, 19.44, 16.58
I realise that it is probably best to run this on a physical machine but wow, how can a few wmi and snmp scripts create this much load?
Re: extremely high load average
Posted: Mon Apr 07, 2014 10:42 am
by abrist
What interval are you running these wmi checks at? If you disable the new wmi checks and wait 15minutes, what is the average 1,5,and 15 load values?
Re: extremely high load average
Posted: Tue Apr 15, 2014 5:02 am
by nottheadmin
i use the default 5 min check, retry after 1 minute.
I sat and watch top and iotop side by side for a while, every 5 minutes, top becomes filled with check_wmi_plus processes and the load average shoots up to anywhere between 25 and 48. IOTOP looked fairly quiet, only writing at about 250 K/s and reading at 0 K/s.
I don't think this is a disk i/o problem.
is it really necessary for a seperate check_wmi_plus process to run for each check on each host, I must have something in the region of 800 check_wmi_plus scripts running at roughly the same time every 5 minutes.
I cant be the only one that is having this problem? or have i committed a school boy configuration error?
Atop is revealing that check_wmi_plus is using between 1% and 2% of the CPU for each instance, if the several hundred instances that launch at the same time are using similar amounts then that would explain the massive loading.
Code: Select all
ATOP - localhost 2014/04/15 12:34:23 --------- 10s elapsed
PRC | sys 0.20s | user 1.69s | #proc 210 | #tslpi 270 | #tslpu 0 | #zombie 0 | #exit 64 |
CPU | sys 3% | user 19% | irq 0% | idle 377% | wait 1% | curf 2.80GHz | curscal ?% |
cpu | sys 1% | user 7% | irq 0% | idle 91% | cpu000 w 0% | curf 2.80GHz | curscal ?% |
cpu | sys 1% | user 5% | irq 0% | idle 94% | cpu002 w 0% | curf 2.80GHz | curscal ?% |
cpu | sys 1% | user 5% | irq 0% | idle 94% | cpu001 w 0% | curf 2.80GHz | curscal ?% |
cpu | sys 0% | user 2% | irq 0% | idle 98% | cpu003 w 0% | curf 2.80GHz | curscal ?% |
CPL | avg1 16.61 | avg5 11.80 | avg15 10.56 | csw 2778 | intr 5217 | | numcpu 4 |
MEM | tot 3.7G | free 2.6G | cache 354.7M | dirty 1.0M | buff 53.7M | slab 191.5M | |
SWP | tot 2.0G | free 1.9G | | | | vmcom 1.1G | vmlim 3.8G |
LVM | roup-lv_root | busy 2% | read 0 | write 598 | MBr/s 0.00 | MBw/s 0.23 | avio 0.37 ms |
DSK | sda | busy 2% | read 0 | write 70 | MBr/s 0.00 | MBw/s 0.23 | avio 3.20 ms |
NET | transport | tcpi 464 | tcpo 543 | udpi 18 | udpo 18 | tcpao 43 | tcppo 0 |
NET | network | ipi 518 | ipo 570 | ipfrw 0 | deliv 482 | icmpi 0 | icmpo 0 |
NET | eth0 0% | pcki 597 | pcko 557 | si 205 Kbps | so 82 Kbps | erri 0 | erro 0 |
NET | lo ---- | pcki 18 | pcko 18 | si 5 Kbps | so 5 Kbps | erri 0 | erro 0 |
PID TID RUID EUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/3
32044 - nagios - 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
32046 - nagios - 0 0.01s 0.17s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
32058 - nagios - 0 0.01s 0.17s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
32075 - nagios - 0 0.01s 0.17s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
32042 - nagios - 0 0.01s 0.16s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
32054 - nagios - 0 0.01s 0.16s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
32073 - nagios - 0 0.00s 0.17s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
32025 - nagios - 0 0.01s 0.12s 0K 0K - - -E 0 E - 1% <check_wmi_pl>
32026 - nagios - 0 0.00s 0.09s 0K 0K - - -E 0 E - 1% <check_wmi_pl>
32027 - nagios - 0 0.01s 0.08s 0K 0K - - -E 0 E - 1% <check_wmi_pl>
32019 - nagios - 0 0.01s 0.06s 0K 0K - - -E 0 E - 1% <check_wmi_pl>
32021 - nagios - 0 0.01s 0.06s 0K 0K - - -E 0 E - 1% <check_wmi_pl>
32017 - nagios - 0 0.00s 0.05s 0K 0K - - -E 0 E - 1% <check_wmi_pl>
19593 - root root 1 0.03s 0.01s 0K 0K 0K 0K -- - R 0 0% atop
8133 - nagios nagios 2 0.02s 0.01s 0K 0K 0K 1728K -- - S 1 0% nagios
22562 - root root 1 0.01s 0.01s 0K 0K 0K 0K -- - S 2 0% top
1576 - mysql mysql 56 0.00s 0.01s 0K 0K 0K 48K -- - S 3 0% mysqld
1749 - ajaxterm ajaxterm 2 0.00s 0.01s 0K 0K 0K 0K -- - S 2 0% python
7 - root root 1 0.01s 0.00s 0K 0K 0K 0K -- - S 1 0% migration/1
18 - root root 1 0.01s 0.00s 0K 0K 0K 0K -- - S 3 0% watchdog/3
20 - root root 1 0.01s 0.00s 0K 0K 0K 0K -- - S 1 0% events/1
22 - root root 1 0.01s 0.00s 0K 0K 0K 0K -- - S 3 0% events/3
31684 - nagios nagios 1 0.00s 0.00s 0K 0K 0K 0K -- - S 2 0% php
31807 - postgres postgres 1 0.00s 0.00s 0K 0K 0K 32K -- - S 3 0% postmaster
31717 - postgres postgres 1 0.00s 0.00s 0K 0K 0K 80K -- - S 3 0% postmaster
31718 - postgres postgres 1 0.00s 0.00s 0K 12K 0K 8K -- - S 2 0% postmaster
8576 - root root 1 0.00s 0.00s 0K 0K 0K 0K -- - S 0 0% sshd
32083 - nagios nagios 1 0.00s 0.00s 185.3M 4396K 0K 0K N- - S 0 0% smbclient
32066 - nagios nagios 1 0.00s 0.00s 185.3M 4392K 0K 0K N- - S 0 0% smbclient
32059 - nagios nagios 1 0.00s 0.00s 34976K 3120K 0K 0K N- - S 0 0% nagios
32076 - nagios nagios 1 0.00s 0.00s 34976K 3120K 0K 0K N- - S 3 0% nagios
32060 - nagios nagios 1 0.00s 0.00s 103.6M 1376K 0K 0K N- - S 1 0% check_smb_shar
32077 - nagios nagios 1 0.00s 0.00s 103.6M 1376K 0K 0K N- - S 2 0% check_smb_shar
32065 - nagios nagios 1 0.00s 0.00s 103.6M 588K 0K 0K N- - S 2 0% check_smb_shar
32082 - nagios nagios 1 0.00s 0.00s 103.6M 588K 0K 0K N- - S 3 0% check_smb_shar
373 - root root 1 0.00s 0.00s 0K 0K 0K 48K -- - S 0 0% jbd2/dm-0-8
934 - root root 1 0.00s 0.00s 0K 0K 0K 28K -- - S 0 0% flush-253:0
32029 - nagios - 0 0.00s 0.00s 0K 0K - - NE 0 E - 0% <wmic>
Here is atop's output during a load of 24
Code: Select all
PRC | sys 1.56s | user 38.75s | #proc 476 | #tslpu 0 | #zombie 2 | #exit 531 |
CPU | sys 32% | user 367% | irq 1% | idle 0% | wait 0% | curscal ?% |
cpu | sys 10% | user 89% | irq 1% | idle 0% | cpu000 w 0% | curscal ?% |
cpu | sys 8% | user 92% | irq 0% | idle 0% | cpu001 w 0% | curscal ?% |
cpu | sys 7% | user 93% | irq 0% | idle 0% | cpu002 w 0% | curscal ?% |
cpu | sys 7% | user 93% | irq 0% | idle 0% | cpu003 w 0% | curscal ?% |
CPL | avg1 26.43 | avg5 12.50 | avg15 10.63 | csw 21217 | intr 47509 | numcpu 4 |
MEM | tot 3.7G | free 1.4G | cache 353.0M | dirty 2.8M | buff 54.1M | slab 198.9M |
SWP | tot 2.0G | free 1.9G | | | vmcom 3.9G | vmlim 3.8G |
LVM | roup-lv_root | busy 6% | read 0 | write 415 | MBw/s 0.16 | avio 1.51 ms |
DSK | sda | busy 6% | read 0 | write 232 | MBw/s 0.16 | avio 2.71 ms |
NET | transport | tcpi 4400 | tcpo 5184 | udpi 70 | udpo 74 | tcpao 386 |
NET | network | ipi 4499 | ipo 5306 | ipfrw 0 | deliv 4487 | icmpo 17 |
NET | eth0 0% | pcki 4506 | pcko 5265 | si 1270 Kbps | so 673 Kbps | erro 0 |
NET | lo ---- | pcki 85 | pcko 85 | si 20 Kbps | so 20 Kbps | erro 0 |
PID TID RUID THR SYSCPU USRCPU VGROW RGROW RDDSK WRDSK ST EXC S CPUNR CPU CMD 1/22
17183 - apache 1 0.01s 0.24s 0K 76K 0K 0K -- - R 0 2% httpd
30258 - apache 1 0.02s 0.21s 2452K 2520K 0K 0K -- - R 0 2% httpd
3084 - nagios 0 0.02s 0.21s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3086 - nagios 0 0.01s 0.22s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3090 - nagios 0 0.01s 0.21s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3109 - nagios 0 0.01s 0.21s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3115 - nagios 0 0.01s 0.21s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3077 - nagios 0 0.01s 0.21s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3106 - nagios 0 0.01s 0.19s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3113 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3110 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3111 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3117 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3116 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3105 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3108 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3096 - nagios 0 0.00s 0.19s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3112 - nagios 0 0.00s 0.19s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3107 - nagios 0 0.00s 0.19s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3114 - nagios 0 0.00s 0.19s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3093 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3092 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3118 - nagios 0 0.00s 0.19s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3165 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3152 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3151 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3182 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3193 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3150 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3190 - nagios 0 0.01s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3233 - nagios 0 0.01s 0.18s 0K 0K - - -E 0 E - 2% <check_wmi_pl>
3075 - nagios 0 0.01s 0.17s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3091 - nagios 0 0.00s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3094 - nagios 0 0.00s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3181 - nagios 0 0.01s 0.17s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3197 - nagios 0 0.01s 0.17s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3160 - nagios 0 0.01s 0.17s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3154 - nagios 0 0.00s 0.18s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
3192 - nagios 0 0.01s 0.17s 0K 0K - - NE 0 E - 2% <check_wmi_pl>
Code: Select all
[root@localhost ~]# tail -f /usr/local/n agios/var/npcd.log
[04-15-2014 12:56:57] NPCD: WARN: MAX lo ad reached: load 11.710000/10.000000 at i=0
[04-15-2014 12:57:58] NPCD: WARN: MAX lo ad reached: load 17.280000/10.000000 at i=0
[04-15-2014 12:58:13] NPCD: WARN: MAX lo ad reached: load 14.260000/10.000000 at i=1
[04-15-2014 12:58:28] NPCD: WARN: MAX lo ad reached: load 11.100000/10.000000 at i=1
[04-15-2014 12:58:43] NPCD: WARN: MAX lo ad reached: load 30.990000/10.000000 at i=1
[04-15-2014 12:58:58] NPCD: WARN: MAX lo ad reached: load 24.120000/10.000000 at i=1
[04-15-2014 12:59:13] NPCD: WARN: MAX lo ad reached: load 24.070000/10.000000 at i=1
[04-15-2014 12:59:28] NPCD: WARN: MAX lo ad reached: load 18.800000/10.000000 at i=1
[04-15-2014 12:59:43] NPCD: WARN: MAX lo ad reached: load 14.640000/10.000000 at i=1
[04-15-2014 12:59:58] NPCD: WARN: MAX lo ad reached: load 11.390000/10.000000 at i=1
Please do not double post, edit your previous post to add your information if you are the last poster adding additional posts will only serve to push you lower on our "to-be-replied-to" list.
Re: extremely high load average
Posted: Tue Apr 15, 2014 5:12 pm
by tmcdonald
WMI is a pretty heavy beast, and 800 is on the higher end of the spectrum for number of WMI checks we usually see. Have you considered implementing any gearman workers or anything?
As for spacing the checks out, I know there are some nagios.cfg-level options for interleaving the checks, but I do not know enough about them on a low level to make any recommendations.
Re: extremely high load average
Posted: Wed Apr 16, 2014 12:49 am
by Box293
Evaluate each type of check to see if a 5 minute interval is necessary.
For example, a disk space check might only need to occur once every hour.
Also, try and use intervals with odd numbers.
For example, if your interval is 60 minutes, it is likely to co-incide with the 5 minute checks once an hour. Try 57 minutes or 62 minutes as this somewhat speads out the checks at different intervals.
Re: extremely high load average
Posted: Wed Apr 16, 2014 4:27 am
by nottheadmin
Hi thanks for the suggestions, I was able to add 4 more cores and an extra 12GB of RAM to the VM today, it made no difference whatsoever though.
I'm currently on 8 2.8GHz xeon's and 16GB ram. Load average just peaked at 75.
So then i looked at the checks in more detail and yes your right, i dont really need to check disk space every 5 minutes, I changed that to 57 minutes.
Sadly the high loading continues. I'm now looking into finding a physical machines to host this and i'll convert the VM to a physical. Gearman is a nice idea but I don't have the resources at the moment.
Re: extremely high load average
Posted: Wed Apr 16, 2014 4:41 pm
by lmiltchev
I'm now looking into finding a physical machines to host this and i'll convert the VM to a physical.
Let us know if this fixed your issue.
I would also recommend reviewing the "Boosting Performance" section in our Administrator's Guide.
http://assets.nagios.com/downloads/nagi ... p#boosting
Re: extremely high load average
Posted: Wed Apr 16, 2014 4:43 pm
by Box293
Another option is to cut back what is being monitored (temporarily).
You could use CCM to disable all the disk space checks and apply the configuration. Then watch the load average over time and see if this makes any improvement. Then repeat the exercise by disabling all the CPU usage checks and then observing the load average.
It's like drawing a line in the sand. Sometimes this approach can help identify if there is one particular type of check that is causing the issue to occur.
Re: extremely high load average
Posted: Wed Apr 16, 2014 4:46 pm
by slansing
Have you tried updating a test VM to the 2014 beta? It will have some pretty big performance bumps. You can install gearman on your local server as well, without needing remote workers, and you should still see a dramatic load difference. Heck, I noticed a big difference with under 100 service checks on a test machine. I'm going to ponder over what you have reported to us thus far and see if we missed anything glaringly obvious. This is definitely not a issue that is effecting a large number of users, you are currently the only one reporting this at the time which makes me want to say there could be an issue with either the VM image, or the hardware, which, you would probably notice with any VM's sharing it anyways.
Good point's Box, if possible, you could try disabling a sweeping number of similar checks at once via the CCM to see if it is a certain type of check, i.e perl VMware checks, SNMP get requests, etc. If you are using a large number of VMware checks I'd highly recommend Box293's latest release which is a completely different way of checking against ESX(i) and vCenter systems. I tested it out on Monday and it was excellent. I think he is working on a wizard for it at the moment.