Page 2 of 2

Re: extremely high load average

Posted: Fri Apr 04, 2014 11:10 am
by abrist
Let us know how it works out. You have an error in the log:

Code: Select all

/usr/local/nagios/libexec/check_smb_share: line 8: /usr/bin/basename: No such file or directory
I would presume your samba checks will not work until you resolve the error. First you should check if its path is different:

Code: Select all

which basename

Re: extremely high load average

Posted: Mon Apr 07, 2014 10:32 am
by nottheadmin
yeah i know about that,

It's right there in /bin/basename, i have modified the script to reflect that but it still errors, the script runs fine though so i'm not bothered about that right now but thanks for pointing it out.

OK so I went ahead a deleted several hundred individual wmi_plus checks and then created about 5 or 6 new checks and applied them to a host group, my largest host group of about 62 hosts with 8 wmi_plus checks each.

unfortunately it has not helped

load average: 33.68, 19.44, 16.58

I realise that it is probably best to run this on a physical machine but wow, how can a few wmi and snmp scripts create this much load?

Re: extremely high load average

Posted: Mon Apr 07, 2014 10:42 am
by abrist
What interval are you running these wmi checks at? If you disable the new wmi checks and wait 15minutes, what is the average 1,5,and 15 load values?

Re: extremely high load average

Posted: Tue Apr 15, 2014 5:02 am
by nottheadmin
i use the default 5 min check, retry after 1 minute.

I sat and watch top and iotop side by side for a while, every 5 minutes, top becomes filled with check_wmi_plus processes and the load average shoots up to anywhere between 25 and 48. IOTOP looked fairly quiet, only writing at about 250 K/s and reading at 0 K/s.

I don't think this is a disk i/o problem.

is it really necessary for a seperate check_wmi_plus process to run for each check on each host, I must have something in the region of 800 check_wmi_plus scripts running at roughly the same time every 5 minutes.

I cant be the only one that is having this problem? or have i committed a school boy configuration error?

Atop is revealing that check_wmi_plus is using between 1% and 2% of the CPU for each instance, if the several hundred instances that launch at the same time are using similar amounts then that would explain the massive loading.


Code: Select all

ATOP - localhost                      2014/04/15  12:34:23                      ---------                      10s elapsed
PRC |  sys    0.20s  |  user   1.69s |  #proc    210  |  #tslpi   270  |  #tslpu     0  | #zombie    0  |  #exit     64  |
CPU |  sys       3%  |  user     19% |  irq       0%  |  idle    377%  |  wait      1%  | curf 2.80GHz  |  curscal   ?%  |
cpu |  sys       1%  |  user      7% |  irq       0%  |  idle     91%  |  cpu000 w  0%  | curf 2.80GHz  |  curscal   ?%  |
cpu |  sys       1%  |  user      5% |  irq       0%  |  idle     94%  |  cpu002 w  0%  | curf 2.80GHz  |  curscal   ?%  |
cpu |  sys       1%  |  user      5% |  irq       0%  |  idle     94%  |  cpu001 w  0%  | curf 2.80GHz  |  curscal   ?%  |
cpu |  sys       0%  |  user      2% |  irq       0%  |  idle     98%  |  cpu003 w  0%  | curf 2.80GHz  |  curscal   ?%  |
CPL |  avg1   16.61  |  avg5   11.80 |  avg15  10.56  |  csw     2778  |  intr    5217  |               |  numcpu     4  |
MEM |  tot     3.7G  |  free    2.6G |  cache 354.7M  |  dirty   1.0M  |  buff   53.7M  | slab  191.5M  |                |
SWP |  tot     2.0G  |  free    1.9G |                |                |                | vmcom   1.1G  |  vmlim   3.8G  |
LVM |  roup-lv_root  |  busy      2% |  read       0  |  write    598  |  MBr/s   0.00  | MBw/s   0.23  |  avio 0.37 ms  |
DSK |           sda  |  busy      2% |  read       0  |  write     70  |  MBr/s   0.00  | MBw/s   0.23  |  avio 3.20 ms  |
NET |  transport     |  tcpi     464 |  tcpo     543  |  udpi      18  |  udpo      18  | tcpao     43  |  tcppo      0  |
NET |  network       |  ipi      518 |  ipo      570  |  ipfrw      0  |  deliv    482  | icmpi      0  |  icmpo      0  |
NET |  eth0      0%  |  pcki     597 |  pcko     557  |  si  205 Kbps  |  so   82 Kbps  | erri       0  |  erro       0  |
NET |  lo      ----  |  pcki      18 |  pcko      18  |  si    5 Kbps  |  so    5 Kbps  | erri       0  |  erro       0  |

  PID   TID  RUID     EUID       THR  SYSCPU   USRCPU  VGROW   RGROW   RDDSK  WRDSK  ST EXC  S CPUNR   CPU  CMD        1/3
32044     -  nagios   -            0   0.01s    0.18s     0K      0K       -      -  NE   0  E     -    2%  <check_wmi_pl>
32046     -  nagios   -            0   0.01s    0.17s     0K      0K       -      -  NE   0  E     -    2%  <check_wmi_pl>
32058     -  nagios   -            0   0.01s    0.17s     0K      0K       -      -  NE   0  E     -    2%  <check_wmi_pl>
32075     -  nagios   -            0   0.01s    0.17s     0K      0K       -      -  NE   0  E     -    2%  <check_wmi_pl>
32042     -  nagios   -            0   0.01s    0.16s     0K      0K       -      -  NE   0  E     -    2%  <check_wmi_pl>
32054     -  nagios   -            0   0.01s    0.16s     0K      0K       -      -  NE   0  E     -    2%  <check_wmi_pl>
32073     -  nagios   -            0   0.00s    0.17s     0K      0K       -      -  NE   0  E     -    2%  <check_wmi_pl>
32025     -  nagios   -            0   0.01s    0.12s     0K      0K       -      -  -E   0  E     -    1%  <check_wmi_pl>
32026     -  nagios   -            0   0.00s    0.09s     0K      0K       -      -  -E   0  E     -    1%  <check_wmi_pl>
32027     -  nagios   -            0   0.01s    0.08s     0K      0K       -      -  -E   0  E     -    1%  <check_wmi_pl>
32019     -  nagios   -            0   0.01s    0.06s     0K      0K       -      -  -E   0  E     -    1%  <check_wmi_pl>
32021     -  nagios   -            0   0.01s    0.06s     0K      0K       -      -  -E   0  E     -    1%  <check_wmi_pl>
32017     -  nagios   -            0   0.00s    0.05s     0K      0K       -      -  -E   0  E     -    1%  <check_wmi_pl>
19593     -  root     root         1   0.03s    0.01s     0K      0K      0K     0K  --   -  R     0    0%  atop
 8133     -  nagios   nagios       2   0.02s    0.01s     0K      0K      0K  1728K  --   -  S     1    0%  nagios
22562     -  root     root         1   0.01s    0.01s     0K      0K      0K     0K  --   -  S     2    0%  top
 1576     -  mysql    mysql       56   0.00s    0.01s     0K      0K      0K    48K  --   -  S     3    0%  mysqld
 1749     -  ajaxterm ajaxterm     2   0.00s    0.01s     0K      0K      0K     0K  --   -  S     2    0%  python
    7     -  root     root         1   0.01s    0.00s     0K      0K      0K     0K  --   -  S     1    0%  migration/1
   18     -  root     root         1   0.01s    0.00s     0K      0K      0K     0K  --   -  S     3    0%  watchdog/3
   20     -  root     root         1   0.01s    0.00s     0K      0K      0K     0K  --   -  S     1    0%  events/1
   22     -  root     root         1   0.01s    0.00s     0K      0K      0K     0K  --   -  S     3    0%  events/3
31684     -  nagios   nagios       1   0.00s    0.00s     0K      0K      0K     0K  --   -  S     2    0%  php
31807     -  postgres postgres     1   0.00s    0.00s     0K      0K      0K    32K  --   -  S     3    0%  postmaster
31717     -  postgres postgres     1   0.00s    0.00s     0K      0K      0K    80K  --   -  S     3    0%  postmaster
31718     -  postgres postgres     1   0.00s    0.00s     0K     12K      0K     8K  --   -  S     2    0%  postmaster
 8576     -  root     root         1   0.00s    0.00s     0K      0K      0K     0K  --   -  S     0    0%  sshd
32083     -  nagios   nagios       1   0.00s    0.00s 185.3M   4396K      0K     0K  N-   -  S     0    0%  smbclient
32066     -  nagios   nagios       1   0.00s    0.00s 185.3M   4392K      0K     0K  N-   -  S     0    0%  smbclient
32059     -  nagios   nagios       1   0.00s    0.00s 34976K   3120K      0K     0K  N-   -  S     0    0%  nagios
32076     -  nagios   nagios       1   0.00s    0.00s 34976K   3120K      0K     0K  N-   -  S     3    0%  nagios
32060     -  nagios   nagios       1   0.00s    0.00s 103.6M   1376K      0K     0K  N-   -  S     1    0%  check_smb_shar
32077     -  nagios   nagios       1   0.00s    0.00s 103.6M   1376K      0K     0K  N-   -  S     2    0%  check_smb_shar
32065     -  nagios   nagios       1   0.00s    0.00s 103.6M    588K      0K     0K  N-   -  S     2    0%  check_smb_shar
32082     -  nagios   nagios       1   0.00s    0.00s 103.6M    588K      0K     0K  N-   -  S     3    0%  check_smb_shar
  373     -  root     root         1   0.00s    0.00s     0K      0K      0K    48K  --   -  S     0    0%  jbd2/dm-0-8
  934     -  root     root         1   0.00s    0.00s     0K      0K      0K    28K  --   -  S     0    0%  flush-253:0
32029     -  nagios   -            0   0.00s    0.00s     0K      0K       -      -  NE   0  E     -    0%  <wmic>
Here is atop's output during a load of 24

Code: Select all

PRC |  sys    1.56s  |  user  38.75s  |  #proc    476  |   #tslpu     0  |  #zombie    2  |  #exit    531  |
CPU |  sys      32%  |  user    367%  |  irq       1%  |   idle      0%  |  wait      0%  |  curscal   ?%  |
cpu |  sys      10%  |  user     89%  |  irq       1%  |   idle      0%  |  cpu000 w  0%  |  curscal   ?%  |
cpu |  sys       8%  |  user     92%  |  irq       0%  |   idle      0%  |  cpu001 w  0%  |  curscal   ?%  |
cpu |  sys       7%  |  user     93%  |  irq       0%  |   idle      0%  |  cpu002 w  0%  |  curscal   ?%  |
cpu |  sys       7%  |  user     93%  |  irq       0%  |   idle      0%  |  cpu003 w  0%  |  curscal   ?%  |
CPL |  avg1   26.43  |  avg5   12.50  |  avg15  10.63  |   csw    21217  |  intr   47509  |  numcpu     4  |
MEM |  tot     3.7G  |  free    1.4G  |  cache 353.0M  |   dirty   2.8M  |  buff   54.1M  |  slab  198.9M  |
SWP |  tot     2.0G  |  free    1.9G  |                |                 |  vmcom   3.9G  |  vmlim   3.8G  |
LVM |  roup-lv_root  |  busy      6%  |  read       0  |   write    415  |  MBw/s   0.16  |  avio 1.51 ms  |
DSK |           sda  |  busy      6%  |  read       0  |   write    232  |  MBw/s   0.16  |  avio 2.71 ms  |
NET |  transport     |  tcpi    4400  |  tcpo    5184  |   udpi      70  |  udpo      74  |  tcpao    386  |
NET |  network       |  ipi     4499  |  ipo     5306  |   ipfrw      0  |  deliv   4487  |  icmpo     17  |
NET |  eth0      0%  |  pcki    4506  |  pcko    5265  |   si 1270 Kbps  |  so  673 Kbps  |  erro       0  |
NET |  lo      ----  |  pcki      85  |  pcko      85  |   si   20 Kbps  |  so   20 Kbps  |  erro       0  |

  PID   TID RUID      THR   SYSCPU  USRCPU  VGROW  RGROW   RDDSK  WRDSK ST EXC  S CPUNR  CPU CMD        1/22
17183     - apache      1    0.01s   0.24s     0K    76K      0K     0K --   -  R     0   2% httpd
30258     - apache      1    0.02s   0.21s  2452K  2520K      0K     0K --   -  R     0   2% httpd
 3084     - nagios      0    0.02s   0.21s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3086     - nagios      0    0.01s   0.22s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3090     - nagios      0    0.01s   0.21s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3109     - nagios      0    0.01s   0.21s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3115     - nagios      0    0.01s   0.21s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3077     - nagios      0    0.01s   0.21s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3106     - nagios      0    0.01s   0.19s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3113     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3110     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3111     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3117     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3116     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3105     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3108     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3096     - nagios      0    0.00s   0.19s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3112     - nagios      0    0.00s   0.19s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3107     - nagios      0    0.00s   0.19s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3114     - nagios      0    0.00s   0.19s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3093     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3092     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3118     - nagios      0    0.00s   0.19s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3165     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3152     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3151     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3182     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3193     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3150     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3190     - nagios      0    0.01s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3233     - nagios      0    0.01s   0.18s     0K     0K       -      - -E   0  E     -   2% <check_wmi_pl>
 3075     - nagios      0    0.01s   0.17s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3091     - nagios      0    0.00s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3094     - nagios      0    0.00s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3181     - nagios      0    0.01s   0.17s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3197     - nagios      0    0.01s   0.17s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3160     - nagios      0    0.01s   0.17s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3154     - nagios      0    0.00s   0.18s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>
 3192     - nagios      0    0.01s   0.17s     0K     0K       -      - NE   0  E     -   2% <check_wmi_pl>

Code: Select all

[root@localhost ~]# tail -f /usr/local/n                                                                                                                                                                                                     agios/var/npcd.log
[04-15-2014 12:56:57] NPCD: WARN: MAX lo                                                                                                                                                                                                     ad reached: load 11.710000/10.000000 at                                                                                                                                                                                                      i=0
[04-15-2014 12:57:58] NPCD: WARN: MAX lo                                                                                                                                                                                                     ad reached: load 17.280000/10.000000 at                                                                                                                                                                                                      i=0
[04-15-2014 12:58:13] NPCD: WARN: MAX lo                                                                                                                                                                                                     ad reached: load 14.260000/10.000000 at                                                                                                                                                                                                      i=1
[04-15-2014 12:58:28] NPCD: WARN: MAX lo                                                                                                                                                                                                     ad reached: load 11.100000/10.000000 at                                                                                                                                                                                                      i=1
[04-15-2014 12:58:43] NPCD: WARN: MAX lo                                                                                                                                                                                                     ad reached: load 30.990000/10.000000 at                                                                                                                                                                                                      i=1
[04-15-2014 12:58:58] NPCD: WARN: MAX lo                                                                                                                                                                                                     ad reached: load 24.120000/10.000000 at                                                                                                                                                                                                      i=1
[04-15-2014 12:59:13] NPCD: WARN: MAX lo                                                                                                                                                                                                     ad reached: load 24.070000/10.000000 at                                                                                                                                                                                                      i=1
[04-15-2014 12:59:28] NPCD: WARN: MAX lo                                                                                                                                                                                                     ad reached: load 18.800000/10.000000 at                                                                                                                                                                                                      i=1
[04-15-2014 12:59:43] NPCD: WARN: MAX lo                                                                                                                                                                                                     ad reached: load 14.640000/10.000000 at                                                                                                                                                                                                      i=1
[04-15-2014 12:59:58] NPCD: WARN: MAX lo                                                                                                                                                                                                     ad reached: load 11.390000/10.000000 at                                                                                                                                                                                                      i=1


Please do not double post, edit your previous post to add your information if you are the last poster adding additional posts will only serve to push you lower on our "to-be-replied-to" list.

Re: extremely high load average

Posted: Tue Apr 15, 2014 5:12 pm
by tmcdonald
WMI is a pretty heavy beast, and 800 is on the higher end of the spectrum for number of WMI checks we usually see. Have you considered implementing any gearman workers or anything?

As for spacing the checks out, I know there are some nagios.cfg-level options for interleaving the checks, but I do not know enough about them on a low level to make any recommendations.

Re: extremely high load average

Posted: Wed Apr 16, 2014 12:49 am
by Box293
Evaluate each type of check to see if a 5 minute interval is necessary.

For example, a disk space check might only need to occur once every hour.


Also, try and use intervals with odd numbers.

For example, if your interval is 60 minutes, it is likely to co-incide with the 5 minute checks once an hour. Try 57 minutes or 62 minutes as this somewhat speads out the checks at different intervals.

Re: extremely high load average

Posted: Wed Apr 16, 2014 4:27 am
by nottheadmin
Hi thanks for the suggestions, I was able to add 4 more cores and an extra 12GB of RAM to the VM today, it made no difference whatsoever though.

I'm currently on 8 2.8GHz xeon's and 16GB ram. Load average just peaked at 75.

So then i looked at the checks in more detail and yes your right, i dont really need to check disk space every 5 minutes, I changed that to 57 minutes.

Sadly the high loading continues. I'm now looking into finding a physical machines to host this and i'll convert the VM to a physical. Gearman is a nice idea but I don't have the resources at the moment.

Re: extremely high load average

Posted: Wed Apr 16, 2014 4:41 pm
by lmiltchev
I'm now looking into finding a physical machines to host this and i'll convert the VM to a physical.
Let us know if this fixed your issue.

I would also recommend reviewing the "Boosting Performance" section in our Administrator's Guide.
http://assets.nagios.com/downloads/nagi ... p#boosting

Re: extremely high load average

Posted: Wed Apr 16, 2014 4:43 pm
by Box293
Another option is to cut back what is being monitored (temporarily).

You could use CCM to disable all the disk space checks and apply the configuration. Then watch the load average over time and see if this makes any improvement. Then repeat the exercise by disabling all the CPU usage checks and then observing the load average.

It's like drawing a line in the sand. Sometimes this approach can help identify if there is one particular type of check that is causing the issue to occur.

Re: extremely high load average

Posted: Wed Apr 16, 2014 4:46 pm
by slansing
Have you tried updating a test VM to the 2014 beta? It will have some pretty big performance bumps. You can install gearman on your local server as well, without needing remote workers, and you should still see a dramatic load difference. Heck, I noticed a big difference with under 100 service checks on a test machine. I'm going to ponder over what you have reported to us thus far and see if we missed anything glaringly obvious. This is definitely not a issue that is effecting a large number of users, you are currently the only one reporting this at the time which makes me want to say there could be an issue with either the VM image, or the hardware, which, you would probably notice with any VM's sharing it anyways.

Good point's Box, if possible, you could try disabling a sweeping number of similar checks at once via the CCM to see if it is a certain type of check, i.e perl VMware checks, SNMP get requests, etc. If you are using a large number of VMware checks I'd highly recommend Box293's latest release which is a completely different way of checking against ESX(i) and vCenter systems. I tested it out on Monday and it was excellent. I think he is working on a wizard for it at the moment.