Total number of concurrent checks exceeds 15, aborting!

bsivavani · Post by **bsivavani** » Tue Jan 19, 2016 12:50 pm

Hi Support,

We are receiving below error in production and it is impacting monitoring as we are receiving flood of alerts in a minute.

Total number of concurrent checks exceeds 15, aborting!

We have followed below technote and increased checks to 20, but no luck.

https://support.nagios.com/forum/viewto ... 3&p=108791

Please suggest how to proceed on this.

rkennedy · Post by **rkennedy** » Tue Jan 19, 2016 12:52 pm

I know you mentioned increasing it, but what is the result once you increase the checks to 20? --concurrent_checks 20

bsivavani · Post by **bsivavani** » Tue Jan 19, 2016 1:17 pm

after changing we are receiving below error

Total number of concurrent checks exceeds 20, aborting!

rkennedy · Post by **rkennedy** » Tue Jan 19, 2016 2:22 pm

How many checks do you have in your environment running against VMware? What happens if the conncurrent_checks is increased to say... 50?

bsivavani · Post by **bsivavani** » Wed Jan 20, 2016 4:52 am

How many checks do you have in your environment running against VMware? What happens if the conncurrent_checks is increased to say... 50?

Please let us know is there any way to find total number of checks against VMware ?

bsivavani · Post by **bsivavani** » Wed Jan 20, 2016 10:10 am

Below are vMA server information

Please find attachment for top command output.

------------------------------------------------------------
Free memory
vi-admin@aushosvmaprd00:~> free -m
total used free shared buffers cached
Mem: 12041 705 11336 0 12 81
-/+ buffers/cache: 610 11430
Swap: 133 89 44
vi-admin@aushosvmaprd00:~>
---------------------------------------------------------

---------------------------
Disk usage
vi-admin@aushosvmaprd00:~> df -kh
Filesystem Size Used Avail Use% Mounted on
/dev/sda3 2.7G 1.5G 1.1G 59% /
udev 5.9G 88K 5.9G 1% /dev
tmpfs 5.9G 0 5.9G 0% /dev/shm
/dev/sda1 128M 37M 85M 31% /boot
vi-admin@aushosvmaprd00:~>
------------------------

As this is impacting production on all servers. Please suggest us next steps.

rkennedy · Post by **rkennedy** » Wed Jan 20, 2016 11:24 am

I believe you may have quite a bit of checks running against your vmware, and that is the underlying issue. The script is not running because of the amount of concurrent checks you have running. As I said in my previous post, you'll need to try increasing it to a higher number.

What is the result of -

Code: Select all

ps -ef|grep box293_check_vm|wc -l

Post by **WillemDH** » Wed Jan 20, 2016 11:29 am

Ok, Just a few thoughts that could help you in the long run:

Do you monitor your VMA servers? How many VMA servers do you have? How many CPU's do your VMA servers have? If you are not yet monitoring your VMA servers, please start doing so. Install the NRPE agent on the VMA servers. You might consider using this plugin: https://exchange.nagios.org/directory/P ... ss/details and use it to monitor process count, average CPU usage and memory usage of the box293_check_vm process.

You might consider desactivating a bunch of services making use of you VMA. Start by desactvating 50 % of your services making use of your VMA. Then it might be a good idea to give your VMA a reboot and monitor if the load calms down. If everything seems to work ok, you can re-activate another 25 % of your services and check again. It might be necessary to add more CPU's or even setup another VMA server to split up the load. (that's what I had to do)

Also if you are making use of the overall datastore performance checks, desactivate these first as these generate a lot of load.

EDIT: Did you add 'nice' in your commands?

I hope this helps! Grtz

bsivavani · Post by **bsivavani** » Wed Jan 20, 2016 12:56 pm

Thank you. We have below concerns.

1.How many minimum number of checks can we do in one vMA server.

2.We have more than 100 ESXi servers to be added on to Nagios that too for data store adapter and storage adapter performance monitoring in one vMA.

3.We are using one vMA server for on-boarding 100 servers (1000 services, if we take average of 10 per server). Please suggest if we can proceed on this.

4.Please give us example command with nice.

Please find below cpu information from vma server
vi-admin@aushosvmaprd00:~> lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 6
On-line CPU(s) list: 0-5
Thread(s) per core: 1
Core(s) per socket: 3
Socket(s): 2
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 26
Stepping: 4
CPU MHz: 2793.000
BogoMIPS: 5586.00
Hypervisor vendor: VMware
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 12288K
NUMA node0 CPU(s): 0-5
vi-admin@aushosvmaprd00:~>

bsivavani · Post by **bsivavani** » Wed Jan 20, 2016 1:13 pm

We are receiving flood alerts for CPU Usage and Memory aswell.

Shall we increase concurrent checks to 50 for those services.

Nagios Support Forum

Total number of concurrent checks exceeds 15, aborting!

Total number of concurrent checks exceeds 15, aborting!

Re: Total number of concurrent checks exceeds 15, aborting!

Re: Total number of concurrent checks exceeds 15, aborting!

Re: Total number of concurrent checks exceeds 15, aborting!

Re: Total number of concurrent checks exceeds 15, aborting!

Re: Total number of concurrent checks exceeds 15, aborting!

Re: Total number of concurrent checks exceeds 15, aborting!

Re: Total number of concurrent checks exceeds 15, aborting!

Re: Total number of concurrent checks exceeds 15, aborting!

Re: Total number of concurrent checks exceeds 15, aborting!