Nagios scale out options?
Posted: Mon Jul 18, 2016 3:21 pm
Hi -
I recently tried to add a few dozen ESX hosts and it caused Nagios to hit our ESX cluster hard. It's currently consuming 30% of resource off of one node which is surprising since this is a brand new Dell PowerEdge 730xd with 24 cores and 784GB of ram. So we need to understand what we can do to scale up/scale out.
Is separating the workloads (production and dev environment) the right thing to do? I've ran through a bunch of the performance tuning docs but what else can we do in case we need to add say 2000 more checks for example? As you may know, ESX is limited to 8 virtual CPUs so we're limited there. But going physical would lose our HA ability.
I recently tried to add a few dozen ESX hosts and it caused Nagios to hit our ESX cluster hard. It's currently consuming 30% of resource off of one node which is surprising since this is a brand new Dell PowerEdge 730xd with 24 cores and 784GB of ram. So we need to understand what we can do to scale up/scale out.
Is separating the workloads (production and dev environment) the right thing to do? I've ran through a bunch of the performance tuning docs but what else can we do in case we need to add say 2000 more checks for example? As you may know, ESX is limited to 8 virtual CPUs so we're limited there. But going physical would lose our HA ability.
Code: Select all
Nagios XI Installation Profile
System:
Nagios XI Version : 5.2.5
nwd2ng01.corp.analog.com 2.6.32-504.3.3.el6.x86_64 x86_64
CentOS release 6.6 (Final)
Gnome is not installed
Apache Information
PHP Version: 5.3.3
Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:47.0) Gecko/20100101 Firefox/47.0
Server Name: nwd2ng01.corp.analog.com
Server Address: 10.64.52.120
Server Port: 80
Date/Time
PHP Timezone: America/New_York
PHP Time: Mon, 18 Jul 2016 16:16:18 -0400
System Time: Mon, 18 Jul 2016 16:16:18 -0400
Nagios XI Data
License ends in: MSTNQS
nagios (pid 13597) is running...
NPCD running (pid 1999).
ndo2db (pid 2011) is running...
CPU Load 15: 9.79
Total Hosts: 600
Total Services: 4116
Function 'get_base_uri' returns: http://nwd2ng01.corp.analog.com/nagiosxi/
Function 'get_base_url' returns: http://nwd2ng01.corp.analog.com/nagiosxi/
Function 'get_backend_url(internal_call=false)' returns: http://nwd2ng01.corp.analog.com/nagiosxi/includes/components/profile/profile.php
Function 'get_backend_url(internal_call=true)' returns: http://localhost/nagiosxi/backend/
Ping Test localhost
Running:
/bin/ping -c 3 localhost 2>&1
PING localhost (127.0.0.1) 56(84) bytes of data.
64 bytes from localhost (127.0.0.1): icmp_seq=1 ttl=64 time=0.025 ms
64 bytes from localhost (127.0.0.1): icmp_seq=2 ttl=64 time=0.037 ms
64 bytes from localhost (127.0.0.1): icmp_seq=3 ttl=64 time=0.026 ms
--- localhost ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.025/0.029/0.037/0.007 ms
Test wget To localhost
WGET From URL: http://localhost/nagiosxi/includes/components/ccm/
Running:
/usr/bin/wget http://localhost/nagiosxi/includes/components/ccm/
--2016-07-18 16:16:20-- http://localhost/nagiosxi/includes/components/ccm/
Resolving localhost... ::1, 127.0.0.1
Connecting to localhost|::1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: "/usr/local/nagiosxi/tmp/ccm_index.tmp"
0K ......... 511M=0s
2016-07-18 16:16:20 (511 MB/s) - "/usr/local/nagiosxi/tmp/ccm_index.tmp" saved [9836]
Network Settings
1: lo: mtu 65536 qdisc noqueue state UNKNOWN
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 00:50:56:9f:52:ef brd ff:ff:ff:ff:ff:ff
inet 10.64.52.120/24 brd 10.64.52.255 scope global eth0
inet6 fe80::250:56ff:fe9f:52ef/64 scope link
valid_lft forever preferred_lft forever
10.64.52.0/24 dev eth0 proto kernel scope link src 10.64.52.120
169.254.0.0/16 dev eth0 scope link metric 1002
default via 10.64.52.1 dev eth0