Basics:
Debian Etch with the latest OS updates
Nagios 3.0.4 (I know, outdated, not able to get them to upgrade due to the nature of the setup)
Dell Server
2x Xeon E5430 (4 cores per CPU, 8 cores total)
32GB RAM, 8GB swap
LSI Logic SAS1068E RAID controller with an unknown number of drives, total drive size visible to OS is ~140GB
Box runs nagios, apache, memcached and mysql
Disk usage is minimal; tested it myself by watching iostat while doing a dd of /dev/zero to a file... blocks written per sec jumped 10 fold when I started the dd, and dropped back down when I killed it.
We have roughly 3000 remote clients running NRPE with anywhere from 8-25 services to check per client, average is 22 based on the number of services vs number of hosts.
We have a handful of passives also sending results through NSCA.
The server doesn't act overloaded at the console, though it has a constant load around 4.0 to 7.0
Checking sar didn't reveal any spikes, so the cpu/memory/bandwidth/disk usage is constant.
All of the hosts are outside of our network. We traverse the public internet to do the service checks but have iptables and the network firewalls on the remote end setup to only allow access to the machines from our IP.
In total, we have the following numbers of hosts and services monitored by nagios according to tac.cgi
2556 hosts
57595 services
It takes anywhere from 10 to 30 seconds to load any page in nagios, sometimes longer, which is why I'm convinced there is a performance problem that can be fixed. It got so bad that they wrote a custom service status display to give us just the pertinent information we need in as small of a space as possible for when services or hosts are down, and which is cached and updated at regular intervals.
Code: Select all
vmstat -a -S m is as follows:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free inact active si so bi bo in cs us sy id wa
30 0 0 3913 1843 10682 0 0 2 1021 0 0 22 26 50 2
iostat -m is as follows:
Linux 2.6.18-5-amd64 (hou-nagios-01) 11/06/2011
avg-cpu: %user %nice %system %iowait %steal %idle
22.15 0.00 25.64 2.24 0.00 49.97
Device: tps MB_read/s MB_wrtn/s MB_read MB_wrtn
sda 105.15 0.01 7.97 182212 108281113
Some tidbits from sar:
sar -q
07:35:01 AM runq-sz plist-sz ldavg-1 ldavg-5 ldavg-15
07:45:01 AM 13 309 5.94 5.20 3.94
07:55:01 AM 27 384 5.00 6.33 5.29
08:05:01 AM 29 361 5.11 5.20 5.18
08:15:01 AM 22 373 5.26 5.49 5.38
08:25:01 AM 24 364 4.76 5.26 5.33
08:35:01 AM 25 374 4.61 5.29 5.36
08:45:01 AM 25 367 5.07 4.90 5.09
08:55:01 AM 21 334 6.33 5.78 5.36
09:05:01 AM 25 365 5.47 6.57 6.10
09:15:01 AM 18 401 5.32 5.54 5.78
09:25:01 AM 20 346 8.53 8.77 7.05
Average: 24 341 3.47 3.95 4.02
sar -r:
07:35:01 AM kbmemfree kbmemused %memused kbbuffers kbcached kbswpfree kbswpused %swpused kbswpcad
07:45:01 AM 2426028 30516680 92.64 2125480 3637164 7815540 72 0.00 0
07:55:01 AM 5160228 27782480 84.34 2130264 3442540 7815540 72 0.00 0
08:05:01 AM 3926600 29016108 88.08 2014836 3640432 7815540 72 0.00 0
08:15:01 AM 4351956 28590752 86.79 2016936 3568480 7815540 72 0.00 0
08:25:01 AM 6942488 26000220 78.93 2018784 3157736 7815540 72 0.00 0
08:35:01 AM 4382444 28560264 86.70 2020500 3508536 7815540 72 0.00 0
08:45:01 AM 6334820 26607888 80.77 2022524 3351132 7815540 72 0.00 0
08:55:01 AM 5171932 27770776 84.30 2024292 3448148 7815540 72 0.00 0
09:05:01 AM 3895784 29046924 88.17 2025908 3630532 7815540 72 0.00 0
09:15:01 AM 6744584 26198124 79.53 2027852 3305528 7815540 72 0.00 0
09:25:01 AM 4127708 28815000 87.47 2029540 3402320 7815540 72 0.00 0
Average: 2564441 30378267 92.22 2138068 4511420 7815540 72 0.00 0
sar -d:
07:35:01 AM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
07:45:01 AM dev8-0 129.71 2.51 23335.02 179.92 26.26 202.42 1.60 20.82
07:55:01 AM dev8-0 152.66 0.73 29636.60 194.14 34.00 222.69 1.72 26.23
08:05:01 AM dev8-0 154.73 0.81 30662.14 198.17 40.60 262.39 2.02 31.20
08:15:01 AM dev8-0 108.70 0.00 17673.97 162.59 18.84 173.28 1.44 15.61
08:25:01 AM dev8-0 110.64 0.01 17636.29 159.40 19.26 174.06 1.44 15.88
08:35:01 AM dev8-0 115.07 0.00 17787.41 154.57 20.53 178.45 1.49 17.17
08:45:01 AM dev8-0 112.28 0.00 17735.88 157.95 19.89 177.16 1.48 16.59
08:55:01 AM dev8-0 105.79 0.04 17572.45 166.11 18.25 172.55 1.43 15.15
09:05:01 AM dev8-0 110.68 0.00 17701.31 159.93 18.13 163.85 1.40 15.48
09:15:01 AM dev8-0 113.03 3.55 17760.53 157.16 18.91 167.27 1.40 15.87
09:25:01 AM dev8-0 114.16 0.01 17780.57 155.75 19.44 170.31 1.42 16.18
Average: dev8-0 113.06 43.41 18208.13 161.43 20.07 177.50 1.48 16.70
sar -u:
07:35:01 AM CPU %user %nice %system %iowait %steal %idle
07:45:01 AM all 20.24 0.00 24.68 3.76 0.00 51.33
07:55:01 AM all 20.59 0.00 24.54 4.88 0.00 49.99
08:05:01 AM all 19.46 0.00 22.39 8.20 0.00 49.95
08:15:01 AM all 20.44 0.00 25.42 2.67 0.00 51.47
08:25:01 AM all 20.54 0.00 26.03 3.03 0.00 50.39
08:35:01 AM all 21.00 0.00 25.85 3.14 0.00 50.01
08:45:01 AM all 20.31 0.00 24.83 3.52 0.00 51.33
08:55:01 AM all 20.55 0.00 26.11 2.27 0.00 51.06
09:05:01 AM all 20.50 0.00 26.16 1.93 0.00 51.41
09:15:01 AM all 20.25 0.00 26.00 2.23 0.00 51.52
09:25:01 AM all 34.09 0.00 24.00 2.02 0.00 39.89
Average: all 21.54 0.00 25.30 2.75 0.00 50.41
Now, when I check top and show per-core cpu usage, I see something interesting:
top - 09:33:36 up 157 days, 4:12, 4 users, load average: 6.82, 8.38, 7.60
Tasks: 277 total, 5 running, 271 sleeping, 0 stopped, 1 zombie
Cpu0 : 6.0%us, 10.0%sy, 0.0%ni, 82.4%id, 0.3%wa, 0.0%hi, 1.3%si, 0.0%st
Cpu1 : 31.9%us, 26.6%sy, 0.0%ni, 41.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 20.9%us, 30.6%sy, 0.0%ni, 48.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu3 : 4.0%us, 13.7%sy, 0.0%ni, 82.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu4 : 24.3%us, 20.9%sy, 0.0%ni, 54.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu5 : 11.6%us, 33.9%sy, 0.0%ni, 54.5%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu6 : 4.7%us, 32.7%sy, 0.0%ni, 62.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu7 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 32942708k total, 29316272k used, 3626436k free, 2030984k buffers
Swap: 7815612k total, 72k used, 7815540k free, 3497496k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
21420 www-data 25 0 450m 365m 5604 R 100 1.1 2:14.45 apache-ssl
29854 nagios 25 0 157m 138m 880 R 75 0.4 1581:17 nagios
29011 nagios 21 0 16440 4572 748 S 1 0.0 5:38.80 nrpe
8131 www-data 15 0 451m 365m 5604 S 1 1.1 1:41.86 apache-ssl
31316 nagios 24 0 4828 584 480 S 1 0.0 0:00.02 check_nrpe
We have nagios checking it looks like up to 50 services at a time across various hosts:
pgrep check_nrpe |wc -l
50
If additional information is needed as to different stats from the server, let me know and I'll get them if I have permissions.