Troubleshooting nagios performance

Speeddymon · Post by **Speeddymon** » Sun Nov 06, 2011 11:07 am

Hi there, I want to start out that I am not the nagios admin for my site, but am a tech, so I have to propose any changes to the nagios admin. This also means I don't have root access to the nagios server but I do have standard user shell; so any files that can only be read by root are off-limits to me. That being said, I'm doing this because I can't seem to get anyone else to; they would rather just live with it.

Basics:

Debian Etch with the latest OS updates
Nagios 3.0.4 (I know, outdated, not able to get them to upgrade due to the nature of the setup)
Dell Server
2x Xeon E5430 (4 cores per CPU, 8 cores total)
32GB RAM, 8GB swap
LSI Logic SAS1068E RAID controller with an unknown number of drives, total drive size visible to OS is ~140GB

Box runs nagios, apache, memcached and mysql
Disk usage is minimal; tested it myself by watching iostat while doing a dd of /dev/zero to a file... blocks written per sec jumped 10 fold when I started the dd, and dropped back down when I killed it.

We have roughly 3000 remote clients running NRPE with anywhere from 8-25 services to check per client, average is 22 based on the number of services vs number of hosts.
We have a handful of passives also sending results through NSCA.

The server doesn't act overloaded at the console, though it has a constant load around 4.0 to 7.0
Checking sar didn't reveal any spikes, so the cpu/memory/bandwidth/disk usage is constant.

All of the hosts are outside of our network. We traverse the public internet to do the service checks but have iptables and the network firewalls on the remote end setup to only allow access to the machines from our IP.

In total, we have the following numbers of hosts and services monitored by nagios according to tac.cgi
2556 hosts
57595 services

It takes anywhere from 10 to 30 seconds to load any page in nagios, sometimes longer, which is why I'm convinced there is a performance problem that can be fixed. It got so bad that they wrote a custom service status display to give us just the pertinent information we need in as small of a space as possible for when services or hosts are down, and which is cached and updated at regular intervals.

Code: Select all

vmstat -a -S m is as follows:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free  inact active   si   so    bi    bo   in   cs us sy id wa
30  0      0   3913   1843  10682    0    0     2  1021    0    0 22 26 50  2

iostat -m is as follows:
Linux 2.6.18-5-amd64 (hou-nagios-01)    11/06/2011

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          22.15    0.00   25.64    2.24    0.00   49.97

Device:            tps    MB_read/s    MB_wrtn/s    MB_read    MB_wrtn
sda             105.15         0.01         7.97     182212  108281113

Some tidbits from sar:

sar -q
07:35:01 AM   runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15
07:45:01 AM        13       309      5.94      5.20      3.94
07:55:01 AM        27       384      5.00      6.33      5.29
08:05:01 AM        29       361      5.11      5.20      5.18
08:15:01 AM        22       373      5.26      5.49      5.38
08:25:01 AM        24       364      4.76      5.26      5.33
08:35:01 AM        25       374      4.61      5.29      5.36
08:45:01 AM        25       367      5.07      4.90      5.09
08:55:01 AM        21       334      6.33      5.78      5.36
09:05:01 AM        25       365      5.47      6.57      6.10
09:15:01 AM        18       401      5.32      5.54      5.78
09:25:01 AM        20       346      8.53      8.77      7.05
Average:           24       341      3.47      3.95      4.02

sar -r:
07:35:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached kbswpfree kbswpused  %swpused  kbswpcad
07:45:01 AM   2426028  30516680     92.64   2125480   3637164   7815540        72      0.00         0
07:55:01 AM   5160228  27782480     84.34   2130264   3442540   7815540        72      0.00         0
08:05:01 AM   3926600  29016108     88.08   2014836   3640432   7815540        72      0.00         0
08:15:01 AM   4351956  28590752     86.79   2016936   3568480   7815540        72      0.00         0
08:25:01 AM   6942488  26000220     78.93   2018784   3157736   7815540        72      0.00         0
08:35:01 AM   4382444  28560264     86.70   2020500   3508536   7815540        72      0.00         0
08:45:01 AM   6334820  26607888     80.77   2022524   3351132   7815540        72      0.00         0
08:55:01 AM   5171932  27770776     84.30   2024292   3448148   7815540        72      0.00         0
09:05:01 AM   3895784  29046924     88.17   2025908   3630532   7815540        72      0.00         0
09:15:01 AM   6744584  26198124     79.53   2027852   3305528   7815540        72      0.00         0
09:25:01 AM   4127708  28815000     87.47   2029540   3402320   7815540        72      0.00         0
Average:      2564441  30378267     92.22   2138068   4511420   7815540        72      0.00         0

sar -d:
07:35:01 AM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
07:45:01 AM    dev8-0    129.71      2.51  23335.02    179.92     26.26    202.42      1.60     20.82
07:55:01 AM    dev8-0    152.66      0.73  29636.60    194.14     34.00    222.69      1.72     26.23
08:05:01 AM    dev8-0    154.73      0.81  30662.14    198.17     40.60    262.39      2.02     31.20
08:15:01 AM    dev8-0    108.70      0.00  17673.97    162.59     18.84    173.28      1.44     15.61
08:25:01 AM    dev8-0    110.64      0.01  17636.29    159.40     19.26    174.06      1.44     15.88
08:35:01 AM    dev8-0    115.07      0.00  17787.41    154.57     20.53    178.45      1.49     17.17
08:45:01 AM    dev8-0    112.28      0.00  17735.88    157.95     19.89    177.16      1.48     16.59
08:55:01 AM    dev8-0    105.79      0.04  17572.45    166.11     18.25    172.55      1.43     15.15
09:05:01 AM    dev8-0    110.68      0.00  17701.31    159.93     18.13    163.85      1.40     15.48
09:15:01 AM    dev8-0    113.03      3.55  17760.53    157.16     18.91    167.27      1.40     15.87
09:25:01 AM    dev8-0    114.16      0.01  17780.57    155.75     19.44    170.31      1.42     16.18
Average:       dev8-0    113.06     43.41  18208.13    161.43     20.07    177.50      1.48     16.70

sar -u:
07:35:01 AM       CPU     %user     %nice   %system   %iowait    %steal     %idle
07:45:01 AM       all     20.24      0.00     24.68      3.76      0.00     51.33
07:55:01 AM       all     20.59      0.00     24.54      4.88      0.00     49.99
08:05:01 AM       all     19.46      0.00     22.39      8.20      0.00     49.95
08:15:01 AM       all     20.44      0.00     25.42      2.67      0.00     51.47
08:25:01 AM       all     20.54      0.00     26.03      3.03      0.00     50.39
08:35:01 AM       all     21.00      0.00     25.85      3.14      0.00     50.01
08:45:01 AM       all     20.31      0.00     24.83      3.52      0.00     51.33
08:55:01 AM       all     20.55      0.00     26.11      2.27      0.00     51.06
09:05:01 AM       all     20.50      0.00     26.16      1.93      0.00     51.41
09:15:01 AM       all     20.25      0.00     26.00      2.23      0.00     51.52
09:25:01 AM       all     34.09      0.00     24.00      2.02      0.00     39.89
Average:          all     21.54      0.00     25.30      2.75      0.00     50.41

Now, when I check top and show per-core cpu usage, I see something interesting:

top - 09:33:36 up 157 days,  4:12,  4 users,  load average: 6.82, 8.38, 7.60
Tasks: 277 total,   5 running, 271 sleeping,   0 stopped,   1 zombie
Cpu0  :  6.0%us, 10.0%sy,  0.0%ni, 82.4%id,  0.3%wa,  0.0%hi,  1.3%si,  0.0%st
Cpu1  : 31.9%us, 26.6%sy,  0.0%ni, 41.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 20.9%us, 30.6%sy,  0.0%ni, 48.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  4.0%us, 13.7%sy,  0.0%ni, 82.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 24.3%us, 20.9%sy,  0.0%ni, 54.8%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu5  : 11.6%us, 33.9%sy,  0.0%ni, 54.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  :  4.7%us, 32.7%sy,  0.0%ni, 62.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu7  : 99.7%us,  0.3%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  32942708k total, 29316272k used,  3626436k free,  2030984k buffers
Swap:  7815612k total,       72k used,  7815540k free,  3497496k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                                                         
21420 www-data  25   0  450m 365m 5604 R  100  1.1   2:14.45 apache-ssl                                                                                                                                                                      
29854 nagios    25   0  157m 138m  880 R   75  0.4   1581:17 nagios                                                                                                                                                                          
29011 nagios    21   0 16440 4572  748 S    1  0.0   5:38.80 nrpe                                                                                                                                                                            
 8131 www-data  15   0  451m 365m 5604 S    1  1.1   1:41.86 apache-ssl                                                                                                                                                                      
31316 nagios    24   0  4828  584  480 S    1  0.0   0:00.02 check_nrpe                                                                                                                                                                      

We have nagios checking it looks like up to 50 services at a time across various hosts:

pgrep check_nrpe |wc -l
50

What I gathered from the top output above is that threading is not working as optimally as it could be.

If additional information is needed as to different stats from the server, let me know and I'll get them if I have permissions.

Post by **jsmurphy** » Mon Nov 07, 2011 5:05 pm

I ran into the exact same problem once... I took the windows solution route and just restarted the server and that made it go away. It hasn't returned yet, so if you haven't already give that a go. I still have no idea what caused it.

Speeddymon · Post by **Speeddymon** » Mon Nov 07, 2011 6:11 pm

We restart nagios regularly. Do you mean reboot the server itself? Its uptime is 151 days and it has been like this the whole time, but I'll see if we can.

mguthrie · Post by **mguthrie** » Tue Nov 08, 2011 12:45 pm

This might be worth a look, this was a presentation from the Nagios World Conference 2011 from a guy who's monitoring 1.4 million services. His suggestions are as good as any I've read.
http://exchange.nagios.org/directory/Mu ... ny/details

The only thing I'd mention that's specific to your scenario is to offload mysql onto a different machine. (This doc is for Nagios XI, but most of it is the same for a Core install).
http://assets.nagios.com/downloads/nagi ... p#boosting

Post by **jsmurphy** » Tue Nov 08, 2011 4:31 pm

Speeddymon wrote:We restart nagios regularly. Do you mean reboot the server itself? Its uptime is 151 days and it has been like this the whole time, but I'll see if we can.

Yep reboot the server itself, but you then have to go and take a long shower to try and wash of the filth of using a windows fix

.

I also just noticed you had 57,000 services not 5,700 like I initially read... so mguthries suggestions are also a pretty great idea, I would even venture that they are in fact a better idea

.

jfreund · Post by **jfreund** » Thu Jan 19, 2012 10:11 pm

Created an account just to respond to this as I haven't found an answer anywhere(and am wondering if there is one).

I have a similar setup to yours in terms of service checks, currently running around 37k, on a similarly spec'd system running CentOS 5.7 and Nagios core 3.2.3. I find this high page load time problem manifests when service checks count get... high. Sub 10k, you don't notice much issue, maybe 5-seconds which everyone can tolerate, but at 37k my page load times for the Hostgroups page are abysmal, around 18-20 seconds. Unusable for operations teams in my opinion. What's worse is we're looking at upping checks to over 100k which means I need a solution if I'm to continue using Nagios.

My problem became apparent when my customer (I run Nagios as a service for projects within my company) reported his page load times for the Hostgroups page had become abysmal, around 28 seconds. I played with all the tweaks I could think like moving files to ram disks, large enterprise tweaks on, and others that I found in the docs or via googling, and the fastest I could get them to load was around 18-20 (at least some improvement!).

I then began looking at what resource limit was on the system. When calling the Hostgroups page you call the status.cgi, as you would actually for quite a few of Nagios web pages. You can watch the status.cgi load the status.dat file where nagios status is, which makes sense since it's basically got to parse it to display your data. I checked i/o wait, and there was none, which makes sense because I'm using a ram disk for storing status.dat. I can see by watching open file handles on the status.cgi process right when it's running that it loads the status.dat of mine in only like 3 seconds, maybe less. What I then see is status.cgi chug on one core of a cpu and max it out. I imagine it's parsing the data, albeit slowly. So there's the limit here... status.cgi is single-threaded and no matter how many cores you have you're limited by it's efficiency and the speed of one of your cores.

I don't imagine there's a solution for this via configuration or hardware... you've got a program opening a file, then parsing it, and it's doing it serially. The only fix I can see is that Nagios' status.cgi gets re-written to perform better. Perhaps pre-parsing the status.dat or formatting in some way that improves parsing performance, but eventually you'll still hit the single-core limit of your system as long as status.cgi is single-threaded.

I still need to update to the latest Nagios but I see nothing in the release notes that indicates improvements to status.cgi.

Friendly word of advice on other front-ends for this - avoid vshell. Using vshell the page load times for Hostgroups went to the 6 minute plus range on my system.

mguthrie · Post by **mguthrie** » Fri Jan 20, 2012 1:04 pm

I think this post would be an interesting discussion on the nagios-devel mailing list on source-forge. The Core Development Team maintains the Core CGI's, and they can probably give the best discussion on this. These are good insights though, and I appreciate you posting what you did.

Friendly word of advice on other front-ends for this - avoid vshell. Using vshell the page load times for Hostgroups went to the 6 minute plus range on my system.

Agreed. I wrote V-Shell and I can't say I would currently recommend it for installs over 10k checks. Scaling issues are my next TODO for that project, but even with lots of tweaking of the code, I'm doubtful as to whether or not PHP will be able to compete with a compiled cgi for a large system like yours. Thanks for posting the numbers though, it's good to know as I continue development on it.

jfreund · Post by **jfreund** » Tue Jan 24, 2012 4:47 pm

I have played with Thruk and check_mk's Multisite front-ends which make use of a module that gives direct access to Nagios status data (and I imagine other stuff, like config too) in memory rather than parsing it from a file on disk and these are markedly faster.

I'll throw a little data out but I've not been terribly scientific so far as my investigation has only been cursory up to now. I mentioned my Nagios setup's Host Groups page via status.cgi is loading at best around 18-20 seconds. I found that Multisite is loading that view in about 6-8 which is a big improvement, but not great since I plan to triple my service groups soon which might put me back in the bad 18-20 range I had prior assuming a linear scale. However I found Multisite gives me a Host Group summary page that provides a slightly different view, providing lists of host groups and a quick summary of their status including name, total count of hosts, and total counts of status types. You can click on them invididually to drill down into the groups. This is a view I can see ops guys using most of the time... quick glances at service or host group summaries, then drilling down into problem areas. This view is coming up in under a second which is awesome.

I have yet to play with anything that uses a DB back-end for storing data which... might be fast as well.

I'm still playing around but I'm a little excited at the potential of these and similar interfaces.

mguthrie · Post by **mguthrie** » Wed Jan 25, 2012 4:04 pm

Yeah, mklivestatus + multisite from what I'm hearing is the best bet to go for really large implementations. See the link above on Dan Wittenburg's presentation notes. Ndoutils is good if you need a standard DB backend, and historical data for reports, but it doesn't run quite as lean as the MK stuff.

Nagios Support Forum

Troubleshooting nagios performance

Troubleshooting nagios performance

Re: Troubleshooting nagios performance

Re: Troubleshooting nagios performance

Re: Troubleshooting nagios performance

Re: Troubleshooting nagios performance

Re: Troubleshooting nagios performance

Re: Troubleshooting nagios performance

Re: Troubleshooting nagios performance

Re: Troubleshooting nagios performance