Switching to distributed monitoring

cwscribner · Post by **cwscribner** » Mon Aug 08, 2011 11:28 am

Are there any server spec guidelines for switching to a distributed monitoring system? What would be a recommended hardware setup for offloading MySQL from a primary Nagios server and only having two servers running?

Post by **nscott** » Mon Aug 08, 2011 6:29 pm

The unfortunate thing is is that there isn't a good hard fast metric due to different checks making different demands on the computer, so these requirements are largely conjecture, but still something that should be worth considering. Monitoring switches is very taxing on the CPU and disk where as receiving passive checks is going to be significantly easier on the CPU. However, I would venture to guess that the 'average' install of NagiosXI has about 8 active checks per host. In your situation, thats about ~12800 service checks. Assuming the service period of those checks in 5 minutes, thats 12800 / 5 /60 = 42.66 checks per second.

Now, on our benchmark we had ~3500 checks before the UI became sluggish enough to deem it borderline. Our benchmark box is a single core, Pentium 4, 3GHz, 32-bit processor with 3GB of RAM. That boils down to 8.6 checks per second. Upon offloading the MySQL database, the CPU usage halved, so conceivably, taking into account possible diminishing returns, we could (conservatively) ramp that up to 13 checks per second. All of this on single core, 32-bit pentium 4 machine.

Now for the MySQL slave, that would take significantly less horsepower, however I don't have any hard numbers for that. If I had to venture a guess I would say that a dual core process with a hefty amount of RAM would take beyond your needs as the biggest thing you'd need to worry about was I/O times and filling the cache.

However, I am currently in the middle of researching this issue for a presentation, so I'd expect a doc sometime mid September on this issue.

only having two servers running?

Do you mean one Nagios XI server with a MySQL server slaved?

cwscribner · Post by **cwscribner** » Mon Aug 08, 2011 8:08 pm

Yes. One for checks, one for mysql.

My system specs are in my signature. Its a quadcore with several gigs of RAM checking ~2150 hosts and ~750 services. The monitoring widget is showing ~300 (+/- 15) checks per minute. Something seems amiss here as your benchmark machine is much more modest but outperforming the server that I'm on. I don't entirely understand why that's happening.

mguthrie · Post by **mguthrie** » Tue Aug 09, 2011 9:27 am

Did you increase the reaper frequency in the nagios.cfg to process a higher volume of check results?

Code: Select all

check_result_reaper_frequency=3
max_check_result_reaper_time=10

If you're using checks with SNMP, WMI, or ESX, these take substantially more CPU than the checks used for the benchmark (check_icmp, check_http,check_DNS).

Also, make sure you tweak the settings in the Admin->Performance settings page. Using the unified views and increasing the refresh multiplier can make a substantial difference in performance. If you have multiple users accessing XI at once, this will also multiply the amount of AJAX calls to the server being made and take it's toll on performance.

cwscribner · Post by **cwscribner** » Tue Aug 09, 2011 9:38 am

reaper changes have been made. I do have a very long list of SNMP checks for UPS devices...~600 services. Any way I can get around that to speed things up?

mguthrie · Post by **mguthrie** » Tue Aug 09, 2011 9:56 am

Hmm, yeah that's a big CPU grab. What's your average CPU load, and do you see anything consistently showing up in the top 5 CPU% time when you run top? (Mysql should be #1, and httpd #2).

cwscribner · Post by **cwscribner** » Tue Aug 09, 2011 10:35 am

Mysql, httpd, nagios, n2odb, and settroubleshootd are the top 5. When I issued the top command, settroubleshootd was hogging %50 of the CPU and then bounced around from 50% to 30%

mguthrie · Post by **mguthrie** » Tue Aug 09, 2011 10:39 am

What is settroubleshootd? I'm not familiar with that process, that might be your CPU culprit....

cwscribner · Post by **cwscribner** » Tue Aug 09, 2011 10:46 am

From my brief research, it looks like its related to web serving. I haven't the slightest as to what to do about it though. Here's a semi-helpful link: http://www.linuxquestions.org/questions ... ry-634347/

Another:http://www.ezlinuxadmin.com/2011/05/set ... th-cpanel/

mguthrie · Post by **mguthrie** » Tue Aug 09, 2011 11:04 am

Here would be a thought that I hadn't considered for your CPU usage. I don't know why I didn't think of it earlier, but it's possible that's something that comes bundled with Gnome. I would consider uninstalling it. Also, if you want to get probably 30-40% of your CPU and Memory back, I would consider restarting your server at runlevel 3 so that Gnome doesn't load at all. Gnome will eat your system resources and serve no real purpose for your monitoring server. All of our documentation and instructions are written for systems without it, so there isn't a real reason to use it.

Nagios Support Forum

Switching to distributed monitoring

Switching to distributed monitoring

Re: Switching to distributed monitoring

Re: Switching to distributed monitoring

Re: Switching to distributed monitoring

Re: Switching to distributed monitoring

Re: Switching to distributed monitoring

Re: Switching to distributed monitoring

Re: Switching to distributed monitoring

Re: Switching to distributed monitoring

Re: Switching to distributed monitoring