Page 1 of 1

Performance Problems

Posted: Fri Sep 03, 2010 10:37 am
by afitch
I've got nagiosxi monitoring about 800 hosts with 3300 services. Most of the services are the provided perl snmp checks with a lot of ssh checks running perl or bash scripts and returning information. I've thrown lots of CPU and Ram at the problem, but it's not fixed the problem. I now have it setup with 8 CPU and 16GB with an upgraded PAE kernel to utilize the RAM. Anyway, mysqld runs from 80-210% CPU all the time. Nagios will run 50-100% as well. There will be times when I click on the page and it takes forever to get a response and I can never view a detail page.

I'm running the ssh checks because of snmp problems with some hosts. I've made various tweeks to no avail. Any suggestions?

Re: Performance Problems

Posted: Fri Sep 03, 2010 11:14 am
by mguthrie
I think for a server that's running that kind of load might be a good candidate for a distributed monitoring setup. There are a few options for that.

DNX is a load distributor for Nagios, I haven't used it yet myself but we just got it tested and documented.
http://library.nagios.com/library/speci ... ith-nagios

Also, we're coming out with a new product called Nagios Fusion, which is used for a distributed monitoring setup. Here's the info page for it, and I believe the release date is set for Oct. 1st.
http://www.nagios.com/products/nagiosfusion

Our other techs might have some other ideas for tweaking your existing server, but I'll have to defer to them for performance tweaking ideas.

Re: Performance Problems

Posted: Fri Sep 03, 2010 11:51 am
by afitch
Fusion looks like the way to go. So for each server (or node) you have, you need an Xi license, right? And then one license of Fusion or two depending on level of HA? I agree with your diagnosis, I probably need 1 server per 400-500 hosts depending on the number of checks. Eventually our server team will be handing off Nagios to the operations team, so Fusion would help.

Re: Performance Problems

Posted: Fri Sep 03, 2010 12:12 pm
by mguthrie
Yeah I couldn't tell you off hand what the licensing situation would be, but I do know we offer a discount that goes up with the number of licenses you buy. Feel free to fire any questions on that to our sales team (sales(at)nagios.com).

Re: Performance Problems

Posted: Fri Sep 03, 2010 4:36 pm
by mmestnik
You need faster/more disks. Try disabling mysql sync and flushes, with the ram you have many applications won't make use of it because they insist on transactional concurrency. Nagios can also be told to flush less often.

How are you getting these usage statistics? I'd assume that your IO"Wait" time is the leading usage of CPU.

Try using a ramfs and calling rsync every 5min or so to flush that to disk backing store that's loaded into ram on boot.

Re: Performance Problems

Posted: Tue Sep 07, 2010 3:06 pm
by afitch
All of your responses make sense. We have an enormous VMware cluster and Petabytes worth of storage. VMware is ultimately managing all the disk. EMC screwed us with pricing of their fiber driver and we only have a single path to all our Tier 1 storage. (So that's the bottleneck). I'll find the sync'ing and flush'ing vars and make some adjustments.

I don't have a good enough understanding of the nagios backend to understand the below statement.

"Try using a ramfs and calling rsync every 5min or so to flush that to disk backing store that's loaded into ram on boot."

Does the ndo module automatically import and export the config and state (using rsync) to the database?

Thanks again. -jb

Re: Performance Problems

Posted: Wed Sep 08, 2010 10:22 am
by mmestnik
"Try using a ramfs and calling rsync every 5min or so to flush that to disk backing store that's loaded into ram on boot."
This comment was actually unrelated to Nagios, see these links for more information.

http://www.thegeekstuff.com/2008/11/ove ... -on-linux/
http://sial.org/howto/rsync/ OR http://oreilly.com/pub/h/41

Re: Performance Problems

Posted: Wed Sep 15, 2010 10:21 am
by afitch
I installed two more xi machines for a total of three and installed the DNX extension (http://dnx.sourceforge.net/). I now have one master and two clients. This dropped the load average from 12 down to 4 on the master with the 2nd and 3rd machines running around 2.50. Big help. I'm still implementing the other changes and chasing down plugin timeouts. I currently have 800 hosts and 2300 checks. I've got 500 windows servers to add yet. I'll probably add another two or three DNX clients for that. Thanks everyone.