Page 1 of 1

Interesting Load on Nagios Server

Posted: Mon Apr 29, 2013 11:08 am
by Smark
Hello everyone,

We're ramping up our Nagios XI deployment. Our performance is fine. Nagios is running on a VM with 4 cores and 4GB of RAM.

Here is our current number of checks. Ignore the unhandled/cirtical issues. We haven't tweaked our thresholds yet. The majority of the service checks are some sort of WMI query via check_wmi_plus.pl .
Image

We're looking to potentially decrease the time between service checks (meaning check more often) and was wondering what performance this has on the system. Here are our current system load graphs:

Localhost: Current_Load (12 hours):
Image

Localhost: Current_Load (3 days):
Image

This looks indicative of some sort of a garbage collection or scheduled "clean up" task. Can anyone explain why the graphs would look like this? I can always throw more CPU at it if necessary. Research says if the load is greater than the number of cores then you may see performance issues.

I was also curious to know if there was a way to do some "parallelization" of the check tasks. I have looked a the Nagios Performance writeup that says to enable the "large environment tweaks" variable, but it looks to already be set. I'm using the Nagios XI Enterprise OVA VM download. Is there any way to gather more information or visually see the check queue? I would be interested to see upcoming checks as needed.

Thanks,
Smark

Edit: Added additional info about our deployment.

Re: Interesting Load on Nagios Server

Posted: Mon Apr 29, 2013 11:31 am
by slansing
These load spikes could be due to Nagios XI doing it's scheduled cleaning, and database trimming, etc. I notice the same spikes on one of my local test boxes. One thing I can suggest for your expansion is looking into integrating Mod Gearman into your environment, either locally or remotely, this will dramatically reduce your load, have a look at the document here:

http://assets.nagios.com/downloads/nagi ... ios_XI.pdf

Though, your spikes do occur more frequently and this could be attributed to a more intense check, or a group of checks which would require more resources to muster, have you taken a look at the times and tried to parallel them with certain checks?