Load Testing Nagios XI - Problems w/ Large Service Count

mp4783 · Post by **mp4783** » Sun Feb 22, 2015 3:36 pm

I'll start with the question for those of you who don't want to read below:

What is the appropriate way to load test the Nagios XI server? By that I mean, how can I put a service check load on it that takes it to point where it can no longer handle any more (assume a checks that consume similar resources)?

My attempts to do my own load testing have resulted in problems on the server as detailed below.

I have been attempting to load test my Nagios XI server by creating a very large number of TCP check service checks on a single host. I did this by writing a script that created service check entries in an import file, using a previously configured service check definition created via the GUI. The script simply incremented the TCP port number checked an the TCP port named in the description for a fixed number of loops.

My initial run produced a file with 1000 checks for TCP ports 2 through 1001. This file was then placed into the /usr/local/nagios/etc/import directory using the appropriate filename for the host. I then executed the reconfigure_nagios.sh utility.

The utility churned away for a considerable amount of time. The Nagios monitoring processes were down for over 2 minutes and 45 seconds while it imported the 1000 new services. It finally finished, and the new configuration was in place. The Nagios server then hummed along with about 1300 service checks without any issues.

I then wanted to increase the number of service checks by 9000. So I reran my script and created service check definitions for TCP ports 1002 through 10003. I followed the same procedure as above. This time the import failed and the service configuration file for the host was removed from /usr/local/nagios/etc/services. However, the entries still appeared in the Nagios GUI and were still present in the database.

I attempted to delete them through the GUI, which accepted the request, applied the changes, and then returned control. The services were still in place in the GUI and database and no service configuration file was written out for the host, which is what I would expect because I deleted all of the host's services.

I then decided to use the nagiosql_delete_service.php utility with the '--config' option, to delete the approximately 9000 services on the host. The process almost 50 minutes to complete on a very lightly loaded server. It appears to be deleting services one at a time. A review of the MySQL log shows a repetitive (with applicable variation) series of statements. That said, upon finally completing the deletion and attempting to reconfigure Nagios, all of the service checks were still in place. I'm not sure if the absence of the service configuration file (removed during the failed import) is at the root of this. I would, in theory, be able to recreate it from the object.cache file.

I could go on and on through many permutations and attempts to fix this, but won't bore you with that. However, this brings up other questions, the first of which is:

Do the nagiosql_NNN_NNN.php utilities "reproduce" the actions one would take through the GUI and output them to a temporary file that is "run" against the Nagios XI server?

The reason I ask this is, when I examined the nagiosql.delete.service file for the deletion session above, I saw that it was being constantly recreated. I also saw that it contained what appeared to be an entire page from the GUI. I even took the file to my local workstation and opened it up in a browser.

I will not presume to understand the mechanics of how these utilities are supposed to work with Nagios XI or the reasons why they operate as they do, but it seems a rather crude way of accomplishing these tasks. It almost feels like screen scraping.

I don't think I did anything not explicitly allowed in the Nagios XI documentation. What concerns me is Nagios' ability to handle large imports and deletions like this while still maintaining monitoring. As it stands, I may have to perform "surgery" on my Nagios XI server to restore it to full function.

mp4783 · Post by **mp4783** » Sun Feb 22, 2015 3:46 pm

I'll amend the previous post to say that I was able to roll back to the configuration with only 1000 TCP service checks, thus restoring the Nagios XI server to normal operation.

Post by **Box293** » Sun Feb 22, 2015 4:41 pm

I suspect the problem you are having with adding a lot of services has to do with the php max_execution_time.

Edit your /etc/php.ini and set:

Code: Select all

max_execution_time = 600

Then:

Code: Select all

service httpd restart

I experienced similar issues when trying to create a bunch of services. In my case I had an Ubuntu virtual machine running with Open vSwitch configured with 2000 ports. Running the switch Network Switch / Router wizard to monitor these would bomb out when finishing the wizard. Increasing the max_execution_time corrected the problem.

mp4783 wrote:What is the appropriate way to load test the Nagios XI server? By that I mean, how can I put a service check load on it that takes it to point where it can no longer handle any more (assume a checks that consume similar resources)?

Good question, this really comes down to what you are actually monitoring. In my case I wanted to push MRTG hard and see how it coped. I found that the real test was when I took the virtual switch offline, the event handlers kicked in for all the service checks and started consuming crap loads of memory.

But this testing is really only for MRTG. What about other checks like WMI, NRPE, Passive?

You are on the right path, but I would be focussing on creating a lot of services which are specific to your implementation.

mp4783 · Post by **mp4783** » Mon Feb 23, 2015 8:14 am

My goal was to load up the monitor daemon and its attendant worker processes to see how many services it could handle. I wanted to hold as many variables constant as possible, thus the use of a single type of service check.

We are just about at the beginning of a massive rollout of Nagios (100,000+ hosts eventually) and we have no hard data on how many hosts can be monitored from a single Nagios XI host of a specific configuration. Deterministic performance measurements are admittedly very hard to achieve in the short run because we have limited control of external variables like network latency. Nagios does provide guidelines for system sizing, but we can't just trust those.

I have said to my peers, and your reply supports, the notion that we cannot know capacity until we put real service checks on a server for an extended period. Predictable behavior is critical in our business, so I'm just fumbling around, trying to find some answers.

That said, Nagios should have accounted for situations like this when building the utility. Either explicitly limit the number of objects that can be deleted or break the deletion process into PHP "manageable" chunks to prevent timeouts. I still don't know exactly what is happening "under the covers", which is another source of frustration. As a commercial (paying) customer, I would expect a bit more technical depth in the documentation.

As always, thanks for the prompt reply.

tmcdonald · Post by **tmcdonald** » Mon Feb 23, 2015 5:26 pm

Just to address some points specifically:

mp4783 wrote:We are just about at the beginning of a massive rollout of Nagios (100,000+ hosts eventually) and we have no hard data on how many hosts can be monitored from a single Nagios XI host of a specific configuration. Deterministic performance measurements are admittedly very hard to achieve in the short run because we have limited control of external variables like network latency. Nagios does provide guidelines for system sizing, but we can't just trust those.

We've had plans to rework our hardware requirements doc for a little while, particularly since we switched to using Core 4. In terms of hard numbers currently - and bearing in mind this depends very much on type of check, frequency, etc. - we see slowdown around 10k total checks. Optimizations can be made such as implementing a ramdisk, offloading the DB, using remote workers, etc. but for the most part 25k total checks for a single server is about the most I have seen still running well.

mp4783 wrote:That said, Nagios should have accounted for situations like this when building the utility. Either explicitly limit the number of objects that can be deleted or break the deletion process into PHP "manageable" chunks to prevent timeouts. I still don't know exactly what is happening "under the covers", which is another source of frustration. As a commercial (paying) customer, I would expect a bit more technical depth in the documentation.

We've made some changes in recent release to, for example, how the Core Config Manager works. It used to write out all the files to disk whenever any one of them was applied, and now it detects the changes (As you know from your other thread) and only writes those out. We are always looking for ways to improve the product both from a usability perspective and an efficiency perspective. If you have any particular suggestions, requests, bugs, etc. please don't hesitate to let us know. Our products are largely customer-driven so we always like feedback.

As for documentation on a technical backend level, I will agree that it is lacking. We're doing in-house training where the devs will walk through some of the underlying gears that are in place (for training new hires and review for current employees). At some point this may become documentation in the classical sense, though it would have to go through some editing, review, etc. before it could be made public.

mp4783 · Post by **mp4783** » Tue Feb 24, 2015 8:57 am

Firstly, as always, I very much appreciate the prompt responses I receive on the forum. Whether they answer my question or not, I can always depend that someone will try to lend a hand.

I just tried a 5000 check import after raising the PHP max_execution_time value to 600. I got the same result as before. The checks show up in the database/GUI, but the service definition file in /usr/local/nagios/etc/services is empty now.

It appears that adding 1000 checks (for a single host at least) seems to work pretty consistently. So I guess that's the number we'll go with.

I know that you protect your Nagios XI PHP code with Source Guard. Are there provisions for commercial clients to obtain that source code? I have (had) questions that might have been answered independently if I had access to that source code.

Thanks.

tmcdonald · Post by **tmcdonald** » Tue Feb 24, 2015 10:28 am

mp4783 wrote:I know that you protect your Nagios XI PHP code with Source Guard. Are there provisions for commercial clients to obtain that source code? I have (had) questions that might have been answered independently if I had access to that source code.

Only certain portions of our XI codebase are encrypted, the vast majority of it is open. I am not aware of any provisions for releasing that code which is encrypted.

Nagios Support Forum

Load Testing Nagios XI - Problems w/ Large Service Count

Load Testing Nagios XI - Problems w/ Large Service Count

Re: Load Testing Nagios XI - Problems w/ Large Service Count

Re: Load Testing Nagios XI - Problems w/ Large Service Count

Re: Load Testing Nagios XI - Problems w/ Large Service Count

Re: Load Testing Nagios XI - Problems w/ Large Service Count

Re: Load Testing Nagios XI - Problems w/ Large Service Count

Re: Load Testing Nagios XI - Problems w/ Large Service Count