Load Testing Nagios XI - Problems w/ Large Service Count
Posted: Sun Feb 22, 2015 3:36 pm
I'll start with the question for those of you who don't want to read below:
What is the appropriate way to load test the Nagios XI server? By that I mean, how can I put a service check load on it that takes it to point where it can no longer handle any more (assume a checks that consume similar resources)?
My attempts to do my own load testing have resulted in problems on the server as detailed below.
I have been attempting to load test my Nagios XI server by creating a very large number of TCP check service checks on a single host. I did this by writing a script that created service check entries in an import file, using a previously configured service check definition created via the GUI. The script simply incremented the TCP port number checked an the TCP port named in the description for a fixed number of loops.
My initial run produced a file with 1000 checks for TCP ports 2 through 1001. This file was then placed into the /usr/local/nagios/etc/import directory using the appropriate filename for the host. I then executed the reconfigure_nagios.sh utility.
The utility churned away for a considerable amount of time. The Nagios monitoring processes were down for over 2 minutes and 45 seconds while it imported the 1000 new services. It finally finished, and the new configuration was in place. The Nagios server then hummed along with about 1300 service checks without any issues.
I then wanted to increase the number of service checks by 9000. So I reran my script and created service check definitions for TCP ports 1002 through 10003. I followed the same procedure as above. This time the import failed and the service configuration file for the host was removed from /usr/local/nagios/etc/services. However, the entries still appeared in the Nagios GUI and were still present in the database.
I attempted to delete them through the GUI, which accepted the request, applied the changes, and then returned control. The services were still in place in the GUI and database and no service configuration file was written out for the host, which is what I would expect because I deleted all of the host's services.
I then decided to use the nagiosql_delete_service.php utility with the '--config' option, to delete the approximately 9000 services on the host. The process almost 50 minutes to complete on a very lightly loaded server. It appears to be deleting services one at a time. A review of the MySQL log shows a repetitive (with applicable variation) series of statements. That said, upon finally completing the deletion and attempting to reconfigure Nagios, all of the service checks were still in place. I'm not sure if the absence of the service configuration file (removed during the failed import) is at the root of this. I would, in theory, be able to recreate it from the object.cache file.
I could go on and on through many permutations and attempts to fix this, but won't bore you with that. However, this brings up other questions, the first of which is:
Do the nagiosql_NNN_NNN.php utilities "reproduce" the actions one would take through the GUI and output them to a temporary file that is "run" against the Nagios XI server?
The reason I ask this is, when I examined the nagiosql.delete.service file for the deletion session above, I saw that it was being constantly recreated. I also saw that it contained what appeared to be an entire page from the GUI. I even took the file to my local workstation and opened it up in a browser.
I will not presume to understand the mechanics of how these utilities are supposed to work with Nagios XI or the reasons why they operate as they do, but it seems a rather crude way of accomplishing these tasks. It almost feels like screen scraping.
I don't think I did anything not explicitly allowed in the Nagios XI documentation. What concerns me is Nagios' ability to handle large imports and deletions like this while still maintaining monitoring. As it stands, I may have to perform "surgery" on my Nagios XI server to restore it to full function.
What is the appropriate way to load test the Nagios XI server? By that I mean, how can I put a service check load on it that takes it to point where it can no longer handle any more (assume a checks that consume similar resources)?
My attempts to do my own load testing have resulted in problems on the server as detailed below.
I have been attempting to load test my Nagios XI server by creating a very large number of TCP check service checks on a single host. I did this by writing a script that created service check entries in an import file, using a previously configured service check definition created via the GUI. The script simply incremented the TCP port number checked an the TCP port named in the description for a fixed number of loops.
My initial run produced a file with 1000 checks for TCP ports 2 through 1001. This file was then placed into the /usr/local/nagios/etc/import directory using the appropriate filename for the host. I then executed the reconfigure_nagios.sh utility.
The utility churned away for a considerable amount of time. The Nagios monitoring processes were down for over 2 minutes and 45 seconds while it imported the 1000 new services. It finally finished, and the new configuration was in place. The Nagios server then hummed along with about 1300 service checks without any issues.
I then wanted to increase the number of service checks by 9000. So I reran my script and created service check definitions for TCP ports 1002 through 10003. I followed the same procedure as above. This time the import failed and the service configuration file for the host was removed from /usr/local/nagios/etc/services. However, the entries still appeared in the Nagios GUI and were still present in the database.
I attempted to delete them through the GUI, which accepted the request, applied the changes, and then returned control. The services were still in place in the GUI and database and no service configuration file was written out for the host, which is what I would expect because I deleted all of the host's services.
I then decided to use the nagiosql_delete_service.php utility with the '--config' option, to delete the approximately 9000 services on the host. The process almost 50 minutes to complete on a very lightly loaded server. It appears to be deleting services one at a time. A review of the MySQL log shows a repetitive (with applicable variation) series of statements. That said, upon finally completing the deletion and attempting to reconfigure Nagios, all of the service checks were still in place. I'm not sure if the absence of the service configuration file (removed during the failed import) is at the root of this. I would, in theory, be able to recreate it from the object.cache file.
I could go on and on through many permutations and attempts to fix this, but won't bore you with that. However, this brings up other questions, the first of which is:
Do the nagiosql_NNN_NNN.php utilities "reproduce" the actions one would take through the GUI and output them to a temporary file that is "run" against the Nagios XI server?
The reason I ask this is, when I examined the nagiosql.delete.service file for the deletion session above, I saw that it was being constantly recreated. I also saw that it contained what appeared to be an entire page from the GUI. I even took the file to my local workstation and opened it up in a browser.
I will not presume to understand the mechanics of how these utilities are supposed to work with Nagios XI or the reasons why they operate as they do, but it seems a rather crude way of accomplishing these tasks. It almost feels like screen scraping.
I don't think I did anything not explicitly allowed in the Nagios XI documentation. What concerns me is Nagios' ability to handle large imports and deletions like this while still maintaining monitoring. As it stands, I may have to perform "surgery" on my Nagios XI server to restore it to full function.