I've got a total of six (6) Nagios servers, five (5) of which are active hosts and the final host is a passive host for displaying the results from the active hosts (without the perfdata.) The passive server has just shy of 35000 passive services configured spread across just shy of 2400 passive hosts. Each of the servers (including the passive server) is a self-contained machine with a fairly vanilla config (added the ramdisk but the MySQL server is still local to the machine.) On the passive server, the average host check latency is 0.01 sec and the average service check latency is 0.14 sec. The server itself has has about 2gb of free memory (out of 8gb total) and hovers between a load of 0.6->1.2 on a 4 cpu system with the CPU usage bouncing no higher than about 50% even when applying configuration. The CPUWait on the box bounces between flat 0 (most of the time) to a top of about 0.7 (in other words, it's not disk bound.)
My environment makes extensive use of the REST API to handle configuration of new hosts / services / hostgroups / assigning hosts to hostgroups / contact creation / etc.. The REST API is pressed into service via ChatOps commands that I've created to allow my team to easily manage the Nagios environment in a very standard way across all of the servers. Now, knowing that we're leaning heavily on the REST API, I should state that I don't consider applying configuration to be complete until the messages queue is back to zero (0) because that's when you can be absolutely confident about interacting with the REST API. If you interact with the REST API before the messages queue has hit zero after applying configuration, you will get limited subsets of data (kinda like what you see when you are looking in the Nagios XI GUI when you apply config and see hosts/services that are grey until it finishes.)
The problem is that applying configuration on the passive server takes a good deal of time (just over 2 minutes), especially compared to the much smaller active servers (just under 30 seconds.) Most of the API commands allow for adding the "applyconfig=" flag which depending on if you set it to either 0 or 1 determines whether the API will apply configuration, however, this flag is not respected by the API call that allows for posting a system/user. This means that if a command to create a user is issued, you have to wait the better than 2 minutes before another user can be created. Mind you, I did a bit of coding to at least post back into the chat channel when the messages queue has hit zero which has helped.
I'd ideally like to figure out a way to prevent the API from applying configuration on the passive server when a user is created so I can then schedule the apply config at the top of the hour. Or perhaps I'm trying to slay the wrong dragon here and a better option would be to determine how to speed up applying configuration on the passive server itself. Either way, I'm pretty open to ideas on how to proceed.
Thoughts / suggestions / pie recipes?