Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
We have configured our 9 nagios servers to send data to a MySQL server using Ndoutils. Everything is working great so far as we are able to collect data from all of the 9 nagios servers. The issue is that 4 out of the 9 nagios servers seems to be sending more information than what the ndo2db application can handle. This results to the IPC message queue limit getting maxed out ( kernel.msgmnb = 524288000 ) causing the nagios daemon to stall and ultimately stop on those 4 servers.
We have tried using ndoutils v2.0.0 and v2.1.b2 but we get the same result.
Moving to NDO 2.1b2 was the right first step, as there were some fixes in there specifically to handle the slow processing.
How many hosts+services are being handled by each of these servers? It honestly might just be too much for the hardware. For that matter, what's the hardware look like on the servers?
delfin wrote:Do we need to think about offloading some of the hosts/service checks from the 4 servers? Or can you think of other options?
I always recommend in split setups like this keeping the load as evenly-distributed as possible. Looking at your numbers, the servers are pretty uniform in resources. With that in mind, the breaking point is likely around the 20k service mark. If you can shift some of the load into your barely-used NAG06 that could alleviate some of the stress on 5/7/8/9.
Other options would include spinning up another server, *possibly* increasing the kernel queue limits (though that might just mask/delay the problem), working on adjusting the check/retry intervals (this can actually have a big impact if done properly and thoroughly), and tweaking various performance-related options in nagios.cfg.
If the server is monitoring 10k hosts and has 10k service checks, does this mean we already hit the 20k breaking point? I'm thinking that the monitored hosts generates host checks and since we are also gathering host check data, do we need to add both the number of hosts and the number of service checks to compute if we're already reaching the 20k breaking point?
Sorry, yea you're correct. Adding the hosts and the services, the new breaking point seems to be between 25k (highest of the stable) and 30k (lowest of the problems). There are many other factors as well, such as frequency of check, % in non-OK states, etc. that will cause more frequent checking, but looking at the numbers you provided there is some sort of barrier related to the number.
bphl wrote:HI, are there any documents or best pratice guidelines for scaling NDO in large environments ?
Not that I know of, first and foremost, move it off of your Nagios server. Secondly, give both your monitoring server and the MySQL servers the fastest disks you can afford. If you need to scale extremely lage, you should be thinking about a raid array of SSD drives.
Finally, you can also limit the information you send to the database via the data_processing_options in the ndomod.cfg
# DATA PROCESSING OPTION
# This option determines what data the NDO NEB module will process.
# Do not mess with this option unless you know what you're doing!!!!
# Read the source code (include/ndbxtmod.h) to determine what values
# to use here. Values from source code should be OR'ed to get the
# value to use here. A value of -1 will cause all data to be processed.
# Read the source code (include/ndomod.h) and look for "NDOMOD_PROCESS_"
# to determine what values to use here. Values from source code should
# be OR'ed to get the value to use here. A value of -1 will cause all
# data to be processed.
scottwilkerson wrote:Finally, you can also limit the information you send to the database via the data_processing_options in the ndomod.cfg
# DATA PROCESSING OPTION
# This option determines what data the NDO NEB module will process.
# Do not mess with this option unless you know what you're doing!!!!
# Read the source code (include/ndbxtmod.h) to determine what values
# to use here. Values from source code should be OR'ed to get the
# value to use here. A value of -1 will cause all data to be processed.
# Read the source code (include/ndomod.h) and look for "NDOMOD_PROCESS_"
# to determine what values to use here. Values from source code should
# be OR'ed to get the value to use here. A value of -1 will cause all
# data to be processed.