I run a fairly large Nagios XI environment consisting of 1800+ hosts and 24k+ active checks on a single server that has ramdrive optimizations enabled and runs both the Nagios Core / XI as well as the database portions of the application. The server itself is a bit of a beast running in VMware with 15 total processors and 24GB of ram (based on vRealize recommendations.) Before anyone comments saying that I should move the database to another server, I'll go ahead and mention I'm soon going to be retiring Nagios in my environment completely and have no interest in investing any more time in my Nagios server. The only reason for this post is because I want to say thank you to the community for your support over the years and offer a final tidbit of help where I can.
I was recently faced with a problem which exhibited when there were more than about 350-400 services in a non-OK state during our monthly patching cycle (even when the hosts and all services were in scheduled downtime.) Nagios XI would become completely unresponsive in rapid succession (less than about 20 minutes) and errors would appear in the eventlog such as "ndo2db: Warning: queue send error, retrying... " repeatedly for the duration that the services were offline.
After a quick google search, I followed the instructions that were available here: https://support.nagios.com/kb/article.php?id=139. Unfortunately, the instructions provided no resolution did give me a clue as to what to look for that was causing the problem. Specifically, running "watch ipcs -q" allowed me to see that the queue was filling and after about 50k messages or so the system would steadily lose ground and eventually become unresponsive. This gave me the clue to actually solve the problem. While increasing the queue size (kernel.msgmnb and kernel.msgmax) as the article indicated allowed for more messages to be in queue it didn't resolve the fact that the messages weren't being consumed fast enough.
The limiting factor turned out to be the the number of files the Nagios user could have open at once, and simply tuning the nofile option in /etc/security/limits.conf was the answer in this case. Specifically, here's the settings that I used:
nagios hard nofile 65536
nagios soft nofile 65536
In my testing, I was able to have up to 3k services offline without the Nagios server breaking a sweat. I did, however, find a limit at around 3500 services being offline when the nagios_statehistory table crashed and had to be repaired.
Large Nagios XI environment - NDOUtils not responding
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: Large Nagios XI environment - NDOUtils not responding
Thanks for that information, much appreciated.
I'm currently putting together some KB articles to address these exact issues you've highlighted.
I'm currently putting together some KB articles to address these exact issues you've highlighted.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.