Hello all,
We are running Nagios XI 5.6.5: CentOS Linux nagios-b 2.6.32-754.17.1.el6.centos.plus.x86_64 #1 SMP Tue Jul 2 20:09:16 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
Due to the massive growth and use of Nagios XI, we recently replaced our servers and added 64GB RAM to these new servers.
We have 326 hosts and nearly 29,000 services.
Nagios XI was very stable until we hit around 25K services. I have no idea what to do or even where to troubleshoot.
I have a script that checks that the output of "ps -eaf | grep nagios.cfg | grep -v grep" is equal to 2. If not, it restarts the nagios service.
[root@nagios-b ]# free
total used free shared buffers cached
Mem: 66067412 59720752 6346660 70844 601948 56079904
-/+ buffers/cache: 3038900 63028512
Swap: 33038332 22436 33015896
Please help!
Regards,
JLu
Number of "nagios.cfg" instances failing.
Re: Number of "nagios.cfg" instances failing.
There are number of things you could do in order to "tune up" your Nagios XI system, and improve performance. You can find information on the topic here:
https://assets.nagios.com/downloads/nag ... p#boosting
Having said that, with so many services, you should really consider purchasing a second Nagios XI server. Tweaking the settings, and throwing more hardware at the server can only go so far.
https://assets.nagios.com/downloads/nag ... p#boosting
Having said that, with so many services, you should really consider purchasing a second Nagios XI server. Tweaking the settings, and throwing more hardware at the server can only go so far.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Number of "nagios.cfg" instances failing.
Thanks for the recommendation. However, we have had similar issues since we started using Nagios in 2014, and we did all the fine-tuning that we could. We've just deployed two new beefed up servers to run this Nagios. A large bulk of the 29Kservices are passive.
We ended up brainstorming about this issue and discovered that we were getting flodded by a node in Amsterdam. We experienced this a year or two ago and found we were getting thousands of SNMP messages per hour and some by minute. And Nagios couldn't cope with that. I doubt anyone's hardware could handle that.
Is there a custom service out there that could detect this kind of flooding? I've already got a script in my head of how I would write my own. By why reinvent the wheel?
Thanks again for your help. Now that we've updated the the IP tables to prevent future flooding, the system is running just fine on the new hardware.
We ended up brainstorming about this issue and discovered that we were getting flodded by a node in Amsterdam. We experienced this a year or two ago and found we were getting thousands of SNMP messages per hour and some by minute. And Nagios couldn't cope with that. I doubt anyone's hardware could handle that.
Is there a custom service out there that could detect this kind of flooding? I've already got a script in my head of how I would write my own. By why reinvent the wheel?
Thanks again for your help. Now that we've updated the the IP tables to prevent future flooding, the system is running just fine on the new hardware.
Re: Number of "nagios.cfg" instances failing.
I makes a big difference that the majority of your checks are passive. In this case, you may be fine, especially with the new hardware.
I don't think there is anything, included in XI that would help you with that. You are on the right track though as you could use your firewall to throttle these messages. If you prefer to use your custom script, I would recommend that you try it in a test environment first, before using it in production. Thanks!Is there a custom service out there that could detect this kind of flooding? I've already got a script in my head of how I would write my own. By why reinvent the wheel?
Be sure to check out our Knowledgebase for helpful articles and solutions!