Page 1 of 1

Number of "nagios.cfg" instances failing.

Posted: Tue Mar 24, 2020 8:50 am
by luczynj
Hello all,

We are running Nagios XI 5.6.5: CentOS Linux nagios-b 2.6.32-754.17.1.el6.centos.plus.x86_64 #1 SMP Tue Jul 2 20:09:16 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Due to the massive growth and use of Nagios XI, we recently replaced our servers and added 64GB RAM to these new servers.

We have 326 hosts and nearly 29,000 services.

Nagios XI was very stable until we hit around 25K services. I have no idea what to do or even where to troubleshoot.

I have a script that checks that the output of "ps -eaf | grep nagios.cfg | grep -v grep" is equal to 2. If not, it restarts the nagios service.

[root@nagios-b ]# free
total used free shared buffers cached
Mem: 66067412 59720752 6346660 70844 601948 56079904
-/+ buffers/cache: 3038900 63028512
Swap: 33038332 22436 33015896


Please help!

Regards,
JLu

Re: Number of "nagios.cfg" instances failing.

Posted: Tue Mar 24, 2020 4:12 pm
by lmiltchev
There are number of things you could do in order to "tune up" your Nagios XI system, and improve performance. You can find information on the topic here:

https://assets.nagios.com/downloads/nag ... p#boosting

Having said that, with so many services, you should really consider purchasing a second Nagios XI server. Tweaking the settings, and throwing more hardware at the server can only go so far.

Re: Number of "nagios.cfg" instances failing.

Posted: Fri Mar 27, 2020 11:29 am
by luczynj
Thanks for the recommendation. However, we have had similar issues since we started using Nagios in 2014, and we did all the fine-tuning that we could. We've just deployed two new beefed up servers to run this Nagios. A large bulk of the 29Kservices are passive.

We ended up brainstorming about this issue and discovered that we were getting flodded by a node in Amsterdam. We experienced this a year or two ago and found we were getting thousands of SNMP messages per hour and some by minute. And Nagios couldn't cope with that. I doubt anyone's hardware could handle that.

Is there a custom service out there that could detect this kind of flooding? I've already got a script in my head of how I would write my own. By why reinvent the wheel?

Thanks again for your help. Now that we've updated the the IP tables to prevent future flooding, the system is running just fine on the new hardware.

Re: Number of "nagios.cfg" instances failing.

Posted: Fri Mar 27, 2020 11:53 am
by lmiltchev
I makes a big difference that the majority of your checks are passive. In this case, you may be fine, especially with the new hardware.
Is there a custom service out there that could detect this kind of flooding? I've already got a script in my head of how I would write my own. By why reinvent the wheel?
I don't think there is anything, included in XI that would help you with that. You are on the right track though as you could use your firewall to throttle these messages. If you prefer to use your custom script, I would recommend that you try it in a test environment first, before using it in production. Thanks!