Page 1 of 1

Nagios XI stability issues

Posted: Thu Mar 19, 2020 5:39 pm
by vappukuttan
Hello,

System: Centos 7.7, 8 cpu, 16gb, enough disk space
Total checks: 8000

I have been having Nagios XI stability issues and am trying to figure out what needs to be done or changed to get it more stable.

I have setup a max concurrent jobs to 60, I have repaired the databases. Tweaked parameters for a large setup (like reaper frequency/time). Setup to use unified tactical overview, increased the refresh multiplier by 10 times the default, disabled auto-running reports and metrics on page load. The ndo2db has been showing "max retries exceeded", so i setup the msgmni value.

Its still not stable.. i am also seeing the wproc related messages.
nagios[26891]: wproc: iocache_read() from Core Worker 26901 returned -1: Bad address

I see the monitoring engine stop every midnight. I have seen it stop randomly during the day.

I am not sure what my best approach is to get this stable.

Thank you,
Vinod

Re: Nagios XI stability issues

Posted: Fri Mar 20, 2020 11:27 am
by jdunitz
Sorry to hear that you're having trouble!

Could we get a system profile from you?
systemProfile.png

Once we have that, we can start investigating.


Thanks!

Re: Nagios XI stability issues

Posted: Fri Mar 20, 2020 12:02 pm
by vappukuttan
Thank you for your response. Once I get the system profile, what do I do. Upload it to this thread. Is there anything in that file, that i need to remove like server ip-address or anything of that kind.

Here is the latest error i got a few minutes ago.. within /var/log/messages. I wasnt sure what README its referring to.

ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 256000 of 512000 messages and 262144000 of 262144000 bytes in the queue. See README for kernel tuning options.


Thank you,
Vinod

Re: Nagios XI stability issues

Posted: Fri Mar 20, 2020 1:49 pm
by jdunitz
You could PM me your profile. There is a lot of IP address and other configuration info in the profile, so if your local security policy doesn't let you share that, you may not be able to get the profile for us after all.

However...

The error you mentioned could be related to this problem:
https://support.nagios.com/kb/article.php?id=139

Have a look at that document, try what it says, and see if that helps your issue.

Thanks!

--Jeffrey

Re: Nagios XI stability issues

Posted: Fri Mar 20, 2020 3:04 pm
by vappukuttan
Thank you Jeffrey, I had made the kernel changes two days ago, and reverted back the
kernel.msgmnb and kernel.msgmax values to 131072000 (after Nagios stopped again)
The kernel.msgmni was set up to 512000

Let me increase the kernel.msgmnb and kernel.msgmax to 262144000 and see if it helps stabilize the ndo2db.

Sending the Profile may be an issue. Is there anything specific that i can grab from it and provide?

Thank you,
Vinod

Re: Nagios XI stability issues

Posted: Fri Mar 20, 2020 4:26 pm
by vappukuttan
It stopped again today.. even with the recommended values of msgmnb, msgmax, msgmni ..

Mar 20 17:15:23 ndo2db: Warning: Retrying message send. This can occur because you have too few messages allowed or too few total bytes allowed in message queues. You are currently using 256000 of 512000 messages and 262144000 of 262144000 bytes in the queue. See README for kernel tuning options.

Not sure if i can increase the values any more.. or why the message queue is getting maxed out.

Thank you,
Vinod

Re: Nagios XI stability issues

Posted: Fri Mar 20, 2020 4:50 pm
by scottwilkerson
Locking thread as it has moved to ticket #820717