Page 1 of 2

Nagios IX crashed due to out of memory - APP is DOWN

Posted: Thu Dec 19, 2019 8:39 am
by brucej543
Nagios XI crashed due to server out of memory. After hard halt and restart ran the repair_database fix script and successfully complete. The sysstat.log file is recording : bcnagios01 ndo2db[59710]: Error: max retries exceeded sending message to queue. Kernel queue parameters may need to be tuned. See README.
Dec 19 08:25:05 bcnagios01 ndo2db[59710]: Warning: queue send error, retrying...
WEB GUI will not stay up and system is running a 100% utilization.

Re: Nagios IX crashed due to out of memory - APP is DOWN

Posted: Thu Dec 19, 2019 11:14 am
by brucej543
I was able to clear the system by using the manage_services.sh script and stop all the services and then rerun the repair_databases.sh. I rebooted the server after the repair_database .sh. I do not see any errors in the log files and performance seems to be back to normal.

Can the cause of this issue be that I scheduled a Downtime for our production windows environment which currently consists of 773 severs and 4585 services and it overloaded the application?

Re: Nagios IX crashed due to out of memory - APP is DOWN

Posted: Thu Dec 19, 2019 1:29 pm
by mbellerue
It looks like the message queue was overrun. In theory that could be from scheduling downtime on so many objects at once. Can you run this command and show me the output.

Code: Select all

ipcs -l

Re: Nagios IX crashed due to out of memory - APP is DOWN

Posted: Thu Dec 19, 2019 1:44 pm
by brucej543
[root@bcnagios01 logrotate.d]# ipcs -l

------ Messages Limits --------
max queues system wide = 32768
max size of message (bytes) = 131072000
default max size of queue (bytes) = 131072000

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 4194303
max total shared memory (kbytes) = 1073741824
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 128
max semaphores per array = 250
max semaphores system wide = 32000
max ops per semop call = 32
semaphore max value = 32767

Re: Nagios IX crashed due to out of memory - APP is DOWN

Posted: Thu Dec 19, 2019 3:05 pm
by mbellerue
That looks good. How much memory is your system currently using? free -h

Re: Nagios IX crashed due to out of memory - APP is DOWN

Posted: Thu Dec 19, 2019 3:15 pm
by brucej543
At the time of the issue, 20G was allocated to this server. While it was down to reboot, the allocated was changed to 32GB
Here is the current display
[root@bcnagios01 ~]# free -h
total used free shared buff/cache available
Mem: 31G 1.5G 23G 131M 6.0G 29G
Swap: 1.9G 0B 1.9G

Re: Nagios IX crashed due to out of memory - APP is DOWN

Posted: Thu Dec 19, 2019 4:45 pm
by mbellerue
That all seems reasonable. Let's grab a system profile, just so we don't lose the logs at the very least. Just head over to Admin -> System Profile -> Download System Profile. Then PM that to me.

Re: Nagios IX crashed due to out of memory - APP is DOWN

Posted: Fri Dec 20, 2019 6:52 am
by brucej543
Profile Attached.
Question,, we are about to add 1500 new windows servers to Nagiosxi. Is there a health check process to check that there no configuration or performance issues with the current environment?

Support update: Downloaded Baycare_Nagios_profile.zip and shared with team.

Re: Nagios IX crashed due to out of memory - APP is DOWN

Posted: Fri Dec 20, 2019 11:47 am
by mbellerue
As far as verifying that there aren't configuration issues, the easiest thing to do is run through a Delete/Write/Verify. Go to Configure -> Core Config Manager -> Config File Management, and hit the Delete Configs button, followed by the Write Configs button, and then the Verify Files button. That will make sure that your configuration is in good order. If you're adding 1500 servers, it might be best to do it in batches. Add a few hundred, apply config (which checks configuration as well), add a few hundred, apply config.

As far performance issues, your best bet is continuing to monitor the localhost checks that are added to Nagios XI by default. Checks, either active or passive, don't take a consistently measurable amount of CPU processing time, or memory, or disk IO. So there's no good general rule like every 500 checks, add a CPU core. Performance is tied very closely to the environment. Keep an eye on the localhost checks, and watch for any potential unacceptable slowdowns in your environment as you're adding hosts.

Regarding your profile, it does look like you have some crashed database tables, so we should definitely run through the repair script. Run the following, paste the output into a text file, and post that back here.

Code: Select all

/usr/local/nagiosxi/scripts/repair_databases.sh

Re: Nagios IX crashed due to out of memory - APP is DOWN

Posted: Fri Dec 20, 2019 1:09 pm
by brucej543
Database repair after Delete/write/verify plus new profile