Nagios IX crashed due to out of memory - APP is DOWN

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
brucej543
Posts: 134
Joined: Thu Jun 21, 2018 9:33 am

Nagios IX crashed due to out of memory - APP is DOWN

Post by brucej543 »

Nagios XI crashed due to server out of memory. After hard halt and restart ran the repair_database fix script and successfully complete. The sysstat.log file is recording : bcnagios01 ndo2db[59710]: Error: max retries exceeded sending message to queue. Kernel queue parameters may need to be tuned. See README.
Dec 19 08:25:05 bcnagios01 ndo2db[59710]: Warning: queue send error, retrying...
WEB GUI will not stay up and system is running a 100% utilization.
brucej543
Posts: 134
Joined: Thu Jun 21, 2018 9:33 am

Re: Nagios IX crashed due to out of memory - APP is DOWN

Post by brucej543 »

I was able to clear the system by using the manage_services.sh script and stop all the services and then rerun the repair_databases.sh. I rebooted the server after the repair_database .sh. I do not see any errors in the log files and performance seems to be back to normal.

Can the cause of this issue be that I scheduled a Downtime for our production windows environment which currently consists of 773 severs and 4585 services and it overloaded the application?
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: Nagios IX crashed due to out of memory - APP is DOWN

Post by mbellerue »

It looks like the message queue was overrun. In theory that could be from scheduling downtime on so many objects at once. Can you run this command and show me the output.

Code: Select all

ipcs -l
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
brucej543
Posts: 134
Joined: Thu Jun 21, 2018 9:33 am

Re: Nagios IX crashed due to out of memory - APP is DOWN

Post by brucej543 »

[root@bcnagios01 logrotate.d]# ipcs -l

------ Messages Limits --------
max queues system wide = 32768
max size of message (bytes) = 131072000
default max size of queue (bytes) = 131072000

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 4194303
max total shared memory (kbytes) = 1073741824
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 128
max semaphores per array = 250
max semaphores system wide = 32000
max ops per semop call = 32
semaphore max value = 32767
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: Nagios IX crashed due to out of memory - APP is DOWN

Post by mbellerue »

That looks good. How much memory is your system currently using? free -h
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
brucej543
Posts: 134
Joined: Thu Jun 21, 2018 9:33 am

Re: Nagios IX crashed due to out of memory - APP is DOWN

Post by brucej543 »

At the time of the issue, 20G was allocated to this server. While it was down to reboot, the allocated was changed to 32GB
Here is the current display
[root@bcnagios01 ~]# free -h
total used free shared buff/cache available
Mem: 31G 1.5G 23G 131M 6.0G 29G
Swap: 1.9G 0B 1.9G
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: Nagios IX crashed due to out of memory - APP is DOWN

Post by mbellerue »

That all seems reasonable. Let's grab a system profile, just so we don't lose the logs at the very least. Just head over to Admin -> System Profile -> Download System Profile. Then PM that to me.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
brucej543
Posts: 134
Joined: Thu Jun 21, 2018 9:33 am

Re: Nagios IX crashed due to out of memory - APP is DOWN

Post by brucej543 »

Profile Attached.
Question,, we are about to add 1500 new windows servers to Nagiosxi. Is there a health check process to check that there no configuration or performance issues with the current environment?

Support update: Downloaded Baycare_Nagios_profile.zip and shared with team.
User avatar
mbellerue
Posts: 1403
Joined: Fri Jul 12, 2019 11:10 am

Re: Nagios IX crashed due to out of memory - APP is DOWN

Post by mbellerue »

As far as verifying that there aren't configuration issues, the easiest thing to do is run through a Delete/Write/Verify. Go to Configure -> Core Config Manager -> Config File Management, and hit the Delete Configs button, followed by the Write Configs button, and then the Verify Files button. That will make sure that your configuration is in good order. If you're adding 1500 servers, it might be best to do it in batches. Add a few hundred, apply config (which checks configuration as well), add a few hundred, apply config.

As far performance issues, your best bet is continuing to monitor the localhost checks that are added to Nagios XI by default. Checks, either active or passive, don't take a consistently measurable amount of CPU processing time, or memory, or disk IO. So there's no good general rule like every 500 checks, add a CPU core. Performance is tied very closely to the environment. Keep an eye on the localhost checks, and watch for any potential unacceptable slowdowns in your environment as you're adding hosts.

Regarding your profile, it does look like you have some crashed database tables, so we should definitely run through the repair script. Run the following, paste the output into a text file, and post that back here.

Code: Select all

/usr/local/nagiosxi/scripts/repair_databases.sh
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.

Be sure to check out our Knowledgebase for helpful articles and solutions!
brucej543
Posts: 134
Joined: Thu Jun 21, 2018 9:33 am

Re: Nagios IX crashed due to out of memory - APP is DOWN

Post by brucej543 »

Database repair after Delete/write/verify plus new profile
You do not have the required permissions to view the files attached to this post.
Locked