Monitoring Engine will not start after upgrading to 5.8

murdock · Post by **murdock** » Thu May 27, 2021 9:41 am

Hi,

We're heading in to a long weekend and monitoring is down: I have 300+ people counting on me / Nagios XI to monitor ~2600 hosts.

We need to make the call soon if this can be fixed, or do we need to rollback.

My recovery plan is to wipe our VM, reinstall CentOS 7, reinstall XI 5.5, and restore from the last backup. If anyone has a better idea, please let me know.

Please let me know if you need anything from the downed 5.8 instance before I have to wipe it clean.

Rob

Post by **vtrac** » Thu May 27, 2021 1:40 pm

Hi,
How are you doing?

Please following this KB and see if you need to increase "Max" connection:
https://support.nagios.com/kb/article/n ... s-513.html

I noticed this warning:

Code: Select all

WARNING: RLIMIT_NPROC is 64090, total max estimated processes is 71016! You should increase your limits (ulimit -u, or limits.conf)

I found this page on ulimits settings:
https://serverfault.com/questions/62861 ... n-centos-7

Also, please follow this KB for message queue:
https://support.nagios.com/kb/article.php?id=139

Please also make sure "ndo2db" is NOT running, since Nagios XI 5.8.3 use NDO3:

Code: Select all

systemctl stop ndo2db

Please run the below command:

Code: Select all

echo "SELECT table_schema as 'Database', table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES ORDER BY (data_length + index_length) DESC;" |mysql -t -u root -pnagiosxi

If your "nagios_logentries" is large, please run the below command to truncate it:

Code: Select all

mysql -u ndoutils -pn@gweb nagios -e 'TRUNCATE TABLE nagios_logentries'

Please run the followings to restart all your services:

Code: Select all

systemctl stop crond
systemctl stop npcd
systemctl stop nagios
pkill -9 -u nagios
for i in $(ipcs -q | grep nagios |awk '{print $2}'); do ipcrm -q $i; done
rm -rf /usr/local/nagiosxi/var/dbmaint.lock
rm -rf /usr/local/nagiosxi/var/event_handler.lock
rm -rf /usr/local/nagiosxi/scripts/reconfigure_nagios.lock
systemctl restart mariadb
systemctl start nagios
systemctl start npcd
systemctl start crond

Best Regards,
Vinh

murdock · Post by **murdock** » Thu May 27, 2021 1:47 pm

Hi Vinh,

Message received, I'll start working on this.

Rob

Post by **vtrac** » Thu May 27, 2021 1:48 pm

Hi,
Also, were you able to do "Apply Configuration"?

Regards,
Vinh

murdock · Post by **murdock** » Thu May 27, 2021 2:32 pm

Hi Vinh,

Yes, I did everything you suggested yesterday including the "apply" and it says it worked; but the nagios service would terminate after about 1 minute (i.e., no change).

I'm working on today's suggestions from you now & will follow up.

Rob

Post by **vtrac** » Thu May 27, 2021 2:40 pm

Hi Rob,
Great to hear that "Apply Config" did worked.

So, I am assuming that your database connections or some DB tables might be too large.

Best Regards,
Vinh

murdock · Post by **murdock** » Thu May 27, 2021 3:24 pm

Vinh,

On the ulimit issue, regarding the serverfault article you referred to, systemd ignores anything/everything in /etc/security/limits*.

So in /usr/lib/systemd/system/*.service, for which service(s) do I need to create an override.conf to increase the user's nproc ("max user processes")?

Rob

PS, with regard to MariaDB, I increased max_connections from the default (151) to the maximum of 818 and restarted; nagios crashed again shortly after restarting.

murdock · Post by **murdock** » Thu May 27, 2021 4:46 pm

Hi Vinh,

I followed all of the suggestions you made and Nagios still crashes shortly after [re]starting.

This has been a P1 (Priority / Severity One) high-visibility catastrophic production failure for us.

At this point, we have completely run out of time, we simply cannot have production completely down for such a long period of time, now more than 24 hours.

I'll note also that I have not yet heard back from Sales on my questions / request for information.

I did not receive any request for any additional information from the 5.8 production instance; so here's the plan:

1. I need to recover / restore our old 5.5 instance so we have a working monitoring system in place for the holiday weekend.

2. Early next week I will clone our 5.5 instance, leaving our 5.5 production system running, thus creating a stand-alone, separate 5.5 instance where I will where I will re-run the upgrade procedure and we can pick this issue back up.

Rob

Post by **vtrac** » Thu May 27, 2021 4:51 pm

Hi Rob,
I talked to Sean, and he think it is NDO3 issue.

let's downgrade to the previous version of ndo2db (instructions below).

Code: Select all

systemctl stop nagios
cd /tmp
rm -rf /tmp/nagiosxi
wget https://assets.nagios.com/downloads/nagiosxi/5/xi-5.6.14.tar.gz
tar zxf xi-5.6.14.tar.gz
cd /tmp/nagiosxi/subcomponents/ndoutils
./install
systemctl enable ndo2db

Then edit your /usr/local/nagios/etc/nagios.cfg and make sure this line is uncommented:
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg

Make sure this line is commented:
#broker_module=/usr/local/nagios/bin/ndo.so /usr/local/nagios/etc/ndo.cfg

Then start the nagios service:

Code: Select all

systemctl start nagios

Best Regards,
Vinh

Post by **vtrac** » Thu May 27, 2021 4:56 pm

Hi Rob,
I'm very sorry for causing too much delay!!

We have confirmed that it is NDO3 issue based on the debug log:

Code: Select all

[1622088804] NDO-3: Ended acknowledgement thread
[1622088804] NDO-3: Ended flapping thread
[1622088804] NDO-3: Ended statechange thread
[1622088804] NDO-3: Ended event_handler thread
[1622088804] NDO-3: Ended notification thread
[1622088804] NDO-3: Ended timed_event thread
[1622088805] NDO-3: Ended service_check thread
[1622088805] NDO-3: Ended downtime thread
[1622088806] Caught SIGTERM, shutting down...

So, please down grade to ndo2db as I have post the instruction in my last replied.

Hope for good new from your!!

Best Regards,
Vinh

Nagios Support Forum

Monitoring Engine will not start after upgrading to 5.8

Re: Monitoring Engine will not start after upgrading to 5.8

Re: Monitoring Engine will not start after upgrading to 5.8

Re: Monitoring Engine will not start after upgrading to 5.8

Re: Monitoring Engine will not start after upgrading to 5.8

Re: Monitoring Engine will not start after upgrading to 5.8

Re: Monitoring Engine will not start after upgrading to 5.8

Re: Monitoring Engine will not start after upgrading to 5.8

Re: Monitoring Engine will not start after upgrading to 5.8

Re: Monitoring Engine will not start after upgrading to 5.8

Re: Monitoring Engine will not start after upgrading to 5.8