Page 2 of 3
Re: Monitoring Engine will not start after upgrading to 5.8
Posted: Thu May 27, 2021 9:41 am
by murdock
Hi,
We're heading in to a long weekend and monitoring is down: I have 300+ people counting on me / Nagios XI to monitor ~2600 hosts.
We need to make the call soon if this can be fixed, or do we need to rollback.
My recovery plan is to wipe our VM, reinstall CentOS 7, reinstall XI 5.5, and restore from the last backup. If anyone has a better idea, please let me know.
Please let me know if you need anything from the downed 5.8 instance before I have to wipe it clean.
Rob
Re: Monitoring Engine will not start after upgrading to 5.8
Posted: Thu May 27, 2021 1:40 pm
by vtrac
Hi,
How are you doing?
Please following this KB and see if you need to increase "Max" connection:
https://support.nagios.com/kb/article/n ... s-513.html
I noticed this warning:
Code: Select all
WARNING: RLIMIT_NPROC is 64090, total max estimated processes is 71016! You should increase your limits (ulimit -u, or limits.conf)
I found this page on ulimits settings:
https://serverfault.com/questions/62861 ... n-centos-7
Also, please follow this KB for message queue:
https://support.nagios.com/kb/article.php?id=139
Please also make sure "ndo2db" is NOT running, since Nagios XI 5.8.3 use NDO3:
Please run the below command:
Code: Select all
echo "SELECT table_schema as 'Database', table_name AS 'Table', round(((data_length + index_length) / 1024 / 1024), 2) 'Size in MB' FROM information_schema.TABLES ORDER BY (data_length + index_length) DESC;" |mysql -t -u root -pnagiosxi
If your "nagios_logentries" is large, please run the below command to truncate it:
Code: Select all
mysql -u ndoutils -pn@gweb nagios -e 'TRUNCATE TABLE nagios_logentries'
Please run the followings to restart all your services:
Code: Select all
systemctl stop crond
systemctl stop npcd
systemctl stop nagios
pkill -9 -u nagios
for i in $(ipcs -q | grep nagios |awk '{print $2}'); do ipcrm -q $i; done
rm -rf /usr/local/nagiosxi/var/dbmaint.lock
rm -rf /usr/local/nagiosxi/var/event_handler.lock
rm -rf /usr/local/nagiosxi/scripts/reconfigure_nagios.lock
systemctl restart mariadb
systemctl start nagios
systemctl start npcd
systemctl start crond
Best Regards,
Vinh
Re: Monitoring Engine will not start after upgrading to 5.8
Posted: Thu May 27, 2021 1:47 pm
by murdock
Hi Vinh,
Message received, I'll start working on this.
Rob
Re: Monitoring Engine will not start after upgrading to 5.8
Posted: Thu May 27, 2021 1:48 pm
by vtrac
Hi,
Also, were you able to do "Apply Configuration"?
Regards,
Vinh
Re: Monitoring Engine will not start after upgrading to 5.8
Posted: Thu May 27, 2021 2:32 pm
by murdock
Hi Vinh,
Yes, I did everything you suggested yesterday including the "apply" and it says it worked; but the nagios service would terminate after about 1 minute (i.e., no change).
I'm working on today's suggestions from you now & will follow up.
Rob
Re: Monitoring Engine will not start after upgrading to 5.8
Posted: Thu May 27, 2021 2:40 pm
by vtrac
Hi Rob,
Great to hear that "Apply Config" did worked.
So, I am assuming that your database connections or some DB tables might be too large.
Best Regards,
Vinh
Re: Monitoring Engine will not start after upgrading to 5.8
Posted: Thu May 27, 2021 3:24 pm
by murdock
Vinh,
On the ulimit issue, regarding the serverfault article you referred to, systemd ignores anything/everything in /etc/security/limits*.
So in /usr/lib/systemd/system/*.service, for which service(s) do I need to create an override.conf to increase the user's nproc ("max user processes")?
Rob
PS, with regard to MariaDB, I increased max_connections from the default (151) to the maximum of 818 and restarted; nagios crashed again shortly after restarting.
Re: Monitoring Engine will not start after upgrading to 5.8
Posted: Thu May 27, 2021 4:46 pm
by murdock
Hi Vinh,
I followed all of the suggestions you made and Nagios still crashes shortly after [re]starting.
This has been a P1 (Priority / Severity One) high-visibility catastrophic production failure for us.
At this point, we have completely run out of time, we simply cannot have production completely down for such a long period of time, now more than 24 hours.
I'll note also that I have not yet heard back from Sales on my questions / request for information.
I did not receive any request for any additional information from the 5.8 production instance; so here's the plan:
1. I need to recover / restore our old 5.5 instance so we have a working monitoring system in place for the holiday weekend.
2. Early next week I will clone our 5.5 instance, leaving our 5.5 production system running, thus creating a stand-alone, separate 5.5 instance where I will where I will re-run the upgrade procedure and we can pick this issue back up.
Rob
Re: Monitoring Engine will not start after upgrading to 5.8
Posted: Thu May 27, 2021 4:51 pm
by vtrac
Hi Rob,
I talked to Sean, and he think it is NDO3 issue.
let's downgrade to the previous version of ndo2db (instructions below).
Code: Select all
systemctl stop nagios
cd /tmp
rm -rf /tmp/nagiosxi
wget https://assets.nagios.com/downloads/nagiosxi/5/xi-5.6.14.tar.gz
tar zxf xi-5.6.14.tar.gz
cd /tmp/nagiosxi/subcomponents/ndoutils
./install
systemctl enable ndo2db
Then edit your /usr/local/nagios/etc/nagios.cfg and make sure this line is uncommented:
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg
Make sure this line is commented:
#broker_module=/usr/local/nagios/bin/ndo.so /usr/local/nagios/etc/ndo.cfg
Then start the nagios service:
Best Regards,
Vinh
Re: Monitoring Engine will not start after upgrading to 5.8
Posted: Thu May 27, 2021 4:56 pm
by vtrac
Hi Rob,
I'm very sorry for causing too much delay!!
We have confirmed that it is NDO3 issue based on the debug log:
Code: Select all
[1622088804] NDO-3: Ended acknowledgement thread
[1622088804] NDO-3: Ended flapping thread
[1622088804] NDO-3: Ended statechange thread
[1622088804] NDO-3: Ended event_handler thread
[1622088804] NDO-3: Ended notification thread
[1622088804] NDO-3: Ended timed_event thread
[1622088805] NDO-3: Ended service_check thread
[1622088805] NDO-3: Ended downtime thread
[1622088806] Caught SIGTERM, shutting down...
So, please down grade to ndo2db as I have post the instruction in my last replied.
Hope for good new from your!!
Best Regards,
Vinh