Unit nagios.service entered failed state

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
idemia-cl
Posts: 13
Joined: Fri Jun 12, 2020 12:19 pm

Unit nagios.service entered failed state

Post by idemia-cl »

Hello,

I just migrated from nagios xi 5.6.14 (CentOS 6) to 5.8.7 (CentOS 7). At the moment Nagios XI has been down constantly, approximately every 15 minutes. The only thing I see in the log is the following:

Feb 7 00:39:20 s1t2nagios02 nagios: wproc: iocache_capacity() is -1048576 for worker Core Worker 20673.
Feb 7 00:39:33 s1t2nagios02 nagios: Caught SIGSEGV, shutting down...
Feb 7 00:39:33 s1t2nagios02 systemd: nagios.service: main process exited, code=exited, status=254/n/a
Feb 7 00:39:33 s1t2nagios02 nagios: Caught SIGTERM, shutting down...
Feb 7 00:39:33 s1t2nagios02 systemd: Unit nagios.service entered failed state.
Feb 7 00:39:33 s1t2nagios02 systemd: nagios.service failed.


The migration and update was done offline.
If you need more information let me know. Please your support

Pablo.
idemia-cl
Posts: 13
Joined: Fri Jun 12, 2020 12:19 pm

Re: Unit nagios.service entered failed state

Post by idemia-cl »

Hello, I have performed the following actions:

- Run the database repair script
/usr/local/nagiosxi/scripts/repair_databases.sh

- Restart nagios full stack
systemctl stop crond
systemctl stop npcd
systemctl stop nagios
pkill -9 -u nagios
rm -rf /usr/local/nagiosxi/var/dbmaint.lock
rm -rf /usr/local/nagiosxi/var/event_handler.lock
rm -rf /usr/local/nagiosxi/scripts/reconfigure_nagios.lock
rm -rf /usr/local/nagios/var/ndo.sock
rm -f /usr/local/nagiosxi/var/event_handler.lock
rm -rf /var/run/nagios.lock
rm -rf /usr/local/nagios/var/nagios.lock
systemctl start nagios
systemctl start npcd
systemctl start crond
systemctl restart httpd


Although nagios did not crash as often, but we still have the problem, with the following message:

[1644244397] wproc: iocache_capacity() is -1048576 for worker Core Worker 28830.
[1644244409] NDO-3: mysql_ping: Unknown error. Is the database running?
[1644244409] Caught SIGSEGV, shutting down...
[1644244409] Caught SIGTERM, shutting down...
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Unit nagios.service entered failed state

Post by ssax »

Please PM me a copy of your profile.zip, you can download it from Admin > System Profile by clicking the Download Profile button.

Please PM me this file as well:

Code: Select all

/usr/local/nagios/var/nagios.log
idemia-cl
Posts: 13
Joined: Fri Jun 12, 2020 12:19 pm

Re: Unit nagios.service entered failed state

Post by idemia-cl »

Hello, I enclose the request
You do not have the required permissions to view the files attached to this post.
User avatar
pbroste
Posts: 1288
Joined: Tue Jun 01, 2021 1:27 pm

Re: Unit nagios.service entered failed state

Post by pbroste »

Hello @idemia-cl

Thanks for sending over the System Profile and nagios.logs.

Want to start off by verifying and correcting any errors in the Core config flight as we see an issue with ARAF host config:

Code: Select all

/usr/local/nagios/bin/nagios -vvv /usr/local/nagios/etc/nagios.cfg
Then run through the following script to verify permissions:

Code: Select all

su - nagios
cd /usr/local/nagiosxi/scripts
./reconfigure_nagios.sh
As see that you are using Postgres and want to have you go through this:
https://support.nagios.com/kb/article/n ... r-754.html

Restart the nagios.service by (running: systemctl restart nagios ) and then verify the Core config:

Code: Select all

/usr/local/nagios/bin/nagios -vvv /usr/local/nagios/etc/nagios.cfg
Please let us know how things look,
Perry
idemia-cl
Posts: 13
Joined: Fri Jun 12, 2020 12:19 pm

Re: Unit nagios.service entered failed state

Post by idemia-cl »

Hi, We executed the instruccions and I attach the archives of the command:
"/usr/local/nagios/bin/nagios -vvv /usr/local/nagios/etc/nagios.cfg " (before and after) commands.
also We incrementing the parameter openFiles of the archive limits.conf because was for default.

but the message persists "NDO-3: mysql_ping: Unknown error. Is the database running"

Regards.
You do not have the required permissions to view the files attached to this post.
ssax
Dreams In Code
Posts: 7682
Joined: Wed Feb 11, 2015 12:54 pm

Re: Unit nagios.service entered failed state

Post by ssax »

Make sure this is done on the EL7 system:

https://support.nagios.com/kb/article-754.html

Edit this file:

Code: Select all

/usr/local/nagiosxi/html/config.inc.php
Above this (line 32):

Code: Select all

// db-specific connection information
Add this:

Code: Select all

// Database connection type
// 1 = persistent, 0 = normal
$cfg['db_conn_persistent'] = 1;

// db-specific connection information
I would also make sure your /etc/my.cnf has these under the [mysqld] section:

Code: Select all

max_connections=1000
max_allowed_packet=512M
Then restart the services and test again:

Code: Select all

systemctl restart mariadb httpd nagios crond
See if those changes resolve it.
idemia-cl
Posts: 13
Joined: Fri Jun 12, 2020 12:19 pm

Re: Unit nagios.service entered failed state

Post by idemia-cl »

hello, we have executed the instructions that you gave us, although the message "NDO-3: mysql_ping: Unknown error. Is the database running?" It no longer appears but we still maintain falls from the platform.

The following are the messages in /var/log/messages that we see when the platform goes down

Feb 25 09:51:29 s1t2nagios02 nagios: Caught SIGTERM, shutting down...
Feb 25 09:51:29 s1t2nagios02 systemd: Unit nagios.service entered failed state.
Feb 25 10:29:55 s1t2nagios02 nagios: wproc: iocache_capacity() is -1048576 for worker Core Worker 13984.
Feb 25 10:59:51 s1t2nagios02 nagios: wproc: iocache_capacity() is -1048576 for worker Core Worker 13991.
Feb 25 10:59:51 s1t2nagios02 nagios: wproc: iocache_capacity() is -795155 for worker Core Worker 13991.
Feb 25 10:59:51 s1t2nagios02 nagios: Caught SIGTERM, shutting down...
Feb 25 10:59:51 s1t2nagios02 systemd: Unit nagios.service entered failed state.
Feb 25 11:03:28 s1t2nagios02 nagios: wproc: iocache_capacity() is -1048576 for worker Core Worker 3634.
Feb 25 11:33:29 s1t2nagios02 nagios: wproc: iocache_capacity() is -1048576 for worker Core Worker 3638.
Feb 25 11:33:34 s1t2nagios02 nagios: wproc: iocache_capacity() is -795155 for worker Core Worker 3638.
Feb 25 11:33:34 s1t2nagios02 nagios: Caught SIGTERM, shutting down...
Feb 25 11:33:34 s1t2nagios02 systemd: Unit nagios.service entered failed state.
Feb 25 11:49:15 s1t2nagios02 nagios: wproc: iocache_capacity() is -1048576 for worker Core Worker 6493.
Feb 25 12:19:15 s1t2nagios02 nagios: wproc: iocache_capacity() is -1048576 for worker Core Worker 6490.
Feb 25 12:49:40 s1t2nagios02 nagios: Caught SIGTERM, shutting down...
Feb 25 12:51:10 s1t2nagios02 systemd: Unit nagios.service entered failed state.
Feb 25 12:58:14 s1t2nagios02 nagios: wproc: iocache_capacity() is -1048576 for worker Core Worker 4279.
Feb 25 12:58:14 s1t2nagios02 nagios: wproc: iocache_capacity() is -795154 for worker Core Worker 4279.

please your support
User avatar
kfanselow
Posts: 247
Joined: Tue Aug 31, 2021 3:25 pm

Re: Unit nagios.service entered failed state

Post by kfanselow »

Hi Pablo,

Could you walk us through the steps of your migration ?

Thanks and Best Regards,
Keith
idemia-cl
Posts: 13
Joined: Fri Jun 12, 2020 12:19 pm

Re: Unit nagios.service entered failed state

Post by idemia-cl »

yes, previously we had an EL6 (CentOS 6) and a version of nagios 5.6.14. We proceeded to update due to vulnerabilities detected with that version, as well as problems with nagios crash.

We proceeded to install a new server with EL7 (CentOS 7) and install nagios 5.6.14. After this, migrate a backup to the new server, following the procedure (https://assets.nagios.com/downloads/nag ... ios-XI.pdf) .

Then the new server is updated to version 5.8.7. All this installation was done offline following the procedure (https://assets.nagios.com/downloads/nag ... onment.pdf)
Locked