Page 1 of 1

Nagios process crashing, looking for debugging suggestions

Posted: Tue Feb 16, 2021 9:13 pm
by acheesem
Hiya, as per the subject the Nagios Process appears to be crashing / bombing out without any obvious reason (to me)

I'm looking for suggestions / help in finding why this is bombing out:

Our Setup:
nagios process running on physical Centos Linux 7.9.2009 (Core)
we distribute jobs to workers (4 of them) using mod_gearman

Nagios XI 5.8.1

nagios --version
Nagios Core 4.4.6
gearmand --version
gearmand 1.1.19.1 - https://github.com/gearman/gearmand/issues
mod_gearman_worker --version
mod_gearman_worker: version 3.3.0 running on libgearman 1.1.19.1



nagios.log excerpt:
[1613526089] NDO-3: Ended flapping thread
[1613526089] NDO-3: Ended acknowledgement thread
[1613526089] NDO-3: Ended statechange thread
[1613526089] NDO-3: Ended event_handler thread
[1613526089] NDO-3: Ended notification thread
...
[1613526089] Caught SIGSEGV, shutting down...

...
...
...
[1613526323] NDO-3: Ended acknowledgement thread
[1613526323] NDO-3: Ended flapping thread
[1613526323] NDO-3: Ended statechange thread
[1613526323] NDO-3: Ended event_handler thread
[1613526323] Caught SIGSEGV, shutting down...


I'm relatively happy to poke around and try things out to see if we can debug the fault, at the moment i've enabled a process watching script that will restart nagios if it detects it down, and it manages to keep it running for the moment. But this is of course not idea, as it is happening far too frequently.

I've read in the past this can be related to mod_gearman and header mismatch. So perhaps I need to rebuild mod_gearman, at present I am using the nagios install scripts for workers / server to setup the two and they usually work well enough for me.



cheers
--Aaron

Re: Nagios process crashing, looking for debugging suggestio

Posted: Wed Feb 17, 2021 2:31 pm
by benjaminsmith
Hi Aaron,

When did you start noticing this behavior? Please try running the database repair script and let me know if you notice any improvement.

Code: Select all

/usr/local/nagiosxi/scripts/repair_databases.sh
If not, please PM a fresh system profile so we can take a closer look at the logs files. Also, do you have a test server setup?

To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button

You could temporarily disable the Mod Gearman broker module to see if that is related to the issue or not. However, please keep in mind if you have a really large environment this may cause issues?

Re: Nagios process crashing, looking for debugging suggestio

Posted: Wed Feb 17, 2021 3:21 pm
by acheesem
benjaminsmith wrote:Hi Aaron,

When did you start noticing this behavior? Please try running the database repair script and let me know if you notice any improvement.

Code: Select all

/usr/local/nagiosxi/scripts/repair_databases.sh
repair run (have run this a few times). We noticed the problems when we received an error on the front end, saying to run repair on mysql. Then on further investigation we found that the nagios daemon was crashing out. So we proceeded to upgrade / debug from there.
benjaminsmith wrote: If not, please PM a fresh system profile so we can take a closer look at the logs files. Also, do you have a test server setup?
Sent through our profile, unfortunately we don't have a test server setup.
benjaminsmith wrote: You could temporarily disable the Mod Gearman broker module to see if that is related to the issue or not. However, please keep in mind if you have a really large environment this may cause issues?
I think I will run into too big a backlog for checks, I might be able to do this outside of hours when its less service affecting. But I would like to exhaust any other options before doing that if at all possible.

cheers
--Aaron

Re: Nagios process crashing, looking for debugging suggestio

Posted: Thu Feb 18, 2021 11:02 am
by benjaminsmith
Hi,

Thanks for the profile. While I don't see any Nagios Core segfaults in the current log, I did notice the following error(s) in the command subsystem logs.

Code: Select all

Database Error: Could not connect to database
Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
Since this is your production monitoring system, I would recommend stepping this back down to NDO-2 to see if that resolves the segfault issue. Since you are running a local database, it's a relatively simple procedure to downgrade to ndo2.

Code: Select all

systemctl stop nagios
cd /tmp
rm -rf /tmp/nagiosxi
wget https://assets.nagios.com/downloads/nagiosxi/5/xi-5.6.14.tar.gz
tar zxf xi-5.6.14.tar.gz
cd /tmp/nagiosxi/subcomponents/ndoutils
./install
systemctl enable ndo2db
Then edit your /usr/local/nagios/etc/nagios.cfg and make sure this line is uncommented:

Code: Select all

broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg
Make sure this line is commented:

Code: Select all

#broker_module=/usr/local/nagios/bin/ndo.so /usr/local/nagios/etc/ndo.cfg
Then start the nagios service:

Code: Select all

systemctl start nagios
Otherwise, try to increase the max connections on the database. We have a KB article with instructions.

Nagios XI - MySQL/MariaDB - Max Connections

Lastly, I noticed your running Nagflux + ModGearman, and I would highly recommend setting up a test server for testing out any major or minor releases of XI. Your license allows for 3 separate activations.

https://support.nagios.com/kb/article.php?id=145

Regards,
Benjamin

Re: Nagios process crashing, looking for debugging suggestio

Posted: Thu Feb 18, 2021 4:52 pm
by acheesem
benjaminsmith wrote:Hi,

Thanks for the profile. While I don't see any Nagios Core segfaults in the current log, I did notice the following error(s) in the command subsystem logs.

Code: Select all

Database Error: Could not connect to database
Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
Since this is your production monitoring system, I would recommend stepping this back down to NDO-2 to see if that resolves the segfault issue. Since you are running a local database, it's a relatively simple procedure to downgrade to ndo2.
Awesome, i've done this, we had already increased the max connections when we noticed some database performance issues as well.
benjaminsmith wrote: Lastly, I noticed your running Nagflux + ModGearman, and I would highly recommend setting up a test server for testing out any major or minor releases of XI. Your license allows for 3 separate activations.

https://support.nagios.com/kb/article.php?id=145
This is something we've been talking about doing considering this build has grown and developed over the years, so we may build up a secondary system for DR and fail over to it. Thanks for the information that will help us have it up and running

cheers for the help
--Aaron

Re: Nagios process crashing, looking for debugging suggestio

Posted: Fri Feb 19, 2021 3:18 pm
by benjaminsmith
Hi Aaron,

From your last reply, it sounds like the downgrade is working out. Let us know if you'd like to keep this open for now if anything new comes up or if you have any new questions.

Benjamin

Re: Nagios process crashing, looking for debugging suggestio

Posted: Sun Feb 21, 2021 2:34 pm
by acheesem
everything seems to be working now, we haven't had any constant crashes.

Is there anything obvious about the ndo change, that we need to be aware of, for future releases for example? Setting up the test environment might have shown these issues, to at least alert us to this, I guess I can keep downgrading ndo if it fails?

thank you for the help, you can close this

cheers
--Aaron

Re: Nagios process crashing, looking for debugging suggestio

Posted: Mon Feb 22, 2021 3:32 pm
by benjaminsmith
Hi,
thank you for the help, you can close this
Your welcome!

Just to let you know, the upgrade script will not force upgrade ndo, so you can take advantage of new features without having to downgrade.
To upgrade again in the future, just run the following commands:

Code: Select all

cd nagiosxi/subcomponents/ndo
./upgrade -f
Then edit your /usr/local/nagios/etc/nagios.cfg and make sure this line is commented:

Code: Select all

broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg
Make sure this line is uncommented:

Code: Select all

broker_module=/usr/local/nagios/bin/ndo.so /usr/local/nagios/etc/ndo.cfg
Then start the nagios service:

Code: Select all

systemctl start nagios