Nagios process crashing, looking for debugging suggestions
Nagios process crashing, looking for debugging suggestions
Hiya, as per the subject the Nagios Process appears to be crashing / bombing out without any obvious reason (to me)
I'm looking for suggestions / help in finding why this is bombing out:
Our Setup:
nagios process running on physical Centos Linux 7.9.2009 (Core)
we distribute jobs to workers (4 of them) using mod_gearman
Nagios XI 5.8.1
nagios --version
Nagios Core 4.4.6
gearmand --version
gearmand 1.1.19.1 - https://github.com/gearman/gearmand/issues
mod_gearman_worker --version
mod_gearman_worker: version 3.3.0 running on libgearman 1.1.19.1
nagios.log excerpt:
[1613526089] NDO-3: Ended flapping thread
[1613526089] NDO-3: Ended acknowledgement thread
[1613526089] NDO-3: Ended statechange thread
[1613526089] NDO-3: Ended event_handler thread
[1613526089] NDO-3: Ended notification thread
...
[1613526089] Caught SIGSEGV, shutting down...
...
...
...
[1613526323] NDO-3: Ended acknowledgement thread
[1613526323] NDO-3: Ended flapping thread
[1613526323] NDO-3: Ended statechange thread
[1613526323] NDO-3: Ended event_handler thread
[1613526323] Caught SIGSEGV, shutting down...
I'm relatively happy to poke around and try things out to see if we can debug the fault, at the moment i've enabled a process watching script that will restart nagios if it detects it down, and it manages to keep it running for the moment. But this is of course not idea, as it is happening far too frequently.
I've read in the past this can be related to mod_gearman and header mismatch. So perhaps I need to rebuild mod_gearman, at present I am using the nagios install scripts for workers / server to setup the two and they usually work well enough for me.
cheers
--Aaron
I'm looking for suggestions / help in finding why this is bombing out:
Our Setup:
nagios process running on physical Centos Linux 7.9.2009 (Core)
we distribute jobs to workers (4 of them) using mod_gearman
Nagios XI 5.8.1
nagios --version
Nagios Core 4.4.6
gearmand --version
gearmand 1.1.19.1 - https://github.com/gearman/gearmand/issues
mod_gearman_worker --version
mod_gearman_worker: version 3.3.0 running on libgearman 1.1.19.1
nagios.log excerpt:
[1613526089] NDO-3: Ended flapping thread
[1613526089] NDO-3: Ended acknowledgement thread
[1613526089] NDO-3: Ended statechange thread
[1613526089] NDO-3: Ended event_handler thread
[1613526089] NDO-3: Ended notification thread
...
[1613526089] Caught SIGSEGV, shutting down...
...
...
...
[1613526323] NDO-3: Ended acknowledgement thread
[1613526323] NDO-3: Ended flapping thread
[1613526323] NDO-3: Ended statechange thread
[1613526323] NDO-3: Ended event_handler thread
[1613526323] Caught SIGSEGV, shutting down...
I'm relatively happy to poke around and try things out to see if we can debug the fault, at the moment i've enabled a process watching script that will restart nagios if it detects it down, and it manages to keep it running for the moment. But this is of course not idea, as it is happening far too frequently.
I've read in the past this can be related to mod_gearman and header mismatch. So perhaps I need to rebuild mod_gearman, at present I am using the nagios install scripts for workers / server to setup the two and they usually work well enough for me.
cheers
--Aaron
-
benjaminsmith
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: Nagios process crashing, looking for debugging suggestio
Hi Aaron,
When did you start noticing this behavior? Please try running the database repair script and let me know if you notice any improvement.
If not, please PM a fresh system profile so we can take a closer look at the logs files. Also, do you have a test server setup?
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
You could temporarily disable the Mod Gearman broker module to see if that is related to the issue or not. However, please keep in mind if you have a really large environment this may cause issues?
When did you start noticing this behavior? Please try running the database repair script and let me know if you notice any improvement.
Code: Select all
/usr/local/nagiosxi/scripts/repair_databases.sh
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
You could temporarily disable the Mod Gearman broker module to see if that is related to the issue or not. However, please keep in mind if you have a really large environment this may cause issues?
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios process crashing, looking for debugging suggestio
repair run (have run this a few times). We noticed the problems when we received an error on the front end, saying to run repair on mysql. Then on further investigation we found that the nagios daemon was crashing out. So we proceeded to upgrade / debug from there.benjaminsmith wrote:Hi Aaron,
When did you start noticing this behavior? Please try running the database repair script and let me know if you notice any improvement.Code: Select all
/usr/local/nagiosxi/scripts/repair_databases.sh
Sent through our profile, unfortunately we don't have a test server setup.benjaminsmith wrote: If not, please PM a fresh system profile so we can take a closer look at the logs files. Also, do you have a test server setup?
I think I will run into too big a backlog for checks, I might be able to do this outside of hours when its less service affecting. But I would like to exhaust any other options before doing that if at all possible.benjaminsmith wrote: You could temporarily disable the Mod Gearman broker module to see if that is related to the issue or not. However, please keep in mind if you have a really large environment this may cause issues?
cheers
--Aaron
-
benjaminsmith
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: Nagios process crashing, looking for debugging suggestio
Hi,
Thanks for the profile. While I don't see any Nagios Core segfaults in the current log, I did notice the following error(s) in the command subsystem logs.
Since this is your production monitoring system, I would recommend stepping this back down to NDO-2 to see if that resolves the segfault issue. Since you are running a local database, it's a relatively simple procedure to downgrade to ndo2.
Then edit your /usr/local/nagios/etc/nagios.cfg and make sure this line is uncommented:
Make sure this line is commented:
Then start the nagios service:
Otherwise, try to increase the max connections on the database. We have a KB article with instructions.
Nagios XI - MySQL/MariaDB - Max Connections
Lastly, I noticed your running Nagflux + ModGearman, and I would highly recommend setting up a test server for testing out any major or minor releases of XI. Your license allows for 3 separate activations.
https://support.nagios.com/kb/article.php?id=145
Regards,
Benjamin
Thanks for the profile. While I don't see any Nagios Core segfaults in the current log, I did notice the following error(s) in the command subsystem logs.
Code: Select all
Database Error: Could not connect to database
Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
Code: Select all
systemctl stop nagios
cd /tmp
rm -rf /tmp/nagiosxi
wget https://assets.nagios.com/downloads/nagiosxi/5/xi-5.6.14.tar.gz
tar zxf xi-5.6.14.tar.gz
cd /tmp/nagiosxi/subcomponents/ndoutils
./install
systemctl enable ndo2db
Code: Select all
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg
Code: Select all
#broker_module=/usr/local/nagios/bin/ndo.so /usr/local/nagios/etc/ndo.cfg
Code: Select all
systemctl start nagios
Nagios XI - MySQL/MariaDB - Max Connections
Lastly, I noticed your running Nagflux + ModGearman, and I would highly recommend setting up a test server for testing out any major or minor releases of XI. Your license allows for 3 separate activations.
https://support.nagios.com/kb/article.php?id=145
Regards,
Benjamin
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios process crashing, looking for debugging suggestio
Awesome, i've done this, we had already increased the max connections when we noticed some database performance issues as well.benjaminsmith wrote:Hi,
Thanks for the profile. While I don't see any Nagios Core segfaults in the current log, I did notice the following error(s) in the command subsystem logs.Since this is your production monitoring system, I would recommend stepping this back down to NDO-2 to see if that resolves the segfault issue. Since you are running a local database, it's a relatively simple procedure to downgrade to ndo2.Code: Select all
Database Error: Could not connect to database Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)
This is something we've been talking about doing considering this build has grown and developed over the years, so we may build up a secondary system for DR and fail over to it. Thanks for the information that will help us have it up and runningbenjaminsmith wrote: Lastly, I noticed your running Nagflux + ModGearman, and I would highly recommend setting up a test server for testing out any major or minor releases of XI. Your license allows for 3 separate activations.
https://support.nagios.com/kb/article.php?id=145
cheers for the help
--Aaron
-
benjaminsmith
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: Nagios process crashing, looking for debugging suggestio
Hi Aaron,
From your last reply, it sounds like the downgrade is working out. Let us know if you'd like to keep this open for now if anything new comes up or if you have any new questions.
Benjamin
From your last reply, it sounds like the downgrade is working out. Let us know if you'd like to keep this open for now if anything new comes up or if you have any new questions.
Benjamin
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Nagios process crashing, looking for debugging suggestio
everything seems to be working now, we haven't had any constant crashes.
Is there anything obvious about the ndo change, that we need to be aware of, for future releases for example? Setting up the test environment might have shown these issues, to at least alert us to this, I guess I can keep downgrading ndo if it fails?
thank you for the help, you can close this
cheers
--Aaron
Is there anything obvious about the ndo change, that we need to be aware of, for future releases for example? Setting up the test environment might have shown these issues, to at least alert us to this, I guess I can keep downgrading ndo if it fails?
thank you for the help, you can close this
cheers
--Aaron
-
benjaminsmith
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: Nagios process crashing, looking for debugging suggestio
Hi,
Just to let you know, the upgrade script will not force upgrade ndo, so you can take advantage of new features without having to downgrade.
To upgrade again in the future, just run the following commands:
Then edit your /usr/local/nagios/etc/nagios.cfg and make sure this line is commented:
Make sure this line is uncommented:
Then start the nagios service:
Your welcome!thank you for the help, you can close this
Just to let you know, the upgrade script will not force upgrade ndo, so you can take advantage of new features without having to downgrade.
To upgrade again in the future, just run the following commands:
Code: Select all
cd nagiosxi/subcomponents/ndo
./upgrade -f
Code: Select all
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/ndomod.cfg
Code: Select all
broker_module=/usr/local/nagios/bin/ndo.so /usr/local/nagios/etc/ndo.cfg
Code: Select all
systemctl start nagios
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!