Hi
We have a deployment with a nagios XI server, separate MariaDB and 7 mod-gearman hosts.
all checks are farmed out to the mod-gearman hosts. NagiosXI itself does very little monitoring.
However the Monitoring Engine is really unstable. It keeps stopping and we cant find a reason why.
Can you please outline the steps we need to do, to determine what is causing the Monitoring Engine to stop.
The happens especially after the configuration is applied, but can also happen at any time.
rgds
George
Monitoring Engine Stops
-
benjaminsmith
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: Monitoring Engine Stops
Hi George,
When did you notice the instability and did this coincide with any system changes? You'll find any error messages related to the Nagios Core process in the nagios.log.
Since you noticed issues during apply configuration, let's run the following tail command, then apply configuration and then post the output to the thread.
Also, send us the system profile and we'll take a closer look at the log files. Since the database is on a separate host, please retrieve the database log as there could be some issues with connectivity or corrupted tables. Thanks, Benjamin
To Download a System Profile
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
When did you notice the instability and did this coincide with any system changes? You'll find any error messages related to the Nagios Core process in the nagios.log.
Code: Select all
/usr/local/nagios/var/nagios.log
Code: Select all
tail -f /usr/local/nagiosxi/var/cmdsubsys.log /usr/local/nagios/var/nagios.log
To Download a System Profile
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: Monitoring Engine Stops
Hi
Today we cant get nagios monitoring to start at all..
the logs go from: (I've replaced the actual alerts with Blah as I cannot be posting hostnames/IP addresses)
[1626721525] NDO-3: Started notification thread
[1626721526] NDO-3: Ended contact_status thread
[1626721851] Successfully launched command file worker with pid 29070
[1626721855] HOST NOTIFICATION: blah
[1626721855] SERVICE ALERT:blah
[1626721869] NDO-3: Ended host_check thread
[1626721895] HOST ALERT: blah
[1626721895] NDO-3: Ended host_status thread
[1626721907] SERVICE NOTIFICATION: blah
<<A few more service and host notifications>>
[1626721959] NDO-3: Ended acknowledgement thread
[1626721959] NDO-3: Ended downtime thread
[1626721959] NDO-3: Ended flapping thread
[1626721959] NDO-3: Ended statechange thread
[1626721959] NDO-3: Ended event_handler thread
[1626721960] NDO-3: Ended notification thread
[1626721960] NDO-3: Ended service_check thread
[1626721961] Caught SIGSEGV, shutting down...
[1626721961] Caught SIGTERM, shutting down...
so NDO stops the monitoring straight away.
We run the database repair script, which completed successfully but has not made a difference.
We are now pretty much in big trouble.
We have 5000 hosts and about 12k services.
But we have an offloaded database, and 7 mod-gearman workers. At 5 minute check intervals this should be very easy to handle.
The NAgios server CPU does not rise about 20%.
The only relevant information I can find refers to downgrading NDO!
Please respond asap if this is what we need to do and how to do it to an offloaded database.
rgds
George
Today we cant get nagios monitoring to start at all..
the logs go from: (I've replaced the actual alerts with Blah as I cannot be posting hostnames/IP addresses)
[1626721525] NDO-3: Started notification thread
[1626721526] NDO-3: Ended contact_status thread
[1626721851] Successfully launched command file worker with pid 29070
[1626721855] HOST NOTIFICATION: blah
[1626721855] SERVICE ALERT:blah
[1626721869] NDO-3: Ended host_check thread
[1626721895] HOST ALERT: blah
[1626721895] NDO-3: Ended host_status thread
[1626721907] SERVICE NOTIFICATION: blah
<<A few more service and host notifications>>
[1626721959] NDO-3: Ended acknowledgement thread
[1626721959] NDO-3: Ended downtime thread
[1626721959] NDO-3: Ended flapping thread
[1626721959] NDO-3: Ended statechange thread
[1626721959] NDO-3: Ended event_handler thread
[1626721960] NDO-3: Ended notification thread
[1626721960] NDO-3: Ended service_check thread
[1626721961] Caught SIGSEGV, shutting down...
[1626721961] Caught SIGTERM, shutting down...
so NDO stops the monitoring straight away.
We run the database repair script, which completed successfully but has not made a difference.
We are now pretty much in big trouble.
We have 5000 hosts and about 12k services.
But we have an offloaded database, and 7 mod-gearman workers. At 5 minute check intervals this should be very easy to handle.
The NAgios server CPU does not rise about 20%.
The only relevant information I can find refers to downgrading NDO!
Please respond asap if this is what we need to do and how to do it to an offloaded database.
rgds
George
Re: Monitoring Engine Stops
I would also mention that we are running v 5.8.1 and have about 40 top level BPI objects with another 100+ lower level ones plus a couple with 60 or so metrics.
I am mentioning this as the upgrade release notes mentioned a fix to ndo for BPI sync.
I am mentioning this as the upgrade release notes mentioned a fix to ndo for BPI sync.
-
benjaminsmith
- Posts: 5324
- Joined: Wed Aug 22, 2018 4:39 pm
- Location: saint paul
Re: Monitoring Engine Stops
Hi,
Here are the steps to roll back the NDO version. It's relatively simple to switch versions, but with an offloaded db you'll want to update all the config files with the database connection info. You can still upgrade to the latest version, as the installed will not force update ndo on the systems that have been rolled back.
Take a full backup/snapshot before proceeding and/or try this out on your test instance first. Let us know if you have any quesitons.
--Benjamin
Here are the steps to roll back the NDO version. It's relatively simple to switch versions, but with an offloaded db you'll want to update all the config files with the database connection info. You can still upgrade to the latest version, as the installed will not force update ndo on the systems that have been rolled back.
Take a full backup/snapshot before proceeding and/or try this out on your test instance first. Let us know if you have any quesitons.
Code: Select all
### DOWNGRADE NDO WITH OFFLOADED DB
service nagios stop
cd /tmp
rm -rf /tmp/nagiosxi
wget https://assets.nagios.com/downloads/nagiosxi/5/xi-5.6.14.tar.gz
tar zxf xi-5.6.14.tar.gz
cd /tmp/nagiosxi
# START OFFLOADED DB SECTION - If you have an offloaded DB you'll need to do these things:
Edit /tmp/nagiosxi/xi-sys.cfg and update 'mysqlpass' value.
Edit /tmp/nagiosxi/subcomponents/ndoutils/mods/cfg/ndo2db.cfg and update 'db_host', 'db_user', and 'db_pass' values.
Edit /tmp/nagiosxi/subcomponents/ndoutils/install and /tmp/nagiosxi/subcomponents/ndoutils/post-install to update all calls to mysql to include -h <db_ip>
# END OFFLOADED DB SECTION
cd /tmp/nagiosxi/subcomponents/ndoutils
./install
chkconfig ndo2db on
service ndo2db start
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Be sure to check out our Knowledgebase for helpful articles and solutions!