Page 1 of 1

Monitoring Engine Stops

Posted: Wed Jul 14, 2021 10:05 am
by rexmundo
Hi

We have a deployment with a nagios XI server, separate MariaDB and 7 mod-gearman hosts.

all checks are farmed out to the mod-gearman hosts. NagiosXI itself does very little monitoring.

However the Monitoring Engine is really unstable. It keeps stopping and we cant find a reason why.

Can you please outline the steps we need to do, to determine what is causing the Monitoring Engine to stop.
The happens especially after the configuration is applied, but can also happen at any time.

rgds
George

Re: Monitoring Engine Stops

Posted: Thu Jul 15, 2021 2:54 pm
by benjaminsmith
Hi George,

When did you notice the instability and did this coincide with any system changes? You'll find any error messages related to the Nagios Core process in the nagios.log.

Code: Select all

/usr/local/nagios/var/nagios.log
Since you noticed issues during apply configuration, let's run the following tail command, then apply configuration and then post the output to the thread.

Code: Select all

tail -f /usr/local/nagiosxi/var/cmdsubsys.log /usr/local/nagios/var/nagios.log
Also, send us the system profile and we'll take a closer look at the log files. Since the database is on a separate host, please retrieve the database log as there could be some issues with connectivity or corrupted tables. Thanks, Benjamin

To Download a System Profile
To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button

Re: Monitoring Engine Stops

Posted: Mon Jul 19, 2021 2:26 pm
by rexmundo
Hi

Today we cant get nagios monitoring to start at all..

the logs go from: (I've replaced the actual alerts with Blah as I cannot be posting hostnames/IP addresses)

[1626721525] NDO-3: Started notification thread
[1626721526] NDO-3: Ended contact_status thread
[1626721851] Successfully launched command file worker with pid 29070
[1626721855] HOST NOTIFICATION: blah
[1626721855] SERVICE ALERT:blah
[1626721869] NDO-3: Ended host_check thread
[1626721895] HOST ALERT: blah
[1626721895] NDO-3: Ended host_status thread
[1626721907] SERVICE NOTIFICATION: blah

<<A few more service and host notifications>>

[1626721959] NDO-3: Ended acknowledgement thread
[1626721959] NDO-3: Ended downtime thread
[1626721959] NDO-3: Ended flapping thread
[1626721959] NDO-3: Ended statechange thread
[1626721959] NDO-3: Ended event_handler thread
[1626721960] NDO-3: Ended notification thread
[1626721960] NDO-3: Ended service_check thread
[1626721961] Caught SIGSEGV, shutting down...
[1626721961] Caught SIGTERM, shutting down...

so NDO stops the monitoring straight away.

We run the database repair script, which completed successfully but has not made a difference.

We are now pretty much in big trouble.

We have 5000 hosts and about 12k services.
But we have an offloaded database, and 7 mod-gearman workers. At 5 minute check intervals this should be very easy to handle.
The NAgios server CPU does not rise about 20%.

The only relevant information I can find refers to downgrading NDO!

Please respond asap if this is what we need to do and how to do it to an offloaded database.

rgds
George

Re: Monitoring Engine Stops

Posted: Tue Jul 20, 2021 2:00 am
by rexmundo
I would also mention that we are running v 5.8.1 and have about 40 top level BPI objects with another 100+ lower level ones plus a couple with 60 or so metrics.

I am mentioning this as the upgrade release notes mentioned a fix to ndo for BPI sync.

Re: Monitoring Engine Stops

Posted: Tue Jul 20, 2021 10:25 am
by benjaminsmith
Hi,

Here are the steps to roll back the NDO version. It's relatively simple to switch versions, but with an offloaded db you'll want to update all the config files with the database connection info. You can still upgrade to the latest version, as the installed will not force update ndo on the systems that have been rolled back.

Take a full backup/snapshot before proceeding and/or try this out on your test instance first. Let us know if you have any quesitons.

Code: Select all

### DOWNGRADE NDO WITH OFFLOADED DB
service nagios stop
cd /tmp
rm -rf /tmp/nagiosxi
wget https://assets.nagios.com/downloads/nagiosxi/5/xi-5.6.14.tar.gz
tar zxf xi-5.6.14.tar.gz
cd /tmp/nagiosxi
# START OFFLOADED DB SECTION - If you have an offloaded DB you'll need to do these things:
Edit /tmp/nagiosxi/xi-sys.cfg and update 'mysqlpass' value.
Edit /tmp/nagiosxi/subcomponents/ndoutils/mods/cfg/ndo2db.cfg and update 'db_host', 'db_user', and 'db_pass' values.
Edit /tmp/nagiosxi/subcomponents/ndoutils/install and /tmp/nagiosxi/subcomponents/ndoutils/post-install to update all calls to mysql to include -h <db_ip>
# END OFFLOADED DB SECTION
cd /tmp/nagiosxi/subcomponents/ndoutils
./install
chkconfig ndo2db on
service ndo2db start
--Benjamin