Page 1 of 1

Automatically Remove Serivces That Are No Longer Reporting

Posted: Wed Dec 16, 2020 10:32 am
by luczynj
Hello all,

We are monitoring traffic routes on our network and have found that when a route no longer reports to Nagios, that we are not notified and we have to manually remove these route services from Nagios.

For example, today I've run a script to see when the last report was received from some of these services and there are over 6,000 of them. We would like to have an automatic way to remove them. We can take care of the perfdata files via a script, but we're wondering if there's a way to remove these services automatically from Nagios XI itself.

Any help would be appreciated.

Regards,
JL

Re: Automatically Remove Serivces That Are No Longer Reporti

Posted: Wed Dec 16, 2020 3:10 pm
by benjaminsmith
Hi JL,

Nagios XI has a Deadpool process that can be enabled. This will automatically remove or de-activate problem hosts or services after a defined period of time.

For setup instructions, please see the following guide:

How To Use Deadpool In Nagios XI

And let us know if you have questions or need assistance setting this up.

--Benjamin

Re: Automatically Remove Serivces That Are No Longer Reporti

Posted: Wed Dec 30, 2020 7:12 pm
by luczynj
Hello there.

Thanks for the quick response!

We jumped in with both feet and did the following.

Since the documentation says it doesn't work retroactively, we decided to do a test on about 6,000 (4 x 1500) services that are decommissioned routes/trunks.

I wrote a pretty handy script a while ago that scans the perfdata files for .xml/.rrd files that are no longer reporting after X days. We manage multiple platforms and have a naming convention where the script now takes a command line parameter of -p <PLATFORM> and -d <NUMBER OF DAYS SINCE UPDATED>. It produces a report that includes how much diskspace is used up by these old files as well. This was before we upgraded our servers and had to babysit the Nagios servers.

I copied the script to do a send_nrdp to the localhost and set the service = 3 and the output = "DEADPOOL" if the service hadn't reported in X days.

When we ran about 1,500 of these send_nrdps, we observed that Nagios-XI would freeze up or crash. We wanted to see the deactivation of the DEADPOOL services while we were testing and set the time limits for stage 1 and 2 to just a few minutes. We received the deadpool email notifications, but aren't sure why what caused Nagios to crash(?). We ended up having to completely restart all Nagios-related services on the platform.

My colleague works overseas and was ready to quit, so I told him I would undo what we did and document what we've done.

Are there limits on the settings we choose? We set stage one to 3 minutes and stage two to 5 minutes.

We picked the low values because we've been trying to figure out how to clean up dead services and were anxious to see it work.

Thanks,
JL

Re: Automatically Remove Serivces That Are No Longer Reporti

Posted: Mon Jan 04, 2021 12:27 pm
by benjaminsmith
Hi JL,
Are there limits on the settings we choose? We set stage one to 3 minutes and stage two to 5 minutes.
Stage 2 must be 5 minutes greater than stage 1 ( see page 3 of the doc ). What version of Nagios XI are you running?

Did you see any errors in the Deadpool log. You can watch this by running the following command:

Code: Select all

tail -f /usr/local/nagiosxi/var/deadpool.log
Without the times set so low and so many objects, it may not be enough time to process the objects and update the configurations. Can you send the system profile?

To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button

Also, can you retrieve the nagios.log from that day? It would be in the following directory:

Code: Select all

 /usr/local/nagios/var/archives
The logs are rotated every 24 hours, so if the log date is labeled 1-03, it would contain the data from 1-02.

Thanks,
Benjamin

Re: Automatically Remove Serivces That Are No Longer Reporti

Posted: Sun Jan 24, 2021 4:47 pm
by luczynj
Here's what's in the deadpool log. It doesn't have date/time stamps (can we fix that, please)?

[root@nagios-p var]# cat deadpool.log | sort -u
CONNECTED TO DATABASES
deadpool init cancelled
deadpool reaper is disabled
<h3>Database Error</h3>A database connection error has been detected, please follow the repair prompt below. If the issue persists, please contact Nagios support.<p>Run the following from the CLI as root to attempt to repair the DB:<br><pre>/usr/local/nagiosxi/scripts/repair_databases.sh</pre></p>CONNECTED TO DATABASES
<h3>Database Error</h3>A database connection error has been detected, please follow the repair prompt below. If the issue persists, please contact Nagios support.<p>Run the following from the CLI as root to attempt to repair the DB:<br><pre>/usr/local/nagiosxi/scripts/repair_databases.sh</pre></p><h3>Database Error</h3>A database connection error has been detected, please follow the repair prompt below. If the issue persists, please contact Nagios support.<p>Run the following from the CLI as root to attempt to repair the DB:<br><pre>/usr/local/nagiosxi/scripts/repair_databases.sh</pre></p>CONNECTED TO DATABASES

[root@nagios-p var]# tail deadpool.log
deadpool init cancelled
CONNECTED TO DATABASES
deadpool reaper is disabled
deadpool init cancelled
CONNECTED TO DATABASES
deadpool reaper is disabled
deadpool init cancelled
CONNECTED TO DATABASES
deadpool reaper is disabled
deadpool init cancelled

Moderator's Note: The profile has been shared with the support team but has been removed from the public forum.

Re: Automatically Remove Serivces That Are No Longer Reporti

Posted: Mon Jan 25, 2021 6:09 pm
by benjaminsmith
Hi,

It looks you do not have the Deadpool feature enabled, the first the script does is check the database for the settings, and decide to continue or not when the cron job is run.
CONNECTED TO DATABASES
deadpool init cancelled
deadpool reaper is disabled
I would widen the times out significantly and test this once more and let me know if it works then. I believe initially it may have hung the database, as I see a large number of these entries in the messages.

Code: Select all

Another reconfigure process is still running, sleeping...
Another reconfigure process is still running, sleeping...
.Another reconfigure process is still running, sleeping...
Another reconfigure process is still running, sleeping..
Best Regards,
Benjamin