Page 1 of 3

Service recovery emails generated without status change

Posted: Tue Apr 16, 2019 12:21 pm
by rferebee
Hello,

For some reason daily, we are seeing service recovery emails being generated by Nagios XI for a subset of our Linux hosts. This is causing a lot of "SPAM" to be sent out as well as causing wasted time with administrators checking their hosts to see if they went Critical.

It seems to only affect Linux hosts and one Contact Group in particular. I opened a similar request several months ago and was told that version 5.5.7 would resolve the bug, but we're now on version 5.5.11 and it's still happening.

Any help would be appreciated. Thank you!

Re: Service recovery emails generated without status change

Posted: Tue Apr 16, 2019 1:05 pm
by rferebee
Here's a screen shot of some of the recovery notification being generated. At no point did these services change their state that would explain the recovery.

Re: Service recovery emails generated without status change

Posted: Tue Apr 16, 2019 1:27 pm
by benjaminsmith
Hello @rferebee,

Nagios Core 4.4.3 fixed a few issues with notifications and recoveries. Could you send your system profile for us to review?

To send us your system profile.
Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and share in a private message or upload it to the ticket.

Thanks.

Re: Service recovery emails generated without status change

Posted: Tue Apr 16, 2019 1:45 pm
by rferebee
PM with system profile sent. Thank you.

Re: Service recovery emails generated without status change

Posted: Tue Apr 16, 2019 3:42 pm
by benjaminsmith
Hi @rferebee,

Thanks for sending over the system profile. I noticed that the nagios_logentries table is corrupted, please run the following as root from the terminal to repair the database.

Code: Select all

/usr/local/nagiosxi/scripts/repair_databases.sh
Next, you do have recovery notifications enabled. Is this intentional?

Code: Select all

define host {
    host_name                   pug
    use                         xiwizard_generic_host
    address                     ip-address
    hostgroups                  AIXDWSSProdServer
    max_check_attempts          5
    check_interval              5
    retry_interval              1
    check_period                xi_timeperiod_24x7
    contact_groups              SUGContact,Welfare DBA,Welfare Websphere Group
    notification_interval       1440
    notification_period         xi_timeperiod_24x7
    first_notification_delay    7
    notification_options        d,u,r,f
    notifications_enabled       1
    _xiwizard                   autodiscovery
    register                    1
}
define service {
    host_name                   pug
    service_description         Disk Check /db/database
    use                         AIXDiskServiceOra
    check_command               check_nrpe!check_disk1!20!20% 10% "/db/database"!!!!!
    max_check_attempts          5
    check_interval              5
    retry_interval              1
    check_period                xi_timeperiod_24x7
    notification_interval       1440
    notification_period         xi_timeperiod_24x7
    notification_options        w,c,u,r,f
    notifications_enabled       1
    contact_groups              SUGContact,Welfare DBA,Welfare Websphere Group
    register                    1
}
The next step would be to pull the state history report for the host and service in question to determine if it experienced a hard recovery or not. Go to Reports > State History and limit the report to Pug for 04-16-2019, then elect State Type as Both and State as Any.

If those services in question did experience a hard recovery, then Nagios would be notifying as expected.

Reference
State Types

Re: Service recovery emails generated without status change

Posted: Tue Apr 16, 2019 3:57 pm
by rferebee
Ok, I actually ran a database repair this morning before I sent you the system profile. So, that means you're still seeing the entries as corrupt after the repair. Perhaps my database repairs aren't working?

We do have recoveries enabled intentionally, if something goes critical we want to know when it recovers.

I read the reference article you supplied, but I'm still having trouble understanding what a HARD recovery is? The state of the service hasn't changed since January. I don't understand why it would need to send a notice that it recovered.

Can you elaborate?

Re: Service recovery emails generated without status change

Posted: Tue Apr 16, 2019 5:01 pm
by npolovenko
@rferebee, On the report screenshot you sent us I'm seeing that the Disk Check /db/database service was in a critical hard state until today and then it recovered. So I'd expect to receive a recovery email notification.

Can you run the following command to truncate email tables in the database and let us know if that fixes the problem?
echo "truncate table xi_events; truncate table xi_meta; truncate table xi_eventqueue;" | mysql -uroot -pnagiosxi nagiosxi
mysqlcheck -r -f -uroot -pnagiosxi --all-databases --use_frm

Re: Service recovery emails generated without status change

Posted: Tue Apr 16, 2019 5:16 pm
by rferebee
Well, I can say beyond a shadow of a doubt. That service check has not been critical since January. I monitor our Nagios environment almost all day as well as send weekly reports of what services are in critical and warning states.

I don't dispute that that's what the report says, but it definitely wasn't in a critical state for 3 months. There must be a disconnect somewhere in our environment.

Re: Service recovery emails generated without status change

Posted: Tue Apr 16, 2019 5:46 pm
by rferebee
Also, I got an error when I tried to run the first command provided:

root@nagiosxi> echo "truncate table xi_events; truncate table xi_meta; truncate table xi_eventqueue;" | mysql -uroot -pnagiosxi nagiosxi
ERROR 1049 (42000): Unknown database 'nagiosxi'

Re: Service recovery emails generated without status change

Posted: Wed Apr 17, 2019 10:37 am
by npolovenko
@rferebee, Seems like you're using postgres for the nagiosxi database. Please run the following commands instead:
echo "truncate table xi_events; truncate table xi_meta; truncate table xi_eventqueue;" | psql nagiosxi nagiosxi
echo "vacuum;vacuum analyze;"|psql nagiosxi postgres
service postgresql restart
Can you generate the state history report for the same host and service and make sure that you select "Type" -> Both.

Finally, could you send in your Nagios XI System Profile so I can review it?
To send us your system profile. Login to the Nagios XI GUI using a web browser.
Click the "Admin" > "System Profile" Menu
Click the "Download Profile" button
Save the profile.zip file and send it to me in a private message. Or you can upload it in the thread.