Page 1 of 1

Check never goes to Critical Hard state

Posted: Tue Dec 18, 2018 9:24 am
by hbouma
We have a check for a process running that where the process was stopped. The checks went through the 3 retries, but stayed in a critical soft state instead of moving to critical hard. We forced the check into an OK state (changed the check to expect the process to be stopped, rechecked, then changed back to expect the process to running). After the changes, the check ran again, ran through all the retries and should have gone to Critical Hard. Instead, the logs showed Critical Soft with the correct amount of tries.

nagios.log was showing

Code: Select all

........;Service status for: ftpsvc;CRITICAL;SOFT;1;CRITICAL: ftpsvc is stopped (should be running)
........;Service status for: ftpsvc;CRITICAL;SOFT;2;CRITICAL: ftpsvc is stopped (should be running)
........;Service status for: ftpsvc;CRITICAL;SOFT;3;CRITICAL: ftpsvc is stopped (should be running)
........;Service status for: ftpsvc;CRITICAL;SOFT;3;CRITICAL: ftpsvc is stopped (should be running)
........;Service status for: ftpsvc;CRITICAL;SOFT;3;CRITICAL: ftpsvc is stopped (should be running)
........;Service status for: ftpsvc;CRITICAL;SOFT;3;CRITICAL: ftpsvc is stopped (should be running)
........;Service status for: ftpsvc;CRITICAL;SOFT;3;CRITICAL: ftpsvc is stopped (should be running)
........;Service status for: ftpsvc;CRITICAL;SOFT;3;CRITICAL: ftpsvc is stopped (should be running)
I see that this problem has been reported (https://support.nagios.com/forum/viewto ... ft#p268187 and https://github.com/NagiosEnterprises/na ... issues/576).

Are there any updates on fixing this issue? We already fixed the problem with this specific check by deleting and recreating the service, but we will need a lasting fix to prevent this from happening again in our production environment.

Nagios XI 5.5.7 on RHEL 7.6 64bit VM's.

Re: Check never goes to Critical Hard state

Posted: Tue Dec 18, 2018 11:25 am
by npolovenko
@hbouma, You're right. That bug is a known issue. While our developers are working on a fix I recommend downgrading the version of the Nagios Core. That will fix this issue temporarily.
https://support.nagios.com/kb/article/n ... e-823.html

Please keep an eye on the XI changelog for updates on this bug fix:
https://www.nagios.com/downloads/nagios-xi/change-log/

Re: Check never goes to Critical Hard state

Posted: Tue Dec 18, 2018 11:28 am
by hbouma
What version should I downgrade to? Also, what features in core/XI would we loose by downgrading?

Re: Check never goes to Critical Hard state

Posted: Tue Dec 18, 2018 11:44 am
by npolovenko
@hbouma, The tutorial walks you through downgrading Core to the version 4.2.4. The list of changes between 4.2.4 and 4.4.2 can be found here:
https://www.nagios.org/projects/nagios-core/history/4x/
Core 4.2.4 was used in XI 5.4.13. It's still functional and will do all the regular tasks you do with the XI.
But once we release the soft state bug fix I recommend upgrading XI and Core will get upgraded automatically as well.

Re: Check never goes to Critical Hard state

Posted: Tue Dec 18, 2018 12:28 pm
by hbouma
Is Nagios Core 4.2.4 the last working version we could go to? I ask because I see 4 CVE vulnerabilities listed as fixed since then, and our Security Department will not allow us to downgrade to known vulnerabilities.

Re: Check never goes to Critical Hard state

Posted: Tue Dec 18, 2018 2:55 pm
by lmiltchev
...our Security Department will not allow us to downgrade to known vulnerabilities
If this is the case, you wouldn't be able to downgrade even to Nagios Core 4.4.1 as there are some CVEs that were fixed in 4.4.2... See the Nagios Core changelog here:

https://github.com/NagiosEnterprises/na ... /Changelog

You may not experience this particular bug in older versions of Nagios Core, but you would experience other bugs that were fixed in 4.4.2. Our developers will be fixing current issues in Nagios Core 4.4.2 as soon as they can. I would recommend that you stick with 4.4.2, and use deleting and recreating services as a temporary workaround, until the issue is permanently resolved.