SOFT recovery always with #attemps=1

op-team · Post by **op-team** » Mon Dec 03, 2018 7:38 am

Hi Guys

since we upgrade our XI to 5.5.7, we noticed some strange behaviour with service statetype

When service recovers from a soft state, the #attempts is always 1 with soft state. We were expecting a value equal to the #actual_attemps.
do someone experience the same issue?

This is an example:
[1543791600] CURRENT HOST STATE: 1287000010_00_switch-04;UP;HARD;1;OK - 10.2.0.59: rta 8.311ms, lost 0%
[1543797638] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[1543797879] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;OK;SOFT;1;'PSE_10011002' is ON(poe_usage:3.04%) 'PSE_10011001' is ON(poe_usage:5.67%) 'PSE_10011003' is ON(poe_usage:1.50%)
[1543798746] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[1543799015] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;2;UNKNOWN: Script timed out
[1543799256] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;OK;SOFT;1;'PSE_10011002' is ON(poe_usage:3.04%) 'PSE_10011001' is ON(poe_usage:5.67%) 'PSE_10011003' is ON(poe_usage:1.50%)
[1543803119] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[1543803388] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;2;UNKNOWN: Script timed out
[1543803657] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;3;UNKNOWN: Script timed out
[1543803898] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;OK;SOFT;1;'PSE_10011002' is ON(poe_usage:3.04%) 'PSE_10011001' is ON(poe_usage:5.74%) 'PSE_10011003' is ON(poe_usage:1.50%)

op-team · Post by **op-team** » Mon Dec 03, 2018 10:01 am

Another behaviour that need investigation.
As you can see from the logs below, the state type never switch to HARD althought the HOST is DOWN and the #max_check_attempts(6) is reached

[root@nagios-01: /usr/local/nagios/libexec]# grep 1287060042_01_switch-02 ../var/nagios.log | egrep "HOST|PoE Status"| grep -v "EVENT HANDLER:" | perl -pe 's/(\d+)/localtime($1)/e'
[Mon Dec 3 00:00:00 2018] CURRENT HOST STATE: 1287060042_01_switch-02;UP;HARD;1;OK - 10.2.130.171: rta 25.176ms, lost 0%
[Mon Dec 3 00:00:00 2018] CURRENT SERVICE STATE: 1287060042_01_switch-02;PoE Status;OK;HARD;1;'PSE_1' is ON(poe_usage:10.00%)
[Mon Dec 3 12:56:26 2018] HOST ALERT: 1287060042_01_switch-02;DOWN;SOFT;1;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:02:35 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[Mon Dec 3 13:07:03 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;2;UNKNOWN: Script timed out
[Mon Dec 3 13:11:32 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;3;UNKNOWN: Script timed out
[Mon Dec 3 13:11:45 2018] HOST NOTIFICATION: alerts.nwuc;1287060042_01_switch-02;DOWN;xi_host_notification_handler;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:11:45 2018] HOST ALERT: 1287060042_01_switch-02;DOWN;HARD;6;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:16:01 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;4;UNKNOWN: Script timed out
[Mon Dec 3 13:20:29 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;5;UNKNOWN: Script timed out
[Mon Dec 3 13:24:58 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:28:46 2018] Warning: The results of service 'PoE Status' on host '1287060042_01_switch-02' are stale by 0d 0h 0m 3s (threshold=0d 0h 4m 15s). I'm forcing an immediate check of the service.
[Mon Dec 3 13:29:16 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:33:45 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:38:14 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:42:43 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:47:11 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:51:40 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:56:09 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:00:38 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:05:07 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:09:36 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:14:05 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:18:34 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:21:28 2018] HOST ALERT: 1287060042_01_switch-02;UP;HARD;1;OK - 10.2.130.171: rta 25.641ms, lost 0%
[Mon Dec 3 14:22:34 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;OK;SOFT;1;'PSE_1' is ON(poe_usage:10.00%)

npolovenko · Post by **npolovenko** » Tue Dec 04, 2018 11:28 am

Hello, @op-team. Do you mean when a service is in a soft Warning state (3) and it recovers to a soft Ok state (1)? The fact that it goes from 3 to 1 is reasonable because the state changed. Each time a state changes the counter resets.

As far as the second issue. It's a known bug and our developers are already working on a fix and are planning to release an update for XI shortly.
If this is affecting your production system and you can't wait a couple weeks, I suggest downgrading the Nagios Core version to 4.2.4.
https://support.nagios.com/kb/article/n ... e-823.html

op-team · Post by **op-team** » Wed Dec 05, 2018 4:55 am

Hi,
Thanks for your reply.

Do you mean when a service is in a soft Warning state (3) and it recovers to a soft Ok state (1)? The fact that it goes from 3 to 1 is reasonable because the state changed. Each time a state changes the counter resets.

Yes this is what i meant, but as explained in "https://assets.nagios.com/downloads/nag ... types.html"
In the example below, looking from time #8 to #10, when soft recovery occurs, first, we should have a soft state with check# == #actualRetry, event handler is executed and later the state HARD with the check# reset to 1

statetype1.PNG

While in my case, the event handler is executed on soft state after the check# is reset to 1.

I am trying to distinguish a service soft recovery from a soft error with the soft recovery from hard error due to the status DOWN or UNReachable of the corresponding HOST

As regard my second issue, do you think that instead of downgrading the nagioscore, the solution suggested in this following topic may solve the problem?

https://support.nagios.com/forum/viewto ... k&start=10

I believe I found the cause in Core and is fixed in the maint branch on Github
https://github.com/NagiosEnterprises/na ... tree/maint

wget https://github.com/NagiosEnterprises/na ... nt.tar.gz
tar xzf maint.tar.gz
cd nagioscore-maint
configureflags="--with-command-group=nagcmd"
if [ ! `command -v systemctl` ] || [ -f /etc/init.d/nagios ]; then
configureflags="--with-init-type=sysv $configureflags"
fi
./configure "$configureflags"
make -j 2 all
make install

service nagios restart

npolovenko · Post by **npolovenko** » Wed Dec 05, 2018 1:02 pm

@op-team. I see. You're right. The first observation is related to a bug in Core 4.4.2. Our developers are currently working on a fix. If this issue is critical for your environment I recommend downgrading the Core version to 4.2.4.
https://support.nagios.com/kb/article/n ... e-823.html

The fix for the second bug hasn't been released on the Core brunch yet, so updating the Core will unlikely resolve the problem.

op-team · Post by **op-team** » Thu Dec 06, 2018 3:50 am

Thanks for your quick reply. I am going to downgrade the core
I will let you know if i need any further help.

npolovenko · Post by **npolovenko** » Thu Dec 06, 2018 12:45 pm

@op-team, Sounds good.

op-team · Post by **op-team** » Mon Jan 07, 2019 4:32 am

Hi Guys,

good news! the core downgrade have fixed both the issues. So now running NagiosXi 5.5.7 with core 4.2.4

According to the changelog, the latest release 5.5.8 doesn't address the core 4.4.2 bugs. right?

5.5.8 - 12/11/2018
Fixed tmp directory for exporting RRD performance data -JO
Fixed UTF-8 characters in host/service names not allowing for external commands from the GUI to be processed [TPS#13833] -JO
Fixed upgrading Config Wizards due to wizards with the same directory name [TPS#13857] -JO
Fixed XSS security vulnerabilities in rss_dashlet -JO
Fixed an issue where importing configuration from files/REST API would sometimes cause duplicate service definitions [TPS#13871] - SAW, JO
Fixed Availability dashlet to work like a normal dashlet and lookback period is properly set based on the report it's created from [TPS#13841] -JO
Fixed issue with nmap multiple IP addresses causing problems running because of security fix -JO,SS
Fixed issue with specific configurations in ndoutils causing Core to crash by updating ndoutils to 2.1.3 -JO
Fixed lock file permissions for Core 4.2.4 (if users are using mod_gearman or had to downgrade to XI's old version of Core) -JO
Core Config Manager (CCM) - 2.7.4
Added icon to relationship popup for host/services that are inactive [TPS#13852] -JO
Fixed missing hosts/service from relationships popup when applied to groups that are set as inactive [TPS#13852] -JO

B.Regards

npolovenko · Post by **npolovenko** » Mon Jan 07, 2019 2:07 pm

@op-team, That's right, I suggest waiting to upgrade until the XI 5.5.9 comes out. That update will include Core 4.4.3 with both bug fixes.

Nagios Support Forum

SOFT recovery always with #attemps=1

SOFT recovery always with #attemps=1

Re: SOFT recovery always with #attemps=1

Re: SOFT recovery always with #attemps=1

Re: SOFT recovery always with #attemps=1

Re: SOFT recovery always with #attemps=1

Re: SOFT recovery always with #attemps=1

Re: SOFT recovery always with #attemps=1

Re: SOFT recovery always with #attemps=1

Re: SOFT recovery always with #attemps=1