SOFT recovery always with #attemps=1

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
op-team
Posts: 50
Joined: Fri Jun 02, 2017 6:19 am

SOFT recovery always with #attemps=1

Post by op-team »

Hi Guys

since we upgrade our XI to 5.5.7, we noticed some strange behaviour with service statetype

When service recovers from a soft state, the #attempts is always 1 with soft state. We were expecting a value equal to the #actual_attemps.
do someone experience the same issue?

This is an example:
[1543791600] CURRENT HOST STATE: 1287000010_00_switch-04;UP;HARD;1;OK - 10.2.0.59: rta 8.311ms, lost 0%
[1543797638] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[1543797879] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;OK;SOFT;1;'PSE_10011002' is ON(poe_usage:3.04%) 'PSE_10011001' is ON(poe_usage:5.67%) 'PSE_10011003' is ON(poe_usage:1.50%)
[1543798746] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[1543799015] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;2;UNKNOWN: Script timed out
[1543799256] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;OK;SOFT;1;'PSE_10011002' is ON(poe_usage:3.04%) 'PSE_10011001' is ON(poe_usage:5.67%) 'PSE_10011003' is ON(poe_usage:1.50%)
[1543803119] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[1543803388] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;2;UNKNOWN: Script timed out
[1543803657] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;3;UNKNOWN: Script timed out
[1543803898] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;OK;SOFT;1;'PSE_10011002' is ON(poe_usage:3.04%) 'PSE_10011001' is ON(poe_usage:5.74%) 'PSE_10011003' is ON(poe_usage:1.50%)
op-team
Posts: 50
Joined: Fri Jun 02, 2017 6:19 am

Re: SOFT recovery always with #attemps=1

Post by op-team »

Another behaviour that need investigation.
As you can see from the logs below, the state type never switch to HARD althought the HOST is DOWN and the #max_check_attempts(6) is reached

[root@nagios-01: /usr/local/nagios/libexec]# grep 1287060042_01_switch-02 ../var/nagios.log | egrep "HOST|PoE Status"| grep -v "EVENT HANDLER:" | perl -pe 's/(\d+)/localtime($1)/e'
[Mon Dec 3 00:00:00 2018] CURRENT HOST STATE: 1287060042_01_switch-02;UP;HARD;1;OK - 10.2.130.171: rta 25.176ms, lost 0%
[Mon Dec 3 00:00:00 2018] CURRENT SERVICE STATE: 1287060042_01_switch-02;PoE Status;OK;HARD;1;'PSE_1' is ON(poe_usage:10.00%)
[Mon Dec 3 12:56:26 2018] HOST ALERT: 1287060042_01_switch-02;DOWN;SOFT;1;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:02:35 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[Mon Dec 3 13:07:03 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;2;UNKNOWN: Script timed out
[Mon Dec 3 13:11:32 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;3;UNKNOWN: Script timed out
[Mon Dec 3 13:11:45 2018] HOST NOTIFICATION: alerts.nwuc;1287060042_01_switch-02;DOWN;xi_host_notification_handler;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:11:45 2018] HOST ALERT: 1287060042_01_switch-02;DOWN;HARD;6;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:16:01 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;4;UNKNOWN: Script timed out
[Mon Dec 3 13:20:29 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;5;UNKNOWN: Script timed out
[Mon Dec 3 13:24:58 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:28:46 2018] Warning: The results of service 'PoE Status' on host '1287060042_01_switch-02' are stale by 0d 0h 0m 3s (threshold=0d 0h 4m 15s). I'm forcing an immediate check of the service.
[Mon Dec 3 13:29:16 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:33:45 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:38:14 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:42:43 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:47:11 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:51:40 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:56:09 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:00:38 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:05:07 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:09:36 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:14:05 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:18:34 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:21:28 2018] HOST ALERT: 1287060042_01_switch-02;UP;HARD;1;OK - 10.2.130.171: rta 25.641ms, lost 0%
[Mon Dec 3 14:22:34 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;OK;SOFT;1;'PSE_1' is ON(poe_usage:10.00%)
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: SOFT recovery always with #attemps=1

Post by npolovenko »

Hello, @op-team. Do you mean when a service is in a soft Warning state (3) and it recovers to a soft Ok state (1)? The fact that it goes from 3 to 1 is reasonable because the state changed. Each time a state changes the counter resets.

As far as the second issue. It's a known bug and our developers are already working on a fix and are planning to release an update for XI shortly.
If this is affecting your production system and you can't wait a couple weeks, I suggest downgrading the Nagios Core version to 4.2.4.
https://support.nagios.com/kb/article/n ... e-823.html
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
op-team
Posts: 50
Joined: Fri Jun 02, 2017 6:19 am

Re: SOFT recovery always with #attemps=1

Post by op-team »

Hi,
Thanks for your reply.
Do you mean when a service is in a soft Warning state (3) and it recovers to a soft Ok state (1)? The fact that it goes from 3 to 1 is reasonable because the state changed. Each time a state changes the counter resets.
Yes this is what i meant, but as explained in "https://assets.nagios.com/downloads/nag ... types.html"
In the example below, looking from time #8 to #10, when soft recovery occurs, first, we should have a soft state with check# == #actualRetry, event handler is executed and later the state HARD with the check# reset to 1
statetype1.PNG
While in my case, the event handler is executed on soft state after the check# is reset to 1.

I am trying to distinguish a service soft recovery from a soft error with the soft recovery from hard error due to the status DOWN or UNReachable of the corresponding HOST


As regard my second issue, do you think that instead of downgrading the nagioscore, the solution suggested in this following topic may solve the problem?

https://support.nagios.com/forum/viewto ... k&start=10

I believe I found the cause in Core and is fixed in the maint branch on Github
https://github.com/NagiosEnterprises/na ... tree/maint​​

wget https://github.com/NagiosEnterprises/na ... nt.tar.gz​
tar xzf maint.tar.gz​
cd nagioscore-maint​
configureflags="--with-command-group=​nagcmd"
if [ ! `command -v systemctl` ] || [ -f /etc/init.d/nagios ]; then
configureflags="--with-init-type=sysv $configureflags"
fi
./configure "$configureflags"​
make -j 2 all​
make install​

service nagios restart
You do not have the required permissions to view the files attached to this post.
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: SOFT recovery always with #attemps=1

Post by npolovenko »

@op-team. I see. You're right. The first observation is related to a bug in Core 4.4.2. Our developers are currently working on a fix. If this issue is critical for your environment I recommend downgrading the Core version to 4.2.4.
https://support.nagios.com/kb/article/n ... e-823.html

The fix for the second bug hasn't been released on the Core brunch yet, so updating the Core will unlikely resolve the problem.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
op-team
Posts: 50
Joined: Fri Jun 02, 2017 6:19 am

Re: SOFT recovery always with #attemps=1

Post by op-team »

Thanks for your quick reply. I am going to downgrade the core
I will let you know if i need any further help.
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: SOFT recovery always with #attemps=1

Post by npolovenko »

@op-team, Sounds good.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
op-team
Posts: 50
Joined: Fri Jun 02, 2017 6:19 am

Re: SOFT recovery always with #attemps=1

Post by op-team »

Hi Guys,

good news! the core downgrade have fixed both the issues. So now running NagiosXi 5.5.7 with core 4.2.4

According to the changelog, the latest release 5.5.8 doesn't address the core 4.4.2 bugs. right?

5.5.8 - 12/11/2018
Fixed tmp directory for exporting RRD performance data -JO
Fixed UTF-8 characters in host/service names not allowing for external commands from the GUI to be processed [TPS#13833] -JO
Fixed upgrading Config Wizards due to wizards with the same directory name [TPS#13857] -JO
Fixed XSS security vulnerabilities in rss_dashlet -JO
Fixed an issue where importing configuration from files/REST API would sometimes cause duplicate service definitions [TPS#13871] - SAW, JO
Fixed Availability dashlet to work like a normal dashlet and lookback period is properly set based on the report it's created from [TPS#13841] -JO
Fixed issue with nmap multiple IP addresses causing problems running because of security fix -JO,SS
Fixed issue with specific configurations in ndoutils causing Core to crash by updating ndoutils to 2.1.3 -JO
Fixed lock file permissions for Core 4.2.4 (if users are using mod_gearman or had to downgrade to XI's old version of Core) -JO
Core Config Manager (CCM) - 2.7.4
Added icon to relationship popup for host/services that are inactive [TPS#13852] -JO
Fixed missing hosts/service from relationships popup when applied to groups that are set as inactive [TPS#13852] -JO

B.Regards
npolovenko
Support Tech
Posts: 3457
Joined: Mon May 15, 2017 5:00 pm

Re: SOFT recovery always with #attemps=1

Post by npolovenko »

@op-team, That's right, I suggest waiting to upgrade until the XI 5.5.9 comes out. That update will include Core 4.4.3 with both bug fixes.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked