Page 1 of 1
SOFT recovery always with #attemps=1
Posted: Mon Dec 03, 2018 7:38 am
by op-team
Hi Guys
since we upgrade our XI to 5.5.7, we noticed some strange behaviour with service statetype
When service recovers from a soft state, the #attempts is always 1 with soft state. We were expecting a value equal to the #actual_attemps.
do someone experience the same issue?
This is an example:
[1543791600] CURRENT HOST STATE: 1287000010_00_switch-04;UP;HARD;1;OK - 10.2.0.59: rta 8.311ms, lost 0%
[1543797638] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[1543797879] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;OK;SOFT;1;'PSE_10011002' is ON(poe_usage:3.04%) 'PSE_10011001' is ON(poe_usage:5.67%) 'PSE_10011003' is ON(poe_usage:1.50%)
[1543798746] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[1543799015] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;2;UNKNOWN: Script timed out
[1543799256] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;OK;SOFT;1;'PSE_10011002' is ON(poe_usage:3.04%) 'PSE_10011001' is ON(poe_usage:5.67%) 'PSE_10011003' is ON(poe_usage:1.50%)
[1543803119] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[1543803388] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;2;UNKNOWN: Script timed out
[1543803657] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;3;UNKNOWN: Script timed out
[1543803898] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;OK;SOFT;1;'PSE_10011002' is ON(poe_usage:3.04%) 'PSE_10011001' is ON(poe_usage:5.74%) 'PSE_10011003' is ON(poe_usage:1.50%)
Re: SOFT recovery always with #attemps=1
Posted: Mon Dec 03, 2018 10:01 am
by op-team
Another behaviour that need investigation.
As you can see from the logs below, the state type never switch to HARD althought the HOST is DOWN and the #max_check_attempts(6) is reached
[root@nagios-01: /usr/local/nagios/libexec]# grep 1287060042_01_switch-02 ../var/nagios.log | egrep "HOST|PoE Status"| grep -v "EVENT HANDLER:" | perl -pe 's/(\d+)/localtime($1)/e'
[Mon Dec 3 00:00:00 2018] CURRENT HOST STATE: 1287060042_01_switch-02;UP;HARD;1;OK - 10.2.130.171: rta 25.176ms, lost 0%
[Mon Dec 3 00:00:00 2018] CURRENT SERVICE STATE: 1287060042_01_switch-02;PoE Status;OK;HARD;1;'PSE_1' is ON(poe_usage:10.00%)
[Mon Dec 3 12:56:26 2018] HOST ALERT: 1287060042_01_switch-02;DOWN;SOFT;1;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:02:35 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[Mon Dec 3 13:07:03 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;2;UNKNOWN: Script timed out
[Mon Dec 3 13:11:32 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;3;UNKNOWN: Script timed out
[Mon Dec 3 13:11:45 2018] HOST NOTIFICATION: alerts.nwuc;1287060042_01_switch-02;DOWN;xi_host_notification_handler;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:11:45 2018] HOST ALERT: 1287060042_01_switch-02;DOWN;HARD;6;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:16:01 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;4;UNKNOWN: Script timed out
[Mon Dec 3 13:20:29 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;5;UNKNOWN: Script timed out
[Mon Dec 3 13:24:58 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:28:46 2018] Warning: The results of service 'PoE Status' on host '1287060042_01_switch-02' are stale by 0d 0h 0m 3s (threshold=0d 0h 4m 15s). I'm forcing an immediate check of the service.
[Mon Dec 3 13:29:16 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:33:45 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:38:14 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:42:43 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:47:11 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:51:40 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:56:09 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:00:38 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:05:07 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:09:36 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:14:05 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:18:34 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:21:28 2018] HOST ALERT: 1287060042_01_switch-02;UP;HARD;1;OK - 10.2.130.171: rta 25.641ms, lost 0%
[Mon Dec 3 14:22:34 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;OK;SOFT;1;'PSE_1' is ON(poe_usage:10.00%)
Re: SOFT recovery always with #attemps=1
Posted: Tue Dec 04, 2018 11:28 am
by npolovenko
Hello,
@op-team. Do you mean when a service is in a soft Warning state (3) and it recovers to a soft Ok state (1)? The fact that it goes from 3 to 1 is reasonable because the state changed. Each time a state changes the counter resets.
As far as the second issue. It's a known bug and our developers are already working on a fix and are planning to release an update for XI shortly.
If this is affecting your production system and you can't wait a couple weeks, I suggest downgrading the Nagios Core version to 4.2.4.
https://support.nagios.com/kb/article/n ... e-823.html
Re: SOFT recovery always with #attemps=1
Posted: Wed Dec 05, 2018 4:55 am
by op-team
Hi,
Thanks for your reply.
Do you mean when a service is in a soft Warning state (3) and it recovers to a soft Ok state (1)? The fact that it goes from 3 to 1 is reasonable because the state changed. Each time a state changes the counter resets.
Yes this is what i meant, but as explained in "
https://assets.nagios.com/downloads/nag ... types.html"
In the example below, looking from time #8 to #10, when soft recovery occurs, first, we should have a soft state with check# == #actualRetry, event handler is executed and later the state HARD with the check# reset to 1
statetype1.PNG
While in my case, the event handler is executed on soft state after the check# is reset to 1.
I am trying to distinguish a service soft recovery from a soft error with the soft recovery from hard error due to the status DOWN or UNReachable of the corresponding HOST
As regard my second issue, do you think that instead of downgrading the nagioscore, the solution suggested in this following topic may solve the problem?
https://support.nagios.com/forum/viewto ... k&start=10
I believe I found the cause in Core and is fixed in the maint branch on Github
https://github.com/NagiosEnterprises/na ... tree/maint
wget
https://github.com/NagiosEnterprises/na ... nt.tar.gz
tar xzf maint.tar.gz
cd nagioscore-maint
configureflags="--with-command-group=nagcmd"
if [ ! `command -v systemctl` ] || [ -f /etc/init.d/nagios ]; then
configureflags="--with-init-type=sysv $configureflags"
fi
./configure "$configureflags"
make -j 2 all
make install
service nagios restart
Re: SOFT recovery always with #attemps=1
Posted: Wed Dec 05, 2018 1:02 pm
by npolovenko
@op-team. I see. You're right. The first observation is related to a bug in Core 4.4.2. Our developers are currently working on a fix. If this issue is critical for your environment I recommend downgrading the Core version to 4.2.4.
https://support.nagios.com/kb/article/n ... e-823.html
The fix for the second bug hasn't been released on the Core brunch yet, so updating the Core will unlikely resolve the problem.
Re: SOFT recovery always with #attemps=1
Posted: Thu Dec 06, 2018 3:50 am
by op-team
Thanks for your quick reply. I am going to downgrade the core
I will let you know if i need any further help.
Re: SOFT recovery always with #attemps=1
Posted: Thu Dec 06, 2018 12:45 pm
by npolovenko
Re: SOFT recovery always with #attemps=1
Posted: Mon Jan 07, 2019 4:32 am
by op-team
Hi Guys,
good news! the core downgrade have fixed both the issues. So now running NagiosXi 5.5.7 with core 4.2.4
According to the changelog, the latest release 5.5.8 doesn't address the core 4.4.2 bugs. right?
5.5.8 - 12/11/2018
Fixed tmp directory for exporting RRD performance data -JO
Fixed UTF-8 characters in host/service names not allowing for external commands from the GUI to be processed [TPS#13833] -JO
Fixed upgrading Config Wizards due to wizards with the same directory name [TPS#13857] -JO
Fixed XSS security vulnerabilities in rss_dashlet -JO
Fixed an issue where importing configuration from files/REST API would sometimes cause duplicate service definitions [TPS#13871] - SAW, JO
Fixed Availability dashlet to work like a normal dashlet and lookback period is properly set based on the report it's created from [TPS#13841] -JO
Fixed issue with nmap multiple IP addresses causing problems running because of security fix -JO,SS
Fixed issue with specific configurations in ndoutils causing Core to crash by updating ndoutils to 2.1.3 -JO
Fixed lock file permissions for Core 4.2.4 (if users are using mod_gearman or had to downgrade to XI's old version of Core) -JO
Core Config Manager (CCM) - 2.7.4
Added icon to relationship popup for host/services that are inactive [TPS#13852] -JO
Fixed missing hosts/service from relationships popup when applied to groups that are set as inactive [TPS#13852] -JO
B.Regards
Re: SOFT recovery always with #attemps=1
Posted: Mon Jan 07, 2019 2:07 pm
by npolovenko
@op-team, That's right, I suggest waiting to upgrade until the XI 5.5.9 comes out. That update will include Core 4.4.3 with both bug fixes.