Hi Guys
since we upgrade our XI to 5.5.7, we noticed some strange behaviour with service statetype
When service recovers from a soft state, the #attempts is always 1 with soft state. We were expecting a value equal to the #actual_attemps.
do someone experience the same issue?
This is an example:
[1543791600] CURRENT HOST STATE: 1287000010_00_switch-04;UP;HARD;1;OK - 10.2.0.59: rta 8.311ms, lost 0%
[1543797638] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[1543797879] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;OK;SOFT;1;'PSE_10011002' is ON(poe_usage:3.04%) 'PSE_10011001' is ON(poe_usage:5.67%) 'PSE_10011003' is ON(poe_usage:1.50%)
[1543798746] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[1543799015] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;2;UNKNOWN: Script timed out
[1543799256] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;OK;SOFT;1;'PSE_10011002' is ON(poe_usage:3.04%) 'PSE_10011001' is ON(poe_usage:5.67%) 'PSE_10011003' is ON(poe_usage:1.50%)
[1543803119] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[1543803388] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;2;UNKNOWN: Script timed out
[1543803657] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;UNKNOWN;SOFT;3;UNKNOWN: Script timed out
[1543803898] SERVICE ALERT: 1287000010_00_switch-04;PoE Status;OK;SOFT;1;'PSE_10011002' is ON(poe_usage:3.04%) 'PSE_10011001' is ON(poe_usage:5.74%) 'PSE_10011003' is ON(poe_usage:1.50%)
SOFT recovery always with #attemps=1
Re: SOFT recovery always with #attemps=1
Another behaviour that need investigation.
As you can see from the logs below, the state type never switch to HARD althought the HOST is DOWN and the #max_check_attempts(6) is reached
[root@nagios-01: /usr/local/nagios/libexec]# grep 1287060042_01_switch-02 ../var/nagios.log | egrep "HOST|PoE Status"| grep -v "EVENT HANDLER:" | perl -pe 's/(\d+)/localtime($1)/e'
[Mon Dec 3 00:00:00 2018] CURRENT HOST STATE: 1287060042_01_switch-02;UP;HARD;1;OK - 10.2.130.171: rta 25.176ms, lost 0%
[Mon Dec 3 00:00:00 2018] CURRENT SERVICE STATE: 1287060042_01_switch-02;PoE Status;OK;HARD;1;'PSE_1' is ON(poe_usage:10.00%)
[Mon Dec 3 12:56:26 2018] HOST ALERT: 1287060042_01_switch-02;DOWN;SOFT;1;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:02:35 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[Mon Dec 3 13:07:03 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;2;UNKNOWN: Script timed out
[Mon Dec 3 13:11:32 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;3;UNKNOWN: Script timed out
[Mon Dec 3 13:11:45 2018] HOST NOTIFICATION: alerts.nwuc;1287060042_01_switch-02;DOWN;xi_host_notification_handler;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:11:45 2018] HOST ALERT: 1287060042_01_switch-02;DOWN;HARD;6;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:16:01 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;4;UNKNOWN: Script timed out
[Mon Dec 3 13:20:29 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;5;UNKNOWN: Script timed out
[Mon Dec 3 13:24:58 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:28:46 2018] Warning: The results of service 'PoE Status' on host '1287060042_01_switch-02' are stale by 0d 0h 0m 3s (threshold=0d 0h 4m 15s). I'm forcing an immediate check of the service.
[Mon Dec 3 13:29:16 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:33:45 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:38:14 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:42:43 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:47:11 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:51:40 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:56:09 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:00:38 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:05:07 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:09:36 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:14:05 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:18:34 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:21:28 2018] HOST ALERT: 1287060042_01_switch-02;UP;HARD;1;OK - 10.2.130.171: rta 25.641ms, lost 0%
[Mon Dec 3 14:22:34 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;OK;SOFT;1;'PSE_1' is ON(poe_usage:10.00%)
As you can see from the logs below, the state type never switch to HARD althought the HOST is DOWN and the #max_check_attempts(6) is reached
[root@nagios-01: /usr/local/nagios/libexec]# grep 1287060042_01_switch-02 ../var/nagios.log | egrep "HOST|PoE Status"| grep -v "EVENT HANDLER:" | perl -pe 's/(\d+)/localtime($1)/e'
[Mon Dec 3 00:00:00 2018] CURRENT HOST STATE: 1287060042_01_switch-02;UP;HARD;1;OK - 10.2.130.171: rta 25.176ms, lost 0%
[Mon Dec 3 00:00:00 2018] CURRENT SERVICE STATE: 1287060042_01_switch-02;PoE Status;OK;HARD;1;'PSE_1' is ON(poe_usage:10.00%)
[Mon Dec 3 12:56:26 2018] HOST ALERT: 1287060042_01_switch-02;DOWN;SOFT;1;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:02:35 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;1;UNKNOWN: Script timed out
[Mon Dec 3 13:07:03 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;2;UNKNOWN: Script timed out
[Mon Dec 3 13:11:32 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;3;UNKNOWN: Script timed out
[Mon Dec 3 13:11:45 2018] HOST NOTIFICATION: alerts.nwuc;1287060042_01_switch-02;DOWN;xi_host_notification_handler;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:11:45 2018] HOST ALERT: 1287060042_01_switch-02;DOWN;HARD;6;CRITICAL - 10.2.130.171: rta nan, lost 100%
[Mon Dec 3 13:16:01 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;4;UNKNOWN: Script timed out
[Mon Dec 3 13:20:29 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;5;UNKNOWN: Script timed out
[Mon Dec 3 13:24:58 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:28:46 2018] Warning: The results of service 'PoE Status' on host '1287060042_01_switch-02' are stale by 0d 0h 0m 3s (threshold=0d 0h 4m 15s). I'm forcing an immediate check of the service.
[Mon Dec 3 13:29:16 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:33:45 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:38:14 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:42:43 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:47:11 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:51:40 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 13:56:09 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:00:38 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:05:07 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:09:36 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:14:05 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:18:34 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;UNKNOWN;SOFT;6;UNKNOWN: Script timed out
[Mon Dec 3 14:21:28 2018] HOST ALERT: 1287060042_01_switch-02;UP;HARD;1;OK - 10.2.130.171: rta 25.641ms, lost 0%
[Mon Dec 3 14:22:34 2018] SERVICE ALERT: 1287060042_01_switch-02;PoE Status;OK;SOFT;1;'PSE_1' is ON(poe_usage:10.00%)
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: SOFT recovery always with #attemps=1
Hello, @op-team. Do you mean when a service is in a soft Warning state (3) and it recovers to a soft Ok state (1)? The fact that it goes from 3 to 1 is reasonable because the state changed. Each time a state changes the counter resets.
As far as the second issue. It's a known bug and our developers are already working on a fix and are planning to release an update for XI shortly.
If this is affecting your production system and you can't wait a couple weeks, I suggest downgrading the Nagios Core version to 4.2.4.
https://support.nagios.com/kb/article/n ... e-823.html
As far as the second issue. It's a known bug and our developers are already working on a fix and are planning to release an update for XI shortly.
If this is affecting your production system and you can't wait a couple weeks, I suggest downgrading the Nagios Core version to 4.2.4.
https://support.nagios.com/kb/article/n ... e-823.html
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: SOFT recovery always with #attemps=1
Hi,
Thanks for your reply.
In the example below, looking from time #8 to #10, when soft recovery occurs, first, we should have a soft state with check# == #actualRetry, event handler is executed and later the state HARD with the check# reset to 1 While in my case, the event handler is executed on soft state after the check# is reset to 1.
I am trying to distinguish a service soft recovery from a soft error with the soft recovery from hard error due to the status DOWN or UNReachable of the corresponding HOST
As regard my second issue, do you think that instead of downgrading the nagioscore, the solution suggested in this following topic may solve the problem?
https://support.nagios.com/forum/viewto ... k&start=10
Thanks for your reply.
Yes this is what i meant, but as explained in "https://assets.nagios.com/downloads/nag ... types.html"Do you mean when a service is in a soft Warning state (3) and it recovers to a soft Ok state (1)? The fact that it goes from 3 to 1 is reasonable because the state changed. Each time a state changes the counter resets.
In the example below, looking from time #8 to #10, when soft recovery occurs, first, we should have a soft state with check# == #actualRetry, event handler is executed and later the state HARD with the check# reset to 1 While in my case, the event handler is executed on soft state after the check# is reset to 1.
I am trying to distinguish a service soft recovery from a soft error with the soft recovery from hard error due to the status DOWN or UNReachable of the corresponding HOST
As regard my second issue, do you think that instead of downgrading the nagioscore, the solution suggested in this following topic may solve the problem?
https://support.nagios.com/forum/viewto ... k&start=10
I believe I found the cause in Core and is fixed in the maint branch on Github
https://github.com/NagiosEnterprises/na ... tree/maint
wget https://github.com/NagiosEnterprises/na ... nt.tar.gz
tar xzf maint.tar.gz
cd nagioscore-maint
configureflags="--with-command-group=nagcmd"
if [ ! `command -v systemctl` ] || [ -f /etc/init.d/nagios ]; then
configureflags="--with-init-type=sysv $configureflags"
fi
./configure "$configureflags"
make -j 2 all
make install
service nagios restart
You do not have the required permissions to view the files attached to this post.
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: SOFT recovery always with #attemps=1
@op-team. I see. You're right. The first observation is related to a bug in Core 4.4.2. Our developers are currently working on a fix. If this issue is critical for your environment I recommend downgrading the Core version to 4.2.4.
https://support.nagios.com/kb/article/n ... e-823.html
The fix for the second bug hasn't been released on the Core brunch yet, so updating the Core will unlikely resolve the problem.
https://support.nagios.com/kb/article/n ... e-823.html
The fix for the second bug hasn't been released on the Core brunch yet, so updating the Core will unlikely resolve the problem.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: SOFT recovery always with #attemps=1
Thanks for your quick reply. I am going to downgrade the core
I will let you know if i need any further help.
I will let you know if i need any further help.
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: SOFT recovery always with #attemps=1
@op-team, Sounds good.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: SOFT recovery always with #attemps=1
Hi Guys,
good news! the core downgrade have fixed both the issues. So now running NagiosXi 5.5.7 with core 4.2.4
According to the changelog, the latest release 5.5.8 doesn't address the core 4.4.2 bugs. right?
5.5.8 - 12/11/2018
Fixed tmp directory for exporting RRD performance data -JO
Fixed UTF-8 characters in host/service names not allowing for external commands from the GUI to be processed [TPS#13833] -JO
Fixed upgrading Config Wizards due to wizards with the same directory name [TPS#13857] -JO
Fixed XSS security vulnerabilities in rss_dashlet -JO
Fixed an issue where importing configuration from files/REST API would sometimes cause duplicate service definitions [TPS#13871] - SAW, JO
Fixed Availability dashlet to work like a normal dashlet and lookback period is properly set based on the report it's created from [TPS#13841] -JO
Fixed issue with nmap multiple IP addresses causing problems running because of security fix -JO,SS
Fixed issue with specific configurations in ndoutils causing Core to crash by updating ndoutils to 2.1.3 -JO
Fixed lock file permissions for Core 4.2.4 (if users are using mod_gearman or had to downgrade to XI's old version of Core) -JO
Core Config Manager (CCM) - 2.7.4
Added icon to relationship popup for host/services that are inactive [TPS#13852] -JO
Fixed missing hosts/service from relationships popup when applied to groups that are set as inactive [TPS#13852] -JO
B.Regards
good news! the core downgrade have fixed both the issues. So now running NagiosXi 5.5.7 with core 4.2.4
According to the changelog, the latest release 5.5.8 doesn't address the core 4.4.2 bugs. right?
5.5.8 - 12/11/2018
Fixed tmp directory for exporting RRD performance data -JO
Fixed UTF-8 characters in host/service names not allowing for external commands from the GUI to be processed [TPS#13833] -JO
Fixed upgrading Config Wizards due to wizards with the same directory name [TPS#13857] -JO
Fixed XSS security vulnerabilities in rss_dashlet -JO
Fixed an issue where importing configuration from files/REST API would sometimes cause duplicate service definitions [TPS#13871] - SAW, JO
Fixed Availability dashlet to work like a normal dashlet and lookback period is properly set based on the report it's created from [TPS#13841] -JO
Fixed issue with nmap multiple IP addresses causing problems running because of security fix -JO,SS
Fixed issue with specific configurations in ndoutils causing Core to crash by updating ndoutils to 2.1.3 -JO
Fixed lock file permissions for Core 4.2.4 (if users are using mod_gearman or had to downgrade to XI's old version of Core) -JO
Core Config Manager (CCM) - 2.7.4
Added icon to relationship popup for host/services that are inactive [TPS#13852] -JO
Fixed missing hosts/service from relationships popup when applied to groups that are set as inactive [TPS#13852] -JO
B.Regards
-
npolovenko
- Support Tech
- Posts: 3457
- Joined: Mon May 15, 2017 5:00 pm
Re: SOFT recovery always with #attemps=1
@op-team, That's right, I suggest waiting to upgrade until the XI 5.5.9 comes out. That update will include Core 4.4.3 with both bug fixes.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.