Service Availability Report seems not accurate for my servic

source888 · Post by **source888** » Fri Jan 10, 2020 12:12 am

Dear nagios core expert,

I use nagios core to generate availability report for one service , in the availability report chart , i see this service's critical time is totally 1d 1h 52m 43s (this duration seems much bigger than what we observed in reallife in last month ), but from the detailed Service Log Entries, i really can't find how this time is calculated , can you help to point out?

source888 · Post by **source888** » Fri Jan 10, 2020 1:15 am

In the whole Dec ,only 3 occurance of the critical alert, and 1 occurance is in the maintenance window, so i think it haven't be calculated in.

source888 · Post by **source888** » Fri Jan 10, 2020 1:24 am

When i click in the "Service State Breakdowns" diagram, i see following detail ,and in this diagram , the critical time is only: Critical : (0.148%) 0d 1h 6m 12s for the whole last month , why here it is much smaller?

Post by **tacolover101** » Fri Jan 10, 2020 2:09 pm

can you please post your host config for an0vm020 and service config for 'Message exchanged' from the /usr/local/nagios/var/objects.cache file?

a few things i'm noticing:
- the host is in maintenance mode at times
- the checks could have dependency
- pending how often these checks run (which looks to be every 24h), there could be complications on what your retries, and max check attempts are set to.
ie. if nagios is only set to check daily, then your interval could be incorrect here

source888 · Post by **source888** » Mon Jan 13, 2020 1:46 am

Here's the content of host and service config from objects.cache:

define host {
host_name an0
alias an0
address 10.1.*.* (sensitive content processed)
check_command check_tcp!55555
event_handler change_host_svc_notification
contact_groups eai_* (ssensitive content processed)
initial_state o
importance 0
check_interval 5.000000
retry_interval 5.000000
max_check_attempts 3
active_checks_enabled 1
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 0
check_freshness 0
notification_options d
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
_USER nagios
_PASS o*** (sensitive content processed)
_SEMPURL http://an0.***.com:8080/SEMP (sensitive content processed)
_SEC_PORT 10** (sensitive content processed)
}

define service {
host_name an0
service_description Message exchanged
check_period E24x7
check_command check_solace!com*** (sensitive content processed)
contact_groups e_***_group (sensitive content processed)
notification_period E_24x7
initial_state o
importance 0
check_interval 5.000000
retry_interval 5.000000
max_check_attempts 3
is_volatile 0
parallelize_check 1
active_checks_enabled 1
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 0
check_freshness 0
notification_options r,w,u,c
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
action_url /nagiosil/cgi-bin/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$' onMouseOver='showGraphPopup(this)' onMouseOut='hideGraphPopup()' rel='/nagiosil/cgi-bin/showgraph.cgi?host=$HOSTNAME$&service=$SERVICEDESC$&period=week&rrdopts=-w+450+-j
retain_status_information 1
retain_nonstatus_information 1
}

scottwilkerson · Post by **scottwilkerson** » Mon Jan 13, 2020 11:53 am

source888 wrote:In the whole Dec ,only 3 occurance of the critical alert, and 1 occurance is in the maintenance window, so i think it haven't be calculated in.

6.png

From 12/10 - 12/11 when the service went down during downtime the recovery didn't come until 15h 4m 32s after the downtime ended.

This time counts and is added to the other times.

source888 · Post by **source888** » Sun Feb 16, 2020 10:35 pm

Hello,Scott

yes,you're right.According to your thinking manner, i now found the calculation of availability is correct .

What cause the report not correct as i think now seem a issue in "soft recovery " not going to "hard recovery" ,as you can see in the following screenshot , there exist one soft recovery on 2019-11-26 11:03:19 ,but then until 2019-11-27 00:00:00 ,this period are all calculated as service critial , and at this time , i assume when see in the service view , this service is green as it is in soft recovery state.

scottwilkerson · Post by **scottwilkerson** » Mon Feb 17, 2020 7:40 am

The downtime had already ended before that 9hour downtime so that time counts

source888 · Post by **source888** » Mon Feb 17, 2020 8:27 am

For that 9 hour critical period , the info i can get from availability report and service event log is following:

on 2019-11-26 10:53:19 , there starts a new service critical (HARD)

on 2019-11-26 11:03:19 , there occurs a OK (SOFT)

on 2019-11-26 20:00:01 , nagios process restart

on 2019-11-27 00:00:00 , on availability report , finally see that service become OK (HARD)

so currently my confusion is why this service can't get an OK (HARD) after the first OK (SOFT) on 2019-11-26 11:03:19 . Is it possible there still exist the bug which was mentioned get fixed in 4.4.3 version , that is : * Fixed services in soft states sometimes not switching into hard states (#576) (Jake Omann)

scottwilkerson · Post by **scottwilkerson** » Mon Feb 17, 2020 8:34 am

If it was in a SOFT CRITICAL the recovery that is logged should be SOFT. It will switch to a hard (in memory) but the recorded as SOFT so event handlers trigger appropriately.

Here's the docs on state types with examples
https://assets.nagios.com/downloads/nag ... types.html
Re:

Service experiences a SOFT recovery. Event handlers execute, but notification are not sent, as this wasn't a "real" problem. State type is set HARD and check # is reset to 1 immediately after this happens.

Nagios Support Forum

Service Availability Report seems not accurate for my servic

Service Availability Report seems not accurate for my servic

Re: Service Availability Report seems not accurate for my se

Re: Service Availability Report seems not accurate for my se

Re: Service Availability Report seems not accurate for my se

Re: Service Availability Report seems not accurate for my se

Re: Service Availability Report seems not accurate for my se

Re: Service Availability Report seems not accurate for my se

Re: Service Availability Report seems not accurate for my se

Re: Service Availability Report seems not accurate for my se

Re: Service Availability Report seems not accurate for my se