Service Availability Report seems not accurate for my servic
Service Availability Report seems not accurate for my servic
Dear nagios core expert,
I use nagios core to generate availability report for one service , in the availability report chart , i see this service's critical time is totally 1d 1h 52m 43s (this duration seems much bigger than what we observed in reallife in last month ), but from the detailed Service Log Entries, i really can't find how this time is calculated , can you help to point out?
I use nagios core to generate availability report for one service , in the availability report chart , i see this service's critical time is totally 1d 1h 52m 43s (this duration seems much bigger than what we observed in reallife in last month ), but from the detailed Service Log Entries, i really can't find how this time is calculated , can you help to point out?
Last edited by source888 on Mon Feb 17, 2020 12:21 am, edited 1 time in total.
Re: Service Availability Report seems not accurate for my se
In the whole Dec ,only 3 occurance of the critical alert, and 1 occurance is in the maintenance window, so i think it haven't be calculated in.
Re: Service Availability Report seems not accurate for my se
When i click in the "Service State Breakdowns" diagram, i see following detail ,and in this diagram , the critical time is only: Critical : (0.148%) 0d 1h 6m 12s for the whole last month , why here it is much smaller?
Last edited by source888 on Mon Feb 17, 2020 12:22 am, edited 1 time in total.
- tacolover101
- Posts: 432
- Joined: Mon Apr 10, 2017 11:55 am
Re: Service Availability Report seems not accurate for my se
can you please post your host config for an0vm020 and service config for 'Message exchanged' from the /usr/local/nagios/var/objects.cache file?
a few things i'm noticing:
- the host is in maintenance mode at times
- the checks could have dependency
- pending how often these checks run (which looks to be every 24h), there could be complications on what your retries, and max check attempts are set to.
ie. if nagios is only set to check daily, then your interval could be incorrect here
a few things i'm noticing:
- the host is in maintenance mode at times
- the checks could have dependency
- pending how often these checks run (which looks to be every 24h), there could be complications on what your retries, and max check attempts are set to.
ie. if nagios is only set to check daily, then your interval could be incorrect here
Re: Service Availability Report seems not accurate for my se
Here's the content of host and service config from objects.cache:
define host {
host_name an0
alias an0
address 10.1.*.* (sensitive content processed)
check_command check_tcp!55555
event_handler change_host_svc_notification
contact_groups eai_* (ssensitive content processed)
initial_state o
importance 0
check_interval 5.000000
retry_interval 5.000000
max_check_attempts 3
active_checks_enabled 1
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 0
check_freshness 0
notification_options d
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
_USER nagios
_PASS o*** (sensitive content processed)
_SEMPURL http://an0.***.com:8080/SEMP (sensitive content processed)
_SEC_PORT 10** (sensitive content processed)
}
define service {
host_name an0
service_description Message exchanged
check_period E24x7
check_command check_solace!com*** (sensitive content processed)
contact_groups e_***_group (sensitive content processed)
notification_period E_24x7
initial_state o
importance 0
check_interval 5.000000
retry_interval 5.000000
max_check_attempts 3
is_volatile 0
parallelize_check 1
active_checks_enabled 1
passive_checks_enabled 1
obsess 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options a
freshness_threshold 0
check_freshness 0
notification_options r,w,u,c
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
action_url /nagiosil/cgi-bin/show.cgi?host=$HOSTNAME$&service=$SERVICEDESC$' onMouseOver='showGraphPopup(this)' onMouseOut='hideGraphPopup()' rel='/nagiosil/cgi-bin/showgraph.cgi?host=$HOSTNAME$&service=$SERVICEDESC$&period=week&rrdopts=-w+450+-j
retain_status_information 1
retain_nonstatus_information 1
}
Last edited by source888 on Mon Feb 17, 2020 12:23 am, edited 1 time in total.
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Service Availability Report seems not accurate for my se
From 12/10 - 12/11 when the service went down during downtime the recovery didn't come until 15h 4m 32s after the downtime ended.source888 wrote:In the whole Dec ,only 3 occurance of the critical alert, and 1 occurance is in the maintenance window, so i think it haven't be calculated in.
This time counts and is added to the other times.
Re: Service Availability Report seems not accurate for my se
Hello,Scott
yes,you're right.According to your thinking manner, i now found the calculation of availability is correct .
What cause the report not correct as i think now seem a issue in "soft recovery " not going to "hard recovery" ,as you can see in the following screenshot , there exist one soft recovery on 2019-11-26 11:03:19 ,but then until 2019-11-27 00:00:00 ,this period are all calculated as service critial , and at this time , i assume when see in the service view , this service is green as it is in soft recovery state.
yes,you're right.According to your thinking manner, i now found the calculation of availability is correct .
What cause the report not correct as i think now seem a issue in "soft recovery " not going to "hard recovery" ,as you can see in the following screenshot , there exist one soft recovery on 2019-11-26 11:03:19 ,but then until 2019-11-27 00:00:00 ,this period are all calculated as service critial , and at this time , i assume when see in the service view , this service is green as it is in soft recovery state.
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Service Availability Report seems not accurate for my se
The downtime had already ended before that 9hour downtime so that time counts
Re: Service Availability Report seems not accurate for my se
For that 9 hour critical period , the info i can get from availability report and service event log is following:
on 2019-11-26 10:53:19 , there starts a new service critical (HARD)
on 2019-11-26 11:03:19 , there occurs a OK (SOFT)
on 2019-11-26 20:00:01 , nagios process restart
on 2019-11-27 00:00:00 , on availability report , finally see that service become OK (HARD)
so currently my confusion is why this service can't get an OK (HARD) after the first OK (SOFT) on 2019-11-26 11:03:19 . Is it possible there still exist the bug which was mentioned get fixed in 4.4.3 version , that is : * Fixed services in soft states sometimes not switching into hard states (#576) (Jake Omann)
on 2019-11-26 10:53:19 , there starts a new service critical (HARD)
on 2019-11-26 11:03:19 , there occurs a OK (SOFT)
on 2019-11-26 20:00:01 , nagios process restart
on 2019-11-27 00:00:00 , on availability report , finally see that service become OK (HARD)
so currently my confusion is why this service can't get an OK (HARD) after the first OK (SOFT) on 2019-11-26 11:03:19 . Is it possible there still exist the bug which was mentioned get fixed in 4.4.3 version , that is : * Fixed services in soft states sometimes not switching into hard states (#576) (Jake Omann)
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Service Availability Report seems not accurate for my se
If it was in a SOFT CRITICAL the recovery that is logged should be SOFT. It will switch to a hard (in memory) but the recorded as SOFT so event handlers trigger appropriately.
Here's the docs on state types with examples
https://assets.nagios.com/downloads/nag ... types.html
Re:
Here's the docs on state types with examples
https://assets.nagios.com/downloads/nag ... types.html
Re:
Service experiences a SOFT recovery. Event handlers execute, but notification are not sent, as this wasn't a "real" problem. State type is set HARD and check # is reset to 1 immediately after this happens.