Page 1 of 2
Inconsistent Nagios Report
Posted: Wed Feb 20, 2013 11:27 am
by fran.pastor
Hello, I've just seen some data that I think are wrong. See attached screenshots.
Last night we had a little network problem and that caused the service check fail, but immediately, on the next check he has recovered. Something curious happened, when you make a report(trend or availability for example) this service check are down for 13 or 15 hours, why?
Re: Inconsistent Nagios Report
Posted: Wed Feb 20, 2013 2:49 pm
by abrist
I wonder if this host was flapping for 13 hours ....
Re: Inconsistent Nagios Report
Posted: Thu Feb 21, 2013 2:44 am
by fran.pastor
No, if you look "instantanea2.png" screenshot, the service doesn't has more events, that screenshot is a "alert history" of that service. There have been no changes to the remaining hours.
is strange
Re: Inconsistent Nagios Report
Posted: Thu Feb 21, 2013 10:31 am
by slansing
I just want to verify that you do have flapping detection enabled correct?
Re: Inconsistent Nagios Report
Posted: Thu Feb 21, 2013 10:44 am
by fran.pastor
slansing wrote:I just want to verify that you do have flapping detection enabled correct?
Yes slansing, we had correctly configurated flap detection, if you see the screenshot of "Alert History" look the events, if he had flapped we would see there.
Re: Inconsistent Nagios Report
Posted: Thu Feb 21, 2013 11:52 am
by abrist
Could you post the host configuration and the main nagios.cfg file in code wrap?
Re: Inconsistent Nagios Report
Posted: Thu Feb 21, 2013 11:53 am
by fran.pastor
I think and I see no logical explanation. Where I can post a bug?
Re: Inconsistent Nagios Report
Posted: Thu Feb 21, 2013 11:56 am
by abrist
You are welcome to post a bug report to
http://tracker.nagios.org but I am not convinced it is a bug yet. Posting the host config and the main nagios config will allow us to check over your configuration to help verify if it is indeed a bug. You are welcome to obfuscate any sensitive information from those files.
Re: Inconsistent Nagios Report
Posted: Thu Feb 21, 2013 12:03 pm
by fran.pastor
abrist wrote:You are welcome to post a bug report to
http://tracker.nagios.org but I am not convinced it is a bug yet. Posting the host config and the main nagios config will allow us to check over your configuration to help verify if it is indeed a bug. You are welcome to obfuscate any sensitive information from those files.
thz for support abrist
Is suspect that if you look at the Trend Report, the service recovers at 00:00, but all day the check has been checked every 5 minutes checking and the result of all checks has been OK, only one CRITICAL at 00:20 +/-
This is the config result from objects.cache:
define service {
host_name Watchmouse
service_description Check Hotelopia
check_period 24x7
check_command check_watchmouse!Check Hotelopia!
contact_groups datacenter-administrators-tic
notification_period 24x7
initial_state o
check_interval 300.000000
retry_interval 300.000000
max_check_attempts 3
is_volatile 0
parallelize_check 1
active_checks_enabled 1
passive_checks_enabled 1
obsess_over_service 1
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options o,w,u,c
freshness_threshold 0
check_freshness 0
notification_options u,w,c,r,s
notifications_enabled 1
notification_interval 0.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
failure_prediction_enabled 1
retain_status_information 1
retain_nonstatus_information 1
}
Re: Inconsistent Nagios Report
Posted: Thu Feb 21, 2013 12:39 pm
by slansing
One issue I noticed right away was this:
Code: Select all
check_interval 300.000000
retry_interval 300.000000
You have your check_interval and retry_interval set to 300 minutes as this is how they interpret the numbers, setting them each to 5 for example would mean the host is checked at a 5 minute interval, and then every 5 minutes after that if the state changes it will be checked again three times before generating an alert.
In this fashion it is entirely possible that it detected the state change, but never checked again until 300 minutes later, and it would have had to do this three times before finding that the host was back up and switching to an Ok state.