Monitor a service EXCEPT for a window of time?

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Locked
GaWd
Posts: 51
Joined: Wed Dec 15, 2010 1:45 pm

Monitor a service EXCEPT for a window of time?

Post by GaWd »

I have 2 mission critical servers that experience a timeout every night between 3:01 and 3:10 AM. It's always a single failure and an immediate recovery, likely caused by network congestion or availability (backups and such).

I would like to eliminate this false positive, and was wondering what my options are.

Disabling the service for a ~5-10minute period, or increasing the threshold for that period seem like the best approaches to the issue, but I wanted to find out what you al thought.

Many thanks in advance.
GaWd
Posts: 51
Joined: Wed Dec 15, 2010 1:45 pm

Re: Monitor a service EXCEPT for a window of time?

Post by GaWd »

Here are the settings for the service:

max_check_attempts 5
normal_check_interval 30
retry_interval 10
active_checks_enabled 1
passive_checks_enabled 1
check_period 24x7
parallelize_check 1
obsess_over_service 1
check_freshness 1
freshness_threshold 15
event_handler_enabled 1
flap_detection_enabled 1
process_perf_data 1
retain_status_information 1
retain_nonstatus_information 1
notification_interval 0
notification_period 24x7
notifications_enabled 1
failure_prediction_enabled 1

I'm thinking that simply setting first_notification_delay to 1 will buy me the 60 seconds or less that I need to allow the network to start responding again.

Since I'm new to Nagios monitoring and configuration I would like to know what the experts think.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Monitor a service EXCEPT for a window of time?

Post by mguthrie »

You can define a custom time period for this check. Create a timeperiod that excludes that 3:00-3:15am or whenever it is, and use that timeperiod for the services you're running for that host.
GaWd
Posts: 51
Joined: Wed Dec 15, 2010 1:45 pm

Re: Monitor a service EXCEPT for a window of time?

Post by GaWd »

I actually went with the idea of a 1 time period delay before sending alerts. I think it's less complicated and has less risk than turning the service off for a specific time period.

I had thought there was a way to program custom time periods, but I couldn't remember.
mguthrie
Posts: 4380
Joined: Mon Jun 14, 2010 10:21 am

Re: Monitor a service EXCEPT for a window of time?

Post by mguthrie »

That works too. Some users prefer the timeperiods method if they need to do reporting to upper management and they want their uptime to look the best it can. Sounds like you've got a working solution though.
GaWd
Posts: 51
Joined: Wed Dec 15, 2010 1:45 pm

Re: Monitor a service EXCEPT for a window of time?

Post by GaWd »

Thank you very much for your assistance. Hopefully this resolves the issue, otherwise I'll need to go with disabling it for a small window.
GaWd
Posts: 51
Joined: Wed Dec 15, 2010 1:45 pm

Re: Monitor a service EXCEPT for a window of time?

Post by GaWd »

OK, so the problem never went away, and I am in need of additional assistance.

I can confirm that there are always socket timeouts at a specific time of day, but I cannot yet pinpoint the reason why. In trying to better-tune this host check, however, I have noticed that all of the intervals listed in the host check are in SECONDS, and not in minutes like the person who set the system up had expected.

I had thought that the intervals were set by the interval_length in nagios.cfg. my nagios.cfg is set as 'interval_length=60'. To me that seems to indicate that in interval is 60 seconds.

The host check for this particular server has a freshness setting of '15', which was thought to mean '15 time units', but I see freshness warnings every 15 seconds in the nagios log.

Are there any other settings that can override the interval_length setting?

I posted the host check setting in my first post, and below are the nagios.cfg settings. I'm sure that I am missing something very basic in my newb-ness to Nagios administration, but I can't tell for sure.

log_file=/usr/local/nagios/var/nagios.log
temp_file=/usr/local/nagios/var/nagios.tmp
status_file=/usr/local/nagios/var/status.dat
status_update_interval=15
nagios_user=nagios
nagios_group=nagcmd
enable_notifications=1
execute_service_checks=1
accept_passive_service_checks=1
enable_event_handlers=1
check_external_commands=1
command_check_interval=-1
command_file=/usr/local/nagios/var/rw/nagios.cmd
lock_file=/usr/local/nagios/var/nagios.pid
retain_state_information=1
state_retention_file=/usr/local/nagios/var/retention.dat
retention_update_interval=0
use_retained_program_state=1
use_syslog=0
log_notifications=0
log_service_retries=1
log_host_retries=1
log_event_handlers=1
log_initial_states=1
log_external_commands=1
log_passive_checks=0
sleep_time=0.25
service_interleave_factor=s
max_concurrent_checks=20
service_reaper_frequency=10
interval_length=60
use_aggressive_host_checking=0
enable_flap_detection=1
soft_state_dependencies=0
service_check_timeout=30
host_check_timeout=30
event_handler_timeout=39
notification_timeout=30
ocsp_timeout=5
perfdata_timeout=5
obsess_over_services=1
process_performance_data=0
check_for_orphaned_services=0
check_service_freshness=1
date_format=us
illegal_object_name_chars=()=
illegal_macro_output_chars=`~$^&
admin_email=XXXXXXXXX
execute_host_checks=1
service_inter_check_delay_method=s
use_retained_scheduling_info=0
accept_passive_host_checks=1
max_service_check_spread=5
host_inter_check_delay_method=s
max_host_check_spread=5
auto_reschedule_checks=0
obsess_over_hosts=1
check_host_freshness=1
host_freshness_check_interval=15
service_freshness_check_interval=15
use_regexp_matching=0
use_true_regexp_matching=0
event_broker_options=-1
daemon_dumps_core=0
host_perfdata_file_mode=a
service_perfdata_file_mode=a
host_perfdata_file_processing_interval=0
service_perfdata_file_processing_interval=0
object_cache_file=/usr/local/nagios/var/objects.cache
p1_file=/usr/bin/p1.pl
resource_file=/usr/local/nagios/etc/resource.cfg
cfg_dir=/usr/local/nagios/etc/objects
Locked