Hard State Reached prior to 3/3 check

JakeHatMacys · Post by **JakeHatMacys** » Tue May 26, 2015 8:50 pm

Never seen this before, and it's creating paging off hours:

So I have the store switch shutting down checks for the store host group, I can give you a screen shot of that (rather PM). But for some reason on its first check it hit a hard state:

Capture.JPG

Is there a setting I might of missed to avoid this at all costs??? We never want this to happen.

Post by **Box293** » Tue May 26, 2015 11:22 pm

Can you post the host definition for this object please. Also, does this host have a parent and is that parent currently down?

Here is some reading material on the UNREACHABLE state:

http://nagios.sourceforge.net/docs/3_0/ ... ility.html

JakeHatMacys · Post by **JakeHatMacys** » Wed May 27, 2015 7:53 am

Box293 wrote:Can you post the host definition for this object please. Also, does this host have a parent and is that parent currently down?

Here is some reading material on the UNREACHABLE state:

http://nagios.sourceforge.net/docs/3_0/ ... ility.html

Yes, the host has a parent and it was down. The server itself went down again later but still alerted on 1 attempt right into a hard state, also our trap sender is set to only alert on host down... but unreachable still triggered an alert. So it looks like Unreachable is still considered down in the SNMP Trap senders eyes?

Pic of Hard state hit again on 1/3 attempts on host down:

Capture2.JPG

Object file:

###############################################################################
#
# Host configuration file
#
# Created by: Nagios Core Config Manager 2.3.1
# Date: 2015-05-27 08:51:36
# Version: Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND ---
# Nagios CCM will overwrite all manual settings during the next update if you
# would like to edit files manually, place them in the 'static' directory or
# import your configs into the CCM by placing them in the 'import' directory.
#
###############################################################################

define host {
host_name ME639ASRFI21
use xiwizard_windowsserver_host
address ME639ASRFI21
parents ME639****************************
hostgroups Windows Servers
check_command check_tcp_445!!!!!!!!
max_check_attempts 3
check_interval 5
retry_interval 5
check_period xi_timeperiod_24x7
notification_interval 60
notification_period xi_timeperiod_24x7
notifications_enabled 0
icon_image win_server.png
statusmap_image win_server.png
_xiwizard windowsserver
register 1
}

###############################################################################
#
# Host configuration file
#
# END OF FILE
#
###############################################################################

jolson · Post by **jolson** » Wed May 27, 2015 4:48 pm

I would like to request some additional information from you.

First, the host check command in question (check_tcp_445).
Next, your nagios settings:

Code: Select all

cat /usr/local/nagios/etc/nagios.cfg

Last, your template configuration: xiwizard_windowsserver_host

Please provide us the above. Obviously this isn't expected behavior, and I'm wondering if there's a misconfigured setting somewhere.

JakeHatMacys · Post by **JakeHatMacys** » Thu May 28, 2015 7:56 am

jolson wrote:I would like to request some additional information from you.

First, the host check command in question (check_tcp_445).
Next, your nagios settings:
Code: Select all
cat /usr/local/nagios/etc/nagios.cfg
Last, your template configuration: xiwizard_windowsserver_host

Please provide us the above. Obviously this isn't expected behavior, and I'm wondering if there's a misconfigured setting somewhere.

Code: Select all

define command {
       command_name                  		check_tcp_445
       command_line                  		$USER1$/check_tcp -H $HOSTADDRESS$ -p 445 $ARG2$
}

Is there an easy way to manipulate the time out on this command? Do I need it on the command line or is that somewhere else as a standard like the nagios.cfg? (I'm using check_tcp_445 as the host command replacing ping)

Saw this: host_check_timeout=30 but wasn't sure if that applied to a command we put in place of the default host check.

Code: Select all

[root@esu2v238 b161172]# cat /usr/local/nagios/etc/nagios.cfg
# MODIFIED
admin_email=root@localhost
admin_pager=root@localhost
translate_passive_host_checks=1
log_event_handlers=0
use_large_installation_tweaks=1
enable_environment_macros=0


# NDOUtils module
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/n                                                          domod.cfg


# PNP settings - bulk mode with NCPD
process_performance_data=1
# service performance data
service_perfdata_file=/usr/local/nagios/var/service-perfdata
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNA                                                          ME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\t                                                          SERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYP                                                          E::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTAT                                                          ETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=process-service-perfdata-file-bulk
# host performance data
host_perfdata_file=/usr/local/nagios/var/host-perfdata
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$H                                                          OSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHO                                                          STSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tHOSTOUTPUT::$HOSTOUTPUT$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=process-host-perfdata-file-bulk


# OBJECTS - UNMODIFIED
#cfg_file=/usr/local/nagios/etc/objects/commands.cfg
#cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
#cfg_file=/usr/local/nagios/etc/objects/templates.cfg
#cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg


# STATIC OBJECT DEFINITIONS (THESE DON'T GET EXPORTED/IMPORTED BY NAGIOSQL)
cfg_dir=/usr/local/nagios/etc/static

# OBJECTS EXPORTED FROM NAGIOSQL
cfg_file=/usr/local/nagios/etc/contacttemplates.cfg
cfg_file=/usr/local/nagios/etc/contactgroups.cfg
cfg_file=/usr/local/nagios/etc/contacts.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/commands.cfg
cfg_file=/usr/local/nagios/etc/hostgroups.cfg
cfg_file=/usr/local/nagios/etc/servicegroups.cfg
cfg_file=/usr/local/nagios/etc/hosttemplates.cfg
cfg_file=/usr/local/nagios/etc/servicetemplates.cfg
cfg_file=/usr/local/nagios/etc/servicedependencies.cfg
cfg_file=/usr/local/nagios/etc/serviceescalations.cfg
cfg_file=/usr/local/nagios/etc/hostdependencies.cfg
cfg_file=/usr/local/nagios/etc/hostescalations.cfg
cfg_file=/usr/local/nagios/etc/hostextinfo.cfg
cfg_file=/usr/local/nagios/etc/serviceextinfo.cfg
cfg_dir=/usr/local/nagios/etc/hosts
cfg_dir=/usr/local/nagios/etc/services

# GLOBAL EVENT HANDLERS
global_host_event_handler=xi_host_event_handler
global_service_event_handler=xi_service_event_handler



# UNMODIFIED
accept_passive_host_checks=1
accept_passive_service_checks=1
additional_freshness_latency=15
auto_reschedule_checks=1
auto_rescheduling_interval=30
auto_rescheduling_window=45
bare_update_check=0
cached_host_check_horizon=15
cached_service_check_horizon=15
check_external_commands=1
check_for_orphaned_hosts=1
check_for_orphaned_services=1
check_for_updates=1
check_host_freshness=0
check_result_path=/usr/local/nagios/var/spool/checkresults
check_result_reaper_frequency=10
check_service_freshness=1
command_file=/usr/local/nagios/var/rw/nagios.cmd
daemon_dumps_core=0
date_format=us
debug_file=/usr/local/nagios/var/nagios.debug
debug_level=0
debug_verbosity=1
enable_event_handlers=1
enable_flap_detection=1
enable_notifications=1
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
event_broker_options=-1
event_handler_timeout=30
execute_host_checks=1
execute_service_checks=1
high_host_flap_threshold=20.0
high_service_flap_threshold=20.0
host_check_timeout=30
host_freshness_check_interval=60
host_inter_check_delay_method=s
illegal_macro_output_chars=`~$&|'"<>
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
interval_length=60
lock_file=/usr/local/nagios/var/nagios.lock
log_archive_path=/usr/local/nagios/var/archives
log_external_commands=0
log_file=/usr/local/nagios/var/nagios.log
log_host_retries=1
log_initial_states=0
log_notifications=1
log_passive_checks=0
log_rotation_method=d
log_service_retries=1
low_host_flap_threshold=5.0
low_service_flap_threshold=5.0
max_check_result_file_age=3600
max_check_result_reaper_time=30
max_concurrent_checks=0
max_debug_file_size=1000000
max_host_check_spread=30
max_service_check_spread=30
nagios_group=nagios
nagios_user=nagios
notification_timeout=30
object_cache_file=/usr/local/nagios/var/objects.cache
obsess_over_hosts=0
obsess_over_services=0
ocsp_timeout=5
passive_host_checks_are_soft=0
perfdata_timeout=5
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_service_attribute_mask=0
retain_state_information=1
retention_update_interval=60
service_check_timeout=60
service_freshness_check_interval=60
service_inter_check_delay_method=s
service_interleave_factor=s
soft_state_dependencies=0
state_retention_file=/usr/local/nagios/var/retention.dat
status_file=/usr/local/nagios/var/status.dat
status_update_interval=10
temp_file=/usr/local/nagios/var/nagios.tmp
temp_path=/tmp
use_aggressive_host_checking=0
use_regexp_matching=0
use_retained_program_state=1
use_retained_scheduling_info=1
use_syslog=1
use_true_regexp_matching=0
broker_module=/opt/OV/HPBsmIntNagios/lib64/libbsmintneb4.so

And just a random question, if I want to disable flapping can I just change this guy:

enable_flap_detection=0

???

Code: Select all

define host {
       name                          		xiwizard_windowsserver_host
       check_command                 		check_xi_host_ping!3000.0!80%!5000.0!100%!!!!
       use                           		xiwizard_generic_host
       active_checks_enabled         		1
       register                    		0

}

Post by **lmiltchev** » Thu May 28, 2015 12:51 pm

command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 445 $ARG2$

I wonder why you have $ARG2$, not $ARG1$ here... Why do you have arg at all if you are not passing an argument?

On the timeout issue - the "default" timeout of the "check_tcp" plugin is 10 seconds. You increase it to 30, i.e.:

Code: Select all

command_line                        $USER1$/check_tcp -H $HOSTADDRESS$ -p 445 -t 30

It won't make sense to increase it more than 30 as the "global" timeout is set at 30 in the nagios.cfg (unless you bump up this one as well).

Code: Select all

host_check_timeout=30

Yes, you can disable flapping globally by setting up:

Code: Select all

enable_flap_detection=0

and restarting nagios.

...or you can do it from the GUI by clicking on the "x" (Disable) action button next to "Flap Detection" under the "Monitoring Engine Status/Monitoring Engine Process".

JakeHatMacys · Post by **JakeHatMacys** » Mon Jun 01, 2015 2:21 pm

lmiltchev wrote:
command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 445 $ARG2$
I wonder why you have $ARG2$, not $ARG1$ here... Why do you have arg at all if you are not passing an argument?

On the timeout issue - the "default" timeout of the "check_tcp" plugin is 10 seconds. You increase it to 30, i.e.:
Code: Select all
command_line                        $USER1$/check_tcp -H $HOSTADDRESS$ -p 445 -t 30
It won't make sense to increase it more than 30 as the "global" timeout is set at 30 in the nagios.cfg (unless you bump up this one as well).
Code: Select all
host_check_timeout=30
Yes, you can disable flapping globally by setting up:
Code: Select all
enable_flap_detection=0
and restarting nagios.

...or you can do it from the GUI by clicking on the "x" (Disable) action button next to "Flap Detection" under the "Monitoring Engine Status/Monitoring Engine Process".

Got rid of the ARG on the command & put in the 30 second time out, didn't set that up originally so I couldn't tell you why ARG was put in. I think they were just mimicking another command. I also disabled flapping since we shouldn't need it... Really just trying to eliminate anything that would change behavior of the monitor as some of them were making checks sooner than the retry interval.

Another thing my boss wants to explore: Is there a way to reset the soft counter on the child when the parent goes down or host dependency is initiated. We had parent child set up but also set up a host group with all the store servers and it's dependent on the Store Switch coming back good. Otherwise it holds off on checking the store servers... problem is the switch will bounce around a bit and the servers being held at 2/3 sometimes will then resume it's 2/3 interval and then hit another bad check going 3/3 and ticketing. We want to know if there's a way to reset to 1/3 for the child any time that group dependency is triggered or parent goes down. (We did both since parent / child wasn't working 100% of the time...)

tmcdonald · Post by **tmcdonald** » Tue Jun 02, 2015 1:51 pm

The first thing that comes to mind would be event handlers submitting a passive result of OK (or whatever state you want, maybe UNKNOWN is better) but that would get messy quickly with hostgroups and dependencies/parents in the way.

JakeHatMacys · Post by **JakeHatMacys** » Thu Jun 04, 2015 12:52 pm

Back to topic at hand I have a similar issue again... I made a change to our switches to go alert on first check... but it's showing up as Soft still:

Capture.JPG

The goal was to keep the dependent servers from being checked for 1 hour if this switch goes down... since we're getting thrown to DSL back up and then it's coming back but we still can't get through to servers sometimes. So I want to give them an hour before we run again. Right now that's not happening:

Capture1.JPG

As you can see the Switch is being checked again before 60 minutes.

Could you tell me what I'm doing wrong???

Code: Select all

###############################################################################
#
# Host configuration file
#
# Created by: Nagios Core Config Manager 2.3.1
# Date:	      2015-06-04 13:50:08
# Version:    Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND --- 
# Nagios CCM will overwrite all manual settings during the next update if you 
# would like to edit files manually, place them in the 'static' directory or 
# import your configs into the CCM by placing them in the 'import' directory.
#
###############################################################################

define host {
	host_name			ME514ANGWY.network.federated.fds
	use				xiwizard_genericnetdevice_host
	alias				ME514-Store-Switch
	address				ME514ANGWY.network.federated.fds
	hostgroups			Store-Switches
	max_check_attempts		1
	check_interval			3
	retry_interval			60
	check_period			xi_timeperiod_24x7
	flap_detection_enabled		0
	notification_interval		60
	notification_period		xi_timeperiod_24x7
	notifications_enabled		0
	icon_image			network_node.png
	statusmap_image			network_node.png
	_xiwizard			genericnetdevice
	register			1
	}	

###############################################################################
#
# Host configuration file
#
# END OF FILE
#
###############################################################################

jdalrymple · Post by **jdalrymple** » Thu Jun 04, 2015 1:02 pm

Are there any other services that depend on these? I noticed this in your nagios.cfg:

Code: Select all

enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1

So if anything depends on a host or service and it's noticed that it's down an on-demand check will be spawned. My guess is that it is on-demand checks that are causing this behavior if I'm understanding the problem fully.

Nagios Support Forum

Hard State Reached prior to 3/3 check

Hard State Reached prior to 3/3 check

Re: Hard State Reached prior to 3/3 check

Re: Hard State Reached prior to 3/3 check

Re: Hard State Reached prior to 3/3 check

Re: Hard State Reached prior to 3/3 check

Re: Hard State Reached prior to 3/3 check

Re: Hard State Reached prior to 3/3 check

Re: Hard State Reached prior to 3/3 check

Re: Hard State Reached prior to 3/3 check

Re: Hard State Reached prior to 3/3 check