Hard State Reached prior to 3/3 check
-
JakeHatMacys
- Posts: 281
- Joined: Thu Sep 25, 2014 3:21 pm
Hard State Reached prior to 3/3 check
Never seen this before, and it's creating paging off hours:
So I have the store switch shutting down checks for the store host group, I can give you a screen shot of that (rather PM). But for some reason on its first check it hit a hard state:
Is there a setting I might of missed to avoid this at all costs??? We never want this to happen.
So I have the store switch shutting down checks for the store host group, I can give you a screen shot of that (rather PM). But for some reason on its first check it hit a hard state:
Is there a setting I might of missed to avoid this at all costs??? We never want this to happen.
You do not have the required permissions to view the files attached to this post.
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: Hard State Reached prior to 3/3 check
Can you post the host definition for this object please. Also, does this host have a parent and is that parent currently down?
Here is some reading material on the UNREACHABLE state:
http://nagios.sourceforge.net/docs/3_0/ ... ility.html
Here is some reading material on the UNREACHABLE state:
http://nagios.sourceforge.net/docs/3_0/ ... ility.html
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
-
JakeHatMacys
- Posts: 281
- Joined: Thu Sep 25, 2014 3:21 pm
Re: Hard State Reached prior to 3/3 check
Yes, the host has a parent and it was down. The server itself went down again later but still alerted on 1 attempt right into a hard state, also our trap sender is set to only alert on host down... but unreachable still triggered an alert. So it looks like Unreachable is still considered down in the SNMP Trap senders eyes?Box293 wrote:Can you post the host definition for this object please. Also, does this host have a parent and is that parent currently down?
Here is some reading material on the UNREACHABLE state:
http://nagios.sourceforge.net/docs/3_0/ ... ility.html
Pic of Hard state hit again on 1/3 attempts on host down:
Object file:
###############################################################################
#
# Host configuration file
#
# Created by: Nagios Core Config Manager 2.3.1
# Date: 2015-05-27 08:51:36
# Version: Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND ---
# Nagios CCM will overwrite all manual settings during the next update if you
# would like to edit files manually, place them in the 'static' directory or
# import your configs into the CCM by placing them in the 'import' directory.
#
###############################################################################
define host {
host_name ME639ASRFI21
use xiwizard_windowsserver_host
address ME639ASRFI21
parents ME639****************************
hostgroups Windows Servers
check_command check_tcp_445!!!!!!!!
max_check_attempts 3
check_interval 5
retry_interval 5
check_period xi_timeperiod_24x7
notification_interval 60
notification_period xi_timeperiod_24x7
notifications_enabled 0
icon_image win_server.png
statusmap_image win_server.png
_xiwizard windowsserver
register 1
}
###############################################################################
#
# Host configuration file
#
# END OF FILE
#
###############################################################################
You do not have the required permissions to view the files attached to this post.
Re: Hard State Reached prior to 3/3 check
I would like to request some additional information from you.
First, the host check command in question (check_tcp_445).
Next, your nagios settings:
Last, your template configuration: xiwizard_windowsserver_host
Please provide us the above. Obviously this isn't expected behavior, and I'm wondering if there's a misconfigured setting somewhere.
First, the host check command in question (check_tcp_445).
Next, your nagios settings:
Code: Select all
cat /usr/local/nagios/etc/nagios.cfgPlease provide us the above. Obviously this isn't expected behavior, and I'm wondering if there's a misconfigured setting somewhere.
-
JakeHatMacys
- Posts: 281
- Joined: Thu Sep 25, 2014 3:21 pm
Re: Hard State Reached prior to 3/3 check
jolson wrote:I would like to request some additional information from you.
First, the host check command in question (check_tcp_445).
Next, your nagios settings:Last, your template configuration: xiwizard_windowsserver_hostCode: Select all
cat /usr/local/nagios/etc/nagios.cfg
Please provide us the above. Obviously this isn't expected behavior, and I'm wondering if there's a misconfigured setting somewhere.
Code: Select all
define command {
command_name check_tcp_445
command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 445 $ARG2$
}
Saw this: host_check_timeout=30 but wasn't sure if that applied to a command we put in place of the default host check.
Code: Select all
[root@esu2v238 b161172]# cat /usr/local/nagios/etc/nagios.cfg
# MODIFIED
admin_email=root@localhost
admin_pager=root@localhost
translate_passive_host_checks=1
log_event_handlers=0
use_large_installation_tweaks=1
enable_environment_macros=0
# NDOUtils module
broker_module=/usr/local/nagios/bin/ndomod.o config_file=/usr/local/nagios/etc/n domod.cfg
# PNP settings - bulk mode with NCPD
process_performance_data=1
# service performance data
service_perfdata_file=/usr/local/nagios/var/service-perfdata
service_perfdata_file_template=DATATYPE::SERVICEPERFDATA\tTIMET::$TIMET$\tHOSTNA ME::$HOSTNAME$\tSERVICEDESC::$SERVICEDESC$\tSERVICEPERFDATA::$SERVICEPERFDATA$\t SERVICECHECKCOMMAND::$SERVICECHECKCOMMAND$\tHOSTSTATE::$HOSTSTATE$\tHOSTSTATETYP E::$HOSTSTATETYPE$\tSERVICESTATE::$SERVICESTATE$\tSERVICESTATETYPE::$SERVICESTAT ETYPE$\tSERVICEOUTPUT::$SERVICEOUTPUT$
service_perfdata_file_mode=a
service_perfdata_file_processing_interval=15
service_perfdata_file_processing_command=process-service-perfdata-file-bulk
# host performance data
host_perfdata_file=/usr/local/nagios/var/host-perfdata
host_perfdata_file_template=DATATYPE::HOSTPERFDATA\tTIMET::$TIMET$\tHOSTNAME::$H OSTNAME$\tHOSTPERFDATA::$HOSTPERFDATA$\tHOSTCHECKCOMMAND::$HOSTCHECKCOMMAND$\tHO STSTATE::$HOSTSTATE$\tHOSTSTATETYPE::$HOSTSTATETYPE$\tHOSTOUTPUT::$HOSTOUTPUT$
host_perfdata_file_mode=a
host_perfdata_file_processing_interval=15
host_perfdata_file_processing_command=process-host-perfdata-file-bulk
# OBJECTS - UNMODIFIED
#cfg_file=/usr/local/nagios/etc/objects/commands.cfg
#cfg_file=/usr/local/nagios/etc/objects/contacts.cfg
#cfg_file=/usr/local/nagios/etc/objects/localhost.cfg
#cfg_file=/usr/local/nagios/etc/objects/templates.cfg
#cfg_file=/usr/local/nagios/etc/objects/timeperiods.cfg
# STATIC OBJECT DEFINITIONS (THESE DON'T GET EXPORTED/IMPORTED BY NAGIOSQL)
cfg_dir=/usr/local/nagios/etc/static
# OBJECTS EXPORTED FROM NAGIOSQL
cfg_file=/usr/local/nagios/etc/contacttemplates.cfg
cfg_file=/usr/local/nagios/etc/contactgroups.cfg
cfg_file=/usr/local/nagios/etc/contacts.cfg
cfg_file=/usr/local/nagios/etc/timeperiods.cfg
cfg_file=/usr/local/nagios/etc/commands.cfg
cfg_file=/usr/local/nagios/etc/hostgroups.cfg
cfg_file=/usr/local/nagios/etc/servicegroups.cfg
cfg_file=/usr/local/nagios/etc/hosttemplates.cfg
cfg_file=/usr/local/nagios/etc/servicetemplates.cfg
cfg_file=/usr/local/nagios/etc/servicedependencies.cfg
cfg_file=/usr/local/nagios/etc/serviceescalations.cfg
cfg_file=/usr/local/nagios/etc/hostdependencies.cfg
cfg_file=/usr/local/nagios/etc/hostescalations.cfg
cfg_file=/usr/local/nagios/etc/hostextinfo.cfg
cfg_file=/usr/local/nagios/etc/serviceextinfo.cfg
cfg_dir=/usr/local/nagios/etc/hosts
cfg_dir=/usr/local/nagios/etc/services
# GLOBAL EVENT HANDLERS
global_host_event_handler=xi_host_event_handler
global_service_event_handler=xi_service_event_handler
# UNMODIFIED
accept_passive_host_checks=1
accept_passive_service_checks=1
additional_freshness_latency=15
auto_reschedule_checks=1
auto_rescheduling_interval=30
auto_rescheduling_window=45
bare_update_check=0
cached_host_check_horizon=15
cached_service_check_horizon=15
check_external_commands=1
check_for_orphaned_hosts=1
check_for_orphaned_services=1
check_for_updates=1
check_host_freshness=0
check_result_path=/usr/local/nagios/var/spool/checkresults
check_result_reaper_frequency=10
check_service_freshness=1
command_file=/usr/local/nagios/var/rw/nagios.cmd
daemon_dumps_core=0
date_format=us
debug_file=/usr/local/nagios/var/nagios.debug
debug_level=0
debug_verbosity=1
enable_event_handlers=1
enable_flap_detection=1
enable_notifications=1
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1
event_broker_options=-1
event_handler_timeout=30
execute_host_checks=1
execute_service_checks=1
high_host_flap_threshold=20.0
high_service_flap_threshold=20.0
host_check_timeout=30
host_freshness_check_interval=60
host_inter_check_delay_method=s
illegal_macro_output_chars=`~$&|'"<>
illegal_object_name_chars=`~!$%^&*|'"<>?,()=
interval_length=60
lock_file=/usr/local/nagios/var/nagios.lock
log_archive_path=/usr/local/nagios/var/archives
log_external_commands=0
log_file=/usr/local/nagios/var/nagios.log
log_host_retries=1
log_initial_states=0
log_notifications=1
log_passive_checks=0
log_rotation_method=d
log_service_retries=1
low_host_flap_threshold=5.0
low_service_flap_threshold=5.0
max_check_result_file_age=3600
max_check_result_reaper_time=30
max_concurrent_checks=0
max_debug_file_size=1000000
max_host_check_spread=30
max_service_check_spread=30
nagios_group=nagios
nagios_user=nagios
notification_timeout=30
object_cache_file=/usr/local/nagios/var/objects.cache
obsess_over_hosts=0
obsess_over_services=0
ocsp_timeout=5
passive_host_checks_are_soft=0
perfdata_timeout=5
precached_object_file=/usr/local/nagios/var/objects.precache
resource_file=/usr/local/nagios/etc/resource.cfg
retained_contact_host_attribute_mask=0
retained_contact_service_attribute_mask=0
retained_host_attribute_mask=0
retained_process_host_attribute_mask=0
retained_process_service_attribute_mask=0
retained_service_attribute_mask=0
retain_state_information=1
retention_update_interval=60
service_check_timeout=60
service_freshness_check_interval=60
service_inter_check_delay_method=s
service_interleave_factor=s
soft_state_dependencies=0
state_retention_file=/usr/local/nagios/var/retention.dat
status_file=/usr/local/nagios/var/status.dat
status_update_interval=10
temp_file=/usr/local/nagios/var/nagios.tmp
temp_path=/tmp
use_aggressive_host_checking=0
use_regexp_matching=0
use_retained_program_state=1
use_retained_scheduling_info=1
use_syslog=1
use_true_regexp_matching=0
broker_module=/opt/OV/HPBsmIntNagios/lib64/libbsmintneb4.so
enable_flap_detection=0
???
Code: Select all
define host {
name xiwizard_windowsserver_host
check_command check_xi_host_ping!3000.0!80%!5000.0!100%!!!!
use xiwizard_generic_host
active_checks_enabled 1
register 0
}
Re: Hard State Reached prior to 3/3 check
I wonder why you have $ARG2$, not $ARG1$ here... Why do you have arg at all if you are not passing an argument?command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 445 $ARG2$
On the timeout issue - the "default" timeout of the "check_tcp" plugin is 10 seconds. You increase it to 30, i.e.:
Code: Select all
command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 445 -t 30Code: Select all
host_check_timeout=30Code: Select all
enable_flap_detection=0...or you can do it from the GUI by clicking on the "x" (Disable) action button next to "Flap Detection" under the "Monitoring Engine Status/Monitoring Engine Process".
Be sure to check out our Knowledgebase for helpful articles and solutions!
-
JakeHatMacys
- Posts: 281
- Joined: Thu Sep 25, 2014 3:21 pm
Re: Hard State Reached prior to 3/3 check
Got rid of the ARG on the command & put in the 30 second time out, didn't set that up originally so I couldn't tell you why ARG was put in. I think they were just mimicking another command. I also disabled flapping since we shouldn't need it... Really just trying to eliminate anything that would change behavior of the monitor as some of them were making checks sooner than the retry interval.lmiltchev wrote:I wonder why you have $ARG2$, not $ARG1$ here... Why do you have arg at all if you are not passing an argument?command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 445 $ARG2$
On the timeout issue - the "default" timeout of the "check_tcp" plugin is 10 seconds. You increase it to 30, i.e.:It won't make sense to increase it more than 30 as the "global" timeout is set at 30 in the nagios.cfg (unless you bump up this one as well).Code: Select all
command_line $USER1$/check_tcp -H $HOSTADDRESS$ -p 445 -t 30Yes, you can disable flapping globally by setting up:Code: Select all
host_check_timeout=30and restarting nagios.Code: Select all
enable_flap_detection=0
...or you can do it from the GUI by clicking on the "x" (Disable) action button next to "Flap Detection" under the "Monitoring Engine Status/Monitoring Engine Process".
Another thing my boss wants to explore: Is there a way to reset the soft counter on the child when the parent goes down or host dependency is initiated. We had parent child set up but also set up a host group with all the store servers and it's dependent on the Store Switch coming back good. Otherwise it holds off on checking the store servers... problem is the switch will bounce around a bit and the servers being held at 2/3 sometimes will then resume it's 2/3 interval and then hit another bad check going 3/3 and ticketing. We want to know if there's a way to reset to 1/3 for the child any time that group dependency is triggered or parent goes down. (We did both since parent / child wasn't working 100% of the time...)
Re: Hard State Reached prior to 3/3 check
The first thing that comes to mind would be event handlers submitting a passive result of OK (or whatever state you want, maybe UNKNOWN is better) but that would get messy quickly with hostgroups and dependencies/parents in the way.
Former Nagios employee
-
JakeHatMacys
- Posts: 281
- Joined: Thu Sep 25, 2014 3:21 pm
Re: Hard State Reached prior to 3/3 check
Back to topic at hand I have a similar issue again... I made a change to our switches to go alert on first check... but it's showing up as Soft still:
The goal was to keep the dependent servers from being checked for 1 hour if this switch goes down... since we're getting thrown to DSL back up and then it's coming back but we still can't get through to servers sometimes. So I want to give them an hour before we run again. Right now that's not happening:
As you can see the Switch is being checked again before 60 minutes.
Could you tell me what I'm doing wrong???
The goal was to keep the dependent servers from being checked for 1 hour if this switch goes down... since we're getting thrown to DSL back up and then it's coming back but we still can't get through to servers sometimes. So I want to give them an hour before we run again. Right now that's not happening:
As you can see the Switch is being checked again before 60 minutes.
Could you tell me what I'm doing wrong???
Code: Select all
###############################################################################
#
# Host configuration file
#
# Created by: Nagios Core Config Manager 2.3.1
# Date: 2015-06-04 13:50:08
# Version: Nagios 3.x config file
#
# --- DO NOT EDIT THIS FILE BY HAND ---
# Nagios CCM will overwrite all manual settings during the next update if you
# would like to edit files manually, place them in the 'static' directory or
# import your configs into the CCM by placing them in the 'import' directory.
#
###############################################################################
define host {
host_name ME514ANGWY.network.federated.fds
use xiwizard_genericnetdevice_host
alias ME514-Store-Switch
address ME514ANGWY.network.federated.fds
hostgroups Store-Switches
max_check_attempts 1
check_interval 3
retry_interval 60
check_period xi_timeperiod_24x7
flap_detection_enabled 0
notification_interval 60
notification_period xi_timeperiod_24x7
notifications_enabled 0
icon_image network_node.png
statusmap_image network_node.png
_xiwizard genericnetdevice
register 1
}
###############################################################################
#
# Host configuration file
#
# END OF FILE
#
###############################################################################
You do not have the required permissions to view the files attached to this post.
-
jdalrymple
- Skynet Drone
- Posts: 2620
- Joined: Wed Feb 11, 2015 1:56 pm
Re: Hard State Reached prior to 3/3 check
Are there any other services that depend on these? I noticed this in your nagios.cfg:
So if anything depends on a host or service and it's noticed that it's down an on-demand check will be spawned. My guess is that it is on-demand checks that are causing this behavior if I'm understanding the problem fully.
Code: Select all
enable_predictive_host_dependency_checks=1
enable_predictive_service_dependency_checks=1