Check goes immediately to HARD Host Down State

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
IMTECH
Posts: 53
Joined: Fri Nov 25, 2011 6:35 am

Check goes immediately to HARD Host Down State

Post by IMTECH »

Hi,

we are having problems with our NagiosXI Installation (System Profile Output attached). We had a severe crash yesterday. Now everything seems to work fine again, but Host Down Notifications are being sent on the first check, instead of the fifth. I checked via SQL, the running configuration has max_check_attempts = 5. Please find a file showing the State History for a host attached.
Why would NagiosXI ignore the max_check_attempts value and go immediately to a hard state?
Do you need additional information to provide support to this issue?
Kind regards
Max

Update: Adding another Picture, showing more State History Entries and clearification that State Types is set to 'Both'.
We do get notifications for each of this UP/DOWN events, each on the first attempt.
You do not have the required permissions to view the files attached to this post.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Check goes immediately to HARD Host Down State

Post by tgriep »

Can you post what happened to the server when it crashed and how it was fixed?

Maybe the crashed corrupted the running config and to verify that, can you look in the following files for the hosts that are going immediately down and see if the settings are correct?

Code: Select all

/usr/local/nagios/var/objects.cache
/usr/local/nagios/var/status.dat
Post what you find in those files so we can view the entries.
Be sure to check out our Knowledgebase for helpful articles and solutions!
IMTECH
Posts: 53
Joined: Fri Nov 25, 2011 6:35 am

Re: Check goes immediately to HARD Host Down State

Post by IMTECH »

Hi,

we do not fully understand what caused the problems yet. We use a setup including mod_gearman and several workers. It looks like there was a communication problem between the workers and our master server or a general network outage, resulting in checks not being accepted by the workers or processed by them. The master itself would have been fine, but we restarted both the master and the workers in the progress of trying to fix the problems. After a while, jobs were processed again. After this period of problems, we noticed the Host Down on first attempt problem.

I checked the entries in objects.cache and status.dat for some of the affected hosts and they look fine for me:

Example Host (current state is ok, but he had the problem recently):

objects.cache:

define host {
host_name XXXX
alias F600c
address XXXX
check_period 24x7
check_command check-host-alive_custom!7000.0,80%!10000.0,100%!!!!!!
event_handler xi_host_notification_handler
contacts helpdesk
contact_groups XXXX
notification_period 24x7
initial_state o
importance 0
check_interval 5.000000
retry_interval 1.000000
max_check_attempts 5
active_checks_enabled 1
passive_checks_enabled 1
obsess 0
event_handler_enabled 1
low_flap_threshold 0.000000
high_flap_threshold 0.000000
flap_detection_enabled 1
flap_detection_options u,u
freshness_threshold 0
check_freshness 0
notification_options r,d,u,f,s
notifications_enabled 0
notification_interval 60.000000
first_notification_delay 0.000000
stalking_options n
process_perf_data 1
icon_image firewall.png
retain_status_information 1
retain_nonstatus_information 1
}

same host in status.dat:

hoststatus {
host_name=XXXX
modified_attributes=1
check_command=check-host-alive_custom!7000.0,80%!10000.0,100%!!!!!!
check_period=24x7
notification_period=24x7
check_interval=5.000000
retry_interval=1.000000
event_handler=xi_host_notification_handler
has_been_checked=1
should_be_scheduled=1
check_execution_time=0.961
check_latency=0.669
check_type=0
current_state=0
last_hard_state=0
last_event_id=7584338
current_event_id=7584339
current_problem_id=0
last_problem_id=3516020
plugin_output=OK - 10.255.200.5: rta 404.515ms, lost 0%
long_plugin_output=
performance_data=rta=404.515ms;7000.000;10000.000;0; pl=0%;80;100;;
last_check=1497371073
next_check=1497371376
check_options=0
current_attempt=1
max_attempts=5
state_type=1
last_state_change=1497370176
last_hard_state_change=1497370176
last_time_up=1497371076
last_time_down=1497370176
last_time_unreachable=1441796057
last_notification=1497370176
next_notification=1497373776
no_more_notifications=0
current_notification_number=0
current_notification_id=986343
notifications_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
flap_detection_enabled=1
process_performance_data=1
obsess=0
last_update=1497371085
is_flapping=0
percent_state_change=0.00
scheduled_downtime_depth=0
}

Is there anything specific i should look for?

Not all of our Hosts are affected but i dont see why. We use Host Templates to pass alert settings down to the hosts. I also checked the template and everything seems to be fine.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Check goes immediately to HARD Host Down State

Post by tgriep »

The strange thing I see if that the Host Down and the Host Up times are the exactly the same your screen capture of the state history report and it doesn't show the check 2 of 5, 3 of 5, 4 of 5, etc....
I simulated the issue you are having and it seemed to work but my host has obsess=1 and your has it set to 0 which could be the issue.
What version of Nagios XI are you running?

Can you run the following as root to stop and start the nagios process and killing off any duplicates and see if that resolves the issue?

Code: Select all

service nagios stop
killall -9 nagios
service nagios stop
Then see if the issue happens again.
Be sure to check out our Knowledgebase for helpful articles and solutions!
IMTECH
Posts: 53
Joined: Fri Nov 25, 2011 6:35 am

Re: Check goes immediately to HARD Host Down State

Post by IMTECH »

Hi,

we are on Nagios XI Version : 2014R2.7

Yes, i also noticed that there is no check 2,3,4,5 of 5. It instantly goes from check 1 of 5 to hard state.

I did as you suggested earlier today, but the problem still persisted. Right now, i dont see any host problems. I checked on hosts which had the problem several times (due to network problems or similar) but right now they are all fine.
Also only host up/down checks are affected. All service checks work fine, having soft states.

Do you see any reason why only some of our hosts would be affected? We use different host templates to pass settings to the hosts and some work exactly as intended (soft states, then hard states) and others seem to be affected by this.
User avatar
tgriep
Madmin
Posts: 9190
Joined: Thu Oct 30, 2014 9:02 am

Re: Check goes immediately to HARD Host Down State

Post by tgriep »

What I think the issue is that those hosts have the Obsess over host option set to off in either the template or the host itself.
Try setting it to ON and see if that resolves the issue.
Be sure to check out our Knowledgebase for helpful articles and solutions!
IMTECH
Posts: 53
Joined: Fri Nov 25, 2011 6:35 am

Re: Check goes immediately to HARD Host Down State

Post by IMTECH »

Hi,

we do not use the obsess over host option in any of our configurations - never did. How would activating it help?
Do you have other ideas?

Thanks for your support.
IMTECH
Posts: 53
Joined: Fri Nov 25, 2011 6:35 am

Re: Check goes immediately to HARD Host Down State

Post by IMTECH »

Problems persist, for some affected hosts, we tried disabling all services and the host in the NagiosXI configuration, applying that configuration, waiting a while and enabling it again, but that didnt help.
dwhitfield
Former Nagios Staff
Posts: 4583
Joined: Wed Sep 21, 2016 10:29 am
Location: NoLo, Minneapolis, MN
Contact:

Re: Check goes immediately to HARD Host Down State

Post by dwhitfield »

Can you PM me your Profile? You can download it by going to Admin > System Config > System Profile and click the ***Download Profile*** button towards the top. If for whatever reason you *cannot* download the profile, please put the output of View System Info (5.3.4+, Show Profile if older) in the thread (that will at least get us some info). This will give us access to many of the logs we would otherwise ask for individually. If security is a concern, you can unzip the profile take out what you like, and then zip it up again. We may end up needing something you remove, but we can ask for that specifically.

After you PM the profile, please update this thread. Updating this thread is the only way for it to show back up on our dashboard.

Alternatively, you may want to email [email protected]. That's really up to you.
IMTECH
Posts: 53
Joined: Fri Nov 25, 2011 6:35 am

Re: Check goes immediately to HARD Host Down State

Post by IMTECH »

Hi,

i sent you the profile.zip via email, it was slightly to big for a PM.

The 'system profile' information is attached to the first post if it helps.
Locked