Failue to achieve hard OK state

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Bitflogger
Posts: 226
Joined: Mon Oct 16, 2017 9:24 am

Failue to achieve hard OK state

Post by Bitflogger »

I am running XI v 5.6.7 on a 64-bit VM CentOS 7 server.

Please see the case titled "Recovery does not send notification, part 2"

Earl
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Failue to achieve hard OK state

Post by tgriep »

Can you post of PM me a new system profile so we can take a look at the logs and settings for the server?
Post or PM this file as well from the nagios server.

Code: Select all

/usr/local/nagios/var/status.dat
What host or service is not reaching the Hard OK state?
Be sure to check out our Knowledgebase for helpful articles and solutions!
Bitflogger
Posts: 226
Joined: Mon Oct 16, 2017 9:24 am

Re: Failue to achieve hard OK state

Post by Bitflogger »

soft_ok_2.png
soft_ok_1.png
You do not have the required permissions to view the files attached to this post.
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Failue to achieve hard OK state

Post by tgriep »

Thanks for the data in the PM.

The current status for that check needs to be reset so it can rebuild the counters and statuses.
To do that, stop the nagios process

Code: Select all

systemctl stop nagios
Then edit the follow file on the server

Code: Select all

/usr/local/nagios/var/retention.dat
Search the file for the servicestatus entry for the w_vugen_doitnumber service.

Delete the whole section for that service.

Save the change and start nagios

Code: Select all

systemctl start nagios
Let the system run for 10 minutes and go to the Service Details menu for that service.
Click on the Advanced Tab to view the current state. If it set in a OK Hard State?
Be sure to check out our Knowledgebase for helpful articles and solutions!
Bitflogger
Posts: 226
Joined: Mon Oct 16, 2017 9:24 am

Re: Failue to achieve hard OK state

Post by Bitflogger »

Hello,

I am interested in identifying other problems like this.

Is there a combination of values I can look for (using a program) in retention.dat?

Here are the values for doitnumber:

service {
host_name=x*****
service_description=w_vugen_doitnumber
modified_attributes=3
check_command=check_nscp_vugen!read_vugen -a s=$_SERVICESCRIPT$!!!!!!!
check_period=24x7
notification_period=win_wed_0300
event_handler=
has_been_checked=1
check_execution_time=5.237
check_latency=0.000
check_type=0
current_state=0
last_state=0
last_hard_state=0
last_event_id=1272545
current_event_id=1274779
current_problem_id=0
last_problem_id=600116
current_attempt=1
max_attempts=3
check_interval=5.000000
retry_interval=5.000000
state_type=1
last_state_change=1573027441
last_hard_state_change=1573027441
last_time_ok=1573585355
last_time_warning=1573024746
last_time_unknown=1570610588
last_time_critical=1573027441
plugin_output=Transaction "doitnumber" ended with a "Pass" status (Duration: 0.5562 Wasted Time: 0.0046)
long_plugin_output=
performance_data='time'=0.5562s;40;70
last_check=1573585355
next_check=1573585655
check_options=0
notified_on_unknown=0
notified_on_warning=0
notified_on_critical=1
current_notification_number=1
current_notification_id=5815804
last_notification=0
notifications_enabled=1
active_checks_enabled=1
passive_checks_enabled=1
event_handler_enabled=1
problem_has_been_acknowledged=0
acknowledgement_type=0
flap_detection_enabled=0
process_performance_data=1
obsess=1
is_flapping=0
percent_state_change=0.00
check_flapping_recovery_notification=0
flapping_comment_id=0
state_history=0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
_ALERT_DESCRIPTION=0;Check doitnumber web page
_SCRIPT=0;doitnumber
}
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Failue to achieve hard OK state

Post by tgriep »

The older version of Nagios Core had a bug that would show the Soft recovery state.
Deleting the entry from the retention.dat file causes it to reset the counters and status so it will show the correct data in the future.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Bitflogger
Posts: 226
Joined: Mon Oct 16, 2017 9:24 am

Re: Failue to achieve hard OK state

Post by Bitflogger »

Hello,

I am on v 5.6.7, I thought my Nagios core was up-to-date.

Fixing one problem is nice, but how do I detect others?

I will upgrade to 5.6.8 soon, will that fix the problem?

Earl
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Failue to achieve hard OK state

Post by tgriep »

Updating XI on the server gets the system to newer software but the update does not reset all of the counters and data for the checks so that is why the retention data has to be edited to they will be re-synced.
If you do not want to edit the checks manually, you can delete the retention.dat file and that will reset all of the checks.
But, doing so would restart all of the checks in the system like the system was brand new, all of the notes and manually set downtime will be gone.

I talked to a developer and confirmed that the Hard OK state is not logged in the data that is used for the State History report.
It is stored that way as other functions and features use the check numbers and states to determine uptime, etc...
The reasoning is to keep track of the attempt number for when objects change state, OK, Not-OK, etc...

If you go to the Service Details for the service and go to the Advanced Tab, it should show as a hard OK state

Sorry about the confusion from what I posted earlier.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Bitflogger
Posts: 226
Joined: Mon Oct 16, 2017 9:24 am

Re: Failue to achieve hard OK state

Post by Bitflogger »

Hello,

So when this problem comes to my attention in the future, presumably in a random way, I should correct the problem service by editing the retention.dat file as you explained.

There is no way to proactively identify the problem services.

Additional updates of software or core will not change the problem.

Are the above statements correct?

Earl
User avatar
tgriep
Madmin
Posts: 9177
Joined: Thu Oct 30, 2014 9:02 am

Re: Failue to achieve hard OK state

Post by tgriep »

The answer to all of the statements is yes, that is correct.

The State History reports does show the OK state as soft as designed.
If a check is in a down hard state with the full attempt reached (say it is 5 of 5 checks) and then the check recovers, it will show a Hard OK state (5 of 5).
If the check fails and only goes to a Soft state (for example, 3 of 5) and the check recovers, it will show an OK Soft state (4 of 5).

The Host and Service details menus, is should show Hard OK.
Be sure to check out our Knowledgebase for helpful articles and solutions!
Locked