Hi all,
Earlier this week we were patching some servers that are monitored by Nagios XI. We scheduled downtime in Nagios for these servers as it is part of our patching regime. 2 of these monitored servers have identical Nagios configurations in that they are using Check_tcp to poll a server on port 1790; Once the maintenance was completed one of the server returned with "State Type OK Hard" and the other with "State Type OK Soft" and this is affecting our service availability reports.
So 2 questions:
1) How can we fix the service availability report to accurately show when the service was available
2) How can we stop this issue occurring in the future?
Screenshots shown at this URL
https://photos.google.com/share/AF1QipP ... R0ajF6UUVn
Running Nagios XI Installed Version: 5.5.9
State Type Hard Up vs Soft Up
Re: State Type Hard Up vs Soft Up
The data is coming from nagios.log and files in /usr/local/nagios/var/archives/ so we could edit those if necessary, but I'm not sure why it would have different results like that. I'd like to get the the logs covering the 26th through the 29th as well as a profile from Admin > System Config > System Profile > Download Profile. Please pm these to me - compress the log files if they are not already.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: State Type Hard Up vs Soft Up
The logs show the -swas machine was actually unreachable(no route to host) while it was in downtime. When the host is in a non-OK status, then its services will automatically go into a HARD state if they also have a non-OK status:
Are you using the "Hide scheduled downtime" option found under Advanced when you run the SLA reports? Does this option make a difference?[Mon May 27 20:00:04 PDT 2019] SERVICE DOWNTIME ALERT: -swas; LOGIN;STARTED; Service has entered a period of scheduled downtime
[Mon May 27 20:00:06 PDT 2019] HOST DOWNTIME ALERT: -swas;STARTED; Host has entered a period of scheduled downtime
[Mon May 27 21:30:44 PDT 2019] HOST ALERT: -swas;DOWN;SOFT;1;CRITICAL - 172.31.120.14: rta nan, lost 100%
[Mon May 27 21:31:33 PDT 2019] SERVICE ALERT: -swas; LOGIN;CRITICAL;HARD;1;connect to address 172.31.120.14 and port 1790: No route to host
[Mon May 27 21:31:45 PDT 2019] HOST ALERT: -swas;DOWN;SOFT;2;CRITICAL - 172.31.120.14: Host unreachable @ 79.99.65.57. rta nan, lost 100%
[Mon May 27 21:32:46 PDT 2019] HOST ALERT: -swas;DOWN;SOFT;3;CRITICAL - 172.31.120.14: Host unreachable @ 79.99.65.57. rta nan, lost 100%
[Mon May 27 21:33:47 PDT 2019] HOST ALERT: -swas;DOWN;SOFT;4;CRITICAL - 172.31.120.14: Host unreachable @ 79.99.65.57. rta nan, lost 100%
[Mon May 27 21:34:45 PDT 2019] HOST ALERT: -swas;UP;SOFT;1;OK - 172.31.120.14: rta 0.819ms, lost 0%
[Mon May 27 21:41:26 PDT 2019] SERVICE ALERT: -swas; LOGIN;OK;SOFT;1;TCP OK - 0.001 second response time on 172.31.120.14 port 1790
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: State Type Hard Up vs Soft Up
I have just tried the "Hide Scheduled Downtime" in the report but the output is the same
Re: State Type Hard Up vs Soft Up
The output shouldn't be the same, unless the host is still in downtime. The downtime start/end times should be both known in order for math to work in reports... I am not sure if this is the case here.I have just tried the "Hide Scheduled Downtime" in the report but the output is the same
Be sure to check out our Knowledgebase for helpful articles and solutions!
Re: State Type Hard Up vs Soft Up
Sorry - I rechecked the output and tweaked the report times and can confirm that once I isolated the period where the work took place and excluded scheduled downtime it does show 100% availability on the service.
Thanks for your assistance!
Thanks for your assistance!
Re: State Type Hard Up vs Soft Up
Great! Let us know if it is OK to close the topic then. Thank you!
Be sure to check out our Knowledgebase for helpful articles and solutions!