Page 2 of 2
Re: Stop Service Checks When Host Down
Posted: Mon Oct 06, 2014 3:58 pm
by zaji_nms
Dear Abrist
In our scenario HOST never goes down because ping blocked, I mean never stop any ICMP ECHO PING, its our MAIN/BASIC connectivity check, if its okay, then we go for SNMP check or any other issue.
When SERVICES STATUS cannot be check via SNMP (WARNING) due to HOST DOWN/unreachable, DURATION should not recalculate, should not restart for those SERVICES under that HOST.
My last screen shot having two parts, the First Part showing SNMP is WARNING (yes its okay) and second is from OPERATION CENTER when HOST came up (its green) and SERICE is DOWN , Yes SERVICE is down but DURATION is totally wrong, it should be 412days+
So again, SERVICE DURATION should reset between OKAY and CRITICAL only. When its WARNING/UNKNOWN should not check.........but please take care we talking about only STATUS (ifoperstatus)
Bandwidth WARNING THRESHOLD (current behavior is very okay)............we not talking anything about BANDWIDTH SERVICE CHECK.....please don't mix up..........because here is also WARNING and CRITICAL status, and DURATION should change between OKAY/WARNING/CRITICAL.....and its okay............our issue is link/interface STATUS should not change any DURATION if HOST is down.
Regards
Re: Stop Service Checks When Host Down
Posted: Mon Oct 06, 2014 4:27 pm
by abrist
If the host is DOWN and the service cannot be checked, the service will also change state changing the duration. This is the expected behavior because the service check changes state (due to host actually being down) when it is checked.
zaji_nms wrote:Yes SERVICE is down but DURATION is totally wrong, it should be 412days+
This is only wrong if the service has been down for 412 days. If the service was up previously, and the host went down followed by the service, then it should have reset duration.
Correct me if I am wrong, but it seems like you do not want the duration of a service state to change if it changes state due to its host going down and the service being "unreachable". Is that correct?
Re: Stop Service Checks When Host Down
Posted: Mon Oct 06, 2014 4:52 pm
by zaji_nms
Yes dear , you are correct. If HOST is down, under that HOST all the SERVICES state should remain as before....specially STATUS (check_xi_service_ifoperstatus)........should not get change.
HOST = TEC-MALC-3
SERVICE = 1-11-26-0 Status
(A) Above services is down, duration start, 1 , 2 , 4 , 7 minutes and so on.....
(B) Now HOST down or not reachable (you can say around 15 minutes)
(C) SERVICE is SNMP WARNING issue may be you can say duration 1, 2 , 5 , 9 , 10 , 15 minutes
(D) HOST is up ...........now
(E) SERVICE = 1-11-26-0 Status is still down, DURATION should be (7+15)=22 minutes (A + B)
7 minutes as SERVICE was down before HOST down
15 minutes as HOST was down/unreachable
DOWN TIME (DURATION) should be 22 minutes not start again from 1, 6, 11 , and so on....ITS WRONG.
Regards
Re: Stop Service Checks When Host Down
Posted: Mon Oct 06, 2014 5:14 pm
by abrist
zaji_nms wrote:If HOST is down, under that HOST all the SERVICES state should remain as before
Unfortunately for you, your desire is not the expected behavior. Currently, nagios will check services when a host is down. If then the services fail, their duration will reset as they always do if they fail. This is expected and well documented. There are many good reasons why this is the case, the least of which being that if the host is down, it may effect the provided services and that should be reflected in the availability duration of the service.
You desire skews duration. If a service is down (even if it is due to a down host), it is no longer available and that should be reflected in duration. If you really want the behavior you are describing, you are welcome to open a feature request for core on github (
https://github.com/NagiosEnterprises/nagioscore) or contact
[email protected] for a custom development quote.
EDIT: I think the bandwidth check may be a cause for some of the confusion - it stays green as the actual bandwidth check in the UI is a local check against local mrtg rrds on the XI system - the actual bandwidth checks run on a cron and are written to separate rrds - so as long as the checks do not exceed the bandwidth thresholds, they will always be green even when the host is down. The port checks are snmp checks directly to the specified port on the host, so when the host goes down, those services are effected and in turn go critical if configured to do so.
Re: Stop Service Checks When Host Down
Posted: Mon Oct 06, 2014 5:39 pm
by zaji_nms
Thanks! Abrist.
You can close the case but keep in mind, you can give as additional feature/choice, its up to the end user how he wants, give control to the end user.
One temporary solution you can give (in next upgrade) addition info in the STATUS
Currently its showing : CRITICAL: Interface 1-11-26-0 (index 2578) is down.
snmpwalk -v 2c -c public tec-malc-3 ifLastChange.2578
IF-MIB::ifLastChange.2578 = Timeticks: (3303822121) 382 days, 9:17:01.21
so it will show : CRITICAL: Interface 1-11-26-0 (index 2578) is down. Last Change 382 days, 9:17:01.21
So NHM team will now CRITICAL 5 minutes as DURATION got changed but link Last Change status from directly host polling : 382 days, 9:17:01.21
Regards
Re: Stop Service Checks When Host Down
Posted: Mon Oct 06, 2014 5:49 pm
by zaji_nms
Dear Abrist
Another example
tec-gsr Port aggrgation between tec-fdry and tec-gsr Status 1d 12h 54m 30s CRITICAL: Interface Port-channel1 (index 84) is down.
Nagios Duration : 1d 12h 54m 30s (as yesterday there was some issue and HOST=TEC-GSR was not reachable)
snmpwalk -v 2c -c public tec-gsr ifLastChange.84
IF-MIB::ifLastChange.84 = Timeticks: (406453132) 47 days, 1:02:11.32
now NAGIOS can do immediate change (next version) and will show below from OPERATION CENTER
tec-gsr Port aggrgation between tec-fdry and tec-gsr Status 1d 12h 54m 30s CRITICAL: Interface Port-channel1 (index 84) is down. ifLastChange=47 days, 1:02:11.32
Sure Nagios user will love it.
Regards
Re: Stop Service Checks When Host Down
Posted: Tue Oct 07, 2014 2:32 am
by zaji_nms
Dear Abrist/Expert
I think somehow I can achieve, can you try to help me.
Try to modify (adding $lastc) but no success in check_ifoperstatus
## Check operational status
elsif ( $response->{$snmpIfOperStatus} == 2 ) {
$state = 'CRITICAL';
$answer = "Interface $name (index $snmpkey) is dOWn. $lastc";
Regards
Re: Stop Service Checks When Host Down
Posted: Tue Oct 07, 2014 9:07 am
by abrist
Have you considered just checking the oid for lastChange of the port (IF-MIB::ifLastChange.xx)?
This behavior will not change, guaranteed. The nagios duration metric is based on the last state change from the perspective of the check itself - this is immutable behavior. If you want to report the lastChange from the perspective of the switch (not nagios), then check that oid. It will not effect the duration for the service check though, as duration is not an uptime metric, it is a last detected state change (of the check) metric.
EDIT: One other thing: I do not expect the behavior of service checks when a host is down to change either. The service actually needs to go critical if the host is actually down for availability reports. Otherwise availability reports would be come useless from the perspective of SLAs. It would not be correct to report that a service had high availability if it's host was down for a large amount of time making the service unusable to the network and unreachable by Nagios. It would be irresponsible and incorrect for our reports to behave this way.