Page 1 of 1

Issue last week Regarding hard state / parent child (XI 2.5)

Posted: Mon May 04, 2015 9:29 am
by JakeHatMacys
We had an episode last week on the 27th where we had a store switch go down and we still alerted on all the children (servers) in the store:

State History Reporting screen: (ANGY is the store switch / parent, rest are the servers / children)

Code: Select all

2015-04-27 14:32:12 xxxxxASHYP02  UNREACHABLE HARD 5 of 5 CRITICAL - xxxxxASHYP02: rta nan, lost 100% 
2015-04-27 14:32:12 xxxxxASHYC01  UNREACHABLE HARD 5 of 5 CRITICAL - xxxxxASHYC01: rta nan, lost 100% 
2015-04-27 14:32:12 xxxxxASRET01  UNREACHABLE HARD 5 of 5 CRITICAL - xxxxxASRET01: rta nan, lost 100% 
2015-04-27 14:32:11 xxxxxASHYP01  UNREACHABLE HARD 5 of 5 CRITICAL - xxxxxASHYP01: rta nan, lost 100% 
2015-04-27 14:31:43 xxxxxASRFI01  UNREACHABLE HARD 5 of 5 CRITICAL - xxxxxASRFI01: Host unreachable @ xxxxx.1.162. rta nan, lost 100% 
2015-04-27 14:29:26 xxxxxANGWY  DOWN HARD 3 of 3 CRITICAL - xxxxxANGWY: Host unreachable @ xxxxx.1.162. rta nan, lost 100% 
Question is that time stamp on these are those the initial checks time stamps or the last check? Because the Parent seemed to hit the hard state and the children still ran & alerted.

Now we did change the host check from it's default ping (on servers) to check port 445, would that break any parent child logic by chance?

Re: Issue last week Regarding hard state / parent child (XI

Posted: Mon May 04, 2015 11:59 am
by jdalrymple
JakeHatMacys wrote:Question is that time stamp on these are those the initial checks time stamps or the last check? Because the Parent seemed to hit the hard state and the children still ran & alerted.
Our service parent logic is still not working right and there are a few bug reports out there to try to get that resolved - this sounds like it's 100% host specific though and we're not talking about service checks at all?
JakeHatMacys wrote:Now we did change the host check from it's default ping (on servers) to check port 445, would that break any parent child logic by chance?
This should not make a difference. If it does - it's a bug.

There are a couple of things missing from your information, or at least things that seem relevant in my mind. While unlikely - is it possible that your retry intervals and such put the children into HARD CRITICAL prior to the parent? Also was it definitely a DOWN notification you received, or was it UNREACHABLE? Obviously the latter would be expected behavior.

Re: Issue last week Regarding hard state / parent child (XI

Posted: Mon May 04, 2015 12:32 pm
by JakeHatMacys
jdalrymple wrote:
JakeHatMacys wrote:Question is that time stamp on these are those the initial checks time stamps or the last check? Because the Parent seemed to hit the hard state and the children still ran & alerted.
Our service parent logic is still not working right and there are a few bug reports out there to try to get that resolved - this sounds like it's 100% host specific though and we're not talking about service checks at all?
JakeHatMacys wrote:Now we did change the host check from it's default ping (on servers) to check port 445, would that break any parent child logic by chance?
This should not make a difference. If it does - it's a bug.

There are a couple of things missing from your information, or at least things that seem relevant in my mind. While unlikely - is it possible that your retry intervals and such put the children into HARD CRITICAL prior to the parent? Also was it definitely a DOWN notification you received, or was it UNREACHABLE? Obviously the latter would be expected behavior.
The parent was DOWN and the children were unreachable. And yes this is only for Host Checks. According to the timelines above the children hit the hard state after the parent. But again I wanted to know if those times were hard state times or first failure times... the state history report isn't really clear on that.

Re: Issue last week Regarding hard state / parent child (XI

Posted: Mon May 04, 2015 1:31 pm
by lmiltchev
But again I wanted to know if those times were hard state times or first failure times... the state history report isn't really clear on that.
You can select "Both" from the "State Types" drop-down menu under the "State History" report in order to show both, the hard and the soft states. Please, post a screenshot of the "State History" report, showing both state types for the same hosts (xxxxxANGWY, xxxxxASRFI01, xxxxxASHYP01, etc.).