Nagios XI Email Glitch

This support forum board is for support questions relating to Nagios XI, our flagship commercial network monitoring solution.
Locked
Branigan
Posts: 13
Joined: Tue Jan 14, 2020 3:28 am
Location: South Africa

Nagios XI Email Glitch

Post by Branigan »

Hi Community,

I need some advice/help with regard to an issue experienced with our Nagios Xi Monitoring system a few days ago. (Nagios XI 5.6.14)

Alerts had been sent out on "Flapping Start/Flapping Stop" states at least 2 hours after an outage had occurred (the service had not been in a flapping state either).
The time stamp generated by Nagios Xi on the Nagios Xi email contacts received did not match/tally up with any historical data on Nagios Xi.
There are no records of the results seen on the emails received when compared to the Nagios Xi "Notification tab" or the Configured item's "service history" itself.

Example: Outage occurred at 04:07, Services recovered at 04:35 (Outage at this stage is possibly ISP related)
Flapping Start/Flapping Stop emails were still received up until 08:45
Service history on Nagios Xi has no Host/Service state records of the emails sent after services had recovered.

I have since restarted the system (no health check errors, there are no system resource issues either), first time seeing this particular issue, also this issue has not reoccurred, not sure whether anyone else has had the same experience and what steps have been taken to prevent/trace the cause of such an incident?

Thanks :)
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios XI Email Glitch

Post by scottwilkerson »

Did the Flapping Stop email have an OK state?

If so, this would be expected behavior.
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Branigan
Posts: 13
Joined: Tue Jan 14, 2020 3:28 am
Location: South Africa

Re: Nagios XI Email Glitch

Post by Branigan »

scottwilkerson wrote:Did the Flapping Stop email have an OK state?

If so, this would be expected behavior.
Yes, the Flapping Stop email had an OK state. See Nagios Timestamp
tempsnip1.png
However I do not see collaborating State history here:
tempsnip.png
Is this normal?

Went through other configured items "State History" and there are no discrepancies as seen above.
Flapping State info collaborates with email alerts.

Thanks.
You do not have the required permissions to view the files attached to this post.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios XI Email Glitch

Post by scottwilkerson »

The state history report you are showing part of is only showing OK states, at some point the state change from non-OK to OK, and if that happens during a period where flapping had started, you would not get anymore notifications until the service came out of flapping.

Here's an overview of what happens when flapping
https://assets.nagios.com/downloads/nag ... pping.html
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Branigan
Posts: 13
Joined: Tue Jan 14, 2020 3:28 am
Location: South Africa

Re: Nagios XI Email Glitch

Post by Branigan »

scottwilkerson wrote:The state history report you are showing part of is only showing OK states, at some point the state change from non-OK to OK, and if that happens during a period where flapping had started, you would not get anymore notifications until the service came out of flapping.

Here's an overview of what happens when flapping
https://assets.nagios.com/downloads/nag ... pping.html
Completely understand what you are saying, thank you for the info.

In this instance, the service in question was last an issue at around 4:15, the service from a dashboard and observational perspective was in actual fact not in a flapping state after 4:15, the service was online and accessible. This was confirmed by other internal system logs.
Reason this had been detected was due to support Teams being inundated with alerts from Nagios that had recovered hours ago.

This issue has not reoccurred, post had been logged to determine whether other users had experienced same or similar issues and what preventative measures were taken.

Thanks.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios XI Email Glitch

Post by scottwilkerson »

Once a service enters a flapping state based on the criteria in the link above, it remains there with no notifications going out until until the criteria in the link above for the service to exit the flapping state is resolved.

After entering a flapping state, a service can have all OK results and this stabilization is what allow it to exit the flapping state.

Conversely, it could stabilize in a CRITICAL state, which would also exit the flapping state, but continue to send notification at the defined Notification interval
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Branigan
Posts: 13
Joined: Tue Jan 14, 2020 3:28 am
Location: South Africa

Re: Nagios XI Email Glitch

Post by Branigan »

scottwilkerson wrote:Once a service enters a flapping state based on the criteria in the link above, it remains there with no notifications going out until until the criteria in the link above for the service to exit the flapping state is resolved.

After entering a flapping state, a service can have all OK results and this stabilization is what allow it to exit the flapping state.

Conversely, it could stabilize in a CRITICAL state, which would also exit the flapping state, but continue to send notification at the defined Notification interval
Thanks Scott, appreciate the feedback.
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios XI Email Glitch

Post by scottwilkerson »

Branigan wrote:
scottwilkerson wrote:Once a service enters a flapping state based on the criteria in the link above, it remains there with no notifications going out until until the criteria in the link above for the service to exit the flapping state is resolved.

After entering a flapping state, a service can have all OK results and this stabilization is what allow it to exit the flapping state.

Conversely, it could stabilize in a CRITICAL state, which would also exit the flapping state, but continue to send notification at the defined Notification interval
Thanks Scott, appreciate the feedback.
No problem
Former Nagios employee
Creator:
Human Design Website
Get Your Human Design Chart
Locked