Page 1 of 1

Identifying bursts of alerts?

Posted: Fri Jan 29, 2021 10:11 am
by davehkent
Is anyone aware of the best way to identify bursts of alerts?

We have an issue at the moment we are trying to track down where we have a sudden burst of alerts, which clear within a minute.That makes most of them soft alerts. It can happen up to 5 times per day and so far we have not spotted a pattern. We are trying to pin down the cause but finding it difficult to use nagios to identify a time. I really want a report that can report something like the number of alerts per minute, flagging over a threshold like 50 or 100.

The best I have found so far is the Alert Stream, which can not be exported, scheduled and has no vertical scale, so it is hard to tell busy days from quiet days. The other is the alert histogram, which is broken down per hour, and is very coarse for what we need. The more exact the time, the better chance we have of finding a smoking gun in our logs somewhere.

I'm also happy to consider making our own report from the API if someone can suggest a few attributes that might be a good starting point?

We are a little behind on Nagiox XI 5.5.11, have 1588 host checks and 5335 service checks.

Re: Identifying bursts of alerts?

Posted: Fri Jan 29, 2021 5:09 pm
by benjaminsmith
Hi Dave,

Have you tried to pull this data using the State History Report? I believe this may work well here. You can download this report in CSV format for analysis in a spreadsheet if desired. The report has the option to select either hard, soft, or both state types and can filter down to specific hosts or services.

This data can be pulled from the API using the GET objects/statehistory endpoint and there's is the option to build limited queries based on host or time periods. This is documented in the GUI at Help > API Docs

Lastly, if you're comfortable with bash/CLI, the nagios logs can be really helpful for troubleshooting as everything is logged there and never deleted. You'll find the nagios.log at:

Code: Select all

/usr/local/nagios/var/nagios.log
And all the archives based on dates in;

Code: Select all

/usr/local/nagios/var/archives
Hope that helps answer your question, let me know if you need assistance with anything.

--Benjamin

Re: Identifying bursts of alerts?

Posted: Tue Feb 02, 2021 10:04 am
by davehkent
Hi,

Thanks for those suggestions. Looking at it, I think the best way to do what I want to achieve is to write a nagios check to look at it's own log and count the number of alerts in a particular window. If that is above a certain threshold, it could go critical and immediately log a ticket, rather than our standard checks which will try three times, wait an hour and if it is still in an alert state, then log a ticket.

Re: Identifying bursts of alerts?

Posted: Tue Feb 02, 2021 4:00 pm
by benjaminsmith
HI Dave,

That sounds great. Here's a link to the official doc on plugin development to help get you started.

https://nagios-plugins.org/doc/guidelines.html

Let us know if you need anything else on this one.

Thanks,
Benjamin

Reference:
Nagios Exchange