NagiosXI Service Alert Aggregation by Host

snapier3 · Post by **snapier3** » Thu Oct 24, 2024 11:50 am

I saw a new thread mentioning this old topic and...
Logs...

The oldest of metrics, and you too can use the power.

The litmus test
If you find yourself wanting to know how many times A had B or C happen in the last X seconds and you want to alert on that threshold, you want log aggregation.

Why?
Is it because the time is the most important index in the query or is it because the log is the single source of truth that already contains all of this data? Maybe it's both?

Take this snippet

nagiosLogExample.PNG

This does seem to contain everything but wait, there's more.

NagiosXI Log entries via the v1 API
Did you know that the NagiosXI logs were available via the "objects/logentries" endpoint?

nagiosLogentriesRef.PNG

If you have availability to logs, the only remaining question is where to do the aggregation and maybe XI isn't the place.

Enter the plugin
Nagios is state driven, so there needs to be an internal scheduler to execute a method to check that state and process the results.
No sweat, the scheduling, execution and processing of the results is the responsibility of the monitoring engine, the method, it's always a plugin.

Take this one:
https://github.com/SNapier/check_nagalagg/

It gets just the crits,

Code: Select all

(2) CRITICAL PROBLEM/S DETECTED ON (u2204ncpa) IN THE LAST (40000)s. SERVICE/S=[os.linux.cpu.utilization-percent-
avg,os.linux.cpu.utilization-percent-avg] | total_count=2; ok_count=0; warn_count=0; crit_count=2;

the crits and the warns,

Code: Select all

(2) CRITICAL PROBLEM/S AND (1) WARNING PROBLEM/S DETECTED ON (u2204ncpa) IN THE LAST (60000)s. SERVICE/S=[os.linux.cpu.utilization-percent-avg,os.linux.cpu.utilization-percent-avg,os.linux.cpu.utilization-percent-avg] | total_count=13; ok_count=10; warn_count=1; crit_count=2;

the negatives,,and it's even got performance data.

Code: Select all

NO REPORTABLE PROBLEM/S DETECTED ON (u2204ncpa)IN THE LAST (60)s | total_count=0; ok_count=0; warn_count=0; crit_count=0;

The plugin will also exit with the associated nagios state which corresponds to the problems it finds. This normal method of operation allows for integration into the built in Nagios notification workflows and strategies.

More important than what it does is what it doesn't do.

It doesn't limit the start time, go back as far as you want but, beware that comes with consequences.

It only calculates hard states.

It does not parse or display content any deeper than the host and a list of service names for eventids outside of the big three;

65536 = CRITICAL ALERT
32768 = WARN ALERT
262144 = OK

It limits the output to a list of services to try and be less impactful for the visual chaos that can be caused by long service output in the XI interface.

NOTE:
When you start talking log aggregation, your talking about horse power with your compute resources. The more logs to be parsed and aggregated the more horse power you need to do the aggregation, it's a viscous cycle. Every time you execute a method, you add load to the system. The more complex the method the more load that you add to the system, the "observer effect". Make note that you may induce more load on the system by using this plugin.

Outside the limits of a plugin
When you find that the resources, specificity or evaluations you wish to perform gets way more complex than what the on state change processing model of Nagios can provide, you're probably not doing the aggregation in the right place.

Happy Monitoring,
--SN

snapier3 · Post by **snapier3** » Thu Oct 24, 2024 1:14 pm

There is one small caveat...
When you apply changes in Nagios the logs have to stack before you can search in the past and will throw an error.

nagiosLogCaveat.PNG

snapier3 · Post by **snapier3** » Fri Oct 25, 2024 2:02 pm

The Nagios restarts were bugging me...
I retooled the plugin a little and it now used the event processing start time as a floor for log time range. If the requested start time is larger than the available time, it will automatically use the max available time.

nagiosLogStartTime-fix.PNG

Nagios Support Forum

NagiosXI Service Alert Aggregation by Host

NagiosXI Service Alert Aggregation by Host

Re: NagiosXI Service Alert Aggregation by Host

Re: NagiosXI Service Alert Aggregation by Host