Logs...
The oldest of metrics, and you too can use the power.
The litmus test
If you find yourself wanting to know how many times A had B or C happen in the last X seconds and you want to alert on that threshold, you want log aggregation.
Why?
Is it because the time is the most important index in the query or is it because the log is the single source of truth that already contains all of this data? Maybe it's both?
Take this snippet This does seem to contain everything but wait, there's more.
NagiosXI Log entries via the v1 API
Did you know that the NagiosXI logs were available via the "objects/logentries" endpoint? If you have availability to logs, the only remaining question is where to do the aggregation and maybe XI isn't the place.
Enter the plugin
Nagios is state driven, so there needs to be an internal scheduler to execute a method to check that state and process the results.
No sweat, the scheduling, execution and processing of the results is the responsibility of the monitoring engine, the method, it's always a plugin.
Take this one:
https://github.com/SNapier/check_nagalagg/
It gets just the crits,
Code: Select all
(2) CRITICAL PROBLEM/S DETECTED ON (u2204ncpa) IN THE LAST (40000)s. SERVICE/S=[os.linux.cpu.utilization-percent-
avg,os.linux.cpu.utilization-percent-avg] | total_count=2; ok_count=0; warn_count=0; crit_count=2; Code: Select all
(2) CRITICAL PROBLEM/S AND (1) WARNING PROBLEM/S DETECTED ON (u2204ncpa) IN THE LAST (60000)s. SERVICE/S=[os.linux.cpu.utilization-percent-avg,os.linux.cpu.utilization-percent-avg,os.linux.cpu.utilization-percent-avg] | total_count=13; ok_count=10; warn_count=1; crit_count=2;Code: Select all
NO REPORTABLE PROBLEM/S DETECTED ON (u2204ncpa)IN THE LAST (60)s | total_count=0; ok_count=0; warn_count=0; crit_count=0;More important than what it does is what it doesn't do.
It doesn't limit the start time, go back as far as you want but, beware that comes with consequences.
It only calculates hard states.
It does not parse or display content any deeper than the host and a list of service names for eventids outside of the big three;
- 65536 = CRITICAL ALERT
32768 = WARN ALERT
262144 = OK
NOTE:
When you start talking log aggregation, your talking about horse power with your compute resources. The more logs to be parsed and aggregated the more horse power you need to do the aggregation, it's a viscous cycle. Every time you execute a method, you add load to the system. The more complex the method the more load that you add to the system, the "observer effect". Make note that you may induce more load on the system by using this plugin.
Outside the limits of a plugin
When you find that the resources, specificity or evaluations you wish to perform gets way more complex than what the on state change processing model of Nagios can provide, you're probably not doing the aggregation in the right place.
Happy Monitoring,
--SN