Using the remote agent to measure state duration

awilson · Post by **awilson** » Thu May 31, 2018 7:09 pm

Hi. I recognize and accept that the recheck settings give a best effort attempt at timing an event's duration. It seems that the alternative is to run the duration check on the remote server. For example, CPU utilization over 95% for 10 minutes.

Are there any challenges with getting something like that to work that we should consider? Thanks!
Alan

scottwilkerson · Post by **scottwilkerson** » Fri Jun 01, 2018 7:49 am

The biggest challenge come down to what you are trying to accomplish. If you are trying to get a precise amount of time, you will need to perform the check more frequently weather it is from the Nagios server or done on the remote server.

Even more so, if you want it to be very precise, realize that especially in the case given, if you are constantly running this CPU check (every few seconds for example), the act of doing so in fact adds to the high CPU.

Post by **mcapra** » Fri Jun 01, 2018 8:22 am

You could configure an event handler to kick-off a sar execution on the remote machine when your CPU checks return critical (or whatever status means >95%). I think sar is about as "low profile" as you can get for this particular use case.

You could get extra creative and wrap the sar execution in a script that ships the resulting data back to Nagios XI passively, but you'd need to be sure the destination RRD (Nagios XI's perfdata database) associated with the CPU service check has the necessary granularity to support the time series otherwise you'd have a whole bunch of useless perfdata being shipped.

scottwilkerson · Post by **scottwilkerson** » Fri Jun 01, 2018 2:26 pm

Thanks for the tip @mcapra

awilson · Post by **awilson** » Fri Jun 01, 2018 6:20 pm

Thank you for the replies. For the sar scenario, would the Nagios XI server receive a passive check threshold violation and then using an event handler trigger a remote NRPE check to (1) execute SAR for up to 10 minutes (2) watch the sar data stream as long as the the issue continues and (3) send a WARN or CRITICAL alert back to the Nagios XI server and (4) then shutdown?

Thanks!
Alan

Post by **mcapra** » Fri Jun 01, 2018 9:06 pm

Emphasis on the notion that, if you absolutely needed the results of the complete SAR execution shipped to Nagios XI automatically, that's where things can get pretty complex pretty fast.

One major potential pitfall of this idea is you'd need to be super duper careful about how the event handler is kicking off SAR or you could have overlapping executions which would definitely exacerbate problems on a machine clocking >95% CPU usage. A lockfile of some sort associated with the wrapper script may be a good starting point. See the official docs for the conditions that trigger event handlers:
https://assets.nagios.com/downloads/nag ... dlers.html

This is tricky stuff. It might be best to re-visit this when more details about the performance data changes coming with Nagios XI 6 are revealed, though that may be several months from now as XI 6 is currently slated for Q2 of 2019. Running something like Telegraf pointed at a more modern time-series database baked into XI sure would simplify things a heck of a lot in this particular situation.

Back to SAR things. I haven't tested this at all, but in my mind:

(1) Event handler calls script on remote machine which spawns a forked SAR execution. Forked because having XI or any third party be responsible for the actual collection of your "post mortem metrics" in this case seems flawed.

(2a) Wrap that forked SAR execution in a script of some sort that periodically reaps the SAR data, converts it to a perfdata-friendly format, and ships it to XI passively as a "check result". You may need more than Bash to accomplish this (like an async process execution in Python/Go/Perl that occasionally reaps STDOUT or wherever SAR is dumping the results).
Ooorrrrr....
(2b) Wrap that forked SAR execution in a script of some sort that, at the end of the SAR execution, parses the resulting SAR data from the full ~10 minute run to a perfdata-friendly format and ships a handful of "check results" to XI passively. It'd be 1 to 1 for SAR data-points and passive check results you'd need to submit to XI if you wanted the whole shebang.

(3) Attach whatever status code to the output that seems appropriate.

(4) SAR is completed, any underlying "SAR -> XI" processes are also completed.

tmcdonald · Post by **tmcdonald** » Mon Jun 04, 2018 11:43 am

Thanks for the assist, @mcapra! OP, please let us know if you need further assistance.

Nagios Support Forum

Using the remote agent to measure state duration

Using the remote agent to measure state duration

Re: Using the remote agent to measure state duration

Re: Using the remote agent to measure state duration

Re: Using the remote agent to measure state duration

Re: Using the remote agent to measure state duration

Re: Using the remote agent to measure state duration

Re: Using the remote agent to measure state duration