Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
I need some help with alerts on Nagios Core 4.4.14. My goal is to get the least latency (because VMs take so little time to recover after a reboot) and minimal repetition of failure. I am turning off networking on the Linux client to simulate an outage, and with these settings, it takes Nagios core >9 minutes to alert that a system is down.
Are those settings for the service or for the host? It looks like those settings are being applied to the service PING, but the host seems to have max_check_attempts at something around 10.
Thanks for responding, here's the full context I'm using now and it seems to be a lot quicker. The broken pipe is due to email not being configured and Slack is preferred by all)
11:07:28
[1714576120] SERVICE NOTIFICATION: nagiosadmin;watto;PING;CRITICAL;notify-service-by-email;CRITICAL - Plugin timed out
[1714576120] SERVICE NOTIFICATION: slack;watto;PING;CRITICAL;notify-service-by-slack;CRITICAL - Plugin timed out
[1714576120] SERVICE ALERT: watto;PING;CRITICAL;HARD;1;CRITICAL - Plugin timed out
[1714576120] wproc: NOTIFY job 3322 from worker Core Worker 1541499 is a non-check helper but exited with return code 127
[1714576120] wproc: host=watto; service=PING; contact=nagiosadmin
[1714576120] wproc: early_timeout=0; exited_ok=1; wait_status=32512; error_code=0;
[1714576120] wproc: stderr line 01: /bin/sh: /bin/mail: No such file or directory
[1714576120] wproc: stderr line 02: /usr/bin/printf: write error: Broken pipe
11:08:48
I feel like commenting out the retry_interval helped the most but I freely admit I was fumbling around trying to find the best results.
rlw_nagios wrote: ↑Wed May 01, 2024 10:11 am
Thanks for responding, here's the full context I'm using now and it seems to be a lot quicker. I feel like commenting out the retry_interval helped the most but I freely admit I was fumbling around trying to find the best results.
What you're doing now is probably the lowest-latency you can get by default (1 minute). If you're not monitoring very many VMs and aren't concerned about performance on the Nagios Core instance / ping spamming the VMs, you may be able to get away with reducing the time unit as well. Per this documentation, you can set interval_length as low as 1 in /usr/local/nagios/etc/nagios.cfg, at which point you'll be getting a latency of 1 second + however long it takes to run your plugin. Realistically you might set it to more like 15 to rerun the check 4 times a minute.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy