How do I improve alert latency?

rlw_nagios · Post by **rlw_nagios** » Fri Apr 26, 2024 1:08 pm

I need some help with alerts on Nagios Core 4.4.14. My goal is to get the least latency (because VMs take so little time to recover after a reboot) and minimal repetition of failure. I am turning off networking on the Linux client to simulate an outage, and with these settings, it takes Nagios core >9 minutes to alert that a system is down.

Code: Select all

    
    max_check_attempts	2
    check_interval		1
    retry_interval		1
    notification_interval	0

Here is what I see in the log;

Code: Select all

[1714153603] SERVICE ALERT: watto;PING;CRITICAL;SOFT;1;CRITICAL - Plugin timed out
[1714153606] HOST ALERT: watto;DOWN;SOFT;1;CRITICAL - Host Unreachable (172.16.11.111)
[1714153667] HOST ALERT: watto;DOWN;SOFT;2;CRITICAL - Host Unreachable (172.16.11.111)
[1714153667] SERVICE ALERT: watto;PING;CRITICAL;HARD;2;CRITICAL - Host Unreachable (172.16.11.111)
[1714153729] HOST ALERT: watto;DOWN;SOFT;3;CRITICAL - Host Unreachable (172.16.11.111)
[1714153789] HOST ALERT: watto;DOWN;SOFT;4;CRITICAL - Host Unreachable (172.16.11.111)
[1714153849] HOST ALERT: watto;DOWN;SOFT;5;CRITICAL - Host Unreachable (172.16.11.111)
[1714153909] HOST ALERT: watto;DOWN;SOFT;6;CRITICAL - Host Unreachable (172.16.11.111)
[1714153969] HOST ALERT: watto;DOWN;SOFT;7;CRITICAL - Host Unreachable (172.16.11.111)
[1714154029] HOST ALERT: watto;DOWN;SOFT;8;CRITICAL - Host Unreachable (172.16.11.111)
[1714154089] HOST ALERT: watto;DOWN;SOFT;9;CRITICAL - Host Unreachable (172.16.11.111)
[1714154149] HOST NOTIFICATION: nagiosadmin;watto;DOWN;notify-host-by-email;CRITICAL - Host Unreachable (  172.16.11.111)
[1714154149] HOST NOTIFICATION: slack;watto;DOWN;notify-host-by-slack;CRITICAL - Host Unreachable (172.16  .11.111)

When I turn networking back on, the recovery notification is very quick. Like less than a minute.

Based on my settings this may be a normal response, but what settings can I use to get closer to my goal?

Post by **danderson** » Tue Apr 30, 2024 1:15 pm

Thanks for reaching out @rlw_nagios,

Are those settings for the service or for the host? It looks like those settings are being applied to the service PING, but the host seems to have max_check_attempts at something around 10.

rlw_nagios · Post by **rlw_nagios** » Wed May 01, 2024 10:11 am

Thanks for responding, here's the full context I'm using now and it seems to be a lot quicker. The broken pipe is due to email not being configured and Slack is preferred by all)

Code: Select all

11:07:28
[1714576120] SERVICE NOTIFICATION: nagiosadmin;watto;PING;CRITICAL;notify-service-by-email;CRITICAL - Plugin timed out
[1714576120] SERVICE NOTIFICATION: slack;watto;PING;CRITICAL;notify-service-by-slack;CRITICAL - Plugin timed out
[1714576120] SERVICE ALERT: watto;PING;CRITICAL;HARD;1;CRITICAL - Plugin timed out
[1714576120] wproc: NOTIFY job 3322 from worker Core Worker 1541499 is a non-check helper but exited with return code 127
[1714576120] wproc:   host=watto; service=PING; contact=nagiosadmin
[1714576120] wproc:   early_timeout=0; exited_ok=1; wait_status=32512; error_code=0;
[1714576120] wproc:   stderr line 01: /bin/sh: /bin/mail: No such file or directory
[1714576120] wproc:   stderr line 02: /usr/bin/printf: write error: Broken pipe
11:08:48

I feel like commenting out the retry_interval helped the most but I freely admit I was fumbling around trying to find the best results.

Code: Select all

define host {
        use                 linux-server
        host_name       watto
        alias           watto
        address         172.16.11.111
        parents         Site-Infra
        }

define service {
    use                 local-service,nagiosgraph
    host_name           watto
    service_description PING
    check_command       check_ping!100.0,20%!500.0,60%
    max_check_attempts      1
    check_interval          1
 #   retry_interval          1
    check_period            24x7
    notification_interval   0
    notification_period     24x7
    contacts                nagiosadmin,slack
    register                1
        }

Post by **swolf** » Wed May 01, 2024 10:19 am

rlw_nagios wrote: ↑Wed May 01, 2024 10:11 am Thanks for responding, here's the full context I'm using now and it seems to be a lot quicker. I feel like commenting out the retry_interval helped the most but I freely admit I was fumbling around trying to find the best results.
Code: Select all
define host {
        use                 linux-server
        host_name       watto
        alias           watto
        address         172.16.11.111
        parents         Site-Infra
        }

define service {
    use                 local-service,nagiosgraph
    host_name           watto
    service_description PING
    check_command       check_ping!100.0,20%!500.0,60%
    max_check_attempts      1
    check_interval          1
 #   retry_interval          1
    check_period            24x7
    notification_interval   0
    notification_period     24x7
    contacts                nagiosadmin,slack
    register                1
        }

What you're doing now is probably the lowest-latency you can get by default (1 minute). If you're not monitoring very many VMs and aren't concerned about performance on the Nagios Core instance / ping spamming the VMs, you may be able to get away with reducing the time unit as well. Per this documentation, you can set interval_length as low as 1 in /usr/local/nagios/etc/nagios.cfg, at which point you'll be getting a latency of 1 second + however long it takes to run your plugin. Realistically you might set it to more like 15 to rerun the check 4 times a minute.

Nagios Support Forum

How do I improve alert latency?

How do I improve alert latency?

Re: How do I improve alert latency?

Re: How do I improve alert latency?

Re: How do I improve alert latency?