How do I improve alert latency?

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Post Reply
rlw_nagios
Posts: 4
Joined: Wed Apr 10, 2024 1:46 pm

How do I improve alert latency?

Post by rlw_nagios »

I need some help with alerts on Nagios Core 4.4.14. My goal is to get the least latency (because VMs take so little time to recover after a reboot) and minimal repetition of failure. I am turning off networking on the Linux client to simulate an outage, and with these settings, it takes Nagios core >9 minutes to alert that a system is down.

Code: Select all

    
    max_check_attempts	2
    check_interval		1
    retry_interval		1
    notification_interval	0
Here is what I see in the log;

Code: Select all

[1714153603] SERVICE ALERT: watto;PING;CRITICAL;SOFT;1;CRITICAL - Plugin timed out
[1714153606] HOST ALERT: watto;DOWN;SOFT;1;CRITICAL - Host Unreachable (172.16.11.111)
[1714153667] HOST ALERT: watto;DOWN;SOFT;2;CRITICAL - Host Unreachable (172.16.11.111)
[1714153667] SERVICE ALERT: watto;PING;CRITICAL;HARD;2;CRITICAL - Host Unreachable (172.16.11.111)
[1714153729] HOST ALERT: watto;DOWN;SOFT;3;CRITICAL - Host Unreachable (172.16.11.111)
[1714153789] HOST ALERT: watto;DOWN;SOFT;4;CRITICAL - Host Unreachable (172.16.11.111)
[1714153849] HOST ALERT: watto;DOWN;SOFT;5;CRITICAL - Host Unreachable (172.16.11.111)
[1714153909] HOST ALERT: watto;DOWN;SOFT;6;CRITICAL - Host Unreachable (172.16.11.111)
[1714153969] HOST ALERT: watto;DOWN;SOFT;7;CRITICAL - Host Unreachable (172.16.11.111)
[1714154029] HOST ALERT: watto;DOWN;SOFT;8;CRITICAL - Host Unreachable (172.16.11.111)
[1714154089] HOST ALERT: watto;DOWN;SOFT;9;CRITICAL - Host Unreachable (172.16.11.111)
[1714154149] HOST NOTIFICATION: nagiosadmin;watto;DOWN;notify-host-by-email;CRITICAL - Host Unreachable (  172.16.11.111)
[1714154149] HOST NOTIFICATION: slack;watto;DOWN;notify-host-by-slack;CRITICAL - Host Unreachable (172.16  .11.111)
When I turn networking back on, the recovery notification is very quick. Like less than a minute.

Based on my settings this may be a normal response, but what settings can I use to get closer to my goal?
User avatar
danderson
Posts: 123
Joined: Wed Aug 09, 2023 10:05 am

Re: How do I improve alert latency?

Post by danderson »

Thanks for reaching out @rlw_nagios,

Are those settings for the service or for the host? It looks like those settings are being applied to the service PING, but the host seems to have max_check_attempts at something around 10.
rlw_nagios
Posts: 4
Joined: Wed Apr 10, 2024 1:46 pm

Re: How do I improve alert latency?

Post by rlw_nagios »

Thanks for responding, here's the full context I'm using now and it seems to be a lot quicker. The broken pipe is due to email not being configured and Slack is preferred by all)

Code: Select all

11:07:28
[1714576120] SERVICE NOTIFICATION: nagiosadmin;watto;PING;CRITICAL;notify-service-by-email;CRITICAL - Plugin timed out
[1714576120] SERVICE NOTIFICATION: slack;watto;PING;CRITICAL;notify-service-by-slack;CRITICAL - Plugin timed out
[1714576120] SERVICE ALERT: watto;PING;CRITICAL;HARD;1;CRITICAL - Plugin timed out
[1714576120] wproc: NOTIFY job 3322 from worker Core Worker 1541499 is a non-check helper but exited with return code 127
[1714576120] wproc:   host=watto; service=PING; contact=nagiosadmin
[1714576120] wproc:   early_timeout=0; exited_ok=1; wait_status=32512; error_code=0;
[1714576120] wproc:   stderr line 01: /bin/sh: /bin/mail: No such file or directory
[1714576120] wproc:   stderr line 02: /usr/bin/printf: write error: Broken pipe
11:08:48
I feel like commenting out the retry_interval helped the most but I freely admit I was fumbling around trying to find the best results.

Code: Select all

define host {
        use                 linux-server
        host_name       watto
        alias           watto
        address         172.16.11.111
        parents         Site-Infra
        }

define service {
    use                 local-service,nagiosgraph
    host_name           watto
    service_description PING
    check_command       check_ping!100.0,20%!500.0,60%
    max_check_attempts      1
    check_interval          1
 #   retry_interval          1
    check_period            24x7
    notification_interval   0
    notification_period     24x7
    contacts                nagiosadmin,slack
    register                1
        }
User avatar
swolf
Developer
Posts: 312
Joined: Tue Jun 06, 2017 9:48 am

Re: How do I improve alert latency?

Post by swolf »

rlw_nagios wrote: Wed May 01, 2024 10:11 am Thanks for responding, here's the full context I'm using now and it seems to be a lot quicker. I feel like commenting out the retry_interval helped the most but I freely admit I was fumbling around trying to find the best results.

Code: Select all

define host {
        use                 linux-server
        host_name       watto
        alias           watto
        address         172.16.11.111
        parents         Site-Infra
        }

define service {
    use                 local-service,nagiosgraph
    host_name           watto
    service_description PING
    check_command       check_ping!100.0,20%!500.0,60%
    max_check_attempts      1
    check_interval          1
 #   retry_interval          1
    check_period            24x7
    notification_interval   0
    notification_period     24x7
    contacts                nagiosadmin,slack
    register                1
        }
What you're doing now is probably the lowest-latency you can get by default (1 minute). If you're not monitoring very many VMs and aren't concerned about performance on the Nagios Core instance / ping spamming the VMs, you may be able to get away with reducing the time unit as well. Per this documentation, you can set interval_length as low as 1 in /usr/local/nagios/etc/nagios.cfg, at which point you'll be getting a latency of 1 second + however long it takes to run your plugin. Realistically you might set it to more like 15 to rerun the check 4 times a minute.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy
Post Reply