Nagios reporting host down but its not?

Alan · Post by **Alan** » Tue Feb 18, 2020 4:10 pm

I was actually thinking this as well. I looked in the event viewer for the server though to see the last time it went down there are two event ID that I used:

6008 - Logged as a dirty shutdown. It gives the message "The previous system shutdown at time on date was unexpected"
6006 - Logged as a clean shutdown. It gives the message "The Event log service was stopped".

I did not get anything from the 6008 event ID and when i filtered by 6006 it says the last time the server was down was 2/9/2020 this is the day we had a maintenance window and patched the server and restarted them. Here are the emails I got it was very fast and was why I was thinking maybe it had to do with the ping setting.

***** Nagios *****

Notification Type: PROBLEM
Host: Svr-Data
State: DOWN
Address: 172.16.10.4
Info: (Host check timed out after 30.01 seconds)

Date/Time: Wed Feb 12 07:57:30 PST 2020

***** Nagios *****

Notification Type: RECOVERY
Host: Svr-Data
State: UP
Address: 172.16.10.4
Info: OK: Agent_version was [2.2.0]

Date/Time: Wed Feb 12 07:57:36 PST 2020

scottwilkerson · Post by **scottwilkerson** » Tue Feb 18, 2020 4:16 pm

It's not the ping setting because this server isn't even using ping for the host command, it is using a NCPA version check....

I know this because of the output

Code: Select all

OK: Agent_version was [2.2.0]

scottwilkerson · Post by **scottwilkerson** » Tue Feb 18, 2020 4:18 pm

It's worth noting that before the DOWN email was sent, the Nagios server likely had several retrys before sending the email, I don't know for sure as you haven't shared the host configuration, just a service, but it would depend on the max_check_attempts value.

The recovery email would be sent right away upon recovery

Alan · Post by **Alan** » Tue Feb 18, 2020 6:15 pm

Here is the host setting from the ncpa.cfg file:

Code: Select all

define host {
    host_name               Svr-Data
    address                 172.16.10.4
    hostgroups              VMs, physical_VMs
    check_command           check_ncpa!-t 'public' -P 5693 -M system/agent_version
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    contact_groups          admins2, calls
    notification_interval   60
    notification_period     24x7
    notifications_enabled   1
    notification_options    d,u,r
    icon_image              ncpa.png
    statusmap_image         ncpa.png
    register                1
}

define service {
    host_name               Svr-Data
    service_description     CPU Load
    check_command           check_ncpa!-t 'public' -P 5693 -M cpu/percent -w 70 -c 85 -q 'aggregate=avg'
    max_check_attempts      45
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Drive C:
    check_command           check_ncpa!-t 'public' -P 5693 -M 'disk/logical/C:|/used_percent' -w 85 -c 95
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Drive E: DCABackups
    check_command           check_ncpa!-t 'public' -P 5693 -M 'disk/logical/E:|/used_percent' -w 85 -c 95
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Drive G: Groups
    check_command           check_ncpa!-t 'public' -P 5693 -M 'disk/logical/G:|/used_percent' -w 85 -c 95
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Drive I: Users
    check_command           check_ncpa!-t 'public' -P 5693 -M 'disk/logical/I:|/used_percent' -w 85 -c 95
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Drive P: Public
    check_command           check_ncpa!-t 'public' -P 5693 -M 'disk/logical/P:|/used_percent' -w 85 -c 95
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Drive O: Collc
    check_command           check_ncpa!-t 'public' -P 5693 -M 'disk/logical/O:|/used_percent' -w 85 -c 95
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Memory Usage
    check_command           check_ncpa!-t 'public' -P 5693 -M memory/virtual -w 80 -c 90 -u G
    max_check_attempts      45
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Ping
    check_command           check_ping!60.0,5%!100.0,10%
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     System Uptime
    check_command           check_ncpa!-t 'public' -P 5693 -M 'system/uptime'
    max_check_attempts      45
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

Post by **Box293** » Tue Feb 18, 2020 9:15 pm

max_check_attempts 5
check_interval 5
retry_interval 1

1.10am host is OK, next check is 1.15am (check_interval 5)
1.11am host goes down
1.15am Nagios detects host is down, that is check #1 (max_check_attempts 5), next check is 1.16am (retry_interval 1) this is a SOFT state
1.16am host check #2, next check is 1.17am
1.17am host check #3, next check is 1.18am
1.18am host check #4, next check is 1.19am
1.19am host check #5, host is now down HARD and notifications are send. Next check is 1.24am (check_interval 5)

So here you can see it can take up to 9 minutes before a notification is sent when a host goes down.

Alan · Post by **Alan** » Wed Feb 19, 2020 1:14 pm

Ya that is weird then why i got an email saying down and up with in 6 seconds. Would you recommend me changing any of these setting:
max_check_attempts 5
check_interval 5
retry_interval 1

I have not had this happen again since this time so maybe this was just some weird situation that happened. That being said 9 minutes seems long to me. If I change max_check_attempts and check_interval to like 2 or 3 would that create more of a chance for false positives?

Post by **Box293** » Wed Feb 19, 2020 4:54 pm

It's all a balancing act. You need to decide what is best for your environment. Define exactly how long you want to wait before you get a notification and then base your intervals off that.

Alan · Post by **Alan** » Thu Feb 20, 2020 2:11 pm

Ok thanks for the response. I will tweak and play with it. Thanks for all the help you provide.

Post by **Box293** » Thu Feb 20, 2020 5:41 pm

Great, locking thread.

Nagios Support Forum

Nagios reporting host down but its not?

Re: Nagios reporting host down but its not?

Re: Nagios reporting host down but its not?

Re: Nagios reporting host down but its not?

Re: Nagios reporting host down but its not?

Re: Nagios reporting host down but its not?

Re: Nagios reporting host down but its not?

Re: Nagios reporting host down but its not?

Re: Nagios reporting host down but its not?

Re: Nagios reporting host down but its not?