Nagios reporting host down but its not?

Support forum for Nagios Core, Nagios Plugins, NCPA, NRPE, NSCA, NDOUtils and more. Engage with the community of users including those using the open source solutions.
Alan
Posts: 86
Joined: Wed Aug 21, 2019 4:14 pm

Re: Nagios reporting host down but its not?

Post by Alan »

I was actually thinking this as well. I looked in the event viewer for the server though to see the last time it went down there are two event ID that I used:

6008 - Logged as a dirty shutdown. It gives the message "The previous system shutdown at time on date was unexpected"
6006 - Logged as a clean shutdown. It gives the message "The Event log service was stopped".

I did not get anything from the 6008 event ID and when i filtered by 6006 it says the last time the server was down was 2/9/2020 this is the day we had a maintenance window and patched the server and restarted them. Here are the emails I got it was very fast and was why I was thinking maybe it had to do with the ping setting.


***** Nagios *****

Notification Type: PROBLEM
Host: Svr-Data
State: DOWN
Address: 172.16.10.4
Info: (Host check timed out after 30.01 seconds)

Date/Time: Wed Feb 12 07:57:30 PST 2020


***** Nagios *****

Notification Type: RECOVERY
Host: Svr-Data
State: UP
Address: 172.16.10.4
Info: OK: Agent_version was [2.2.0]

Date/Time: Wed Feb 12 07:57:36 PST 2020
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios reporting host down but its not?

Post by scottwilkerson »

It's not the ping setting because this server isn't even using ping for the host command, it is using a NCPA version check....

I know this because of the output

Code: Select all

OK: Agent_version was [2.2.0]
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
scottwilkerson
DevOps Engineer
Posts: 19396
Joined: Tue Nov 15, 2011 3:11 pm
Location: Nagios Enterprises
Contact:

Re: Nagios reporting host down but its not?

Post by scottwilkerson »

It's worth noting that before the DOWN email was sent, the Nagios server likely had several retrys before sending the email, I don't know for sure as you haven't shared the host configuration, just a service, but it would depend on the max_check_attempts value.

The recovery email would be sent right away upon recovery
Former Nagios employee
Creator:
ahumandesign.com
enneagrams.com
Alan
Posts: 86
Joined: Wed Aug 21, 2019 4:14 pm

Re: Nagios reporting host down but its not?

Post by Alan »

Here is the host setting from the ncpa.cfg file:

Code: Select all

define host {
    host_name               Svr-Data
    address                 172.16.10.4
    hostgroups              VMs, physical_VMs
    check_command           check_ncpa!-t 'public' -P 5693 -M system/agent_version
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    contact_groups          admins2, calls
    notification_interval   60
    notification_period     24x7
    notifications_enabled   1
    notification_options    d,u,r
    icon_image              ncpa.png
    statusmap_image         ncpa.png
    register                1
}

define service {
    host_name               Svr-Data
    service_description     CPU Load
    check_command           check_ncpa!-t 'public' -P 5693 -M cpu/percent -w 70 -c 85 -q 'aggregate=avg'
    max_check_attempts      45
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Drive C:
    check_command           check_ncpa!-t 'public' -P 5693 -M 'disk/logical/C:|/used_percent' -w 85 -c 95
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Drive E: DCABackups
    check_command           check_ncpa!-t 'public' -P 5693 -M 'disk/logical/E:|/used_percent' -w 85 -c 95
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Drive G: Groups
    check_command           check_ncpa!-t 'public' -P 5693 -M 'disk/logical/G:|/used_percent' -w 85 -c 95
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Drive I: Users
    check_command           check_ncpa!-t 'public' -P 5693 -M 'disk/logical/I:|/used_percent' -w 85 -c 95
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Drive P: Public
    check_command           check_ncpa!-t 'public' -P 5693 -M 'disk/logical/P:|/used_percent' -w 85 -c 95
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Drive O: Collc
    check_command           check_ncpa!-t 'public' -P 5693 -M 'disk/logical/O:|/used_percent' -w 85 -c 95
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Memory Usage
    check_command           check_ncpa!-t 'public' -P 5693 -M memory/virtual -w 80 -c 90 -u G
    max_check_attempts      45
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     Ping
    check_command           check_ping!60.0,5%!100.0,10%
    max_check_attempts      5
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}

define service {
    host_name               Svr-Data
    service_description     System Uptime
    check_command           check_ncpa!-t 'public' -P 5693 -M 'system/uptime'
    max_check_attempts      45
    check_interval          5
    retry_interval          1
    check_period            24x7
    notification_interval   60
    notification_period     24x7
    contact_groups          admins2
    register                1
}
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Nagios reporting host down but its not?

Post by Box293 »

max_check_attempts 5
check_interval 5
retry_interval 1
1.10am host is OK, next check is 1.15am (check_interval 5)
1.11am host goes down
1.15am Nagios detects host is down, that is check #1 (max_check_attempts 5), next check is 1.16am (retry_interval 1) this is a SOFT state
1.16am host check #2, next check is 1.17am
1.17am host check #3, next check is 1.18am
1.18am host check #4, next check is 1.19am
1.19am host check #5, host is now down HARD and notifications are send. Next check is 1.24am (check_interval 5)

So here you can see it can take up to 9 minutes before a notification is sent when a host goes down.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Alan
Posts: 86
Joined: Wed Aug 21, 2019 4:14 pm

Re: Nagios reporting host down but its not?

Post by Alan »

Ya that is weird then why i got an email saying down and up with in 6 seconds. Would you recommend me changing any of these setting:
max_check_attempts 5
check_interval 5
retry_interval 1

I have not had this happen again since this time so maybe this was just some weird situation that happened. That being said 9 minutes seems long to me. If I change max_check_attempts and check_interval to like 2 or 3 would that create more of a chance for false positives?
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Nagios reporting host down but its not?

Post by Box293 »

It's all a balancing act. You need to decide what is best for your environment. Define exactly how long you want to wait before you get a notification and then base your intervals off that.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Alan
Posts: 86
Joined: Wed Aug 21, 2019 4:14 pm

Re: Nagios reporting host down but its not?

Post by Alan »

Ok thanks for the response. I will tweak and play with it. Thanks for all the help you provide.
User avatar
Box293
Too Basu
Posts: 5126
Joined: Sun Feb 07, 2010 10:55 pm
Location: Deniliquin, Australia
Contact:

Re: Nagios reporting host down but its not?

Post by Box293 »

Great, locking thread.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Locked