I was actually thinking this as well. I looked in the event viewer for the server though to see the last time it went down there are two event ID that I used:
6008 - Logged as a dirty shutdown. It gives the message "The previous system shutdown at time on date was unexpected"
6006 - Logged as a clean shutdown. It gives the message "The Event log service was stopped".
I did not get anything from the 6008 event ID and when i filtered by 6006 it says the last time the server was down was 2/9/2020 this is the day we had a maintenance window and patched the server and restarted them. Here are the emails I got it was very fast and was why I was thinking maybe it had to do with the ping setting.
***** Nagios *****
Notification Type: PROBLEM
Host: Svr-Data
State: DOWN
Address: 172.16.10.4
Info: (Host check timed out after 30.01 seconds)
Date/Time: Wed Feb 12 07:57:30 PST 2020
***** Nagios *****
Notification Type: RECOVERY
Host: Svr-Data
State: UP
Address: 172.16.10.4
Info: OK: Agent_version was [2.2.0]
Date/Time: Wed Feb 12 07:57:36 PST 2020
Nagios reporting host down but its not?
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Nagios reporting host down but its not?
It's not the ping setting because this server isn't even using ping for the host command, it is using a NCPA version check....
I know this because of the output
I know this because of the output
Code: Select all
OK: Agent_version was [2.2.0]
-
- DevOps Engineer
- Posts: 19396
- Joined: Tue Nov 15, 2011 3:11 pm
- Location: Nagios Enterprises
- Contact:
Re: Nagios reporting host down but its not?
It's worth noting that before the DOWN email was sent, the Nagios server likely had several retrys before sending the email, I don't know for sure as you haven't shared the host configuration, just a service, but it would depend on the max_check_attempts value.
The recovery email would be sent right away upon recovery
The recovery email would be sent right away upon recovery
Re: Nagios reporting host down but its not?
Here is the host setting from the ncpa.cfg file:
Code: Select all
define host {
host_name Svr-Data
address 172.16.10.4
hostgroups VMs, physical_VMs
check_command check_ncpa!-t 'public' -P 5693 -M system/agent_version
max_check_attempts 5
check_interval 5
retry_interval 1
check_period 24x7
contact_groups admins2, calls
notification_interval 60
notification_period 24x7
notifications_enabled 1
notification_options d,u,r
icon_image ncpa.png
statusmap_image ncpa.png
register 1
}
define service {
host_name Svr-Data
service_description CPU Load
check_command check_ncpa!-t 'public' -P 5693 -M cpu/percent -w 70 -c 85 -q 'aggregate=avg'
max_check_attempts 45
check_interval 5
retry_interval 1
check_period 24x7
notification_interval 60
notification_period 24x7
contact_groups admins2
register 1
}
define service {
host_name Svr-Data
service_description Drive C:
check_command check_ncpa!-t 'public' -P 5693 -M 'disk/logical/C:|/used_percent' -w 85 -c 95
max_check_attempts 5
check_interval 5
retry_interval 1
check_period 24x7
notification_interval 60
notification_period 24x7
contact_groups admins2
register 1
}
define service {
host_name Svr-Data
service_description Drive E: DCABackups
check_command check_ncpa!-t 'public' -P 5693 -M 'disk/logical/E:|/used_percent' -w 85 -c 95
max_check_attempts 5
check_interval 5
retry_interval 1
check_period 24x7
notification_interval 60
notification_period 24x7
contact_groups admins2
register 1
}
define service {
host_name Svr-Data
service_description Drive G: Groups
check_command check_ncpa!-t 'public' -P 5693 -M 'disk/logical/G:|/used_percent' -w 85 -c 95
max_check_attempts 5
check_interval 5
retry_interval 1
check_period 24x7
notification_interval 60
notification_period 24x7
contact_groups admins2
register 1
}
define service {
host_name Svr-Data
service_description Drive I: Users
check_command check_ncpa!-t 'public' -P 5693 -M 'disk/logical/I:|/used_percent' -w 85 -c 95
max_check_attempts 5
check_interval 5
retry_interval 1
check_period 24x7
notification_interval 60
notification_period 24x7
contact_groups admins2
register 1
}
define service {
host_name Svr-Data
service_description Drive P: Public
check_command check_ncpa!-t 'public' -P 5693 -M 'disk/logical/P:|/used_percent' -w 85 -c 95
max_check_attempts 5
check_interval 5
retry_interval 1
check_period 24x7
notification_interval 60
notification_period 24x7
contact_groups admins2
register 1
}
define service {
host_name Svr-Data
service_description Drive O: Collc
check_command check_ncpa!-t 'public' -P 5693 -M 'disk/logical/O:|/used_percent' -w 85 -c 95
max_check_attempts 5
check_interval 5
retry_interval 1
check_period 24x7
notification_interval 60
notification_period 24x7
contact_groups admins2
register 1
}
define service {
host_name Svr-Data
service_description Memory Usage
check_command check_ncpa!-t 'public' -P 5693 -M memory/virtual -w 80 -c 90 -u G
max_check_attempts 45
check_interval 5
retry_interval 1
check_period 24x7
notification_interval 60
notification_period 24x7
contact_groups admins2
register 1
}
define service {
host_name Svr-Data
service_description Ping
check_command check_ping!60.0,5%!100.0,10%
max_check_attempts 5
check_interval 5
retry_interval 1
check_period 24x7
notification_interval 60
notification_period 24x7
contact_groups admins2
register 1
}
define service {
host_name Svr-Data
service_description System Uptime
check_command check_ncpa!-t 'public' -P 5693 -M 'system/uptime'
max_check_attempts 45
check_interval 5
retry_interval 1
check_period 24x7
notification_interval 60
notification_period 24x7
contact_groups admins2
register 1
}
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: Nagios reporting host down but its not?
1.10am host is OK, next check is 1.15am (check_interval 5)max_check_attempts 5
check_interval 5
retry_interval 1
1.11am host goes down
1.15am Nagios detects host is down, that is check #1 (max_check_attempts 5), next check is 1.16am (retry_interval 1) this is a SOFT state
1.16am host check #2, next check is 1.17am
1.17am host check #3, next check is 1.18am
1.18am host check #4, next check is 1.19am
1.19am host check #5, host is now down HARD and notifications are send. Next check is 1.24am (check_interval 5)
So here you can see it can take up to 9 minutes before a notification is sent when a host goes down.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Nagios reporting host down but its not?
Ya that is weird then why i got an email saying down and up with in 6 seconds. Would you recommend me changing any of these setting:
max_check_attempts 5
check_interval 5
retry_interval 1
I have not had this happen again since this time so maybe this was just some weird situation that happened. That being said 9 minutes seems long to me. If I change max_check_attempts and check_interval to like 2 or 3 would that create more of a chance for false positives?
max_check_attempts 5
check_interval 5
retry_interval 1
I have not had this happen again since this time so maybe this was just some weird situation that happened. That being said 9 minutes seems long to me. If I change max_check_attempts and check_interval to like 2 or 3 would that create more of a chance for false positives?
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: Nagios reporting host down but its not?
It's all a balancing act. You need to decide what is best for your environment. Define exactly how long you want to wait before you get a notification and then base your intervals off that.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.
Re: Nagios reporting host down but its not?
Ok thanks for the response. I will tweak and play with it. Thanks for all the help you provide.
- Box293
- Too Basu
- Posts: 5126
- Joined: Sun Feb 07, 2010 10:55 pm
- Location: Deniliquin, Australia
- Contact:
Re: Nagios reporting host down but its not?
Great, locking thread.
As of May 25th, 2018, all communications with Nagios Enterprises and its employees are covered under our new Privacy Policy.