Hi,
I've recently setup email and SMS alerting on Nagios XI for our firewalls and switches. Almost immediately I've been getting alerts, at different times (but never during the working day) to say the devices are DOWN. However on closer inspection I find this is not the case as I also monitor System Uptime and all devices have not been DOWN (in the truest sense) at any time. So my question is how can I investigate and determine what the cause of this issue is. We're talking about 3 firewalls and 9 switches across 3 sites all registering as DOWN at the same time and all registering as UP at around the same time too.
TIA.
False Positives
Re: False Positives
Hi @rj-admin2,
This could be a result of services going down for any length of time, depending on your monitoring settings for those services. Personally I've noticed switches can be a particular source of these, as individual ports can report outages for momentary disruptions. You can account for this on the 3rd page of a wizard, where you are asked to choose your values for rechecking a problem-reporting host/service. If your time range is too narrow, try increasing the number of times and/or the check interval at which the service will be rechecked before you are notified.
To change this setting for an existing host or service, go to Configure->Core Config Manager->Monitoring(Left Navbar)->Hosts (or Services) -> Edit on the item you'd like to modify. The setting you're looking for here is on the Check Settings tab (Retry Interval, Max Check attempts).
Let us know if this doesn't fully address your issue or if you have any other questions about this!
This could be a result of services going down for any length of time, depending on your monitoring settings for those services. Personally I've noticed switches can be a particular source of these, as individual ports can report outages for momentary disruptions. You can account for this on the 3rd page of a wizard, where you are asked to choose your values for rechecking a problem-reporting host/service. If your time range is too narrow, try increasing the number of times and/or the check interval at which the service will be rechecked before you are notified.
To change this setting for an existing host or service, go to Configure->Core Config Manager->Monitoring(Left Navbar)->Hosts (or Services) -> Edit on the item you'd like to modify. The setting you're looking for here is on the Check Settings tab (Retry Interval, Max Check attempts).
Let us know if this doesn't fully address your issue or if you have any other questions about this!
Re: False Positives
To add to what jsimon said are the emails you are getting about specific services on the devices? If it is just certain services going down at the same time it sounds like it could be possible that there is a configuration in your network to disable certain ports during non work hours.
Re: False Positives
Many thanks for he replies.
I have reconfigured Nagios to use IP addresses instead of DNS names for the hosts as it is these that are reported as being DOWN. I'm also now in the process of configuring Parent/Child relationships between the devices. This I hope will reduce the number of alerts we are receiving. In fact since I've done this we have not had an alert generated in the last 2 nights.
TIA.
I have reconfigured Nagios to use IP addresses instead of DNS names for the hosts as it is these that are reported as being DOWN. I'm also now in the process of configuring Parent/Child relationships between the devices. This I hope will reduce the number of alerts we are receiving. In fact since I've done this we have not had an alert generated in the last 2 nights.
TIA.